Sunday, August 13, 2006

Correlation Clustering

Last year Ailon, Charikar and Newman designed a very simple algorithm for correlation clustering. Suppose you have a set S of n data items, and for each pair of items, you know whether they are similar or dissimilar. How do you partition S into clusters according to similarity? Ideally, any two similar items should be in the same cluster and any two dissimilar items should be in different clusters. This is not always possible: for example, with three items a,b,c, if a and b are similar, b and c are similar, but a and c are dissimilar, there is no way you can cluster them satisfactorily (call this an inconsistent triangle). In correlation clustering, one aims to find a clustering minimizing the number of disagreeing edges.

The algorithm is wonderfully simple: pick an item at random, form a cluster consisting of that item and of all the items which are similar to it, remove the items of that cluster from S, and recurse.

The analysis is wonderfully simple. If you just picked item a, you will create a disagreement on edge {b,c} exactly when triangle abc is inconsistent. For any inconsistent triangle abc, let A(ab,c) denote the event that a,b,c all stay together in S until a is the vertex chosen at random by the algorithm, and let p(bc,a) denote its probability. The expected cost of the algorithm is exactly the sum over inconsistent triangles abc of p(ab,c)+p(ac,b)+p(bc,a). So, if we write q(abc)=p(ab,c)=p(ac,b)=p(bc,a), the cost is 3 times the sum of q(.) over all inconsistent triangles.

Notice that A(ab,c) and A(ab,d) are disjoint events. Thus the sum over c of p(ab,c) is at most 1. So, q defines a fractional packing of inconsistent triangles, and any clustering will have cost at least equal to the sum of q(.) over all inconsistent triangles. This proves that the algorithm is a 3-approximation.

This result is perhaps not a great leap forward in our understanding of TCS in general, but it is a very beautiful illustration of linear programming duality, a little gem of simplicity and elegance that is a pleasure to look at.

Comments page(The usual comments link does not work properly.)


No comments:

Post a Comment