{"title": "Parallel Correlation Clustering on Big Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 82, "page_last": 90, "abstract": "Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, in practice KwikCluster requires a large number of clustering rounds, a potential bottleneck for large graphs.We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds, and provably achieve nearly linear speedups. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3 approximation ratio.We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.", "full_text": "Parallel Correlation Clustering on Big Graphs\n\nXinghao Pan\u21b5,\u270f, Dimitris Papailiopoulos\u21b5,\u270f, Samet Oymak\u21b5,\u270f,\n\nBenjamin Recht\u21b5,\u270f,, Kannan Ramchandran\u270f, and Michael I. Jordan\u21b5,\u270f,\n\n\u21b5AMPLab, \u270fEECS at UC Berkeley, Statistics at UC Berkeley\n\nAbstract\n\nGiven a similarity graph between items, correlation clustering (CC) groups similar\nitems together and dissimilar ones apart. One of the most popular CC algorithms\nis KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and\nobtains a 3-approximation ratio. Unfortunately, in practice KwikCluster requires\na large number of clustering rounds, a potential bottleneck for large graphs.\nWe present C4 and ClusterWild!, two algorithms for parallel correlation cluster-\ning that run in a polylogarithmic number of rounds, and provably achieve nearly\nlinear speedups. C4 uses concurrency control to enforce serializability of a par-\nallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is\na coordination free algorithm that abandons consistency for the bene\ufb01t of better\nscaling; this leads to a provably small loss in the 3 approximation ratio.\nWe demonstrate experimentally that both algorithms outperform the state of the\nart, both in terms of clustering accuracy and running time. We show that our\nalgorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while\nachieving a 15\u21e5 speedup.\n\n1\n\nIntroduction\n\ncluster 1\n\ncluster 2\n\nClustering items according to some notion of similarity is a major primitive in machine learning.\nCorrelation clustering serves as a basic means to achieve this goal: given a similarity measure\nbetween items, the goal is to group similar items together and dissimilar items apart. In contrast to\nother clustering approaches, the number of clusters is not determined a priori, and good solutions\naim to balance the tension between grouping all items together versus isolating them.\nThe simplest CC variant can be described on a\ncomplete signed graph. Our input is a graph\nG on n vertices, with +1 weights on edges be-\ntween similar items, and 1 edges between dis-\nsimilar ones. Our goal is to generate a partition\nof vertices into disjoint sets that minimizes the\nnumber of disagreeing edges:\nthis equals the\nnumber of \u201c+\u201d edges cut by the clusters plus\nthe number of \u201c\u201d edges inside the clusters.\nThis metric is commonly called the number of\ndisagreements. In Figure 1, we give a toy ex-\nample of a CC instance.\nEntity deduplication is the archetypal motivat-\ning example for correlation clustering, with ap-\nplications in chat disentanglement, co-reference resolution, and spam detection [1, 2, 3, 4, 5, 6]. The\ninput is a set of entities (say, results of a keyword search), and a pairwise classi\ufb01er that indicates\u2014\nwith some error\u2014similarities between entities. Two results of a keyword search might refer to the\nsame item, but might look different if they come from different sources. By building a similarity\n\nFigure 1: In the above graph, solid edges denote simi-\nlarity and dashed dissimilarity. The number of disagree-\ning edges in the above clustering clustering is 2; we\ncolor the bad edges with red.\n\ncost = (#\u201c\u201d edges inside clusters) + (#\u201c+\u201d edges across clusters) = 2\n\n1\n\n\fgraph between entities and then applying CC, the hope is to cluster duplicate entities in the same\ngroup; in the context of keyword search, this implies a more meaningful and compact list of results.\nCC has been further applied to \ufb01nding communities in signed networks, classifying missing edges\nin opinion or trust networks [7, 8], gene clustering [9], and consensus clustering [3].\nKwikCluster is the simplest CC algorithm that achieves a provable 3-approximation ratio [10], and\nworks in the following way: pick a vertex v at random (a cluster center), create a cluster for v\nand its positive neighborhood N (v) (i.e., vertices connected to v with positive edges), peel these\nvertices and their associated edges from the graph, and repeat until all vertices are clustered. Beyond\nits theoretical guarantees, experimentally KwikCluster performs well when combined with local\nheuristics [3].\nKwikCluster seems like an inherently sequential algorithm, and in most cases of interest it requires\nmany peeling rounds. This happens because a small number of vertices are clustered per round. This\ncan be a bottleneck for large graphs. Recently, there have been efforts to develop scalable variants\nof KwikCluster [5, 6]. In [6] a distributed peeling algorithm was presented in the context of MapRe-\nduce. Using an elegant analysis, the authors establish a (3 + \u270f)-approximation in a polylogarithmic\nnumber of rounds. The algorithm employs a simple step that rejects vertices that are executed in\nparallel but are \u201ccon\ufb02icting\u201d; however, we see in our experiments, this seemingly minor coordina-\ntion step hinders scale-ups in a parallel core setting. In [5], a sketch of a distributed algorithm was\npresented. This algorithm achieves the same approximation as KwikCluster, in a logarithmic num-\nber of rounds, in expectation. However, it performs signi\ufb01cant redundant work, per iteration, in its\neffort to detect in parallel which vertices should become cluster centers.\n\nOur contributions We present C4 and ClusterWild!, two parallel CC algorithms with provable\nperformance guarantees, that in practice outperform the state of the art, both in terms of running\ntime and clustering accuracy. C4 is a parallel version of KwikCluster that uses concurrency control\nto establish a 3-approximation ratio. ClusterWild!\nis a simple to implement, coordination-free\nalgorithm that abandons consistency for the bene\ufb01t of better scaling, while having a provably small\nloss in the 3 approximation ratio.\nC4 achieves a 3 approximation ratio, in a poly-logarithmic number of rounds, by enforcing con-\nsistency between concurrently running peeling threads. Consistency is enforced using concurrency\ncontrol, a notion extensively studied for databases transactions, that was recently used to parallelize\ninherently sequential machine learning algorithms [11].\nClusterWild! is a coordination-free parallel CC algorithm that waives consistency in favor of speed.\nThe cost we pay is an arbitrarily small loss in ClusterWild!\u2019s accuracy. We show that ClusterWild!\nachieves a (3 + \u270f)OPT + O(\u270f \u00b7 n \u00b7 log2 n) approximation, in a poly-logarithmic number of rounds,\nwith provable nearly linear speedups. Our main theoretical innovation for ClusterWild! is analyzing\nthe coordination-free algorithm as a serial variant of KwikCluster that runs on a \u201cnoisy\u201d graph.\nIn our experimental evaluation, we demonstrate that both algorithms gracefully scale up to graphs\nwith billions of edges. In these large graphs, our algorithms output a valid clustering in less than\n5 seconds, on 32 threads, up to an order of magnitude faster than KwikCluster. We observe how,\nnot unexpectedly, ClusterWild! is faster than C4, and quite surprisingly, abandoning coordination in\nthis parallel setting, only amounts to a 1% of relative loss in the clustering accuracy. Furthermore,\nwe compare against state of the art parallel CC algorithms, showing that we consistently outperform\nthese algorithms in terms of both running time and clustering accuracy.\n\nNotation G denotes a graph with n vertices and m edges. G is complete and only has \u00b11 edges.\nWe denote by dv the positive degree of a vertex, i.e., the number of vertices connected to v with\npositive edges. denotes the positive maximum degree of G, and N (v) denotes the positive neigh-\nborhood of v; moreover, let Cv = {v, N (v)}. Two vertices u, v are termed as \u201cfriends\u201d if u 2 N (v)\nand vice versa. We denote by \u21e1 a permutation of {1, . . . , n}.\n\n2\n\n\f2 Two Parallel Algorithms for Correlation Clustering\n\nThe formal de\ufb01nition of correlation clustering is given below.\nCorrelation Clustering. Given a graph G on n vertices, partition the vertices into an arbitrary\nnumber k of disjoint subsets C1, . . . ,Ck such that the sum of negative edges within the subsets plus\nthe sum of positive edges across the subsets is minimized:\n\nOPT = min\n1\uf8ffk\uf8ffn\n\nmin\n\nCi\\Cj =0,8i6=j\n[k\ni=1Ci={1,...,n}\n\nkXi=1\n\nE(Ci,Ci) +\n\nkXi=1\n\nkXj=i+1\n\nE+(Ci,Cj)\n\nwhere E+ and E are the sets of positive and negative edges in G.\n\nKwikCluster is a remarkably simple algorithm that approximately solves the above combinatorial\nproblem, and operates as follows. A random vertex v is picked, a cluster Cv is created with v and\nits positive neighborhood, then the vertices in Cv are peeled from the graph, and this process is\nrepeated until all vertices are clustered KwikCluster can be equivalently executed, as noted by [5], if\nwe substitute the random choice of a vertex per peeling round, with a random order \u21e1 preassigned to\nvertices, (see Alg. 1). That is, select a random permutation on vertices, then peel the vertex indexed\nby \u21e1(1), and its friends. Remove from \u21e1 the vertices in Cv and repeat this process. Having an order\namong vertices makes the discussion of parallel algorithms more convenient.\n\nselect the vertex v indexed by \u21e1(1)\nCv = {v, N (v)}\nRemove clustered vertices from G and \u21e1\n\nAlgorithm 1 KwikCluster with \u21e1\n1: \u21e1 = a random permutation of {1, . . . , n}\n2: while V 6= ; do\n3:\n4:\n5:\n6: end while\n\nC4: Parallel CC using Concurency Control.\nSuppose we now wish to run a parallel version\nof KwikCluster, say on two threads: one thread\npicks vertex v indexed by \u21e1(1) and the other\nthread picks u indexed by \u21e1(2), concurrently.\nCan both vertices be cluster centers? They can,\niff they are not friends in G. If v and u are con-\nnected with a positive edge, then the vertex with\nthe smallest order wins. This is our concurency\nrule no. 1. Now, assume that v and u are not friends in G, and both v and u become cluster centers.\nMoreover, assume that v and u have a common, unclustered friend, say w: should w be clustered\nwith v, or u? We need to follow what would happen with KwikCluster in Alg. 1: w will go with\nthe vertex that has the smallest permutation number, in this case v. This is concurency rule no. 2.\nFollowing the above simple rules, we develop C4, our serializable parallel CC algorithm. Since, C4\nconstructs the same clusters as KwikCluster (for a given ordering \u21e1), it inherits its 3 approximation.\nThe above idea of identifying the cluster centers in rounds was \ufb01rst used in [12] to obtain a parallel\nalgorithm for maximal independent set (MIS).\nC4, shown as Alg. 2, starts by assigning a random permutation \u21e1 to the vertices, it then samples an\n unclustered vertices; this sample is taken from the pre\ufb01x of \u21e1. After sampling\nactive set A of n\nA, each of the P threads picks a vertex with the smallest order in A, then checks if that vertex can\nbecome a cluster center. We \ufb01rst enforce concurrency rule no. 1: adjacent vertices cannot be cluster\ncenters at the same time. C4 enforces it by making each thread check the friends of the vertex, say\nv, that is picked from A. A thread will check in attemptCluster whether its vertex v has any\npreceding friends that are cluster centers. If there are none, it will go ahead and label v as cluster\ncenter, and proceed with creating a cluster. If a preceding friend of v is a cluster center, then v is\nlabeled as not being a cluster center. If a preceding friend of v, call it u, has not yet received a\nlabel (i.e., u is currently being processed and is not yet labeled as cluster center or not), then the\nthread processing v, will wait on u to receive a label. The major technical detail is in showing that\nthis wait time is bounded; we show that no more than O(log n) threads can be in con\ufb02ict at the\nsame time, using a new subgraph sampling lemma [13]. Since C4 is serializable, it has to respect\nconcurrency rule no. 2: if a vertex u is adjacency to two cluster centers, then it gets assigned to the\none with smaller permutation order. This is accomplished in createCluster. After processing\nall vertices in A, all threads are synchronized in bulk, the clustered vertices are removed, a new\nactive set is sampled, and the same process is repeated until everything has been clustered. In the\nfollowing section, we present the theoretical guarantees for C4.\n\n3\n\n\f vertices in V [\u21e1].\n\nwhile A 6= ; do in parallel\n\nAlgorithm 2 C4 & ClusterWild!\n1: Input: G, \u270f\n2: clusterID(1) = . . . = clusterID(n) = 1\n3: \u21e1 = a random permutation of {1, . . . , n}\n4: while V 6= ; do\n5: = maximum vertex degree in G(V )\n6: A = the \ufb01rst \u270f \u00b7 n\n7:\n8:\n9:\n10:\n11:\n12:\n13:\nend if\n14:\nend while\n15:\n16:\nRemove clustered vertices from V and \u21e1\n17: end while\n18: Output: {clusterID(1), . . . , clusterID(n)}.\n\nv = \ufb01rst element in A\nA = A {v}\nif C4 then // concurrency control\n\nattemptCluster(v)\n\nelse if ClusterWild! then // coordination free\n\ncreateCluster(v)\n\ncreateCluster(v):\nclusterID(v) = \u21e1(v)\nfor u 2 (v) \\ A do\nend for\n\nclusterID(u) = min(clusterID(u), \u21e1(v))\n\nattemptCluster(v):\nif clusterID(u) = 1 and isCenter(v) then\nend if\n\ncreateCluster(v)\n\nif \u21e1(u) < \u21e1(v) then // if they precede you, wait\n\nisCenter(v):\nfor u 2 (v) do // check friends (in order of \u21e1)\nwait until clusterID(u) 6= 1 // till clustered\nif isCenter(u) then\nreturn 0 //a friend is center, so you can\u2019t be\n\nend if\n\nend if\nend for\nreturn 1 // no earlier friends are centers, so you are\n\nClusterWild!: Coordination-free Correlation Clustering. ClusterWild! speeds up computation\nby ignoring the \ufb01rst concurrency rule. It uniformly samples unclustered vertices, and builds clusters\naround all of them, without respecting the rule that cluster centers cannot be friends in G. In Clus-\nterWild!, threads bypass the attemptCluster routine; this eliminates the \u201cwaiting\u201d part of C4.\nClusterWild! samples a set A of vertices from the pre\ufb01x of \u21e1. Each thread picks the \ufb01rst ordered\nvertex remaining in A, and using that vertex as a cluster center, it creates a cluster around it. It peels\naway the clustered vertices and repeats the same process, on the next remaining vertex in A. At\nthe end of processing all vertices in A, all threads are synchronized in bulk, the clustered vertices\nare removed, a new active set is sampled, and the parallel clustering is repeated. A careful analy-\nsis along the lines of [6] shows that the number of rounds (i.e., bulk synchronization steps) is only\npoly-logarithmic.\nQuite unsurprisingly, ClusterWild! is faster than C4. Interestingly, abandoning consistency does not\nincur much loss in the approximation ratio. We show how the error introduced in the accuracy of the\nsolution can be bounded. We characterize this error theoretically, and show that in practice it only\ntranslates to only a relative 1% loss in the objective. The main intuition of why ClusterWild! does\nnot introduce too much error is that the chance of two randomly selected vertices being friends is\nsmall, hence the concurrency rules are infrequently broken.\n\n3 Theoretical Guarantees\n\nIn this section, we bound the number of rounds required for each algorithms, and establish the\ntheoretical speedup one can obtain with P parallel threads. We proceed to present our approximation\nguarantees. We would like to remind the reader that\u2014as in relevant literature\u2014we consider graphs\nthat are complete, signed, and unweighted. The omitted proofs can be found in the Appendix.\n\n3.1 Number of rounds and running time\n\nOur analysis follows those of [12] and [6]. The main idea is to track how fast the maximum degree\ndecreases in the remaining graph at the end of each round.\n\nLemma 1. C4 and ClusterWild! terminate after O 1\n\n\u270f log n \u00b7 log rounds w.h.p.\n\nWe now analyze the running time of both algorithms under a simpli\ufb01ed BSP model. The main idea\nis that the the running time of each \u201csuper step\u201d (i.e., round) is determined by the \u201cstraggling\u201d thread\n(i.e., the one that gets assigned the most amount of work), plus the time needed for synchronization\nat the end of each round.\nAssumption 1. We assume that threads operate asynchronously within a round and synchronize at\nthe end of a round. A memory cell can be written/read concurrently by multiple threads. The time\n\n4\n\n\fspent per round of the algorithm is proportional to the time of the slowest thread. The cost of thread\nsynchronization at the end of each batch takes time O(P ), where P is the number of threads. The\ntotal computation cost is proportional to the sum of the time spent for all rounds, plus the time spent\nduring the bulk synchronization step.\n\nUnder this simpli\ufb01ed model, we show that both algorithms obtain nearly linear speedup, with Clus-\nterWild! being faster than C4, precisely due to lack of coordination. Our main tool for analyzing C4\nis a recent graph-theoretic result from [13] (Theorem 1), which guarantees that if one samples an\nO(n/) subset of vertices in a graph, the sampled subgraph has a connected component of size at\nmost O(log n). Combining the above, in the appendix we show the following result.\nTheorem 2. The theoretical\n\nis upper bounded by\n,\nni\nP\ni\nis the size of the batch in the i-th round of each algorithm. The running time of Cluster-\n\n+ P\u2318 log n \u00b7 log \u2318 as long as the number of cores P is smaller than mini\n\nrunning time of C4 on P cores\n\nO\u21e3\u21e3 m+n log n\nWild! on P cores is upper bounded by O m+n\n\nP + P log n \u00b7 log .\n\nwhere ni\ni\n\n3.2 Approximation ratio\n\nWe now proceed with establishing the approximation ratios of C4 and ClusterWild!.\n\nC4 is serializable.\nIt is straightforward that C4 obtains precisely the same approximation ratio as\nKwikCluster. One has to simply show that for any permutation \u21e1, KwikCluster and C4 will output\nthe same clustering. This is indeed true, as the two simple concurrency rules mentioned in the\nprevious section are suf\ufb01cient for C4 to be equivalent to KwikCluster.\nTheorem 3. C4 achieves a 3 approximation ratio, in expectation.\n\nClusterWild! as a serial procedure on a noisy graph. Analyzing ClusterWild!\nis a bit more\ninvolved. Our guarantees are based on the fact that ClusterWild! can be treated as if one was\nrunning a peeling algorithm on a \u201cnoisy\u201d graph. Since adjacent active vertices can still become\ncluster centers in ClusterWild!, one can view the edges between them as \u201cdeleted,\u201d by a somewhat\nunconventional adversary. We analyze this new, noisy graph and establish our theoretical result.\nTheorem 4. ClusterWild! achieves a (3 + \u270f)\u00b7OPT+O(\u270f\u00b7n\u00b7log2 n) approximation, in expectation.\nWe provide a sketch of the proof, and delegate the details to the appendix. Since ClusterWild!\nignores the edges among active vertices, we treat these edges as deleted. In our main result, we\nquantify the loss of clustering accuracy that is caused by ignoring these edges. Before we proceed,\nwe de\ufb01ne bad triangles, a combinatorial structure that is used to measure the clustering quality of a\npeeling algorithm.\nDe\ufb01nition 1. A bad triangle in G is a set of three vertices, such that two pairs are joined with a\npositive edge, and one pair is joined with a negative edge. Let Tb denote the set of bad triangles in\nG.\n\nTo quantify the cost of ClusterWild!, we make the below observation.\nLemma 5. The cost of any greedy algorithm that picks a vertex v (irrespective of the sampling\norder), creates Cv, peels it away and repeats, is equal to the number of bad triangles adjacent to\neach cluster center v.\nLemma 6. Let \u02c6G denote the random graph induced by deleting all edges between active vertices per\nround, for a given run of ClusterWild!, and let \u2327new denote the number of additional bad triangles\nthat \u02c6G has compared to G. Then, the expected cost of ClusterWild! can be upper bounded as\n\nleast one of its end points becomes active, while t is still part of the original unclustered graph.\n\nEPt2Tb\n1Pt + \u2327new , where Pt is the event that triangle t, with end points i, j, k, is bad, and at\nProof. We begin by bounding the second term E{\u2327new}, by considering the number of new bad\ntriangles \u2327new,i created at each round i:\nE{\u2327new,i} \uf8ff X(u,v)2E\n\nP(u, v 2 Ai)\u00b7|N (u)[N (v)| \uf8ff X(u,v)2E\u2713 \u270f\ni\u25c62\n\nE\ni \uf8ff 2\u00b7\u270f2\u00b7n.\n\n\u00b72\u00b7i \uf8ff 2\u00b7\u270f2\u00b7\n\n5\n\n\fpt\n\npt\n\n\u21b5 Pt2Tb\n\npt\n\u21b5 .\n\n\u270f log n log ) rounds, we get that1\n\n1Pt =Pt2Tb\n\npt\n\npt. To do that we use the following lemma.\n\n\u21b5 \uf8ff 1, then,Pt2Tb\n\n1Pt. Let Su,v =S{u,v}\u21e2t2Tb\n\nUsing the result that ClusterWild! terminates after at most O( 1\nE{\u2327new} \uf8ff O(\u270f \u00b7 n \u00b7 log2 n).\nWe are left to bound EPt2Tb\nLemma 7. If pt satis\ufb01es 8e, Pt:e\u21e2t2Tb\nProof. Let B\u21e4 be one (of the possibly many) sets of edges that attribute a +1 in the cost of an optimal\nalgorithm. Then, OPT =Pe2B\u21e4 1 Pe2B\u21e4Pt:e\u21e2t2Tb\nNow, as with [6], we will simply have to bound the expectation of the bad triangles, adjacent to\nan edge (u, v): Pt:{u,v}\u21e2t2Tb\nt be the union of the sets of nodes of\nthe bad triangles that contain both vertices u and v. Observe that if some w 2 S\\{u, v} becomes\nactive before u and v, then a cost of 1 (i.e., the cost of the bad triangle {u, v, w}) is incurred. On\nthe other hand, if either u or v, or both, are selected as pivots in some round, then Cu,v can be\nas high as |S| 2, i.e., at most equal to all bad triangles containing the edge {u, v}. Let Auv =\n{u or v are activated before any other vertices in Su,v}. Then,\nu,v\u21e4 \u00b7 P(AC\nE [Cu,v] = E [Cu,v| Au,v] \u00b7 P(Au,v) + E\u21e5Cu,v| AC\nu,v)\n\npt \uf8ff \u21b5 \u00b7 OP T.\n\u21b5 =Pt2Tb |B\u21e4 \\ t|\n| {z }1\n\n\uf8ff 1 + (|S| 2) \u00b7 P({u, v} \\ A 6= ;|S \\ A 6= ;) \uf8ff 1 + 2|S| \u00b7 P(v \\ A 6= ;|S \\ A 6= ;)\nwhere the last inequality is obtained by a union bound over u and v. We now bound the following\nprobability:\nP{ v 2 A|S \\ A 6= ;} = P{v 2 A} \u00b7 P{S \\ A 6= ;|v 2 A}\nObserve that P{v 2 A} = \u270f\nround, that no positive neighbors in S become activated is upper bounded by\nn|S|P \nn\nP =\n\n1 P{S \\ A = ;}\n, hence we need to upper bound P{S \\ A = ;}. The probability, per\ne\u25c6|S|n/P\n\nn |S| + t\u25c6 \uf8ff\u27131 \n\nn\u25c6n/P#|S|n/P\n\n= P{v 2 A}\nP{S \\ A 6= ;}\n\nHence, the probability can be upper bounded as\n\n=\"\u27131 \n\n|S|Yt=1\u27131 \n\n=\n\nP{v 2 A}\n\n.\n\n\uf8ff\u2713 1\n\n.\n\nP{S \\ A 6= ;}\n\nP\n\nP\n\nP\n\nn\u25c6|S|\n\nWe know that |S| \uf8ff 2 \u00b7 + 2 and also \u270f \uf8ff 1. Then, 0 \uf8ff \u270f \u00b7 |S| \uf8ff \u270f \u00b72 + 2\nE(Cu,v) \uf8ff 1 + 2 \u00b7\n\u21e31 + 2 \u00b7\n\n \uf8ff 4 Hence, we have\n1Pt + \u2327newo \uf8ff\n1e4\u00b7\u270f\u2318 \u00b7 OPT + \u270f \u00b7 n \u00b7 ln(n/) \u00b7 log \uf8ff (3 + \u270f) \u00b7 OPT + O(\u270f \u00b7 n \u00b7 log2 n) which establishes\n\n|S|P{ v \\ A 6= ;|S \\ A 6= ;} \uf8ff\n. The overall expectation is then bounded by EnPt2Tb\n\nour approximation ratio for ClusterWild!.\n\n1 e\u270f\u00b7|S|/\n\n1exp{4\u270f}\n\n4\u00b7\u270f\n\n4\u270f\n\n.\n\n\u270f \u00b7 |S|/\n\n3.3 BSP Algorithms as a Proxy for Asynchronous Algorithms\n\nWe would like to note that the analysis under the BSP model can be a useful proxy for the perfor-\nmance of completely asynchronous variants of our algorithms. Speci\ufb01cally, see Alg. 3, where we\nremove the synchronization barriers.\nThe only difference between the asynchronous execution in Alg. 3, compared to Alg. 2, is the com-\nplete lack of bulk synchronization, at the end of the processing of each active set A. Although the\nanalysis of the BSP variants of the algorithms is tractable, unfortunately analyzing precisely the\nspeedup of the asynchronous C4 and the approximation guarantees for the asynchronous Cluster-\nWild! is challenging. However, in our experimental section we test the completely asynchronous\nalgorithms against the BSP algorithms of the previous section, and observe that they perform quite\nsimilarly both in terms of accuracy of clustering, and running times.\n\n1We skip the constants to simplify the presentation; however they are all smaller than 10.\n\n6\n\n\f4 Related Work\n\nattemptCluster(v)\n\ncreateCluster(v)\n\nAlgorithm 3 C4 & ClusterWild!\n\nelse if ClusterWild! then // coordination free\n\n(asynchronous execution)\n\n1: Input: G\n2: clusterID(1) = . . . = clusterID(n) = 1\n3: \u21e1 = a random permutation of {1, . . . , n}\n4: while V 6= ; do\n5:\nv = \ufb01rst element in V\n6:\nV = V {v}\n7:\nif C4 then // concurrency control\n8:\n9:\n10:\n11:\nend if\n12:\nRemove clustered vertices from V and \u21e1\n13: end while\n14: Output: {clusterID(1), . . . , clusterID(n)}.\n\nCorrelation clustering was formally introduced\nby Bansal et al. [14]. In the general case, min-\nimizing disagreements is NP-hard and hard to\napproximate within an arbitrarily small con-\nstant (APX-hard) [14, 15]. There are two varia-\ntions of the problem: i) CC on complete graphs\nwhere all edges are present and all weights are\n\u00b11, and ii) CC on general graphs with arbitrary\nedge weights. Both problems are hard, how-\never the general graph setup seems fundamen-\ntally harder. The best known approximation ra-\ntio for the latter is O(log n), and a reduction to\nthe minimum multicut problem indicates that\nany improvement to that requires fundamental\nbreakthroughs in theoretical algorithms [16].\nIn the case of complete unweighted graphs, a long series of results establishes a 2.5 approximation\nvia a rounded linear program (LP) [10]. A recent result establishes a 2.06 approximation using an\nelegant rounding to the same LP relaxation [17]. By avoiding the expensive LP, and by just using the\nrounding procedure of [10] as a basis for a greedy algorithm yields KwikCluster: a 3 approximation\nfor CC on complete unweighted graphs.\nVariations of the cost metric for CC change the algorithmic landscape: maximizing agreements (the\ndual measure of disagreements) [14, 18, 19], or maximizing the difference between the number of\nagreements and disagreements [20, 21], come with different hardness and approximation results.\nThere are also several variants: chromatic CC [22], overlapping CC [23], small number of clusters\nCC with added constraints that are suitable for some biology applications [24].\nThe way C4 \ufb01nds the cluster centers can be seen as a variation of the MIS algorithm of [12]; the\nmain difference is that in our case, we \u201cpassively\u201d detect the MIS, by locking on memory variables,\nand by waiting on preceding ordered threads. This means, that a vertex only \u201cpushes\u201d its cluster\nID and status (cluster center/clustered/unclustered) to its friends, versus \u201cpulling\u201d (or asking) for its\nfriends\u2019 cluster status. This saves a substantial amount of computational effort.\n\n5 Experiments\n\nOur parallel algorithms were all implemented2 in Scala\u2014we defer a full discussion of the imple-\nmentation details to Appendix C. We ran all our experiments on Amazon EC2\u2019s r3.8xlarge (32\nvCPUs, 244Gb memory) instances, using 1-32 threads. The real graphs listed in Table 1 were each\n\nGraph\n\nDBLP-2011\nENWiki-2013\n\nUK-2005\nIT-2004\n\nWebBase-2001\n\n# vertices\n986,324\n4,206,785\n39,459,925\n41,291,594\n118,142,155\n\n# edges\n6,707,236\n101,355,853\n921,345,078\n1,135,718,909\n1,019,903,190\n\nDescription\n\n2011 DBLP co-authorship network [25, 26, 27].\n\n2013 link graph of English part of Wikipedia [25, 26, 27].\n\n2005 crawl of the .uk domain [25, 26, 27].\n2004 crawl of the .it domain [25, 26, 27].\n\n2001 crawl by WebBase crawler [25, 26, 27].\n\nTable 1: Graphs used in the evaluation of our parallel algorithms.\n\ntested with 100 different random \u21e1 orderings. We measured the runtimes, speedups (ratio of run-\ntime on 1 thread to runtime on p threads), and objective values obtained by our parallel algorithms.\nFor comparison, we also implemented the algorithm presented in [6], which we denote as CDK for\nshort3. Values of \u270f = 0.1, 0.5, 0.9 were used for C4 BSP, ClusterWild! BSP and CDK. In the interest\nof space, we present only representative plots of our results; full results are given in our appendix.\n\n2Code available at https://github.com/pxinghao/ParallelCorrelationClustering.\n3CDK was only tested on the smaller graphs of DBLP-2011 and ENWiki-2013, because CDK was pro-\n\nhibitively slow, often 2-3 orders of magnitude slower than C4, ClusterWild!, and even KwikCluster.\n\n7\n\n\f5\n10\n\n4\n10\n\ns\nm\n\n \n/\n \n\ne\nm\n\ni\nt\n\nn\nu\nr\n \nn\na\ne\nM\n\n3\n10\n\n \n\n1\n\nMean Runtime, UK\u22122005\n\nSerial\nC4 As\nC4 BSP \u03b5=0.1\nCW As\nCW BSP \u03b5=0.1\n\n \n\n5\n10\n\ns\nm\n\n \n/\n \n\ne\nm\n\ni\nt\n\nn\nu\nr\n \nn\na\ne\nM\n\n4\n10\n\n3\n10\n\n \n\n1\n\n2\n\n4\n\n8\n\n16\n\n32\n\nNumber of threads\n\nMean Runtime, IT\u22122004\n\nMean Speedup, Webbase\u22122001\n\n \n\n \n\nSerial\nC4 As\nC4 BSP \u03b5=0.5\nCW As\nCW BSP \u03b5=0.5\n\n2\n\n4\n\n8\n\n16\n\n32\n\nNumber of threads\n\np\nu\nd\ne\ne\np\nS\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n \n\n0\n0\n\nIdeal\nC4 As\nC4 BSP \u03b5=0.9\nCW As\nCW BSP \u03b5=0.9\n\n5\n\n10\n\n15\n\n20\n\nNumber of threads\n\n25\n\n30\n\n35\n\n(a) Mean runtimes, UK-2005, \u270f = 0.1\n\n(b) Mean runtimes, IT-2004, \u270f = 0.5\n\n(c) Mean speedup, WebBase, \u270f = 0.9\n\nMean Number of Rounds\n\n \n\nC4/CW BSP, UK\u22122005\nC4/CW BSP, IT\u22122004\nC4/CW BSP, Webbase\u22122001\nC4/CW BSP, DBLP\u22122011\nCDK, DBLP\u22122011\nC4/CW BSP, ENWiki\u22122013\nCDK, ENWiki\u22122013\n\n10000\n\ns\nd\nn\nu\no\nr\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n8000\n\n6000\n\n4000\n\n2000\n\n \n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n\u03b5\n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\ns\ne\nc\ni\nt\nr\ne\nv\n \nd\ne\nk\nc\no\nb\n\nl\n\n \nf\n\n \n\no\n%\n\n \n\n0\n0\n\n% of Blocked Vertices, ENWiki\u22122013\n\n \n\nC4 BSP \u03b5=0.9 Min\nC4 BSP \u03b5=0.9 Mean\nC4 BSP \u03b5=0.9 Max\nC4 BSP Min\nC4 BSP Mean\nC4 BSP Max\n\nObjective Value Relative to Serial, DBLP\u22122011\n1.12\n\n \n\nCW BSP \u03b5=0.9 mean\nCW BSP \u03b5=0.9 median\nCW As mean\nCW As median\nCDK \u03b5=0.9 mean\nCDK \u03b5=0.9 median\n\nl\n\ne\nu\na\nv\n \nj\nb\no\n \nl\na\ni\nr\ne\nS\n\nl\n\n \n:\n \ne\nu\na\nv\n \nj\nb\no\n \no\ng\nA\n\nl\n\n1.1\n\n1.08\n\n1.06\n\n1.04\n\n1.02\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\nNumber of threads\n\n \n\n1\n0\n\n5\n\n10\n\n15\n\n20\n\nNumber of threads\n\n25\n\n30\n\n35\n\n(d) Mean number of\nrounds for BSP algorithms\n\nsynchronization\n\n(e) Percent of blocked vertices for C4,\nENWiki-2013. BSP run with \u270f = 0.9.\n\n(f) Median objective values, DBLP-2011.\nCW BSP and CDK run with \u270f = 0.9\n\nFigure 2: In the above \ufb01gures, \u2018CW\u2019 is short for ClusterWild!, \u2018BSP\u2019 is short for the bulk-synchronous variants\nof the parallel algorithms, and \u2018As\u2019 is short for the asynchronous variants.\n\nRuntimes & Speedups: C4 and ClusterWild! are initially slower than serial, due to the overheads\nrequired for atomic operations in the parallel setting. However, all our parallel algorithms outper-\nform KwikCluster with 3-4 threads. As more threads are added, the asychronous variants become\nfaster than their BSP counterparts as there are no synchronization barrriers. The difference between\nBSP and asychronous variants is greater for smaller \u270f. ClusterWild!\nis also always faster than\nC4 since there are no coordination overheads. The asynchronous algorithms are able to achieve a\nspeedup of 13-15x on 32 threads. The BSP algorithms have a poorer speedup ratio, but nevertheless\nachieve 10x speedup with \u270f = 0.9.\nSynchronization rounds: The main overhead of the BSP algorithms lies in the need for synchro-\nnization rounds. As \u270f increases, the amount of synchronization decreases, and with \u270f = 0.9, our\nalgorithms have less than 1000 synchronization rounds, which is small considering the size of the\ngraphs and our multicore setting.\nBlocked vertices: Additionally, C4 incurs an overhead in the number of vertices that are blocked\nwaiting for earlier vertices to complete. We note that this overhead is extremely small in practice\u2014\non all graphs, less than 0.2% of vertices are blocked. On the larger and sparser graphs, this drops to\nless than 0.02% (i.e., 1 in 5000) of vertices.\nObjective value: By design, the C4 algorithms also return the same output (and thus objective\nvalue) as KwikCluster. We \ufb01nd that ClusterWild! BSP is at most 1% worse than serial across all\ngraphs and values of \u270f. The behavior of asynchronous ClusterWild! worsens as threads are added,\nreaching 15% worse than serial for one of the graphs. Finally, on the smaller graphs we were able\nto test CDK on, CDK returns a worse median objective value than both ClusterWild! variants.\n\n6 Conclusions and Future Directions\n\nIn this paper, we have presented two parallel algorithms for correlation clustering with nearly linear\nspeedups and provable approximation ratios. Overall, the two approaches support each other\u2014when\nC4 is relatively fast relative to ClusterWild!, we may prefer C4 for its guarantees of accuracy, and\nwhen ClusterWild! is accurate relative to C4, we may prefer ClusterWild! for its speed.\nIn the future, we intend to implement our algorithms in the distributed environment, where syn-\nchronization and communication often account for the highest cost. Both C4 and ClusterWild! are\nwell-suited for the distributed environment, since they have polylogarithmic number of rounds.\n\n8\n\n\fReferences\n[1] Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. Duplicate record detection: A survey. Knowledge and Data\n\nEngineering, IEEE Transactions on, 19(1):1\u201316, 2007.\n\n[2] Arvind Arasu, Christopher R\u00b4e, and Dan Suciu. Large-scale deduplication with constraints using dedupalog. In Data Engineering, 2009.\n\nICDE\u201909. IEEE 25th International Conference on, pages 952\u2013963. IEEE, 2009.\n\n[3] Micha Elsner and Warren Schudy. Bounding and comparing methods for correlation clustering beyond ilp.\n\nIn Proceedings of the\nWorkshop on Integer Linear Programming for Natural Langauge Processing, pages 19\u201327. Association for Computational Linguistics,\n2009.\n\n[4] Bilal Hussain, Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, and Ren\u00b4ee J Miller. An evaluation of clustering algorithms in duplicate\n\ndetection. Technical report, 2013.\n\n[5] Francesco Bonchi, David Garcia-Soriano, and Edo Liberty. Correlation clustering: from theory to practice. In Proceedings of the 20th\n\nACM SIGKDD international conference on Knowledge discovery and data mining, pages 1972\u20131972. ACM, 2014.\n\n[6] Flavio Chierichetti, Nilesh Dalvi, and Ravi Kumar. Correlation clustering in mapreduce. In Proceedings of the 20th ACM SIGKDD\n\ninternational conference on Knowledge discovery and data mining, pages 641\u2013650. ACM, 2014.\n\n[7] Bo Yang, William K Cheung, and Jiming Liu. Community mining from signed social networks. Knowledge and Data Engineering,\n\nIEEE Transactions on, 19(10):1333\u20131348, 2007.\n\n[8] N Cesa-Bianchi, C Gentile, F Vitale, G Zappella, et al. A correlation clustering approach to link classi\ufb01cation in signed networks. In\n\nAnnual Conference on Learning Theory, pages 34\u20131. Microtome, 2012.\n\n[9] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expression patterns. Journal of computational biology, 6(3-4):281\u2013297,\n\n1999.\n\n[10] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM\n\n(JACM), 55(5):23, 2008.\n\n[11] Xinghao Pan, Joseph E Gonzalez, Stefanie Jegelka, Tamara Broderick, and Michael Jordan. Optimistic concurrency control for dis-\n\ntributed unsupervised learning. In Advances in Neural Information Processing Systems, pages 1403\u20131411, 2013.\n\n[12] Guy E Blelloch, Jeremy T Fineman, and Julian Shun. Greedy sequential maximal independent set and matching are parallel on average.\nIn Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, pages 308\u2013317. ACM, 2012.\n\n[13] Michael Krivelevich. The phase transition in site percolation on pseudo-random graphs. arXiv preprint arXiv:1404.5731, 2014.\n[14] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering.\n\nIn 2013 IEEE 54th Annual Symposium on Foundations of\n\nComputer Science, pages 238\u2013238. IEEE Computer Society, 2002.\n\n[15] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. In Foundations of Computer\n\nScience, 2003. Proceedings. 44th Annual IEEE Symposium on, pages 524\u2013533. IEEE, 2003.\n\n[16] Erik D Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation clustering in general weighted graphs. Theoretical\n\nComputer Science, 361(2):172\u2013187, 2006.\n\n[17] Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near optimal LP rounding algorithm for correlation\nclustering on complete and complete k-partite graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of\nComputing, STOC \u201915, pages 219\u2013228, 2015.\n\n[18] Chaitanya Swamy. Correlation clustering: maximizing agreements via semide\ufb01nite programming. In Proceedings of the \ufb01fteenth annual\n\nACM-SIAM symposium on Discrete algorithms, pages 526\u2013527. Society for Industrial and Applied Mathematics, 2004.\nIoannis Giotis and Venkatesan Guruswami. Correlation clustering with a \ufb01xed number of clusters. In Proceedings of the seventeenth\nannual ACM-SIAM symposium on Discrete algorithm, pages 1167\u20131176. ACM, 2006.\n\n[19]\n\n[20] Moses Charikar and Anthony Wirth. Maximizing quadratic programs: extending grothendieck\u2019s inequality. In Foundations of Computer\n\nScience, 2004. Proceedings. 45th Annual IEEE Symposium on, pages 54\u201360. IEEE, 2004.\n\n[21] Noga Alon, Konstantin Makarychev, Yury Makarychev, and Assaf Naor. Quadratic forms on graphs. Inventiones mathematicae, 163(3):\n\n499\u2013522, 2006.\n\n[22] Francesco Bonchi, Aristides Gionis, Francesco Gullo, and Antti Ukkonen. Chromatic correlation clustering. In Proceedings of the 18th\n\nACM SIGKDD international conference on Knowledge discovery and data mining, pages 1321\u20131329. ACM, 2012.\n\n[23] Francesco Bonchi, Aristides Gionis, and Antti Ukkonen. Overlapping correlation clustering. In Data Mining (ICDM), 2011 IEEE 11th\n\nInternational Conference on, pages 51\u201360. IEEE, 2011.\n\n[24] Gregory J Puleo and Olgica Milenkovic. Correlation clustering with constrained cluster sizes and extended weights bounds. arXiv\n\npreprint arXiv:1411.0547, 2014.\n\n[25] P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW, 2004.\n[26] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing\n\nsocial networks. In WWW. ACM Press, 2011.\n\n[27] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience,\n\n34(8):711\u2013726, 2004.\n\n9\n\n\f", "award": [], "sourceid": 55, "authors": [{"given_name": "Xinghao", "family_name": "Pan", "institution": "UC Berkeley"}, {"given_name": "Dimitris", "family_name": "Papailiopoulos", "institution": "UC Berkeley"}, {"given_name": "Samet", "family_name": "Oymak", "institution": "Caltech"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}, {"given_name": "Kannan", "family_name": "Ramchandran", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}