{"title": "Affinity Clustering: Hierarchical Clustering at Scale", "book": "Advances in Neural Information Processing Systems", "page_first": 6864, "page_last": 6874, "abstract": "Graph clustering is a fundamental task in many data-mining and machine-learning pipelines. In particular, identifying a good hierarchical structure is at the same time a fundamental and challenging problem for several applications. The amount of data to analyze is increasing at an astonishing rate each day. Hence there is a need for new solutions to efficiently compute effective hierarchical clusterings on such huge data.  The main focus of this paper is on minimum spanning tree (MST) based clusterings. In particular, we propose affinity, a novel hierarchical clustering based on Boruvka's MST algorithm. We prove certain theoretical guarantees for affinity (as well as some other classic algorithms) and show that in practice it is superior to several other state-of-the-art clustering algorithms.   Furthermore, we present two MapReduce implementations for affinity. The first one works for the case where the input graph is dense and takes constant rounds. It is based on a Massively Parallel MST algorithm for dense graphs that improves upon the state-of-the-art algorithm of Lattanzi et al. (SPAA 2011). Our second algorithm has no assumption on the density of the input graph and finds the affinity clustering in $O(\\log n)$ rounds using Distributed Hash Tables (DHTs). We show experimentally that our algorithms are scalable for huge data sets, e.g., for graphs with trillions of edges.", "full_text": "Af\ufb01nity Clustering: Hierarchical Clustering at Scale\n\nMohammadHossein Bateni\n\nGoogle Research\n\nbateni@google.com\n\nSoheil Behnezhad\u2217\nUniversity of Maryland\nsoheil@cs.umd.edu\n\nMahsa Derakhshan\u2217\nUniversity of Maryland\nmahsaa@cs.umd.edu\n\nMohammadTaghi Hajiaghayi\u2217\n\nUniversity of Maryland\nhajiagha@cs.umd.edu\n\nRaimondas Kiveris\n\nGoogle Research\n\nrkiveris@google.com\n\nSilvio Lattanzi\nGoogle Research\n\nsilviol@google.com\n\nVahab Mirrokni\nGoogle Research\n\nmirrokni@google.com\n\nAbstract\n\nGraph clustering is a fundamental task in many data-mining and machine-learning\npipelines. In particular, identifying a good hierarchical structure is at the same time\na fundamental and challenging problem for several applications. The amount of\ndata to analyze is increasing at an astonishing rate each day. Hence there is a need\nfor new solutions to ef\ufb01ciently compute effective hierarchical clusterings on such\nhuge data.\nThe main focus of this paper is on minimum spanning tree (MST) based clusterings.\nIn particular, we propose af\ufb01nity, a novel hierarchical clustering based on Bor\u02dauvka\u2019s\nMST algorithm. We prove certain theoretical guarantees for af\ufb01nity (as well as\nsome other classic algorithms) and show that in practice it is superior to several\nother state-of-the-art clustering algorithms.\nFurthermore, we present two MapReduce implementations for af\ufb01nity. The \ufb01rst\none works for the case where the input graph is dense and takes constant rounds. It\nis based on a Massively Parallel MST algorithm for dense graphs that improves\nupon the state-of-the-art algorithm of Lattanzi et al. [34]. Our second algorithm has\nno assumption on the density of the input graph and \ufb01nds the af\ufb01nity clustering in\nO(log n) rounds using Distributed Hash Tables (DHTs). We show experimentally\nthat our algorithms are scalable for huge data sets, e.g., for graphs with trillions of\nedges.\n\n1\n\nIntroduction\n\nClustering is a classic unsupervised learning problem with many applications in information retrieval,\ndata mining, and machine learning. In hierarchical clustering the goal is to detect a nested hierarchy\nof clusters that unveils the full clustering structure of the input data set. In this work we study the\nhierarchical clustering problem on real-world graphs. This problem has received a lot of attention\nin recent years [13, 16, 41] and new elegant formulations and algorithms have been introduced.\nNevertheless many of the newly proposed techniques are sequential, hence dif\ufb01cult to apply on large\ndata sets.\n\n\u2217Supported in part by NSF CAREER award CCF-1053605, NSF BIGDATA grant IIS-1546108, NSF\nAF:Medium grant CCF-1161365, DARPA GRAPHS/AFOSR grant FA9550-12-1-0423, and another DARPA\nSIMPLEX grant.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWith the constant increase in the size of data sets to analyze, it is crucial to design ef\ufb01cient large-scale\nsolutions that can be easily implemented in distributed computing platforms (such as Spark [45]\nand Hadoop [43] as well as MapReduce and its extension Flume [17]), and cloud services (such as\nAmazon Cloud or Google Cloud). For this reason in the past decade several papers proposed new\ndistributed algorithms for classic computer science and machine learning problems [3, 4, 7, 14, 15, 19].\nDespite these efforts not much is known about distributed algorithms for hierarchical clustering. There\nare only two works analyzing these problems [27, 28], and neither gives any theoretical guarantees\non the quality of their algorithms or on the round complexity of their solutions.\nIn this work we propose new parallel algorithms in the MapReduce model to compute hierarchical\nclustering and we analyze them from both theoretical and experimental perspectives. The main idea\nbehind our algorithms is to adapt clustering techniques based on classic minimum spanning tree\nalgorithms such as Bor\u02dauvka\u2019s algorithm [11] and Kruskal\u2019s algorithm [33] to run ef\ufb01ciently in parallel.\nFurthermore we also provide a new theoretical framework to compare different clustering algorithms\nbased on the concept of a \u201ccerti\ufb01cate\u201d and show new interesting properties of our algorithms.\nWe can summarize our contribution in four main points.\nFirst, we focus on the distributed implementations of two important clustering techniques based\non classic minimum spanning tree algorithms. In particular we consider linkage-based clusterings\ninspired by Kruskal\u2019s algorithm and a novel clustering called af\ufb01nity clustering based on Bor\u02dauvka\u2019s\nalgorithm. We provide new theoretical frameworks to compare different clustering algorithms based\non the concept of a \u201ccerti\ufb01cate\u201d as a proof of having a good clustering and show new interesting\nproperties of both af\ufb01nity and single-linkage clustering algorithms.\nThen, using a connection between linkage-based clustering, af\ufb01nity clustering and the minimum\nspanning tree problem, we present new ef\ufb01cient distributed algorithms for the hierarchical clustering\nproblem in a MapReduce model. In our analysis we consider the most restrictive model for distributed\ncomputing, called Massively Parallel Communication, among previously studied MapReduce-like\nmodels [10, 23, 30]. Along the way, we obtain a constant round MapReduce algorithm for minimum\nspanning tree (MST) of dense graphs (in Section 5). Our algorithm for graphs with \u0398(n1+c) edges\nand for any given \u0001 with 0 < \u0001 < c < 1, \ufb01nds the MST in (cid:100)log(c/\u0001)(cid:101)+1 rounds using \u02dcO(n1+\u0001) space\nper machine and O(nc\u2212\u0001) machines (i.e., optimal total space). This improves the round complexity of\nthe state-of-the-art MST algorithm of Lattanzi et al. [34] for dense graphs which requires up to (cid:100)c/\u0001(cid:101)\nrounds using the same number of machines and space. Prior to our work, no hierarchical clustering\nalgorithm was known in this model.\nThen we turn our attention to real world applications and we introduce ef\ufb01cient implementations of\naf\ufb01nity clustering as well as classic single-linkage clustering that leverage Distributed Hash Tables\n(DHTs) [12, 31] to speed up computation for huge data sets.\nLast but not least, we present an experimental study where we analyze the scalability and effectiveness\nof our newly introduced algorithms and we observe that, in most cases, af\ufb01nity clustering outperforms\nall state-of-the-art algorithms from both quality and scalability standpoints.2\n\n2 Related Work\n\nClustering and, in particular, hierarchical clustering techniques have been studied by hundreds of\nresearchers [16, 20, 22, 32]. In social networks, detecting the hierarchical clustering structure is a\nbasic primitive for studying the interaction between nodes [36, 39]. Other relevant applications of\nhierarchical clustering can be found in bioinformatics, image analysis and text classi\ufb01cation.\nOur paper is closely related to two main lines of research. The \ufb01rst one focuses on studying\ntheoretical properties of clustering approaches based on minimum spanning trees (MSTs). Linkage-\nbased clusterings (often based on Kruskal\u2019s algorithm) have been extensively studied as basic\ntechniques for clustering datasets. The most common linkage-based clustering algorithms are single-\nlinkage, average-linkage and complete-linkage algorithms. In [44], Zadeh and Ben-David gave a\ncharacterization of the single-linkage algorithm. Their result has been then generalized to linkage-\nbased algorithms in [1]. Furthermore single-linkage algorithms are known to provably recover a\nground truth clustering if the similarity function has some stability properties [6]. In this paper we\n\n2Implementations are available at https://github.com/MahsaDerakhshan/AffinityClustering.\n\n2\n\n\fintroduce a new technique to compare clustering algorithms based on \u201ccerti\ufb01cates.\u201d Furthermore we\nintroduce and analyze a new algorithm\u2014af\ufb01nity\u2014based on Bor\u02dauvka\u2019s well-known algorithm. We\nshow that af\ufb01nity is not only scalable for huge data sets but also its performance is superior to several\nstate-of-the-art clustering algorithms. To the best of our knowledge though Bor\u02dauvka\u2019s algorithm is a\nwell-known and classic algorithm, not many clustering algorithms have been considered based on\nBor\u02dauvka\u2019s.\nThe second line of work is closely related to distributed algorithms for clustering problems. Several\nmodels of MapReduce computation have been introduced in the past few years [10, 23, 30]. The \ufb01rst\npaper that studied clustering problems in these models is by Ene et al. [18], where the authors prove\nthat any \u03b1 approximation algorithm for the k-center or k-median problems can produce 4\u03b1 + 2 and\n10\u03b1 + 3 approximation factors, respectively, for the k-center or k-median problems in the MapReduce\nmodel. Subsequently several papers [5, 7, 8] studied similar problems in the MapReduce model. A\nlot of efforts also went into studying ef\ufb01cient algorithms on graphs [3, 4, 7, 15, 14, 19]. However the\nproblem of hierarchical clustering did not receive a lot of attention. To the best of our knowledge\nthere are only two papers [27, 28] on this topic, and neither analyzes the problem formally or proves\nany guarantee in any MapReduce model.\n\n3 Minimum Spanning Tree-Based Clusterings\n\nWe begin by going over two famous algorithms for minimum spanning tree and de\ufb01ne the corre-\nsponding algorithms for clustering.\nBor\u02dauvka\u2019s algorithm and af\ufb01nity clustering: Bor\u02dauvka\u2019s algorithm [11], \ufb01rst published in 1926, is\nan algorithm for \ufb01nding a minimum spanning tree (MST)3. The algorithm was rediscovered a few\ntimes, in particular by Sollin [42] in 1965 in the parallel computing literature. Initially each vertex\nforms a group (cluster) by itself. The algorithm begins by picking the cheapest edge going out of\neach cluster, in each round (in parallel) joins these clusters to form larger clusters and continues\njoining in a similar manner until a tree spanning all vertices is formed. Since the size of the smallest\ncluster at least doubles each time, the number of rounds is at most O(log n). In af\ufb01nity clustering, we\nstop Bor\u02dauvka\u2019s algorithm after r > 0 rounds when for the \ufb01rst time we have at most k clusters for a\ndesired number k > 0. In case the number of clusters is strictly less than k, we delete the edges that\nwe added in the last round in a non-increasing order (i.e., we delete the edge with the highest weight\n\ufb01rst) to obtain exactly k clusters. To the best of our knowledge, although Bor\u02dauvka\u2019s algorithm is a\nwell-known and classic algorithm, clustering algorithms based on it have not been considered much.\nA natural hierarchy of nodes can be obtained by continuing Bor\u02dauvka\u2019s algorithm: each cluster here\nwill be a subset of future clusters. We call this hierarchical af\ufb01nity clustering.\nWe present distributed implementations of Bor\u02dauvka/af\ufb01nity in Section 5 and show its scalability even\nfor huge graphs. We also show af\ufb01nity clustering, in most cases, works much better than several\nwell-known clustering algorithms in Section 6.\nKruskal\u2019s algorithm and single-linkage clustering: Kruskal\u2019s algorithm [33] \ufb01rst introduced in\n1956 is another famous algorithm for \ufb01nding MST. The algorithm is highly sequential and iteratively\npicks an edge of the least possible weight that connects any two trees (clusters) in the forest.4 Though\nthe number of iterations in Kruskal\u2019s algorithm is n \u2212 1 (the number of edges of any tree on n nodes),\nthe algorithm can be implemented in O(m log n) time with simple data structures (m is the number\nof edges) and in O(ma(n)) time using a more sophisticated disjoint-set data structure, where a(.) is\nthe extremely slowly growing inverse of the single-valued Ackermann function.\nIn single-linkage clustering, we stop Kruskal\u2019s algorithm when we have at least k clusters (trees) for\na desired number k > 0. Again if we desire to obtain a corresponding hierarchical single-linkage\nclustering, by adding further edges which will be added in Kruskal\u2019s algorithm later, we can obtain a\nnatural hierarchical clustering (each cluster here will be a subset of future clusters).\nAs mentioned above, Kruskal\u2019s Algorithm and single-linkage clustering are highly sequential, however\nas we show in Section 5 thinking backward once we have an ef\ufb01cient implementation of Bor\u02dauvka\u2019s\n\n3More precisely the algorithm works when there is a unique MST, in particular, when all edge weights are\ndistinct; however this can be easily achieved by either perturbing the edge weights by an \u0001 > 0 amount or have a\ntie-breaking ordering for edges with the same weights\n\n4Unlike Bor\u02dauvka\u2019s method, this greedy algorithm has no limitations on the distinctness of edge weights.\n\n3\n\n\f(or any MST algorithm) in Map-Reduce and using Distributed Hash Tables (DHTs), we can achieve\nan ef\ufb01cient parallel implementation of single-linkage clustering as well. We show scalability of this\nimplementation even for huge graphs in Section 5 and its performance in experiments in Section 6.\n\n4 Guaranteed Properties of Clustering Algorithms\n\nAn important property of af\ufb01nity clustering is that it produces clusters that are roughly of the same\nsize. This is intuitively correct since at each round of the algorithm, each cluster is merged to at\nleast one other cluster and as a result, the size of even the smallest cluster is at least doubled. In fact\nlinkage based algorithms (and specially single linkage) are often criticized for producing uneven\nclusters; therefore it is tempting to give a theoretical guarantee for the size ratio of the clusters that\naf\ufb01nity produces. Unfortunately, as it is illustrated in Figure 1, we cannot give any worst case bounds\nsince even in one round we may end up having a cluster of size \u2126(n) and another cluster of size\nO(1). As the \ufb01rst property, we show that at least in the \ufb01rst round, this does not happen when the\nobservations are randomly distributed. Our empirical results on real world data sets in Section 6.1,\nfurther con\ufb01rm this property for all rounds, and on real data sets.\n\nFigure 1: An example of how af\ufb01nity may produce a large component in one round.\n\nWe start by de\ufb01ning the nearest neighbor graph.\nDe\ufb01nition 1 (Nearest Neighbor Graph). Let S be a set of points in a metric space. The nearest\nneighbor graph of S, denoted by GS, has |S| vertices, each corresponding to an element in S and if\na \u2208 S is the nearest element to b \u2208 S in S, graph GS contains an edge between the corresponding\nvertices of a and b.\n\nAt each round of af\ufb01nity clustering, all the vertices that are in the same connected component of the\nnearest neighbor graph will be merged together5. Thus, it suf\ufb01ces to bound the connected components\u2019\nsize.\nFor a random model of points, consider a Poisson point process X in Rd (d \u2265 1) with density 1.\nIt has two main properties. First, the number of points in any \ufb01nite region of volume V is Poisson\ndistributed with mean V . Second, the number of points in any two disjoint regions are independent\nof each other.\nTheorem 1 (H\u00e4ggstr\u00f6m et al. [38]). For any d \u2265 2, consider the (Euclidean distance) nearest\nneighbor graph G of a realization of a Poisson point process in Rd with density 1. All connected\ncomponents of G are \ufb01nite almost surely.\n\nTheorem 1 implies that the size of the maximum connected component of the points within any \ufb01nite\nregion in Rd is bounded by almost a constant number. This is a very surprising result compared to\nthe worst case scenario of having a connected component that contains all the points.\nNote that although the aforementioned bound holds for the \ufb01rst round of af\ufb01nity, after the connected\ncomponents are contracted, we cannot necessarily assume that the new points are Poisson distributed\nand the same argument cannot be used for the rest of the rounds.\nNext we present further properties of af\ufb01nity clustering. Let us begin by introducing the concept of\n\u201ccost\u201d for a clustering solution to be able to compare clustering algorithms.\nDe\ufb01nition 2. The cost of a cluster is the sum of edge lengths (weights) of a minimum Steiner tree\nconnecting all vertices inside the cluster. The cost of a clustering is the sum of the costs of its clusters.\nFinally a non-singleton clustering of a graph is a partition of its vertices into clusters of size at least\ntwo.\n\nEven one round of af\ufb01nity clustering often produces good solutions for several applications. Now we\nare ready to present the following extra property of the result of the \ufb01rst round of af\ufb01nity clustering.\n\n5Depending on the variant of af\ufb01nity that we use, the distance function will be updated.\n\n4\n\n\fTheorem 2. The cost of any non-singleton clustering is at least half of that of the clustering obtained\nafter the \ufb01rst round of af\ufb01nity clustering.\n\nBefore presenting the proof of Theorem 2, we need to demonstrate the concept of disc painting\nintroduced previously in [29, 2, 21, 9, 25]. In this setting, we consider a topological structure of\na graph metric in which each edge is a curve connecting its endpoints whose length is equal to its\nweight. We assume each vertex has its own color. A disc painting is simply a set of disjoint disks\ncentered at terminals (with the same colors of the center vertices). A disk of radius r centered at\nvertex v paints all edges (or portions) of them which are at distance r from vertex v with the color of\nv. Thus we paint (portions of) edges by different disks each corresponding to a vertex and each edge\ncan be painted by at most two disks. With this de\ufb01nition of disk painting, we now demonstrate the\nproof of Theorem 2.\nNext we turn our focus to obtain structural properties for single-linkage clustering. We denote by Fk\nthe set of edges added after k iterations of Kruskal, i.e., when we have n\u2212 k clusters in single-linkage\nclustering. Note that Fk is a forest, i.e., a set of edges with no cycle. First we start with an important\nobservation whose proof comes directly from the description of the single-linkage algorithm.\nProposition 3. Suppose we run single-linkage clustering until we have n \u2212 k clusters. Let doutside\nbe the minimum distance between any two clusters and dinside be the maximum distance of any edge\nadded to forest Fk. Then doutside \u2265 dinside.\nWe note that Proposition 3 demonstrates the following important property of single-linkage clustering:\nEach vertex of a cluster at any time has a neighbor inside to which is closer than any other vertex\noutside of its clusters.\nNext we de\ufb01ne another criterion for desirability of a clustering algorithm. This generalizes Proposi-\ntion 3.\nDe\ufb01nition 3. An \u03b1-certi\ufb01cate for a clustering algorithm, where \u03b1 \u2265 1, is an assignment of shares to\neach vertex of the graph with the following two properties: (1) The cost of each cluster is at most\n\u03b1 times the sum of shares of vertices inside the cluster; (2) For any set S of vertices containing at\nmost one from each cluster in our solution, the imaginary cluster S costs at least the sum of shares of\nvertices in S.\n\nNote that intuitively the \ufb01rst property guarantees that vertices inside each cluster can pay the cost of\ntheir corresponding cluster and that there is no free-rider. The second property intuitively implies we\ncannot \ufb01nd any better clustering by combining vertices from different clusters in our solution.\nNext we show that there always exists a 2-certi\ufb01cate for single-linkage clustering guaranteeing its\nworst-case performance.\nTheorem 4. Single-linkage always produces a clustering solution that has a 2-certi\ufb01cate.\n\n5 Distributed Algorithms\n\n5.1 Constant Round Algorithm For Dense Graphs\n\nUnsurprisingly, \ufb01nding the af\ufb01nity clustering of a given graph G is closely related to the problem of\n\ufb01nding its Minimum Spanning Tree (MST). In fact, we show the data that is encoded in the MST of G\nis suf\ufb01cient for \ufb01nding its af\ufb01nity clustering (Theorem 9). This property is also known to be true for\nsingle linkage [24]. For MapReduce algorithms this is particularly useful because the MST requires a\nsubstantially smaller space than the original graph and can be stored in one machine. Therefore, once\nwe have the MST, we can obtain af\ufb01nity or single linkage in one round.\nThe main contribution of this section is an algorithm for \ufb01nding the MST (and therefore the af\ufb01nity\nclustering) of dense graphs in constant rounds of MapReduce which improves upon prior known\ndense graph MST algorithms of Karloff et al. [30] and Lattanzi et al. [34].\nTheoretical Model. Let N denote the input size. There are a total number of M machines and each\nof them has a space of size S. Both S and M must be substantially sublinear in N. In each round,\nthe machines can run an arbitrary polynomial time algorithm on their local data. No communication\nis allowed during the rounds but any two machines can communicate with each other between the\nrounds as long as the total communication size of each machine does not exceed its memory size.\n\n5\n\n\f(cid:46) Since G is assumed to be dense we know c > 0.\n\nAlgorithm 1 MST of Dense Graphs\nInput: A weighted graph G\nOutput: The minimum spanning tree of G\n1: function MST(G = (V, E), \u0001)\nc \u2190 logn (m/n)\n2:\nwhile |E| > O(n1+\u0001) do\n3:\nREDUCEEDGES(G, c)\n4:\nc \u2190 (c \u2212 \u0001)/2\n5:\n6:\n7: function REDUCEEDGES(G = (V, E), c)\n8:\n9:\n10:\n11:\n\nMove all the edges to one machine and \ufb01nd MST of G in there.\nk \u2190 n(c\u2212\u0001)/2\nIndependently and u.a.r. partition V into k subsets {V1, . . . , Vk}.\nIndependently and u.a.r. partition V into k subsets {U1, . . . , Uk}.\nLet Gi,j be a subgraph of G with vertex set Vi \u222a Uj containing any edge (v, u) \u2208 E(G)\nfor any i, j \u2208 {1, . . . , k} do\n\nwhere v \u2208 Vi and u \u2208 Uj.\n\n12:\n13:\n14:\n\nSend all the edges of Gi,j to the same machine and \ufb01nd its MST in there.\nRemove an edge e from E(G) , if e \u2208 Gi,j and it is not in MST of Gi,j.\n\nThis model is called Massively Parallel Communication (MPC) in the literature and is \u201carguably the\nmost popular one\u201d [26] among MapReduce like models.\nTheorem 5. Let G = (V, E) be a graph with n vertices and n1+c edges for any constant c > 0 and\nlet w : E (cid:55)\u2192 R+ be its edge weights. For any given \u0001 such that 0 < \u0001 < c, there exists a randomized\nalgorithm for \ufb01nding the MST of G that runs in at most (cid:100)log (c/\u0001)(cid:101) + 1 rounds of MPC where\nevery machine uses a space of size \u02dcO(n1+\u0001) with high probability and the total number of required\nmachines is O(nc\u2212\u0001).\n\nOur algorithm, therefore, uses only enough total space ( \u02dcO(n1+c)) on all machines to store the input.\nThe following observation is mainly used by Algorithm 1 to iteratively remove the edges that are not\npart of the \ufb01nal MST.\nLemma 6. Let G(cid:48) = (V (cid:48), E(cid:48)) be a (not necessarily connected) subgraph of the input graph G. If an\nedge e \u2208 E(cid:48) is not in the MST of G(cid:48), then it is not in the MST of G either.\nTo be more speci\ufb01c, we iteratively divide G into its subgraphs, such that each edge of G is at least in\none subgraph. Then, we handle each subgraph in one machine and throw away the edges that are not\nin their MST. We repeat this until there are only O(n1+\u0001) edges left in G. Then we can handle all\nthese edges in one machine and \ufb01nd the MST of G. Algorithm 1 formalizes this process.\nLemma 7. Algorithm 1 correctly \ufb01nds the MST of the input graph in (cid:100)log (c/\u0001)(cid:101) + 1 rounds.\nBy Lemma 6 we know any edge that is removed from is not part of the MST therefore it suf\ufb01ces to\nprove the while loop in Algorithm 1 takes (cid:100)log (c/\u0001)(cid:101) + 1 iterations.\nLemma 8. In Algorithm 1, every machine uses a space of size \u02dcO(n1+\u0001) with high probability.\nThe combination of Lemma 7 and Lemma 8 implies that Algorithm 1 is indeed in MPC and\nTheorem 5 holds. See supplementary material for omitted proofs.\nThe next step is to prove all the information that is required for af\ufb01nity clustering is indeed contained\nin the MST.\nTheorem 9. Let G = (V, E) denote an arbitrary graph, and let G(cid:48) = (V, E(cid:48)) denote the minimum\nspanning tree of G. Running af\ufb01nity clustering algorithm on G gives the same clustering of V as\nrunning this algorithm on G(cid:48).\nBy combining the MST algorithm given for Theorem 5 and the suf\ufb01ciency of MST for computing\naf\ufb01nity clustering (Theorem 9) and single linkage ([24]) we get the following corollary.\nCorollary 10. Let G = (V, E) be a graph with n vertices and n1+c edges for any constant c > 0\nand let w : E (cid:55)\u2192 R+ be its edge weights. For any given \u0001 such that 0 < \u0001 < c, there exists a\n\n6\n\n\frandomized algorithm for af\ufb01nity clustering and single linkage that runs in (cid:100)log (c/\u0001)(cid:101) + 1 rounds of\nMPC where every machine uses a space of size \u02dcO(n1+\u0001) with high probability and the total number\nof required machines is O(nc\u2212\u0001).\n\n5.2 Logarithmic Round Algorithm For Sparse Graphs\nConsider a graph G(V, E) on n = |V | vertices, with edge weights w : E (cid:55)\u2192 R. We assume that the\nedge weights denote distances. (The discussion applies mutatis mutandis to the case where edge\nweights signify similarity.)\nThe algorithm works for a \ufb01xed number of synchronous rounds, or until no further progress is made,\nsay, by reaching a single cluster of all vertices. Each round consists of two steps: First, every vertex\npicks its best edge (i.e., that of the minimum weight) at each round; and then the graph is contracted\nalong the selected edges. (See Algorithm 2 in the appendix.)\nFor a connected graph, the algorithm continues until a single cluster of all vertices is obtained. The\nsupernodes at different rounds can be thought of as a hierarchical clustering of the vertices.\nWhile the \ufb01rst step of each round has a trivial implementation in MapReduce, the latter might\ntake \u2126(log n) MapReduce rounds to implement, as it is an instance of the connected components\nproblem. Using a DHT was shown to signi\ufb01cantly improve the running time here, by implementing\nthe operation in one round of MapReduce [31]. Basically we have a read-only random-access table\nmapping each vertex to its best neighbor. Repeated lookups in the table allows each vertex to follow\nthe chain of best neighbors until a loop (of length two) is encountered. This assigns a unique name for\neach connected component; then all the vertices in the same component are reduced into a supernode.\nTheorem 11. The af\ufb01nity clustering algorithm runs in O(log n) rounds of MapReduce when we have\naccess to a distributed hash table (DHT). Without the DHT, the algorithm takes O(log2 n) rounds.\n\n6 Experiments\n\n6.1 Quality Analysis\n\nIn this section, we compare well known hierarchical and \ufb02at clustering algorithms, such as k-means,\nsingle linkage, complete linkage and average linkage with different variants of af\ufb01nity clustering,\nsuch as single af\ufb01nity, complete af\ufb01nity and average af\ufb01nity. We run our experiments on several data\nsets from the UCI database [37] and use Euclidean distance6.\nTo evaluate the outputs of these algorithms we use Rand index which is de\ufb01ned as follows.\nDe\ufb01nition 4 (Rand index [40]). Given a set V = {v1, . . . , vn} of n points and two clusterings\nX = {X1, . . . , Xr} and Y = {Y1, . . . , Ys} of V . De\ufb01ne the following.\n\n\u2022 a: the number of pairs in V that are in the same cluster in X and in the same cluster in Y .\n\u2022 b: the number of pairs in V that are in different clusters in X and in different clusters in Y .\n\nthe Rand index r(X, Y ) is de\ufb01ned to be (a + b)/(cid:0)n\n\n(cid:1). By having the ground truth clustering T of a\n\ndata set, we de\ufb01ne the Rand index score of a clustering X, to be r(X, T ).\n\n2\n\nThe Rand index based scores are in range [0, 1] and a higher number implies a better clustering.\nFor a hierarchical clustering, the level of its corresponding tree with the highest score is used for\nevaluations.\nFigure 2 (a) compares the Rand index score of different clustering algorithms for different data sets.\nWe observe that single af\ufb01nity generally performs really well and is among the top two algorithms\nfor most of the datasets (all except Glass). Average af\ufb01nity also seems to perform well and in some\ncases (e.g., for Soybean data set) it produces a very high quality clustering compared to others. To\nsummarize, linkage based algorithms do not seem to be as good as af\ufb01nity based algorithms but in\nsome cases k-means could be close.\n\n6We consider Iris, Wine, Soybean, Digits and Glass data sets.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 2: Comparison of clustering algorithms based on their Rand index score (a) and clusters size\nratio (b).\n\nTable 1: Statistics about datasets used. (Numbers for ImageGraph are approximate.) The \ufb01fth column\nshows the relative running time of af\ufb01nity clustering, and the last column is the speedup obtained by\na ten-fold increase in parallelism.\n\nDataset\nLiveJournal\nOrkut\nFriendster\nImageGraph\n\n# nodes\n4,846,609\n3,072,441\n65,608,366\n2 \u00d7 1010\n\n7,861,383,690\n42,687,055,644\n1,092,793,541,014\n1012\n\n# edges max degree\n444,522\n893,056\n2,151,462\n14000\n\nrunning time\n1.0\n2.4\n54\n142\n\nspeedup\n4.3\n9.2\n5.9\n4.1\n\nAnother property of the algorithms that we measure is the clusters\u2019 size ratio. Let X = {X1, . . . , Xr}\nbe a clustering. We de\ufb01ne the size ratio of X to be mini,j\u2208[r] |Xi|/|Xj|. As it is visualized in Figure 2\n(b), af\ufb01nity based algorithms have a much higher size ratio (i.e., the clusters are more balanced)\ncompared to linkage based algorithms. This con\ufb01rms the property that we proved for Poisson\ndistributions in Section 4 for real world data sets. Hence we believe af\ufb01nity clustering is superior\nto (or at least as good as) the other methods when the dataset under consideration is not extremely\nunbalanced.\n\n6.2 Scalability\n\nHere we demonstrate the scalability of our implementation of af\ufb01nity clustering. A collection of\npublic and private graphs of varying sizes are studied. These graphs have between 4 million and 20\nbillion vertices and from 4 billion to one trillion edges. The \ufb01rst three graphs in Table 1 are based on\npublic graphs [35]. As most public graphs are unweighted, we use the number of common neighbors\nbetween a pair of nodes as the weight of the edge between them. (This is computed for all pairs,\nwhether they form a pair in the original graph or not, and then new edges of weight zero are removed.)\nThe last graph is based on (a subset of) an internal corpus of public images found on the web and\ntheir similarities.\nWe note that we use the \u201cmaximum\u201d spanning tree variant of af\ufb01nity clustering; here edge weights\ndenote similarity rather than distance.\nWhile we cannot reveal the exact running times and number of machines used in the experiments, we\nreport these quantities in \u201cnormalized form.\u201d We only run one round of af\ufb01nity clustering (consisting\nof a \u201cFind Best Neighbors\u201d and a \u201cContract Graph\u201d step). Two settings are used in the experiments.\nWe once use W MapReduce workers and D machines for the DHT, and compare this to the case with\n10W MapReduce workers and D machines for the DHT. This ten-fold increase in the number of\nMapReduce workers leads to four- to ten-fold decrease in the total running time for different datasets.\nEach running time is itself the average over three runs to reduce the effect of external network events.\nTable 1 also shows how the running time changes with the size of the graph. With a modest number\nof MapReduce workers, af\ufb01nity clustering runs in less than an hour for all the graphs.\n\n8\n\nDatasetsRand Index ScoreAlgorithmSingle AffinityAverage AffinityComplete AffinityComplete LinkageAverage LinkageSingle Linkagek-Means0.40.60.81.0IrisSoybeanWineGlassDigits0.00.20.40.60.8IrisSoybeanWineGlassDigitsDatasetsClusters' Size RatioAlgorithmSingle AffinityAverage AffinityComplete AffinityComplete LinkageAverage LinkageSingle Linkagek-Means\fReferences\n[1] Margareta Ackerman, Shai Ben-David, and David Loker. Characterization of linkage-based clustering. In\nCOLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 270\u2013281,\n2010.\n\n[2] Ajit Agrawal, Philip N. Klein, and R. Ravi. When trees collide: An approximation algorithm for the\n\ngeneralized steiner problem on networks. SIAM J. Comput., 24(3):440\u2013456, 1995.\n\n[3] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements.\nIn Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pages 459\u2013467,\n2012.\n\n[4] Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev. Parallel algorithms for\ngeometric graph problems. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing,\npages 574\u2013583. ACM, 2014.\n\n[5] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable\n\nk-means++. PVLDB, 5(7):622\u2013633, 2012.\n\n[6] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering\nvia similarity functions. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing,\nVictoria, British Columbia, Canada, May 17-20, 2008, pages 671\u2013680, 2008.\n\n[7] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on\ngeneral communication topologies. In Advances in Neural Information Processing Systems 26: 27th Annual\nConference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8,\n2013, Lake Tahoe, Nevada, United States., pages 1995\u20132003, 2013.\n\n[8] MohammadHossein Bateni, Aditya Bhaskara, Silvio Lattanzi, and Vahab S. Mirrokni. Distributed balanced\nclustering via mapping coresets. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec,\nCanada, pages 2591\u20132599, 2014.\n\n[9] MohammadHossein Bateni, Mohammad Taghi Hajiaghayi, and D\u00e1niel Marx. Approximation schemes for\n\nsteiner forest on planar graphs and graphs of bounded treewidth. J. ACM, 58(5):21:1\u201321:37, 2011.\n\n[10] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In\nProceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systems,\npages 273\u2013284. ACM, 2013.\n\n[11] Oktar Boruvka. O jist\u00e9m probl\u00e9mu minim\u00e1ln\u00edm. Pr\u00e1ce Moravsk\u00e9 p\u02c7r\u00edrodov\u02c7edeck\u00e9 spole\u02c7cnosti. Mor.\n\np\u02c7r\u00edrodov\u02c7edeck\u00e1 spole\u02c7cnost, 1926.\n\n[12] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows,\nTushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured\ndata. ACM Trans. Comput. Syst., 26(2):4:1\u20134:26, 2008.\n\n[13] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and\nspreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 841\u2013854, 2017.\n\n[14] Rajesh Chitnis, Graham Cormode, Hossein Esfandiari, MohammadTaghi Hajiaghayi, Andrew McGregor,\nMorteza Monemizadeh, and Sofya Vorotnikova. Kernelization via sampling with applications to \ufb01nding\nmatchings and related problems in dynamic graph streams. In Proceedings of the Twenty-Seventh Annual\nACM-SIAM Symposium on Discrete Algorithms, pages 1326\u20131344, 2016.\n\n[15] Rajesh Hemant Chitnis, Graham Cormode, Mohammad Taghi Hajiaghayi, and Morteza Monemizadeh.\nParameterized streaming: Maximal matching and vertex cover. In Proceedings of the Twenty-Sixth Annual\nACM-SIAM Symposium on Discrete Algorithms, pages 1234\u20131251, 2015.\n\n[16] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the 48th\nAnnual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21,\n2016, pages 118\u2013127, 2016.\n\n[17] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simpli\ufb01ed data processing on large clusters. Communi-\n\ncations of the ACM, 51(1):107\u2013113, 2008.\n\n9\n\n\f[18] Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce. In Proceedings of the\n17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA,\nUSA, August 21-24, 2011, pages 681\u2013689, 2011.\n\n[19] Hossein Esfandiari, Mohammad Taghi Hajiaghayi, Vahid Liaghat, Morteza Monemizadeh, and Krzysztof\nOnak. Streaming algorithms for estimating the matching size in planar graphs and beyond. In Proceedings\nof the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1217\u20131233, 2015.\n\n[20] Assaf Glazer, Omer Weissbrod, Michael Lindenbaum, and Shaul Markovitch. Approximating hierarchical\nmv-sets for hierarchical clustering. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec,\nCanada, pages 999\u20131007, 2014.\n\n[21] Michel X. Goemans and David P. Williamson. A general approximation technique for constrained forest\n\nproblems. SIAM J. Comput., 24(2):296\u2013317, 1995.\n\n[22] Jacob Goldberger and Sam T. Roweis. Hierarchical clustering of a mixture model. In Advances in Neural\nInformation Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 13-18,\n2004, Vancouver, British Columbia, Canada], pages 505\u2013512, 2004.\n\n[23] Michael T Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, searching, and simulation in the\nmapreduce framework. In International Symposium on Algorithms and Computation, pages 374\u2013383.\nSpringer, 2011.\n\n[24] John C Gower and GJS Ross. Minimum spanning trees and single linkage cluster analysis. Applied\n\nstatistics, pages 54\u201364, 1969.\n\n[25] Mohammad Taghi Hajiaghayi, Vahid Liaghat, and Debmalya Panigrahi. Online node-weighted steiner\nforest and extensions via disk paintings. In 54th Annual IEEE Symposium on Foundations of Computer\nScience, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 558\u2013567, 2013.\n\n[26] Sungjin Im, Benjamin Moseley, and Xiaorui Sun. Ef\ufb01cient massively parallel methods for dynamic\nprogramming. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing. ACM, 2017.\n\n[27] Chen Jin, Ruoqian Liu, Zhengzhang Chen, William Hendrix, Ankit Agrawal, and Alok N. Choudhary. A\nscalable hierarchical clustering algorithm using spark. In First IEEE International Conference on Big Data\nComputing Service and Applications, BigDataService 2015, Redwood City, CA, USA, March 30 - April 2,\n2015, pages 418\u2013426, 2015.\n\n[28] Chen Jin, Md Mostofa Ali Patwary, Ankit Agrawal, William Hendrix, Wei-keng Liao, and Alok Choudhary.\nDisc: A distributed single-linkage hierarchical clustering algorithm using mapreduce. In Proceedings of\nthe 4th International SC Workshop on Data Intensive Computing in the Clouds, 2013.\n\n[29] Michael J\u00fcnger and William R. Pulleyblank. New primal and dual matching heuristics. Algorithmica,\n\n13(4):357\u2013386, 1995.\n\n[30] Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In\nProceedings of the twenty-\ufb01rst annual ACM-SIAM symposium on Discrete Algorithms, pages 938\u2013948.\nSociety for Industrial and Applied Mathematics, 2010.\n\n[31] Raimondas Kiveris, Silvio Lattanzi, Vahab S. Mirrokni, Vibhor Rastogi, and Sergei Vassilvitskii. Connected\ncomponents in MapReduce and beyond. In Proceedings of the ACM Symposium on Cloud Computing,\nSeattle, WA, USA, November 03 - 05, 2014, pages 18:1\u201318:13, 2014.\n\n[32] Akshay Krishnamurthy, Sivaraman Balakrishnan, Min Xu, and Aarti Singh. Ef\ufb01cient active algorithms for\nhierarchical clustering. In Proceedings of the 29th International Conference on Machine Learning, ICML\n2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.\n\n[33] Joseph B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. In\n\nProceedings of the American Mathematical Society, volume 7, pages 48\u201350, 1956.\n\n[34] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. Filtering: a method for solving\ngraph problems in mapreduce. In Proceedings of the twenty-third annual ACM symposium on Parallelism\nin algorithms and architectures, pages 85\u201394. ACM, 2011.\n\n[35] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http:\n\n//snap.stanford.edu/data, June 2014.\n\n10\n\n\f[36] Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets, 2nd Ed. Cambridge\n\nUniversity Press, 2014.\n\n[37] Moshe Lichman. UCI machine learning repository, 2013.\n\n[38] Ronald Meester et al. Nearest neighbor and hard sphere models in continuum percolation. Random\n\nstructures and algorithms, 9(3):295\u2013315, 1996.\n\n[39] Mark Newman. Networks: An Introduction. Oxford University Press, Inc., New York, NY, USA, 2010.\n\n[40] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American\n\nStatistical association, 66(336):846\u2013850, 1971.\n\n[41] Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. In Advances in Neural\nInformation Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016,\nDecember 5-10, 2016, Barcelona, Spain, pages 2316\u20132324, 2016.\n\n[42] M. Sollin. Le trac\u00e9 de canalisation. Programming, Games, and Transportation Networks (in French), 1965.\n\n[43] Tom White. Hadoop: The De\ufb01nitive Guide. O\u2019Reilly Media, Inc., 2012.\n\n[44] Reza Zadeh and Shai Ben-David. A uniqueness theorem for clustering. In UAI 2009, Proceedings of the\nTwenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, Montreal, QC, Canada, June 18-21,\n2009, pages 639\u2013646, 2009.\n\n[45] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster\ncomputing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud\nComputing, pages 10\u201310, 2010.\n\n11\n\n\f", "award": [], "sourceid": 3447, "authors": [{"given_name": "Mohammadhossein", "family_name": "Bateni", "institution": "Google research"}, {"given_name": "Soheil", "family_name": "Behnezhad", "institution": "University of Maryland"}, {"given_name": "Mahsa", "family_name": "Derakhshan", "institution": "University of Maryland"}, {"given_name": "MohammadTaghi", "family_name": "Hajiaghayi", "institution": "University of Maryland"}, {"given_name": "Raimondas", "family_name": "Kiveris", "institution": "Google research"}, {"given_name": "Silvio", "family_name": "Lattanzi", "institution": "Google Research"}, {"given_name": "Vahab", "family_name": "Mirrokni", "institution": "Google Research NYC"}]}