{"title": "Distributed Balanced Clustering via Mapping Coresets", "book": "Advances in Neural Information Processing Systems", "page_first": 2591, "page_last": 2599, "abstract": "Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the ``balanced clustering'' problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound distributed algorithm remains an open problem. In the present paper we develop a general framework based on ``mapping coresets'' to tackle this issue. For a wide range of clustering objective functions such as k-center, k-median, and k-means, our techniques give distributed algorithms for balanced clustering that match the best known single machine approximation ratios.", "full_text": "Distributed Balanced Clustering via Mapping\n\nCoresets\n\nMohammadHossein Bateni\n\nGoogle NYC\n\nbateni@google.com\n\nAditya Bhaskara\n\nGoogle NYC\n\nbhaskaraaditya@google.com\n\nSilvio Lattanzi\nGoogle NYC\n\nsilviol@google.com\n\nVahab Mirrokni\n\nGoogle NYC\n\nmirrokni@google.com\n\nAbstract\n\nLarge-scale clustering of data points in metric spaces is an important problem in\nmining big data sets. For many applications, we face explicit or implicit size con-\nstraints for each cluster which leads to the problem of clustering under capacity\nconstraints or the \u201cbalanced clustering\u201d problem. Although the balanced cluster-\ning problem has been widely studied, developing a theoretically sound distributed\nalgorithm remains an open problem. In this paper we develop a new framework\nbased on \u201cmapping coresets\u201d to tackle this issue. Our technique results in \ufb01rst\ndistributed approximation algorithms for balanced clustering problems for a wide\nrange of clustering objective functions such as k-center, k-median, and k-means.\n\n1\n\nIntroduction\n\nLarge-scale clustering of data points in metric spaces is an important problem in mining big data\nsets. Many variants of such clustering problems have been studied spanning, for instance, a wide\nrange of (cid:96)p objective functions including the k-means, k-median, and k-center problems. Motivated\nby a variety of big data applications, distributed clustering has attracted signi\ufb01cant attention over the\nliterature [11, 4, 5]. In many of these applications, an explicit or implicit size constraint is imposed\nfor each cluster; e.g., if we cluster the points such that each cluster \ufb01ts on one machine, the size\nconstraint is enforced by the storage constraint on each machine. We refer to this as balanced clus-\ntering. In the setting of network location problems, these are referred to as capacitated clustering\nproblems [6, 16, 17, 10, 3]. The distributed balanced clustering problem is also well-studied and\nseveral distributed algorithms have been developed for it in the context of large-scale graph parti-\ntioning [21, 20]1. Despite this extensive literature, none of the distributed algorithms developed for\nthe balanced version of the problem have theoretical approximation guarantees. The present work\npresents the \ufb01rst such distributed algorithms for a wide range of balanced clustering problems with\nprovable approximation guarantees. To acheive this goal, we develop a new technique based on\nmapping coresets.\nA coreset for a set of points in a metric space is a subset of these points with the property that an\napproximate solution to the whole point-set can be obtained given the coreset alone. An augmented\nconcept for coresets is the notion of composable coresets which have the following property: for a\ncollection of sets, the approximate solution to the union of the sets in the collection can be obtained\ngiven the union of the composable coresets for the point sets in the collection. This notion was\n\n1A main difference between the balanced graph partitioning problems and balanced clustering problems\nconsidered here is that in the graph partitioning problems a main objective function is to minimize the cut\nfunction.\n\n1\n\n\fMapReduce model\n\nProblem\n\nL-balanced k-center\nk-clustering in (cid:96)p\n\nL-balanced k-clustering in (cid:96)p\n\nApproximation Rounds\nO(1)\nO(1)\nO(1)\n\nO(1)\nO(p)\n\n(O(p),2)\n\nStreaming model\n\nProblem\n\nL-balanced k-center\nk-clustering in (cid:96)p\n\nL-balanced k-clustering in (cid:96)p\n\nApproximation\n\nO(1)\nO(p)\n\n(O(p),2)\n\nPasses\nO(1)\nO(1)\nO(1)\n\nTable 1: Our contributions, all results hold for k < n1/2\u2212\u0001, for constant \u0001 > 0. We notice that for\nthe L-balanced k-clustering (p) general we get a bicriteria optimization (we can potentially open 2k\ncenters in our solutions).\n\nformally de\ufb01ned in a recent paper by Indyk et al [14]. In this paper, we augment the notion of\ncomposable coresets further, and introduce the concept of mapping coresets. A mapping coreset is a\ncoreset with an additional mapping of points in the original space to points in the coreset. As we will\nsee, this will help us solve balanced clustering problems for a wide range of objective functions and\na variety of massive data processing applications, including streaming algorithms and MapReduce\ncomputations. Roughly speaking, this is how a mapping coreset is used to develop a distributed\nalgorithm for the balanced clustering problems: we \ufb01rst partition the data set into several blocks in\na speci\ufb01c manner. We then compute a coreset for each block. In addition, we compute a mapping\nof points in the original space to points in the coreset. Finally, we collect all these coresets, and then\nsolve the clustering problem for the union of the coresets. We can them use the (inverse) map to get\nback a clustering for the original points.\nOur Contributions.\nIn this paper, we introduce a framework for solving distributed clustering\nproblems. Using the concept of mapping coresets as described above, our framework applies to\nbalanced clustering problems, which are much harder than their unrestricted counterparts in terms\nof approximation.\nThe rough template of our results is the following: given a single machine \u03b1-approximation algo-\nrithm for a clustering problem (with or without balance constraints), we give a distributed algorithm\nfor the problem that has an O(\u03b1) approximation guarantee. Our results also imply streaming algo-\nrithms for such clustering problems, using sublinear memory and constant number of passes. More\nprecisely, we consider balanced clustering problems with an (cid:96)p objective. For speci\ufb01c choice of p, it\ncaptures the commonly used k-center, k-median and k-means objectives. Our results are also very\nrobust\u2014for instance, bicriteria approximations (violating either the number of clusters or the cluster\nsizes) on a single machine can be used to give distributed bicriteria approximation algorithms, with\na constant loss in the cost. This is particularly important for balanced versions of k-median and\nk-means, for which we know of constant factor approximation to the cost only if we allow violating\none of the constraints. (Moreover, mild violation might not be terribly bad in certain applications,\nas long as we obtain small cost.)\nFinally, other than presenting the \ufb01rst distributed approximations for balanced clustering, our gen-\neral framework also implies constant-factor distributed approximations for a general class of un-\ncapacitated clustering problems (for which we are not aware of distributed algorithms with formal\nguarantees). We summarize our new results in Table 1.\nRelated Work. The notion of coresets has been introduced in [2]. In this paper, we use the term\ncoresets to refer to an augmented notion of coresets, referred to as \u201ccomposable coresets\u201d [14].\nThe notion of (composable) coresets are also related to the concept of mergeable summaries that\nhave been studied in the literature [1]. The main difference between the two is that aggregating\nmergeable summaries does not increase the approximation error, while in the case of coresets the\nerror ampli\ufb01es. The idea of using coresets has been applied either explicitly or implicitly in the\nstreaming model [12, 2] and in the MapReduce framework [15, 18, 5, 14]. However, none of the\nprevious work applies these ideas for balanced clustering problems.\n\n2\n\n\fThere has been a lot of work on designing ef\ufb01cient distributed algorithms for clustering problems in\nmetric spaces. A formal computation model for the MapReduce framework has been introduced by\nKarloff et al. [15]. The \ufb01rst paper that studied clustering problems in this model is by Ene et al. [11],\nwhere the authors prove that one can use an \u03b1 approximation algorithm for the k-center or k-median\nproblem to obtain a 4\u03b1 + 2 and a 10\u03b1 + 3 approximation respectively for the k-center or k-median\nproblems in the MapReduce model. Subsequently Bahmani et al. [4] showed how to implement k-\nmeans++ ef\ufb01ciently in the MapReduce model. Finally, very recently, Balcan et al. [5] demonstrate\nhow one can use an \u03b1 approximation algorithm for the k-means or k-median problem to obtain\ncoresets in the distributed (and MapReduce) setting. They however do not consider the balanced\nclustering problems or the general set of clustering problems with the (cid:96)p objective function.\nThe literature of clustering in the streaming model is also very rich. The \ufb01rst paper we are aware\nof is due to Charikar et al. [7], who study the k-center problem in the classic streaming setting.\nSubsquently Guha et al. [12] give the \ufb01rst single pass constant approximation algorithm to the k-\nmedian problem. Following up on this, the memory requirements and the approximation factors of\ntheir result were further improved by Charikar et al. in [8].\nFinally, capacitated (or balanced) clustering is well studied in approximation algorithms [6, 16,\n9], with constant factors known in some cases and only bicriteria in others. Our results may be\ninterpreted as saying that the capacity constraints may be a barrier to approximation, but are not a\nbarrier to parallelizability. This is the reason our approximation guarantees are bicriteria.\n\n2 Preliminaries\n\nIn all the problems we study, we will denote by (V, d) the metric space we are working with. We\nwill denote n = |V |, the number of points in V . We will also write duv as short hand for d(u, v).\nGiven points u, v, we assume we have an oracle access to duv (or can compute it, as in geometric\nsettings). Formally, a clustering C of a set of points V is a collection of sets C1, C2, . . . , Cr which\npartition V . Each cluster Ci has a center vi, and we de\ufb01ne the \u2018(cid:96)p cost\u2019 of this clustering as\n\n(cid:32)(cid:88)\n\n(cid:88)\n\n(cid:33)1/p\n\ncostp(C) :=\n\nv\u2208Ci\n\ni\n\nd(v, vi)p\n\n.\n\n(1)\n\nWhen p is clear from the context, we will simply refer to this quantity as the cost of the clustering\nand denote it cost(C).\nLet us now de\ufb01ne the L-balanced k-clustering problem with (cid:96)p cost.\nDe\ufb01nition 1 (L-balanced k-clustering (p)). Given (V, d) and a size bound L, \ufb01nd a clustering C of\nV which has at most k clusters, at most L points in each cluster, and cluster centers v1, . . . , vk so\nas to minimize costp(C), the (cid:96)p cost de\ufb01ned in Eq. (1).\nThe case p = 1 is the capacitated k-median and with p = \u221e is also known as the capacitated\nk-center problem (with uniform capacities).\nDe\ufb01nition 2 (Mapping and mapping cost). Given a multiset S and a set V , we call a bijective func-\nv\u2208V d(v, f (v))p.\nDe\ufb01nition 3 (Clustering and optimal solution). Given a clustering problem P with an (cid:96)p objective,\nwe de\ufb01ne OP TP as the cost of the optimal solution to P.\n\ntion f : V \u2192 S a mapping from V to S and we de\ufb01ne the cost of a mapping as(cid:80)\n\n3 Mapping coreset framework\n\nThe main idea behind our distributed framework is a new family of coresets that help in dealing with\nbalanced clustering.\nDe\ufb01nition 4 (\u03b4-mapping coreset). Given a set of points V , a \u03b4-mapping coreset for a clustering\nproblem P consists of a multiset S with elements from V , and a mapping from V to S such that the\ntotal cost of the mapping is upper bounded by \u03b4 \u00b7 OP T pP. We de\ufb01ne the size of a \u03b4-mapping coreset\nas the number of distinct elements in S.\n\nNote that our de\ufb01nition does not prescribe the size of the mapping coreset \u2013 this can be a parameter\nwe choose. We now de\ufb01ne the composability of coresets.\n\n3\n\n\fDe\ufb01nition 5 (Composable \u03b4-mapping coreset). Given disjoint sets of points V1, V2, . . . , Vm, and\ncorresponding \u03b4-mapping coresets S1, S2, . . . , Sm, the coresets are said to be composable if we have\nthat \u222aiSi is a 2p\u03b4-mapping coreset for \u222aiVi (the overall map is the union of those for V1, . . . , Vm).\nRemark. The non-trivial aspect of showing that coresets compose comes from the fact that we\ncompare the cost of mapping to the cost of OPTP on the union of Vi (which we need to show is not\ntoo small). Our main theorem is now the following\nTheorem 1. Let V be a set of points and suppose L, k, p \u2265 1 are parameters. Then for any\nU \u2286 V , there exists an algorithm that takes U as input, and produces a 2p-mapping coreset for the\nL-balanced k-clustering (p) problem for U. The size of this coreset is \u02dcO(k),2 and the algorithm\nuses space that is quadratic in |U|. Furthermore, for any partition V1, V2, . . . , Vr of V , the mapping\ncoresets produced by the algorithm on V1, V2, . . . , Vr compose.\n\n3.1 Clustering via \u03b4-mapping coresets\n\nThe theorem implies a simple general framework for distributed clustering:\n\n1. Split the input into m chunks arbitrarily (such that each chunk \ufb01ts on a machine), and\ncompute a (composable) 2p-mapping coreset for each of the chunks. For each point in the\ncoreset, assign a multiplicity equal to the number of points mapped to it (including itself).\n2. Gather all the coresets (and multiplicities of their points) into one machine, and compute a\n\nk-clustering of this multiset.\n\n3. Once clusters for the points (and their copies) are found, we can \u2018map back\u2019, and \ufb01nd a\n\nclustering of the original points.\n\nThe idea is that in each chunk, the size of the coreset will be small, thus the union of the coresets\nis small (and hence \ufb01ts on one machine). The second step requires care: the clustering algorithm\nshould work when the points have associated multiplicities, and use limited memory. This is cap-\ntured as follows.\nDe\ufb01nition 6 (Space-ef\ufb01cient algorithm). Given an instance (V, d) for a k-clustering problem in\nwhich V has N distinct points, each with some multiplicity, a sequential \u03b1-approximation algorithm\nis called space-ef\ufb01cient if the space used by the algorithm is O(N 2 \u00b7 poly(k)).\nThe framework itself is a very natural one, thus the key portions are the step of \ufb01nding the mapping\ncoresets that (a) have small mapping cost and (b) compose well on arbitrary partition of the input,\nand that of \ufb01nding space ef\ufb01cient algorithms. Sections 4 and 5 give details of these two steps.\nFurther, because the framework is general, we can apply many \u201clevels\u201d of it. This is illustrated\nbelow in Section 3.2.\nTo prove the correctness of the framework, we also need to prove that moving from the original\npoints in a chunk to a coreset with multiplicities (as described in (1)) does not affect us too much in\nthe approximation. We prove this using a general theorem:\nTheorem 2. Let f : V (cid:55)\u2192 S be a bijection. Let C be any clustering of V , and let C(cid:48) denote the clus-\ntering of S obtained by applying a bijection f to the clustering C. Then there exists a choice of cen-\nv\u2208V d(v, f (v))p.\nIn our case, if we consider the set of points in the coreset with multiplicities, the mapping gives a\nbijection, thus the above theorem applies in showing that the cost of clustering is not much more than\nthe \u201cmapping cost\u201d given by the bijection. The theorem can also be used in the opposite direction,\nas will be crucial in obtaining an approximation guarantee.\nPreserving balanced property. The above theorem allows us to move back and forth (algorithmi-\ncally) between clusterings of V and (the coreset with multiplicities) S as long as there is a small-cost\nmapping. Furthermore, since f is a bijection, we have the property that if the clustering was balanced\nin V , the corresponding one in S will be balanced as well, and vice versa.\nPutting things together. Let us now see how to use the theorems to obtain approximation guar-\nantees. Suppose we have a mapping f from V to the union of the coresets of the chunks (called\n\nters for C(cid:48) such that cost(C(cid:48))p \u2264 22p\u22121(cost(C)p + \u00b5total), where \u00b5total denotes(cid:80)\n\n2Here and elsewhere below, \u02dcO(\u00b7) is used to hide a logarithmic factor.\n\n4\n\n\fS, which is a multi set), with total mapping cost \u00b5total. Suppose also that we have an \u03b1 space-\nef\ufb01cient approximation algorithm for clustering S. Now we can use the Theorem 2 to show that in\nS, there exists a clustering whose cost, raised to the p-th power, is at most 22p\u22121(cost(C)p + \u00b5total).\nThis means that the approximation algorithm on S gives a clustering of cost (to the pth power)\n\u2264 22p\u22121\u03b1p(cost(C)p + \u00b5total). Finally, using Theorem 2 in the opposite direction, we can map\nback the clusters from S to V and get a an upper bound on the clustering cost (to the pth power) of\n22p\u22121(22p\u22121\u03b1p(cost(C)p + \u00b5total) + \u00b5total). But now using Theorem 1, we know that for the f in our\nalgorithm, \u00b5total \u2264 2pcost(C)p. So plugging this into the bound above, and after some manipulations\n(and taking pth roots) we obtain that the cost of the \ufb01nal clustering is \u2264 32\u03b1cost(C). The details of\nthis calculation can be found in the supplementary material.\nRemark. The approximation ratio (i.e., 32\u03b1) seems quite pessimistic. In our experiments, we have\nobserved (if we randomly partition the points initially) that the constants are much better (often at\nmost 1.5). The slack in our analysis arises mainly because of Theorem 2, in which the worst case in\nthe analysis is very unlikely to occur in practice.\n\n3.2 Mapping Coresets for Clustering in MapReduce\n\nThe above distributed algorithm can be placed in the formal model for MapReduce introduced by\nKarloff et al. [15].\nThe model has two main restrictions, one on the total number of machines and another on the\nmemory available on each machine.\nIn particular, given an input of size N, and a suf\ufb01ciently\nsmall \u03b3 > 0, in the model there are N 1\u2212\u03b3 machines, each with N 1\u2212\u03b3 memory available for the\ncomputation. As a result, the total amount of memory available to the entire system is O(N 2\u22122\u03b3).\nIn each round a computation is executed on each machine in parallel and then the outputs of the\ncomputation are shuf\ufb02ed between the machines.\nIn this model the ef\ufb01ciency of an algorithm is measured by the number of the \u2018rounds\u2019 of MapReduce\nin the algorithm. A class of algorithms of particular interest are the ones that run in a constant\nnumber of rounds. This class of algorithms are denoted MRC0.\nThe high level idea is to use coreset construction and a sequential space-ef\ufb01cient \u03b1-approximation\nalgorithm (as outlined above). Unfortunately, this approach does not work as such in the MapRe-\nduce model because both the coreset construction algorithm, and the space-ef\ufb01cient algorithm, re-\nquire memory quadratic in the size of their input. Therefore we perform multiple \u2018levels\u2019 of our\nframework.\nGiven an instance (V, d), the MapReduce algorithm proceeds as follows:\n\n1. Partition the points arbitrarily into 2n(1+\u03b3)/2 sets.\n2. Compute the composable 2p-mapping coreset on each of the machines (in parallel) to obtain\n\nf and the multisets S1, S2, . . . , S2n(1+\u03b3)/2, each with roughly (cid:101)O(k) distinct points.\n\n3. Partition the computed coreset again into n1/4 sets.\n4. Compute composable 2p-mapping coresets on each of the machines (in parallel) to obtain\n\nn1/4, each with (cid:101)O(k) distinct points.\n\nf(cid:48), and multisets S(cid:48)\n1, S(cid:48)\n\n5. Merge all the S(cid:48)\n\n2, . . . , S(cid:48)\n\n1, S(cid:48)\n2, . . . , S(cid:48)\n1, S(cid:48)\n\nsequential space-ef\ufb01cient \u03b1-approximation algorithm.\n\nn1/4 on a single machine and compute a clustering using the\n2, . . . , S(cid:48)\n\n6. Map back the points in S(cid:48)\n\nfunction f(cid:48)\u22121 and obtain a clustering of the points in S1, S2, . . . , S2n(1+\u03b3)/2.\n\nn1/4 to the points in S1, S2, . . . , S2n(1+\u03b3)/2 using the\n7. Map back the points in S1, S2, . . . , S2n(1+\u03b3)/2 to the points in V using the function f\u22121 and\n\nthus obtain a clustering of the initial set of points.\n\nNote that if k < n1/4\u2212\u0001, for constant \u0001 > \u03b3, at every step of the MapReduce, the input size on\neach machine is bounded by n(1\u2212\u03b3)/2 and thus we can run our coreset reduction and a space-ef\ufb01cient\nalgorithm (in which we think of the poly(k) as constant \u2013 else we need minor modi\ufb01cation). Fur-\nthermore if n1/4\u2212\u0001 \u2264 k < n(1\u2212\u0001)/2, for constant \u0001 > \u03b3, we can exploit the trade-off between number\nof rounds and approximation factor to get a similar result (refer to the supplement for details).\n\n5\n\n\fFigure 1: We split the input into m parts, compute mapping coresets for each part, and aggregate\nthem. We then compute a solution to this aggregate and map the clustering back to the input.\n\nWe are now ready to state our main theorem in the MapReduce framework:\nTheorem 3. Given an instance (V, d) for a k-clustering problem, with |V | = n and a sequential\nspace-ef\ufb01cient \u03b1 approximation algorithm for the (L-balanced) k-clustering (p) problem, there ex-\nists a MapReduce algorithm that runs in O(1) rounds and obtains an O(\u03b1) approximation for the\n(L-balanced) k-clustering (p) problem, for L, p \u2265 1 and 0 < k < n(1\u2212\u0001)/2 (constant \u0001 > 0).\n\nThe previous theorem combined with the results of Section 5 gives us the results presented in Ta-\nble 1. Furthermore it is possible to extend this approach to obtain streaming algorithms via the same\ntechniques. We defer the details of this to the supplementary material.\n\n4 Coresets and Analysis\n\nWe now come to the proof of our main result\u2014Theorem 1. We give an algorithm to construct\ncoresets, and then show that coresets constructed this way compose.\nConstructing composable coresets.\nSuppose we are given a set of points V . We \ufb01rst show how to select a set of points S that are\nclose to each vertex in V , and use this set as a coreset with a good mapping f. The selection of\nS uses a modi\ufb01cation of the algorithm of Lin and Vitter [19] for k-median. We remark that any\napproximation algorithm for k-median with (cid:96)p objective can be used in place of the linear program\n(as we did in our experiments, for p = \u221e, in which a greedy farthest point traversal can be used).\nConsider a solution (x, y) to the following linear programming (LP) relaxation:\n\n(cid:88)\n\n(cid:88)\n\nu\n\nv\n\nmin\n\nd(u, v)pxuv\n\n(cid:88)\n(cid:88)\n\nv\n\nxuv = 1\nxuv \u2264 yv\nyu \u2264 k\n0 \u2264 xuv, yu \u2264 1\n\nu\n\nsubject to\n\nfor all u\n\n(every u assigned to a center)\n\nfor all u, v\n\nfor all u, v.\n\n(assigned only to center)\n(at most k centers)\n\nIn the above algorithms, we can always treat p \u2264 log n, and in particular the case p = \u221e, as\np = log n. This introduces only negligible error in our computations but make them tractable. More\nspeci\ufb01cally, when working with p = log n, the power operators do not increase the size of the input\nby more than a factor log n.\n\n6\n\n\fRounding We perform a simple randomized rounding with weights scaled up by O(log n): round\neach yu to 1 with a probability equal to min{1, yu(4 log n)/\u0001}. Let us denote this probability by y(cid:48)\nu,\nand the set of \u201ccenters\u201d thus obtained, by S. We prove the following (proof in the supplement)\nLemma 4. With probability (1\u22121/n), the set S of selected centers satis\ufb01es the following properties.\n1. Each vertex has a relatively close selected center. In particular, for every u \u2208 V , there is a\n\n(cid:104)\n(1 + \u0001)(cid:80)\n\n(cid:105)1/p\n\n.\n\ncenter opened at distance at most\n\nv d(u, v)pxuv\n\n2. Not too many centers are selected; i.e., |S| < 8k log n\n\n\u0001\n\n.\n\nMapping and multiplicity. Once we have a set S of centers, we map every v \u2208 V the center closest\nto it, i.e., f (v) = arg mins\u2208S d(v, s). If ms points in V are mapped to some s \u2208 S, we set its\nmultiplicty to ms. This de\ufb01nes a bijection from V to the resulting multiset.\nComposability of the coresets.\nWe now come to the crucial step, the proof of composability for the mapping coresets constructed\nearlier, i.e., the \u2018furthermore\u2019 part of Theorem 1.\nTo show this, we consider any vertex sets V1, V2, . . . , Vm, and mapping coresets S1, S2, . . . , Sm\nobtained by the rounding algorithm above. We have to prove that the total moving cost is at most\n(1 + \u0001)2pOP TP, where the optimum value is for the instance \u222aiVi. We denote by LP (Vi) the\noptimum value of the linear program above, when the set of points involved is Vi. Finally, we write\n\n\u00b5v := d(v, fv)p, and \u00b5total :=(cid:80)\n\nv\u2208V \u00b5v. We now have:\n\nLemma 5. Let LPi denote the objective value of the optimum solution to LP (Vi), i.e., the LP\nrelaxation written earlier when only vertices in Vi are considered. Then we have\n\n(cid:88)\n\n\u00b5total \u2264 (1 + \u0001)\n\nLPi.\n\ni\n\nThe proof follows directly from Lemma 4 and the de\ufb01nition of f. The next lemma is crucial: it\nshows that LP (V ) cannot be too small. The proof is deferred to the supplement.\n\nLemma 6. In the notation above, we have(cid:80)\n\ni LPi \u2264 2p \u00b7 LP (V ).\n\nThe two lemmas imply that the total mapping cost is at most (1 + \u0001)2pOP TP, because LP (V ) is\nclearly \u2264 OP TP. This completes the proof of Theorem 1.\n\n5 Space ef\ufb01cient algorithms on a single machine\n\nOur framework ultimately reduces distributed computation to a sequential computation on a com-\npressed instance. For this, we need to adapt the known algorithms on balanced k-clustering, in order\nto handle compressed instances. We now give a high level overview and defer the details to the\nsupplementary material.\nFor balanced k-center, we modify the linear programming (LP) based algorithm of [16], and its anal-\nysis to deal with compressed instances. This involves the following trick: if we have a compressed\ninstance with N points, since there are only k centers to open, at most k \u201ccopies\u201d of each point are\ncandidate centers. We believe this trick can be applied more generally to LP based algorithms.\nFor balanced k-clustering with other (cid:96)p objectives (even p = 1), it is not known how to obtain\nconstant factor approximation algorithms (even without the space ef\ufb01cient restriction). Thus we\nconsider bicriteria approximations, in which we respect the cluster size constraints, but have up to\n2k clusters. This can be done for all (cid:96)p objectives as follows: \ufb01rst solve the problem approximately\nwithout enforcing the balanced constraint, then post-process the clusters obtained. If a cluster con-\ntains ni points for ni > L, then subdivide the cluster into (cid:100)ni/L(cid:101) many clusters. The division\nshould be done carefully (see supplement).\nThe post-processing step only involves the counts of the vertices in different clusters, and hence can\nbe done in a space ef\ufb01cient manner. Thus the crucial part is to \ufb01nd the \u2018unconstrained\u2019 k-clustering\nin a space ef\ufb01cient way. For this, the typical algorithms are either based on local search (e.g., due\n\n7\n\n\fGraph\nUS\nWorld\n\nRelative size of\n\nsequential instance\n\n0.33%\n0.1%\n\nRelative increase\n\nin radius\n+52%\n+58%\n\nTable 2: Quality degradation due to the two-round approach.\n\nFigure 2: Scalability of parallel implementation.\n\nto [13]), or based on rounding linear programs. The former can easily be seen to be space ef\ufb01cient\n(we only need to keep track of the number of centers picked at each location). The latter can be\nmade space ef\ufb01cient using the same trick we use for k-center.\n\n6 Empirical study\n\nIn order to gauge its practicality, we implement our algorithm. We are interested in measuring its\nscalability in addition to the effect of having several rounds on the quality of the solution.\nIn particular, we compare the quality of the solution (i.e., the maximum radius from the k-center\nobjective) produced by the parallel implementation to that of the sequential one-machine implemen-\ntation of the farthest seed heuristic. In some sense, our algorithm is a parallel implementation of\nthis algorithm. However, the instance is too big for the sequential algorithm to be feasible. As a\nresult, we run the sequential algorithm on a small sample of the instance, hence a potentially easier\ninstance.\nOur experiments deal with two instances to test this effect: the larger instance is the world graph\nwith hundreds of millions of nodes, and the smaller one is the graph of US road networks with tens\nof millions of nodes. Each node has the coordinate locations, which we use to compute great-circle\ndistances\u2014the closest distance between two points on the surface of the earth. We always look for\n1000 clusters, and run our parallel algorithms on a few hundred machines.\nTable 2 shows that the quality of the solution does not degrade substantially if we use the two-\nround algorithm, more suited to parallel implementation. The last column shows the increase in the\nmaximum radius of clusters due to computing the k-centers in two rounds as described in the paper.\nNote that the radius increase numbers quoted in the table are upper bounds since the sequential\nalgorithm could only be run on a simpler instance. In reality, the quality reduction may be even less.\n300 subset of the actual\nIn case of the US Graph, the sequential algorithm was run on a random 1\ngraph, whereas a random 1\nWe next investigate how the running time of our algorithm scales with the size of the instance. We\nfocus on the bigger instance (World Graph) and once again take its random samples of different\nsizes (10% up to 100%). This yields to varying instance sizes, but does not change the structure of\nthe problem signi\ufb01cantly, and is perfect for measuring scalability. Figure 2 shows the increase in\nrunning time is sublinear. In particular, a ten-fold increase in instance size only leads to a factor 3.6\nincrease in running time.\n\n1000 subset was used for the World Graph.\n\n8\n\n\fReferences\n[1] P. K. AGARWAL, G. CORMODE, Z. HUANG, J. PHILLIPS, Z. WEI, AND K. YI, Mergeable\nsummaries, in Proceedings of the 31st symposium on Principles of Database Systems, ACM,\n2012, pp. 23\u201334.\n\n[2] P. K. AGARWAL, S. HAR-PELED, AND K. R. VARADARAJAN, Approximating extent mea-\n\nsures of points, Journal of the ACM (JACM), 51 (2004), pp. 606\u2013635.\n\n[3] H.-C. AN, A. BHASKARA, AND O. SVENSSON, Centrality of trees for capacitated k-center,\n\nCoRR, abs/1304.2983 (2013).\n\n[4] B. BAHMANI, B. MOSELEY, A. VATTANI, R. KUMAR, AND S. VASSILVITSKII, Scalable\n\nk-means++, PVLDB, 5 (2012), pp. 622\u2013633.\n\n[5] M.-F. BALCAN, S. EHRLICH, AND Y. LIANG, Distributed clustering on graphs, in NIPS,\n\n2013, p. to appear.\n\n[6] J. BAR-ILAN, G. KORTSARZ, AND D. PELEG, How to allocate network centers, J. Algo-\n\nrithms, 15 (1993), pp. 385\u2013415.\n\n[7] M. CHARIKAR, C. CHEKURI, T. FEDER, AND R. MOTWANI, Incremental clustering and\ndynamic information retrieval, in Proceedings of the Twenty-ninth Annual ACM Symposium\non Theory of Computing, STOC \u201997, New York, NY, USA, 1997, ACM, pp. 626\u2013635.\n\n[8] M. CHARIKAR, L. O\u2019CALLAGHAN, AND R. PANIGRAHY, Better streaming algorithms for\nclustering problems, in In Proc. of 35th ACM Symposium on Theory of Computing (STOC,\n2003, pp. 30\u201339.\n\n[9] J. CHUZHOY AND Y. RABANI, Approximating k-median with non-uniform capacities, in\n\nSODA, 2005, pp. 952\u2013958.\n\n[10] M. CYGAN, M. HAJIAGHAYI, AND S. KHULLER, LP rounding for k-centers with non-\n\nuniform hard capacities, in FOCS, 2012, pp. 273\u2013282.\n\n[11] A. ENE, S. IM, AND B. MOSELEY, Fast clustering using mapreduce, in KDD, 2011, pp. 681\u2013\n\n689.\n\n[12] S. GUHA, N. MISHRA, R. MOTWANI, AND L. O\u2019CALLAGHAN, Clustering data streams,\n\nSTOC, (2001).\n\n[13] A. GUPTA AND K. TANGWONGSAN, Simpler analyses of local search algorithms for facility\n\nlocation, CoRR, abs/0809.2554 (2008).\n\n[14] P. INDYK, S. MAHABADI, M. MAHDIAN, AND V. MIRROKNI, Composable core-sets for\n\ndiversity and coverage maximization, in unpublished, 2014.\n\n[15] H. J. KARLOFF, S. SURI, AND S. VASSILVITSKII, A model of computation for mapreduce, in\n\nSODA, 2010, pp. 938\u2013948.\n\n[16] S. KHULLER AND Y. J. SUSSMANN, The capacitated k-center problem, SIAM J. Discrete\n\nMath., 13 (2000), pp. 403\u2013418.\n\n[17] M. R. KORUPOLU, C. G. PLAXTON, AND R. RAJARAMAN, Analysis of a local search heuris-\n\ntic for facility location problems, in SODA, 1998, pp. 1\u201310.\n\n[18] S. LATTANZI, B. MOSELEY, S. SURI, AND S. VASSILVITSKII, Filtering: a method for solv-\n\ning graph problems in mapreduce, in SPAA, 2011, pp. 85\u201394.\n\n[19] J.-H. LIN AND J. S. VITTER, Approximation algorithms for geometric median problems, Inf.\n\nProcess. Lett., 44 (1992), pp. 245\u2013249.\n\n[20] F. RAHIMIAN, A. H. PAYBERAH, S. GIRDZIJAUSKAS, M. JELASITY, AND S. HARIDI, Ja-\n\nbe-ja: A distributed algorithm for balanced graph partitioning, in SASO, 2013, pp. 51\u201360.\n\n[21] J. UGANDER AND L. BACKSTROM, Balanced label propagation for partitioning massive\n\ngraphs, in WSDM, 2013, pp. 507\u2013516.\n\n9\n\n\f", "award": [], "sourceid": 1350, "authors": [{"given_name": "Mohammadhossein", "family_name": "Bateni", "institution": null}, {"given_name": "Aditya", "family_name": "Bhaskara", "institution": "Google Research NYC"}, {"given_name": "Silvio", "family_name": "Lattanzi", "institution": "Google Research NYC"}, {"given_name": "Vahab", "family_name": "Mirrokni", "institution": "Google Research NYC"}]}