{"title": "Fast Distributed k-Center Clustering with Outliers on Massive Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1063, "page_last": 1071, "abstract": "Clustering large data is a fundamental problem with a vast number of applications. Due to the increasing size of data, practitioners interested in clustering have turned to distributed computation methods. In this work, we consider the widely used k-center clustering problem and its variant used to handle noisy data, k-center with outliers. In the noise-free setting we demonstrate how a previously-proposed distributed method is actually an O(1)-approximation algorithm, which accurately explains its strong empirical performance. Additionally, in the noisy setting, we develop a novel distributed algorithm that is also an O(1)-approximation. These algorithms are highly parallel and lend themselves to virtually any distributed computing framework. We compare both empirically against the best known noisy sequential clustering methods and show that both distributed algorithms are consistently close to their sequential versions. The algorithms are all one can hope for in distributed settings: they are fast, memory efficient and they match their sequential counterparts.", "full_text": "Fast Distributed k-Center Clustering with Outliers on\n\nMassive Data\n\nGustavo Malkomes, Matt J. Kusner, Wenlin Chen\n\nDepartment of Computer Science and Engineering\n\nWashington University in St. Louis\n\nSt. Louis, MO 63130\n\n{luizgustavo,mkusner,wenlinchen}@wustl.edu\nBenjamin Moseley\n\nKilian Q. Weinberger\n\nDepartment of Computer Science\n\nDepartment of Computer Science and Engineering\n\nCornell University\nIthaca, NY 14850\n\nkqw4@cornell.edu\n\nWashington University in St. Louis\n\nSt. Louis, MO 63130\n\nbmoseley@wustl.edu\n\nAbstract\n\nClustering large data is a fundamental problem with a vast number of applications.\nDue to the increasing size of data, practitioners interested in clustering have turned\nto distributed computation methods. In this work, we consider the widely used k-\ncenter clustering problem and its variant used to handle noisy data, k-center with\noutliers. In the noise-free setting we demonstrate how a previously-proposed dis-\ntributed method is actually an O(1)-approximation algorithm, which accurately\nexplains its strong empirical performance. Additionally, in the noisy setting, we\ndevelop a novel distributed algorithm that is also an O(1)-approximation. These\nalgorithms are highly parallel and lend themselves to virtually any distributed\ncomputing framework. We compare each empirically against the best known se-\nquential clustering methods and show that both distributed algorithms are con-\nsistently close to their sequential versions. The algorithms are all one can hope\nfor in distributed settings: they are fast, memory ef\ufb01cient and they match their\nsequential counterparts.\n\n1\n\nIntroduction\n\nClustering is a fundamental machine learning problem with widespread applications. Example ap-\nplications include grouping documents or webpages by their similarity for search engines [30] or\ngrouping web users by their demographics for targeted advertising [2]. In a clustering problem one\nis given as input a set U of n data points, characterized by a set of features, and is asked to cluster\n(partition) points so that points in a cluster are similar by some measure. Clustering is a well un-\nderstood task on modestly sized data sets; however, today practitioners seek to cluster datasets of\nmassive size. Once data becomes too voluminous, sequential algorithms become ineffective due to\ntheir running time and insuf\ufb01cient memory to store the data. Practitioners have turned to distributed\nmethods, in particular MapReduce [13], to ef\ufb01ciently process massive data sets.\nOne of the most fundamental clustering problems is the k-center problem. Here, it is assumed\nthat for any two input points a pair-wise distance can be computed that re\ufb02ects their dissimilarity\n(typically these arise from a metric space). The objective is to choose a subset of k points (called\ncenters) that give rise to a clustering of the input set into k clusters. Each input point is assigned to\nthe cluster de\ufb01ned by its closest center (out of the k center points). The k-center objective selects\nthese centers to minimize the farthest distance of any point to its cluster center.\n\n1\n\n\fThe k-center problem has been studied for over three decades and is a fundamental task used for\nexemplar based clustering [22]. It is known to be NP-Hard and, further, no algorithm can achieve\na (2\u2212 \u0001)-approximation for any \u0001 > 0 unless P=NP [16, 20]. In the sequential setting, there are\nalgorithms which match this bound achieving a 2-approximation [16, 20].\nThe k-center problem is popular for clustering datasets which are not subject to noise since the\nobjective is sensitive to error in the data because the worst case (maximum) distance of a point to\nthe centers is used for the objective. In the case where data can be noisy [1, 18, 19], previous work\nhas considered the k-centers with outliers problem [10]. In this problem, the objective is the same,\nbut additionally one may discard a set of z points from the input. These z points are the outliers and\nare ignored in the objective. Here, the best known algorithm is a 3-approximation [10].\nOnce datasets become large, the known algorithms for these two problems become ineffective. Due\nto this, previous work on clustering has resorted to alternative algorithmics. There have been several\nworks on streaming algorithms [3, 17, 24, 26]. Others have focused on distributed computing [6,\n7, 14, 25]. The work in the distributed setting has focused on algorithms which are implementable\nin MapReduce, but are also inherently parallel and work in virtually any distributed computing\nframework. The work of [14] was the \ufb01rst to consider k-center clustering in the distributed setting.\nTheir work gave an O(1)-round O(1)-approximate MapReduce algorithm. Their algorithm is a\nsampling based MapReduce algorithm which can be used for a variety of clustering objectives.\nUnfortunately, as the authors point out in their paper, the algorithm does not always perform well\nempirically for the k-center objective since the objective function is very sensitive to missing data\npoints and the sampling can cause large errors in the solution.\nThe work of Kumar et al. [23] gave a (1\u2212 1\ne )-approximation algorithm for submodular function\nmaximization subject to a cardinality constraint in the MapReduce setting, however, their algorithm\nrequires a non-constant number of MapReduce rounds. Whereas, Mirzasoleiman et al. [25] (re-\ncently, extended in [8]) gave a two MapReduce rounds algorithm but their approximation ratio is not\nconstant. It is known that an exact algorithm for submodular maximization subject to a cardinality\nconstraint gives an exact algorithm for the k-center problem. Unfortunately, both problems are NP-\nHard and the reduction is not approximation preserving. Therefore, their theoretical results do not\nimply a nontrivial approximation for the k-center problem.\nFor these problems, the following questions loom: What can be achieved for k-center clustering\nwith or without outliers in the large-scale distributed setting? What underlying algorithmic ideas are\nneeded for the k-center with outliers problem to be solved in the distributed setting? The k-center\nwith outliers problem has not been studied in the distributed setting. Given the complexity of the\nsequential algorithm, it is not clear what such an algorithm would look like.\nContributions. In this work, we consider the k-center and k-center with outliers problems in the\ndistributed computing setting. Although the algorithms are highly parallel and work in virtually\nany distributed computing framework, they are particularly well suited for the MapReduce [13]\nas they require only small amounts of inter-machine communication and very little memory on\neach machine. We therefore state our results for the MapReduce framework [13]. We will assume\nthroughout the paper that our algorithm is given some number of machines, m, to process the data.\nWe \ufb01rst begin by considering a natural interpretation of the algorithm of Mirzasoleiman et al. [25]\non submodular optimization for the k-center problem. The algorithm we introduce runs in two\nMapReduce rounds and achieves a small constant approximation.\nTheorem 1.1. There is a two round MapReduce algorithm which achieves a 4-approximation for\nthe k-center problem which communicates O(km) amount of data assuming the data is already\npartitioned across the machines. The algorithm uses O(max{n/m, mk}) memory on each machine.\nNext we consider the k-center with outliers problem. This problem is far more challenging and pre-\nvious distributed techniques do not lend themselves to this problem. Here we combine the algorithm\ndeveloped for the problem without outliers with the sequential algorithm for k-center with outliers.\nWe show a two round MapReduce algorithm that achieves an O(1)-approximation.\nTheorem 1.2. There is a two round MapReduce algorithm which achieves a 13-approximation for\nthe k-center with outliers problem which communicates O(km log n) amount of data assuming the\ndata is already partitioned across the machines. The algorithm uses O(max{n/m, m(k+z) log n})\nmemory on each machine.\n\n2\n\n\fFinally, we perform experiments with both algorithms on real world datasets. For k-center we\nobserve that the quality of the solutions is effectively the same as that of the sequential algorithm for\nall values of k\u2014the best one could hope for. For the k-center problem with outliers our algorithm\nmatches the sequential algorithm as the values of k and z vary and it signi\ufb01cantly outperforms the\nalgorithm which does not explicitly consider outliers. Somewhat surprisingly our algorithm achieves\nan order of magnitude speed-up over the sequential algorithm even if it is run sequentially.\n\n2 Preliminaries\n\nMap-Reduce. We will consider algorithms in the distributed setting where our algorithms are given\nm machines. We de\ufb01ne our algorithms in a general distributed manner, but they particularly suited\nto the MapReduce model [21]. This model has become widely used both in theory and in applied\nmachine learning [4, 5, 9, 12, 15, 21, 25, 27, 31]. In the MapReduce setting, algorithms run in\nrounds. In each round the machines are allowed to run a sequential computation without machine\ncommunication. Between rounds, data is distributed amongst the machines in preparation for new\ncomputation. The goal is to design an algorithm which runs in a small number of rounds since\nthe main running time bottleneck is distributing the data amongst the machine between each round.\nGenerally it is assumed that each of the machines uses sublinear memory [21]. The motivation here\nis that since MapReduce is used to process large data sets, the memory on the machines should be\nmuch smaller than the input size to the problem. It is additionally assumed that there is enough\nmemory to store the entire dataset across all machines. Our algorithms fall into this category and\nthe memory required on each machine scales inversely with m.\nk-center (with outliers) problem. In the problems considered, there is a universe U of n points.\nBetween each pair of points u, v \u2208 U there is a distance d(u, v) specifying their dissimilarity. The\npoints are assumed to lie in a metric space which implies that for all u, v, u(cid:48) \u2208 U we have that 1.\nd(u, u) = 0, 2. d(u, v) = d(v, u), and 3. d(u, v)\u2264 d(u, u(cid:48))+d(u(cid:48), v) (triangle inequality). For a set\nX of points, we let dX (u) := minv\u2208X{d(u, v)} denote the minimum distance of a point u \u2208 U to\nany point in X. In the k-center problem, the goal is to choose a set of centers X of k points such\nthat maxv\u2208U dX (v) is minimized (i.e., dX (v) is the distance between v and its cluster center and we\nwould like to minimize the largest distance, across all points). In the k-center with outliers problem,\nthe goal is to choose a set X of k points and a set Z of z points such that maxv\u2208U\\Z dX (v) is\nminimized. Note that in this problem the algorithm simply needs to choose the set X as the optimal\nset of Z points is well de\ufb01ned: It is the set of points in U farthest from the centers X.\nSequential algorithms The most widely used k-center (with-\nout outliers) algorithm is the following simple greedy proce-\ndure, summarized in pseudo-code in Algorithm 1. The algo-\nrithm sets X = \u2205 and then iteratively adds points from U to\nX until |X| = k. At each step, the algorithm greedily se-\nlects the farthest point in U from X, and then adds this point\nto the updated set X. This algorithm is natural and ef\ufb01cient\nand is known to give a 2-approximation for the k-center prob-\nlem [20]. However, it is also inherently sequential and does\nnot lend itself to the distributed setting (except for very small\nk). A na\u00a8\u0131ve MapReduce implementation can be obtained by \ufb01nding the element v\u2208 U to maximize\ndX (v) in a distributed fashion (line 4 in Algorithm 1). This, however, requires k rounds of Map-\nReduce that must distribute the entire dataset each round. Therefore it is unsuitably inef\ufb01cient for\nmany real world problems. The sequential algorithm for k-center with outliers is more complicated\ndue to the increased dif\ufb01culty of the problem (for reference see: [10]). This algorithm is even more\nfundamentally sequential than Algorithm 1.\n3 k-Center in MapReduce\n\n1: X = \u2205\n2: Add any point u \u2208 U to X\n3: while |X| < k do\nu = argmaxv\u2208U dX (v)\n4:\n5: X = X \u222a {u}\n6: end while\n\nAlgorithm 1 Sequential k-center\nGREEDY(U, k)\n\nIn this section we consider the k-center problem where no outliers are allowed. As mentioned\nbefore, a similar variant of this problem has been previously studied in Mirzasoleiman et al. [25]\nin the distributed setting. The work of Mirzasoleiman et al. considers submodular maximization\nm}-approximation where m is the number of machines. Their algorithm\nand showed a min{ 1\nwas shown to perform extremely well in practice (in a slightly modi\ufb01ed clustering setup). The\n\nk , 1\n\n3\n\n\fwhere machine i receives Ui.\n\n2: Machine i assigns Ci = GREEDY(Ui, k)\n3: All sets Ci are assigned to machine 1\n4: Machine 1 sets X = GREEDY(\u222am\n5: Output X\n\ni=1Ci, k)\n\nAlgorithm 2 Distributed k-center\nGREEDY-MR(U, k)\n1: Partition U into m equal sized sets U1, . . . , Um\n\nk-center problem can be mapped to submodular maximization, but the standard reduction is not\napproximation preserving and their result does not imply a non-trivial approximation for k-center.\nIn this section, we give a natural interpretation of their algorithm without submodular maximization.\nAlgorithm 2 summarizes a distributed approach for solving the k-center problem. First the data\npoints of U are partitioned across all m machines. Then each machine i runs the GREEDY algorithm\non the partition they are given to compute a set Ci of k points. These points are assigned to a\nsingle machine, which runs GREEDY again to compute the \ufb01nal solution. The algorithm runs in two\nMapReduce rounds and the only information communicated is Ci for each i if the data is already\nassigned to machines. Thus, we have the following proposition.\nProposition 3.1. The algorithm GREEDY-MR runs in two MapReduce rounds and communicates\nO(km) amount of data assuming the data is originally partitioned across the machines. The algo-\nrithm uses O(max{n/m, mk}) memory on each machine.\nWe aim to bound the approximation ratio of\nGREEDY-MR. Let OPT denote the optimal\nsolution value for the k-center problem. The\nprevious proposition and following lemma\ngive Theorem 1.1.\nLemma 3.2. The algorithm GREEDY-MR is\na 4-approximation algorithm.\nProof. We \ufb01rst show for any i that dCi(u) \u2264\n2OPT for any u \u2208 Ui. Indeed, say that this is\nnot the case for sake of contradiction for some i. Then for some u \u2208 Ui, dCi(u) > 2OPT which\nimplies u is distance greater than 2OPT from all points in Ci. By de\ufb01nition of GREEDY for any pair\nof points v, v(cid:48) \u2208 Ci it must be the case that d(v, v(cid:48)) \u2265 dCi(u) > 2OPT (otherwise u would have\nbeen included in Ci). Thus, in the set {u} \u222a Ci there are k + 1 points all of distance greater than\n2OPT from each other. However, then two of these points v, v(cid:48) \u2208 ({u} \u222a Ci) must be assigned to\nthe same center v\u2217 in the optimal solution. Using the triangle inequality and the de\ufb01nition of OPT it\nmust be the case that d(v, v(cid:48)) \u2264 d(v\u2217, v) + d(v\u2217, v(cid:48)) \u2264 2OPT, a contradiction. Thus, for all points\nu \u2208 Ui, it must be that dCi(u) \u2264 2OPT.\nLet X denote the output solution by GREEDY-MR. We can show a similar result for points in \u222am\ni=1Ci\ni=1Ci. Indeed, say that\nwhen compared to X. That is, we show that dX (u) \u2264 2OPT for any u \u2208 \u222am\ni=1Ci, dX (u) > 2OPT which\nthis is not the case for sake of contradiction. Then for some u \u2208 \u222am\nimplies u is distance greater than 2OPT from all points in X. By de\ufb01nition of GREEDY for any pair\nof points v, v(cid:48) \u2208\u222am\ni=1Ci it must be that d(v, v(cid:48)) \u2265 dX (u) > 2OPT. Thus, in the set {u} \u222a X there\nare k+1 points all of distance greater than 2OPT from each other. However, then two of these points\nv, v(cid:48) \u2208 ({u}\u222a X) must be assigned to the same center v\u2217 in the optimal solution. Using the triangle\ninequality and the de\ufb01nition of OPT it must be the case that d(v, v(cid:48)) \u2264 d(v\u2217, v)+d(v\u2217, v(cid:48)) \u2264 2OPT,\na contradiction. Thus, for all points u \u2208 \u222am\nNow we put these together to get a 4-approximation. Consider any point u \u2208 U. If u is in Ci for any\ni, it must be the case that dX (u) \u2264 2OPT by the above argument. Otherwise, u is not in Ci for any i.\nLet Uj be the partition which u belongs to. We know that u is within distance 2OPT to some point\nv \u2208 Cj and further we know that v is within distance 2OPT from X from the above arguments.\nThus, using the triangle inequality, dX (u) \u2264 d(u, v) + dX (v) \u2264 2OPT + 2OPT \u2264 4OPT.\n4 k-center with Outliers\nIn this section, we consider the k-center with outliers problem and give the \ufb01rst MapReduce algo-\nrithm for the problem. The problem is more challenging than the version without outliers because\none has to also determine which points to discard, which can drastically change which centers should\nbe chosen. Intuitively, the right algorithmic strategy is to choose centers such that there are many\npoints around them. Given that they are surrounded by many points, this is a strong indicator that\nthese points are not outliers. This idea was formalized in the algorithm of Charikar et al. [10], a\nwell-known and in\ufb02uential algorithm for this problem in the single machine setting.\nAlgorithm 3 summarizes the approach of Charikar et al. [10]. It takes as input the set of points\nU, the desired number centers k and a parameter G. The parameter G is a \u2018guess\u2019 of the optimal\nsolution\u2019s value. The algorithm\u2019s performance is best when G = OPT where OPT denotes the\n\ni=1Ci, it must be that dX (u) \u2264 2OPT.\n\n4\n\n\f\u2200u \u2208 U(cid:48) let Bu ={v : v \u2208 U(cid:48), du,v \u2264 G}\nLet v(cid:48) = argmaxu\u2208U(cid:48)|Bu|\nSet X = X \u222a {v(cid:48)}\nCompute B(cid:48)v(cid:48) ={v : v \u2208 U(cid:48), dv(cid:48),v \u2264 3G}\nU(cid:48) = U(cid:48) \\ B(cid:48)v(cid:48)\n\nAlgorithm 3 Sequential k-center with outliers [10]\nOUTLIERS(U, k, G)\n1: U(cid:48) = U, X = \u2205\n2: while |X| < k do\n3:\n4:\n5:\n6:\n7:\n8: end while\n\noptimal k-center objective after discarding z points. The number of outliers to be discarded, z, is\nnot a parameter of the algorithm and is communicated implicitly through G. The value of G can be\ndetermined by doing a binary search on possible values of G\u2014between the minimum and maximum\ndistances of any two points.\nFor each point u \u2208 U, the set Bu contains\npoints within distance G of u. The algo-\nrithm adds the point v(cid:48) to the solution set\nwhich covers the largest number of points\nwith Bv(cid:48). The idea here is to add points which\nhave many points nearby (and thus large Bv(cid:48)).\nThen the algorithm removes all points from\nthe universe which are within distance 3G\nfrom v(cid:48) and continues until k points are cho-\nsen to be in the set X. Recall that in the out-\nliers problem, choosing the centers is a well\nde\ufb01ned solution and the outliers are simply the farthest z points from the centers. Further, it can\nbe shown that when G = OP T , after selecting the k centers, there are at most z outliers remaining\nin U(cid:48). It is known that this algorithm gives a 3-approximation [10]\u2014however it is not ef\ufb01cient on\nlarge or even medium sized datasets due to the computation of the sets Bu within each iteration. For\ninstance, it can take \u2248 100 hours on a data set with 45, 000 points.\nWe now give a distributed approach (Algorithm 4) for clustering with outliers. This algorithm is\nnaturally parallel, yet it is signi\ufb01cantly faster even if run sequentially on a single machine. It uses a\nsub-procedure (Algorithm 5) which is a generalization of OUTLIERS.\nThe algorithm \ufb01rst partitions the points\nacross the m machines (e.g., set Ui goes\nto machine i). Each machine i runs the\nGREEDY algorithm on Ui, but selects k+z\npoints rather than k. This results in a set\nCi. For each c \u2208 Ci, we assign a weight\nwc that is the number of points in Ui that\nhave c as their closest point in Ci (i.e., if\nCi de\ufb01nes an intermediate clustering of Ui\nthen wc is the number of points in the c-\ncluster). The algorithm then runs a vari-\nation of OUTLIERS called CLUSTER, de-\nscribed in Algorithm 5, on only the points\nin \u222am\ni=1Ci. The main differences are that\nCLUSTER represents each point c by the number of points wc closest to it, and that it uses 5G and\n11G for the radii in Bu and B(cid:48)u.\nThe total machine-wise communica-\ntion required for OUTLIERS-MR is\nthat needed to send each of the sets\nto Machine 1 along with their\nCi\nweights. Each weight can have size at\nmost n, so it only requires O(log n)\nspace to encode the weight. This\ngives the following proposition.\nProposition 4.1. OUTLIERS-MR\nruns in two MapReduce rounds and\ncommunicates O((k + z)m log n)\namount of data assuming the data\nis originally partitioned across the\nmachines. The algorithm uses O(max{n/m, m(k + z) log n}) memory on each machine.\nOur goal is to show that OUTLIERS-MR is an O(1)-approximation algorithm (Theorem 1.2). We\n\ufb01rst present intermediate lemmas and give proof sketches, leaving intermediate proofs to the sup-\nplementary material. We overload notation and let OPT denote a \ufb01xed optimal solution as well as\n\nAlgorithm 5 Clustering subroutine\nCLUSTER(U, k, G)\n1: U(cid:48) = U, X = \u2205\n2: while |X| < k do\n3:\n4:\n5:\n6:\n7:\n8: end while\n9: Output X\n\n2: Machines i sets Ci = GREEDY(Ui, k + z)\n3: For each point c \u2208 Ci, machine i set wc = |{v :\n4: All sets Ci are assigned to machine 1 with the\n5: Machine 1 sets X = CLUSTER(\u222am\n6: Output X\n\nv \u2208 Ui, d(v, c) = dCi(v)}| + 1\nweights of the points in Ci\n\n\u2200u \u2208 U compute Bu = {v : v \u2208 U(cid:48), du,v \u2264 5G}\nLet v(cid:48) = argmaxu\u2208U\nSet X = X \u222a {v(cid:48)}\nCompute B(cid:48)v(cid:48) ={v : v \u2208 U(cid:48), dv(cid:48),v \u2264 11G}\nU(cid:48) = U(cid:48) \\ B(cid:48)v(cid:48)\n\nu(cid:48)\u2208Bu wu(cid:48)\n\nAlgorithm 4 Distributed k-center with outliers\nOUTLIERS-MR(U, k, z, G, \u03b1, \u03b2)\n\n1: Partition U into m equal sized sets U1, . . . , Um\n\nwhere machine i receives Ui.\n\ni=1Ci, k, G)\n\n(cid:80)\n\n5\n\n\fthe optimal objective to the problem. We will assume throughout the proof that G = OPT, as we can\nperform a binary search to \ufb01nd \u02c6G = OPT(1 + \u0001) for arbitrarily small \u0001 > 0 when running CLUSTER\non a single machine. We \ufb01rst claim that any point in Ui is not too far from any point in Ci.\nLemma 4.2. For every point u \u2208 Ui it is the case that dCi(u) \u2264 2OPT for all 1 \u2264 i \u2264 m.\nGiven the above lemma, let O1, . . . , Ok denote the clusters in the optimal solution. A cluster in OPT\nis de\ufb01ned as a subset of the points in U, not including outliers identi\ufb01ed by OPT, that are closest to\nsome \ufb01xed center chosen by OPT. The high level idea of our proof is similar to that used in [10].\nOur goal is to show that when our algorithm choses each center, the set of points discarded from U(cid:48)\nin CLUSTER can be mapped to some cluster in the optimal solution. At the end of CLUSTER there\nshould be at most z points in U(cid:48), which are the outliers in the optimal solution. Knowing that we\nonly discard points from U(cid:48) close to centers we choose, this will imply the approximation bound.\nFor every point u \u2208 U, which must fall into some Ui, we let c(u) denote the closest point in Ci to u\n(i.e., c(u) is the closest intermediate cluster center found by GREEDY to u). Consider the output of\nCLUSTER, X = {x1, x2, . . . , xk}, ordered by how elements were added to X. We will say that an\noptimal cluster Oi is marked at CLUSTER iteration j if there is a point u \u2208 Oi such that c(u) /\u2208 U(cid:48)\njust before xj is added to X. Essentially if a cluster is marked, we can make no guarantee about\ncovering it within some radius of xj (which will then be discarded). Figure 1 shows examples where\nOi is (and is not) marked. We begin by noting that when xj is added to X that the weight of the\npoints removed from U(cid:48) is at least as large as the maximum number of points in an unmarked cluster\nin the optimal solution.\n\nLemma 4.3. When xj is added, then(cid:80)\n\nwu(cid:48) \u2265|Oi| for any unmarked cluster Oi.\n\nu(cid:48)\u2208Bxj\n\nGiven this result, the following lemma considers a point v that is in some\ncluster Oi. If c(v) is within the ball Bxj for xj added to X, then intuitively,\nthis means that we cover all of the points in Oi with B(cid:48)xj . Another way to say\nthis is that after we remove the ball B(cid:48)xj , no points in Oi contribute weight to\nany point in U(cid:48).\nLemma 4.4. Consider that xj is to be added to X. Say that c(v) \u2208 Bxj for\nsome point v \u2208 Oi for some i. Then, for every point u \u2208 Oi either c(u) \u2208 B(cid:48)xj\nor c(u) has already been removed from U(cid:48).\n\nSee the supplementary material for the proof. The \ufb01nal lemma below states\nthat the weight of the points in \u222axi:1\u2264i\u2264kB(cid:48)xi is at least as large as the number\nof points in \u222a 1\u2264i\u2264kOi. Further, we know that | \u222a 1\u2264i\u2264k Oi| = n \u2212 z since\nOPT has z outliers. Viewing the points in B(cid:48)xi as being assigned to xi in the\nalgorithm\u2019s solution then this shows that the number of points covered is at\nleast as large as the number of points that the optimal solution covers. Hence,\nthere cannot be more than z points uncovered by our algorithm.\n\nLemma 4.5. (cid:80)k\n\ni=1\n\n(cid:80)\n\nu\u2208B(cid:48)\n\nxi\n\nwu \u2265 n \u2212 z\n\nFinally, we are ready to complete the proof of Theorem 1.2.\n\nFigure 1: Examples\nin which Oi\nis/is\nnot marked.\n\nProof of [Theorem 1.2] Lemma 4.5 implies that the sum of the weights of the\npoints which are in \u222axi:1\u2264i\u2264kB(cid:48)xi are at least n \u2212 z. We know that every point u contributes to the\nweight of some point c(u) which is in Ci for 1 \u2264 i \u2264 m and by Lemma 4.2 d(u, c(u)) \u2264 2OPT.\nWe map every point u \u2208 U to xi, such that c(u) \u2208 B(cid:48)xi. By de\ufb01nition of B(cid:48)xi and Lemma 4.2 it is\nthe case d(u, xi) \u2264 13OPT by the triangle inequality. Thus, we have mapped n \u2212 z points to some\npoint in X within distance 13OPT. Hence, our algorithm discards at most n\u2212 z points and achieves\na 13-approximation. With Proposition 4.1 we have shown Theorem 1.2.\n\n2\n\n5 Experiments\n\nWe evaluate the real-world performance of the above clustering algorithms on seven clustering\ndatasets, described in Table 1. We compare all methods using the k-center with outliers objective, in\nwhich z outliers may be discarded. We begin with a brief description of the clustering methods we\n\n6\n\nmarkedOiunmarkedOiOiuc(u)U09vc(v)c(v0)v0deleted from U0Oic(u)U0vc(v)c(v0)v0u8\fTable 1: The clustering datasets (and their descriptions) used for evaluation.\nname\nParkinsons [28]\nCensus1\nSkin1\n\ndescription\npatients with early-stage Parkinson\u2019s disease\ncensus household information\nRGB-pixel samples from face images\n\nYahoo [11] web-search ranking dataset (features are GBRT outputs [29])\nCovertype1\nPower1\nHiggs1\n\na forest cover dataset with cartographic features\nhousehold electric power readings\nparticle detector measurements (the seven \u2018high-level\u2019 features)\n\nn\n\n5, 875\n45, 222\n245, 057\n473, 134\n522, 911\n2, 049, 280\n11, 000, 000\n\ndim.\n22\n12\n3\n\n500\n13\n7\n7\n\ncompare. We then show how the distributed algorithms compare with their sequential counterparts\non datasets small enough to run the sequential methods, for a variety of settings. Finally, in the\nlarge-scale setting, we compare all distributed methods for different settings of k.\nMethods. We implemented the sequential GREEDY and OUTLIERS and distributed GREEDY-MR\n[25] and OUTLIERS-MR. We also implemented two baseline methods: RANDOM|RANDOM: m\nmachines randomly select k +z points, then a single machine randomly selects k points out of the\npreviously selected m(k+z) points; RANDOM|OUTLIERS: m machines randomly select k+z points,\nthen OUTLIERS (Algorithm 4) is run over the m(k+z) points previously selected; All methods were\nimplemented in MATLABTM and conducted on an 8-core Intel Xeon 2 GHz machine.\n\nFigure 2: The performance of sequential and distributed methods. We plot the objective value of\nfour small datasets for varying k, z, and m.\nSequential vs. Distributed. Our \ufb01rst set of experiments evaluate how close the proposed distributed\nmethods are to their sequential counterparts. To this end, we vary all parameters: number of centers\nk, number of outliers z, and the number of machines m. We consider datasets for which computing\nthe sequential methods is practical: Parkinsons, Census and two random subsamples (10, 000 inputs\neach) of Covertype and Power. We show the results in Figure 2. Each column contains the results for\na single dataset and each row for a single varying parameter (k, z, or m), along with standard errors\nover 5 runs. When a parameter is not varied we \ufb01x k = 50, z = 256, and m = 10. As expected, the\nobjective value for all methods generally decreases as k increases (as the distance of any point to its\ncluster center must shrink with more clusters). RANDOM|RANDOM and RANDOM|OUTLIERS usu-\nally perform worse than GREEDY-MR for small k (save 10k Covertype) and RANDOM|OUTLIERS\n\n1https://archive.ics.uci.edu/ml/datasets/\n\n7\n\n510152000.511.522.53x 105k=50, z=256m5678900.511.522.533.544.55x 105k=50, m=10log2(z)2040608010000.511.522.53x 105m=10, z=256k510152015202530354045k=50, z=256m5678915202530354045k=50, m=10log2(z)2040608010015202530354045m=10, z=256k51015200510152025303540k=50, z=256m567890510152025303540k=50, m=10log2(z)204060801000510152025303540m=10, z=256k510152000.40.81.21.62k=50, z=256m5678900.40.81.21.62k=50, m=10log2(z)2040608010000.40.81.21.62m=10, z=256kobjective valueParkinsonnumber of clusters: knumber of machines: m10k Covertype10k PowerCensus555number of outliers: log(z)2Random | RandomRandom | OutliersOutliersGreedyGreedy-MROutliers-MR\fFigure 3: The objective value of \ufb01ve large-scale datasets, for varying k\n\nsometimes matches it for large k. For all values of k tested, OUTLIERS-MR outperforms all other\ndistributed methods. Furthermore, it matches or slightly outperforms (which we attribute to ran-\ndomness) the sequential OUTLIERS method in all settings. As z increases the two random methods\nimprove, beyond GREEDY-MR in some cases. Similar to the \ufb01rst plot, OUTLIERS-MR outperforms\nall other distributed methods while matching the sequential clustering method. For very small set-\ntings of m (i.e., m = 2, 6), OUTLIERS-MR and GREEDY-MR perform slightly worse than sequen-\ntial OUTLIERS and GREEDY. However, for practical settings of m (i.e., m \u2265 10), OUTLIERS-MR\nmatches OUTLIERS and GREEDY-MR matches GREEDY. In terms of speed, on the largest of these\ndatasets (Census) OUTLIERS-MR run sequentially is more than 677\u00d7 faster than OUTLIERS, see\nTable 2. This large speedup is due to the fact that we cannot store the full distance matrix for Census,\nthus all distances need to be computed on demand.\nLarge-scale. Our second set of experiments\nfocus on the performance of the distributed\nmethods on \ufb01ve large-scale datasets, shown\nin Figure 3. We vary k between 0 and\n100, and \ufb01x m = 10 and z = 256. Note\nthat for certain datasets, clustering while tak-\ning into account outliers produces a notice-\nable reduction in objective value. On Ya-\nhoo, the GREEDY-MR method is even outper-\nformed by RANDOM|OUTLIERS that considers\noutliers. Similar to the small dataset results\nOUTLIERS-MR outperforms nearly all distributed methods (save for small k on Covertype). Even\non datasets where there appear to be few outliers OUTLIERS-MR has excellent performance. Fi-\nnally, OUTLIERS-MR is extremely fast: clustering on Higgs took less than 15 minutes.\n\nTable 2: The speedup of the distributed algo-\nrithms, run sequentially, over their sequential\ncounterparts on the small datasets.\n\n10k Power\nParkinson\nCensus\n\n6.2\n9.4\n4.4\n677.7\n\n3.6\n4.8\n4.9\n12.4\n\ndataset\n\nk-center\n\noutliers\n\n10k Covertype\n\n6 Conclusion\n\nIn this work we described algorithms for the k-center and k-center with outliers problems in the dis-\ntributed setting. For both problems we studied two round MapReduce algorithms which achieve an\nO(1)-approximation and demonstrated that they perform almost identically to their sequential coun-\nterparts on real data. Further, a number of our experiments validate that using k-center clustering\non noisy data degrades the quality of the solution. We hope these techniques lead to the discovery\nof fast and ef\ufb01cient distributed algorithms for other clustering problems. In particular, what can be\nshown for the k-median or k-means with outliers problems are exciting open questions.\nAcknowledgments GM was supported by CAPES/BR; MJK and KQW were supported by the NSF\ngrants IIA-1355406, IIS-1149882, EFRI-1137211; and BM was supported by the Google and Yahoo\nResearch Awards.\n\nReferences\n[1] P. K. Agarwal and J. M. Phillips. An ef\ufb01cient algorithm for 2d euclidean 2-center with outliers. In ESA,\n\npages 64\u201375, 2008.\n\n[2] C. C. Aggarwal, J. L. Wolf, and P. S Yu. Method for targeted advertising on the web based on accumulated\nself-learning data, clustering users and semantic node graph techniques, March 30 2004. US Patent\n6,714,975.\n\n[3] N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation.\n\n2009.\n\nIn NIPS, pages 10\u201318,\n\n8\n\n20406080100020406080100m=10, z=256k2040608010011.21.41.61.822.22.4m=10, z=256k20406080100050100150200250m=10, z=256kCovertypePowerobjective valueSkin204060801000.150.20.250.3m=10, z=256kYahoo2040608010005101520m=10, z=256kHiggsnumber of clusters: kRandom | RandomRandom | OutliersGreedy-MROutliers-MR\f[4] A. Andoni, A. Nikolov, K. Onak, and G. Yaroslavtsev. Parallel algorithms for geometric graph problems.\n\nIn STOC, pages 574\u2013583, 2014.\n\n[5] B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. PVLDB, 5\n\n(5):454\u2013465, 2012.\n\n[6] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable\n\nk-means++. PVLDB, 5(7):622\u2013633, 2012.\n\n[7] M. Balcan, S. Ehrlich, and Y. Liang. Distributed k-means and k-median clustering on general communi-\n\ncation topologies. In NIPS, pages 1995\u20132003, 2013.\n\n[8] Rafael Barbosa, Alina Ene, Huy Nguyen, and Justin Ward. The power of randomization: Distributed\n\nsubmodular maximization on massive datasets. In ICML, pages 1236\u20131244, 2015.\n\n[9] A. Z. Broder, L. G. Pueyo, V. Josifovski, S. Vassilvitskii, and S. Venkatesan. Scalable k-means by ranked\n\nretrieval. In WSDM, pages 233\u2013242, 2014.\n\n[10] M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems\n\nwith outliers. In SODA, pages 642\u2013651, 2001.\n\n[11] M. Chen, K. Q. Weinberger, O. Chapelle, D. Kedem, and Z. Xu. Classi\ufb01er cascade for minimizing feature\n\nevaluation cost. In AIStats, pages 218\u2013226, 2012.\n\n[12] F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231\u2013240, 2010.\n[13] J. Dean and S. Ghemawat. MapReduce: Simpli\ufb01ed data processing on large clusters. In OSDI, pages\n\n137\u2013150, 2004.\n\n[14] A. Ene, S. Im, and B. Moseley. Fast clustering using MapReduce. In KDD, pages 681\u2013689, 2011.\n[15] J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric\n\nstreaming computations. In SODA, pages 710\u2013719, 2008.\n\n[16] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science,\n\n38(0):293 \u2013 306, 1985. ISSN 0304-3975.\n\n[17] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O\u2019Callaghan. Clustering data streams: Theory and\n\npractice. IEEE Trans. Knowl. Data Eng., 15(3):515\u2013528, 2003.\n\n[18] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Techniques for clustering massive data sets.\n\nIn\nClustering and Information Retrieval, volume 11 of Network Theory and Applications, pages 35\u201382.\nSpringer US, 2004. ISBN 978-1-4613-7949-2.\n\n[19] M. Hassani, E. M\u00a8uller, and T. Seidl. EDISKCO: energy ef\ufb01cient distributed in-sensor-network k-center\n\nclustering with outliers. In SensorKDD-Workshop, pages 39\u201348, 2009.\n\n[20] D. S. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of\n\nOperations Research, 10(2):180\u2013184, 1985.\n\n[21] H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In SODA, pages\n\n938\u2013948, 2010.\n\n[22] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-\n\nInterscience, 9th edition, March 1990. ISBN 0471878766.\n\n[23] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in\n\nmapreduce and streaming. In SPAA, pages 1\u201310, 2013.\n\n[24] R. M. McCutchen and S. Khuller. Streaming algorithms for k-center clustering with outliers and with\n\nanonymity. In APPROX-RANDOM, pages 165\u2013178, 2008.\n\n[25] B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization: Identi-\n\nfying representative elements in massive data. In NIPS, pages 2049\u20132057, 2013.\n\n[26] M. Shindler, A. Wong, and A. W. Meyerson. Fast and accurate k-means for large datasets. In NIPS, pages\n\n2375\u20132383, 2011.\n\n[27] S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607\u2013614,\n\n2011.\n\n[28] A. Tsanas, M. A Little, P. E McSharry, and L. O Ramig. Enhanced classical dysphonia measures and\nIn ICASSP, pages 594\u2013597.\n\nsparse regression for telemonitoring of parkinson\u2019s disease progression.\nIEEE, 2010.\n\n[29] S. Tyree, K.Q. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search\n\nranking. In WWW, pages 387\u2013396. ACM, 2011.\n\n[30] O. Zamir, O. Etzioni, O. Madani, and R. M Karp. Fast and intuitive clustering of web documents. In\n\nKDD, volume 97, pages 287\u2013290, 1997.\n\n[31] Z. Zhao, G. Wang, A.R. Butt, M. Khan, V.S.A. Kumar, and M.V. Marathe. Sahad: Subgraph analysis in\n\nmassive networks using hadoop. In IPDPS, pages 390\u2013401, May 2012.\n\n9\n\n\f", "award": [], "sourceid": 675, "authors": [{"given_name": "Gustavo", "family_name": "Malkomes", "institution": "Washington University in St. Louis"}, {"given_name": "Matt", "family_name": "Kusner", "institution": "Washington University in St. Louis"}, {"given_name": "Wenlin", "family_name": "Chen", "institution": "Washington University in St. Louis"}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": "Washington University in St. Louis"}, {"given_name": "Benjamin", "family_name": "Moseley", "institution": "Washington University in St Lo"}]}