{"title": "Distributed $k$-Clustering for Data with Heavy Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 7838, "page_last": 7846, "abstract": "In this paper, we consider the $k$-center/median/means clustering with outliers problems (or the $(k, z)$-center/median/means problems) in the distributed setting.  Most previous distributed algorithms have their communication costs linearly depending on $z$, the number of outliers.  Recently Guha et al.[10] overcame this dependence issue by considering bi-criteria approximation algorithms that output solutions with $2z$ outliers.  For the case where $z$ is large, the extra $z$ outliers discarded by the algorithms might be too large, considering that the data gathering process might be costly. In this paper, we improve the number of outliers to the best possible $(1+\\epsilon)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$.  The problems we consider include the $(k, z)$-center problem, and $(k, z)$-median/means problems in Euclidean metrics. Implementation of the our algorithm for $(k, z)$-center shows that it outperforms many previous algorithms, both in terms of the communication cost and quality of the output solution.", "full_text": "Distributed k-Clustering for Data with Heavy Noise\n\nXiangyu Guo\n\nUniversity at Buffalo\nBuffalo, NY 14260\n\nxiangyug@buffalo.edu\n\nShi Li\n\nUniversity at Buffalo\nBuffalo, NY 14260\nshil@buffalo.edu\n\nAbstract\n\nIn this paper, we consider the k-center/median/means clustering with outliers\nproblems (or the (k, z)-center/median/means problems) in the distributed setting.\nMost previous distributed algorithms have their communication costs linearly\ndepending on z, the number of outliers. Recently Guha et al. [10] overcame this\ndependence issue by considering bi-criteria approximation algorithms that output\nsolutions with 2z outliers. For the case where z is large, the extra z outliers\ndiscarded by the algorithms might be too large, considering that the data gathering\nprocess might be costly. In this paper, we improve the number of outliers to\nthe best possible (1 + \u0001)z, while maintaining the O(1)-approximation ratio and\nindependence of communication cost on z. The problems we consider include the\n(k, z)-center problem, and (k, z)-median/means problems in Euclidean metrics.\nImplementation of the our algorithm for (k, z)-center shows that it outperforms\nmany previous algorithms, both in terms of the communication cost and quality of\nthe output solution.\n\n1\n\nIntroduction\n\nClustering is a fundamental problem in unsupervised learning and data analytics. In many real-life\ndatasets, noises and errors unavoidably exist. It is known that even a few noisy data points can\nsigni\ufb01cantly in\ufb02uence the quality of the clustering results. To address this issue, previous work has\nconsidered the clustering with outliers problem, where we are given a number z on the number of\noutliers, and need to \ufb01nd the optimum clustering where we are allowed to discard z points, under\nsome popular clustering objective such as k-center, k-median and k-means.\nDue to the increase in volumes of real-life datasets, and the emergence of modern parallel computation\nframeworks such as MapReduce and Hadoop, computing a clustering (with or without outliers) in the\ndistributed setting has attracted a lot of attention in recent years. The set of points are partitioned into\nm parts that are stored on m different machines, who collectively need to compute a good clustering\nby sending messages to each other. Often, the time to compute a good solution is dominated by the\ncommunications among machines. Many recent papers on distributed clustering have focused on\ndesigning O(1)-approximation algorithms with small communication cost [2, 13, 10].\nMost previous algorithms for clustering with outliers have the communication costs linearly depending\non z, the number of outliers. Such an algorithm performs poorly when data is very noisy. Consider\nthe scenario where distributed sensory data are collected by a crowd of people equipped with portable\nsensory devices. Due to different skill levels of individuals and the quality of devices, it is reasonable\nto assume that a small constant fraction of the data points are unreliable.\nRecently, Guha et al. [10] overcame the linear dependence issue, by giving distributed O(1)-\napproximation algorithms for k-center/median/means with outliers problems with communication\ncost independent of z. However, the solutions produced by their algorithms have 2z outliers. Such\na solution discards z more points compared to the (unknown) optimum one, which may greatly\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdecrease the ef\ufb01ciency of data usage. Consider an example where a research needs to be conducted\nusing the inliers of a dataset containing 10% noisy points; a \ufb01ltering process is needed to remove\nthe outliers. A solution with 2z outliers will only preserve 80% of data points, as opposed to the\npromised 90%. As a result, the quality of the research result may be reduced.\nUnfortunately, a simple example (described in the supplementary material) shows that if we need to\nproduce any multiplicatively approximate solution with only z outliers, then the linear dependence\non z can not be avoided. We show that, even deciding whether the optimum clustering with z outliers\nhas cost 0 or not, for a dataset distributed on 2 machines, requires a communication cost of \u2126(z) bits.\nGiven such a negative result and the positive results of Guha et al. [10], the following question is\ninteresting from both the practical and theoretical points of view:\nCan we obtain distributed O(1)-approximation algorithms for k-center/median/means with outliers\nthat have communication cost independent of z and output solutions with (1 + \u0001)z outliers, for any\n\u0001 > 0?\nOn the practical side, an algorithm discarding \u0001z additional outliers is acceptable, as the number\ncan be made arbitrarily small, compared to both the promised number z of outliers and the number\nn \u2212 z of inliers. On the theoretical side, the (1 + \u0001)-factor for the number of outliers is the best we\ncan hope for if we are aiming at an O(1)-approximation algorithm with communication complexity\nindependent of z; thus answering the question in the af\ufb01rmative can give the tight tradeoff between\nthe number of outliers and the communication cost in terms of z.\nIn this paper, we make progress in answering the above question for many cases. For the k-center\nobjective, we solve the problem completely by giving a (24(1 + \u0001), 1 + \u0001)-bicriteria approximation\nalgorithm with communication cost O\n, where \u2206 is the aspect ratio of the metric.\n(24(1 + \u0001) is the approximation ratio, 1 + \u0001 is the multiplicative factor for the number of outliers our\nalgorithm produces; the formal de\ufb01nition appears later.) For k-median/means objective, we give a\ndistributed (1 + \u0001, 1 + \u0001)-bicrteria approximation algorithm for the case of Euclidean metrics. The\n\n(cid:17)\n\u0001 , k, D, m, log \u2206(cid:1), where D is the dimension\ncommunication complexity of the algorithm is poly(cid:0) 1\n\nof the underlying Euclidean metric. (The exact communication complexity is given in Theorem 1.2.)\nUsing dimension reduction techniques, we can assume D = O( log n\n\u00012 ), by incurring a (1+\u0001)-distortion\nin pairwise distances. So, the setting indeed covers a broad range of applications, given that the term\n\u201ck-means clustering\u201d is de\ufb01ned and studied exclusively in the context of Euclidean metrics. The\n(1 + \u0001, 1 + \u0001)-bicriteria approximation ratio comes with a caveat: our algorithm has running time\nexponential in many parameters such as 1\n\u0001 , k, D and m (though it has no exponential dependence on\nn or z).\n\n\u00b7 log \u2206\n\n\u0001\n\n(cid:16) km\n\n\u0001\n\n1.1 Formulation of Problems\n\np\u2208P (cid:48) d(p, C) and(cid:80)\n\ncenters and a set P (cid:48) \u2286 P of n \u2212 z points so as to minimize maxp\u2208P (cid:48) d(p, C) (resp.(cid:80)\nand(cid:80)\nmaxp\u2208P (cid:48) d(p, C),(cid:80)\n\nWe call the k-center (resp. k-median and k-means) problem with z outliers as the (k, z)-center (resp.\n(k, z)-median and (k, z)-means) problem. Formally, we are given a set P of n points that reside in a\nmetric space d, two integers k \u2265 1 and z \u2208 [0, n]. The goal of the problem is to \ufb01nd a set C of k\np\u2208P (cid:48) d(p, C)\np\u2208P (cid:48) d2(p, C)), where d(p, C) = minc\u2208C d(p, c) is the minimum distance from p to a center\nin C. For all the 3 objectives, given a set C \u2286 P of k centers, the best set P (cid:48) can be derived from P\nby removing the z points p \u2208 P with the largest d(p, C). Thus, we shall only use a set C of k centers\nto denote a solution to a (k, z)-center/median/means instance. The cost of a solution C is de\ufb01ned as\np\u2208P (cid:48) d2(p, C) respectively for a (k, z)-center, median and\nmeans instance, where P (cid:48) is obtained by applying the optimum strategy. The n \u2212 z points in P (cid:48) and\nthe z points in P \\ P (cid:48) are called inliers and outliers respectively in the solution.\nAs is typical in the machine learning literature, we consider general metrics for (k, z)-center, and\nEuclidean metrics for (k, z)-median/means. In the (k, z)-center problem, we assume that each point\np in the metric space d can be described using O(1) words, and given the descriptions of two points p\nand q, one can compute d(p, q) in O(1) time. In this case, the set C of centers must be from P since\nthese are all the points we have. For (k, z)-median/means problem, points in P and centers C are\nfrom Euclidean space RD, and it is not required that C \u2286 P . One should treat D as a small number,\nsince dimension reduction techniques can be applied to project points to a lower-dimension space.\n\n2\n\n\fBi-Criteria Approximation We say an algorithm for the (k, z)-center/median/means problem\nachieves a bi-criteria approximation ratio (or simply approximation ratio) of (\u03b1, \u03b2), for some \u03b1, \u03b2 \u2265 1,\nif it outputs a solution with at most \u03b2z outliers, whose cost is at most \u03b1 times the cost of the optimum\nsolution with z outliers.\nDistributed Clustering\nIn the distributed setting, the dataset P is split among m machines, where\nPi is the set of data points stored on machine i. We use ni to denote |Pi|. Following the communica-\ntion model of [8] and [10], we assume there is a central coordinator, and communications can only\nhappen between the coordinator and the m machines. The communication cost is measured in the\ntotal number of words sent. Communications happen in rounds, where in each round, messages are\nsent between the coordinator and the m machines. A message sent by a party (either the coordinator\nor some machine) in a round can only depends on the input data given to the party, and the messages\nreceived by the party in previous rounds. As is common in most of the previous results, we require\nthe number of rounds used to be small, preferably a small constant.\nOur distributed algorithm needs to output a set C of k centers, as well as an upper bound L on the\nmaximum radius of the generated clusters. For simplicity, only the coordinator needs to know C and\nL. We do not require the coordinator to output the set of outliers since otherwise the communication\ncost is forced to be at least z. In a typical clustering task, each machine i can \ufb01gure out the set of\noutliers in its own dataset Pi based on C and L (1 extra round may be needed for the coordinator to\nsend C and L to all machines).\n\n1.2 Prior Work\n\nIn the centralized setting, we know the best possible approximation ratios of 2 and 3 [4] for the k-\ncenter and (k, z)-center problems respectively, and thus our understanding in this setting is complete.\nThere has been a long stream of research on approximation algorithms k-median and k-means,\nleading to the current best approximation ratio of 2.675 [3] for k-median, 9 [1] for k-means, and\n6.357 for Euclidean k-means [1]. The \ufb01rst O(1)-approximation algorithm for (k, z)-median is\ngiven by Chen, [7]. Recently, Krishnaswamy et al. [12] developed a general framework that gives\nO(1)-approximations for both (k, z)-median and (k, z)-means.\nMuch of\nthe recent work has focused on solving k-center/median/means and (k, z)-\ncenter/median/means problems in the distributed setting [9, 2, 11, 13, 11, 13, 8, 6, 10, 5]. Many\ndistributed O(1) approximation algorithms with small communication complexity are known for\nthese problems. However, for (k, z)-center/median/means problems, most known algorithms have\ncommunication complexity linearly depending on z, the number of outliers. Guha et al. [10] over-\ncame the dependence issue, by giving (O(1), 2 + \u0001)-bicriteria approximation algorithms for all the\nthree objectives. The communication costs of their algorithms are \u02dcO(m/\u0001 + mk), where \u02dcO hides a\nlogarithmic factor.\n\n1.3 Our Contributions\n\nOur main contributions are in designing (O(1), 1 + \u0001)-bicriteria approximation algorithms for the\n(k, z)-center/median/means problems. The algorithm for (k, z)-center works for general metrics:\nTheorem 1.1. There is a 4-round, distributed algorithm for the (k, z)-center problem, that achieves\na (24(1 + \u0001), 1 + \u0001)-bicriteria approximation and O\ncommunication cost, where \u2206 is\nthe aspect ratio of the metric.\n\n(cid:16) km\n\n(cid:17)\n\n\u00b7 log \u2206\n\n\u0001\n\n\u0001\n\nWe give a high-level picture of the algorithm. By guessing, we assume that we know the optimum\ncost L\u2217 (since we do not know, we need to lose the log \u2206\n-factor in the communication complexity).\nIn the \ufb01rst round of the algorithm, each machine i will call a procedure called aggregating, on its\nset Pi. This procedure performs two operations. First, it discards some points from Pi; second, it\nmoves each of the survived points by a distance of at most O(1)L\u2217. After the two operations, the\npoints will be aggregated at a few locations. Thus, machine i can send a compact representation of\nthese points to the coordinator: a list of (p, w(cid:48)\np is the number of\npoints aggregated at p. The coordinator will collect all the data points from all the machines, and run\nthe algorithm of [4] for (k, z(cid:48))-center instance on the collected points, for some suitable z(cid:48).\n\np) pairs, where p is a location and w(cid:48)\n\n\u0001\n\n3\n\n\fTo analyze the algorithm, we show that the set P (cid:48) of points collected by the coordinator well-\napproximates the original set P . The main lemma is that the total number of non-outliers removed\nby the aggregation procedure on all machines is at most \u0001z. This incurs the additive factor of \u0001z in\nthe number of outliers. We prove this by showing that inside any ball of radius L\u2217, and for every\nmachine i \u2208 [m], we removed at most \u0001z\nkm points in Pi. Since the non-outliers are contained in the\nunion of k balls of radius L\u2217, and there are m machines, the total number of removed non-outliers is\nat most \u0001z. For each remaining point, we shift it by a distance of O(1)L\u2217, leading to an O(1)-loss in\nthe approximation ratio of our algorithm.\nWe perform experiments comparing our main algorithm stated in Theorem 1.1 with many previous\nones on real-world datasets. The results show that it matches the state-of-art method in both solution\nquality (objective value) and communication cost. We remark that the qualities of solutions are\nmeasured w.r.t removing only z outliers. Theoretically, we need (1 + \u0001)z outliers in order to achieve\nan O(1)-approximation ratio and our constant 24 is big. In spite of this, empirical evaluations suggest\nthat the algorithm on real-word datasets performs much better than what can be proved theoretically\nin the worst case.\nFor (k, z)-median/means problems, our algorithm works for the Euclidean metric case and has\ncommunication cost depending on the dimension D of the Euclidean space. One can w.l.o.g. assume\nD = O(log n/\u00012) by using the dimension reduction technique. Our algorithm is given in the\nfollowing theorem:\n\nTheorem 1.2. There is a 2-round, distributed algorithm for the (k, z)-median/means problems\nin D-dimensional Euclidean space, that achieves a (1 + \u0001, 1 + \u0001)-bicriteria approximation ratio\nwith probability 1 \u2212 \u03b4. The algorithm has communication cost O\n, where \u2206\n\n\u03a6D \u00b7 log(n\u2206/\u0001)\n\nis the aspect ratio of the input points, \u03a6 = O(cid:0) 1\n\u03a6 = O(cid:0) 1\n(cid:1) for (k, z)-means.\n\n\u03b4 ) + mk log mk\n\n\u00014 (kD + log 1\n\n\u00012 (kD + log 1\n\n\u03b4\n\n(cid:16)\n\n(cid:17)\n\n\u0001\n\n\u03b4 ) + mk(cid:1) for (k, z)-median, and\n(cid:16)(cid:80)\n\n(cid:17)\np\u2208P dL(p, C) \u2212 zL\n\n\u0001\n\n(cid:17)\n\n(cid:16) log(\u2206n/\u0001)\n\nWe now give an overview of our algorithm for (k, z)-median/means. First, it is not hard to reformulate\nthe objective of the (k, z)-median problem as minimizing supL\u22650\n, where\ndL is obtained from d by truncating all distances at L. By discretization, we can construct a set\nL of O\ninteresting values that the L under the superior operator can take. Thus, our\ngoal becomes to \ufb01nd a set C, that is simultaneously good for every k-median instance de\ufb01ned\nby dL, L \u2208 L. Since now we are handling k-median instances (without outliers), we can use the\ncommunication-ef\ufb01cient algorithm of [2] to construct an \u0001-coreset QL with weights wL for every\nL \u2208 L. Roughly speaking, the coreset QL is similar to the set P for the task of solving the k-median\nproblem under metric dL. The size of each \u0001-coreset QL is at most \u03a6, implying the communication\ncost stated in the theorem. After collecting all the coresets, the coordinator can approximately solve\nthe optimization problem on them. This will lead to an (1 + O(\u0001), 1 + O(\u0001))-bicriteria approximate\nsolution. The running time of the algorithm, however, is exponential in the total size of the coresets.\nThe argument can be easily adapted to the (k, z)-means setting.\nOrganization In Section 2, we prove Theorem 1.1, by giving the (24(1 + \u0001), 1 + \u0001)-approximation\nalgorithm. The empirical evaluations of our algorithm for (k, z)-center and the proof of Theorem 1.2\nare provided in the supplementary material.\n\nset Q of points, and a set S \u2286 Q, we use w(S) =(cid:80)\n\nNotations Throughout the paper, point sets are multi-sets, where each element has its own identity.\nBy a copy of some point p, we mean a point with the same description as p but a different identity.\nFor a set Q of points, a point p, and a radius r \u2265 0, we de\ufb01ne ballQ(p, r) = {q \u2208 Q : d(p, q) \u2264 r}\nto be the set of points in Q that have distances at most r to p. For a weight vector w \u2208 ZQ\u22650 on some\np\u2208S wp to denote the total weight of points in S.\nThroughout\nthe paper, P is always the set of input points. We shall use dmin =\nminp,q\u2208P :d(p,q)>0 d(p, q) and dmax = maxp,q\u2208P d(p, q) to denote the minimum and maximum\nnon-zero pairwise distance between points in P . Let \u2206 = dmax\ndenote the aspect ratio of the metric.\ndmin\n\n4\n\n\f\u00b7 log \u2206\n\n\u0001\n\n\u0001\n\n\u0001\n\n(cid:16) km\n\n\u0001\n\n(cid:16) log \u2206\n\n(cid:17)\n\n2 Distributed (k, z)-Center Algorithm with (1 + \u0001)z Outliers\n\nbetween 0 and w(Q) =(cid:80)\n\n(cid:1), that either returns a (k, (1+\u0001)z)-center\n\n(cid:17)\nto design a main algorithm with communication cost O(cid:0) km\n\nIn this section, we prove Theorem 1.1, by giving the (24(1 + \u0001), 1 + \u0001)-approximation algorithm for\n. Let L\u2217 be the cost of the optimum (k, z)-\n(k, z)-center, with communication cost O\ncenter solution (which is not given to us). We assume we are given a parameter L \u2265 0 and our goal is\nsolution of cost at most 24L, or certi\ufb01es that L\u2217 > L. Notice that L\u2217 \u2208 {0}\u222a [dmin/2, dmax]. We can\nobtain our (24(1 + \u0001), 1 + \u0001)-approximation by running the main algorithm for O\ndifferent\nvalues of L in parallel, and among all generated solutions, returning the one correspondent to the\nsmallest L. A naive implementation requires all the parties to know dmin and dmax in advance; we\nshow in the supplementary material that the requirement can be removed.\nIn intermediate steps, we may deal with (k, z)-center instances where points have integer weights. In\nthis case, the instance is de\ufb01ned as (Q, w), where Q is a set of points, w \u2208 ZQ\n>0, and z is an integer\nq\u2208Q wq. The instance is equivalent to the instance \u02c6Q, the multi-set where\nwe have wq copies of each q \u2208 Q.\n[4] gave a 3-approximation algorithm for the (k, z)-center problem. However, our setting is slightly\nmore general so we can not apply the result directly. We are given a weighted set Q of points that\nde\ufb01nes the (k, z)-center instance. The optimum set C\u2217 of centers, however, can be from the superset\nP \u2287 Q which is hidden to us. Thus, our algorithm needs output a set C of k centers from Q and\ncompare it against the optimum set C\u2217 of centers from P . Notice that by losing a factor of 2, we can\nassume centers are in Q; this will lead to a 6-approximation. Indeed, by applying the framework of\n[4] more carefully, we can obtain a 4-approximation for this general setting. We state the result in the\nfollowing theorem:\nTheorem 2.1 ([4]). Let d be a metric over the set P of points, Q \u2286 P and w \u2208 ZQ\n>0. There is an\nalgorithm kzc (Algorithm 1) that takes inputs k, z(cid:48) \u2265 1, (Q, w(cid:48)) with |Q| = n(cid:48), the metric d restricted\nto Q, and a real number L(cid:48) \u2265 0. In time O(n(cid:48)2), the algorithm either outputs a (k, z(cid:48))-center solution\nC(cid:48) \u2286 Q to the instance (Q, w(cid:48)) of cost at most 4L(cid:48), or certi\ufb01es that there is no (k, z(cid:48))-center solution\nC\u2217 \u2286 P of cost at most L(cid:48) and outputs \u201cNo\u201d.\nThe main algorithm is dist-kzc (Algorithm 3), which calls an important procedure called aggregating\n(Algorithm 2). We describe aggregating and dist-kzc in Section 2.1 and 2.2 respectively.\n\n2.1 Aggregating Points\nThe procedure aggregating, as described in Algorithm 2, takes as input the set Q \u2286 P of points to be\naggregated (which will be some Pi when we actually call the procedure), the guessed optimum cost\nL, and y \u2265 0, which controls how many points can be removed from Q. It returns a set Q(cid:48) of points\nobtained from aggregation, along with their weights w(cid:48).\nAlgorithm 1 kzc(k, z(cid:48), (Q, w(cid:48)), L(cid:48))\n1: U \u2190 Q, C(cid:48) \u2190 \u2205;\n2: for i \u2190 1 to k do\npi \u2190 p \u2208 Q with largest w(cid:48)(ballU (p, 2L(cid:48)))\n3:\nC(cid:48) \u2190 C(cid:48) \u222a {pi}\n4:\nU \u2190 U \\ ballU (pi, 4L(cid:48))\n5:\n6: if w(cid:48)(U ) > z(cid:48) then return \u201cNo\u201d else return C(cid:48)\nIn aggregating, we start from U = Q and Q(cid:48) = \u2205 and keep removing points from U. In each iteration,\nwe check if there is a p \u2208 Q with |ballU (p, 2L)| \u2265 y. If yes, we add p to Q(cid:48), remove ballU (p, 4L)\nfrom U and let wp be the number of points removed. We repeat thie procedure until such a p can not\nbe found. We remark that the procedure is very similar to the algorithm kzc (Algorithm 1) in [4].\nWe start from some simple observations about the algorithm.\n\nAlgorithm 2 aggregating(Q, L, y)\n1: U \u2190 Q, Q(cid:48) \u2190 \u2205;\n2: while \u2203p \u2208 Q with |ballU (p, 2L)| > y do\np \u2190 |ballU (p, 4L)|\n3:\n4:\n5: return (Q(cid:48), w(cid:48))\n\nQ(cid:48) \u2190 Q(cid:48) \u222a {p}, w(cid:48)\nU \u2190 U \\ ballU (p, 4L)\n\nClaim 2.2. We de\ufb01ne V =(cid:83)\n\np\u2208Q(cid:48) ballQ(p, 4L) to be the set of points in Q with distance at most 4L\nto some point in Q(cid:48) at the end of Algorithm 2. Then, the following statements hold at the end of the\nalgorithm:\n\n5\n\n\f1:\n2.3.\n\nTwo\n\ncases\nIn Figure\n\nFigure\nin\nof\nLemma\n(a),\nballs\n{ballU (c, L) : c \u2208 C\u2217, d(p, c) \u2264 3L}\ncir-\ncles) are all empty. So, ballU (p, 2L) \u2286 O. In Figure\n(b),\nthere is a non-empty ballU (c, L) for some\nc \u2208 C\u2217 with d(p, c) \u2264 3L (the red circle). The ball\nis contained in ballU (p, 4L).\n\nproof\nthe\n(red\n\nFigure 2: Illustration for proof of Lemma 2.7.\nfi : Vi \u2192 P (cid:48)\ni is indicated by the dashed lines,\neach of whom is of length at most 4L. The\nkm.\nnumber of crosses in a circle is at most \u0001z\n\n1. U = Q \\ V .\n\n2. (cid:12)(cid:12)ballU (p, 2L)(cid:12)(cid:12) \u2264 y for every p \u2208 Q.\n\nQ(cid:48).\n\n3. There is a function f : V \u2192 Q(cid:48) such that d(p, f (p)) \u2264 4L, \u2200p \u2208 V , and w(cid:48)(q) = |f\u22121(q)|,\u2200q \u2208\n\nProof. U is exactly the set of points in Q with distance more than 4L to any point in Q(cid:48) and thus\nU = Q \\ V . Property 2 follows from the termination condition of the algorithm. Property 3 holds\nby the way we add points to Q(cid:48) and remove points from U. If in some iteration we added q to Q(cid:48),\nwe can de\ufb01ne f (p) = q for every point p \u2208 ballU (p, 4L), i.e, every point removed from U in the\niteration.\n\nWe think of U as the set of points we discard from Q and V as the set of survived points. We then\nmove each p \u2208 V to f (p) \u2208 Q(cid:48) and thus V will be aggregated at the set Q(cid:48) of locations. The\nfollowing crucial lemma upper bounds |Q(cid:48)|:\nLemma 2.3. Let \u02c6z \u2265 0 and assume there is a (k, \u02c6z)-center solution C\u2217 \u2286 P to the instance Q with\ncost at most L. Then, at the end of Algorithm 2 we have |Q(cid:48)| \u2264 k + \u02c6z\ny .\n\nProof. Let O = Q\\(cid:83)\n\nc\u2208C\u2217 ballQ(c, L) be the set of outliers according to solution C\u2217. Thus |O| \u2264 \u02c6z.\nFocus on the moment before we run Step 3 in some iteration of aggregating. See Figure 1 for the\ntwo cases we are going to consider. In case (a), every center c \u2208 ballC\u2217 (p, 3L) has ballU (c, L) = \u2205.\nIn this case, every point q \u2208 ballU (p, 2L) has d(q, C\u2217) > L: if d(p, c) > 3L for some c \u2208 C\u2217, then\nd(q, c) \u2265 d(p, c)\u2212d(p, q) > 3L\u22122L = L by triangle inequality; for some c \u2208 C\u2217 with d(p, c) \u2264 3L,\nwe have ballU (c, L) = \u2205, implying that d(q, c) > L as q \u2208 U. Thus, ballU (p, 2L) \u2286 O. So, Step 3\nin this iteration will decrease |O \u2229 U| by at least |ballU (p, 4L)| \u2265 |ballU (p, 2L)| > y.\nConsider the case (b) where some c \u2208 ballC\u2217 (p, 3L) has ballU (c, L) (cid:54)= \u2205. Then ballU (p, 4L) \u2287\nballU (c, L) will be removed from U by Step 3 in this iteration. Thus,\n1. if case (a) happens, then |U \u2229 O| is decreased by more than y in this iteration;\n2. otherwise case (b) happens; then for some c \u2208 C\u2217, ballU (c, L) changes from non-empty to \u2205.\nThe \ufb01rst event can happen for at most |O|/y \u2264 \u02c6z/y iterations and the second event can happen for at\nmost |C\u2217| \u2264 k iterations. So, |Q(cid:48)| \u2264 k + \u02c6z/y.\n\n6\n\np2L3LpointsinC\u2217pointsinULp(a)(b)4L\u2264\u0001zkmballsforC\u2217pointsinVipointsinUipointsinP(cid:48)i\f2.2 The Main Algorithm\n\nWe are now ready to describe the main algorithm for the (k, z)-center problem, given in Algorithm 3.\nIn the \ufb01rst round, each machine will call aggregating(Pi, L, \u0001z\ni). All the machines\nwill \ufb01rst send their corresponding |P (cid:48)\ni| to the coordinator. In Round 2 the algorithm will check if\ni| is small or not. If yes, send a \u201cYes\u201d message to all machines; otherwise return \u201cNo\u201d and\nterminate the algorithm. In Round 3, if a machine i received a \u201cYes\u201d message from the coordinator,\nthen it sends the dataset P (cid:48)\ni to the coordinator. Finally in Round 4, the\ni\u2208[m] P (cid:48)\n\n(cid:80)\ni\u2208[m] |P (cid:48)\ncoordinator collects all the weighted points P (cid:48) =(cid:83)\n\ni with the weight vector w(cid:48)\n\ni and run kzc on these points.\n\nkm ) to obtain (P (cid:48)\n\ni , w(cid:48)\n\nAlgorithm 3 dist-kzc\ninput on all parties: n, k, z, m, L, \u0001\ninput on machine i: dataset Pi with |Pi| = ni\noutput: a set C(cid:48) \u2286 P or \u201cNo\u201d (which certi\ufb01es L\u2217 > L)\nRound 1 on machine i \u2208 [m]\ni) \u2190 aggregating(Pi, L, \u0001z\ni , w(cid:48)\n1: (P (cid:48)\nkm )\n2: send |P (cid:48)\ni| to the coordinator\nRound 2 on the coordinator\n\n1: if(cid:80)\n\ni\u2208[m] |P (cid:48)\n\ni| > km(1 + 1/\u0001) then return \u201cNo\u201d else send \u201cYes\u201d to each machine i \u2208 [m]\n\n1: let P (cid:48) \u2190(cid:83)m\n\nRound 3 on machine i \u2208 [m]\n1: Upon receiving of a \u201cYes\u201d message from the coordinator, respond by sending (P (cid:48)\nRound 4 on the coordinator\n2: let w(cid:48) be the function from P (cid:48) to Z>0 obtained by merging w(cid:48)\n3: let z(cid:48) \u2190 (1 + \u0001)z + w(cid:48)(P (cid:48)) \u2212 n\n4: if z(cid:48) < 0 then return \u201cNo\u201d else return kzc(k, z(cid:48), (P (cid:48), w(cid:48)), L(cid:48) = 5L)\n\n2,\u00b7\u00b7\u00b7 , w(cid:48)\n\ni=1 P (cid:48)\n\n1, w(cid:48)\n\nm\n\ni\n\ni , w(cid:48)\ni)\n\n(cid:80)m\ni=1 |P (cid:48)\n\nAn immediate observation about the algorithm is that its communication cost is small:\nClaim 2.4. The communication cost of dist-kzc is O( km\nProof. The total communication cost of Round 1 and Round 2 is O(m). We run Round 3 only\nwhen the coordinator sent the \u201cYes\u201d message, in which case the communication cost is at most\n\n\u0001 ).\n\ni| \u2264 km(1 + 1/\u0001) = O( km\n\u0001 ).\n\ni\n\np\u2208P (cid:48)\n\ni\u2208[m] P (cid:48)\n\ni be the P (cid:48)\n\nLet V = (cid:83)\n\ni constructed in Round 1 on machine i. Let Vi =(cid:83)\ni\u2208[m] Vi, P (cid:48) = (cid:83)\n\nIt is convenient to de\ufb01ne some notations before we make further analysis. For every machine i \u2208 [m],\nlet P (cid:48)\nballPi(p, 4L) be the set of\npoints in Pi that are within distance at most 4L to some point in P (cid:48)\ni . Notice that this is the de\ufb01nition\nof V in Claim 2.2 for the execution of aggregating on machine i. Let Ui = Pi \\ Vi; this is the set U\nat the end of this execution. Let fi be the mapping from Vi to P (cid:48)\ni satisfying Property 3 of Claim 2.2.\ni and f be the function from V to P (cid:48), obtained by merging\nf1, f2,\u00b7\u00b7\u00b7 , fm. Thus (p, f (p)) \u2264 4L,\u2200p \u2208 V and w(cid:48)(q) = |f\u22121(q)|,\u2200q \u2208 P (cid:48).\nClaim 2.5. If dist-kzc returns a set C(cid:48), then C(cid:48) is a (k, (1 + \u0001)z)-center solution to the instance P\nwith cost at most 24L.\nProof. C(cid:48) must be returned in Step 4 in Round 4. By Theorem 2.1 for kzc, C(cid:48) is a (k, z(cid:48))-center\n\nc\u2208C(cid:48) ballP (cid:48)(c, 20L)(cid:1) \u2264 z(cid:48).\nc\u2208C(cid:48) ballP (cid:48)(c, 20L)(cid:1) \u2265 w(cid:48)(P (cid:48)) \u2212 z(cid:48) = n \u2212 (1 + \u0001)z. Notice that for each q \u2208 P (cid:48),\nc\u2208C(cid:48) ballP (c, 24L)(cid:12)(cid:12) \u2264 (1 + \u0001)z.\n\nsolution to (P (cid:48), w(cid:48)) of cost at most 4 \u00b7 5L = 20L. That is, w(cid:48)(cid:0)P (cid:48) \\(cid:83)\nThis implies w(cid:48)(cid:0)(cid:83)\nc\u2208C(cid:48) ballP (c, 24L)(cid:12)(cid:12) \u2265 n\u2212 (1 + \u0001)z, which is exactly(cid:12)(cid:12)P \\(cid:83)\n(cid:12)(cid:12)(cid:83)\nreturn a set C(cid:48). We de\ufb01ne C\u2217 \u2286 P to be a set of size k such that |P \\(cid:83)\nI =(cid:83)\nLemma 2.6. After Round 1, we have(cid:80)\n\nWe can now assume L \u2265 L\u2217 and we need to prove that we must reach Step 4 in Round 4 and\nc\u2208C\u2217 ball(c, L)| \u2264 z. Let\nc\u2208C\u2217 ballP (c, L) be the set of \u201cinliers\u201d according to C\u2217 and O = P \\ I be the set of outliers.\n\nthe set f\u22121(q) \u2286 V \u2286 P of points are within distance 4L from q and w(cid:48)(q) = |f\u22121(q)|. So,\n\nThus, |I| \u2265 n \u2212 z and |O| \u2264 z.\n\ni| \u2264 km(1 + 1/\u0001).\n\ni\u2208[m] |P (cid:48)\n7\n\n\f(cid:12)(cid:12)(cid:12)Pi \\(cid:83)\n\nProof. Let zi = |Pi \u2229 O| =\na (k, zi)-center solution to the instance Pi with cost at most L. By Lemma 2.3, we have that\n|P (cid:48)\ni| \u2264 k + zi\n\nc\u2208C\u2217 ballPi(c, L)\n\n(cid:12)(cid:12)(cid:12) be the set of outliers in Pi. Then, C\u2217 is\ni\u2208[m] zi \u2264 km(cid:0)1 + 1\n\n(cid:1) .\n\n\u0001\n\n(cid:80)\n\n(cid:80)\n\u0001z/(km). So, we have\ni\u2208[m] |P (cid:48)\n\ni| \u2264 km + km\n\n\u0001z\n\nTherefore, the coordinator will not return \u201cNo\u201d in Round 2. It remains to prove the following Lemma.\nLemma 2.7. Algorithm 3 will reach Step 4 in Round 4 and return a set C(cid:48).\nProof. See Figure 2 for the illustration of the proof. By Property 2 of Claim 2.2, we have\n|ballUi (p, 2L)| \u2264 \u0001z\nkm for every p \u2208 Ui since Ui \u2286 Pi. This implies that for every c \u2208 C\u2217,\nwe have |ballUi(c, L)| \u2264 \u0001z\nkm. (Otherwise, taking an arbitrary p in the ball leads to a contradiction.)\n(cid:88)\n(cid:88)\n\n|ballUi(c, L)| \u2264 (cid:88)\n\n\u2264 \u0001z\nm\n= |I| \u2212 \u0001z \u2265 n \u2212 (1 + \u0001)z.\n\n(cid:0)|I \u2229 Pi| \u2212 |I \u2229 Ui|(cid:1) \u2265 (cid:88)\n\n(cid:16)|I \u2229 Pi| \u2212 \u0001z\n\n(cid:12)(cid:12)(cid:12) \u2264 (cid:88)\n\nc\u2208C\u2217\n|I \u2229 Vi| =\n\n(cid:12)(cid:12)(cid:12) (cid:91)\n\n|Ui \u2229 I| =\n\n\u2200i \u2208 [m].\n\nballUi(c, L)\n\n\u0001z\nkm\n\nc\u2208C\u2217\n\nc\u2208C\u2217\n\n(cid:17)\n\n,\n\ni\u2208[m]\n\ni\u2208[m]\n\ni\u2208[m]\n\nm\n\nSo, w(cid:48)(P (cid:48) \\(cid:83)\n\nFor every p \u2208 V \u2229 I, f (p) will have distance at most L + 4L = 5L to some center in C\u2217. Also,\nnotice that w(cid:48)(q) = |f\u22121(q)| for every q \u2208 P (cid:48), we have that\n\nw(cid:48)(cid:0)(cid:83)\n\nc\u2208C\u2217 ballP (cid:48)(c, 5L)(cid:1) \u2265 |V \u2229 I| \u2265 n \u2212 (1 + \u0001)z.\n\nc\u2208C\u2217 ballP (cid:48)(c, 5L)) \u2264 w(P (cid:48)) \u2212 n + (1 + \u0001)z = z(cid:48). This implies that z(cid:48) \u2265 0, and there\nis a (k, z(cid:48))-center solution C\u2217 \u2286 P to the instance (P (cid:48), w(cid:48)) of cost at most 5L. Thus dist-kzc will\nreach Step 4 in Round 4 and returns a set C(cid:48). This \ufb01nishes the proof of the Lemma.\nWe now brie\ufb02y analyze the running times of algorithms on all parties. The running time of computing\nP (cid:48)\ni ) and this is the bottleneck for machine i. Considering\ni on each machine i in round 1 is O(n2\nall possible values of L, the running time on machine i is O\n. The running time of the\n\n(cid:16)\n\n. We sort all the interesting\nround-4 algorithm of the central coordinator for one L will be O\nL values in increasing order. The central coordinator can use binary search to \ufb01nd some L(cid:48) such that\nthe main algorithm outputs a set C(cid:48) for L = L(cid:48) but outputs \u201cNo\u201d for L being the value before L(cid:48) in\nthe ordering. So, the running time of the central coordinator can be made O\n\n(cid:1)2 \u00b7 log log \u2206\n\n(cid:16)(cid:0) km\n\n(cid:17)\n\n.\n\n\u0001\n\n\u0001\n\n\u0001\n\nThe quadratic dependence of running time of machine i on ni might be an issue when ni is big; we\ndiscuss how to alleviate the issue in the supplementary material.\n\n(cid:17)\n(cid:1)2(cid:17)\n\ni \u00b7 log \u2206\nn2\n\n(cid:16)(cid:0) km\n\n\u0001\n\n3 Conclusion\n\n(cid:17)\n\n\u0001\n\n(cid:16) km\n(cid:16)\n\n\u0001\n\n\u00b7 log \u2206\n\nIn this paper, we give a distributed (24(1 + \u0001), 1 + \u0001)-bicriteria approximation for the (k, z)-center\nproblem, with communication cost O\n. The running times of the algorithms for all\nparties are polynomial. We evaluate the algorithm on realworld data sets and it outperforms most\nprevious algorithms, matching the performance of the state-of-art method[10].\nFor the (k, z)-median/means problem, we give a distributed (1 + \u0001, 1 + \u0001)-bicriteria approximation\nalgorithm with communication cost O\n, where \u03a6 is the upper bound on the size of the\ncoreset constructed using the algorithm of [2]. The central coordinator needs to solve the optimiza-\ntion problem of \ufb01nding a solution that is simultaneously good for O\nk-median/means\ninstances. Since the approximation ratio for this problem will go to both factors in the bicriteria\nratio, we really need a (1 + \u0001)-approximation for the optimization problem. Unfortunately, solving\nk-median/means alone is already APX-hard, and we don\u2019t know a heuristic algorithm that works\nwell in practice (e.g, a counterpart to Lloyd\u2019s algorithm for k-means). It is interesting to study if\na different approach can lead to a polynomial time distributed algorithm with O(1)-approximation\nguarantee.\n\n(cid:16) log(\u2206n/\u0001)\n\n\u03a6D \u00b7 log \u2206\n\n(cid:17)\n\n(cid:17)\n\n\u0001\n\n\u0001\n\n8\n\n\fAcknowledgments\n\nThis research was supported by NSF grants CCF-1566356 and CCF-1717134.\n\nReferences\n[1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for\nk-means and euclidean k-median by primal-dual algorithms. In 58th IEEE Annual Symposium\non Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017,\npages 61\u201372, 2017.\n\n[2] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median\nclustering on general communication topologies. In Advances in Neural Information Processing\nSystems 26, NIPS 2013, December 5-8, 2013, Lake Tahoe, Nevada, United States., pages\n1995\u20132003, 2013.\n\n[3] Jaroslaw Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An\nimproved approximation for k-median and positive correlation in budgeted optimization. ACM\nTrans. Algorithms, 13(2):23:1\u201323:31, 2017.\n\n[4] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility\nlocation problems with outliers. In Proceedings of the 12th Annual Symposium on Discrete\nAlgorithms, January 7-9, 2001, Washington, DC, USA., pages 642\u2013651, 2001.\n\n[5] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering\n\nand outlier detection. CoRR, abs/1805.09495, 2018.\n\n[6] Jiecao Chen, He Sun, David P. Woodruff, and Qin Zhang. Communication-optimal distributed\nclustering. In Advances in Neural Information Processing Systems 29: Annual Conference on\nNeural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages\n3720\u20133728, 2016.\n\n[7] Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In\nProceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008,\nSan Francisco, California, USA, January 20-22, 2008, pages 826\u2013835, 2008.\n\n[8] Hu Ding, Yu Liu, Lingxiao Huang, and Jian Li. k-means clustering with distributed dimensions.\nIn Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New\nYork City, NY, USA, June 19-24, 2016, pages 1339\u20131348, 2016.\n\n[9] Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce. In Proceedings\nof the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nSan Diego, CA, USA, August 21-24, 2011, pages 681\u2013689, 2011.\n\n[10] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. In Proceedings of the 29th\nACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC,\nUSA, July 24-26, 2017, pages 143\u2013152, 2017.\n\n[11] Sungjin Im and Benjamin Moseley. Brief announcement: Fast and better distributed mapreduce\nalgorithms for k-center clustering. In Proceedings of the 27th ACM on Symposium on Parallelism\nin Algorithms and Architectures, SPAA 2015, Portland, OR, USA, June 13-15, 2015, pages\n65\u201367, 2015.\n\n[12] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and\nk-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT\nSymposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018,\npages 646\u2013659, 2018.\n\n[13] Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, and Benjamin Moseley.\nIn Advances in Neural\nFast distributed k-center clustering with outliers on massive data.\nInformation Processing Systems 28, NIPS 2015, December 7-12, 2015, Montreal, Quebec,\nCanada, pages 1063\u20131071, 2015.\n\n9\n\n\f", "award": [], "sourceid": 4880, "authors": [{"given_name": "Shi", "family_name": "Li", "institution": "University at Buffalo"}, {"given_name": "Xiangyu", "family_name": "Guo", "institution": "State University of New York at Buffalo"}]}