{"title": "Greedy Sampling for Approximate Clustering in the Presence of Outliers", "book": "Advances in Neural Information Processing Systems", "page_first": 11148, "page_last": 11157, "abstract": "Greedy algorithms such as adaptive sampling (k-means++) and furthest point traversal are popular choices for clustering problems. One the one hand, they possess good theoretical approximation guarantees, and on the other, they are fast and easy to implement. However, one main issue with these algorithms is the sensitivity to noise/outliers in the data. In this work we show that for k-means and k-center clustering, simple modifications to the well-studied greedy algorithms result in nearly identical guarantees, while additionally being robust to outliers. For instance, in the case of k-means++, we show that a simple thresholding operation on the distances suffices to obtain an O(\\log k) approximation to the objective. We obtain similar results for the simpler k-center problem. Finally, we show experimentally that our algorithms are easy to implement and scale well. We also measure their ability to identify noisy points added to a dataset.", "full_text": "Greedy Sampling for Approximate Clustering in the\n\nPresence of Outliers\n\nAditya Bhaskara\nUniversity of Utah\n\nbhaskaraaditya@gmail.com\n\nSharvaree Vadgama\nUniversity of Utah\n\nsharvaree.vadgama@gmail.com\n\nHong Xu\n\nUniversity of Utah\n\nhxu.hongxu@gmail.com\n\nAbstract\n\nGreedy algorithms such as adaptive sampling (k-means++) and furthest point\ntraversal are popular choices for clustering problems. One the one hand, they\npossess good theoretical approximation guarantees, and on the other, they are fast\nand easy to implement. However, one main issue with these algorithms is the\nsensitivity to noise/outliers in the data. In this work we show that for k-means and\nk-center clustering, simple modi\ufb01cations to the well-studied greedy algorithms\nresult in nearly identical guarantees, while additionally being robust to outliers. For\ninstance, in the case of k-means++, we show that a simple thresholding operation\non the distances suf\ufb01ces to obtain an O(log k) approximation to the objective.\nWe obtain similar results for the simpler k-center problem. Finally, we show\nexperimentally that our algorithms are easy to implement and scale well. We also\nmeasure their ability to identify noisy points added to a dataset.\n\n1\n\nIntroduction\n\nClustering is one of the fundamental problems in data analysis. There are several formulations that\nhave been very successful in applications, including k-means, k-median, k-center, and various notions\nof hierarchical clustering (see [19, 12] and references there-in).\nIn this paper we will consider k-means and k-center clustering. These are both extremely well-studied.\nThe classic algorithm of Gonzalez [16] for k-center clustering achieves a factor 2 approximation,\nand it is NP-hard to improve upon this for general metrics, unless P equals NP. For k-means, the\nclassic algorithm is due to Lloyd [23], proposed over 35 years ago. Somewhat recently, [4] (see\nalso [25]) proposed a popular variant, known as \u201ck-means++\u201d. This algorithm remedies one of\nthe main drawbacks of Lloyd\u2019s algorithm, which is the lack of theoretical guarantees. [4] proved\nthat the k-means++ algorithm yields an O(log k) approximation to the k-means objective (and also\nimproves performance in practice). By way of more complex algorithms, [21] gave a local search\nbased algorithm that achieves a constant factor approximation. Recently, this has been improved\nby [2], which is the best known approximation algorithm for the problem. The best known hardness\nresults rule out polynomial time approximation schemes [3, 11].\nThe algorithms of Gonzalez (also known as furthest point traversal) and [4] are appealing also\ndue to their simplicity and ef\ufb01ciency. However, one main drawback in these algorithms is their\nsensitivity to corruptions/outliers in the data. Imagine 10k of the points of a dataset are corrupted and\nthe coordinates take large values. Then both furthest point traversal as well as k-means++ end up\nchoosing only the outliers. The goal of our work is to remedy this problem, and achieve the simplicity\nand scalability of these algorithms, while also being robust in a provable sense.\nSpeci\ufb01cally, our motivation will be to study clustering problems when some of the input points\nare (possibly adversarially) corrupted, or are outliers. Corruption of inputs is known to make even\nsimple learning problems extremely dif\ufb01cult to deal with. For instance, learning linear classi\ufb01ers\nin the presence of even a small fraction of noisy labels is a notoriously hard problem (see [18, 5]\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand references therein). The \ufb01eld of high dimensional robust statistics has recently seen a lot of\nprogress on various problems in both supervised and unsupervised learning (see [20, 14]). The main\ndifference between our work and the works in robust statistics is that our focus is not to estimate a\nparameter related to a distribution, but to instead produce clusterings that are near-optimal in terms of\nan objective that is de\ufb01ned solely on inliers.\nFormulating clustering with outliers. Let OPTfull(X) denote the k-center or k-means objective\non a set of points X. Now, given a set of points that also includes outliers, the goal in clustering with\noutliers (see [7, 17, 22]) is to partition the points X into Xin and Xout so as to minimize OPTfull(Xin).\nTo avoid the trivial case of setting Xin = ;, we rquire |Xout|\uf8ff z, for some parameter z that is also\ngiven. Thus, we de\ufb01ne the optimum OPT of the k-clustering with outliers problem as\n\nOPT := min\n|Xout|\uf8ffz\n\nOPTfull(X \\ Xout).\n\nThis way of de\ufb01ning the objective has also found use for other problems such as PCA with outliers\n(also known as robust PCA, see [6] and references therein). For the problems we consider, namely\nk-center and k-means, there are many existing works that provide approximation algorithms for OPT\nas de\ufb01ned above. The early work of [7] studied the problem of k-median and facility location in this\nsetup. The algorithms provided were based on linear programming relaxations, and were primarily\nmotivated by the theoretical question of the power of such relaxations. Recently, [17] gives a more\npractical local search based algorithm, with running time quadratic in the number of points (which\ncan also be reduced to a quadratic dependence on z, in the case z \u2327 n). Both of these algorithms\nare bi-criteria approximations (de\ufb01ned formally below). In other words, they allow the algorithm to\ndiscard > z outliers, while obtaining a good approximation to the objective value OPT. In practice,\nthis corresponds to declaring a small number of the inliers as outliers. In applications where the true\nclusters are robust to small perturbations, such algorithms are acceptable.\nThe recent result of [22] (and the earlier result of [10] for k-median) go beyond bi-criteria approxima-\ntion. They prove that for k-means clustering, one can obtain a factor 50 approximation to the value\nof OPT, while declaring at most z points as outliers, as desired. While this effectively settles the\ncomplexity of the problem, there are many key drawbacks. First, the algorithm is based on an iterative\nprocedure that solves a linear programming relaxation in each step, which can be very inef\ufb01cient in\npractice (and hard to implement). Further, in many applications, it may be necessary to improve on\nthe (factor 50) approximation guarantee, potentially at the cost of choosing more clusters or slightly\nweakening the bound on the number of outliers.\nOur main results aim to address this drawback. We prove that very simple variants of the classic\nGonzalez algorithm for k-center, and the k-means++ algorithm for k-means result in approximation\nguarantees. The catch is that we only obtain bi-criteria results. To state our results, we will de\ufb01ne the\nfollowing notion.\nDe\ufb01nition 1. Consider an algorithm for the k-clustering (means/center) problem that on input X, k, z,\noutputs k0 centers (allowed to be slightly more than k), along with a partition X = X0in [ X0out that\nsatis\ufb01es (a) |X0out|\uf8ff z, and (b) the objective value of assigning the points X0in to the output centers\nis at most \u21b5 \u00b7 OPT.\nThen we say that the algorithm obtains an (\u21b5, ) approximation using k0 centers, for the k-clustering\nproblem with outliers.\n\nNote that while our main results only output k centers, clustering algorithms are also well-studied\nwhen the number of clusters is not strictly speci\ufb01ed. This is common in practice, where the application\nonly demands a rough bound on the number of clusters. Indeed, the k-means++ algorithm is known\nto achieve much better approximations (constant as opposed to O(log k)) for the problem without\noutliers, when the number of centers output is O(k) instead of k [1, 26].\n\n1.1 Our results.\nK-center clustering in metric spaces. For k-center, our algorithm is a variant of furthest point\ntraversal, in which instead of selecting the furthest point from the current set of centers, we choose a\nrandom point that is not too far from the current set. Our results are the following.\nTheorem 1.1. Let z, k, \" > 0 be given parameters, and X = Xin [ Xout be a set of points in a metric\nspace with |Xout|\uf8ff z. There is an ef\ufb01cient randomized algorithm that with probability 3/4 outputs\na (2 + \", 4 log k)-approximation using precisely k centers to the k-center with outliers problem.\n\n2\n\n\fRemark \u2013 guessing the optimum. The additional \" in the approximation is because we require\nguessing the value of the optimum. This is quite standard in clustering problems, and can be done\nby a binary search. If OPT is assumed to lie in the range (c, c) for some c > 0, then it can be\nestimated up to an error of c\" in time O(log(/\")), which gets added as a factor in the running time\nof the algorithm. In practice, this is often easy to achieve with = poly(n). We will thus assume a\nknowledge of the optimum value in both our algorithms.\nAlso, note that the algorithm outputs exactly k centers, and obtains the same (factor 2, up to \")\napproximation to the objective as the Gonzalez algorithm, but after discarding O(z log k) points as\noutliers. Next, we will show that if we allow the algorithm to output > k centers, one can achieve a\nbetter dependence on the number of points discarded.\nTheorem 1.2. Let z, k, c, \" > 0 be given parameters, and X = Xin [ Xout be a set of points in a\nmetric space with |Xout|\uf8ff z. There is an ef\ufb01cient randomized algorithm that with probability 3/4\noutputs a (2 + \", (1 + c)/c)-approximation using (1 + c)k centers to the k-center w/ outliers problem.\n\nAs c increases, note that the algorithm outputs very close to z outliers. In other words, the number of\npoints it falsely discards as outliers is small (at the expense of larger k).\n\nK-means clustering. Here, our main contribution is to study an algorithm called T-kmeans++, a\nvariant of D2 sampling (i.e. k-means++), in which the distances are thresholded appropriately before\nprobabilities are computed. For this simple variant, we will establish robust guarantees that nearly\nmatch the guarantees known for k-means++ without any outliers.\nTheorem 1.3. Let z, k, be given parameters, and X = Xin [ Xout be a set of points in Euclidean\nspace with |Xout|\uf8ff z. There is an ef\ufb01cient randomized algorithm that with probability 3/4 gives\nan (O(log k), O(log k))-approximation using k centers to the k-means with outliers problem on X.\n\nThe algorithm outputs an O(log k) approximation to the objective value (similar to k-means++).\nHowever, the algorithm may discard up to O(z log k) points as outliers. Note also that when z = 0,\nwe recover the usual k-means++ guarantee. As in the case of k-center, we ask if allowing a bi-criteria\napproximation improves the dependence on the number of outliers. Here, an additional dimension\nalso comes into play. For k-means++, it is known that choosing O(k) centers lets us approximate the\nk-means objective up to an O(1) factor (see, for instance, [1, 4, 25]). We can thus ask if a similar\nresult is possible in the presence of outliers. We show that the answer to both the questions is yes.\nTheorem 1.4. Let z, k, , c be given parameters, and X = Xin [ Xout be a set of points in a metric\nspace with |Xout|\uf8ff z. Let > 0 be an arbitrary constant. There is an ef\ufb01cient randomized algorithm\nthat with probability 3/4 gives a (( + 64), (1 + c)(1 + )/c(1 ))-approximation using (1 + c)k\ncenters to the k-center with outliers problem on X.\n\nGiven the simplicity of our procedure, it is essentially as fast as k-means++ (modulo the step of\nguessing the optimum value, which adds a logarithmic overhead). Assuming that this is O(log n),\n\nour running times are all eOkn. In particular, the procedure is signi\ufb01cantly faster than local search\n\napproaches [17], as well as linear programming based algorithms [22, 10]. Our run times also\ncompare well with those of recent, coreset based approaches to clustering with outliers, such as those\nof [9, 24] (see also references therein).\n\n1.2 Overview of techniques\nTo show all our results, we consider simple randomized modi\ufb01cations of classic algorithms, specif-\nically Gonzales\u2019 algorithm and the k-means++ algorithm. Our modi\ufb01cations, in effect, place a\nthreshold on the probability of any single point being chosen. The choice of the threshold ensures\nthat during the entire course of the algorithm, only a small number of outlier points will be chosen.\nOur analysis thus needs to keep track of (a) the number of points being chosen, (b) the number of\ninlier clusters from which we have chosen points (and in the case of k-means, points that are \u201cclose\nto the center\u201d), (c) number of \u201cwasted\u201d iterations, due to choosing outliers. We use different potential\nfunctions to keep track of these quantities and measure progress. These potentials are directly inspired\nby the elegant analysis of the k-means++ algorithm provided in [13] (which is conceptually simpler\nthan the original one in [4]).\n\n3\n\n\f2 Warm-up: Metric k-center in the presence of outliers\n\nLet (X, d) be a metric space. Recall that the classic Gonzalez algorithm [16] for k-center works\nby maintaining a set of centers S, and at each step \ufb01nding the point x 2 X that is furthest from S\nand adding it to X. After k iterations, a simple argument shows that the S obtained gives a factor 2\napproximation to the best k centers in terms of the k-center objective.\nAs we described earlier, this furthest point traversal algorithm is very susceptible to the presence\nof outliers. In particular, if the input X includes z > k points that are far away from the rest of the\npoints, all the points selected (except possibly the \ufb01rst) will be outliers. Our main idea to overcome\nthis problem is to ensure that no single point is too likely to be picked in each step. Consider the\nsimple strategy of choosing one of the 2z points furthest away from S (uniformly at random; we are\nassuming n 2z + k). This ensures that in every step, there is at least a 1/2 probability of picking\nan inlier (as there are only z outliers). In what follows, we will improve upon this basic idea and\nshow that it leads to a good approximation to the objective restricted to the inliers.\nThe algorithm for proving Theorems 1.1 and 1.2 is very simple: in every step, a center is added to\nthe current solution by choosing a uniformly random point in the dataset that is at a distance > 2r\nfrom the current centers. As discussed in Section 1.2, our proofs of both the theorems employ an\nappropriately designed potential function, adapted from [13].\n\nAlgorithm 1 k-center with outliers\nInput: points X \u2713 Rd, parameters k, z, r; r is a guess for OPT\nOutput: a set S` \u2713 X of size `\n1: Initialize S0 = ;\n2: for t = 1 to ` do\n3:\n\nLet Ft be the set of all points that are at a distance > 2r from St1. I.e.,\n\nFt := {x 2 X : d(x, St1) > 2r}\n\n4:\n5:\n6: return S`\n\nLet x be a point sampled u.a.r from Ft\nSt = St1 [{ x}\n\nNotation. Let C1, C2, . . . , Ck be the optimal clusters. So by de\ufb01nition, [iCi = Xin. Let Ft be the\nset of far away points at time t, as de\ufb01ned in the algorithm. Thus Ft includes both inliers and outliers.\nA simple observation about the algorithm is the following\nObservation 1. Suppose that the guess of r is OPT, and consider any iteration t of the algorithm.\nLet u 2 Ci be one of the chosen centers (i.e., u 2 St). Then Ci \\F t = ;, and thus no other point in\nCi can be subsequently added as a center.\n\ni\n\ni\n\nFinally, we denote by E(t)\nthe set of points in cluster Ci that are at a distance 2r from St. I.e., we\nde\ufb01ne E(t)\n:= Ci \\F t. The observation above implies that E(t)\ni = ; whenever St contains some\nu 2 Ci. But the converse is not necessarily true (since all the points in Ci could be at a distance < 2r\nfrom points in other clusters, which happened to be picked in St).\nNext, let nt denote the number of clusters i such that Ci \\ St = ;, i.e., the number of clusters none\nof whose points were selected so far. We are now ready to analyze the algorithm.\n\n2.1 Algorithm choosing k-centers\n\nWe will now analyze the execution of Algorithm 1 for k iterations, thereby establishing Theorem 1.1.\nThe key step is to de\ufb01ne the appropriate potential function. To this end, let wt denote the number of\ntimes that one of the outliers was added to the set S in the \ufb01rst t iterations. I.e., wt = |Xout \\ St|.\nThe potential we consider is now:\n\n t :=\n\nwt|Ft \\ Xin|\n\nnt\n\n.\n\n4\n\n(1)\n\n\fOur main lemma bounds the expected increase in t, conditioned on any choice of St (recall that St\ndetermines nt).\nLemma 1. Let St be any set of centers chosen in the \ufb01rst t iterations, for some t 0. We have\n\nE\nt+1\n\n[ t+1 t | St] \uf8ff\n\nz\nnt\n\n.\n\nAs usual, Et+1 denotes an expectation only over the (t + 1)th step. Let us \ufb01rst see how the lemma\nimplies Theorem 1.1.\n\nProof of Theorem 1.1. The idea is to repeatedly apply Lemma 1. Since we do not know the values of\nnt, we use the simple lower bound nt k t, for any t < k.\nAlong with the observation that 0 = 0 (since w0 = 0), we have\nk1Xt=0\n\nwhere Hk is the kth Harmonic number. Thus by Markov\u2019s inequality, Pr[ k \uf8ff 4zHk] 3/4. By\nthe de\ufb01nition of k, this means that with probability at least 3/4,\n\nk t \uf8ff zHk,\n\nE[ t+1 t] \uf8ff\n\nk1Xt=0\n\nE[ k] =\n\nz\n\nwk|Ft \\ Xin|\n\nnk\n\n\uf8ff 4z ln k.\n\nThe key observation is that we always have wk = nk. This is because if the set Sk did not intersect nk\nof the optimal clusters, then since Sk cannot include two points from the same cluster (as we observed\nearlier), precisely nk of the iterations must have chosen outliers. This means that with probability at\nleast 3/4, we have |Ft \\ Xin|\uf8ff 4z ln k. This means that after k iterations, with probability at least\n3/4, at most 4z ln k of the inliers are at a distance > 2r away from the chosen set Sk. Thus the total\nnumber of points at a distance > 2r away from Sk is at most z(4 ln k + 1). This completes the proof\nof the theorem.\n\nWe thus only need to show Lemma 1.\n\ni\n\nProof of Lemma 1. For simplicity, let us write ei := |E(t)\n| = |Ci \\F t|. In other words ei is the\nnumber of points in the ith optimal cluster that are at distance > 2r from St. Let us also write\nF =Pi ei. By de\ufb01nition, we have that F = |Ft \\ Xin|.\nThen, the sampling in the (t + 1)th iteration samples an inlier with probability F/|Ft|, and an outlier\nwith probability 1 F\n. If an inlier is sampled, the value nt reduces by 1, but wt stays the same. If\n|Ft|\nan outlier is sampled, the value nt stays the same, while wt increases by 1. The value of |Ft \\ Xin| is\nnon-increasing. If a point in Ci is chosen (which happens with probability ei/|Ft|), it reduces by at\nleast ei. Thus, we have\n\nei\n|Ft|\n\nwt(F ei)\n\nnt 1\n\nE[ t+1] \uf8ff\n\nei(F ei) =\n\nThe \ufb01rst term on the RHS can be simpli\ufb01ed as\n\nkXi=1\n|Ft|(nt 1)Xi\n\n|Ft|\u25c6 (wt + 1)F\n+\u27131 \ni!\n|Ft|(nt 1) F 2 Xi\nThe number of non-zero ei is at most nt, by de\ufb01nition. Thus we havePi e2\ni F 2/nt. Plugging this\n|Ft|\u25c6 F\n|Ft|\u25c6 (wt + 1)F\n= t +\u27131 \nThe proof now follows by using the simple facts:\u21e31 F\nat most z outliers) and F \uf8ff|F t| (which is true by de\ufb01nition, because F = |Xin \\F t|).\nThis completes the analysis of Algorithm 1 when the number of centers ` is exactly k.\n\ninto (2) and simplifying, we have\n\n|Ft|\u2318 \uf8ff z\n\n(which is true because there are\n\n+\u27131 \n\nE[ t+1] \uf8ff\n\nwtF 2\n|Ft|nt\n\n.\n\nnt\n\n.\n\nnt\n\ne2\n\nF\n\n|Ft|\n\nF\n\nnt\n\nwt\n\nwt\n\nF\n\n(2)\n\n5\n\n\f2.2 Bi-criteria approximation\nNext, we see that running Algorithm 1 for ` = (1 + c)k iterations results in covering more clusters\n(thus resulting in fewer outliers). Thus we end up with a tradeoff between the number of centers chosen\nand the number of points the algorithm declares as outliers (while obtaining the same approximation\n(factor 2) for the objective OPT \u2013 Theorem 1.2). The potential function now needs modi\ufb01cation. The\ndetails are deferred to Section A.1.\n\n3\n\nk-means via thresholded adaptive sampling\n\nNext we consider the k-means problem when some of the points are outliers. Here we propose\na variant of the k-means++ procedure (see [4]), which we call T-kmeans++. Our algorithm, like\nk-means++, is an iterative algorithm that samples a point to be a centroid at each iteration according\nto a probability that depends on the distance to the current set of centers. However, we avoid the\nproblem of picking too many outliers by simply thresholding the distances.\nNotation. Let us start with some notation that we use for the remainder of the paper. The points\nX are now in a Euclidean space (as opposed to an arbitrary metric space in Section 2). We assume\nas before that |X| = n, and X = Xin [ Xout, where |Xout| = z, which is a known parameter.\nAdditioanlly, will be a parameter that we will control. For the purposes of de\ufb01ning the algorithm,\nwe assume that we have a guess for the optimum objective value, denoted OPT.\nNow, for any set of centers C, we de\ufb01ne\n\n\u2327 (x, C) = min\u2713d(x, C)2,\n\n \u00b7 OPT\n\nz\n\n\u25c6\n\n(3)\n\nWe follow the standard practice of de\ufb01ning the distance to an empty set to be 1. Next, for any set of\npoints U, de\ufb01ne \u2327 (U, C) =Px2U \u2327 (x, C). Note that the parameter lets us interpolate between\nuniform sampling ( ! 0), and classic D2 sampling ( ! 1). In our results, choosing a higher \nhas the effect of reducing the number of points we declare as outliers, at the expense of having a\nworse guarantee on the approximation ratio for the objective.\nWe can now state our algorithm (denoted Algorithm 2)\n\nAlgorithm 2 Thresholded Adaptive Sampling \u2013 T-kmeans++\nInput: a set of points X \u2713 Rd, parameters k, z, and a guess for the optimum OPT.\nOutput: a set S \u2713 X of size `.\n1: Initialize S0 = ;.\n2: for t = 1 . . .` do\n3:\n\nsample a point x from the distribution\n\u2327 (x, St1)\n\np(x) =\n\nPx2X \u2327 (x, St1)\n\n4:\n5: return S`\n\nset St = St1 [{ x}.\n\n.\n\n(with \u2327 as de\ufb01ned in (3))\n\nThe key to the analysis is the following observation, that instead of the k-means objective, it\n\nsuf\ufb01ces to bound the quantityPx2X \u2327 (x, S`).\nLemma 2. Let C be a set of centers, and suppose that \u2327 (X, C) \uf8ff \u21b5 \u00b7 OPT. Then we can partition\nX into X0in and X0out such that\n1. Px2X0in\nd(x, C)2 \uf8ff \u21b5 \u00b7 OPT, and\n2. |X0out|\uf8ff \u21b5z\n .\n\nProof. The proof follows easily from the de\ufb01nition of \u2327 (Eq. (3)). Let X0out be the set of points for\nwhich d(x, C)2 > OPT/z, and let X0in be X \\ X0out. Then by de\ufb01nition (and the bound on \u2327 (X, C)),\n\n6\n\n\fwe have\n\nd(x, C)2 + |X0out|\n\n \u00b7 OPT\n\nz\n\n\uf8ff \u21b5 \u00b7 OPT.\n\nXx2X0in\n\nBoth the terms on the LHS are non-negative. Using the fact that the \ufb01rst term is non-negative gives the\n\ufb01rst part of the lemma, and the inequality for the second term gives the second part of the lemma.\n\n3.1\n\nk-means with outliers: an O(log k) approximation\n\nOur \ufb01rst result is an analog of the theorem of [4], for the setting in which we have outliers in the data.\nAs in the case of k-center clustering, we use a potential based analysis (inspired from [13]).\nTheorem 3.1. Running algorithm 2 for k iterations outputs a set Sk that satis\ufb01es\n\nE[\u2327 (X, Sk)] \uf8ff ( + O(1)) log k \u00b7 OPT.\n\nWe note that Theorem 3.1 together with Lemma 2 directly implies Theorem 1.3. Thus the main step\nis to prove Theorem 3.1. This is done using a potential function as before, but requires a more careful\nargument than the one for k-center (speci\ufb01cally, the goal is not to include some point from a cluster,\nbut to include a \u201ccentral\u201d one). Please see the supplement, section A.2 for details.\n\n3.2 Bi-criteria approximation\n\nTheorem 3.2. Consider running Algorithm 2 for ` = (1 + c)k iterations, where c > 0 is a constant.\nThen for any > 0, with probability , the set S` satis\ufb01es\n\n( + 64)(1 + c)OPT\n\n.\n\n\u2327 (X, S`) \uf8ff\n\n(1 )c\n\nNote that this theorem directly implies Theorem 1.4 by repeating the algorithm O(1/) times. Once\nagain, we use a slightly different potential function from the one for the O(log k) approximation. We\ndefer the details of the proof to Section A.3 of the supplementary material.\n\n4 Experiments\n\nIn this section, we demonstrate the empirical performance of our algorithm on multiple real and\nsynthetic datasets, and compare it to existing heuristics. We observe that the algorithm generally\nbehaves better than known heuristics, both in accuracy and (especially in) the running time. Our real\nand sythetic datasets are designed in a manner similar to [17]. All real datasets we use are available\nfrom the UCI repository [15].\nk-center with outliers. We will evaluate Algorithm 1 on\nsynthetic data sets, where points are generated according\na mixture of d-dimensional Gaussians. The outliers in this\ncase are chosen randomly in an appropriate bounding box.\nMetrics. For k-center, we choose synthetic datasets be-\ncause we wish to measure the cluster recall, i.e., the frac-\ntion of true clusters from which points are chosen by the\nalgorithm. (Ideally, if we choose k centers, we wish to\nhave precisely one point chosen from each cluster, so the\ncluster recall is 1). We compute this quantity for three\nalgorithms: the \ufb01rst is the trivial baseline of choosing k0\nrandom points from the dataset (denoted Random). The\nsecond and third are KC-Outlier and Gonzalez respec-\ntively. Figure 1 shows the recall as we vary the number\nof centers chosen. Note that when k = 20, even when\nroughly k0 = 23 centers are chosen, we have a perfect recall (i.e., all the clusters are chosen) for our\nalgorithm. Meanwhile Random and Gonzalez take considerably longer to \ufb01nd all the clusters.\n\nFigure 1: Figure showing cluster recall for\nthe three algorithms, when d = 15, k = 20,\nz = 100 and n = 10120. The x axis shows\nthe number of clusters we pick.\n\n7\n\n10.012.515.017.520.022.525.027.530.0k0.00.20.40.60.81.0Center 5ecall0ean 5ecallk-center-aGaptive-samplingGRnzalezranGRm-sampling\fFigure 2: Figure showing the empirical cluster recall for the T-kmeans++ algorithm compared to prior\nheuristics. Here k = 20, z = 2000, n = 12020. The x axis shows the number of clusters we pick.\nk-means with outliers. Once again, we demonstrate the cluster recall on a synthetic dataset. In\nthis case, we compare our algorithm with a heuristic proposed in [17]: running k-means++ followed\nby an iteration of \u201coutlier-senstive Lloyd\u2019s iteration\u201d, proposed in [8]. We also ran the latter procedure\nas a post-processing step for our algorithm. Figure 2 reports the cluster recall and the value of\nthe k-means objective for the algorithms. Unlike the case of k-center, the T-kmeans++ algorithm\ncan potentially choose points in one cluster multiple times. However, we consistently observe that\nT-kmeans++ outperforms the other heuristics.\nFinally, we perform experiments on three datasets:\n\n1. NIPS (a dataset from the conference NIPS over 1987-2015): clustering was done on the\n\nrows of a 11463 \u21e5 50 matrix (the number of columns was reduced via SVD).\n(again, SVD was used to reduce the number of columns).\n\n2. The MNIST digit-recognition dataset: clustering was performed on the rows of a 60000\u21e5 40\n3. Skin Dataset (available via the UCI database): clustering was performed on the rows of a\n\n245, 057 \u21e5 3 matrix (original dataset).\n\nIn order to simulate corruptions, we randomly choose 2.5% of the points in the datasets and corrupt\nall the coordinates by adding independent noise in a pre-de\ufb01ned range. The following table outlines\nthe results. We report the outlier recall, i.e., the number of true outliers designated as outliers by\nthe algorithm. For fair comparison, we make all the algorithms output precisely z outliers. Our\nresults indicate slightly better recall values for T-kmeans++. For some data sets (e.g. Skin), the\nk-means objective value is worse for T-kmeans++. Thus in this case, the outliers are not \u201csuf\ufb01ciently\ncorrupting\u201d the original clustering.1\n\nKM recall TKM recall KM objective TKM objective\n\nTable showing outlier recall for KM (k-means++) and TKM (T-kmeans++) along with the k-means cost.\n\n5 Conclusion\nWe proposed simple variants of known greedy heuristics for two popular clustering settings (k-center\nand k-means clustering) in order to deal with outliers/noise in the data. We proved approximation\nguarantees, comparing to the corresponding objectives on only the inliers. The algorithms are also\n\neasy to implement, run in eO(kn) time, and perform well on both real and synthetic datasets.\n\n1An anonymous reviewer suggested experiments on the kddcup-1999 dataset (as in [9]). However, we\nobserved that treating certain labels as outliers as done in the prior work is not meaningful: the outliers turn out\nto be closer to one of the cluster centers than many points in that cluster.\n\n8\n\nSkin\n\nDataset\nNIPS\n\nk\n10\n20\n30\n10\n20\n30\nMNIST 10\n20\n30\n\n0.960\n0.939\n0.924\n0.619\n0.642\n0.630\n0.985\n0.982\n0.977\n\n0.977\n0.973\n0.978\n0.667\n0.690\n0.690\n0.988\n0.989\n0.986\n\n4173211\n4046443\n3956768\n7726552\n5936156\n5164635\n1.546 \u21e5108\n1.475 \u21e5108\n1.429 \u21e5108\n\n4167724\n4112852\n4115889\n7439527\n5637427\n4853001\n1.513 \u21e5108\n1.449 \u21e5108\n1.412 \u21e5108\n\n152025303540k0.00.20.40.60.81.0Center 5ecall0ean Center 5ecall7-kmeans++7-kmeans++ Lloydk-means++k-means++ Lloyd152025303540k100000011000001200000130000014000001500000160000017000001800000Cost0ean Cost7-kmeans++7-kmeans++ Lloydk-means++k-means++ Lloyd\fReferences\n[1] Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means cluster-\ning. In Proceedings of the 12th International Workshop and 13th International Workshop on\nApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques,\nAPPROX \u201909 / RANDOM \u201909, pages 15\u201328, Berlin, Heidelberg, 2009. Springer-Verlag.\n\n[2] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-\nmeans and euclidean k-median by primal-dual algorithms. 2017 IEEE 58th Annual Symposium\non Foundations of Computer Science (FOCS), pages 61\u201372, 2017.\n\n[3] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean\n\nsum-of-squares clustering. Machine Learning, 75(2):245\u2013248, May 2009.\n\n[4] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In\nProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA\n\u201907, pages 1027\u20131035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathe-\nmatics.\n\n[5] Pranjal Awasthi, Maria Florina Balcan, and Philip M. Long. The power of localization for\n\nef\ufb01ciently learning linear separators with noise. J. ACM, 63(6):50:1\u201350:27, January 2017.\n\n[6] Aditya Bhaskara and Srivatsan Kumar. Low rank approximation in the presence of outliers.\n\nCoRR, abs/1804.10696, 2018.\n\n[7] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility\nlocation problems with outliers. In Proceedings of the Twelfth Annual ACM-SIAM Symposium\non Discrete Algorithms, SODA \u201901, pages 642\u2013651, Philadelphia, PA, USA, 2001. Society for\nIndustrial and Applied Mathematics.\n\n[8] Sanjay Chawla and Aristides Gionis. k-means-: A uni\ufb01ed approach to clustering and outlier\n\ndetection. In SDM, 2013.\n\n[9] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering\nand outlier detection. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2248\u2013\n2256. Curran Associates, Inc., 2018.\n\n[10] Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers.\nIn Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms,\nSODA \u201908, pages 826\u2013835, Philadelphia, PA, USA, 2008. Society for Industrial and Applied\nMathematics.\n\n[11] Sanjoy Dasgupta. The hardness of k-means clustering. In The hardness of k-means clustering,\n\n2008.\n\n[12] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of\nthe Forty-eighth Annual ACM Symposium on Theory of Computing, STOC \u201916, pages 118\u2013127,\nNew York, NY, USA, 2016. ACM.\n\n[13] Sanjoy Dasgupta and Mohan Paturi. Lecture notes in geometric algorithms, 2013.\n\n[14] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Zheng Li, Ankur Moitra, and\nAlistair Stewart. Robust estimators in high dimensions without the computational intractability.\nCoRR, abs/1604.06443, 2016.\n\n[15] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.\n\n[16] Teo\ufb01lo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical\n\nComputer Science, 38:293 \u2013 306, 1985.\n\n[17] Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local\nsearch methods for k-means with outliers. Proc. VLDB Endow., 10(7):757\u2013768, March 2017.\n\n9\n\n\f[18] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise.\nIn Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science,\nFOCS \u201906, pages 543\u2013552, Washington, DC, USA, 2006. IEEE Computer Society.\n\n[19] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning:\n\ndata mining, inference and prediction. Springer, 2 edition, 2009.\n\n[20] Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics, 2nd Edition. Wiley, 2009.\n[21] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman,\nand Angela Y. Wu. A local search approximation algorithm for k-means clustering. Compu-\ntational Geometry, 28(2):89 \u2013 112, 2004. Special Issue on the 18th Annual Symposium on\nComputational Geometry - SoCG2002.\n\n[22] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and\nk-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT\nSymposium on Theory of Computing, STOC 2018, pages 646\u2013659, New York, NY, USA, 2018.\nACM.\n\n[23] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Trans. Information Theory, 28:129\u2013\n\n136, 1982.\n\n[24] Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering.\n\nMach. Learn., 56(1-3):35\u201360, June 2004.\n\n[25] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness\n\nof lloyd-type methods for the k-means problem. J. ACM, 59(6):28:1\u201328:22, January 2013.\n\n[26] Dennis Wei. A constant-factor bi-criteria approximation guarantee for $k$-means++. CoRR,\n\nabs/1605.04986, 2016.\n\n10\n\n\f", "award": [], "sourceid": 5971, "authors": [{"given_name": "Aditya", "family_name": "Bhaskara", "institution": "University of Utah"}, {"given_name": "Sharvaree", "family_name": "Vadgama", "institution": "University of Utah"}, {"given_name": "Hong", "family_name": "Xu", "institution": "University of Utah"}]}