Paper ID: | 5971 |
---|---|

Title: | Greedy Sampling for Approximate Clustering in the Presence of Outliers |

This paper proposes simple algorithmic modifications to popular approximation algorithms for the k-centers and k-means clustering problems. These modifications limit the selection of outliers, and thereby allow the authors to translate existing theoretical properties of the standard algorithms to the corresponding problems where outliers are present. The modifications in essence either constrain the selection of initial centroids based on their distance from the current set, or reduce the probabilities associated with selecting outliers which arise in the kmeans++ algorithm which has d^2 proportional probabilities. Empirical results show that the methods work well on some popular benchmarks. The paper is clear and well written, and the methods, although simple modifications of existing algorithms, are intuitive. The theoretical analysis is also well presented and persuasive. My main concern here surrounds the availability of a good estimate of OPT (the minimum objective value), especially for the k-means problem. The authors claim that this is a common assumption in clustering literature, but don't provide a reference. To conclude, a few minor comments/issue/typos: 1. Theorem 1.2 appears to dominate Theorem 1.1 when c = 1. If my understanding is correct, what then is the use of Theorem 1.1? 2. In Theorem 1.4 there is a typo in the statement of the approximation (extra right brace after "\beta + 64"). 3. In line 235 I presume you mean "... considerably more technical than our analysis for K-CENTERS..."

I have a mixed feeling about this paper. On one hand, it contains some nice and simple ideas such as threshold sampling (for k-means) and successive sampling (for k-center), which can potentially be useful in practice. On the other hand, I have the following concerns. First, it seems that the authors were not aware of recent works on clustering with outliers, including: [*] Distributed partial clustering, SPAA'17 [**] A practical algorithm for distributed clustering and outlier detection, NIPS'18 Though these algorithms are (mainly) designed for the distributed models, they can be used in the centralized setting as well. In fact, [**] used a successive sampling procedure (originally from "Optimal time bounds for approximate clustering", UAI'02) that is similar to Algorithm 1 in this paper. Certainly, the problems targeted in [**] are k-median/means, while Algorithm 1 is designed for k-center, but the underlying ideas (i.e., iterative sampling from uncovered points) look to be very similar. I hope the authors can make a careful comparison between these algorithms. Moreover, in [*] a centralized (O(1), 2)-bicriteria algorithm is designed for k-means with outliers. The algorithm uses k centers and has running time close to linear in terms of the dataset size. It looks like this result is strictly better than Thm 1.3 in terms of approximation guarantee? Second, the author mentioned in Section 3 that "Our first result is an analog of the theorem of [4], for the setting in which we have outliers in the data. As in the case of k-center clustering, we use a potential based analysis (inspired from [12])." I hope that some discussion on the novelty of the algorithm and analysis can be included in the main text. Otherwise it appears that Alg. 2 and its analysis are very incremental. The experiments part looks very brief. -- Please give the details about how the synthetic dataset is generated, at the level that others can repeat the experiments. -- The proposed algorithm should be compared with the one in [**], for k-means. -- There are real world data sets with ground truth (i.e., which are the outliers) available, such as KddCup99 (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). It will be good to see the performance of the proposed algorithm on real world datasets. -- Why for k-center one reports "cluster recall", while for k-means one reports "outlier recall"? It would be good to see both measurements on both cases. -- Better to give some intuition before the mathematical lemmas and proofs on why the extra threshold in the definition of \tau(x, C) helps in the outlier setting. Other comments: -- Intro, second paragraph, the last two sentences look to contradictory to each other? -- Alg. 1, why not directly use k instead of \ell?

- Experimental section does not compare the results of the suggested algorithms with the other known algorithms using other techniques such as local search. A more elaborate experimental section will help. - Even though the theoretical results are nice, the main selling point of the paper from a usability viewpoint is the simple algorithms that are fast and easy to implement (and hence debug). However, the running time analysis and comparison with other known methods are missing. Such an analysis will help see the results of this paper in the right perspective. - On the theoretical front, are there lower bounds arguments of the form: for fixed c=1 and a=2 is there an (a,b,c) algorithm with b = O(1)? There are many such combinations possible. Do the authors know about such lower bounds? It would be nice to include this in the discussion to be able to evaluate the nice upper bounds given in this work.