__ Summary and Contributions__: I should mention that I have reviewed this paper before (submitted at another venue).
The contribution is clearly stated in the title. This is theoretical paper that improves the theoretical approximation guarantees of the k-means++ and k-means|| algorithms. Even though the improvements may be small but these improvement are made using some interesting collection of techniques that itself may be interesting for many readers.
Here are some comments about the writeup:
1. It may help the readers if you mention that the algorithms being discussed are for the discrete setting (where centers are chosen from among the points) and also work for metric setting. Note that some of the previous work mentioned in the "related work" section are for the Euclidean case.
2. Lines 47-60: Does it make sense to also experiment with k-means|| where $\ell * T = k$? This may also help ruling out explanation (2).
3. You may want to check:
https://proceedings.icml.cc/static/paper_files/icml/2020/3965-Paper.pdf
In the sense that the quality of the centers may also be measured in terms how how much improvement local search gives.
3. The Martingale based analysis is also found in the previous work such as Aggarwal, Deshpande, and Kannan. You should add this citation in relevant places.
I have read the author response.

__ Strengths__: - k-means++ and k-means|| are simple algorithms that are used in practice. Understanding approximation guarantees for these algorithms help us understand why they work so well in practice. The paper gives a nice compilation of results and techniques used for the analysis of these simple algorithms and is well-written. This could be of value to many attendees who are interested in these algorithms.

__ Weaknesses__: I do not see any significant weaknesses in the paper.

__ Correctness__: Yes.

__ Clarity__: Yes. The paper is well-presented and it is not too difficult to go over the high-level structure of the arguments presented.

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: In this submission new bounds on the approximation factor of k-means++ and k-means|| are presented. k-means++ is a well-known seeding method of Lloyd's method with approximation factor O(log(k)) and k-means|| is by now also a well-known scalable variant of k-means++. The first theoretical contribution is an upper bound of 5(ln(k)+2) on the approximation factor of k-means++, which improved upon the previously known upper bound of 8(ln(k)+2) by a constant factor. Then bicriteria approximations are considered, i.e., one uses k-means++ and k-means|| to compute clusterings with k+Delta clusters for some Delta > 0 and compares the costs of these clusterings with those of an optimal k-clustering. Both for k-means++ and k-means|| such bicriteria results are already known but in this submission improved bounds are shown. These improved bounds exhibit almost the same asymptotic behavior with respect to Delta as the known bounds but the constants are significantly smaller.
The improvement of the approximation factor of k-means++ from 8(ln(k)+2) to 5(ln(k)+2) is based on proving a better bound on the expected cost of covered clusters (i.e., those clusters from which at least one center is chosen). The rest of the proof follows more or less along the same lines as various analyses of k-means++ in the literature. I find it interesting that apparently it has not been observed before that the analysis of Arthur and Vassilvitskii can be improved in this respect. In the submission also an example is presented showing that 5*ln(k) is essentially a tight bound.
The proofs of the bicriteria results are more involved. They also use some known ingredients from analyses in the literature but improve these analyses in many aspects. In a nutshell both analyses boil down to proving that as long as the uncovered clusters have high costs it is very likely that they will get covered.
In the appendix also experiments are shown with k-means++, k-means|| and a variant of k-means++ with pruning (in this variant more than k centers are sampled by k-means++ and then only k of these are selected). k-means|| and k-means++ with pruning yield clusterings with essentially the same quality and perform slightly better than k-means++. The conclusion one can draw from this is that oversampling seems to be very helpful in obtaining good clusterings.

__ Strengths__: I think the line of research is very interesting and up-to-date. It is very relevant to the NeurIPS community in my opinion. The proposed analyses are clever and introduce many new ideas though some arguments are along the lines of known analyses in the literature. The experiments show that oversampling is probably the reason why k-means|| performs better in experiments than k-means++, which is also interesting (but not very surprising). Since k-means++ and k-means|| are well-known and successful algorithms both in theory and in practice, enhancing the understanding of their behavior is very important.

__ Weaknesses__: One might consider as a weakness that nothing very novel is shown. The bounds on the approximation guarantees are improved but in the end only the constants get better and the asymptotic behavior is not changed (due to some lower bounds, one cannot expect better asymptotic bounds). There is also no essentially new analysis technique presented but the known techniques are applied in a more clever way than before.

__ Correctness__: I did not have time to check all details in the appendix but all results seem very plausible and everything I checked is correct.

__ Clarity__: The paper is generally well written and easy to follow.

__ Relation to Prior Work__: Yes, the relevant literature is cited and discussed.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This is an interesting paper which provides a large list of improved approximation bounds for the k-means++ and k-means|| (k-means parallel) algorithms, and some variants thereof. After providing a thorough introduction into the algorithms and related results, the author(s) state their contributions: (1) an improved bound on the multiplicative factor by which the k-means++ output differs from the optimal k-means clustering solution, (2) improved bounds on an oversampled form of k-means++, and (3) improved bounds on a form of k-means|| without pruning.
Inside the main text, the authors provide the crucial parts of proofs of the three results stated above. These results are substantiated in the supplementary material, which is quite extensive. The proof techniques are interesting and enjoyable to read. Some numerics are also presented to provide explanation of why the k-means|| algorithm works better than k-means++.

__ Strengths__: - k-means++ and its variants are standard and useful algorithms, and this paper contains many nice results which advance the current knowledge of the approximation guarantees of these methods. The
- The proofs are interesting to read and utilize a fairly broad set of techniques. The writing is clear and consistent throughout.

__ Weaknesses__: - My main concern for the paper is that the organization is a bit awkward. The results in Sections 4-6 each rely on additional results provided in the supplementary material, but the statements of these deferred proofs often appear abruptly (e.g. Lemma 4.2 and Lemma C.1 both appear in the middle of the proof of Lemma 4.1). A few lines/sections appear without motivation and could likely be moved to the appendix (more below in the "Clarity" section). This could make space to have some discussions in the main text that guide the reader through the list of proofs in the supplementary materials. Alternatively, the author(s) could add a short discussion at the end summarizing some of the insights that made this work possible, e.g. in contrast to the proof techniques of (Arthur and Vassilvitskii 2007).

__ Correctness__: I have gone through all the details in the main text and verified that they are correct. The proofs in the appendix appear correct, but I have not verified these details except for a few of the lemmas. The short empirical section seems rigorous, although code is not provided to reproduce these figures.

__ Clarity__: Overall, the paper is very well-written. Here are some nitpicks:
- something seems to be off with the latex formatting, as many lines are missing line numbers
- p5.L129: why mention Markov chain here? It can be mentioned in the appendix, where it is actually used. Otherwise the reader may expect to see more details about the Markov chain comment in the main text.
- Does Section 3.1 belong closer to the current Section 6, where it is actually used? It could also be moved to the appendix and referenced from within Section 6.

__ Relation to Prior Work__: Yes - comparisons of approximation bounds are clearly provided and explained.

__ Reproducibility__: Yes

__ Additional Feedback__: --------
--------
EDIT after author feedback and discussion:
Overall this seemed like a nice paper with interesting results, and I am happy to continue supporting it after seeing the author feedback and discussion. I do encourage the authors to think some more about the suggestions re: organization.
--------

__ Summary and Contributions__: This paper presents tighter analyses for the approximation bounds of the classical k-means++ and k-means|| algorithms. In particular, the original k-means++ analysis yielding an approximation of 8(ln(k) + 2) from the optimal solution is here improved to 5(ln(k) + 2). Similarly, the k-means|| bound is improved by a constant factor.
Both k-means++ and k-means|| are ordinarily used by practitioners thanks to their simplicity, so in that respects these improvements are nice to have. On the other hand, the improvements on the analyses are only marginal and not substantive enough in my opinion to pass the bar for acceptance.
What would make the paper a lot stronger is a consolidation of the analyses for k-means++ and k-means||. To elaborate, k-means++ runs for k iterations picking a single center per iteration (proportionally to the squared distance to the closest already chosen center); whereas k-means|| can be seen as a natural generalization in that it runs for T iterations and picks \ell centers per iteration. What would be nice is to study k-means|| for general T and \ell, so to obtain a single bound that recovers both the k-means++ performance guarantee (for T=k, \ell=1), as well as the k-means|| bound (for \ell=\Theta(k)). Note that the bound presented for k-means|| is immaterial in the case when T=O(k) and \ell=O(1). Such an analysis would also underline other sweet spots for these parameters (A complementary empirical study of it could also be useful.)
Minor:
- To help the reader, please add a reference to the NP-hardness of k-means.

__ Strengths__: - Applicability: k-means++ and k-means|| are both widely used.
- Claims are sound

__ Weaknesses__: Improvements are marginal and individual to each algorithm.

__ Correctness__: Claims are sound to me. I couldn't spot any obvious issues in the proofs.

__ Clarity__: The paper is well written and results are presented properly.

__ Relation to Prior Work__: It would have been good to expand on the work by Rozhon (2020) since it's directly related to this one.

__ Reproducibility__: Yes

__ Additional Feedback__: