Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper presents a simple and clear algorithm of a divide-and-conquer scheme of distributed classification using the nearest neighbour framework. The general methods presented are not entirely new, but are accompanied by a crisp statistical analysis which proves tight convergence rates to the Bayes optimal classifier. The novelty in this algorithm is the complete distributed nature of it - the fact that very little information must be transferred between different computing units as opposed to previous work algorithms which compelled more data transference between them. The claims are clear and understandable, and the theoretical part is very satisfactory, as well as a nice complementary empirical section showing improved speeds (in the denoised case) as well as improved regret. On the other hand, I add a few points to explain the overall score I have given this submission: 1) The algorithm is rather similar to the denoising algorithm referred to in the paper. The denoising algorithm could be performed in a distributed manner as well since it performs calculations on different subsamples of the data. In addition, similar analysis to that of the submitted paper has been done in the denoising case. 2) In the extreme cases in which the subsamples are very small (the size of s is bounded from above but still may reach significantly large sizes) it seems there may be a possibility that k >1/|S|. I may be mistaken regarding this statement, but if it is true than a tighter higher bound is necessary. 3) The idea of distributing the data and thus speed up the process of classification seems somewhat inferior to compressing the data. the speedup occurs from multiprocessing rather than reduction of computations (There is still need to entirely search each subsample). The denoising variation of the algorithm presents actual reduction in computation time, which the original BigNN does not. Therefore, it seems to me that a fair amount of work presented here is due to the referenced papers. Finally, I would like to state that the paper is well written and is very clean and understandable. The only downside in my eye is the fact a lot of its originality comes from previous papers. With further analysis answering the questions presented in the conclusions this could be a more novel and unique paper. Best of luck
After reading the author feedback, I am impressed by the thorough comments and I am increasing my score from 5 to 7. Please incorporate your discussion points to the paper, and thanks for the well-thought-out response! * * * * * Original review below * * * * * From a methodological standpoint, this paper combines an ensembling approach that is basically just pasting (Brieman 1999) with nearest neighbor classification and then also with the denoising method of Xue and Kpotufe (2018). We already know that k-NN classification can achieve the minimax optimal rate in fairly general settings (Chaudhuri and Dasgupta 2014). That pasting used in conjunction with k-NN classification also achieves this rate is unsurprising (and the analysis is straightforward), and stitching the result with the theory developed by Xue and Kpotufe is also straightforward. Overall the theoretical contributions are reassuring albeit unsurprising and incremental. Given the aim of the paper, I think it's extremely important to experimentally compare against recently proposed quantization schemes that are largely about how to scale up nearest neighbors to large datasets: see Kpotufe and Verma 2017, and Kontorovich, Weiss, and Sabato 2017. Perhaps more discussion/theory on when/why one should use pasting rather than bagging to ensemble k-NN estimators would be helpful. I realize that the authors have mentioned a little bit about known asymptotic results but I think numerical simulations would really be helpful here. Separately, depending on the dataset (i.e., the feature space and distribution), I would suspect that even just taking 1 subsample without ensembling could yield a good classifier. Perhaps some discussion on understanding how much training data we could get away with (and whether we could just ignore a lot of the data to save on computation) could be helpful. This comment is a bit related to the quantization strategies mentioned above.
This paper treats the theoretical analysis of k-nearest neighbor (kNN) classification when the kNN classification is performed separately in different machines with different data. The information is combined later for final decision. The paper is clearly written and the experiments support the derived theory well. The derived result is important because the simple methods such as k-NN classification is becoming important for treating really large data, but the data now cannot be treated in a single machine. The results achieve the similar order of convergence rate in a single machine algorithm without loosing the simplicity of k-NN methods. This paper is a nice contribution to nearest neighbor community in NeurIPS.
The paper is well-written, the results appear to be novel and correct.