Review for NeurIPS paper: Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate

NeurIPS 2020

Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate

Review 1

Summary and Contributions: This paper proposes a computationally tractable way of computing weights (possibly negative) for weighted k-NN classification that achieves the same rate of convergence as the optimal weights approach of Samworth (2012). The proposed method first computes unweighted k-NN estimators for different values of k, and then solves a regression problem that can interestingly be interpreted as extrapolating to the imaginary k=0 nearest neighbor.

Strengths: The proposed multiscale k-NN approach is interesting and, I suspect, would mainly be of interest to ML researchers working on theory for nonparametric methods. The impact is a bit limited since it really is focused on weighted k-NN classification where weights can be negative. The work is very heavily based on Samworth's 2012 optimal k-NN weights paper. The proposed method seems to do about as well as one of Samworth's methods.

Weaknesses: In terms of hyperparameter tuning, in your experimental results, you specify hyperparameters values for V, c_1, c_2, c_3, ... c_V (in a way to match Theorem 2's conditions). Is there some general strategy for how to tune these hyperparameters including how to set V? Understanding how sensitive multiscale k-NN is to how these are tuned would be very helpful to understand. Overall, the proposed multiscale k-NN approach is quite clever although it seems to only barely do better than Samworth's baseline with nonnegative weights (see Table 2; where multiscale k-NN does better than Samworth's baseline with nonnegative weights, the improvement is generally very small).

Correctness: The claims and method appear correct.

Clarity: The paper is decently written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Update after reading author feedback and other reviews: the author feedback sufficiently addressed my concerns, and I am increasing my score to a 6. Please revise the paper according to what was in the author response to improve the paper draft.

Review 2

Summary and Contributions: The paper considers binary classification using weighted k-nearest neighbours. The main research problem of interest has to do with convergence rates for the excess risk (in expectation over the sample) achieved by different implementations of weighted k-NN. As a key background work, they cite work by Samworth (2012) which shows that by using real-valued weights instead of being constrained to non-negative weights, it is possible, at least in principle, to achieve a faster rate which is essentially optimal among all classifiers. The problem is that actual computation of weights which satisfy the conditions put forward by Samworth is a challenge. With this basic context in place, the authors propose a new algorithmic approach doing real-valued weighted k-NN which is easy to implement, and their main claim is that this practical alternative achieves the same rate as the Samworth-type weight setting. They also provide empirical results comparing existing k-NN weighting schemes with two representative implementations of their proposed procedure on a number of benchmark datasets, for fixed values of sample size n.

Strengths: I would say that the main strength of this paper is that the proposed algorithm (multi-scale weighted k-NN) is easy to understand, straightforward to implement, and that the technical material presented in this paper is for the most part very easy to follow. The procedure is rather intuitive as well, and so as a contribution of a new general-purpose method for doing weighted k-NN with a theoretical guarantee tacked on, it may be appealing to practitioners.

Weaknesses: I would say that the chief weakness of this work is that the novel results presented here seem rather limited in terms of new insights. The vast majority of the paper is dedicated to background review; there is about 1.5 pages of new technical material here, and a half-page of empirical analysis which does not consider changing values of n (i.e., no actual comparison of rates). The proof of their main result may be of value from a technical standpoint, but the authors themselves state that it follows very closely from existing results (Chaudhury and Dasgupta, 2014), and all details are relegated to the supplement.

Correctness: The formal result appears to be correct, though I have not looked in detail at the proof given in supplementary materials. The empirical tests conducted here are relevant to their general problem of interest (classification using weighted k-NN), although direct comparison of learning curves (performance over n) might have been nice to see, given that the main focus of the formal side of the paper is a method that achieves near-optimal rates.

Clarity: The paper is written clearly.

Relation to Prior Work: The paper gives a detailed background on related work, and it is clear where the present work stands in the existing literature.

Reproducibility: Yes

Additional Feedback: - As an additional comment, I thought that most of the paper was pleasant to read for a wide audience, save for lines 271-276, which happen to be the key technical conditions used in Theorem 2, the main theoretical result of this paper. I think it would benefit the paper greatly to provide some more discussion with these conditions, to make the level of clarity on par with the rest of the paper. - One small point: in 2.2 (notation), line 120, there is a typo (bold x instead of X). After reading the feedback and the other reviews, I maintain my current assessment.

Review 3

Summary and Contributions: The paper deals with the problem of achieving improvement over convergence rate of “vanilla” k-NN. This is previously done by leveraging assumptions over the distribution (its smoothness (\beta-Holder condition, \gamma-neighbour average smoothness) and “margin” (\alpha-margin condition, namely, upper bounding the probability of instances whose expected label expectation is ~1/2)) along with using weighted k-NN, with several methods for defining the weights and their resulting convergence rate. The paper proposes a new method, called MS-k-NN, of estimating several unweighted k-NN estimators, for different values of k (\nu(k) for simplicity of notation in the review). For each k, a radius r is associated by taking the distance from the query to its k-closest neighbour. Then, pairs (r(k), \nu(k)) are obtained, and parameters b are obtained by linear regression in order to estimate \nu(k) by a polynomial in r(k). It is proven that this method obtains the optimal convergence rates obtained by more cumbersome methods of weighted k-NN. It is shown that experimentally, over a few datasets, the performance of MS-k-NN is similar to that of the weighted k-NN.

Strengths: The method proposed in the paper is intuitive and very simple. It also provides interesting insight as to what lies beneath the optimal weights in weighted k-NN methods by showing correspondence between the weights obtained in MS-k-NN and in those methods.

Weaknesses: The method seems to provide little benefit over LP estimators. Although, as mentioned in the paper, the number of weights in the regression is reduced by a factor of 2, to 0.5\beta, it is not clear whether this is significant in real world problems, where large value of \beta seems to be an unrealistic assumption. Additionally, they’re also very easy to implement.

Correctness: seems correct.

Clarity: clear.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: