Paper ID: | 8838 |
---|---|

Title: | Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration |

1) Summary of the paper The paper introduces a randomized version of the truncated kernel ridge regression algorithm. The algorithm essentially selects a random subset of training points and learns a (truncated) kernel ridge regression function on the selected subset. Under certain characteristic assumptions on the complexity of the function class in which the optimal function lies and on the complexity of the RKHS, the paper shows that the algorithm achieves optimal generalization guarantees. This is an improvement over the existing results in this setting in one of the regimes of the problem space. Additionally, the authors show that under a zero Bayes risk condition, the algorithm achieves a faster convergence rate to the Bayes risk. The main contribution of the paper lies in adapting the proof techniques used in the online kernel regression literature to the standard kernel regression setting. 2) Main comments The structure and writing of the paper are fair and easy to follow, and the results are put into perspective with respect to the current literature. The paper also provides clear mathematical intuition behind the various assumptions made for the analysis. It is also fair to argue that the assumptions are standard in the kernel regression literature. However, the usability of the results requires that the assumptions are relatable to some practical setting. This is not provided/discussed in the paper. For instance, it is not clear how the assumption on the eigenvalue decay of the integral operator translates to conditions on the underlying distribution and the kernel function thereby limiting the usability of the results. On a similar note, the regime under which the algorithm improves over existing results needs further discussion to emphasize the relevance of the obtained results. One of the motivations of the paper are to address the discrepancy of the role of regularization in theory vs practice. The authors address this issue under the setting of zero / near zero Bayes risk condition. The authors briefly justify the usability of this setting through a visual classification example. However, unlike in the classification setting, this seems to be a very strong assumption in the regression setting (a distinction the authors clearly make in the discussion of their theoretical findings). Furthermore, the main contribution of the paper stems from the adaptation of the techniques used in the online kernel regression literature to the kernel regression setting. A further discussion and emphasis on ways to adapt the techniques to more general settings in kernel regression could have further improved the usability of the paper. 3) Minor comments and typos l. 36: missing ' after "easiness" Algorithm 1 is often referred to as Algorithm 3 in the paper. Algorithm 1: if I understand correctly, the sampling is done after solving n minimization problems. Why not first sample k, and only compute f_k? l. 227: missing space after "Figure 5" 4) After rebuttal and discussion The rebuttal adequately addressed my concerns, and I am increasing my score. I think the paper should be accepted.

The derived results for KTR3 with non-zero regularization parameters is a strict improvement in the regime $2\beta + b<1$ upon the best-known bound of KRR and stochastic gradient methods. The derived results for KTR3 also indicate that in the regime of easy problems the best among of regularization is exactly zero. -------------After rebuttal----------------- The comparisons with related results have been made clearly by Fischer/Steinwart in Table 1 of the nice paper (https://arxiv.org/pdf/1702.07254.pdf). (For ease of comparisons, I use the same notations from that Table in the following). The focus of the paper is on the case $\alpha =1$. In my opinion, this paper gave optimal rates for the case $\alpha =1$, and the results indeed improve previous results, closing a gap (but only for the case $\alpha =1$). Note that for the general case $0<\alpha\leq 1$, it is still an unresolved problem. And in my opinion, the technique of the paper under review may demonstrate some further new results for the general case $\alpha\leq 1$. Thus, I keep my scores unchanged. A further comment on the nice paper (https://arxiv.org/pdf/1702.07254.pdf) is needed in the final version. The derived results are very interesting, closing a long-standing gap between upper and lower bounds. The proof technique developed may be also interesting to the community. The paper is well written. I did a high-level check of the proof. I believe the proof should be correct. Minor comments:\Line 70 on Page 2, $X$ is compact..\Line 71 on Page 2, a citation is needed here for $f_{\rho} \in \bar{H_K}$; ... which is the closer of $H_K$ in $L_{\rho}^2$ \Line 80 on page 3, $L_K$ is compact ?\

*this paper is nicely written and the results are presented in a rigorous and clean way, I like that paper! *also, the results are important since it was not clear to the community if these improved bounds also hold in that regime *however I would like to ask the authors for being more clear: in l. 38/39 they write that these bounds are mini-max optimal; in fact, in that regime there are no known lower bounds! so claiming optimality is a bit too much here! please be more careful, also on p.4, l.127-131 (the regime of Caponnetto and DeVito is different) *I do not fully get the effect of randomization. Can you prove your results without randomization (and usingthe full dataset)? Your randomization-strategy allows to build an estimator with less data then the full dataset. My intuition is that this should have an empirical effect. Surely, these rates cannot be established when using only a subset of the data (without additional assumptions). How much do you "loose" when implementing your algorithm? In your experiments you choose the full sample. This deserves a more detailed discussion.