NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1324 Error Analysis of Generalized Nyström Kernel Regression

### Reviewer 1

#### Summary

This paper studies Nystrom kernel regression based on coefficient regularization. The main contribution is to show that Nystron algorithm also works with guaranteed error bounds when the kernel is indefinite. The results seem to be novel and interesting.

#### Qualitative Assessment

It seems that the definition of C^s kernel is not precise. The authors should check the equation in Definition 1 for the case s > 1, which is considered in Theorem 1. Also, since the dimension may be larger than 1, the absolute value in Definition 1 should be replaced by a concrete norm. The Epanechnikov kernel is not used in the real data. As the main contribution of the paper is on the side of indefinite kernel, it might be helpful also to report results of Epanechnikov kernel on the real-world data. Minor comments: "real-word data" in the end of introduction should be "real-world data" It seems that LY in the paragraph after eq (9) is not defined before.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 2

#### Summary

The authors focus on the kernel ridge regression task with l_2-coefficient regularization [Eq.~(3)], on a compact domain of R^d. The primary goal is to analyze the performance of the Nystrom technique in this task, with 'Mercer kernels' (continuous bounded functions), which are not necessarily positive symmetric or positive definite. The main contribution is a generalization error bound on Nystrom based estimator (Theorem 1) which can give rise to fast rates (Theorem 2) if the true regression function satisfies an exponential approximation requirement (Definition 2) with a C^\infty kernel. The authors numerically illustrate that in the studied task the Nystrom method with column-norm sampling typically performs better compared to Nystrom technique relying on uniform sampling. The paper is 1)novel: it bridges the gap between two popular directions, l_2-coefficient regularized objective functions and approximation with the Nystrom scheme. 2)well-organized, clearly written. There are a few statements in the paper which should be proved and some bugs to be corrected. I detail these comments, and my minor points in the 'Qualitative assessment' section.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 3

#### Summary

This paper studies the Nyström approximation for kernel ridge regression. The main contributions of the paper, as emphasized by the authors with the word "generalized" in the title, are results for kernels that are not symmetric or positive semi-definite.

#### Qualitative Assessment

As depicted by the authors, a novelty in this paper is the use of a coefficient regularization, as opposed to the norm regularization. The coefficient regularization has been widely used in the literature. Indeed, the connections between a given function and its coefficients is straightforward thanks to the connections between the primal and the dual optimization problems. Experiments are deceiving. While the main motivation of this work is on indefinite kernels, most of the experiments are done with the conventional Gaussian kernel on real and simulated data. The only used indefinite kernel is the Epanechnikov kernel, only on simulated data. We think that extensive experiments should be done in order to examine the relevance of this work. There are some typos, such as "bouunds". UPDATE : The authors have addressed our major concerns. We have adjusted the scores accordingly.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

This paper studies the convergence of the kernel-based regression estimator under the Nystr\"{o}m approximation. Nystr\"{o}m approximation method is a classic way that aims at scaling up kernel methods by reducing their computational cost. In the kernel ridge regression setup, the generalization ability of the kernel ridge regression estimator with the Nystro\"{o}m approximation has been well understood. Here, it is frequently assumed that the kernel of interest is a Mercer kernel. In this paper, instead of assuming the positive definiteness of the kernel, the authors consider a more general setup where the kernels are not necessarily to be positive definite. Learning schemes within this setup are also usually termed as coefficient-based schemes and were originally introduced in Scholkopf and Smola, MIT Press, 2001. Without the positive definite restrictions on the kernels, more flexibility of the considered learning scheme can, therefore, be pursued. Concerning the theoretical results of this paper, the generalization bounds of the $\ell^2$-norm regularized regression scheme under the Nystr\"{o}m approximation are derived. This is the main theoretical result of this paper. Besides the generalization bounds, this paper also discusses two different subsampling schemes in Nystr\"{o}m approximation, namely, uniform subsampling and column-norm subsampling.

#### Qualitative Assessment

This paper is presented clearly and is easy to follow. It is interesting to see that Nystr\"{o}m approximation is studied in a different setup, which should be able to draw some interest from researchers in this field. The analysis conducted in this paper seems sound. My only suggestion is that, in my view, the motivation of conducting the present study should be explained more clearly.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Summary

The authors consider the problem of generalized Nystrom kernel regression (GNKR) with $\ell_2$-norm regularization: the idea is to consider the Nystrom regression with general kernels (not necessary symmetric or positive semi-definite) and combine it with the $\ell_2$-norm regularization. They provide a bound for the generalization error bound (Theorem 1) and a consequent oracle inequality for the regression function (eq.(8)) and they provide the convergence rate of GNKR and underline its dependence on the subsampling size (Theorem 2). The choice of the subsampling size is chosen as a trade-off between the approximation performance and the computational complexity. Experiments on synthetic and real data are provided.

#### Qualitative Assessment

The paper is well structured: the problem is clearly and exhaustively explained. Their main contributions are well described with respect to previous works. Additional comments: > line 22 "simplicity" should be replaced by "simple" > line 56 "integrating" should be replaced by "combining" > line 131 confusion notation: $\| \cdot\|_{\mathcal F}$ should be replaced by $\| \cdot\|_{L^2_{\rho_{\mathcal X}}}$

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 6

#### Summary

This work proposes a novel theoretical analysis of Nystrom-based kernel ridge regression (KRR) with general kernel functions (not necessarily positive-definite), called generalized Nystrom kernel regression (GNKR). The analysis includes generalization bounds and convergence rates with polynomial decay. The experimental section first evaluates the performance of GNKR on synthetic data, both with uniform and non-uniform sampling schemes, showing that for some datasets a general kernel (in this case, the Epanechnikov kernel) performs better than a positive-definite one (the Gaussian kernel). Secondly, the effect of different sampling schemes on GNKR with Gaussian kernel (which is positive-definite) is considered for real-world datasets, showing that column-norm subsampling performs better than uniform subsampling.

#### Qualitative Assessment

The work is an interesting extension of previous ones for general kernels (e.g., kernel functions which are not positive definite). The structure of the paper presents quite clearly the reasoning behind the work, which seems sound. Unfortunately, the exposition is not as clear in terms of grammar, and this makes the paper somehow difficult to read smoothly. For this reason, please have your manuscript proof-read for English grammatical and style issues. A list of some of the most frequent typos (not all of them) can be found below. Theory ------ The theoretical results look correct. However, a deeper discussion of Theorem 2 would be required. For instance, it seems that the presented bound is not a generalization of the one in [14]. In fact, in the case of positive-definite kernel functions, the optimal bound in [14] is not recovered by Theorem 2. This open question should be addressed in-depth in the discussion of the theoretical results, by including a precise comparison of the bounds. Experiments ----------- The experiments on toy synthetic data presented in Subsection 5.1 are interesting, as they show that for a non-continuous and a non-flat regression function a general kernel function (the Epanechnikov kernel) yields higher test accuracy with respect to the positive-definite Gaussian kernel. This provides an empirical motivation for the proposed analysis. On the other hand, the experiments on real data in Subsection 5.2 appear to be out of scope, since they compare the predictive performance of GNKR under different sampling schemes, but with a positive-definite Gaussian kernel instead of a general kernel. Therefore, the actual algorithm used here seems to be Nystrom KRR. Consequently, the experimental results do not provide insight on what would be the influence of sampling schemes in the general setting treated in the core of the paper. Some examples of frequent typos (not exhaustive list) ----------------------------------------------------- - necessary to be --> necessarily 18 - subsampling matrix --> subsampled matrix 22 - most simplicity strategy --> most simple strategy - "the" is often misused 87 - Different from the previous works --> Differently from previous works 93 - independently --> independently and identically 94 - square --> squares 98 - good --> well 104 - semi-definite - semi-definiteness 105 - kernel --> the kernel 107 - implement --> implementation 107, 108 (and beyond) - computation burden, computation complexity --> computational burden, computational complexity 108 - computation requirement --> computational requirements 112 - $O(mn+m^3)$ --> It might be $O(mn^2+m^3)$ 117,119 - subset are drawn --> subset is drawn 121 - similar --> a similar 122 - [3, 4] --> [3, 4], 122 - Gram matrix --> the Gram matrix 185 - Different --> Differently 195 - simply --> simple Table 3 - #Feature --> # Features Table 3 - #Instance --> # Instances Figure 1 - Subsapling --> Subsampling Figure 1 - Number of sampling samples --> Subsampling level

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)