
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper studies lowrank kernel matrix approximation in the context of kernel ridge regression. The main contribution is to show that the rank of the approximation can be chosen to be linear in the effective dimensionality for statistical guarantees, improving the result in [1] where the rank is required to be linear in the degrees of freedom. The key point is using Nystrom approximation with nonuniform sampling strategy.
Some comments & suggestions: The paper is well written and structured. The reduction in sampling complexity is novel to my knowledge, but as the authors indicated, the proofs on the statistical guarantees largely follow from the work in [1], which somewhat affects the novelty of the paper. It seems that there is an error in the proof of theorem 2 (line 074 in the supplementary material). Specifically, the random variable on the lefthand side corresponds to column sampling with replacement (exactly p columns are picked out), while, for the random variable on the righthand side, the number of picked columns takes values over the interval {0,1,...,mp}. So these two random variables are not equal in my opinion. As this equality is the foundation of the subsequent deduction, I suggest the authors to check their proof carefully (perhaps, you wanted to use z_{ik} for some different meaning: the indicator of the kth sampling, i.e., z_{ik}=1 iff the kth sampling selects the ith column. However, in this case, the identity in the line 096 would not hold).
Some minor comments: (1) line 085, "which if" > "whether if" (2) line 101, "l:(Y,X)"
> "l:(Y,Y)" (3) Eq. (4), you miss some "sigma^2" in front of \xi (4) line 216, "in conjunction will" > "in conjunction with" (5) line 148 (supplementary material), "\gamma/\lambda"
> "n\gamma"
Reference.
[1] Bach, Francis. "Sharp analysis of lowrank kernel matrix approximations." arXiv preprint arXiv:1208.2015 (2012).
Q2: Please summarize your review in 12 sentences
A moderately novel paper studying lowrank kernel matrix approximations in the context of kernel ridge regression. There is an error in the proofs.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Quality: Paper is well written.
Clarity: The statement and the results are clear and well motivated. The proofs seem sound.
Originality: The paper is an evolution of [2]. While [2] focuses on uniform subsampling methods, this paper focuses on subsampling methods based on leverage scores. The proof techniques in this paper are a variation of the ones in [2].
Significance: With respect to [2], this paper addresses the more interesting and useful case of subsampling with leverage scores. This paper, as [2], proposes an analysis in the fixed design setting (that is remarkably different fromt the classical statistical machine learning setting that is a nontrivial extension of the random design setting), it is definitely a first step towards a theoretical analysis of such efficient approximation scheme for KRR.
Q2: Please summarize your review in 12 sentences
The paper studies some theoretical properties of the Nystrom approximation scheme applied to Kernel Ridge Regression (KRR) in the fixed design setting. It proves that when the points in Nystrom are selected by means of leverage scores, then the approximated KRR algorithm has an accuracy that is of the same order of the standard KRR.
The algorithm has been already studied from a computational point of view for example in [1] and the proof techniques are a variation on the ones in [2]. Indeed [2] proves that in the fixed design setting Nystrom KRR approximation with uniform selection method has an accuracy that is of the same order than standard KRR.
The interesting aspect of KRR Nystrom with lev. scores is that it needs far less points than the ones required by using a uniform selection method. Indeed, as pointed out in the introduction, the proposed approximation scheme is among the fastest in literature for which there exist statistical guarantees.
Despite the fact that the statistical properties of the algorithm are studied in the fixed design setting, that is simpler and remarkably different than the classical statistical machine learning setting (random design), the proposed result can be considered a first step in analyzing the properties of a very useful approximation scheme for KRR from a theoretical viewpoint.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
 While the paper appears to be
sound and well presented on the theory side, it seems that the authors forgot to put the MSE results in the table of accuracy and to compare it to other methods. Timing of the experiments is also not reported which makes it hard to grasp the importance of this method, and how it compares to other methods with classical leverage scores.
 More convincing experiments are needed to see how those new scores interact with the nystrom method in term of accuracy and how they compare with other scores.
 One would expect that Nystrom sampling itself is doing a regularization and hence no need for lambda, can those scores be computed without the need of lambda?
Q2: Please summarize your review in 12 sentences
This paper presents an approach to improve the running time of kernel methods. New notion of leverage score is introduced in the paper, that reduces the complexity of the nystrom method to the effective dimension of the problem.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends on previous work relating to the Nystrom method of low rank approximation of SPSD (e.g. kernel) matrices, showing that tighter bounds can be derived that the sharp analysis of Bach using a nonuniform sampling strategy that attempts to control the bias. The idea of nonuniform sampling for the Nystrom approximation is not itself new, but the analysis goes further than this, and holds for any sketching matrix that meets certain conditions. This, combined with the resultant bounds, is a significant contribution.
Comments/criticisms:  the title is "Fast Randomized Kernel Methods ..." whereas currently the analysis only holds for Kernel Ridge Regression. **Note that the title was changed for final submission**  it seems that this method will not work for any normalised kernel matrix (i.e. where f(x) = \phi(x)/ \left \phi(x) \right), since the p_i's will be uniform. For example, the RBF kernel will always have 1's on the diagonal, so K_{ii} / TR(K) will just be 1/n?  in the first experiment, what is the MSE when using full KRR on this dataset? does the lambda ridge method converge to this error?  in the second experiment (Table 1), why haven't MSE's been reported? A comparison with existing methods should also be done here.
Specific comments:  for ref [1] it would be better to refer to the published ICML paper than the unpublished arxiv paper unless there's a good reason not to (http://machinelearning.wustl.edu/mlpapers/bibtex/icml2013_gittens13)  for ref [5] same as above (http://jmlr.org/proceedings/papers/v32/ma14.html)  L67, L256 the authors refer to Bach's d_{mof} as "maximal degrees of freedom", but I believe Bach called this "marginal degrees of freedom"  eq 4 (and later) the first term is given as bias(K)^2 whereas Bach denoted this same term as simply bias(K)  is there a reason for changing this?  L158
The operator is used for the first time. I presume you are referring to the linear matrix inequality X \succeq Y \Longleftrightarrow \lambda_{min}(XY)\geq 0, but I'm not sure all readers will be aware of this  L162 derived of => derived from  Theorem 2. State what t is (i.e. t \in (0, 1)?)  L302305 Section 3.5. The last sentence in the first paragraph is unclear and too long, please reword this  L373 experiments > experiment  Figure 1. These figures are too small to be readable. I realise space is an issue but please make these more readable if possible.  L398 haven't we known there existence => had we not known of their existence  L399 determine \lambda => determine \lambda and \sigma  Table 1. \sigma was used earlier in the paper to refer to eigenvalues, now it refers to the RBF parameter  could a different symbol be used?  throughout the paper (and references inverted commas are the wrong way round, caused by using the double quote " character instead of two single quotes ''  In the references the capitalisation is wrong in many cases: conference/journal papers should have title case (except for proper nouns and abbreviations), whereas books should have every word capitalized. Nystrom should appear as Nystr\"om. Journal papers should have issue numbers and page numbers.
Q2: Please summarize your review in 12 sentences
A strong paper from a theoretical perspective, that builds on previous work on the statistical properties of low rank matrix approximation, although the experiments could be stronger.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
On the whole, the reviewers acknowledged the high
quality of our theoretical results and the importance of the subject. Four
reviewers gave the paper high marks. Their comments included that our
method "needs far less points than the ones required by using a uniform
selection method," that our method is "among the fastest in the literature
for which there exist statistical guarantees," that ours is "a strong
paper from a theoretical perspective," and that it is "an extremely well
written paper presenting useful and very thorough improvements approximate
KRR." The other two reviews were by a reviewer who pointed out a minor
easilyfixable imprecision in the statement of our theoretical result and
one who provided a lowquality undetailed review. Several reviewers
pointed out correctly that this is more of a theoretical paper.
Regarding comments about the empirical section: We decided to
compare our method only to preexisting method in the Nystrom family and
not to other methods more generally. The reason was due to space and since
previous literature already extensively compared Nystrom methods with
other kernel approximation schemes [1,2,5]. In experiment 1, the MSE
when the full kernel is used is asymptotically approached as p gets large
in all cases, albeit at different speeds, the lambdaridge scores method
being the fastest. The MSE was not reported in Table 1, as that
compares d_{mof} and d_{eff}. The MSE is irrelevant there since only one
method (sampling with lev. scores) is used. Our algorithm was not
optimized to yield the best run time, but we exhibited the main
bottlenecks that such an algorithm has to overcome, and we showed that
they indeed can be overcome. Namely: (1) we bring down the number of
columns sampled in order to provide a statistically accurate prediction to
the effective dimension d_{eff}, and (2) we minimize the memory
requirement of the computation, i.e., no need to form and store the entire
kernel matrix explicitly. Both results are new for this class of sampling
methods.
Reviewer 1: The "error" the reviewer is referring to
is an imprecision in our statement, and it is easy to correct by being
more precise about the sampling process (i.e. definition of the variables
z_{ik}). We will do so, and we are grateful for that being pointed out to
us. Thanks for spotting the typos.
Reviewer 2: The random
design setting is treated in the very recent paper [Rudi et al., arXiv
1507.04717] that builds on the methods we introduce in this submission.
They show that leveragebased Nystrom has the same desirable risk
properties, w.h.p. on the randomness of the data.
Reviewer 3:
The title is intended to emphasize "fast" and "statistical
guarantees" in the context of kernel methods, not that we claim full
generality on all kernel methods. This being said, and while we prove our
results in the context of the \ell_2 loss, there is evidence that
\ell_2like properties in statistical estimation extend to losses that
verify a selfconcordance property. E.g., the logistic loss, where a
biasvariance decomposition holds asymptotically (See [Bach.
Selfconcordant analysis for logistic regression. EJS. 2010]). On the
other hand, this problem it is still completely open for non smooth
objectives (e.g., SVM). For the RBF case, the fact that the kernel
matrix contains 1's on the diagonal does not mean the leverage scores are
uniform. But in this case the approximation scheme becomes particularly
simple: Theorem 4 tells you to sample uniformly O(1/\lambda) columns to
get an approximation of the leverage scores. Thanks for the minor
comments.
Reviewer 5: Nystrom is doing regularization, and this
is studied in detail in a very recent paper [Rudi et al., arXiv
1507.04717] which builds on our work and others. But lambda plays other
crucial roles than "just" statistical regularization. The problem stops
being interesting for lambda = 0, since the effective dimension reduces to
the rank of K (which is infinite at the population level) and the
lambdascores become the usual scores. Even Rudi et al. consider lambda
>0.
Reviewer 6: The experiments show: (1) fewer columns are
needed when sampling using lev. scores, as measured by the *MSE* (not by
some irrelevant matrix norm), and (2) the effective dimension is often
much smaller than previously defined notions of dimension. Line 398
is a typo. Thanks for spotting that. 
