Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
For the important goal of test error guarantees, the combination of ideas from screening together with approximate regularization paths as proposed here is completely novel, and nicely surprising. While the result here is only for binary linear classification, it might be a useful first step towards more general ML models. The paper is clearly written, and the experimental results are encouraging.
For the reader it might be nice to more clearly distinguish explicit definitions of epsapproximate solutions in terms of training error (1) and validation error (2), to avoid confusion. (and also 'epsapproximate regularization parameter'. maybe better call it 'epsapproximate best regularization parameter'?)
The authors should discuss the theoretical complexity of the proposed approach as a function of eps more in detail. More concretely in this sense, assume the C values would be chosen naively and equallyspaced, what do your results say about the interval size? This would also be a good contact point to discuss the connection to the path methods for training error (vs text as done here) more in detail.
minor: l061: the > this l264 soluitons
(this is a light review)
Q2: Please summarize your review in 12 sentences
The paper contributes a novel approach to compute guarantees on the crossvalidation error along the entire regularization path. This is a highly relevant first step towards much stronger methods for hyperparameter search, such that those methods could in the future come with guarantees for test error.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a novel framework for computing a lower bound of the CV errors as a function of the regularization parameter. The main motivating question of the manuscript ("it is hard to tell how many gridpoints would be needed in crossvalidation") has been already extensively studied both from theoretical and empirical perspective. For example, see (Rifkin et al, MITCSAILTR2007025) or (Rasmussen and Williams, Gaussian Processes for Machine Learning, Chapter 5.)
The proposed score and validation error bounds are technically sound. These results are direct application of the ideas from safe screening (bounding Lagrangian multipliers at the optimal solution). While this was clearly stated in the supplemental material a similar remark in the main body (e.g. section 3) of the manuscript could be helpful for the reader.
The experimental setup to evaluate a theoretical approximation of the regularization parameter is not clearly presented. How exactly the grid search is performed: 10 fold crossvalidation on the training set to obtain optimal parameters with final evaluation on the validation set? In that case, I assume that E_{v} on the training set should have been reported in Fig(3) instead of validation.
Q2: Please summarize your review in 12 sentences
The proposed score and validation error bounds are technically sound. Novelty is incremental and experimental setup/result is dissapointing.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Interesting paper, with a potentially useful result.
It is not clear that the bound would remain valid if the method were kernelized and the kernel parameters also tuned using crossvalidation (as the kernel and regularization parameters tend not to be independent).
The experimental evaluation is a little weak as there are only a few of small datasets.
The size of the datasets is especially an issue as the paper discusses the computational practicality of the bound.
It is not clear that the bound enables better models to be selected, than would be selected simply by optimizing the crossvalidation error directly (within some computational budget).
The method may be useful in avoiding overtuning the regularization parameter (i.e. overfitting the crossvalidation estimate).
Q2: Please summarize your review in 12 sentences
The paper presents a bound on the crossvalidation error for linear models and a procedure for finding a value with guaranteed CV error.
The paper is interesting, and well written, but the experimental evaluation is a little weak.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
A common task in machine learning is to learn the regularization parameter of a model. Usually this is done using crossvalidation. The main result of the paper is that is presents a new way to analytically compute bounds on validation error. First, a lower and an upper bound for wx_i is computed for each validation instance x_i (w is the parameter of the model) as a function of the regularization parameter, given a solution for a different regularization parameter. Then, the scores for validation instances are combined to yield the bounds on the validation error. A central result in the paper is Lemma 1, that shows how compute the bounds for wx_i. The proof is not given in the paper, but the reader is referred to the Appendix. Giving intuition of this core result in the main text would be very important for the reader to be able to understand the idea without going to the Appendix, even if it's entirely reasonably to omit the full proof from the main text. The proof (in the appendix) seems to rely heavily on the convexity of the regularized loss function, limiting the usability (although being more general than presented before). Also, L2 prior on the parameter is used. Furthermore, the paper only deals with tuning a univariate hyperparameter. The results appear correct, though I did not check the proofs in detail. The text is clearly written (though see the remark above). The experiments show that the regularization parameter can be optimized efficiently by assuming a reasonably tight error bound and, importantly, that the error bound is available in the first place. The topic seems relevant for the NIPS community, and the results interesting, though potentially too specific to raise wide interest.
Q2: Please summarize your review in 12 sentences
The paper presents analytical bounds for crossvalidation error and demonstrates how these can be used to optimize the parameters efficiently.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thank you for fruitful comments.
[Comment on
the scope of the paper]
Let us first make a comment about the scope
of our paper because multiple reviewers pointed out that the class of
problems we considered here is narrow. We would like to rebut this
criticism from the following two points.
a) Some reviewers might
feel that L2penalty term is too restrictive. We would like to emphasize
that this restriction is identical with assuming that the loss function is
strongly convex. A huge body of theoretical works on convex empirical risk
minimization problems have been devoted to the problems with strongly
convex loss functions. If the reviewers claim that the scope of our work
is narrow, the same criticism should be applied to those past works
targeted to strongly convex loss functions. In addition, note that our
approach can be used with any other convex regularization terms such as
L1regularization as long as the loss function is strongly convex.
b) The class of problems we consider here contains several
important classification algorithms such as SVM and logistic regression.
We would also like to emphasize that all the methodologies presented in
the paper can be kernelized if the loss function can be kernelized (it
means that our methodologies can be applied to kernel SVM and kernel
logistic regression etc.). Furthermore, it is easy to confirm that most of
the methodologies presented here can be adapted to other types of problems
such as regression.
We would highly appreciate if the reviewers
reconsider whether they still feel the scope of our problems is narrow
when the above two points are taken into account.
[Answers to Rev
1]
> The main motivating example has been already studied
...
We agree with the reviewer in that, CV error for special
problems (such as LOOCV error in leastsquare regression) can be
represented as a function of the regularization parameter, and it is
possible to provide approximation guarantee of CV error in these special
cases. Our contribution can be interpreted as an extension of these
special cases to wider class of problems (namely, problems with strongly
convex loss functions). We believe that this extension is nontrivial. We
will discuss and clarify this point in the final version.
> The
experimental setup is not clearly presented:
Fig 2 is devoted for
illustrating Problem 1 in sec 4.1. In Fig 2, we compared three grid search
methods: 1) simple grid search at evenly allocated regularization
parameters in logarithmic scale (red), Bayesian optimization (blue), and
Algorithm 2 (green). On the other hand, Fig 3 and Tab 1 are devoted for
illustrating Problem 2 in sec 4.2. In Fig 3 and Tab 1, we used Algorithm
2. Since Fig 3 is only for illustration, we just plotted the error on
independent validation set (not CV error). We apologize for the confusion,
and make it clear in the final version.
[Answers to Rev
2]
> practically utility seems limited ...
I am afraid
there is a misunderstanding here. Algorithm 2 automatically selects a
sequence of regularization parameter values, and it is guaranteed that
approximate solutions trained at these regularization parameters are shown
to satisfy the desired approximation quality. For example, it could find
eps=0.01 approximationguaranteed solution for D10 data set (n > 64000)
in 36.81 seconds (see Tab 1). We believe we can call it practical.
> why greencurve is good in Fig2?
Greencurve
(Algorithm 2) is better since it cleverly selects the sequence of
regularization parameter values. In other words, greencurve is better
because it can skip ranges where solutions are shown to be no better than
the curren best by \eps.
[Answers to Rev 3]
> Intuition
of the proof of Lemma 1 should be provided in main text.
Thank you
for the suggestion. We would do that in the final version.
>
Convexity and L2penalty seems to be restrictive assumption.
See
"Comment on the scope of the paper" above.
[Answers to Rev
4]
> Clarification between epsapproximate solutions in terms of
training error and validation error would be needed
Thank you for
suggestion. We will clarify this point in the final version.
>
theoretical complexity of the path
We understand this is important.
We conjecture that we can derive theoretical complexity by using Lipschitz
constant of CV error curve.
[Answers to Rev 5]
See "Comment
on the scope of the paper" above.
[Answers to Rev 6]
> Is
it valid when it is kernelized?
All the methodologies presented
here can be kernelized if the loss function can be kernelized. See
"Comment on the scope of the paper b)" above.
> There are only a
few of small datasets...
In Tab 2, we summarized 10 data sets whose
sample size ranges from 270 (D1) to 64700 (D10). We do not think it is
"only a few of small datasets", although we can apply the method to larger
datasets if it is necessary.
