NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1909
Title:First order expansion of convex regularized estimators

Reviewer 1

Simple case - squared loss and identity covariance. The key ingredient in the proof is establishing for the problem settings of interest, that the vector \hat{\beta} - \eta lie in a well behaving set T with benign Rademacher complexity. I am only giving this paper a weak accept because: (a) - The tools and techniques used in the proofs are pretty standard, and do not contain novel ideas. (b) - Even though the authors provide some motivating examples and problem settings, I am unable to imagine more general applications of their results in Machine Learning. It seems that for any new problem of interest, we will need to think from scratch to quantifying the set T and controlling its rademacher complexity. Further the estimator in (4) may not be computationally easier. I am quite flexible on this front and am willing to change my opinion based on the author feedback. (c) Writing: The paper is not very well written, and I strongly recommend the authors to refine the paper further. Some typos I observed are listed below: 1. Line 47, definition of euclidean norms. 2. Line 74, missing a K in the expression on the LHS. 3. Line 101, the Sigma should be K as per definition (9)

Reviewer 2

The main results of the paper is novel. They are stated in very general setting and can be applied to any convex estimators. But both applications on lasso and group lasso are already studied in the literature. It is unknown whether there are applications so that the theory leads to new results beyond those in the literature.

Reviewer 3

The article introduces a so-called first order expansion for regression problems. The idea is quite interesting and the results at first intriguing, but they next lead to a few major concerns: * application-wise, it seems that the results are only usable under the strong assumption of the existence of a function psi, which is possibly hard to determine (at least only simple applications are provided by the authors which seem to suggest that this is indeed the case). It would be appropriate to better discuss the applicability of the results. * technically speaking, I am a bit concerned with the third item of (A1) which seems to constrain \ell to be quadratic (see my comments below). It is not clear neither how this assumption is really "called" in the main results. Please clarify this point. * generally speaking, it is difficult to assess the limitations and width of application of the work: which problems can it handle, which not? What would be next hurdles to break? Are there definitely problematic issues for follow ups? I would like to have a clear feedback on these aspects, both technical and applied. Detailed comments: Abstract: - repetition of "Such first order expansion" (3 times) - "the Lasso", then "the lasso" Introduction: - "one observes observations"... - what does it precisely mean for "the literature of the past decade [to have] demonstrated great success"? In which context? To what avail? Which literature anyways? Please point to references. - "in terms estimation" - I do not really understand the relevance of Q1 and Q2. You're essentially trying to convince the reader here that the proposed work is useful by asking questions on relevance of the method. Q1 is clearly not really a fundamental question. Q2 is better but the real question is rather to know whether one can improve the tractability of the asymptotics of hat{beta} for instance? And there your work proposes to answer that by the first-order expansion argument. Not the other way around. - "certain smoothness assumptions" --> such as? - in passing: "certain assumptions ... implieS" - "allows transfer of results for averages to study of" --> please correct the English gramar/syntax - please specify the definition of the derivatives of \ell(.,.). This in passing imposes that \ell be properly differentiable. This is not the case for the l1 norm. How is that dealt with? - the overall setting also assumes a concentation effect of the argument of \ell as n-->infty. This is known not to be the case when n,p-->infty together for e.g., Xi~N(0,I). Thus Taylor expansions are to no avail in this case. Since it is proposed here to let p grow possibly large, it would be worth discussing what scaling for 'p' is allowed and how it relates to the data statistics. Main Results: - "There exists constants" --> exist - what is T in third display of (A1)? - the first two assumptions in (A1) feel natural, but the last one possibly stringent. First of all, isn't the denominator simply ||u||_K^2? Then, what this implies roughly speaking is that no eigenvalue of K vanishes, thus essentially that l'' is bounded below. Am I right? So it's both bounded from above and bounded from below... Isn't that equivalent to saying that only quadratic costs are allowed? This would be a major issue of course... - it would be appropriate to comment on Th2.1. What does it really say? How does it differ from prior work, what's new and interesting in it? - I do not understand in Prop2.2 why (8) holds for some B3>0. Doesn't l''(y,u) tend to 0 with u-->-infty for instance? which likely poses a problem, but I may be wrong. Or is (8) considered probabilitistically somehow assuming (A2)? In which case, it must be restated properly. What is the low... - I would restrict the title to just "Application to ..." which is more formal Application to exact risk identities - generally speaking, to be applicable, the overall work assumes that a function psi exists and can be evaluated. It would make sense to discuss this aspect in concluding remarks, say (and move the proof outline to appendix). At this point of the article, the applications considered are rather elementary, which gives a (hopefully wrong) feeling that the results of the article are only applicable to trivial scenarios.