Review for NeurIPS paper: Towards Problem-dependent Optimal Learning Rates

NeurIPS 2020

Towards Problem-dependent Optimal Learning Rates

Review 1

Summary and Contributions: The paper suggests a more subtle analysis of uniform localize convergence that is problem-dependent in the following sense: each problem is defined by its best hypothesis (not necessarily from the hypothesis set), and the problem-dependency bounds are either by the loss or the variance of this hypothesis. The paper shows that this approach yields a lower bound to the problem-independent analysis (like local Rademacher complexity), with equality only for a subset of the problems. Finally, the paper shows examples of implementing those bounds.

Strengths: The paper confronts the known problem-independent techniques which are known to be far from optimal in some problems, and suggest a new strategy that fits each problem differently. Each of the three examples in section 6 can be considered a specific contribution by itself, as each emphasis on how different distributions can lead to different bounds; thus, distribution-independent bounds are assumed to be far from optimal. The paper is well written and provides good intuition of the work and the proofs of the claims. After each statement or theorem, the authors give a brief explanation, that put their claims in context and actively helped for the readability.

Weaknesses: The dependency on L* and V* is theoretically acceptable; however it is unclear how one can estimate them in practice. The authors suggested to add a new corolarry for this case. There are some notations that were explained by the author during the rebuttle phase and should be added to the paper itself: Explain that psi is dependent of n, and the arithmetic definition of [\vee].

Correctness: While I did not thoroughly review the proofs, the proofs' intuitions do not raise any red flag.

Clarity: The paper is well written and organized.

Relation to Prior Work: The authors provide great emphasis to explain their contribution over previous works.

Reproducibility: Yes

Additional Feedback: Minor issues: - Line 86: you defined h in H, but used f in equation 2.5. - The paragraph after line 89: it is missing r_0 after upper bounding T(f).

Review 2

Summary and Contributions: This paper examines generalization errors that depend on the loss and variance of the best hypotheses. The analysis is based on refinements of local Rademacher complexity arguments. Additionally, the paper proposes a new moment-penalized estimator and examines the rates obtained for a few nonparametric estimators. The rates improve on recent work in various parameter regimes.

Strengths: The theoretical results are certainly of interest, given the community push to explain generalization of sophisticated estimators without relying on basic uniform convergence bounds. Although I didn't check the proofs carefully, it appears that an above-average amount of technical effort went into the results.

Weaknesses: 1) The push to better understand generalization is mostly driven by neural networks, so if the bounds could be described for neural networks, that would be helpful. Of course, this presumably requires a good bound on \psi. (2) It also may be helpful to include a more concrete discussion on why the resulting technique is unhelpful for parametric classes, where local Rademacher complexity gives optimal results. In general, it would help to make the comparisons clearer.

Correctness: The paper seems correct, although I did not check the proofs carefully.

Clarity: (3) The paper could be written better. I laughed a bit when I saw that Section 3 was the preliminaries. Usually section 2 is the preliminaries so that it doesn't break the flow of the main arguments of the paper. (4) In Section 6 (and elsewhere, where appropriate), I would have liked a table giving some explicit rate comparisons. Sure, there are discussions of the rate comparisons in the text, but they're simply not as easy to follow, especially when many of the aligned expressions aren't really equations or inequalities in the sense that they describe a relation but merely an expression. I find that this makes it harder to identify key information quickly. L134: characterizes L160: the left and right quotes are not in the same style L232: hypothesis L246: variance-dependent L264: invoking

Relation to Prior Work: Likely enough for someone in this area.

Reproducibility: Yes

Additional Feedback: I have read the other reviews and the response and have updated my score.

Review 3

Summary and Contributions: This paper proposes a new analysis framework, which is termed “uniform localized convergence”, that allows for a finer characterization of problem-dependent rates in learning. In particular, Theorem 1 characterizes the loss-dependent rate of ERM and Section 6.2 illustrates how this can lead to tighter rates than the traditional ones known. In addition, Theorem 2 proposes a new estimator with a better variance-dependent rate than what is known in prior work. Finally, Section 6 illustrates some examples where we get improved rates based on Theorems 1 and 2.

Strengths: I think this is a novel contribution with very clean and crisp results. Beyond proving improved problem-dependent rates, the machinery and techniques involved may be useful in other applications as well.

Weaknesses: For variance dependent rates, perhaps the authors could discuss a bit more on the tradeoffs between practicality of the proposed estimator and how that compares with prior work. This is lightly discussed in the broader impact section, but maybe this could be expanded on a bit further. Also, do we know whether the rates in Theorems 1 and 2 are optimal or not? Perhaps the authors could discuss cases where this is or (is not) the case.

Correctness: I haven’t spotted any issues with the claims/proofs.

Clarity: Very well-written and clear.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: *** Update after author response *** Thank you for addressing my questions in the response.

Review 4

Summary and Contributions: The authors derive new, sharper bounds for the generalization error in the standard statistical learning framework. Their general results hold under weaker assumptions (not requiring the Bernstein condition and sub-root property of the \psi surrogate function). They derive both a loss and a variance dependent version of the upper bound and apply the derived results on nonparametric function classes demonstrating the advantages of their results over standard approaches in the literature.

Strengths: The paper is well written, I liked that the authors provided proof-sketches in the paper, not deferring everything technical to the supplement. They apply standard, but involved empirical process techniques, hence the level of the paper is higher than standard NeurIPS submissions. The derived results are clean and elegant and the structure of the paper is also good.

Weaknesses: Beside the many typos (see below for a non-complete list) in several places the derived results are not rigorous/sharp in the factor B and some of the cases need a bit more care. More specifically: - l 98-100: in (2.6) it should be \mathbb{P}(f), not \mathbb{P}_n(l(h,z)), right? - l 144: What happens if BL^*<<r^*<<B^2/n? - l 189: In Theorem 2 the authors have r^* in the upper bound without B. When B is n dependent or very large this could be a problem. The lack of the term 1/B should be also noted when the results are compared to corresponding results in the literature. - 209: where does the B come from in the first bound in (6.1)? - l 269: I guess the authors meant \log V* (or even better |\log V^*|), else it would be negative. Furthermore, (especially in the noisy setting) if V^*\gtrsim n^{\delta}, for some \delta>0 the log n gap is still present...

Correctness: The proofs seem to be correct and there is no methodological contribution.

Clarity: The paper is easy to read, despite being a theoretical work. The authors introduce all of the key concepts and make the manuscript (relatively) self-contained (given the format they do a good job making the paper accessible). However, there are a lot of grammar mistakes/typos, so the whole manuscript has to be very carefully checked. A non-complete list of typos/grammar mistakes are: -l 24: achieves -> achieve - l 50: prelimaries -> prelimaries - l 89: $2^{k+1}r_0$ - l 145: The -> the - l 168: make -> makes - l 169: penalizes-> penalize - l 180: use larger brackets in (5.2) - l 187: are numerical constants - l 193: want -> wants - l 229: strength is not in a correct form - l 232: give -> gives - l 242: both of the - l 249: 1+1\rho->1+\rho - l 263: propose -> proposed - l 264: involking -> invoking - l 267: give -> gives - l 270: inconsistent notation for the variance

Relation to Prior Work: Yes, in my opinion the literature review and connection to other relevant articles are well discussed.

Reproducibility: Yes

Additional Feedback: ******** Update after authors reply ************* The authors have successfully addressed my concerns and I also appreciated that they provided estimators with theoretical guarantees for the parameters V* and L*, making their method more practical. Incorporating the changes suggested by the referees have substantially improved the quality of the manuscript, hence I have decided to raise my assessment grade to 7.