NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1494 Stagewise Training Accelerates Convergence of Testing Error Over SGD

### Reviewer 1

Some detailed comments are provided below: 1. It is hard to understand what Remark 6 conveys. The error bound condition (assuming that this refers to Lemma 1) is derived from the PL condition, so what it means that “this error bound condition instead of the PL condition is enough to derive the results”. 2. The analytic strategy in this paper is to first establish the convergence of SGD for each stage and then extend it to K stages for the considered START algorithm. Then, derives the test error bound of the stagewise SGD. A further step is to show how the bound can affect or guide a specific choice of K in the practical stagewise SGD. 3, It might be better to use “START” in the paper title and Figure 1, instead of “SGD”, because the regularized terms have been added to the updating formula so it is not the vanilla SGD. It would be better to discuss the link between this regularized version and vanilla SGD to further shed light on the vanilla SGD. 4, This paper mentions “Corollary 2” several times, such as in Line 183 and Line 237. However, the paper does not have Corollary 2. This paper seems to be prepared in a hurry. 5, In Line 49, definition of \mu should be given before its use. In Line 123, the assumption $| f( w, z )|\leq 1$ is strong, which only works on the function value range [-1, 1] so the authors need to give more explanation why it does not lose the generality.

### Reviewer 2

This paper first performs an error decomposition of the ERM paradigm into classical statistical learning measures. These errors are bounded for the START algorithm introduced in this paper. The key concept used from statistical learning is uniform stability. While, I think this exercises is important and interesting, it was not quite clear to me which conclusion I should draw from the analysis. We know that simple algorithms, like SGD (even in non-convex cases), generalize well, so what is the main take-away message for me after reading this paper? That SGD has slow convergence is folk wisdom, so can't be the important take-away. • The terminology „testing error“ for the population risk is not very fortunate in my opinion. • Eq. 2: It should be written with respect to which variable derivatives are taken here. • If A is a randomized algorithm, what is then the meaning of f(A(S),Z)? Is A(S) a random variable, or a random probability measure (the latter point is for instance standard in game theory and also used in Shalev-Schwartz, Shamir, Srebro and Sridharan JMLR2010 „Learning, Stability and Uniform Convergence)? If you go for the second option, then there is no need to carry the heavy notation in the expectation operator. • Why is there a <= in the error decomposition after line 96? To me this looks as equality should hold. • line 119: Z_{i} can be replaced by a generic random variable, and the subscript in the expectation operator is not needed. • PL is defined here given a data sample. It is required to hold uniformly over all samples of the same length. In particular, the constant mu is deterministic. The same remark applies to the uniform bound \epsilon_{0} in line 121. • line 128: „whole“ • line 143: Expectation can be taken with respect to a generic random variable Z. There is no need for the subscript i. • I don’t understand Algorithm 2. It seems that F^{gamma}_{w} is never used in this routine? What does the function O represent at stage 5 of Algorithm 2? This becomes only clear after reading Theorem Section 4. • Line 256: “behaved”. • Line 257: I am confused by the way you want to choose theta. I assume you want theta to be very large? • Check the sentences in line 330 and 348.

### Reviewer 3

The paper analyzes algorithm 1, which is very similar to the SGD with piecewise constant learning rates that people use for deep learning in practice. The theorems result from separately bounding optimization and generalization error. The optimization bound depends on one-point weak quasi-convexity (or alternatively, weak convexity), and the generalization bound is a uniform stability result. Empirical results provide evidence that these assumptions approximately hold when training ResNet models on CIFAR-10. One question I have here is why \mu (a curvature variable) is plotted as only an average \mu, unlike \theta. In cases that the model converges to approximately zero loss (most cases), the assumptions appear to be holding along the optimization trajectory. I think that at times, the paper overstates its conclusions, starting with the title. As a theory paper, it should be more careful to state the limitations in practice. Eq. 5 is an especially strong assumption for deep learning. Comparing generalization performance using uniform stability upper bounds, which are very loose, can potentially result in misleading explanations as well. Overall, I find this to be a useful step forward for theoretical understanding of deep learning optimization. Update after author response period: the authors have agreed to add more discussion of the limitations of their results. I have read through the other reviews and responses, and I will keep my review the same.