Review for NeurIPS paper: Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

NeurIPS 2020

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Review 1

Summary and Contributions: UPDATE: The authors adequately addressed my questions. My score remains the same. This paper studies the dynamics of a scale-invariant function optimized under gradient descent with weight decay (e.g. neural networks with batchnorm). It begins with a nice survey of recent results as well as apparent contradictions from these theories and empirical phenomena. Then it proposes a new SDE model of such optimization dynamics, from which it derives a few implications, such as the intrinsic learning rate and its relation to effective weight decay. Based on empirical evidence, the paper raises two conjectures that such dynamics converge rapidly to an equilibrium distribution. In the context of the SDE framework and these conjectures, the authors revisit empirical observations and re-interpret them. Finally, an array of experiments were done to verify the theoretical insights.

Strengths: The paper is very nicely written, and I learned a lot from reading it. The survey of prior results is coherent and adequately motivates the questions investigated in this paper. The SDE analysis, though nothing sophisticated, is clean, and this simultaneous simplicity and clarity should be applauded. One of the main claims in this paper is that, when training a normalized neural network for a long time, the end result only depends on the intrinsic learning rate, and not the initial conditions. This is well supported by the experiments in the paper, and I’m relatively convinced (though I have questions about the time needed to reach this equilibrium, below). I personally have wondered about why dropping the learning rate will suddenly increase test error then settle down again. (Though not sure whether this is a particularly important question in the grand scheme). This paper gives a very nice explanation of this, which I find satisfying. I also find interesting the interpretation of normalized SGD training as a combination of a SDE phase and a gradient flow phase.

Weaknesses: Among several, your paper makes two concrete predictions: 1. When dropping learning rate by 10, the intrinsic learning rate drops by 10 immediately (this is obvious), but it eventually converges to sqrt(10) 2. Reaching equilibrium takes O(1/\lambda_e) steps. I’d like to see experiments measuring and verifying them, or if your results are already in the paper, have them be more prominent, and linked to where these predictions are discussed. For example, I’d like to see a plot that plots 1/\lambda_e vs “step to convergence”, which should be linear if your prediction is correct. Other questions I have 1. Recent works suggest that gradient noise may be heavy tailed [1]. How would that change your theoretical insights? 2. Recent works indicate batchnorm causes severe gradient explosion in deep networks at initialization [2]. Would your experiments still hold in such deep networks? (say, depth 100 BN MLP) 3. What happens if you drop the learning rate before equilibrium (I assume this is common in practice)? Is the performance better or worse? Do the networks in practice reach equilibrium in the typical training time frame? 4. You show that the performance of small LR equilibrium is better than large LR equilibrium. If I just want to capture the best performance right after the learning rate drop, is small LR or large LR preferable? 5. Another increasing popular lr schedule is linear warmup and linear decay. What does this theory say about them? In my experience, in such a schedule, the performance is much more sensitive to learning rate than the weight decay, so some nontrivial correction seems to be needed to the theory here, at least to the prediction that “\lambda_e determines everything”. 6. What is the “number of trials” in fig 2(b)? 7. Does the theoretical prediction of “\lambda_e determines everything” hold when there’s momentum? 8. In fig 4(b), why do the solid curves (log weight norm) diverge from each other? Typos 1. Line 194: effective LR is gamma^-1/2 not gamma^1/2 2. Line 237: “equilibrium” typo (missing “i”) 3. Line 300: should it be norm^-1/2 instead of norm^-2? [1] Simsekli et al. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks. https://arxiv.org/abs/1901.06053 [2] Yang et al. A Mean Field Theory of Batch Normalization. https://arxiv.org/abs/1902.08129

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper extends the SDE perspective of SGD to incorporate the combined influence of weight decay and scale invariance (which arises when normalization methods are used). This leads to a number of interesting insights, most strikingly that the equilibrium distribution of constant learning rate SGD in the long time limit will be governed by an effective learning rate = learning rate*weight decay coefficient. Empirically, the authors observe evidence that the mixing time into this equilibrium distribution is surprisingly fast.

Strengths: The authors tackle an important problem and identify some surprising and potentially practically useful conclusions. I believe this could prove to be an important contribution to the field which others may build on.

Weaknesses: In some places the authors over-claim, and a number of the authors comments felt misleading (see comments below). However I think many of these issues are easily resolved.

Correctness: I have not checked the derivation in detail.

Clarity: The paper reads well and is mostly easy to follow despite the technical content. However the figures are poorly presented and difficult to interpret.

Relation to Prior Work: The authors give a good discussion of prior work.

Reproducibility: Yes

Additional Feedback: Edit: I thank the authors for their response and am happy with their comments. 1) The presentation of the original SGD=SDE missing the key step, which is to identify that the minibatch noise is inverse in the batch size. This allows us to identify a temperature T=learning rate/B, and consequently we are able to reduce the learning rate to approach the SDE limit by simultaneously reducing the batch size. (of course, a key criticism of the SDE limit is that the batch size is bounded by 1). The role of batch size is also not clarified in the scale invariant SDE? 2) The authors argue that their results indicate that large learning rates do not generalize well, but a better presentation would be to say that they show that large effective learning rates generalize well. While one can make the naive learning rate small by changing the weight decay, this is no different to making the learning rate small by changing the batch size. It does not contradict the claim that finite learning rates aid generalization. 3) The authors claim that the fast equilibrium conjecture explains the benefits of batchNorm. This statement is too strong. Note that the scale invariant SDE also applies to layerNorm/instanceNorm, yet these methods generalize significantly worse than BN. 4) Additionally, the primary benefits of BN arise in resNets, and previous work has shown that this occurs because BN preserves signal propagation at initialization in resNets. This property is not captured by the analysis here. 5) Usually the SDE is defined by identifying the learning rate = the timestep dt. Here the authors introduce dt explicitly. Does this alter the analysis? 6) The authors suggest that their work will criticise the gaussian noise assumption, yet their SDE appears to assume Gaussian noise? 7) "Gradient descent not equal to gradient flow": this section appears to simply note that gradient descent with finite learning rates is not gradient flow and can be chaotic in some landcapes. Am i missing something? 8) I did not follow how the SWA experiment indicates that SGD has not equilibriated? it appears to simply indicate that SGD is fluctuating in a local minimum (consistent with equilibrium). Furthermore the authors main argument later is that SGD does equilibriate quickly (with BN)? 9) The authors make a striking claim that changing the initial learning rate is equivalent to changing the initialization scale. Are there any experiments to verify this? 10) Note that it is already recognized that the original step-wise schedules generalize poorly. Popular modern schedules (eg cosine decay) combine a large initial learning rate with rapid decay at late times. Intuitively, it is believed that this rapid decay is beneficial precisely because it prevents equilibration, thus preserving the generalization benefit of the initial learning rate. I therefore did not find the mnist results very surprising, since the step-wise decay schedule used allows training to equilibrate after each drop. I would also encourage the authors to extend their learning rate sweep to smaller learning rates to identify the optimum. 11) The authors are correct to note that, in the step wise schedule, the large initial learning rate primarily enables fast convergence. However their proposed schedule still comprises two stages, an initial finite (ie large) learning rate stage, followed by v small learning rates to simulate gradient flow. Could they clarify whether they are arguing that gradient flow generalizes as well as finite learning rates or not? (assuming infinite compute budgets)

Review 3

Summary and Contributions: The paper analyzes modern models with BN layers that trained using an SGD optimizerwith a regime based on LR step scheduler and weight decay (WD). The paper formulatesSGD as SDE and define the intrinsic LR which is the product of the LR and WD.They show the number of steps for reaching equilibrium in scales inversely tothe intrinsic LR thus controls the model convergence. Farther more they showthat small LR can perform equally welland if forcing the model to reach equilibrium ( which require more steps withsmall LR) we can get even better results.

Strengths: Thepaper has strong theoretical analysis and it jointly examines the connectionbetween LR WD and BN. The paper challenge common DNN training practice andsuggest a new method to investigate model convergence.

Weaknesses: Althoughthe paper suggests many different observations, proofs, and interpretation I foundthe paper to be not cohesive and thus hard to follow. For instance, although mentioned in relatedwork they do not explain how intrinsic LR affects large-batch training. Also,despite the extensive amount of experiments, I found the figures hard to understandand unintuitive.

Correctness: I didnot find any error in the derivation and the experiment setting seems fair. Iwould like to emphasize that even one experiment on a larger dataset thatemphasizes the importance of iLR would have been beneficial.

Clarity: Theresome typos and grammar issues but I think the maid caveat is the lack of onecohesive storyline. Although mentioned in the introduction, I am not sure how theirmethod translates to other normalization schemes (beside BN), how iLR isaffected by the large-batch size, and how it explains SWA boost. Similarly, theircode is neither clean nor documented thus it is hard to understand how to runit and reproduce\extend their results.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Sadly,despite the great potential of this paper has I recommend the authors to rewrite it as currently it is very hard to follow and I believe many important observations and interpretations don’t get enough attention. --------- After Rebuttal -------- I would like to thank the authors for their answers. The response adressed my concerns. I will reaise my score to 7.

Review 4

Summary and Contributions: This work applies stochastic differential equation analysis to the the learning process of networks using batch normalization. It concludes that the final equilibrium does rely on its initial states and only on one state parameter \lambda_e that is the product of the learning rate and weight decay. The authors also conjecture that the learning time to reach equilibrium is inversely proportional to \lambda_e instead of an exponential time bound. This paper is generally interesting as it delves into the mechanisms of learning of BN and the conclusions are also of interest to the community for better designing learning strategies and algorithms. Concerns from me are mainly on its assumption part and its small-sclae experimental verification. Instead of analyzing BN, this study actually is limited to a general normalization + independent noise framework. To experimentally verify its assumption and conclusions, the authors also should provide results on larger datasets. ----------Post rebuttal comments----------------- The rebuttal has addressed my concerns and well explains itself in its framework, therefore I would like to raise my score to 7.

Strengths: This study applies stochastic differential equation to learning dynamics of networks with BN, which is of importance on the theoretical development of learning dynamics. The derivation and writing are also neat. The conclusion that the equilibrium state only depends on the "intrinsic learning rate" is also interesting to the community on the understanding of BN.

Weaknesses: 1. This study is conducted under the assumption of Wiener process and continuity limit, on which the conclusion that the equilibrium state does not rely on its initial learning rate and initialization strongly depends on. Actually the performance of weight normalization + gradient noise, which should be a better candidate for the current analysis, does not reach as high as BN. Therefore, the mechanism of learning dynamics of BN is only partially addressed in this study. 2. As for the experimental verification, only mnist and cifar-10 examples are shown here. The authors should show its effectiveness on larger datasets such as ImageNet to gain more credit. 3. As also pointed out by the author, the mixing in parameter space does not exist for optimization without weight decay since the weight norm monotonically increases. The authors should give a short discussion on the threshold of weight decay that learning falls into the "no mixing in parameter space" zone.

Correctness: This derivation of study is generally correct to my knowledge except the assumptions adopted as I stated above.

Clarity: This paper is clearly written.

Relation to Prior Work: This study has addressed its difference with the prior work.

Reproducibility: Yes

Additional Feedback: