NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6229
Title:Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Reviewer 1

1. The paper is mostly well-written; the problem well-stated and the ideas/solution very clear. I really liked the fact that authors presented the intuitive understanding before stating the technical results; this helps getting the essence of the result without getting lost in the subtle technicalities. In addition, the technical content is solid, rigorous and well presented. At some points, I found the paper to be heavy on notation but perhaps unavoidable here. 2. The authors introduce the notion of "learning order" - the order in which the learning algorithm learns "memorizable"(perhaps not the most apt word? - confused me for a moment) vs "generalizable" features. This is an interesting concept in its own essense and perhaps not elaborated/used, even on a toy distribution, so precisely before? Another good thing is instead of just stating the learning problem, they distinguish essential nuances with terminology like memorizable and generalizable, which are fairly intuitive. Moreover, this intuition of differences in learning memorizable and generalizable features could could reflect folklore wisdom about learning more finer features later? As an aside, this not only provides conceptual understanding but could potentially be helpful in principally exploiting domain knowledge in real world applications. 3. A basic question is how general is the phenomenon? The related work hardly discusses this - they instead focus on large batch vs small batch papers, or survey (perhaps not-so related content) like implicit regularization and adaptive gradient methods. I don't exactly see if small batch vs large batch captures this phenomenon; if yes (because its just a scaling?), should say explicitly. A small discussion on if the phenomenon has been observed for different datasets/tasks with different optimizers would provide a solid motivation and highlight the importance of understanding the phenomenon. Even if it is observed for a small subset of datasets (like vision tasks), this still helps to isolate that perhaps image-like distributions exhibit these memorizable vs generalizable behaviour. 4. At times, the writing is hand-wavy and confusing; for example the concept of "memorizable and generalizable", though intuitive, is sketchy and not formally explained. I assume that the authors wanted to give the informal essence, however since they are such an important part of the narrative, the authors should attempt to formalize these - perhaps identify based on sample complexity and/or complexity of the classifiers. Such a discussion is indeed attempted in lines 38-43, but could be shelved out better. In particular, phrases like "that is learnable by a low-complexity classifier, but are inherently noisy" could be ambiguous - what is "inherently noisy"? Other instances of hand-wavy language - line 72 "while adding noise before the activations which eventually gets annealed."? what do you mean by "getting annealed" - Is annealed a technical term in optimization/learning? (perhaps this is my ignorance). 5. Although the learning problem is explained well, what could help the presentation is perhaps a figure about the data distribution in 2d? Also memorizable and generalizable can also be discussed with a figure perhaps? 6. How important is the Gaussian noise injected in every step for the analysis? Also in experiments section line 282, "We test this empirically by adding small Gaussian noise during training 283 before every activation layer in a WideResNet1 [37] architecture." I am just wondering why its important to add before activation? Were it fine if I add noise after, or maybe just to the SGD iterates? A small comment on this specification would help. 7. There is hardly any discussion on contribution in proof techniques. The authors remark that even though the analysis is inspired form "kernel" regimes, it is unlike other works since "In our analysis, the underlying kernel is changing over time" (line 100). In that case, what tools are used, and moreover what analysis tools do they contribute so that perhaps they be used more generally? ------------------------ I thank the authors for providing clarification to the questions. In the light of this, I have increased my score by 1 point.

Reviewer 2

This is a very interesting theory paper showing that a neural network trained with a large learning rate and annealing generalizes better than the same network trained with small learning rate. The authors construct a data distribution which contains two types of features (low noise, hard-to-fit features, and high noise, easy-to-fit features). Under such a data distribution, the authors show that for a two-layer ReLU network trained with large learning rate and the same network trained with small learning rate, the order of learning two types of patterns is different, which eventually results in the gap in generalizations. In the experiment, the authors confirm on modified CIFAR-10 data that different learning rate schedule can indeed influence the learning order and generalization performance. The authors propose a fix to the small learning rate (inject noise before activations), which works both theoretically and empirically. In the proof, the authors carefully design a data distribution which contains low noise, hard-to-fit feature (Q-feature) and high noise, easy-to-fit feature (P-feature). In the data distribution, a very small fraction of data only has P-feature, a large fraction of data only has Q-feature and the remaining data has both P-feature and Q-feature. For the large learning rate and annealing, the network first learns P-feature and learns Q-feature after the annealing. On the contrast, the network with small learning rate quickly memorizes Q-feature and can only learn P-feature from the samples with only P-feature. Since the number of samples with only P-feature is small, the network can only learn a small margin, which results in the bad generalization performance on samples with only P-feature. Here are my major comments: 1. In the paper, the authors consider logistic loss with l_2 regularization. I was wondering whether this analysis can be extended to other losses (for example, mean square loss). It would also be good to explain the reason we need regularization here. 2. I feel this result requires the fraction of samples with only P-features to be small, otherwise the network with small learning rate can learn P-feature well just from these samples. So I was wondering whether it’s possible to identify a data distribution with only one type of features in which the large learning rate schedule still generalizes better than small learning rate. Of course, in this case, the order of learning features is the same (only one feature) and the generalization gap must due to some other reason. Although this is beyond the scope of this paper, it’s still good to talk about such possible directions. ----------------------------------------------------------------------- I have read the authors' response, which resolves most of my concerns. I think this is a very interesting theory paper. I will keep my score as it is.

Reviewer 3

Originality: This is the first paper to study the implicit regularization of large learning rate training theoretically and rigorously. Constructing a simple task that can rigorously show that larger learning rate training outperforms small learning rate is highly non-trivial. Quality and clarity: I did not check the math of this paper. The paper is well-written and the authors spend some efforts to help the readers to gain the intuition behind the paper. In my opinion, this is a very novel theory paper. I believe many researchers will explore their ideas in depth in the future. Update: I thank the authors' respond and will keep my score.