Review for NeurIPS paper: Quantitative Propagation of Chaos for SGD in Wide Neural Networks

NeurIPS 2020

Quantitative Propagation of Chaos for SGD in Wide Neural Networks

Review 1

Summary and Contributions: This article considers the infinite width limit of SGD for one layer networks in which only the first layer is trained and the learning rate is allowed to depend on the number of hidden units. The main result is the identification of two different limiting behaviors for the weight dynamics. In the first regime, when the learning rate is below a critical threshold, the limiting dynamics of particles (weights) is a deterministic ODE in which weights for different neurons are independent. In the second regime, in contrast, the limiting dynamics of the weights is an SDE in which weights corresponding to different neurons are again independent. Some numerical results comparing training on MNIST and CIFAR10 show excellent agreement between the statistics of weights learned by a wide network and the corresponding theoretical predictions.

Strengths: The question of the effect of large learning rate is interesting both practically and theoretically. This paper obtains a strong result on this subject. Not only it is shown that there is a qualitative (and quantitative) difference between large and small learning rate, but this article in fact obtains a formula for the limiting dynamics and interprets it as a discretization of McKean-Vlasov on empirical measures of weights.

Weaknesses: The main weakness of this article is the restricted setting in which it applies. Namely, the authors consider training just the first layer weights (with output weights set to 1/#neurons) in a one layer network. This is a reasonable starting case but already having networks with more than one layer is very interesting due to the fact that neurons are no longer independent.

Correctness: I did not check every detail of the derivations, but the method seems correct.

Clarity: For the most part, yes, but there are a couple of specific points that I think should be clarified (see below).

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I have a few minor comments. (1) The overall discussion of the learning rate is somewhat confusing and could be clarified. Specifically: (1a) Depending on how one thinks about it, the learning rate in previous papers on infinite width SGD depends on the number of hidden units. What I mean is that you explicitly put in the 1/N as the size of the weights into the last layer. This has the effect of putting in a 1/N in the derivative d Loss / d W, where W is a weight in the first layer, which is akin to putting an extra 1/N into the learning rate. (I understand you put in the 1/N into the definition since you wanted a well-defined limit.) In previous papers (NTK-type analyses in deeper networks), sometimes this scale of weights is like N^{-1/2}. (1b) The form of the learning rate \gamma N^\beta ( n + \gamma_{\alpha\beta}(N)^{-1} )^{-\alpha} is hard to intuitively understand. Indeed, while I understand that this is what you need to get a mean field limit, the factor of N^\beta out in from seems confusing and is only properly understood I think when you realize that the 1/N from the weights in the 2nd layer actually turns it into a N^{\beta-1} (this also clarifies why \beta=1 is special). Also, what is the role of the n? This tells me that, unless \alpha=0, the step size decays as a function of time at a very specific rate. Is that important/interesting? (2) After corollary 3, you say that the result is valid for the whole trajectory and not just for a fixed time horizon. This is confusing since the constants C_{m,T} in Theorems 1 and 2 depend on T. It would be nice if you explained a bit more here what is meant. ____________________________________ POST REBUTTAL UPDATE: I have read the authors' feedback. I am glad that the authors will put in an extended discussion of how to interpret the learning rate in terms of alpha, beta, N, n. I think this will make the paper more readable. My overall assessment of the article remains positive (7/10).

Review 2

Summary and Contributions: This paper performs a mean-field analysis of the behavior of SGD on a 1-hidden layer neural network with number of neurons N. One question is how the step-size-dependence-on-N affects the behavior as N grows, both in the case of a step size that is fixed over iterations, and in the case where the step size decays as a power law over time. Roughly speaking, there is a step-size-dependence-on-N like $N^\beta$ that emerges from the analysis, for which the $\beta \in [0, 1)$ regime behaves differently from the $\beta = 1$ regime. The paper runs some experiments on MNIST and CIFAR to show 1-layer networks converging in distribution as N grows.

Strengths: The main novelty of this work is in studying the $\beta > 0$ case, the $\beta = 0$ case has been studied before. I think the regime change result is interesting: It suggests that one can choose a larger step size for wide networks as long as the step-size is sublinear in the width.

Weaknesses: The theoretical work requires pretty strict assumptions in the form of A1-(a-d), in particular A1-b rules out non-smooth activation functions, as well as functions that are unbounded for fixed x. Perhaps one way in which the results could be more thorough is to cover what happens in cases where the step size is still proportional to N^{\beta} at high iteration number, but not with the exact $\gamma_{\alpha, \beta)(N)$ additive factor.

Correctness: The experiments use the ReLU, yet your assumptions assume the non-linearity is thrice-differentiable. By and large though, the approach seems correct.

Clarity: I find the paper clearly written.

Relation to Prior Work: Yes, the work describes several prior works that deal with subcases of the more general framework it puts forward.

Reproducibility: Yes

Additional Feedback: Some things that seemed like typos to me: Line 91: $\ell$ is a function of $\mathbb{R} \times \mathbb{R}$, I think it should be $\mathbb{R} \times \mathsf{Y}$? Line 110: missing period. Line 207: Should be "(7) if $\beta \in [0,1)$"? EDIT 8/22/2020 The author response says the learning rate will be generalized from $(n + \gamma_{\alpha, \beta} (N) N^{-1})^\alpha$ to $(n + c)^\alpha$. I think this really helps the usability of the result of this paper in further work, and it pushes the paper up from marginal accept to an accept for me.

Review 3

Summary and Contributions: This paper studies SGD dynamics for two-layers neural networks approximating it by a mean field diffusion in the large width limit. In the mean field diffusion, each neuron evolves independently according to a diffusion whose coefficients are functions of the overall density. The main results are bound on the distance between the original SGD dynamics and the mean field model. The authors study a family of scalings of the stepsize and show that depending on the scaling, the mean field diffusion is deterministic or not.

Strengths: 1) This work identify a new regime in which the mean field dynamics of each neuron is non-deterministic. This is an interesting phenomenon. 2) The mathematical analysis is quite sophisticated.

Weaknesses: 1) In the simpler deterministic regime of earlier work (corresponding to small stepsize), this paper does not seem to provide stronger results or new insights. 2) It is unclear what insight emerges from this new analysis that was not already in previous work. 3) It is unclear what is the difference between the two regimes from a machine learning point of view. Simulations point at a difference in generalization error, but it is unclear whether the theory sheds any light on this difference. Also, it is unclear whether this difference is generic.

Correctness: Both the claims and approach seem reasonable. I did not check the proof details.

Clarity: The paper is quite clear.

Relation to Prior Work: The comparison with earlier work seem to imply that the novelty of the new results id that they are quantitative, while pervious work only established asymptotics. This is wrong. Several earlier papers obtained quantitative bounds very similar to the current one (for instance [23,25]). The real novelty is in identifying a new non-deterministic limit.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The paper considers letting the gradient of two layered over-parametrized modela, with the second layer having fixed weights, vary with the number of neurons N. The paper shows that the choice of gradient scaling with N leads to two different mean-field behavior: a deterministic regime that has already been studied a new stochastic one that takes the form of a McKean-Vlasov diffusion.

Strengths: The paper’s claims are particularly elegant in that they derive from a “simple” modification of the gradient scaling as a function of the number of neurons N. The claims relating to the limiting behavior as N->\infty are supported by extensive theoretical results as well as empirical studies using real datasets.

Weaknesses: None worth discussing. The applicability of the results may be limited at this point, but the paper is likely an important stepping stone towards a better characterization of the generalization properties of neural networks.

Correctness: I am unable to fully ascertain the correctness of the results given the extensive supplementary materials needed to support them. However, I have found no obvious flaws, and the empirical studies do support the theoretical results.

Clarity: This is a very theoretical paper which dense in notation and with results left to the supplementary materials - not the easiest material to present concisely without losing the reader. To the extent that it can be done, I would argue this paper hits the mark. That being said, I do wonder about the assumptions in A1 and whether they belong in the paper rather than the supplementary materials. They are presented without much explanation, and from then on, they mostly serve as references in the theorems and propositions. An exposition of their significance and why they are needed, with the details left to the supplementary materials, might have been a more effective use of space.

Relation to Prior Work: The paper fits in a much larger context of understanding gradient descent for overparametrized models and the authors take the time to provide a clear exposition of where this particular contribution fits and in what sense it is an extension of previous work.

Reproducibility: Yes

Additional Feedback: Very interesting results. The use of only one layer with trainable weights makes sense from a theoretical perspective but it might still be interesting to find out what happens in practice if the parameters of both layers can be learned. Is that something you’ve explored? Do the one layer results represent a decent approximation or is a completely different dynamic at play in that case? I do realize this would amount to pure numerical studies, which is not the point of this paper, but I can’t help but wonder how different the two cases in the over-parametrized limit.