Summary and Contributions: This paper studies optimization and generalization of a two-layer, infinite width neural network with a three-times differentiable activation function, weights decaying, and trained by noisy gradient descent. The considered scaling factor covers the mean field theory and neural tangent kernel regime.
Strengths: - extends result of (Mei et al. 2019) with gradient noises and weight decay regularizer - provides generalization properties (estimation error bounds) of such two-layer neural network
Weaknesses: After checked the rebuttal, the authors address my concerns on alpha = poly(n,d,lambda). I increase my score to 6. ============================================================== The quality of this paper is ok, but I'm doubting on the used assumptions/conditions. - One key issue is regarding to the minimal eigenvalue of the Gram matrix in NTK that largely effects the trainable and generalization properties. As indicated by Thm 2 in [S1], the minimal eigenvalue of NTK converges at O(n^-1/2) rate, so in this case, \lambda_0 \propto n^{-3/4}. Or, it scales at a constant order O(1), which implies \lambda_0 \propto n^{-1/2}. This deserves a detailed discussion since it indeed effects the presented theorems to avoid some unattainable cases. [S1] On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective. NeurIPS, 2019. For example, 1) Thm 4.4: if \lambda_0 := n^{-3/4}, the condition (4.1) in Thm 4.4 acutally is alpha > \lambda^{1/2} n^{2/3}, which appears difficult to cover the mean filed case with alpha = 1. This is because, in learning theory, the regularization parameter is taken by \lamda := n^{-a} with 0 <a \leq 1. This issue also exists in Thm 4.5 with alpha > sqrt(n \lambda). 2) Thm 4.4 (also Lemma 5.3) D_{KL}(p_t || p_0) < O(alpha^{-2} \lambda_0^{-4}) appears to increase with n. This divergence looks strange to me, and how to explain this? 3) The above result on D_{KL}(p_t || p_0) < O(alpha^{-2} \lambda_0^{-4}) makes me scrutinize the assumption D(p_true || p_0) < +\inf in Thm 4.5. I'm not sure that this assumption is attainable. Overall, it's important to discuss whether the used conditions and assumptions in the presented results are attaianble/mild or not. I'm ready to read the author's argument about them.
Correctness: The presented results appear technically sound to me, although I did not check them very carefully. Nevertheless, I'm doubting about the used conditions/assumptions.
Clarity: The paper is well written and easy to follow.
Relation to Prior Work: This paper clarified the difference between their work and previous work.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: Prior works have established convergence of gradient descent on sufficiently wide two-layer neural networks. This paper aims to extend the same result to cases with noisy gradient and weight decay regularization.
Strengths: Convergence of gradient descent with Gaussian noise and regularization is proved, under certain assumptions. Moreover, the convergence rate is shown to be linear.
Weaknesses: The authors claim without justification that the standard tools used in NTK regimes CANNOT handle noisy gradient and regularizers (Line 28 and 168). It is not clear that whether the prior works just did not analyze these two particular scenarios due to limited space or similar, or these two scenarios cannot be handled in principle. If it is the first case, the results in this paper would be not quite significant, and can be considered as a natural extension of previous works. So, I suggest the author to spend a section or so to provide detailed analysis on how and why noisy gradient and regularizers can not be handled by standard NTK analysis. Note that the model f contains a scaling factor alpha, see Eq.(3.1) and (3.2). According to the definition of tangent kernel, the scaling factor alpha should appear in the expression (line 127), but it is missing. This is important, because a small scaling factor also scale the kernel matrix, and fails the assumption 4.3, which is the base of the main theorem. In Line 130, it is not quite exact to claim “parameters stay close to initialization” without mentioning in what norm. When consider the NTK regime (as in Jacot et al), i.e., the sufficiently wide neural networks, it is important to distinguish the infinity-norm and Euclidean norm, when the number of parameters is large. As pointed out by Liu et al (reference [1] below), parameters do NOT stay close to initialization in the sense of Euclidean norm (see Remark 6.2 therein), although the infinity-norm is small. This is because a large number of small componentwise changes adds up to a non-small total change. The gradient noise and regularizer are controlled by a common coefficient \lambda. In principle, gradient noise and regularizer are totally independent and should have different coefficients. My question is: do the results still hold, if the gradient noise and regularizer are controlled by different coefficient? Reference: [1] Liu, C., Zhu, L., and Belkin, M., Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. Arxiv: 2003.00307, (2020).
Correctness: I have a little doubt about the correctness of the expression in line 127. As I mentioned in the weakness section, a scaling factor may be missing.
Clarity: As for Theorem 4.4, I suggest the authors to provide some intuitions on the following: >>how does the scaling factor alpha affect the results? Why does scaling factor matter? From Eq. (4.1), it seems that for smaller alpha the theorem does not hold and convergence is not guaranteed for the noisy gradient and regularization scenarios. >>A discussion of the order of scaling factor alpha is preferred.
Relation to Prior Work: Related works are clearly discussed in a separate section.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: This paper addresses one of the limitations of the NTK analysis in handling regularization terms (e.g., ell_2). Instead of studying gradient flow in the parameter space, the authors base upon mean-field analysis and transform the parameter learning into a distribution learning in the space of probability measures. An important ingredient is additive Gaussian noises to gradient updates. This allows the authors to achieve a linear convergence of the squared loss. They also establish a generalization bound.
Strengths: The results are a non-trivial extension of the standard NTK analysis and new. The proof of convergence (Theorem 4.5) is very similar in spirit with the gradient flow on the loss, where instead of controlling the movement of the weights, this paper focuses on the Wasserstein distance between p_t and p_0. One interesting thing is the dynamics of the regularization term KL(p||p0) is coupled with the dynamics of the loss with respect to the measure p_t, which is not possible in standard NTK.
Weaknesses: One limitation with this mean-field analysis is that there is no finite bound on the number of neurons. Moreover, it seems to be difficult to extend this analysis for non-smooth activation, for example RELU.
Correctness: I didn’t check all the proof in Appendix, but the statements appear sensible and correct.
Clarity: Yes
Relation to Prior Work: Yes
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: The paper proposed a generalized analysis for neural tangent kernel so that it can address the empirical practices of regularization and gradient noises during training process.
Strengths: The paper proved the generalization performance of the noisy gradient descent algorithm with regularization. It tries to address the concern that weight distribution is no longer close to the initialization, when regularization is used during optimization.
Weaknesses: One of important point of NTK was over-parameterization. A lot of research work recently is showing regularization is not critical for over-parameterized models. Even though the authors talk a little bit about the necessity of weight decay in Sec 3, I'm still not quite convinced about the motivation of paper about using regularization in NTK. It might weaken the possible implications of the theoretical results in the paper.
Correctness: The theoretical results should be correct. Assumption 4.2 seems a little bit strong, as some popular activation functions aren't really 3-times differentiable, e.g. Relu-type activations. It limited the applicability of the theoretical results. The assumption in 4.5 of existence of p_true also seems non-trivial. But it might make sense in practices as the training error usually converges to 0. Maybe authors could discuss a little bit more about this assumption as well.
Clarity: The paper made the main point clearly overall. Some of the argument in Section 3 is a little bit too skechy and the authors could provide more details. E.g. * The paper could make the derivation of Eqn (3.4) more explicitly. * The connection from Eqn (3.4) and (3.5) is also non-trivial. They did provide some references, but it'll also be good to give brief recap to make the paper more readable.
Relation to Prior Work: The paper discussed a good collection of related works about neural tangent kernel and related theoretical analysis. There are a couple of points I also wanted the authors to discuss: * The noisy gradient seems a rough approximation of the stochastic gradient descent algorithm commonly used in practices. There's some existing work analyzing the generalization bound for SGD directly. Could the authors discuss a little bit more on this? * The author claims that weight decay regularization is still meaning in the neural tangent kernel regime. But in the literature, there're some discussion that regularization might not be necessary for over-parameterized model. I'd like the authors to elaborate more on this point, as it's critical to the main point for this paper.
Reproducibility: Yes
Additional Feedback: ********edit************** I updated my score since the paper is well-written and the point is made clearly.