NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4416 Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks

### Reviewer 1

After rebuttal: I have carefully read the comments from other reviewers and the feedback from the authors. My main concern was the generalization ability of NGD, but the experiments in the feedback are a bit confused to me because GD doesn't seem to achieve zero training loss but NGD converges to 0 very quickly in MNIST regression. I would suggest the authors provide more details about that experiment setting, e.g., how do you select the hyperparameter. Thus, I would like to keep my score unchanged. ---------------------------------------------------- The authors analyzed the dynamics of Natural Gradient Descent(NGD) with a two-layer over-parametrized ReLU neural network and quadratic loss. The framework for the proof follows the recent line of work about over-parametrization, e.g., the papers written by Du et al, Li and Liang, and Allen-Zhu et al., the core of which is the Gram matrix. The main intuition of this framework is that if the Gram matrix is positive definite, then the gradient will be lower bounded by the minimum eigenvalue of the Gram matrix and the value of objective function (loss function), i.e., the gradient cannot be small unless the objective is small, so we will get a linear convergence rate. Moreover, due to this linear convergence rate, the movement of the weights will be small, making the Gram matrix always positive definite. These two things are in some sense equivalent to the two conditions mentioned in this paper, and the authors proved that with NGD and a two-layer over-parametrized ReLU neural network, one will satisfy these conditions with high probability. The interesting thing about this result is that it achieves a better convergence rate using the second-order optimization algorithm and better tolerance for the step size. The authors also provided analysis for K-FAC in a very similar setting, and still uses a similar framework for the proof. The converge rate and requirement for the width of the network is a bit worse than NGD, but acceptable. The authors analyzed the generalization performance for NGD as well. At several places, the authors made some simplifications for the setting, e.g., fix all the weights on the second layer. This is acceptable because otherwise the movement of the second layer will still be small and doesn't significantly influence the result. This paper is generally well-written and very easy to read. The authors also bounded \lambda_0 in a very simple setting, which is not mentioned in the paper of Du et al., but it would be better to analyze it more carefully. Despite the advantages mentioned above, I have some concerns about this paper, which are listed below: 1. For the generalization result, getting similar generalization bound doesn't necessarily mean that the two algorithms will have a similar ability to generalize. Thus, to make the authors' claim about generalization more convincing, I would suggest the authors do some experiments on some real-world datasets and compare the generalization performance. 2. The authors gave a lower bound for \lambda_0 but doesn't lower bound \lambda_S, so we don't know the difference in width requirements of NGD and K-FAC. The authors can say something about the lower bound of \lambda_S or do some experiments to convince the readers that the requirement of K-FAC is not much worse. 3. The authors are simplifying some settings like fixing the second layer weights, but I think the authors should provide some intuition about why it works for the general setting instead of directly saying "it's easy to ...".

### Reviewer 2

This paper studies the convergence performance of natural gradient descent for training over-parameterized neural networks. Specifically, the authors prove that NGD can achieve faster convergence rate and similar generalization performance compared with ordinary gradient descent. Moreover, this paper also proves that K-FAC can also converge to global minima with a linear rate under certain assumptions. This paper is well written and theoretically sound. The limitation of this paper mainly lies in the significance and proof novelty given the existing literature. More specifically, I have the following comments: 1. The studied optimization algorithm is of limited interesting, especially in training deep neural networks. Indeed, the computation of natural gradient is extremely expensive for over-parameterized neural network, since it involves the operations for high-dimensional matrices (matrix inverse and matrix production) in each iteration. Besides, the authors should also clarify why K-FAC can reduce the computation and memory costs compared with the exact calculation of natural gradient, it also involves the inverse calculation for high-dimensional matrices. 2. The proof technique is quite similar to Du et al., 2018b regarding the convergence analysis and Arora et al. 2019b regarding the generalization analysis. Therefore the proof novelty is not significant, and the theoretical results are somehow incremental. 3. In Remark 3, the authors should clearly state the required iteration number to achieve epsilon accuracy, and discuss the comparison with those in Du et al., 2018b, Oymak and Soltanolkotabi, 2019. In addition, the authors should also rigorously illustrate why X^\topX typically has a smaller condition number than the Gram matrix G. 4. In Theorem 6, the initialization of model weights depends on the target accuracy \epsilon, however, in practice the initialization scheme is fixed regardless of the choice of target accuracy. 5. The statement of the 3rd contribution summarized in the introduction is not accurate. Using a larger step size may not be directly related to a faster convergence rate. In Wu et al., 2019, the authors showed that a variant of adaptive gradient descent can also leverage O(1) step size but attain the same convergence rate as gradient descent. 6. The authors should run some simple experiments to verify the theoretical results, including comparison with GD in terms of both iteration number and running time, as did in Bernacchia et al., 2018. 7. In the proof of Theorem 4, condition 2 is verified in (55), which requires that (54) holds. However, to prove (54), condition 2 is necessary. This implies that when verifying condition the authors implicitly assume that such condition holds. The authors should clarify this.

### Reviewer 3

Originality: The contribution is original with related work clearly referenced. Quality: The techniques used in the proofs seem sound. It would be interesting to have values for actual networks, for several quantities (see "improvement" part) Clarity: The setup and theorems are clearly presented. Everything is well explained and the proofs are in the appendix. Significance: The natural gradient is a promising way for optimizing and analyzing neural networks. This theoretical contribution for 2 layers ReLU networks is an interesting step toward a more general theory.