NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Recent theoretical work on analyzing Deep Neural Network training has focused on the evolutions of the over-parametrized neural network prediction errors under Gradient Descent. These evolutions can be neatly described in a matrix form. This paper found that the matrices can approximate an integral operator this is determined by the feature vector distribution only. Consequently, Gradient Method can be viewed as approximately apply the powers of this integral operator on the underlying function that generates the labels. Moreover, this paper also derives a new linear convergence rate on the new assumption of the underlying function. This analysis is derived from the integral operator approximation analysis above. This paper is well-organized and well-written. The novelty, clarity, and originality of this paper are quite great. This paper provides a new tool to understand why the basic Gradient Descent methods work very well on Deep Learing models and showed new convergence rates on the new assumptions of the underlying function and feature vector generation, which might be fit to situations in the real world.
Reviewer 2
One of the biggest reasons that I am not too thrilled about the submission is that the two-layer fully connected neural network model under consideration is off standard; the weights of the second layer is fixed to binary (i.e. +1, -1 with uniform scaling) and are not being updated via the gradient descent procedures. Correct me if I am wrong; this assumption is neither commonly adopted in experiments nor identical to the comparable theoretical works such as [DLL+18] or [ADH+19]. If the authors are convinced of the significance or general applicability of the suggested framework, they should have taken more care communicating those to the audience. A relatively minor issue is about the significance of the c_1 term in Theorem 3. The authors nicely demonstrate that the constant c_1 satisfying (13) can be controlled via a sum whose dominating term is inversely proportional to (lambda_{m_l} - lambda_{m_l + 1}), which is later formalized to Theorem 4. (By the way, I had trouble locating a formal definition of eps(f^*,l). ) The question is, can we guarantee that the value is large enough to ensure the ignorability of the term c_1? I am particularly worried about this, as the authors have already mentioned that the spectrum of the random matrix concentrates as n grows, in line 145. Beside these issues, technical clarity and validity of the submission seems satisfactory, and the proof ideas would be of at least mild interest to the theoreticians concerned with deep learning. Plus, the authors do a nice job of criticizing the previous results. +) please do the grammar check, to remove the typos including: line 7, 'approximately apply'; line 32, 'prediction errors evolution'; line 32, 'neuronscience'; line 48, 'approximately apply' (stopped counting around here...).
Reviewer 3
This paper considers gradient descent as an operator on functions, and thereby obtains convergence guarantees in terms of the eigenvalues of that operator. A similar perspective, in a somewhat simpler setting, was studied in Vempala--Wilmes (2017). Convergence guarantees for empirical risk minimization where also obtained in the overparameterized setting in Arora et. al. (2019), among others. The authors observe that some previous convergence guarantees give rates that tend to zero as the number of examples increases. To avoid this problem, the authors introduce the assumption that the target function is well approximated by its projection to a small number of eigenspaces of the gradient operator. For example, the function could be a harmonic polynomial over the sphere with respect to the uniform distribution. In this case, the authors obtain linear convergence rate as long as the number of hidden units is at least quadratic in the number of examples. Overall, this is a pleasant paper, mathematically. However, its improvements over the existing literature are not dramatic. --- I thank the authors for their very careful response. I continue to believe this paper is worth accepting to NeurIPS and have slightly improved my score.