Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
After rebuttal: I have carefully read the comments of other reviewers and the feedback from the authors. The authors provide insights about generalization to high dimensional cases promise to include the experimental results for high dimensions, which address my main concern. However, the experiment results are not provided in the rebuttal. Thus, I would like to keep my score unchanged. -------------------------------------------- Summary: This paper analyzed the dynamics of gradient descent algorithm training an over-parametrized one-hidden-layer ReLU neural network on an interpolation task in three settings and compare them. In this paper, the authors are using a neural network with one hidden layer and one output layer. The activation function is ReLU, and each hidden unit also has a bias term that can be trained. The input is 1-dimensional, i.e., a scalar, so the task is equivalent to interpolation. The loss function is square loss, and the authors use gradient descent to learn the network weights. The authors use an over-parametrization scheme where the number of nodes in the hidden layer tends to infinity, which means the network is infinitely wide. There are two main learning schemes mentioned in this paper: Kernel and adaptive. The Kernel scheme is that the weights of the neural network stay very close to the initialization and the optimization process works in a linear approximation in this neighborhood. The adaptive scheme is using mean-field setting and the evolving of the weights follows a PDE. In the adaptive scheme, since ReLU function is positively homogeneous, the weights have some redundant terms. Thus, the authors divide the learning scheme into the reduced parameters version and full version. The authors identified a condition \delta, which only depends on initialization, that decides the style of the learning process. If \delta is very large, we are in adaptive scheme, resulting in adaptive linear splines, while when \delta is very small we are in the Kernel scheme, resulting in cubic splines. Otherwise, we are somewhere in the middle of these two schemes. The authors also characterize the dynamics in these two schemes, and their experimental results validate their theoretical analysis. My comments for this paper: 1. The experiments in this paper focused on the exact setting as the theoretical analysis, but the theoretical analysis doesn't provide much intuition about whether these kinds of phenomena can generalize to more complicated cases like high-dimension case. Thus, it would be better if the authors could do some experiments about general cases and whether some similar dynamics preserves. 2. For the Kernel methods, some recent results showed that even if the hidden layer is only polynomially wide, the weights will still stay in a small neighborhood with high probability. I wonder whether in this case the dynamics are the same as the result in this paper.
The authors study the inductive bias of relu networks in dimension 1. They distinguish between the active and kernel regime, which is a very timely question in deep learning. This paper gives a very satisfactory and precise answer for the kernel regime. Theorem 4 shows that in 1 dimension the kernel method interpolates according to a cubic spline. Unfortunately , the result for the adaptive regime are not clear. Theorem 3 simply states some dynamics and then remarks below that if c^2>a^2+b^2 then the dynamics follow the reduced dynamics. However I do not see that they formally proved it follows an adaptive linear splines. I have read the responses and the authors seem to address some of my comments. I'll keep my score at "marginally above"
Although the result in this paper only applies to one-hidden-layer networks with one dimension input, it is an extremely important direction of understanding the difference between neural network learning and kernel learning. Especially under the current hype about neural tangent kernel (NTK). The main concern for me about this work is the paper is written in a non-rigorous way, which makes it hard to understand the technical contribution of the work. For example, the main claim "in adaptive learning regime, the learning favors linear splines" (as in abstract) is never formally proved anywhere. The only Lemma in this adaptive regime is Lemma 1, however, the Lemma is more an "intuitive statement" than a formal theorem: the \rho in the Lemma is the \rho at some iteration, and it is not clear what is the picture about the full dynamic when all the iterations are taking into consideration. (for example, a point can be attractor in one iteration and repulsor for another). It is possible that in the long ran all these attractor/repulsors get canceled and the network does not adapt to the data eventually. The smaller concern is that the paper is missing some proper citations about previous works in learning under over-parameterization, mostly the kernel regime: Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data A Convergence Theory for Deep Learning via Over-Parameterization Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers Especially the second work actually proves that the neural network learns the solution of the best kernel in the "kernel learning regime", even for deep networks. The authors cited some follow-up papers of these works but the original ones are not properly addressed. After Rebuttal: The authors have promised to improve their paper in the next revision. Their promises make sense and addressed my concerns. I am willing to higher my score.