Authors

Stefano Sarao Mannelli, Eric Vanden-Eijnden, Lenka Zdeborová

Abstract

<p>We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the overparametrized regime where the layer width m is larger than the input dimension d. </p> <p>We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width m*&lt;=m. </p> <p>We describe how the empirical loss landscape is affected by the number n of data samples and the width m* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on n, d, and m*, thereby establishing conditions under which the neural network can in principle recover the teacher. </p> <p>We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice.</p> <p>Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples.</p> <p>These results are confirmed by numerical experiments.</p>