#### Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)

#### Authors

*Sebastian Goldt, Madhu Advani, Andrew M. Saxe, Florent Krzakala, Lenka Zdeborová*

#### Abstract

Deep neural networks achieve stellar generalisation even when they have enough
parameters to easily fit all their training data. We study this phenomenon by
analysing the dynamics and the performance of over-parameterised two-layer
neural networks in the teacher-student setup, where one network, the student,
is trained on data generated by another network, called the teacher. We show
how the dynamics of stochastic gradient descent (SGD) is captured by a set of
differential equations and prove that this description is asymptotically exact
in the limit of large inputs. Using this framework, we calculate the final
generalisation error of student networks that have more parameters than their
teachers. We find that the final generalisation error of the student increases
with network size when training only the first layer, but stays constant or
even decreases with size when training both layers. We show that these
different behaviours have their root in the different solutions SGD finds for
different activation functions. Our results indicate that achieving good
generalisation in neural networks goes beyond the properties of SGD alone and
depends on the interplay of at least the algorithm, the model architecture,
and the data set.