Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper theoretically justify the phenomenon that deep learning generalizes if a large learning rate is used in the early stage of training. To do so, this paper considers a rather simple problem setting and shows that 2-layer neural network generalizes better if it is trained by a large learning rate first followed by an annealed learning rate than a small learning rate. This concept is supported by numerical experiments on CIFAR10. This is an interesting paper and gives a rigorous insight to the well known phenomenon. This would open up a new research field on this topic that many researchers would follow.