Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, Volkan Cevher
In this paper, we analyze the trajectories of stochastic gradient descent (SGD) with the aim of understanding their convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules. Subsequently, we prove that the algorithm's rate of convergence to local minimizers with a positive-definite Hessian is $O(1/n^p)$ if the method is run with a $Θ(1/n^p)$ step-size. This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to significant performance gains; we demonstrate this heuristic using ResNet architectures on CIFAR. Finally, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered.