NIPS Proceedingsβ

Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization

Part of: Advances in Neural Information Processing Systems 31 (NIPS 2018) pre-proceedings

[PDF] [BibTeX] [Supplemental]

Authors

Conference Event Type: Poster

Abstract

We study finite-sum nonconvex optimization problems, where the objective function is an average of $n$ nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each epoch, our algorithm uses $K+1$ nested reference points to build an semi-stochastic gradient to further reduce its variance in each epoch. For smooth functions, the proposed algorithm converges to an approximate first order stationary point (i.e., $\|\nabla F(\xb)\|_2\leq \epsilon$) within $\tO(n\land \epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2})$\footnote{$\tO(\cdot)$ hides the logarithmic factors} number of stochastic gradient evaluations, where $n$ is the number of component functions, and $\epsilon$ is the optimization error. This improves the best known gradient complexity of SVRG $O(n+n^{2/3}\epsilon^{-2})$ and the best gradient complexity of SCSG $O(\epsilon^{-5/3}\land n^{2/3}\epsilon^{-2})$. For gradient dominated functions, our algorithm achieves $\tO(n\land \tau\epsilon^{-1}+\tau\cdot (n^{1/2}\land (\tau\epsilon^{-1})^{1/2})$ gradient complexity, which again beats the existing best gradient complexity $\tO(n\land \tau\epsilon^{-1}+\tau\cdot (n^{1/2}\land (\tau\epsilon^{-1})^{2/3})$ achieved by SCSG. Thorough experimental results on different nonconvex optimization problems back up our theory.