How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Allen-Zhu, Zeyuan

How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018)

Bibtex Metadata Paper Reviews

Authors

Zeyuan Allen-Zhu

Abstract

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$ . However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$ , we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$ , improving the best known rate $O(\varepsilon^{-8/3})$ . If $f(x)$ is nonconvex, to find its $\varepsilon$ -approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$ , where previously SGD variants only achieve $\tilde{O}(\varepsilon^{-4})$ . This is no slower than the best known stochastic version of Newton's method in all parameter regimes.

How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Authors

Abstract

Name Change Policy