NeurIPS 2020

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Review 1

Summary and Contributions: This paper proves that with probability 1, SGD will converge to a point that is not strict saddle points. It also provides a local non-asymptotic convergence if initial point is located in a local neighborhood of a strongly convex local minimizer.

Strengths: The probability 1 result is technically new.

Weaknesses: There are a lot similar results in slightly different regime, which makes this work looks incremental. (a) [17, 18] already showed that with probability, GD will converge to a point that is not strict saddle point, the only difference of this work is that this work adds the noise in gradients. (b) [27] already proves morally the same result only under a bit restrictive Morse assumption. In the case of GD, this Morse assumption can be resolved by using a stronger stable manifold theorem in "Michael Shub. Global stability of dynamical systems.", I suspect a similar combination might go through here? (c) there are already several non-asymptotic results with sharp rate for SGD in avoiding saddle point [14 and two additional works in "Relation to prior work" section]. Usually one view asymptotic results (this paper) weaker than non-asymptotic results (earlier papers), it is also not clear from this paper if one can obtain probability 1 result by modifying the existing high probability result with Borel Cantelli lemma and a bit extra work.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: There are two recent SGD analyses that are relevant to this paper: "On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points"; "Sharp analysis for nonconvex SGD escaping from saddle points".

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper analyzes the orbit of stochastic gradient descent and the main result is that SGD avoids saddle point and the interpolated continuous trajectory converges almost surely. The authors give a detailed literature review and comparison to previous work. The technique used in this paper is the combination of classic results from dynamical system and probability theory. This approach is rare in machine learning literature. Even considering the amount of research in this area (convergence and saddle avoidance for GD), the result of this paper is of some interest.

Strengths: Saddle point avoidance results are important in non-convex optimizations and significant amount of research(asymptotic avoidance and non-asymptotic rate of escaping) is focusing on this area, the results is relevant to machine learning community.

Weaknesses: 1. The selling point of this paper. The authors emphasize on the convergence of SGD which is presented in theorem 2 and corollary 2. But I think this is somehow captured by the literature like “Stochastic gradient on Riemannian manifold, Bonnabel 2013”, where the almost surely convergence is established even for Riemannian manifold? Convergence of SGD in Euclidean space might be derived from this paper. From this perspective, convergence result lacks the novelty as it is stated. 2. The novelty of techniques. It seems that the techniques and results of this paper are very incremental and almost the same with that of Pemantle's. 3. The rate of convergence. In theorem 4, I cannot see a practical way to choose parameters to have a convergence guarantee as Jin et al., so this is still an asymptotic result.

Correctness: Correct

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: The authors have reviewed most well known works in the literature.

Reproducibility: Yes

Additional Feedback: Question to be addressed: Please compare the assumptions listed in this paper to the assumptions listed by Pemantle [27] (iii and iv, theorem 1) to give a clearer view for the avoidance results.

Review 3

Summary and Contributions: This paper considers the convergence of stochastic gradient descent (SGD) for nonconvex problems. The main contribution includes proving that (1) SGD converges to a critical point (subsequence convergence) almost surely (2) SGD also escapes strict saddles almost surely, (3) local convergence to a strict minimizer once the iterates get close to it.

Strengths: 1. It is a theoretical paper, providing new convergence results for SGD when used for solving nonconvex problem. 2. Escaping strict saddle points of SGD without the help of additional noise seems novel and can provide theoretical guarantee for practical use of SGD. 3. The convergence rate to local minimum is provided, but suffers few weaknesses which are described below.

Weaknesses: 1. Compared to Ge et al. [10] which provides the rate of escaping saddle points of SGD, there is no explicit rate of escaping saddle points provided in this work. 2. Theorem 4 provides the rate of convergence to local minimum, but (1) it only works for strict local minimizers whose Hessian are positive definite, (2) it only provides the rate once the iterate is close to the strict local minimizer, but there is no guarantee if this happens, or how fast it happens. 3. The following work is very related and is missing: Daneshmand, Hadi, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. "Escaping Saddles with Stochastic Gradients." In International Conference on Machine Learning, pp. 1155-1164. 2018.

Correctness: They are correct, but I didn't carefully check the proofs.

Clarity: The paper is well written

Relation to Prior Work: The following work is very related and is missing: Daneshmand, Hadi, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. "Escaping Saddles with Stochastic Gradients." In International Conference on Machine Learning, pp. 1155-1164. 2018.

Reproducibility: Yes

Additional Feedback: _______________________________________After rebuttal_______________________ I appreciate the effort the authors have made to address my comment. In Theorem 4, the dependence of m (or other parameters) on $\delta$ should be highlighted since in the current version, one can simply set $\delta = 0$, which gives probability of 1.

Review 4

Summary and Contributions: This paper shows that under mild conditions, SGD converges to a critical point of the general non-convex functions and avoids all strict saddle points, with probability 1. It also presents a convergence analysis of the SGD once it enters the neighborhood of a local minimum.

Strengths: The paper generalizes the known convergence results for the deterministic GD and the SGD. The theoretical analysis is novel and elegant. It combines a series of results in the theory of dynamical systems and shows that they can bridge the continuous GD trajectory to the SGD trajectory. I like the APT analysis. By showing a relative weak property that the SGD is always bounded, it automatically gives the strong conclusion that SGD converges to the same set as the continuous GD.

Weaknesses: - The rate of convergence analysis would be more interesting if it bounds the iteration complexity before hitting the neighborhood U1. - What's the convergence rate's dependence on the dimension d?

Correctness: To the best of my knowledge, they are correct.

Clarity: The paper is very well written.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Overall, this is an interesting paper. The paper presents a set of useful tools and may provide new insights to the study of non-convex optimization theory.