NeurIPS 2020

Escaping Saddle-Point Faster under Interpolation-like Conditions


Review 1

Summary and Contributions: This work studied the convergence of two popular methods: SGD and cubic regularization, under a special condition called strong-growth condition (SGC). The authors suggested that for nonconvex optimization, SGC helps SGD and cubic regularization find the approximate local minimizer faster. The authors also showed that such an acceleration also holds for the case where only zeroth-order oracle is accessible.

Strengths: - The organization of this work is clear. - The theoretical results in this paper are sound.

Weaknesses: - The importance of the SGC condition remains unclear. In Line 129, the authors claimed that SGC condition is satisfied in some practical settings such as the training of deep neural networks, therefore the SGC condition should be regarded as an interesting special setting for nonconvex optimization. However, recent work [1,2] showed that the training of deep neural networks can be further regarded as a special task of convex optimization in the Neural tangent kernel (NTK) regime, which is a stronger condition than SGC. Therefore, the authors may want to clarify the importance of SGC by showing some more examples in machine learning. - The technique contribution of this work seems incremental. As the authors suggested, [VBS18] firstly studied the SGC condition under nonconvex setting and proposed that SGD costs O(1/\epsilon^2) gradient complexity to find first-order stationary points. Meanwhile, note that [AZL18] proposed a generic framework which could turn any algorithms for finding first-order stationary points into algorithms for finding approximate local minimizer, without hurting the convergence rate. Therefore, is it true that the convergence rate O(1/\epsilon^2) for SGD to find approximate local minimizer established in this paper can be directly deduced by combining existing results from [VBS18] and [AZL18]? The authors may want to discuss above issues to highlight their technique contribution in this paper. - The authors may want to add some discussion about how the SGC condition will affect the variance-reduction based algorithms such as [3,4,5,6]. [1] Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. "A convergence theory for deep learning via over-parameterization." International Conference on Machine Learning. 2019. [2] Zou, Difan, et al. "Gradient descent optimizes over-parameterized deep ReLU networks." Machine Learning 109.3 (2020): 467-492. [3] Johnson, Rie, and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction." Advances in neural information processing systems. 2013. [4] Zhou, Dongruo, Pan Xu, and Quanquan Gu. "Stochastic nested variance reduction for nonconvex optimization." Advances in Neural Information Processing Systems. 2018. [5] Wang, Zhe, et al. "SpiderBoost and momentum: Faster variance reduction algorithms." Advances in Neural Information Processing Systems. 2019. [6] Nguyen, Lam M., et al. "Finite-sum smooth optimization with sarah." arXiv preprint arXiv:1901.07648 (2019).

Correctness: The claims and methods proposed by the authors are correct.

Clarity: This paper is written well and easy to follow.

Relation to Prior Work: This work should discuss more about previous work to show the difference.

Reproducibility: No

Additional Feedback: ########################### I am OK with the authors' response.


Review 2

Summary and Contributions: This work analyzes the oracle complexity of perturbed stochastic gradient descent algorithm and the stochastic cubic-regularized Newton’s method for escaping saddle-points in nonconvex stochastic optimization. They show that under interpolation-like conditions these two algorithms obtain improved rates for escaping saddle-points.

Strengths: The authors proved inproved results of speed of PSGD and SCRN under new assumptions.

Weaknesses: The extra assumption is not explained very well and lacks of intuitions. As the core assumption, the authors just explain a little about this assumption.

Correctness: I think the claims are correct.

Clarity: Yes.

Relation to Prior Work: Clear.

Reproducibility: Yes

Additional Feedback: This work analyzes the oracle complexity of perturbed stochastic gradient descent algorithm and the stochastic cubic-regularized Newton’s method for escaping saddle-points in nonconvex stochastic optimization. They show that under interpolation-like conditions these two algorithms obtain improved rates for escaping saddle-points. The proofs are based on the Assumption 3.1. 1. More details of Assumption 3.1 needs to be presented. The previous SGD just used the unbiased asssumtion and bounded variance. This paper uses a detailed distribution of the stochastic gradient, thus, proves the better results. Can this assumption be satisfied in the network training or finite sum optimization. 2. The results and proofs are quite complicated. The authors may present the proof sketch to make this paper more readable for the readers. 3. As a theoretical paper, the authors may present the core techniques of the proofs. 4. The zeroth and first order methods are meaningful and common in the ML. The high order methd needs the Hessian matrix and cannot be applied to high dimensional case. I think it is better to remove the results of high order methd into the supplementary part. ------------------------------------------------------------------------------------------------- comments after rebuttal ------------------------------------------------------------------------------------------------- I have read your feedback. Thank you!


Review 3

Summary and Contributions: This work improves the convergence rate of PSGD and SCRN by using strong growth condition. Further, the theoretical analysis is extended to the zeroth-order variants of PSGD and SCRN.

Strengths: (1) It is great to see that the convergence rate of PSGD and SCRN is improved by using strong growth condition. (2) The theoretical analysis is established for both zeroth-order and high-order methods.

Weaknesses: (1) The theoretical contributions in this paper are not significantly strong. Most of the theoretical analysis follows directly from previous works, e.g., [JNG+19], [VBS18], by assuming the strong growth condition. Taking that into account, the technical contributions of this work are not significant. (2) The strong growth condition pushes all the stochastic samples of the total loss function to share the same critical point, which makes the stochastic analysis much simpler due to the automatic variance reduction. Therefore, the strong growth condition is actually a very strong assumption. Although the authors mention that the per-sample loss in the supervised learning seems to satisfy the strong growth condition, but looking carefully, this requires there is a ground-truth model/classifier to exactly fit the training data. What if there is label noise in the training? In other words, if the strong growth condition is approximately satisfied, then do we still achieve such improved convergence rate? More discussions are encouraged in terms of reasonability and generality of the strong growth condition. (3) The numerical experiments are weak. No numerical experiments are developed to verify the proposed upper-bounds of the sample-complexity/oracle calls. I am caring about the experiments because many assumptions used in this work that are hard to verify in practice, more discussions/experiments are needed to clarify what practical scenarios fall into the framework of this paper.

Correctness: Seems correct. No empirical results are provided.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: =================After response======================= I read the author response, and agree with the authors for the responses for my part.