Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
While the result might be considered incremental, extending recent work by considering a more sophisticated class of algorithms, the paper still seems quite relevant. This appears to be a good next step in an overarching research project of bringing important results from dynamical systems theory to the context of non-convex optimization techniques applied to machine learning problems, what speaks for the originality of the work. Its significance comes from the considered extension being relevant and not trivial. The paper depends on knowledge that many readers may not be familiar with, and for that reason can be a little difficult to follow. Bringing this knowledge to the field is precisely one of the best aspects of the paper, though. In many places it is quite intructional and easy to follow. In others not quite so. It is hard to suggest improvements, though. The authors cannot be asked to provide a complete introductory text on the required subjects. While that is clear, the text seems intended to different audiences in different parts, and could perhaps be made more homogeneous in that regard, improving its quality and clarity.
This paper is well written and well argued. It is likely to be of wide interest to the community. The paper is quite technical, although, for the results they are trying to establish, this is necessary.
In this paper, the authors prove that first order methods with vanishing step sizes can also almost always avoid saddle points. The step-size decay is time dependent, the resulting analysis for the main theorem is a non-trivial extension of the proof for constant step sizes. Post rebuttal: Classical stochastic approximation proves require the sum of step-sizes to go to infinity while the sum of squares of step-sizes to be less than infinity. Since, the latter condition is no longer required, the gap between the allowed step-sizes for stochastic approximation and what is proposed by this paper is unclear. Do the additional conditions (Lines 167-168) lead to the elimination of the condition? 1) The motivation for first order methods with decaying step sizes over constant step sizes is not clear. If methods like stochastic gradient descent are used as motivation in the introduction, then its surprising that results will not hold for any first order method with noisy estimates for gradients like SGD. 2) The rebuttal is not convincing enough in how exactly Line 54 implies the elimination of the condition. Moreover, it is obvious why block co-ordinate descent fails since it specifically takes in gradient estimates with non-zero noise and sum of squares of step-sizes is infinite. This will be the same issue the authors run into when proving results for something like SGD or any stochastic approximation based methods. However, the paper still presents non-trivial breakthrough for vanishing step-sizes using dynamical systems theory which can be viewed as a good first step.