Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper presents several "shadowing" results for gradient descent and the heavy ball method, under several conditions on the objective. In short, the authors provide conditions under which a discrete approximation of an ODE defines a trajectory that "stays close" to the actual trajectory of the ODE. This research is motivated by a by a recent paper by Su, Jordan, and Candes that models Nesterov's method via an ODE: this leads the authors to ask the question of when an ODE solution indeed well approximates a discrete algorithm, which is what would be implemented in practice. Although the interest and motivation is mostly on HB, the bulk of the results presented in the paper are for GD. The paper is well-written overall, and the results are interesting, if somewhat shallow. There is also a question of practicality, as it is not clear how to use these results other than in the negative (e.g., no convergence rates seem inferable from these bounds). On a more fundamental level, my main concern has to do with novelty. The authors hint this, but discrete approximations of ODEs are a very old subject. The authors refer to papers in the fifties, but in fact discrete approximations of ODE's go back centuries (in fact, the authors analyze the so-called Euler method), and ideas presented here seem to be revisiting central notions of real analysis, numerical analysis, and control. Even on the stochastic side, bounds on corresponding approximations are also classic (see, e.g., the chapter called "The ODE Method" in Benveniste and Priouet, or the corresponding sections of Kushner and Yin cited by the authors). The classic nature of the topic gives me pause in accepting the paper in its current form. I feel that it does not treat this context appropriately: erring towards clarity and readability perhaps, the paper presents related work somewhat informally, and does not sufficiently place current results to the broader real analysis/numerical analysis/control literature. ————— Review edited after rebuttal ————— The authors are very right that they give a comprehensive, formal review of shadowing in Section 2; my comment was unfair and poorly worded, to say the least. Section 2 does indeed present “context”, and recent results, on what is known on shadowing properties at least from a control perspective. I do still believe that it would be worth giving a point of view from real analysis (convergence results of finite operators to trajectories) as well as from numerical analysis, where correct computation seems to be a central theme. I am not an expert in either disciplines, so I cannot provide more concrete feedback; my apologies for this.
This paper studies the deviation of the (discrete) trajectory of optimization algorithms from the (continuous) trajectory of their continuous-time ODE counterparts. The optimization algorithms considered are gradient descent and heavy ball method (also known as momentum method). The paper shows that, under some conditions of the objective function (strong convexity, smoothness and Lipschitz condition), trajectory of the gradient descent method with small enough step-size does not deviate too much from the trajectory of its ODE counterpart. The paper also extends the discussion to cases like strong concavity, saddle points and stochastic GD, but obtain conclusions that is somehow weaker than the one mentioned above. To some degree, these results verify the conjecture that analysis of the corresponding ODE would be helpful for analyzing the optimization algorithms, especially for the gradient descent on strongly convex objective functions. However, the result in Theorem 3, which is the main result, is not surprising to me at all, since the step-size is required to be very tiny and objective function is strongly convex (only one global minimum exists, which makes the trajectory stable). Note that in Theorem 3, it is almost always true that 2\mu\epsilon/(L\ell) << 1/L. Then the step-size h would be very tiny, compared to the common practical choice 1/L. The result in nonconvex case (Theorem 4) is also limited. It only holds for quadratic functions. For general non-convex function, this means Theorem 4 only holds for a small region/neighborhood, which can be well approximated by quadratic function. Outside of this region, nothing is known. More comments/concerns: >Theorem 3 requires that the objective function is strongly convex as well as Lipschitz. These two conditions are contradictory, since the objective function has to grow no slower than quadratic as required by strong convexity. Obviously quadratic function is not Lipschitz. Although this contradiction can be resolve by limiting the radius of the feasible parameter space, the Lipschitzness would depend on this radius. In this case, according to Theorem 3, larger feasible parameter space results in even smaller step size h.
I have read the authors rebuttal, and decided to keep my score. I believe that if the heuristic glueing approach can be made precise, this would greatly improve the paper. ======================== This is an interesting paper and addresses an important problem: namely the conditions under which discrete-time optimization algorithms can be approximated by continuous time dynamics uniformly in time. If this holds, large-time properties of the latter can be used to infer that of the former. Previous work in this direction relied mostly on constructing Lyapunov functions, but in this work the authors propose an alternative: allowing the initial condition of the continuous-time dynamics to change a little -- thereby establishing uniform error estimates for a larger class of loss landscapes. Although most theoretical results are based on well-established results from dynamical systems theory, I feel it is nevertheless important to introduce these ideas to the greater machine learning community. One perceived drawback of this work is the limited setting: most results are on quadratic functions (or perturbations of them). I have the following questions: 1. Can the results for quadratic settings be generalized, especially for the Hyperbolic case, to more general loss functions? e.g Thm C.1 C.2 2. On gluing: from the discussion from line 237, it would appear to me that if you perform local quadratic approximation of the loss function and then glue all pieces (assuming they are hyperbolic, uniformly) together you would naively get a global result. Since this result is not proved, I expect there is some difficulty in this program. I would appreciate an explanation of this difficulty.