NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center

### Reviewer 1

This paper studies how to discretize continuous-time ODEs into discrete-time accelerated optimization methods in a rate-maintaining manner. The authors propose a solution by borrowing the concept of opportunistic state-triggered control from the control. The proposed approach makes use of the Lyapunov functions for the so-called high-resolution differential equations to design variable-stepsize forward-Euler discretizations. Originality: The idea of this paper is definitely original. Although the paper uses some results in a recent paper (Ref[22]) which introduces the high-resolution ODE, the connection between state-triggered control and optimization design seems new and novel to me. This is the part I really like about the paper. Quality: The technical quality of this paper are reasonably good (especially on the theory side). I haven't noticed any technical flaws. However, I do have the following concerns. 1) The rate obtained by the proposed approach seems very weird to me. I looked at the results in Theorem 3.3. How does this result even recover the standard rate (1-sqrt(mu/L)) where L is the smoothness constant? At least in the first glance the bound in Theorem 3.3 does not even depend on L. I can not see how eventually a factor of sqrt(1/L) will come out from the bounds in Theorem 3.3. As a sanity check, at least Theorem 3.3 should recover the standard discrete-time accelerated rate 1-sqrt(mu/L). In continuous time, the constant L won't show up and that is as expected. Once the discretization is involved, the constant L will play a role. I can see the terms in Proposition 3.2 depend on L but just cannot see how the final convergence bound in Theorem 3.3 depends on L at this point. 2) The numerical examples need some improvements. A 2-dimensional quadratic example is not going to be that convincing. In addition, both Figures 2 and 3 are confusing. The right plot in Figure 2 seems some divergence behavior, right? I am very confused here. Figure 3 plots the case where the optimal point is already known. I don't think this is a good plot demonstrating the potential of the proposed approach. Please explain what exactly happens in Figure 2. Using the same amount of iterations, does the proposed method outperform existing discretization methods? If so, making a plot to precisely show that. Clarity: Overall the paper is well written. As I mentioned before, Figure 2 confuses me a lot. Other than that, I think the paper is quite clear. Significance: Bridging state-triggered control with optimization design is a novel idea that may inspire further research. I think the conceptual novelty of this paper is high, although I am a little bit unsure whether the proposed discretization approach itself really outperforms the existing approaches. A minor comment: The authors claim that their approach is general for any dynamical systems. Is this true? It seems the construction of the g function will be case-dependent and hence applying the proposed approach to general dynamical systems seems to require significant amount of extra work. ==================================================== After the rebuttal: The analytical comparison between the proposed discretization and existing approaches is still not made at this moment. It will be much more convincing if the proposed approach can at least recover the standard rate of Nesterov's method. I also agree with Reviewer 2 that there is a concern on whether preserving the continuous time rate itself is the most important thing to do or not. Overall this paper has something I really like (the original idea of bridging optimization and state/event-triggered control) but at the same time the true benefits of the proposed approach are not convincing. I tend to keep my score as weak acceptance.

### Reviewer 2

Please, see comments in item 1. A couple of other comments are below. The numerical implementation of the proposed method requires computing all the quantities stated in Proposition 3.2 (in my opinion this is quite cumbersome). Moreover, one might converge in fewer iterations but it is also necessary to compute more gradients and several other quantities. Thus, the number of iterations is not totally representative of the computational complexity. The numerical experiments illustrate a trivial case, quite far from real problems encountered in machine learning or statistics. Finally, this paper does not make due justice to several other recent papers on the topic, which are not even mentioned. For instance, I suggest the authors check recent papers from: Jordan, Attouch, Franca, Vidal, Cabot, etc. (to cite just a few). ============== post authors response: "We respectfully disagree with this comment, as we include the work by 9 the authors suggested by R#2, see [3], [22], [23], [28] and [29]." Obviously, I wasn't referring to these papers which are already cited. An exhaustive literature review is not what I mean. I just wanted to point out that some consideration to existing literature may be appropriate. Replacing the Hessian by the mentioned approximation, $\nabla f(x_{k+1}) - \nabla f(x_k)$, introduces another source of error in the discretization. I cannot judge whether this is meaningful in this case without seeing the proper derivation. I agree that the idea of the paper is interesting, and deserves explorations. However, in my opinion the current results of the paper are insufficient. Moreover, I don't see any benefits compared to much simpler, elegant, and so far more efficient approaches. Let me stress another point of concern. There are many methods in ODEs literature to implement discretization designed to preserve some given quantity, such as the energy of the systems for instance. One can simply use any discretization and project the solution. It turns out that such approaches are much inferior since they are considerably more unstable than simpler methods which do not preserve the same quantity. Thus, I'm not so sure that just because one can force the discretization to preserve the decaying of the Lyapunov function, this automatically will lead to a more efficient method. This has to be justified through stability analysis.