NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8472
Title:Online Optimal Control with Linear Dynamics and Predictions: Algorithms and Regret Analysis

Reviewer 1

SETTING: In continuous control, model predictive control is often used as a means to negotiate model mismatch or misspecification. Here, the setting assumes the availability of cost functions for some time steps in the future. This seems somewhat stylized. In terms of predictability of cost functions (representing prices etc.), a guarantee akin to that of doubly robust estimators would seem more apt. Furthermore, while regret is a perfectly well-defined quantity, a competitive ration like guarantee might be a better measure since the regret is non-vanishing. TECHNIQUES: The key observation here is that with the foreknowledge of cost functions up to a certain horizon enables one to run gradient-based offline optimization algorithms (like GD, NAG etc.) on the best-in-hindsight problem for a few iterations in the online setting. This was observed in previous work (Li 2018). This paper proposes a somewhat classical, yet very well placed reparameterization of the problem that fits into the aforementioned framework. To the reviewer, it seems the obvious parameterization in terms of the controls does not seem to, at least immediately, yield the result. **In this light, this reparameterization + offline-algorithm-on-hindsight-problem observation is quite neat.** The lower bound seems conceivable given the one in (Li 2018). 1. Can the authors comment on the transformation to the canonical form affect the strong convexity and smoothness of the transformed costs on Line 479? 2. Line 155/6 -- Was x_t+1 = A x_t + B u_t intended instead of x_t+1 = x_t + u_t? POST RESPONSE Adding the response under "Parameterization in terms of control inputs." to the main paper would be nice. The reviewer also thinks R#3's suggestion of putting GD first would add to readability. The reviewer is retaining his/her score.

Reviewer 2

The paper and the contributions are cleanly written. Matching upper/lower bounds on regret and proposed algorithm would be significant to community. One issue is upper/lower bounds still differ by condition number^2 as discussed below. The fact that setup applies to possibly nonlinear f_t,g_t is definitely good although it is not clear how much technicality is introduced by doing this compared to LQT setup. -------------------------------------------------------- After reading other reviews and author feedback, I am more convinced about the significance of the paper and adjusted my score accordingly.

Reviewer 3

The authors study online control problems where the dynamics are known, there is no process noise, but the cost functions are changing over time. A main motivating application is LQ tracking. The authors study the problem by looking at dynamic regret, where the online algorithm has to compete with the optimal solution (not just best static controller in hindsight). To compensate for this more difficult objective, the algorithm is assumed to have access to a small prediction lookahead window. The main result of the work is a new momentum based online control algorithm. The authors prove that the regret of their proposed algorithm decreases exponentially in the prediction lookahead window size W. The authors also prove a lower bound that shows that for LQ tracking their regret upper bound is nearly optimal. Experimentally, the authors show their algorithm outperforms MPC from a regret point of view while using similar amount of computation. This is a nice paper. The focus on dynamic regret is new, and the reduction from LQ controllable form to unconstrained convex optimization is of interest to the online learning community. The authors should cite the following related paper: Online Control with Adversarial Disturbances. Agarwal et al. ICML 2019. There are some similarities between Agarwal et al. and this work. For instance, Agarwal et al. also considers time varying costs while allowing for an adversarial disturbance sequence to affect the dynamics. However, they show regret bounds with respect to the best (almost) static policy. Presentation wise, in Algorithm 1 I would have preferred to first see the algorithm with vanilla gradient descent first before moving to the triple momentum algorithm. The regret bound can be presented for just RHTM, but for simply understanding the algorithm a simpler update rule would be easier to follow. Minor typo: in lines 155 and 156, it should be x_{t+1} = A x_t + B u_t.