Paper ID: | 5938 |
---|---|

Title: | Online Forecasting of Total-Variation-bounded Sequences |

=== Update after reading the authors' response: The authors' response answered my questions well. One limitation that I missed in my initial review is that the coordinates of the parameter theta (the true signal sequence) are not assumed to be bounded; the only assumption is that theta lies in a total variation ball. This means that the only bound on these coordinates is through the bound C on the total variation of the sequence. Hence, the dependence on the bound B on the sequence (L-infty norm of theta) is implicit and replaced by the worst-case upper bound C, which leads to a dependence on C^2 instead of B*C on the intersection of those balls. I think that this limitation should be addressed, given that in the online learning literature it is more customary to provide the explicit dependence on B. To be specific, this would entail assuming that theta lies in the intersection of a TV ball of radius C and an L-infinity ball of radius B (where both B and C can be assumed to be known, though adaptivity to those can be considered), and providing matching upper and lower bound over this class. This appears to be feasible; otherwise, the authors should at least explicitly mention this point. I chose to keep my initial score, given the merits of this paper. However, I encourage the authors to address or at least mention the above issue in the final version of the paper. == This paper considers the problem of sequential prediction of a sequence of real values. The sequence is assumed to be obtained by adding independent noise to a true sequence; the quality of predictions is measured by the cumulative squared error between the true and predicted values. This quantity essentially coincides with the dynamic regret from the online learning literature. Since one cannot hope for a sublinear regret without any assumption on the sequence, the authors consider sequences that are bounded in total variation. Total variation balls are often considered as rich nonparametric classes in the nonparametric estimation literature, which unlike the smaller Holder and Sobolev balls can account for abrupt changes and inhomogeneous smoothness. These classes motivated successful developments such as wavelet shrinkage in the 1990s. Achieving the optimal rate over these classes is not trivial, and requires some spatially adaptive algorithms; in particular, linear smoothers, a large class of algorithms which includes kernel smoothers, are know to fare suboptimally over these classes. The problem considered by the authors is actually different from the denoising problem considered in the nonparametric literature (where the goal is to estimate the true past signal in a batch fashion); here, the goal is to forecast the next observation online. The main contributions are: - Given the knowledge of the time horizon n, the total variation C and the noise level sigma, the authors propose an efficient algorithm called ARROWS with optimal n^{1/3} (C \sigma^2)^{2/3} regret. The fact that the O(n^{1/3}) regret can be achieved can also be seen from the general results on online nonparametric regression of (among others) Rakhlin and Sridharan, although this latter approach is non-constructive. - Through a connection with the problem of denoising, the authors note that linear algorithms such as online gradient descent (OGD) can only achieve suboptimal O(n^{1/2}) regret over such classes. - Finally, the authors show that for some range of values of C, the problem of forecasting is actually harder than that of denoising. I found this paper to be very interesting and well-written. The main points were articulated clearly, and I appreciated the connection between the two bodies of work. (I did not read the proofs in the supplementary material.) Remarks: * Theorem 8, and remark (lines 257-258) "despite the fact that ARROWS is designed for total variation class, it adapts to the optimal rates of forecasting sequences that are spatially regular": doesn't this apply to any minimax strategy for the TV class (by the same argument)? In this case, it would be more precise to formulate this (in this section and the introduction) as a consequence of the optimality on TV classes, rather than another property specific to the ARROWS algorithm. * Often in online learning, the considered notion of regret is obtained by constraining the comparator class, but not the sequence of values (which is seen as deterministic and arbitrary, although bounded). Do your results carry to this formulation? Typos: l. 102 "then" -> "than"; l. 225 "bound the number"

This paper is generally well-written and quite clear. The contributions seem significant though the setting may be a bit narrow with the squared loss only, no exogenous variables, known variance, horizon and bound on the total variation... I would be interested in more details on real applications (or experiments with real dataset rather same simulations) and with how this setting may be extended to the use of covariates. Other comments: - In the classical notion of dynamic regret (see e.g., Zinkevich 03), the complexity is measured in terms of the total variation of the comparison sequence (x_t), not the one of the true sequence (theta_t). This has the advantage to provide some kind of oracle inequality. It would be nice to discuss the optimal bound in that case and if the current analysis might be extended to it. - If I understood well, the improvement over Besbe et al. 2015, is due to the squared loss (instead of general strongly convex losses) which allows a bias-variance separation in the analysis. Is this only possible for squared loss or might this be true for a wide class of loss functions (such as self-concordant losses)? - I did not understand very well how this setting falls into the framework of online non-parametric regression. I would enjoy more explanations. Furthermore, it is mentioned in the introduction (l.148) that the algorithm provided here is the first polynomial time algorithm with O(n^(1/3)) regret though in the Appendix it is said that the one of Gaillard and Gerchinovitz 2015 has a complexity O(n^(7/3)). Besides, could your lower-bound (with the C^2 term) be re-obtained from their minimax rate? - Would the term C^2 in the lower-bound be necessary if we assumed all the theta to be bounded? - How the algorithm could be adapted to unknown meta-parameters: horizon (n), noise variance (sigma) and bound on the total variation? I am wondering if a doubling trick or a meta-aggregation using an expert advice algorithm would be sufficient. By the way, sigma should be added as a meta-parameter of the algorithm.

The algorithm is shown to have regret of the optimal order n^{1/3} with probability 1-\delta. The algorithm runs in O(n\log n) time. The infimum in the definition of regret (the displayed formula on page 2) seems misplaced.