NeurIPS 2020

Better Full-Matrix Regret via Parameter-Free Online Learning

Review 1

Summary and Contributions: This paper deals with the setting of parameter-free online convex optimization. In particular, "full-matrix" bounds are considered. These bounds take into account the fact that the rank of the subspace spanned by the subgradients received along the game can be smaller than the true dimension d. Importantly, the authors provide 3 different characterisations of matrix bounds which capture different aspects of the problem at hand. At the same time, they describe algorithms achieving the bounds given in the introduction.

Strengths: This work follows the recent developments in parameter-free online learning. Given the various applications of these algorithms in practice, it is fundamental to understand how to better design them. This work provides a useful approach, both to design new algorithms and shed light on already existing ones (such as Adagrad). Compared to prior work, the algorithms provided by the authors do not need prior knowledge of the norm of the (optimal) competitor and their regret bounds hold for any arbitrary convex domain. The paper is well written: the problem is nicely introduced and clearly explained. Also, the technical level is satisfactory and all the theorems are carefully explained.

Weaknesses: In my opinion, the techniques used in this work are all somehow previously known, as stated by the authors themselves. For example, Algorithm 1 is an application of the scheme proposed in [17] by Cutkosky & Orabona on how to combine online learning algorithms and the proof of its regret bound seems straightforward (both the regret of FTRL and 1d parameter-free optimization algorithms are well known). Also, Algorithm 2 requires a straightforward extension of Thm 3 from [17] (as honestly stated also by the authors). The authors also provide a brief discussion about the result of Theorem 6 compared to similar results contained in [17], however it is not totally clear to me why the algorithm in [17] cannot be applied to constrained settings. Also, I'm not an expert on this but I think from a practical point of view the diagonal version of Adagrad is used and not the full matrix version, which requires dealing with a dxd matrix in each round, which makes the running time much worse. Regarding this, I think an empirical evaluation of the algorithms would be beneficial. For example, it would be nice to see if the theoretical gains of the algorithm proposed in Theorem 7 are justified despite a worse running time. More in general, a it would be nice to show an empirical evaluation against the original Adagrad algorithm but also to other parameter-free methods such as the ones given in [23] (Cutkosky and Sarlos). **After rebuttal**: the authors clearly explained why the reduction in [17] does not work in this case in their response. However, this is not clearly stated in the main paper and I think it should. Nevertheless, the paper is relevant and well written.

Correctness: All the theorems are sound.

Clarity: The paper is very well written, easy to follow and enjoyable to read. I think there are some typos in the proof of Theorem 6. In particular, the authors say (line 155) that " $ \| x \|_t^2 = \| x \|^2 + x^\top ( I + G_T ) x $". However, at the beginning of the proofs (line 157) we have that: \| x \|^2 = \| \mathring{x} \|^2 x^\top ( I + G_T ) x = \| x \| + ... ", which is different from the definition given above (while after the equality a square is missing from \| x \| ). Also, in the proof of the same theorem, lemma 8 is used but its statement can't be found in the main paper (I think it should be mentioned it is in the Appendix).

Relation to Prior Work: In my opinion this work only looks at one side of the problem, which is the norm of the competitor. On the other hand, from a practical point of view it is of equal importance (if not more important) to consider also the norms of the gradients. In this work it is assumed they’re upper bounded by 1. However, a major reason for adagrad success can be the fact that it is a scale-free algorithm (see for example “Scale-Free Online Learning” [Orabona & Pal, 2018]), meaning that if the gradients are multiplied by different constants, then the learning rates will scale them. This is crucial with neural-networks, where the first layers have smaller magnitude of the gradients compared to the last layers. I am aware of the fact that it's not possible to achieve an optimal dependence on both the norm of the competitor and an upper bound on the norm of the gradients without prior knowledge, however I would like to see a discussion on the aforementioned behaviour.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper considers online convex optimization by automatically tuning the learning rate to an optimal value. They devise an algorithm, based on FTRL framework, to attain a regret bound (4) depend on the sequence of the given norms. As the application of this framework, they recover various full-matrix bounds up to a factor log T.

Strengths: They provide a vehicle for designing parameter-free algorithms that is adaptive to the series of given norms. This framework can recover several prior results up to a logarithmic factor by the proper substitutions of norms. I believe it is insightful for designing parameter-free algorithms for more complicated tasks. Also, their regret bound is specific to the sequence of gradients (or subgradients) and the optimal solution.

Weaknesses: (1) It is a mystery why they need to pay an extra log T multiplicative factor for the regret bounds. As the author mentioned, there might be something missing in the analysis or algorithm. (2) For section 3.2, the author applies the technique in [17] to derive regret bound when the domain is constrained. However, this regret loses the information of the domain. It is better if the regret can also be specific to the domain.

Correctness: The paper is based on the sharp observations from the previous works. The analysis is simple and solid.

Clarity: One thing unclear is about the assumptions of Theorem 5. It follows from Lemma 4 and 3, the author need to make clear if the sequence of norms should be increasing or not. Also, Line 136 claims the algorithm 2 can proceeds with arbitrary norms. I hope it can be clear that if the increase of the sequence of norms is necessary.

Relation to Prior Work: Clear and concise.

Reproducibility: Yes

Additional Feedback: Since the algorithm 2 wraps algorithm 1 and Diagonal betting algorithm [23] as the ingredients. I am a little worried about the efficient issue. Could the author compare the computational complexity of algorithm 2 with other paprameter-free algorithms? __________________________________________After rebuttal______________ I am happy with the author's response. I would like to increase one more score to encourage this work.

Review 3

Summary and Contributions: The paper studies the setting on convex optimization where the decision set can be either constrained and unconstrained and presents algorithms with improved regret bounds compared to AdaGrad.

Strengths: What I like about this paper is the idea of delegating the parameter tuning task that is usually required for FTRL to a parameter-free algorithm using the dimension-free-to-one-dimensional reduction by Cutkosky and Orabona 2018. This idea showcases nicely how FTRL and parameter-free algorithms can be combined to yield an *adaptive* parameter-free regret bound. To achieve the required adaptivity, the authors introduce an intermediate algorithm that uses varying norms, with an interesting analysis.

Weaknesses: The paper's main (advertised) contributions are algorithms that achieve the regret bounds in Eq. (2) and (3). For the former bound, the authors only claim to be the first to achieve it for the constrained setting. Unfortunately, an algorithm with a tighter regret bound than (2) already exists (in both the constrained and unconstrained settings). In fact, this is achieved by the Matrix-FreeGrad algorithm due to Mhammedi and Koolen 2020 'Lipschitz and Comparator-Norm Adaptivity in Online Learning'; Inspecting the display immediately after their Theorem 10, one can see that the regret bound of Matrix-FreeGrad is essentially (in the notation of the current manuscript) |w|_{G_T} \sqrt{\log \det (I + G_T)} which is always smaller than Eq. (2) (this follows from the steps between lines 163 and 164 of the current manuscript). The regret bound of Matrix-Freegrad can be seamlessly transferred to the constrained setting via the constrained-to-unconstrained reduction due to Cutkosky and Orabona 2018 or Cutkosky 2020 'Parameter-free, Dynamic, and Strongly-Adaptive Online Learning'. The authors claim that an algorithm with regret bound (2) in the unconstrained setting already exists and refer to the paper by Cutkosky and Orabona 2018. I do not think that the latter paper achieves this bound with r equal to G_T's rank (r is equal to the full dimension in their case). That being said, if there were an algorithm that achieves such a bound in the unconstrained setting (such as Matrix-FreeGrad), then the constrained-to-unconstrained reduction due to Cutkosky and Orabona 2018 (or Cutkosky 2020 'Parameter-free, Dynamic, and Strongly-Adaptive Online Learning') will immediately lead to the same regret-bound in the constrained setting (up to lower-order terms). I actually do not see why Algorithm 2 (Varying Norm Adaptivity) is needed to transfer a regret bound to the constrained setting. The black box reduction due to Cutkosky and Orabona 2018 does not care about the inner workings of the unconstrained algorithm to transfer the regret bound, as long as the norm of the gradients is bounded by 1 (for some fixed choice of norm). Perhaps Algorithm 2 relaxes the requirement on the norm of the gradients; only requiring |g_t|_{t-1,*}\leq 1 instead of the more constraining |g_t|_* \leq 1. However, this is never discussed in the paper. Also, in lines 53 and 54, the authors claim that prior work has only considered fixed norms. This is not true; see the work of Cutkosky 2019 'Combining Online Learning Guarantees.' On minor aspects: - line 63, a square-root is missing in the definition of the |x|_M. - Line 67, the subgradient should be defined. - Line 114, there is a missing subscript 't' in the norm. - Lemma 4 does not mention where the comparator w lives---it should be unconstrained. =========== Post rebuttal ============ It seems that Eq (2) does not trivially follow from the reduction by Cutkosky and Orabona. The discussion provided in the authors' response is highly relevant to the paper, as it motivates the problem and shows where previous methods fail. I do not understand why this was not included in the manuscript in the first place. For this reason, I only increase my score to 6, and I would urge the authors to include the explanation provided in the rebuttal in the final version of the paper.

Correctness: Some false claims were made about the novelty of the regret bound in Eq. (2) and the difficulty to transfer unconstrained regret bounds to the constrained setting (see weaknesses section above for more detail).

Clarity: The paper is clear enough.

Relation to Prior Work: Relevant related work missing such as: - Cutkosky 2019 'Combining Online Learning Guarantees'. - Cutkosky 2019 'Artificial Constraints and Hints for Unbounded Online Learning'. - Cutkosky 2020 'Parameter-free, Dynamic, and Strongly-Adaptive Online Learning'. - Mhammedi and Koolen 'Lipschitz and Comparator-Norm Adaptivity in Online Learning'.

Reproducibility: Yes

Additional Feedback: I think the paper would be stronger if it was more focused on the idea of delegating the tuning of the learning rate of FTRL to a parameter-free algorithm. It seems that the algorithm of Thm 6 admits a better regret bound which matches that of Matrix-FreeGrad, but the techniques used to get it in the current paper are rather different. This could be investigated more and discussed in the paper. Finally, I think that the authors should not claim Eq. (2) is novel in the constrained setting.