NeurIPS 2020

Regret Bounds without Lipschitz Continuity: Online Learning with Relative-Lipschitz Losses

Review 1

Summary and Contributions: The traditional online convex optimization relies on the assumption that the loss function is Lipchitz continuous. Recently, relative-Lipschitz conditions were proposed in convex optimization literature, which extends the scope of the problems one can solve using first-order methods. The submission studies a variety of online learning algorithms under the recently proposed relative-Lipschitz conditions, including follow the regularized leader, online mirror descent, regularized dual averaging, and extend the standard regret bounds to the relative setting, including both O(sqrt{T}) regret and O(log T) regret. I see some merits of the submission, but there are some major concerns that need to be addressed (see the weakness 1 below).

Strengths: 1. This is the first work showing that online mirror descent can achieve O(log T) regret for non-Euclidean distance, but of course with a stronger assumption, the relative strong convexity. 2. The regret bounds presented in Section 3 and Section 5 are simple and elegant. They look like the ``right’’ regret bound for these settings. 3. The contribution to O(sort{T}) regret under relative continuity is interesting, but it may not be the first result under similar condition (see weakness 1).

Weaknesses: 1. There is ICLR paper "Online and stochastic optimization beyond Lipschitz continuity: A Riemannian approach". Since it is closely related to the submission (both talk about online learning without Lipschitz continuity), it should be cited and carefully compared with. Although the presumed conditions are different (relative continuity and Riemannian continuity), it is unclear to me what are the examples that satisfies relative continuity but not Riemannian continuity. It 2. The work does not present interesting examples where the loss is relative continuous but not Lipschitz continuous. An example with some preliminary numerical experiments can greatly help the readers better understand the usage of these algorithms. 3. The work directly extends the similar results in convex optimization, which seems to be straight-forward, and may lack theoretical depth.

Correctness: I quickly checked the proof, and they seem to be correct.

Clarity: The paper is well written.

Relation to Prior Work: See the weakness 1.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper focuses on online convex optimization (OCO) problems, where the respective losses are not Lipschitz continuous, as it is assumed typically in the literature. More precisely, the standard Lipschitz continuity assumption is generalized by a new class of losses that of relative Lipschitz continuity, which is a direct extension of Lu's (2018) relative continuity. Building on this blanket assumption for the loss functions, the optimal regret bounds are recovered for two online iterative schemes; that of FTRL (and a particular case of it namely the Dual Averaging algorithm) and Dual-Stabilized OMD. In particular, the authors show the O(\sqrt{T}) regret bound for two scenarios: for the standard online setting and the so-called composite loss function framework. Additionally, the authors show a logarithmic regret bound for the case where the loss functions are in addition relatively strongly convex, which is a Bregman based generalization of the traditional euclidean notion of strong convexity.

Strengths: The main novelty of this paper is that it tries to establish logarithmic regret for loss functions that go beyond the traditional euclidean notions of Lipschitz continuity and strong convexity. However, this result may be questionable since the existence of loss functions that simultaneously satisfy both of these generalized notions is unclear as it becomes apparent in what follows below.

Weaknesses: My concerns about this paper concentrate on two main points. First, the main class of losses that the paper introduces, that of relative Lipschitz continuity (Def. 2.1) seems very closely related to that of Riemann Lipschitz Continuity (RLC) in Antonakopoulos et. al (2020). In particular, given that the losses are (RLC) then one can recover relative Lipschitz continuity via a direct combination of convexity and Cauchy-Schwartz inequality. Moreover, conversely every relative Lipschitz continuous loss can be seen as (RLC) if one chooses the respective Riemannian metric accordingly; this becomes even more evident for the example that the paper presents, if f(x)=x^{2} for x\in R, then one can straightforwardly choose the Riemannian metric in such a manner that the respective dual norm would be \|v\|_{x,\ast}=|v|/x and (RLC) follows. That said, this weakens significantly the contributions concerning FTRL and the like, since in Antonakopoulos et. al (2019) these results are already established. On the other hand, concerning the most intriguing part that of establishing logarithmic regret for the case where the loss functions are in addition relatively strongly convex, there is no obvious way to establish any relevant examples that satisfy simultaneously relative Lipschitz continuity and relative strong convexity, besides of course the euclidean ones. More precisely, assume that we have the static standard convex minimization problem and x^* is a solution. Then, relative strong convexity (Def 2.2) implies that for some constant M>0: <\nabla f(x),x-x^*>\geq M D(x^\ast,x) whereas due to Lu's (2019) relative continuity, we have for some L>0: <\nabla f(x),x-x^*>\leq L\sqrt{2D(x^*,x)} Hence, the above inequalities demand that the Bregman "distance" between a solution and the base point x remains bounded, which fails to be true for various practical examples like Poisson Inverse Problems and/or Support Vector Machines due to the singular behaviour of the regularizer (or its gradient) near the boundary of the feasible domain. That said, in order to justify the significance of the logarithmic regret result, the authors have to provide relevant examples of interest for this particular class of loss functions. In conclusion, in the blanket assumption part 2.1, the losses domain X is assumed to be closed. This assumption, despite the fact that it is typical for the Euclidean framework, a priori excludes various interesting problems (e.g. f(x)= -logx, x>0, the KL-divergence etc). Post rebuttal: The authors provided an example of a relatively continuous/relatively strongly convex function; therefore I increased my score. However, I think that the paper needs a revision according the following points: 1.Clarify the connection with existing work, in particular the paper of Antonakopoulos 2020. 2. Highlight in a more clear way the significance of logarithmic regret obtained for relatively continuous/relative strongly convex functions by providing a more extensive presentation and applications of the said class of fucntions.

Correctness: The proofs seem correct and sound. However, the issue mentioned above concerning the relation of relative strong convexity and relative Lipschitz continuity seems unjustified.

Clarity: The paper is well-written and easy to follow.

Relation to Prior Work: To the best of my knowledge, the most related work that deals with (OCO) problems where the respective losses are not Lipschitz continuous is that of Antonakopoulos et al (2020); in the said paper, the notion that extends the standard Lipschitz is that of Riemann Lipschitz continuity (RLC). On the other hand, in this paper. the key notion is that of relative Lipschitz continuity. In their introduction, the author(s) claim that the relation between these two notions is unclear. However, as I mentioned above, this is not true. More precisely, given (RLC) one can recover relative Lipschitz continuity via a direct combination of convexity and Cauchy-Schwartz inequality. Additionally, every relatively Lipschitz continuous loss function can be seen also as (RLC) under an. appropriately chosen Riemannian metric.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper extends known regret bounds for OCO ("Online Convex Optimization") to the case where adversary functions are "relatively Lipshitz continuous" or "relatively strongly convex." Informally, these two notions are generalizations of usual of Lipshitz continuity and strong convexity, where one replaces a norm with a Bregman divergence relative to a convex function R (the traditional setting corresponds to R=1/2||.||^2). Classical results establish that if adversary functions are Lipshitz continuous or strongly convex then certain algorithms can achieve regret of O(sqrt(T)) and O(log T) respectively. This paper shows similar results hold in the relative setting (Theorems 3.2 and 3.3). The algorithms that achieve these bounds are "follow the (regularized) leader". The authors also show that the O(sqrt(T)) bound still applies in case of more tractable (online) "dual averaging" algorithms. Finally, the authors prove that "Dual-Stabilized Online Mirror Descent" (DS-OMD) also achieves O(sqrt(T)) for relatively Lipshitz functions, and that (non-stabilized) OMD achieves O(log(T)) for relatively strongly convex functions.

Strengths: The paper is rigorous and clearly written and discusses important topics for the optimization community.

Weaknesses: My only concern is whether the results presented in the paper are sufficiently novel. In particular, the authors mention in the related work section that Antonakopoulos et al. also prove O(sqrt(T)) bounds for FTRL and OMD beyond traditional Lipschitz Contintuity (in the context of "Riemann Lipschitz Contintiity"). I would have liked a more careful comparison with this work. Is there any advantage of the "relative" framework compared to the "Riemannian" one?

Correctness: The paper is technically sound.

Clarity: The paper is clear and well written.

Relation to Prior Work: As mentioned above, I think the paper is missing a careful comparison with the work of Antonakopoulos et al.

Reproducibility: Yes

Additional Feedback: I enjoyed reading the paper even though I was not very familiar with the topic. Perhaps some motivational examples could be useful. In particular, the authors mention in the Introduction that not all loss functions that appear in applications satisfy the traditional strong convexity assumptions. Maybe the authors could expand on this point and provide a concrete setting where the more general theory is necessary.