{"title": "Adaptive Online Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 65, "page_last": 72, "abstract": null, "full_text": "Adaptive Online Gradient Descent\n\nPeter L. Bartlett\n\nDivision of Computer Science\n\nDepartment of Statistics\n\nUC Berkeley\n\nBerkeley, CA 94709\n\nbartlett@cs.berkeley.edu\n\nElad Hazan\n\nIBM Almaden Research Center\n\n650 Harry Road\n\nSan Jose, CA 95120\n\nhazan@us.ibm.com\n\nAlexander Rakhlin \u2217\n\nDivision of Computer Science\n\nUC Berkeley\n\nBerkeley, CA 94709\n\nrakhlin@cs.berkeley.edu\n\nAbstract\n\nWe study the rates of growth of the regret in online convex optimization. First,\nwe show that a simple extension of the algorithm of Hazan et al eliminates the\nneed for a priori knowledge of the lower bound on the second derivatives of the\nobserved functions. We then provide an algorithm, Adaptive Online Gradient\nDescent, which interpolates between the results of Zinkevich for linear functions\nand of Hazan et al for strongly convex functions, achieving intermediate rates\nT and log T . Furthermore, we show strong optimality of the algorithm.\nbetween\nFinally, we provide an extension of our results to general norms.\n\n\u221a\n\n1 Introduction\n\nloss,PT\n\nt=1 ft(xt), is not much larger than the smallest total lossPT\n\nThe problem of online convex optimization can be formulated as a repeated game between a player\nand an adversary. At round t, the player chooses an action xt from some convex subset K of Rn,\nand then the adversary chooses a convex loss function ft. The player aims to ensure that the total\nt=1 ft(x) of any \ufb01xed action x.\nThe difference between the total loss and its optimal value for a \ufb01xed action is known as the regret,\nwhich we denote\n\nRT =\n\nft(xt) \u2212 min\nx\u2208K\n\nft(x).\n\nTX\n\nt=1\n\nTX\n\nt=1\n\nMany problems of online prediction of individual sequences can be viewed as special cases of online\nconvex optimization, including prediction with expert advice, sequential probability assignment,\nand sequential investment [1]. A central question in all these cases is how the regret grows with the\nnumber of rounds of the game.\n\u221a\nZinkevich [2] considered the following gradient descent algorithm, with step size \u03b7t = \u0398(1/\n(Here, \u03a0K(v) denotes the Euclidean projection of v on to the convex set K.)\n\nt).\n\n\u2217Corresponding author.\n\n1\n\n\fAlgorithm 1 Online Gradient Descent (OGD)\n1: Initialize x1 arbitrarily.\n2: for t = 1 to T do\n3:\n4:\n5: end for\n\nPredict xt, observe ft.\nUpdate xt+1 = \u03a0K(xt \u2212 \u03b7t+1\u2207ft(xt)).\n\n\u221a\n\nT , where T is the number of rounds\nZinkevich showed that the regret of this algorithm grows as\nof the game. This rate cannot be improved in general for arbitrary convex loss functions. However,\nthis is not the case if the loss functions are uniformly convex, for instance, if all ft have second\nderivative at least H > 0. Recently, Hazan et al [3] showed that in this case it is possible for the\nregret to grow only logarithmically with T , using the same algorithm but with the smaller step size\n\u03b7t = 1/(Ht). Increasing convexity makes online convex optimization easier.\nThe algorithm that achieves logarithmic regret must know in advance a lower bound on the convexity\nof the loss functions, since this bound is used to determine the step size. It is natural to ask if this is\n\u221a\nessential: is there an algorithm that can adapt to the convexity of the loss functions and achieve the\nsame regret rates in both cases\u2014O(log T ) for uniformly convex functions and O(\nT ) for arbitrary\nconvex functions? In this paper, we present an adaptive algorithm of this kind.\nThe key technique is regularization: We consider the online gradient descent (OGD) algorithm,\nbut we add a uniformly convex function, the quadratic \u03bbtkxk2, to each loss function ft(x). This\ncorresponds to shrinking the algorithm\u2019s actions xt towards the origin. It leads to a regret bound of\nthe form\n\nTX\n\nt=1\n\nRT \u2264 c\n\n\u03bbt + p(\u03bb1, . . . , \u03bbT ).\n\n\u221a\n\nThe \ufb01rst term on the right hand side can be viewed as a bias term; it increases with \u03bbt because\nthe presence of the regularization might lead the algorithm away from the optimum. The second\nterm is a penalty for the \ufb02atness of the loss functions that becomes smaller as the regularization\nincreases. We show that choosing the regularization coef\ufb01cient \u03bbt so as to balance these two terms\nin the bound on the regret up to round t is nearly optimal in a strong sense. Not only does this choice\nT and log T regret rates in the linear and uniformly convex cases, it leads to a kind of\ngive the\noracle inequality: The regret is no more than a constant factor times the bound on regret that would\nhave been suffered if an oracle had provided in advance the sequence of regularization coef\ufb01cients\n\u03bb1, . . . , \u03bbT that minimizes the \ufb01nal regret bound.\nTo state this result precisely, we introduce the following de\ufb01nitions. Let K be a convex subset of\nRn and suppose that supx\u2208K kxk \u2264 D. For simplicity, throughout the paper we assume that K is\ncentered around 0, and, hence, 2D is the diameter of K. De\ufb01ne a shorthand \u2207t = \u2207ft(xt). Let Ht\nbe the largest value such that for any x\u2217 \u2208 K,\nft(x\u2217) \u2265 ft(xt) + \u2207>\n\nk\u2207tk \u2264 Gt. De\ufb01ne \u03bb1:t :=Pt\n\n(1)\nIn particular, if \u22072ft \u2212 Ht \u00b7 I (cid:23) 0, then the above inequality is satis\ufb01ed. Furthermore, suppose\ns=1 Hs. Let H 1:0 = 0. Let us now state the\nAdaptive Online Gradient Descent algorithm as well as the theoretical guarantee for its performance.\n\ns=1 \u03bbs and H 1:t :=Pt\n\nt (x\u2217 \u2212 xt) + Ht\n2\n\nkx\u2217 \u2212 xtk2.\n\nAlgorithm 2 Adaptive Online Gradient Descent\n1: Initialize x1 arbitrarily.\n2: for t = 1 to T do\n3:\n\nPredict xt, observe ft.\nCompute \u03bbt = 1\n2\nCompute \u03b7t+1 = (H 1:t + \u03bb1:t)\u22121.\nUpdate xt+1 = \u03a0K(xt \u2212 \u03b7t+1(\u2207ft(xt) + \u03bbtxt)).\n\n(H 1:t + \u03bb1:t\u22121)2 + 8G2\n\n(cid:18)q\n\n4:\n\n5:\n6:\n7: end for\n\n(cid:19)\nt /(3D2) \u2212 (H 1:t + \u03bb1:t\u22121)\n\n.\n\n2\n\n\fTheorem 1.1. The regret of Algorithm 2 is bounded by\n\nRT \u2264 3\n\ninf\n\u03bb\u2217\n1 ,...,\u03bb\u2217\n\nT\n\n\u2217\n1:T +\nD2\u03bb\n\n \n\n!\n\n.\n\n(Gt + \u03bb\u2217\n\u2217\nH 1:t + \u03bb\n1:t\n\nt D)2\n\nTX\n\nt=1\n\nWhile Algorithm 2 is stated with the squared Euclidean norm as a regularizer, we show that it\nis straightforward to generalize our technique to other regularization functions that are uniformly\nconvex with respect to other norms. This leads to adaptive versions of the mirror descent algorithm\nanalyzed recently in [4, 5].\n\n2 Preliminary results\n\nThe following theorem gives a regret bound for the OGD algorithm with a particular choice of step\nsize. The virtue of the theorem is that the step size can be set without knowledge of the uniform\nlower bound on Ht, which is required in the original algorithm of [3]. The proof is provided in\nSection 4 (Theorem 4.1), where the result is extended to arbitrary norms.\nTheorem 2.1. Suppose we set \u03b7t+1 = 1\nH1:t\n\n. Then the regret of OGD is bounded as\n\nTX\n\nt=1\n\nRT \u2264 1\n2\n\nG2\nt\nH 1:t\n\n.\n\nIn particular, loosening the bound,\n\n2RT \u2264\n\nmaxt G2\nt\n\nPt\n\nmint\n\n1\nt\n\ns=1 Hs\n\nlog T.\n\nPt\n\nproblem with the algorithm. If H1 > 0 and Ht = 0 for all t > 1, thenPt\n\nNote that nothing prevents Ht from being negative or zero, implying that the same algorithm gives\nlogarithmic regret even when some of the functions are linear or concave, as long as the partial\ns=1 Hs are positive and not too small. The above result already provides an important\naverages 1\nt\nextension to the log-regret algorithm of [3]: no prior knowledge on the uniform convexity of the\nfunctions is needed, and the bound is in terms of the observed sequence {Ht}. Yet, there is still a\ns=1 Hs = H1, resulting\nT ) bound can be obtained. In the\nT ) bound on\n\n\u221a\n\u221a\nin a linear regret bound. However, we know from [2] that a O(\nnext section we provide an algorithm which interpolates between O(log T ) and O(\nthe regret depending on the curvature of the observed functions.\n\n3 Adaptive Regularization\nSuppose the environment plays a sequence of ft\u2019s with curvature Ht \u2265 0. Instead of performing\ngradient descent on these functions, we step in the direction of the gradient of \u02dcft(x) = ft(x) +\n2 \u03bbtkxk2, where the regularization parameter \u03bbt \u2265 0 is chosen appropriately at each step as a\n1\nfunction of the curvature of the previous functions. We remind the reader that K is assumed to be\ncentered around the origin, for otherwise we would instead use kx \u2212 x0k2 to shrink the actions xt\ntowards the origin x0. Applying Theorem 2.1, we obtain the following result.\nTheorem 3.1. If the Online Gradient Descent algorithm is performed on the functions \u02dcft(x) =\nft(x) + 1\n\n2 \u03bbtkxk2 with\n\n\u03b7t+1 =\n\n1\n\nH 1:t + \u03bb1:t\n\nfor any sequence of non-negative \u03bb1, . . . , \u03bbT , then\n\nRT \u2264 1\n\n2 D2\u03bb1:T +\n\n1\n2\n\n3\n\nTX\n\nt=1\n\n(Gt + \u03bbtD)2\nH 1:t + \u03bb1:t\n\n.\n\n\fProof. By Theorem 2.1 applied to functions \u02dcft,\n\n(cid:19)\n\n TX\n\n(cid:18)\n\nTX\n\n!\n\nTX\n\nx\n\nt=1\n\nft(x) +\n\nft(xt) +\n\n\u2264 min\n\nIndeed, it is easy to verify that condition (1) for ft implies the corresponding statement with \u02dcHt =\nHt+\u03bbt for \u02dcft. Furthermore, by linearity, the bound on the gradient of \u02dcft is Gt+\u03bbtkxtk \u2264 Gt+\u03bbtD.\nDe\ufb01ne x\u2217 = arg minx\n\n1\n2 \u03bbtkxtk2\nPT\nt=1 ft(x). Then, dropping the kxtk2 terms and bounding kx\u2217k2 \u2264 D2,\nft(xt) \u2264 TX\n\n(Gt + \u03bbtD)2\nH 1:t + \u03bb1:t\n\n1\n2 \u03bbtkxk2\n\nft(x\u2217) +\n\nTX\n\nTX\n\n1\n2\n\n+\n\nt=1\n\nt=1\n\n.\n\n,\n\n1\n2 D2\u03bb1:T +\n\n1\n2\n\n(Gt + \u03bbtD)2\nH 1:t + \u03bb1:t\n\nt=1\n\nwhich proves the the theorem.\n\nt=1\n\nt=1\n\nTX\nTX\n\n1\n2\n\nt=1\n\nt=1\n\n \n\nThe following inequality is important in the rest of the analysis, as it allows us to remove the de-\npendence on \u03bbt from the numerator of the second sum at the expense of increased constants. We\nhave\n1\n2\n\n(Gt + \u03bbtD)2\nH 1:t + \u03bb1:t\n\nH 1:t + \u03bb1:t\u22121 + \u03bbt\n\n2 D2\u03bb1:T +\n\nH 1:t + \u03bb1:t\n\nD2\u03bb1:T +\n\nTX\n\n\u2264 1\n\n!\n\n(cid:18)\n\n(cid:19)\n\n2G2\nt\n\nt D2\n\n2\u03bb2\n\n+\n\nt=1\n\n\u2264 3\n\n2 D2\u03bb1:T +\n\nG2\nt\n\nH 1:t + \u03bb1:t\n\n,\n\n(2)\n\n\u221a\n\nwhere the \ufb01rst inequality holds because (a + b)2 \u2264 2a2 + 2b2 for any a, b \u2208 R.\n\u221a\nIt turns out that for appropriate choices of {\u03bbt}, the above theorem recovers the O(\nT ) bound on the\nregret for linear functions [2] and the O(log T ) bound for strongly convex functions [3]. Moreover,\nunder speci\ufb01c assumptions on the sequence {Ht}, we can de\ufb01ne a sequence {\u03bbt} which produces\nintermediate rates between log T and\nT . These results are exhibited in corollaries at the end of\nthis section.\nOf course, it would be nice to be able to choose {\u03bbt} adaptively without any restrictive assump-\ntions on {Ht}. Somewhat surprisingly, such a choice can be made near-optimally by simple lo-\nt=1 \u03bbt and\n. The \ufb01rst sum increases in any particular \u03bbt and the other decreases. While the\nin\ufb02uence of the regularization parameters \u03bbt on the \ufb01rst sum is trivial, the in\ufb02uence on the second\nsum is more involved as all terms for t \u2265 t0 depend on \u03bbt0. Nevertheless, it turns out that a simple\nchoice of \u03bbt is optimal to within a multiplicative factor of 2. This is exhibited by the next lemma.\nLemma 3.1. De\ufb01ne\n\ncal balancing. Observe that the upper bound of Eq. (2) consists of two sums: D2PT\nPT\n\nH1:t+\u03bb1:t\n\nG2\nt\n\nt=1\n\nHT ({\u03bbt}) = HT (\u03bb1 . . . \u03bbT ) = \u03bb1:T +\nwhere Ct \u2265 0 does not depend on \u03bbt\u2019s. If \u03bbt satis\ufb01es \u03bbt =\n\nCt\n\nH 1:t + \u03bb1:t\n\n,\n\nfor t = 1, . . . , T , then\n\nTX\n\nt=1\nCt\n\nHT ({\u03bbt}) \u2264 2 inf\n{\u03bb\u2217\nt }\u22650\n\nH1:t+\u03bb1:t\n\nHT ({\u03bb\u2217\n\nt}).\n\nProof. We prove this by induction. Let {\u03bb\u2217\ncoef\ufb01cients. The base of the induction is proved by considering two possibilities: either \u03bb1 < \u03bb\u2217\nnot. In the \ufb01rst case, \u03bb1 + C1/(H1 + \u03bb1) = 2\u03bb1 \u2264 2\u03bb\u2217\nis proved similarly.\nNow, suppose\n\nt} be the optimal sequence of non-negative regularization\n1 or\n1)). The other case\n\n1 + C1/(H1 + \u03bb\u2217\n\n1 \u2264 2(\u03bb\u2217\n\n\u2217\n1:T , then\nConsider two possibilities. If \u03bb1:T < \u03bb\n\nHT\u22121({\u03bbt}) \u2264 2HT\u22121({\u03bb\u2217\n\nt}).\n\nHT ({\u03bbt}) = \u03bb1:T +\n\nCt\n\n= 2\u03bb1:T \u2264 2\u03bb\n1:T \u2264 2HT ({\u03bb\u2217\n\u2217\n\nt}).\n\nTX\n\nH 1:t + \u03bb1:t\n\nt=1\n\n4\n\n\fIf, on the other hand, \u03bb1:T \u2265 \u03bb\n\u2217\n1:T , then\nCt\n= 2\n\nH 1:T + \u03bb1:T\nUsing the inductive assumption, we obtain\n\nH 1:T + \u03bb1:T\n\n\u03bbT +\n\nCt\n\n\u2264 2\n\nCt\n\n\u2217\nH 1:T + \u03bb\n1:T\n\n\u2264 2\n\n(cid:18)\n\n\u03bb\u2217\nT +\n\nCt\n\n\u2217\nH 1:T + \u03bb\n1:T\n\n(cid:19)\n\n.\n\nHT ({\u03bbt}) \u2264 2HT ({\u03bb\u2217\n\nt}).\n\nThe lemma above is the key to the proof of the near-optimal bounds for Algorithm 2 1.\n\n \n\n\u2217\n3D2\u03bb\n1:T + 2\n\n!\n\nT\n\n\u2264 inf\n\u03bb\u2217\n1 ,...,\u03bb\u2217\n(Gt + \u03bb\u2217\n\u2217\nH 1:t + \u03bb\n1:t\n\nt D)2\n\n,\n\n!\n\nTX\n\nt=1\n\nG2\nt\n\n\u2217\nH 1:t + \u03bb\n1:t\n\nProof. (of Theorem 1.1)\nBy Eq. 2 and Lemma 3.1,\n\nTX\n\nG2\nt\n\nRT \u2264 3\n\nH 1:t + \u03bb1:t\n\n \n2 D2\u03bb1:T +\nt=1\n1\n\u2217\n1:T +\n2 D2\u03bb\n\n\u2264 6\n\ninf\n\u03bb\u2217\n1 ,...,\u03bb\u2217\n\n1\n2\nprovided the \u03bbt are chosen as solutions to\n3\n2 D2\u03bbt =\n\nT\n\nTX\n\nt=1\n\nIt is easy to verify that\n1\n2\n\n\u03bbt =\n\n(cid:18)q\n\nG2\nt\n\nH 1:t + \u03bb1:t\u22121 + \u03bbt\n\n.\n\n(cid:19)\nt /(3D2) \u2212 (H 1:t + \u03bb1:t\u22121)\n\n(H 1:t + \u03bb1:t\u22121)2 + 8G2\n\n(3)\n\nis the non-negative root of the above quadratic equation. We note that division by zero in Algorithm 2\noccurs only if \u03bb1 = H1 = G1 = 0. Without loss of generality, G1 6= 0, for otherwise x1 is\nminimizing f1(x) and regret is negative on that round.\n\nHence, the algorithm has a bound on the performance which is 6 times the bound obtained by the\nbest of\ufb02ine adaptive choice of regularization coef\ufb01cients. While the constant 6 might not be optimal,\nit can be shown that a constant strictly larger than one is unavoidable (see previous footnote).\nWe also remark that if the diameter D is unknown, the regularization coef\ufb01cients \u03bbt can still be\nchosen by balancing as in Eq. (3), except without the D2 term. This choice of \u03bbt, however, increases\nthe bound on the regret suffered by Algorithm 2 by a factor of O(D2).\nLet us now consider some special cases and show that Theorem 1.1 not only recovers the rate of\nincrease of regret of [3] and [2], but also provides intermediate rates. For each of these special cases,\nwe provide a sequence of {\u03bbt} which achieves the desired rates. Since Theorem 1.1 guarantees that\nAlgorithm 2 is competitive with the best choice of the parameters, we conclude that Algorithm 2\nachieves the same rates.\nCorollary 3.1. Suppose Gt \u2264 G for all 1 \u2264 t \u2264 T . Then for any sequence of convex functions\n\u221a\n{ft}, the bound on regret of Algorithm 2 is O(\n\nT ).\n\nProof. Let \u03bb1 =\n\nT and \u03bbt = 0 for 1 < t \u2264 T . By Eq. 2,\n\n\u221a\n\n \n\n1\n2\n\nTX\n\nt=1\n\nD2\u03bb1:T +\n\n(Gt + \u03bbtD)2\nH 1:t + \u03bb1:t\n\n\u2264 3\n\n!\n\nTX\n2 D2\u03bb1:T +\nTX\n\n\u221a\n\nt=1\n\nT +\n\n\u2264 3\n\n2 D2\n\nH 1:t + \u03bb1:t\n\nG2\nt\n\n(cid:18)3\n\nG2\u221a\nT\n\n=\n\nt=1\n\n2 D2 + G2\n\n(cid:19)\u221a\n\nT .\n\n1Lemma 3.1 effectively describes an algorithm for an online problem with competitive ratio of 2. In the full\nversion of this paper we give a lower bound strictly larger than one on the competitive ratio achievable by any\nonline algorithm for this problem.\n\n5\n\n\f\u221a\n\nHence, the regret of Algorithm 2 can never increase faster than\ntions of [3].\nCorollary 3.2. Suppose Ht \u2265 H > 0 and G2\nAlgorithm 2 is O(log T ).\n\nT . We now consider the assump-\nt < G for all 1 \u2264 t \u2264 T . Then the bound on regret of\nPT\n\nPT\n\nG\n\ntH \u2264 G\n\n2H (log T + 1).\n\nG2\nt\nH1:t\n\n\u2264 1\n\n2\n\n2\n\nt=1\n\nt=1\n\nProof. Set \u03bbt = 0 for all t. It holds that RT \u2264 1\n\nThe above proof also recovers the result of Theorem 2.1. The following Corollary shows a spectrum\nof rates under assumptions on the curvature of functions.\nCorollary 3.3. Suppose Ht = t\u2212\u03b1 and Gt \u2264 G for all 1 \u2264 t \u2264 T .\n\n1. If \u03b1 = 0, then RT = O(log T ).\n\u221a\n2. If \u03b1 > 1/2, then RT = O(\nT ).\n3. If 0 < \u03b1 \u2264 1/2, then RT = O(T \u03b1).\n\n\u03bb1 = T \u03b1 and \u03bbt = 0 for 1 < t \u2264 T . Note thatPt\n!\n\ns=1 Hs \u2265R t\u22121\nProof. The \ufb01rst two cases follow immediately from Corollaries 3.1 and 3.2. For the third case, let\nx=0(x + 1)\u2212\u03b1dx = (1 \u2212 \u03b1)\u22121t1\u2212\u03b1 \u2212\n(1 \u2212 \u03b1)\u22121. Hence,\nTX\n\nTX\n\n \n\n1\n2\n\nD2\u03bb1:T +\n\nt=1\n\n(Gt + \u03bbtD)2\nH 1:t + \u03bb1:t\n\n\u2264 3\n\n2 D2\u03bb1:T +\n\nH 1:t + \u03bb1:t\n\nt=1\n\nG2\nt\n\nTX\n\nt=1\n\n\u2264 2D2T \u03b1 + G2(1 \u2212 \u03b1)\n\n1\n\nt1\u2212\u03b1 \u2212 1\n\n\u2264 2D2T \u03b1 + 2G2 1\n\u03b1\n\nT \u03b1 + O(1) = O(T \u03b1).\n\n4 Generalization to different norms\n\nThe original online gradient descent (OGD) algorithm as analyzed by Zinkevich [2] used the Eu-\nclidean distance of the current point from the optimum as a potential function. The logarithmic\nregret bounds of [3] for strongly convex functions were also stated for the Euclidean norm, and\nsuch was the presentation above. However, as observed by Shalev-Shwartz and Singer in [5], the\nproof technique of [3] extends to arbitrary norms. As such, our results above for adaptive regular-\nization carry on to the general setting, as we state below . Our notation follows that of Gentile and\nWarmuth [6].\nDe\ufb01nition 4.1. A function g over a convex set K is called H-strongly convex with respect to a\nconvex function h if\n\n\u2200x, y \u2208 K . g(x) \u2265 g(y) + \u2207g(y)>(x \u2212 y) + H\n\n2 Bh(x, y).\n\nHere Bh(x, y) is the Bregman divergence with respect to the function h, de\ufb01ned as\n\nBh(x, y) = h(x) \u2212 h(y) \u2212 \u2207h(y)>(x \u2212 y).\n\n2 is\nThis notion of strong convexity generalizes the Euclidean notion:\nstrongly convex with respect to h(x) = kxk2\n2). More gen-\nerally, the Bregman divergence can be thought of as a squared norm, not necessarily Euclidean,\ni.e., Bh(x, y) = kx \u2212 yk2. Henceforth we also refer to the dual norm of a given norm, de\ufb01ned\nby kyk\u2217 = supkxk\u22641{y>x}. For the case of \u2018p norms, we have kyk\u2217 = kykq where q satis\ufb01es\n2kxk2 (this holds for norms\np + 1\n1\nother than \u2018p as well).\n\nq = 1, and by H\u00a8older\u2019s inequality y>x \u2264 kyk\u2217kxk \u2264 1\n\n2 (in this case Bh(x, y) = kx \u2212 yk2\n\n2kyk2\u2217 + 1\n\nthe function g(x) = kxk2\n\n6\n\n\fFor simplicity, the reader may think of the functions g, h as convex and differentiable2. The follow-\ning algorithm is a generalization of the OGD algorithm to general strongly convex functions (see\nthe derivation in [6]). In this extended abstract we state the update rule implicitly, leaving the issues\nof ef\ufb01cient computation for the full version (these issues are orthogonal to our discussion, and were\naddressed in [6] for a variety of functions h).\n\nAlgorithm 3 General-Norm Online Gradient Descent\n1: Input: convex function h\n2: Initialize x1 arbitrarily.\n3: for t = 1 to T do\n4:\n5:\n6:\n7: end for\n\nPredict xt, observe ft.\nCompute \u03b7t+1 and let yt+1 be such that \u2207h(yt+1) = \u2207h(xt) \u2212 2\u03b7t+1\u2207ft(xt).\nLet xt+1 = arg minx\u2208K Bh(x, yt+1) be the projection of yt+1 onto K.\n\nThe methods of the previous sections can now be used to derive similar, dynamically optimal, bounds\non the regret. As a \ufb01rst step, let us generalize the bound of [3], as well as Theorem 2.1, to general\nnorms:\nTheorem 4.1. Suppose that, for each t, ft is a Ht-strongly convex function with respect to h, and let\nh be such that Bh(x, y) \u2265 kx \u2212 yk2 for some norm k \u00b7 k. Let k\u2207ft(xt)k\u2217 \u2264 Gt for all t. Applying\nthe General-Norm Online Gradient Algorithm with \u03b7t+1 = 1\nH1:t\n\n, we have\n\nTX\n\nt=1\n\nRT \u2264 1\n2\n\nG2\nt\nH 1:t\n\n.\n\nProof. The proof follows [3], with the Bregman divergence replacing the Euclidean distance as a\npotential function. By assumption on the functions ft, for any x\u2217 \u2208 K,\n\nft(xt) \u2212 ft(x\u2217) \u2264 \u2207ft(xt)>(xt \u2212 x\u2217) \u2212 Ht\n\n2 Bh(x\u2217, xt).\n\nBy a well-known property of Bregman divergences (see [6]), it holds that for any vectors x, y, z,\n\n(x \u2212 y)>(\u2207h(z) \u2212 \u2207h(y)) = Bh(x, y) \u2212 Bh(x, z) + Bh(y, z).\n\nCombining both observations,\n\n2(ft(xt) \u2212 ft(x\u2217)) \u2264 2\u2207ft(xt)>(xt \u2212 x\u2217) \u2212 HtBh(x\u2217, xt)\n\n1\n\n1\n\n=\n\n\u03b7t+1\n\n=\n\u03b7t+1\n\u2264 1\n\u03b7t+1\n\n(\u2207h(yt+1) \u2212 \u2207h(xt))>(x\u2217 \u2212 xt) \u2212 HtBh(x\u2217, xt)\n[Bh(x\u2217, xt) \u2212 Bh(x\u2217, yt+1) + Bh(xt, yt+1)] \u2212 HtBh(x\u2217, xt)\n[Bh(x\u2217, xt) \u2212 Bh(x\u2217, xt+1) + Bh(xt, yt+1)] \u2212 HtBh(x\u2217, xt),\n\nwhere the last inequality follows from the Pythagorean Theorem for Bregman divergences [6], as\nxt+1 is the projection w.r.t the Bregman divergence of yt+1 and x\u2217 \u2208 K is in the convex set.\nSumming over all iterations and recalling that \u03b7t+1 = 1\nH1:t\n\n,\n\n(cid:18) 1\n\n(cid:19)\n\nBh(x\u2217, xt)\n\n\u2212 1\n\u03b7t\n\n\u2212 Ht\n\n\u03b7t+1\n\n+ Bh(x\u2217, x1)\n\n\u2212 H1\n\n+\n\n(cid:18) 1\n\n\u03b72\n\n(cid:19)\n\nTX\n\n1\n\n\u03b7t+1\n\nt=1\n\nBh(xt, yt+1)\n\n(4)\n\n2RT \u2264 TX\nTX\n\nt=2\n\n=\n\n1\n\n\u03b7t+1\n\nt=1\n\nBh(xt, yt+1).\n\n2Since the set of points of nondifferentiability of convex functions has measure zero, convexity is the only\nproperty that we require. Indeed, for nondifferentiable functions, the algorithm would choose a point \u02dcxt, which\nis xt with the addition of a small random perturbation. With probability one, the functions would be smooth\nat the perturbed point, and the perturbation could be made arbitrarily small so that the regret rate would not be\naffected.\n\n7\n\n\fWe proceed to bound Bh(xt, yt+1). By de\ufb01nition of Bregman divergence, and the dual norm in-\nequality stated before,\n\nBh(xt, yt+1) + Bh(yt+1, xt) = (\u2207h(xt) \u2212 \u2207h(yt+1))>(xt \u2212 yt+1)\n\n= 2\u03b7t+1\u2207ft(xt)>(xt \u2212 yt+1)\nt+1k\u2207tk2\u2217 + kxt \u2212 yt+1k2.\n\u2264 \u03b72\n\nThus, by our assumption Bh(x, y) \u2265 kx \u2212 yk2, we have\n\nBh(xt, yt+1) \u2264 \u03b72\n\nt+1k\u2207tk2\u2217 + kxt \u2212 yt+1k2 \u2212 Bh(yt+1, xt) \u2264 \u03b72\n\nt+1k\u2207tk2\u2217.\n\nPlugging back into Eq. (4) we get\n\nRT \u2264 1\n2\n\nTX\n\nt=1\n\nTX\n\nt=1\n\n\u03b7t+1G2\n\nt =\n\n1\n2\n\nG2\nt\nH 1:t\n\n.\n\nThe generalization of our technique is now straightforward. Let A2 = supx\u2208K g(x) and 2B =\nsupx\u2208K k\u2207g(x)k\u2217. The following algorithm is an analogue of Algorithm 2 and Theorem 4.2 is the\nanalogue of Theorem 1.1 for general norms.\n\n(cid:18)q\n\nAlgorithm 4 Adaptive General-Norm Online Gradient Descent\n1: Initialize x1 arbitrarily. Let g(x) be 1-strongly convex with respect to the convex function h.\n2: for t = 1 to T do\n3:\n\n(cid:19)\nt /(A2 + 2B2) \u2212 (H 1:t + \u03bb1:t\u22121)\n\n.\n\nPredict xt, observe ft\nCompute \u03bbt = 1\n2\nCompute \u03b7t+1 = (H 1:t + \u03bb1:t)\u22121.\nLet yt+1 be such that \u2207h(yt+1) = \u2207h(xt) \u2212 2\u03b7t+1(\u2207ft(xt) + \u03bbt\nLet xt+1 = arg minx\u2208K Bh(x, yt+1) be the projection of yt+1 onto K.\n\n(H 1:t + \u03bb1:t\u22121)2 + 8G2\n\n2 \u2207g(xt))).\n\n4:\n\n5:\n6:\n7:\n8: end for\n\nTheorem 4.2. Suppose that each ft is a Ht-strongly convex function with respect to h, and let g be\na 1-strongly convex with respect h. Let h be such that Bh(x, y) \u2265 kx\u2212 yk2 for some norm k\u00b7k. Let\nk\u2207ft(xt)k\u2217 \u2264 Gt. The regret of Algorithm 4 is bounded by\n\n \n\nRT \u2264 inf\n\u03bb\u2217\n1 ,...,\u03bb\u2217\n\nT\n\n\u2217\n(A2 + 2B2)\u03bb\n1:T +\n\nTX\n\nt=1\n\n(Gt + \u03bb\u2217\n\u2217\nH 1:t + \u03bb\n1:t\n\nt B)2\n\n!\n\n.\n\nIf the norm in the above theorem is the Euclidean norm and g(x) = kxk2, we \ufb01nd that D =\nsupx\u2208K kxk = A = B and recover the results of Theorem 1.1.\n\nReferences\n[1] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\n2006.\n\n[2] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML,\n\npages 928\u2013936, 2003.\n\n[3] Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal. Logarithmic regret algorithms for online convex\n\noptimization. In COLT, pages 499\u2013513, 2006.\n\n[4] Shai Shalev-Shwartz and Yoram Singer. Convex repeated games and Fenchel duality. In B. Sch\u00a8olkopf,\nJ. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press,\nCambridge, MA, 2007.\n\n[5] Shai Shalev-Shwartz and Yoram Singer. Logarithmic regret algorithms for strongly convex repeated games.\n\nIn Technical Report 2007-42. The Hebrew University, 2007.\n\n[6] C. Gentile and M. K. Warmuth. Proving relative loss bounds for on-line learning algorithms using Bregman\n\ndivergences. In COLT. Tutorial, 2000.\n\n8\n\n\f", "award": [], "sourceid": 699, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": null}, {"given_name": "Alexander", "family_name": "Rakhlin", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}