{"title": "On the Universality of Online Mirror Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 2645, "page_last": 2653, "abstract": "We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee.", "full_text": "On the Universality of Online Mirror Descent\n\nNathan Srebro\n\nTTIC\n\nnati@ttic.edu\n\nKarthik Sridharan\n\nTTIC\n\nkarthik@ttic.edu\n\nAmbuj Tewari\n\nUniversity of Texas at Austin\nambuj@cs.utexas.edu\n\nWe show that for a general class of convex online learning problems, Mirror Descent can always\nachieve a (nearly) optimal regret guarantee.\n\nAbstract\n\n1\n\nIntroduction\n\nMirror Descent is a \ufb01rst-order optimization procedure which generalizes the classic Gradient Descent procedure to\nnon-Euclidean geometries by relying on a \u201cdistance generating function\u201d speci\ufb01c to the geometry (the squared \uffff2-\nnorm in the case of standard Gradient Descent) [14, 4]. Mirror Descent is also applicable, and has been analyzed,\nin a stochastic optimization setting [9] and in an online setting, where it can ensure bounded online regret [20]. In\nfact, many classical online learning algorithms can be viewed as instantiations or variants of Online Mirror Descent,\ngenerally either with the Euclidean geometry (e.g. the Perceptron algorithm [5] and Online Gradient Descent [27]), or\nin the simplex (\uffff1 geometry), using an entropic distance generating function (Winnow [13] and Multiplicative Weights\n/ Online Exponentiated Gradient algorithm [11]). More recently, the Online Mirror Descent framework has been\napplied, with appropriate distance generating functions derived for a variety of new learning problems like multi-task\nlearning and other matrix learning problems [10], online PCA [26] etc.\nIn this paper, we show that Online Mirror Descent is, in a sense, universal. That is, for any convex online learning\nproblem, of a general form (speci\ufb01ed in Section 2), if the problem is online learnable, then it is online learnable, with\na nearly optimal regret rate, using Online Mirror Descent, with an appropriate distance generating function. Since\nMirror descent is a \ufb01rst order method and often has simple and computationally ef\ufb01cient update rules, this makes the\nresult especially attractive. Viewing online learning as a sequentially repeated game, this means that Online Mirror\nDescent is a near optimal strategy, guaranteeing an outcome very close to the value of the game.\nIn order to show such universality, we \ufb01rst generalize and re\ufb01ne the standard Mirror Descent analysis to situations\nwhere the constraint set is not the dual of the data domain, obtaining a general upper bound on the regret of Online\nMirror Descent in terms of the existence of an appropriate uniformly convex distance generating function (Section 3).\nWe then extend the notion of a martingale type of a Banach space to be sensitive to both the constraint set and the data\ndomain, and building on results of [24], we relate the value of the online learning repeated game to this generalized\nnotion of martingale type (Section 4). Finally, again building on and generalizing the work of [16], we show how\nhaving appropriate martingale type guarantees the existence of a good uniformly convex function (Section 5), that in\nturn establishes the desired nearly-optimal guarantee on Online Mirror Descent (Section 6). We mainly build on the\nanalysis of [24], who related the value of the online game to the notion of martingale type of a Banach space and\nuniform convexity when the constraint set and data domain are dual to each other. The main technical advance here\nis a non-trivial generalization of their analysis (as well as the Mirror Descent analysis) to the more general situation\nwhere the constraint set and data domain are chosen independently of each other. In Section 7 several examples are\nprovided that demostrate the use of our analysis.\nMirror Descent was initially introduced as a \ufb01rst order deterministic optimization procedure, with an \uffffp constraint and\na matching \uffffq Lipschitz assumption (1 \u2264 p \u2264 2, 1/q + 1/p = 1), was shown to be optimal in terms of the number of\nexact gradient evaluations [15]. Shalev-Shwartz and Singer later observed that the online version of Mirror Descent,\nagain with an \uffffp bound and matching \uffffq Lipschitz assumption (1 \u2264 p \u2264 2, 1/q + 1/p = 1), is also optimal in terms\n\n1\n\n\fof the worst-case (adversarial) online regret. In fact, in such scenarios stochastic Mirror Descent is also optimal in\nterms of the number of samples used. We emphasize that although in most, if not all, settings known to us these three\nnotions of optimality coincide, here we focus only on the worst-case online regret.\nSridharan and Tewri [24] generalized the optimality of online Mirror Descent (w.r.t. regret) to scenarios where learner\nis constrained to a unit ball of an arbitrary Banach space (not necessarily and \uffffp space) and the objective functions\nhave sub-gradients that lie in the dual ball of the space\u2014for reasons that will become clear shortly, we refer to this as\nthe data domain. However, often we encounter problems where the constraint set and data domain are not dual balls,\nbut rather are arbitrary convex subsets. In this paper, we explore this more general, \u201cnon-dual\u201d, variant, and show that\nalso in such scenarios online Mirror Descent is (nearly) optimal in terms of the (asymptotic) worst-case online regret.\n\n2 Online Convex Learning Problem\n\nAn online convex learning problem can be viewed as a multi-round repeated game where on round t, the learner \ufb01rst\npicks a vector (predictor) wt from some \ufb01xed set W, which is a closed convex subset of a vector space B. Next,\nthe adversary picks a convex cost function ft : W \uffff\u2192 R from a class of convex functions F. At the end of the\nround, the learner pays instantaneous cost ft(wt). We refer to the strategy used by the learner to pick the ft\u2019s as an\nonline learning algorithm. More formally, an online learning algorithm A for the problem is speci\ufb01ed by the mapping\nA :\uffffn\u2208N F n\u22121 \uffff\u2192 W. The regret of the algorithm A for a given sequence of cost functions f1, . . . , fn is given by\n\nRn(A, f1, . . . , fn) =\n\n1\nn\n\nn\ufffft=1\n\nft(A(f1:t\u22121)) \u2212 inf\nw\u2208W\n\nft(w) .\n\n1\nn\n\nn\ufffft=1\n\nThe goal of the learner (or the online learning algorithm), is to minimize the regret for any n.\nIn this paper, we consider cost function classes F speci\ufb01ed by a convex subset X\u2282B \uffff of the dual space B\uffff. We\nconsider various types of classes, where for all of them, subgradients1 of the functions in F lie inside X (we use the\nnotation \uffffx, w\uffff to mean applying linear functional x \u2208B \uffff on w \u2208B ) :\n\nFLip(X ) = {f : f is convex \u2200w \u2208W ,\u2207f (w) \u2208X} ,\nFsup(X ) = {w \uffff\u2192 |\uffffx, w\uffff \u2212 y| : x \u2208X , y \u2208 [\u2212b, b]}\n\nFlin(X ) = {w \uffff\u2192 \uffffx, w\uffff : x \u2208X} ,\n\nThe value of the game is then the best possible worst-case regret guarantee an algorithm can enjoy. Formally :\n\nVn(F,X ,W) = inf\nA\n\nsup\n\nf1:n\u2208F(X )\n\nRn(A, f1:n)\n\n(1)\n\nIt is well known that the value of a game for all the above sets F is the same. More generally:\nProposition 1. If for a convex function class F, we have that \u2200f \u2208F , w \u2208W ,\u2207f (w) \u2208X then,\n\nVn(F,X ,W) \u2264V n(Flin,X ,W)\n\nVn(FLip,X ,W) = Vn(Fsup,X ,W) = Vn(Flin,X ,W)\n\nFurthermore,\nThat is, the value for any class F for which subgradients are in X , is upper bounded by the value of the class of\nlinear functionals in W, see e.g. [1]. In particular, this includes the class FLip which is the class of all functions with\nsubgradients in X , and since Flin(X ) \u2282F Lip(X ) we get the \ufb01rst equality. The second equality is shown in [18].\nThe class Fsup(X ) corresponds to linear prediction with an absolute-difference loss, and thus its value is the best\npossible guarantee for online supervised learning with this loss. We can de\ufb01ne more generally a class F\uffff =\n{\uffff(\uffffx, w\uffff, y) : x \u2208X , y \u2208 [\u2212b, b]} for any 1-Lipschitz loss \uffff, and this class would also be of the desired type, with its\nvalue upper bounded by Vn(Flin,X ,W). In fact, this setting includes supervised learning fairly generally, including\nproblems such as multitask learning and matrix completion, where in all cases X speci\ufb01es the data domain2. The\nequality in the above proposition can also be extended to other commonly occurring convex loss function classes like\nthe hinge loss class with some extra constant factors.\n\nmeaning that at least one of the sub-gradients is in X .\n\n1Throughout we commit to a slight abuse of notation, with \u2207f (w) indicating some sub-gradient of f at w and \u2207f (w) \u2208X\n2Any convex supervised learning problem can be viewed as linear classi\ufb01cation with some convex constraint W on predictors.\n\n2\n\n\fOwing to Proposition 1, we can focus our attention on the class Flin (as other two behave similarly), and use shorthand\n(2)\nHenceforth the term value without any quali\ufb01cation refers to value of the linear game. Further, for any p \u2208 [1, 2] let,\n(3)\n\nVn(W,X ) := Vn(Flin,X ,W)\n\nVp := inf\uffffV \uffff\uffff\uffff \u2200n \u2208 N,Vn(W,X ) \u2264 V n\u2212(1\u2212 1\np )\uffff\n\nMost prior work on online learning and optimization considers the case when W is the unit ball of some Banach space,\nand X is the unit ball of the dual space, i.e. W and X are related to each other through duality. In this work, however,\nwe analyze the general problem where X\u2208B \uffff is not necessarily the dual ball of W. It will be convenient for us\nto relate the notions of a convex set and a corresponding norm. The Minkowski functional of a subset K of a vector\nspace V is de\ufb01ned as \uffffv\uffffK := inf {\u03b1> 0 : v \u2208 \u03b1K}. If K is convex and centrally symmetric (i.e. K = \u2212K), then\nis a semi-norm. Throughout this paper, we will require that W and X are convex and centrally symmetric.\n\uffff\u00b7\uffffK\nis a norm. Although not strictly required for our results, for simplicity we\nFurther, if the set K is bounded then \uffff\u00b7\uffffK\nwill assume W and X are are such that \uffff\u00b7\uffffW\n(the Minkowski functionals of the sets W and X ) are norms.\nEven though we do this for simplicity, we remark that all the results go through for semi-norms. We use X \uffff and W \uffff\nto represent the dual of balls X and W respectively, i.e. the unit balls of the dual norms \uffff\u00b7\uffff\u2217\nX\n3 Mirror Descent and Uniform Convexity\n\nand \uffff\u00b7\uffff\u2217\nW\n\nand \uffff\u00b7\uffffX\n\n.\n\nA key tool in the analysis mirror descent is the notion of strong convexity, or more generally uniform convexity:\nDe\ufb01nition 1. \u03a8: B\u2192 R is q-uniformly convex w.r.t. \uffff\u00b7\uffff if for any w, w\uffff \u2208B :\n\n\u2200\u03b1\u2208[0,1] \u03a8( \u03b1w + (1 \u2212 \u03b1)w\uffff) \u2264 \u03b1\u03a8(w) + (1 \u2212 \u03b1)\u03a8(w\uffff) \u2212 \u03b1(1\u2212\u03b1)\n\nq\n\n\uffffw \u2212 w\uffff\uffffq\n\nWe emphasize that in the de\ufb01nition above, the norm \uffff.\uffff and the subset W need not be related, and we only require\nuniform convexity inside W. This allows us to relate a norm with a non-matching \u201cball\u201d. To this end de\ufb01ne,\np\u22121-uniformly convex w.r.t. \uffff\u00b7\uffffX \u2217 , \u03a8(0) = 0\uffff\n\nDp := inf\uffff\uffff sup\n\n\u03a8(w)\uffff p\u22121\n\n\u03a8: W \uffff\u2192 R+ is\n\nw\u2208W\n\np\n\np\n\nGiven a function \u03a8, the Mirror Descent algorithm, AMD is given by\n\n\uffff\uffff\uffff\uffff\uffff\n\n(4)\n\nwt+1 = argmin\nw\u2208W\n\n\u2206\u03a8 (w|wt) + \u03b7\uffff\u2207ft(wt), w \u2212 wt\uffff\n\n\u2206\u03a8\uffffw\uffff\uffffw\ufffft+1\uffff\n\n2 \uffffw\uffff2\n\nor equivalently w\ufffft+1 = \u2207\u03a8\u2217 (\u2207\u03a8(wt) \u2212 \u03b7\u2207ft(wt)) , wt+1 = argmin\nw\u2208W\n\n(5)\nwhere \u2206\u03a8 (w|w\uffff) :=\u03a8( w)\u2212\u03a8(w\uffff)\u2212\uffff\u2207\u03a8(w\uffff), w \u2212 w\uffff\uffff is the Bregman divergence and \u03a8\u2217 is the convex conjugate\nof \u03a8. As an example notice that when \u03a8(w) = 1\n2 then we get the gradient descent algorithm and when W is\nthe d dimensional simplex and \u03a8(w) =\uffffd\ni=1 wi log(1/wi) then we get the multiplicative weights update algorithm.\nLemma 2. Let \u03a8: B \uffff\u2192 R be non-negative and q-uniformly convex w.r.t. norm \uffff\u00b7\uffffX \u2217. For the Mirror Descent algo-\nrithm with this \u03a8, using w1 = argmin\nwe can guarantee that for any f1, . . . , fn\nw\u2208W\nn\uffffn\nt=1 \uffff\u2207ft\uffffp\ns.t. 1\n\n\uffff1/p\n\uffff 1\nR(AMD, f1, . . . , fn) \u2264 2\uffff supw\u2208W \u03a8(w)\nn\uffffn\nt=1 \uffff\u2207ft\uffffp\nNote that in our case we have \u2207f \u2208X , i.e. \uffff\u2207f\uffffX \u2264 1, and so certainly 1\nvalue of the game, for any p \u2208 [1, 2], we de\ufb01ne:\n\n\u03a8(w) and \u03b7 =\uffff supw\u2208W \u03a8(w)\nq\u22121),\n\n(where p = q\n\nX \u2264 1\n\nnB\n\nn\n\n.\n\nq\n\nX \u2264 1. Similarly to the\np )\uffff\nRn(AMD, f1:n) \u2264 Dn\u2212(1\u2212 1\n\n(6)\n\nMDp := inf\uffffD : \u2203\u03a8,\u03b7 s.t. \u2200n \u2208 N,\n\nwhere the Mirror Descent algorithm in the above de\ufb01nition is run with the corresponding \u03a8 and \u03b7. The constant MDp\nis a characterization of the best guarantee the Mirror Descent algorithm can provide. Lemma 2 therefore implies:\n\nsup\n\nf1:n\u2208F(X )\n\n3\n\n\fCorollary 3. Vp \u2264 MDp \u2264 2Dp.\nProof. The \ufb01rst inequality is by the de\ufb01nition of Vp and MDp. Second inequality follows from previous lemma.\n\nThe Mirror Descent bound suggests that as long as we can \ufb01nd an appropriate function \u03a8 that is uniformly convex\nw.r.t. \uffff\u00b7\uffff\u2217\nX\n\nwe can get a diminishing regret guarantee. This suggests constructing the following function:\n\n\u02dc\u03a8q :=\n\nargmin\n\n\u03c8:\u03c8 is q-uniformly convex\nw.r.t. \uffff\u00b7\uffffX\u2217 on W and \u03c8\u22650\n\n\u03a8(w) .\n\nsup\nw\u2208W\n\n(7)\n\n(8)\n\nIf no q-uniformly convex function exists then \u02dc\u03a8q = \u221e is assumed by default. The above function is in a sense the\nbest choice for the Mirror Descent bound in (2). The question then is: when can we \ufb01nd such appropriate functions\nand what is the best rate we can guarantee using Mirror Descent?\n\n4 Martingale Type and Value\n\nIn [24], it was shown that the concept of the Martingale type (also sometimes called the Haar type) of a Banach\nspace and optimal rates for online convex optimization problem, where X and W are duals of each other, are closely\nrelated. In this section we extend the classic notion of Martingale type of a Banach space (see for instance [16]) to\none that accounts for the pair (W \uffff,X ). Before we proceed with the de\ufb01nitions we would like to introduce a few\nnecessary notations. First, throughout we shall use \uffff \u2208 {\u00b11}N to represent in\ufb01nite sequence of signs drawn uniformly\nat random (i.e. each \uffffi has equal probability of being +1 or \u22121). Also throughout (xn)n\u2208N represents a sequence of\nmappings where each xn : {\u00b11}n\u22121 \uffff\u2192 B\uffff. We shall commit to the abuse of notation and use xn(\uffff) to represent\nxn(\uffff) = xn(\uffff1, . . . ,\uffff n\u22121) (i.e. although we used entire \uffff as argument, xn only depends on \ufb01rst n \u2212 1 signs). We are\nnow ready to give the extended de\ufb01nition of Martingale type (or M-type) of a pair (W \uffff,X ).\nDe\ufb01nition 2. A pair (W \uffff,X ) of subsets of a vector space B\uffff is said to be of M-type p if there exists a constant C \u2265 1\nsuch that for all sequence of mappings (xn)n\u22651 where each xn : {\u00b11}n\u22121 \uffff\u2192 B\uffff and any x0 \u2208B \uffff :\n\nsup\nn\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\n\nx0 +\n\nn\uffffi=1\n\n\uffffixi(\uffff)\uffff\uffff\uffff\uffff\uffff\n\np\n\nW \uffff\uffff \u2264 Cp\uf8eb\uf8ed\uffffx0\uffffp\n\nX +\uffffn\u22651\n\nE [\uffffxn(\uffff)\uffffp\n\nX ]\uf8f6\uf8f8\n\nThe concept is called Martingale type because (\uffffnxn(\uffff))n\u2208N is a martingale difference sequence and it can be shown\nthat rate of convergence of martingales in Banach spaces is governed by the rate of convergence of martingales of\ni=1 \uffffixi(\uffff) (which are incidentally called Walsh-Paley martingales). We point the reader to\n\nthe form Zn = x0 +\uffffn\n[16, 17] for more details. Further, for any p \u2208 [1, 2] we also de\ufb01ne,\nC\uffff\uffff\uffff\uffff\uffff\uffff\nCp := inf\uf8f1\uf8f2\uf8f3\n\n\u2200x0 \u2208B \uffff,\u2200(xn)n\u2208N, sup\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\n\n\uffffixi(\uffff)\uffff\uffff\uffff\uffff\uffff\n\nn\uffffi=1\n\nx0 +\n\nn\n\np\n\nW\uffff\uffff \u2264 C p\uf8eb\uf8ed\uffffx0\uffffp\n\nX\n\n+\uffffn\u22651\n\nE\uffffxn(\uffff)\uffffp\n\nX\uf8f6\uf8f8\uf8fc\uf8fd\uf8fe\n\nCp is useful in determining if the pair (W \uffff,X ) has Martingale type p.\nThe results of [24, 18] showing that a Martingale type implies low regret, actually apply also for \u201cnon-matching\u201d W\nand X and, in our notation, imply that Vp \u2264 2Cp. Speci\ufb01cally we have the following theorem from [24, 18] :\nTheorem 4. [24, 18] For any W\u2208B and any X\u2208B \uffff and any n \u2265 1,\n\nsup\nx\n\n1\nn\n\nn\uffffi=1\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\n\n\uffffixi(\uffff)\uffff\uffff\uffff\uffff\uffffW \uffff\uffff \u2264V n(W,X ) \u2264 2 sup\n\nx\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\n\n1\nn\n\nn\uffffi=1\n\n\uffffixi(\uffff)\uffff\uffff\uffff\uffff\uffffW \uffff\uffff\n\nwhere the supremum above is over sequence of mappings (xn)n\u22651 where each xn : {\u00b11}n\u22121 \uffff\u2192 X .\nOur main interest here will is in establishing that low regret implies Martingale type. To do so, we start with the above\ntheorem to relate value of the online convex optimization game to rate of convergence of martingales in the Banach\nspace. We then extend the result of Pisier in [16] to the \u201cnon-matching\u201d setting combining it with the above theorem\nto \ufb01nally get :\n\n4\n\n\fLemma 5. If for some r \u2208 (1, 2] there exists a constant D > 0 such that for any n,\n\nVn(W,X ) \u2264 Dn\u2212(1\u2212 1\n\nr )\n\nthen for all p < r, we can conclude that any x0 \u2208B \uffff and any B\uffff sequence of mappings (xn)n\u22651 where each\nxn : {\u00b11}n\u22121 \uffff\u2192 B\uffff will satisfy :\nn\uffffi=1\n\nW \uffff\uffff \u2264\uffff 1104 D\n\nX +\uffffi\u22651\n\n(r \u2212 p)2\uffffp\uf8eb\uf8ed\uffffx0\uffffp\n\nX ]\uf8f6\uf8f8\n\nE [\uffffxi(\uffff)\uffffp\n\nsup\nn\n\nx0 +\n\np\n\n\uffffixi(\uffff)\uffff\uffff\uffff\uffff\uffff\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\n\nThat is, the pair (W,X ) is of martingale type p.\nThe following corollary is an easy consequence of the above lemma.\nCorollary 6. For any p \u2208 [1, 2] and any p\uffff < p : Cp\uffff \u2264 1104 Vp\n(p\u2212p\uffff)2\n5 Uniform Convexity and Martingale Type\n\nThe classical notion of Martingale type plays a central role in the study of geometry of Banach spaces. In [16], it\nwas shown that a Banach space has Martingale type p (the classical notion) if and only if uniformly convex functions\nwith certain properties exist on that space (w.r.t. the norm of that Banach space). In this section, we extend this result\nand show how the Martingale type of a pair (W \uffff,X ) are related to existence of certain uniformly convex functions.\nSpeci\ufb01cally, the following theorem shows that the notion of Martingale type of pair (W \uffff,X ) is equivalent to the\nexistence of a non-negative function that is uniformly convex w.r.t. the norm \uffff\u00b7\uffffX \uffff.\nLemma 7. If, for some p \u2208 (1, 2], there exists a constant C > 0, such that for all sequences of mappings (xn)n\u22651\nwhere each xn : {\u00b11}n\u22121 \uffff\u2192 B\uffff and any x0 \u2208B \uffff:\n\nsup\nn\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\n\nx0 +\n\nn\uffffi=1\n\n\uffffixi(\uffff)\uffff\uffff\uffff\uffff\uffff\n\np\n\nW \uffff\uffff \u2264 Cp\uf8eb\uf8ed\uffffx0\uffffp\n\nX +\uffffn\u22651\nX \uffff \u2264 \u03a8(w) \u2264 Cq\nq \uffffw\uffffq\nW\n\nE [\uffffxn(\uffff)\uffffp\n\nX ]\uf8f6\uf8f8\n\n(i.e. (W \uffff,X ) has Martingale type p), then there exists a convex function \u03a8: B \uffff\u2192 R+ with \u03a8(0) = 0, that is\nq-uniformly convex w.r.t. norm \uffff\u00b7\uffffX \uffff s.t. \u2200w \u2208B , 1\nThe following corollary follows directly from the above lemma.\nCorollary 8. For any p \u2208 [1, 2], Dp \u2264 Cp.\nThe proof of Lemma 7 goes further and gives a speci\ufb01c uniformly convex function \u03a8 satisfying the desired requirement\n(i.e. establishing Dp \u2264 Cp) under the assumptions of the previous lemma:\n\nq \uffffw\uffffq\n\n.\n\n\u03a8\u2217q(x) := sup\uf8f1\uf8f2\uf8f3\n\n1\nC p sup\n\nn\n\nx +\n\nn\uffffi=1\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\n\np\n\nW\uffff\uffff \u2212\uffffi\u22651\n\n\uffffixi(\uffff)\uffff\uffff\uffff\uffff\uffff\n\nX\uffff\uf8fc\uf8fd\uf8fe\nE\uffff\uffffxi(\uffff)\uffffp\n\nwhere the supremum above is over sequences (xn)n\u2208N and p = q\nq\u22121.\n6 Optimality of Mirror Descent\n\n, \u03a8q := (\u03a8\u2217q)\u2217 .\n\n(9)\n\nIn the Section 3, we saw that if we can \ufb01nd an appropriate uniformly convex function to use in the mirror descent\nalgorithm, we can guarantee diminishing regret. However the pending question there was when can we \ufb01nd such a\nfunction and what is the rate we can gaurantee. In Section 4 we introduced the extended notion of Martingale type of\na pair (W \uffff,X ) and how it related to the value of the game. Then, in Section 5, we saw how the concept of M-type\nrelated to existence of certain uniformly convex functions. We can now combine these results to show that the mirror\ndescent algorithm is a universal online learning algorithm for convex learning problems. Speci\ufb01cally we show that\nwhenever a problem is online learnable, the mirror descent algorithm can guarantee near optimal rates:\n\n5\n\n\fTheorem 9. If for some constant V > 0 and some q \u2208 [2,\u221e), Vn(W,X ) \u2264 V n\u2212 1\nq for all n, then for any n > eq\u22121,\nthere exists regularizer function \u03a8 and step-size \u03b7, such that the regret of the mirror descent algorithm using \u03a8 against\nany f1, . . . , fn chosen by the adversary is bounded as:\n\nRn(AMD, f1:n) \u2264 6002 V log2(n) n\u2212 1\n\nq\n\n(10)\n\nlog(n) we get the above statement.\n\nProof. Combining Mirror descent guarantee in Lemma 2, Lemma 7 and the lower bound in Lemma 5 with p =\nq\nq\u22121 \u2212 1\nThe above Theorem tells us that, with appropriate \u03a8 and learning rate \u03b7, mirror descent will obtain regret at most a\nfactor of 6002 log(n) from the best possible worst-case upper bound. We would like to point out that the constant V\nin the value of the game appears linearly and there is no other problem or space related hidden constants in the bound.\nThe following \ufb01gure summarizes the relationship between the various constants. The arrow mark from Cp\uffff to Cp\nindicates that for any n, all the quantities are within log2 n factor of each other.\n\np\uffff < p, Cp\uffff \u2264 Vp\n\n(extending Pisier\u2019s result [16])\n\nLemma 5\n\n\u2264 MDp \u2264 Dp\n\n(Generalized MD guarantee)\n\nLemma 2\n\nDe\ufb01nition of Vp\n\n\u2264 Cp\n\nConstruction of \u03a8, Lemma 10\n(extending Pisier\u2019s result [16])\n\nFigure 1: Relationship between the various constants\n\nWe now provide some general guidelines that will help us in picking out appropriate function \u03a8 for mirror descent.\nFirst we note that though the function \u03a8q in the construction (9) need not be such that (q\u03a8q(w))1/q is a norm, with\na simple modi\ufb01cation as noted in [17] we can make it a norm. This basically tells us that the pair (W,X ) is online\nlearnable, if and only if we can sandwich a q-uniformly convex norm in-between X \uffff and a scaled version of W (for\nsome q < \u221e). Also note that by de\ufb01nition of uniform convexity, if any function \u03a8 is q-uniformly convex w.r.t. some\nnorm \uffff\u00b7\uffff and we have that \uffff\u00b7\uffff \u2265 c\uffff\u00b7\uffffX\n. These two observations\nis q-uniformly convex w.r.t. norm \uffff\u00b7\uffffX\ntogether suggest that given pair (W,X ) what we need to do is \ufb01nd a norm \uffff\u00b7\uffff in between \uffff\u00b7\uffff\uffff\n(C < \u221e,\nX\nsmaller the C better the bound ) such that \uffff\u00b7\uffffq is q-uniformly convex w.r.t \uffff\u00b7\uffff.\n7 Examples\n\nand C \uffff\u00b7\uffffW\n\n, then \u03a8(\u00b7)\ncq\n\nWe demonstrate our results on several online learning problems, speci\ufb01ed by W and X .\n\uffffp non-dual pairs\nIt is usual in the literature to consider the case when W is the unit ball of the \uffffp norm in some \ufb01nite\ndimension d while X is taken to be the unit ball of the dual norm \uffffq where p, q are H\u00a8older conjugate exponents. Using\nthe machinery developed in this paper, it becomes effortless to consider the non-dual case when W is the unit ball Bp1\nof some \uffffp1 norm while X is the unit ball Bp2 for arbitrary p1, p2 in [1,\u221e]. We shall use q1 and q2 to represent Holder\nconjugates of p1 and p2. Before we proceed we \ufb01rst note that for any r \u2208 (1, 2], \u03c8r(w) := 1\nr is 2-uniformly\nw.r.t. norm \uffff\u00b7\uffffr (see for instance [25]). On the other hand by Clarkson\u2019s inequality, we have that for r \u2208 (2,\u221e),\n\u03c8r(w) := 2r\nr is r-uniformly convex w.r.t. \uffff\u00b7\uffffr. Putting it together we see that for any r \u2208 (1,\u221e), the function\n\u03c8r de\ufb01ned above, is Q-uniformly convex w.r.t \uffff\u00b7\uffffr for Q = max{r, 2}. The basic technique idea is to be to select \u03c8r\nbased on the guidelines in the end of the previous section. Finally we show that using \u02dc\u03c8r := dQ max{ 1\nr ,0}\u03c8r in\nMirror descent Lemma 2 yields the bound that for any f1, . . . , fn \u2208F :\n1\u221a2(r\u22121)}dmax{ 1\nn1/ max{r,2}\n\nRn(AMD, f1:n) \u2264\n\n2(r\u22121)\uffffw\uffff2\n\n2 max{2,\n\nr ,0}+max{ 1\n\nr \uffffw\uffffr\n\nq2 \u2212 1\n\nr \u2212 1\np1\n\n,0}\n\nq2 \u2212 1\n\nThe following table summarizes the scenarios where a value of r = 2, i.e. a rate of D2/\u221an, is possible, and lists the\ncorresponding values of D2 (up to numeric constant of at most 16):\n\n6\n\n\fp1 Range\n1 \u2264 p1 \u2264 2\n1 \u2264 p1 \u2264 2\n1 \u2264 p1 \u2264 2\np1 > 2\np1 > 2\n1 \u2264 p1 \u2264 2\n\np2\u22121 Range\n\nq2 = p2\nq2 > 2\np1 \u2264 q2 \u2264 2\n1 \u2264 q2 < p1\nq2 > 2\n1 \u2264 q2 \u2264 2\nq2 = \u221e\n\nD2\n\n1\n\nd(1/2\u22121/p1)\nd(1/q2\u22121/p1)\n\n\u221ap2 \u2212 1\nd1/q2\u22121/p1\u221ap2 \u2212 1\n\ufffflog(d)\n\nNote that the \ufb01rst two rows are dimension free, and so apply also in in\ufb01nite-dimensional settings, whereas in the other\nscenarios, D2 is \ufb01nite only when the dimension is \ufb01nite. An interesting phenomena occurs when d is \u221e, p1 > 2 and\nq2 \u2265 p1. In this case D2 = \u221e and so one cant expect a rate of O( 1\u221an ). However we have Dp2 < 16 and so can still\nget a rate of n\u2212 1\nq2 .\nBall et al [3] tightly calculate the constants of strong convexity of squared \uffffp norms, establishing the tightness of D2\nwhen p1 = p2. By extending their constructions it is also possible to show tightness (up to a factor of 16) for all other\nvalues in the table. Also, Agarwal et al [2] recently showed lower bounds on the sample complexity of stochastic\noptimization when p1 = \u221e and p2 is arbitrary\u2014their lower bounds match the last two rows in the table.\nNon-dual Schatten norm pairs in \ufb01nite dimensions Exactly the same analysis as above can be carried out for\nSchatten p-norms, i.e. when W = BS(p1), X = BS(p2) are the unit balls of Schatten p-norm (the p-norm of the\nsingular values) for matrix of dimensions d1 \u00d7 d2. We get the same results as in the table above (as upper bounds\non D2), with d = min{d1, d2}. These results again follow using similar arguments as \uffffp case and tight constants for\nstrong convexity parameters of the Schatten norm from [3].\n\nq+r\u22122 \uffffw\uffff2\n\nq\ufffflog(d)) using \u03a8(w) = 1\n\nNon-dual group norm pairs in \ufb01nite dimensions\nIn applications such as multitask learning, groups norms such as\n\uffffw\uffffq,1 are often used on matrices w \u2208 Rk\u00d7d where (q, 1) norm means taking the \uffff1-norm of the \uffffq-norms of the\ncolumns of w. Popular choices include q = 2,\u221e. Here, it may be quite unnatural to use the dual norm (p,\u221e) to de\ufb01ne\nthe space X where the data lives. For instance, we might want to consider W = B(q,1) and X = B(\u221e,\u221e) = B\u221e. In\nsuch a case we can calculate that D2(W,X ) =\u0398( k1\u2212 1\nq,r where r = log d\nlog d\u22121.\nMax Norm Max-norm has been proposed as a convex matrix regularizer for application such as matrix comple-\ntion [21]. In the online version of the matrix completion problem at each time step one element of the matrix is\nrevealed, corresponding to X being the set of all matrices with a single element being 1 and the rest 0. Since\nwe need X to be convex we can take the absolute convex hull of this set and use X to be the unit element-wise\n\uffff1 ball.\nIts dual is \uffffW\uffffX \uffff = maxi,j |Wi,j|. On the other hand given a matrix W , its max-norm is given by\n\uffffW\uffffmax = minU,V :W =U V \uffff (maxi \uffffUi\uffff2)\uffffmaxj \uffffVj\uffff2\uffff. The set W is the unit ball under the max norm. As\nnoted in [22] the max-norm ball is equivalent, up to a factor two, to the convex hull of all rank one sign ma-\ntrices. Let us now make a more general observation. Consider any set W = abscvx({w1, . . . , wK}), the ab-\nsolute convex hull of K points w1, . . . , wK \u2208B .\nIn this case, the Minkowski norm for this W is given by\ni=1 |\u03b1i|. In this case, for any q \u2208 (1, 2], if we de\ufb01ne the norm \uffffw\uffffW,q =\n\uffffw\uffffW := inf \u03b11,...,\u03b1K :w=\uffffK\ni=1 \u03b1iwi\uffff\uffffK\nW,q is 2-uniformly convex w.r.t.\ninf \u03b11,...,\u03b1K :w=\uffffK\nlog K\u22121, then supw\u2208W\uffff\u03a8(w) = O(\u221alog K) and so\n\uffff\u00b7\uffffW,q (similar to \uffff1 \u2212 \uffffq case). Further if we use q = log K\nD2 = \u221alog K. For the max norm case the norm is equivalent to the norm got by the taking the absolute convex hull\nof the set of all rank one sign matrices. Cardinality of this set is of course 2N +M. Hence using the above proposition\nand noting that X \uffff is the unit ball of | \u00b7 |\u221e we see that \u03a8 is obviously 2-uniformly convex w.r.t. \uffff\u00b7\uffffX \uffff and so we get\na regret bound O\uffff\uffff M +N\nn \uffff. This matches the stochastic (PAC) learning guarantee [22], and is the \ufb01rst guarantee we\n\nare aware of for the max norm matrix completion problem in the online setting.\n\ni=1 \u03b1iwi\uffffK\ni=1 |\u03b1i|q\uffff1/q\n\n, then the function \u03a8(w) =\n\n1\n\n2(q\u22121) \uffffw\uffff2\n\n8 Conclusion and Discussion\n\nIn this paper we showed that for a general class of convex online learning problems, there always exists a distance\ngenerating function \u03a8 such that Mirror Descent using this function achieves a near-optimal regret guarantee. This\n\n7\n\n\fshows that a fairly simple \ufb01rst-order method, in which each iteration requires a gradient computation and a prox-\nmap computation, is suf\ufb01cient for online learning in a very general sense. Of course, the main challenge is deriving\ndistance generating functions appropriate for speci\ufb01c problems\u2014although we give two mathematical expressions for\nsuch functions, in equations (7) and (9), neither is particularly tractable in general. In the end of Section 6 we do give\nsome general guidelines for choosing the right distance generating function. However obtaining a more explicit and\nsimple procedure at least for reasonable Banach spaces is a very interesting question.\nFurthermore, for the Mirror Descent procedure to be ef\ufb01cient, the prox-map of the distance generating function must be\nef\ufb01ciently computable, which means that even though a Mirror Descent procedure is always theoretically possible, we\nmight in practice choose to use a non-optimal distance generating function, or even a non-MD procedure. Furthermore,\nwe might also \ufb01nd other properties of w desirable, such as sparsity, which would bias us toward alternative methods\n[12, 7]. Nevertheless, in most instances that we are aware of, Mirror Descent, or slight variations of it, is truly an\noptimal procedure and this is formalized and rigorously establish here.\nIn terms of the generality of the problems we handle, we required that the constraint set W be convex, but this seems\nunavoidable if we wish to obtain ef\ufb01cient algorithms (at least in general). Furthermore, we know that in terms of\nworst-case behavior, both in the stochastic and in the online setting, for convex cost functions, the value is unchained\nwhen the convex hull of a non-convex constraint set [18]. The requirement that the data domain X be convex is\nperhaps more restrictive, since even with non-convex data domain, the objective is still convex. Such non-convex X\nare certainly relevant in many applications, e.g. when the data is sparse, or when x \u2208X is an indicator, as in matrix\ncompletion problems and total variation regularization. In the total variation regularization problem, W is the set of\nall functions on the interval [0, 1] with total variation bounded by 1 which is in fact a Banach space. However set X\nwe consider here is not the entire dual ball and in fact is neither convex nor symmetric. It only consists of evaluations\nof the functions in W at points on interval [0, 1] and one can consider a supervised learning problem where the goal\nis to use the set of all functions with bounded variations to predict targets which take on values in [\u22121, 1] . Although\nthe total-variation problem is not learnable, the matrix completion problem certainly is of much interest. In the matrix\ncompletion case, taking the convex hull of X does not seem to change the value, but we are unaware of neither a\nguarantee that the value of the game is unchanged when a non-convex X is replaced by its convex hull, nor of an\nexample where the value does change\u2014it would certainly be useful to understand this issue. We view the requirement\nthat W and X be symmetric around the origin as less restrictive and mostly a matter of convenience.\nWe also focused on a speci\ufb01c form of the cost class F, which beyond the almost unavoidable assumption of convexity,\nis taken to be constrained through the cost sub-gradients. This is general enough for considering supervised learning\nwith an arbitrary convex loss in a worst-case setting, as the sub-gradients in this case exactly correspond to the data\npoints, and so restricting F through its sub gradients corresponds to restricting the data domain. Following Proposition\n1, any optimality result for FLip also applies to Fsup, and this statement can also be easily extended to any other\nreasonable loss function, including the hinge-loss, smooth loss functions such as the logistic loss, and even strongly-\nconvex loss functions such as the squared loss (in this context, note that a strongly convex scalar function for supervised\nlearning does not translate to a strongly convex optimization problem in the worst case). Going beyond a worst-case\nformulation of supervised learning, one might consider online repeated games with other constraints on F, such as\nstrong convexity, or even constraints on {ft} as a sequence, such as requiring low average error or conditions on the\ncovariance of the data\u2014these are beyond the scope of the current paper.\nEven for the statistical learning setting, online methods along with online to batch conversion are often preferred due\nto their ef\ufb01ciency especially in high dimensional problems. In fact for \uffffp spaces in the dual case, using lower bounds\non the sample complexity for statistical learning of these problems, one can show that for large dimensional problems,\nmirror descent is an optimal procedure even for the statistical learning problem. We would like to consider the question\nof whether Mirror Descent is optimal for stochastic convex optimization (convex statistical learning) setting [9, 19, 23]\nin general. Establishing such universality would have signi\ufb01cant implications, as it would indicate that any learnable\n(convex) problem, is learnable using a one-pass \ufb01rst-order online method (i.e. Stochastic Approximation approach).\n\nReferences\n[1] J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online\nconvex games. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2008.\n[2] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information-theoretic lower\n\nbounds on the oracle complexity of convex optimization.\n\n8\n\n\f[3] Keith Ball, Eric A. Carlen, and Elliott H. Lieb. Sharp uniform convexity and smoothness inequalities for trace\n\nnorms. Invent. Math., 115:463\u2013482, 1994.\n\n[4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.\n\nOperations Research Letters, 31(3):167\u2013175, 2003.\n\n[5] H. D. Block. The perceptron: A model for brain functioning. Reviews of Modern Physics, 34:123\u2013135, 1962.\n\nReprinted in \u201dNeurocomputing\u201d by Anderson and Rosenfeld.\n\n[6] V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Sparse and low-rank matrix decompositions. In IFAC\n\nSymposium on System Identi\ufb01cation, 2009.\n\n[7] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the \uffff1-ball for learning in high\n\ndimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008.\n\n[8] Ali Jalali, Pradeep Ravikumar, Sujay Sanghavi, and Chao Ruan. A Dirty Model for Multi-task Learning. In\n\nNIPS, December 2010.\n\n[9] A. Juditsky, G. Lan, A. Nemirovski, and A. Shapiro. Stochastic approximation approach to stochastic program-\n\nming. SIAM J. Optim, 19(4):1574\u20131609, 2009.\n\n[10] Sham M. Kakade, Shai Shalev-shwartz, and Ambuj Tewari. On the duality of strong convexity and strong\n\nsmoothness: Learning applications and matrix regularization, 2010.\n\n[11] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information\n\nand Computation, 132(1):1\u201364, January 1997.\n\n[12] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. In Advances in Neural Informa-\n\ntion Processing Systems 21, pages 905\u2013912, 2009.\n\n[13] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine\n\nLearning, 2:285\u2013318, 1988.\n\n[14] A. Nemirovski and D. Yudin. On cesaro\u2019s convergence of the gradient descent method for \ufb01nding saddle points\n\nof convex-concave functions. Doklady Akademii Nauk SSSR, 239(4), 1978.\n\n[15] A. Nemirovski and D. Yudin. Problem complexity and method ef\ufb01ciency in optimization. Nauka Publishers,\n\nMoscow, 1978.\n\n[16] G. Pisier. Martingales with values in uniformly convex spaces. Israel Journal of Mathematics, 20(3\u20134):326\u2013350,\n\n1975.\n\n[17] G. Pisier. Martingales in banach spaces (in connection with type and cotype). Winter School/IHP Graduate\n\nCourse, 2011.\n\n[18] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters, and\n\nlearnability. NIPS, 2010.\n\n[19] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In COLT, 2009.\n[20] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Advances in Neural Information\n\nProcessing Systems, 19:1265, 2007.\n\n[21] Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakola. Maximum-margin matrix factorization. In Advances\n\nin Neural Information Processing Systems 17, pages 1329\u20131336. MIT Press, 2005.\n\n[22] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th Annual Con-\n\nference on Learning Theory, pages 545\u2013560. Springer-Verlag, 2005.\n\n[23] Nathan Srebro and Ambuj Tewari. Stochastic optimization for machine learning. In ICML 2010, tutorial, 2010.\n[24] K. Sridharan and A. Tewari. Convex games in Banach spaces. In Proceedings of the 23nd Annual Conference\n\non Learning Theory, 2010.\n\n[25] S.Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, Hebrew University of\n\nJerusalem, 2007.\n\n[26] Manfred K. Warmuth and Dima Kuzmin. Randomized online pca algorithms with regret bounds that are loga-\n\nrithmic in the dimension, 2007.\n\n[27] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML, 2003.\n[28] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society, Series B, 67:301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1437, "authors": [{"given_name": "Nati", "family_name": "Srebro", "institution": null}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}