{"title": "Minimax Optimal Algorithms for Unconstrained Linear Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2724, "page_last": 2732, "abstract": "We design and analyze minimax-optimal algorithms for online linear   optimization games where the player's choice is unconstrained.  The   player strives to minimize regret, the difference between his loss   and the loss of a post-hoc benchmark strategy.  The standard   benchmark is the loss of the best strategy chosen from a bounded   comparator set, whereas we consider a broad range of benchmark   functions. We consider the problem as a sequential multi-stage   zero-sum game, and we give a thorough analysis of the minimax   behavior of the game, providing characterizations for the value of   the game, as well as both the player's and the adversary's optimal   strategy.  We show how these objects can be computed efficiently   under certain circumstances, and by selecting an appropriate   benchmark, we construct a novel hedging strategy for an   unconstrained betting game.", "full_text": "Minimax Optimal Algorithms\n\nfor Unconstrained Linear Optimization\n\nH. Brendan McMahan\n\nGoogle Reasearch\n\nSeattle, WA\n\nmcmahan@google.com\n\nJacob Abernethy\u21e4\n\nComputer Science and Engineering\n\nUniversity of Michigan\n\njabernet@umich.edu\n\nAbstract\n\nWe design and analyze minimax-optimal algorithms for online linear optimization\ngames where the player\u2019s choice is unconstrained. The player strives to minimize\nregret, the difference between his loss and the loss of a post-hoc benchmark strat-\negy. While the standard benchmark is the loss of the best strategy chosen from a\nbounded comparator set, we consider a very broad range of benchmark functions.\nThe problem is cast as a sequential multi-stage zero-sum game, and we give a\nthorough analysis of the minimax behavior of the game, providing characteriza-\ntions for the value of the game, as well as both the player\u2019s and the adversary\u2019s\noptimal strategy. We show how these objects can be computed ef\ufb01ciently under\ncertain circumstances, and by selecting an appropriate benchmark, we construct a\nnovel hedging strategy for an unconstrained betting game.\n\n1\n\nIntroduction\n\nMinimax analysis has recently been shown to be a powerful tool for the construction of online\nlearning algorithms [Rakhlin et al., 2012]. Generally, these results use bounds on the value of\nthe game (often based on the sequential Rademacher complexity) in order to construct ef\ufb01cient\nalgorithms. In this work, we show that when the learner is unconstrained, it is often possible to\nef\ufb01ciently compute an exact minimax strategy for both the player and nature. Moreover, with our\ntools we can analyze a much broader range of problems than have been previously considered.\nWe consider a game where on each round t = 1, . . . , T , \ufb01rst the learner selects xt 2 Rn, and then\nan adversary chooses gt 2G\u21e2 Rn, and the learner suffers loss gt \u00b7 xt. The goal of the learner is to\nminimize regret, that is, loss in excess of that achieved by a post-hoc benchmark strategy. We de\ufb01ne\n\nTXt=1\n\nRegret = Loss  (Benchmark Loss) =\n\ngt \u00b7 xt  L(g1, . . . , gT )\n\n(1)\n\nas the regret with respect to benchmark performance L (the L intended will be clear from context).\nThe standard de\ufb01nition of regret arises from the choice\ng1:T \u00b7 x = inf\nx2Rn\n\nL(g1, . . . , gT ) = inf\nx2X\n\ng1:T \u00b7 x + I(x 2X ),\n\n(2)\n\nwhere I(condition) is the indicator function: it returns 0 when condition holds, and returns\n1 otherwise. The above choice of L represents the loss of the best \ufb01xed point x in the bounded\nconvex set X . Throughout we shall write g1:t =Pt\ns=1 gs for a sum of scalars or vectors. When L\ndepends only on the sum G \u2318 g1:T we write L(G).\n\n\u21e4Work performed while the author was in the CIS Department at the University of Pennsylvania and funded\n\nby a Simons Postdoctoral Fellowship\n\n1\n\n\fIn the present work we shall consider a broad notion of regret in which, for example, L is de\ufb01ned not\nin terms of a \u201cbest in hindsight\u201d comparator but instead in terms of a \u201cpenalized best in hindsight\u201d\nobjective. Let be some penalty function, and consider\n\nL(G) = min\n\nx\n\nG \u00b7 x + ( x).\n\n(3)\n\nThis is a direct generalization of the usual comparator notion which takes (x) = I(x 2X ).\nWe view this interaction as a sequential zero-sum game played over T rounds, where the player\nstrives to minimize Eq. (1), and the adversary attempts to maximize it. We study the value of this\ngame, de\ufb01ned as\n\ngT 2G TXt=1\n\nsup\n\ngt \u00b7 xt  L(g1, . . . , gT )! .\n\n(4)\n\nV T \u2318 inf\nx12Rn\n\nsup\ng12G\n\n. . .\n\ninf\nxT 2Rn\n\nWith this in mind, we can describe the primary contributions of the present paper:\n\n1. We provide a characterization of the value of the game Eq. (4) in terms of the supremum\nover the expected value of a function of a martingale difference sequence. This will be\nmade more explicit in Section 2.\n\n2. We provide a method for computing the player\u2019s minimax optimal (deterministic) strategy\nin terms of a \u201cdiscrete derivative.\u201d Similarly, we show how to describe the adversary\u2019s\noptimal randomized strategy in terms of martingale differences.\n\n3. For \u201ccoordinate-decomposable\u201d games we give a natural and ef\ufb01ciently computable de-\n\nscription of the value of the game and the player\u2019s optimal strategy.\n\n4. In Section 3, we consider several benchmark functions L, de\ufb01ned in Eq. (3) via a penalty\nfunction , which lead to interesting and surprising optimal algorithms; we also exactly\ncompute the values of these games. Figure 1 summarizes these applications. In particular,\nwe show that constant-step-size gradient descent is minimax optimal for a quadratic , and\nan exponential L leads to a bounded-loss hedging algorithm that can still yield exponential\nreward on \u201ceasy\u201d sequences.\n\nApplications The primary contributions of this paper are to the theory. Nevertheless, it is worth\npausing to emphasize that the framework of \u201cunconstrained online optimization\u201d is a fundamental\ntemplate for (and strongly motivated by) several online learning settings, and the results we develop\nare applicable to a wide range of commonly studied algorithmic problems. The classic algorithm\nfor linear pattern recognition, the Perceptron, can be seen as an algorithm for unconstrained linear\noptimization. Methods for training a linear SVM or a logistic regression model, such as stochastic\ngradient descent or the Pegasos algorithm [Shalev-Shwartz et al., 2011], are unconstrained opti-\nmization algorithms. Finally, there has been recent work in the pricing of options and other \ufb01nancial\nderivatives [DeMarzo et al., 2006, Abernethy et al., 2012] that can be described exactly in terms of\na repeated game which \ufb01ts nicely into our framework.\nWe also wish to emphasize that the algorithm of Section 3.2 is both practical and easily imple-\nmentable: for a multi-dimensional problem one needs to only track the sum of gradients for each\ncoordinate (similar to Dual Averaging), and compute Eq. (12) for each coordinate to derive the\nappropriate strategy. The algorithm provides us with a tool for making potentially unconstrained\nbets/investments, but as we discuss it also leads to interesting regret bounds.\n\nRelated Work Regret-based analysis has received extensive attention in recent years; see Shalev-\nShwartz [2012] and Cesa-Bianchi and Lugosi [2006] for an introduction. The analysis of alternative\nnotions of regret is also not new. Vovk [2001] gives bounds relative to benchmarks similar to Eq. (3),\nthough for different problems and not in the minimax setting. In the expert setting, there has been\nmuch work on tracking a shifting sequence of experts rather than the single best expert; see Koolen\net al. [2012] and references therein. Zinkevich [2003] considers drifting comparators in an online\nconvex optimization framework. This notion can be expressed by an appropriate L(g1, . . . , gT ), but\nnow the order of the gradients matters. Merhav et al. [2006] and Dekel et al. [2012] consider the\nstronger notion of policy regret in the online experts and bandit settings, respectively. Stoltz [2011]\nalso considers some alternative notions of regret. For investing scenarios, Agarwal et al. [2006]\n\n2\n\n\fsetting\nsoft feasible set\n\n(\n\nminimax value\nT\n2\n\nupdate\nxt+1 = 1\n\n g1:t\n\nL(G) \n G2\n|G|\n\n2\n\n\n\nx)\n2 x2\nI(|x|\uf8ff 1)\n\nstandard regret\n\n!q 2\nbounded-loss betting  exp(G/pT ) pT x log(pT x) + pT x ! pe\nFigure 1: Summary of speci\ufb01c online linear games considered in Section 3. Results are stated for\nthe one-dimensional problem where gt 2 [1, 1]; Corollary 5 gives an extension to n dimensions.\nThe benchmark L is given as a function of G = g1:T . The standard notion of regret corresponds\nto the L(G) = minx2[1,1] g1:t \u00b7 x = |G|. The benchmark functions can alternatively be derived\nfrom a suitable penalty on comparator points x, so L(G) = minx Gx + ( x).\n\nEq. (14)\nEq. (12)\n\n\u21e1 T\n\nand Hazan and Kale [2009] consider regret with respect to the best constant-rebalanced portfolio.\nOur algorithm in Section 3.2 applies to similar problems, but does not require a \u201cno junk bonds\u201d\nassumption, and is in fact minimax optimal for a natural benchmark.\nExisting algorithms do offer bounds for unconstrained problems, generally of the form kx\u21e4k/\u2318 +\n\u2318Pt gtxt. However, such bounds can only guarantee no-regret when an upper bound R on kx\u21e4k is\nknown in advance and used to tune the parameter \u2318. If one knows such a R, however, the problem\nis no longer truly unconstrained. The only algorithms we know that avoid this problem are those of\nStreeter and McMahan [2012], and the minimax-optimal algorithm we introduce in Sec 3.2; these\nalgorithms guarantee guarantee Regret \uf8ffORpT log((1 + R)T ) for any R > 0.\n\nThe \ufb01eld has seen a number of minimax approaches to online learning. Abernethy and Warmuth\n[2010] and Abernethy et al. [2008b] give the optimal behavior for several zero-sum games against\na budgeted adversary. Section 3.3 studies the online linear game of Abernethy et al. [2008a] under\ndifferent assumptions, and we adapt some techniques from Abernethy et al. [2009, 2012]; the latter\nwork also involves analyzing an unconstrained player. Rakhlin et al. [2012] utilizes powerful tools\nfor non-constructive analysis of online learning as a technique to design algorithms; our work differs\nin that we focus on cases where the exact minimax strategy can be computed.\n\nNotions of Regret The standard notion of regret corresponds to a hard penalty (x) = I(x 2\nX ). Such a de\ufb01nition makes sense when the player by de\ufb01nition must select a strategy from some\nbounded set, for example a probability from the n-dimensional simplex, or a distribution on paths\nin a graph. However, in contexts such as machine learning where any x 2 Rn corresponds to a valid\nmodel, such a hard constraint is dif\ufb01cult to justify; while any x 2 Rn is technically feasible, in order\nto prove regret bounds we compare to a much more restrictive set. As an alternative, in Sections 3.1\nand 3.2 we propose soft penalty functions that encode the belief that points near the origin are more\nlikely to be optimal (we can always re-center the problem to match our beliefs in this regard), but do\nnot rule out any x 2 Rn a priori.\nThus, one of our contributions is showing that interesting results can be obtained by choosing L\ndifferently than in Eq. (2). The player cannot do well in terms of the absolute lossPt gt \u00b7 xt for\n\nall sequences g1, . . . , gT , but she can do better on some sequences at the expense of doing worse on\nothers. The benchmark L makes this notion precise: sequences for which L(g1, . . . , gT ) is large and\nnegative are those on which the player desires good performance, at the expense of allowing more\nloss (in absolute terms) on sequences where L(g1, . . . , gT ) is large and positive. The value of the\ngame V T tells us to what extent any online algorithm can hope to match the benchmark L.\n\n2 General Unconstrained Linear Optimization\n\nIn this section we develop general results on the unconstrained linear optimization problem. We start\nby analyzing (4) in greater detail, and give tools for computing the regret value V T in such games.\nWe show that in certain cases the computation of the minimax value can be greatly simpli\ufb01ed.\nThroughout we will assume that the function L is concave in each of its arguments (thought not\nnecessarily jointly concave) and bounded on GT . We also include the following assumptions on the\n\n3\n\n\fset G. First, we assume that either G is a polytope or, more generally, that ConvexHull(G) is a full-\nrank polytope in Rn. This is not strictly necessary but is convenient for the analysis; any bounded\nconvex set in Rn can be approximated to arbitrary precision with a polytope. We also make the\nnecessary assumption that the ConvexHull(G) contains the origin in its interior. We let G0 be the set\nof \u201ccorners\u201d of G, that is G0 = {g1, . . . , gm} and hence ConvexHull(G) = ConvexHull(G0).\nWe are also concerned with the conditional value of the game, Vt, given x1, . . . xt and g1, . . . gt have\nalready been played. That is, the Regret when we \ufb01x the plays on the \ufb01rst t rounds, and then assume\nminimax optimal play for rounds t+1 through T . However, following the approach of Rakhlin et al.\ns=1 xs \u00b7 gs from Eq. (4). We can view this as cost that the learner has\nalready payed, and neither that cost nor the speci\ufb01c previous plays of the learner impact the value of\nthe remaining terms in Eq. (1). Thus, we de\ufb01ne\n\n[2012], we omit the termsPt\n\nVt(g1, . . . , gt) = inf\n\nxt+12Rn\n\nsup\ngt+12G\n\n. . .\n\ninf\nxT 2Rn\n\ngT 2G TXs=t+1\n\nsup\n\ngs \u00b7 xs  L(g1, . . . , gT )! .\n\n(5)\n\nNote the conditional value of the game before anything has been played, V0(), is exactly V T .\n\nThe martingale characterization of the game The fundamental tool used in the rest of the paper\nis the following characterization of the conditional value of the game:\nTheorem 1. For every t and every sequence g1, . . . , gt 2G , we can write the conditional value of\nthe game as\n\nVt(g1, . . . , gt) =\n\nmax\n\nG2(G0),E[G]=0\n\nE[Vt+1(g1, . . . , gt, G)],\n\nwhere (G0) is the set of random variables on G0. Moreover, for all t the function Vt is convex in\neach of its coordinates and bounded.\n\nAll proofs omitted from the body of the paper can be found in the appendix or the extended version\nof this paper.\nLet MT (G) be the set of T -length martingale difference sequences on G0, that is the set of\nall sequences of random variables (G1, . . . , GT ), with Gt taking values in G0, which satisfy\nE[Gt|G1, . . . , Gt1] = 0 for all t = 1, . . . , T . Then, we immediately have the following:\nCorollary 2. We can write\n\nV T =\n\nmax\n\n(G1,...,GT )2MT (G0)\n\nE[L(G1, . . . , GT )],\n\nwith the analogous expression holding for the conditional value of the game.\n\nCharacterization of optimal strategies The result above gives a nice expression for the value\nof the game V T but unfortunately it does not lead directly to a strategy for the player. We now\ndig a bit deeper and produce a characterization of the optimal player behavior. This is achieved by\nanalyzing a simple one-round zero-sum game. As before, we assume G is a bounded subset of Rn\nwhose convex hull is a polytope whose interior contains the the origin 0. Assume we are given some\nconvex function f de\ufb01ned and bounded on all of ConvexHull(G). We consider the following:\n\nV = inf\nx2Rn\n\nsup\ng2G\n\nx \u00b7 g + f (g).\n\n(6)\n\ni=1 \u21b5igi = 0, such that V =Pn+1\n\nTheorem 3. There exists a set of n + 1 distinct points {g1, . . . , gn+1}\u21e2G whose convex hull is of\nfull rank, and a distribution ~\u21b5 2 n+1 satisfyingPn+1\ni=1 \u21b5if (gi).\nMoreover, an optimal choice for the in\ufb01mum in (6) is the gradient of the unique linear interpolation\nof the pairs {(g1,f (g1)), . . . , (gn+1,f (gn+1))}.\nThe theorem makes a useful point about determining the player\u2019s optimal strategy for games of this\nform. If the player can determine a full-rank set of \u201cbest responses\u201d {g1, . . . , gn+1} to his optimal\nx\u21e4, each of which should be a corner of the polytope G, then we know that x\u21e4 must be a \u201cdiscrete\ngradient\u201d of the function f around 0. That is, if the size of G is small relative to the curvature of\nf, then an approximation to rf (0) is the linear interpolation of f at a set of points around 0.\nAn optimal x\u21e4 will be exactly this interpolation.\n\n4\n\n\fThis result also tells us how to analyze the general T -round game. We can express (5), the condi-\ntional value of the game Vt1, in recursive form as\nsup\ngt2G\n\nVt1(g1, . . . , gt1) = inf\nxt2Rn\n\ngt \u00b7 xt + Vt(g1, . . . , gt1, gt).\n\n(7)\n\nHence by setting f (gt) = Vt(g1, . . . , gt1, gt), noting that the latter is convex in gt by Theorem 1,\nwe see we have an immediate use of Theorem 3.\n\ni=1 Li(gi).\n\n3 Minimax Optimal Algorithms for Coordinate-Decomposable Games\nIn this section, we consider games where G consists of axis-aligned constraints, and L decomposes\nso L(g) = Pn\nIn order to solve such games, it is generally suf\ufb01cient to consider n\nindependent one-dimensional problems. We study such games \ufb01rst:\nTheorem 4. Consider the one-dimensional unconstrained game where the player selects xt 2 R\nand the adversary chooses gt 2G = [1, 1], and L is concave in each of its arguments and bounded\non GT . Then, V T = Egt\u21e0{1,1}\u21e5  L(g1, . . . , gT )\u21e4 where the expectation is over each gt chosen\nindependently and uniformly from {1, 1} (that is, the gt are Rademacher random variables). Fur-\nther, the conditional value of the game is\n(8)\n\nVt(g1, . . . , gt) =\n\nE\n\ngt+1,...,gT \u21e0{1,1}\u21e5  L(g1, . . . , gt, gt+1, . . . gT )\u21e4.\n\nThe proof is immediate from Corollary 2, since the only possible martingale that both plays from\nthe corners of G and has expectation 0 on each round is the sequence of independent Rademacher\nrandom variables.1 Given Theorem 4, and the fact that the functions L of interest will generally\ndepend only on g1:T , it will be useful to de\ufb01ne BT to be the distribution of g1:T when each gt is\ndrawn independently and uniformly from {1, 1}.\nTheorem 4 can immediately be extended to coordinate-decomposable games as follows:\nCorollary 5. Consider the game where the player chooses xt 2 Rn, the adversary chooses gt 2\n[1, 1]n, and the payoff isPT\nt=1 gt \u00b7 xt Pn\ni=1 L(g1:T,i) for concave L. Then the value V T and\nthe conditional value Vt(\u00b7) can be written as\nG\u21e0BT\u21e5  L(G)\u21e4\n\nGi\u21e0BTt\u21e5  L(g1:t,i + Gi)\u21e4.\n\nand Vt(g1, . . . , gt) =\n\nV T = n E\n\nnXi=1\n\nE\n\nThe proof follows by noting the constraints on both players\u2019 strategies and the value of the game\nfully decompose on a per-coordinate basis.\n\nA recipe for minimax optimal algorithms in one dimension Since Eq. (5) gives the minimax\nvalue of the game if both players play optimally from round t + 1 forward, a minimax strategy for\nthe learner on round t + 1 must be xt+1 = arg minx2R maxg2{1,1} g \u00b7 x + Vt+1(g1, . . . , gt, g).\nNow, we can apply Theorem 3, and note that unique strategy for the adversary is to play g = 1\nor g = 1 with equal probability. Thus, the player strategy is just the interpolation of the points\n(1,f (1)) and (1,f (1)), where we take f = Vt+1, giving us\n\nxt+1 =\n\n1\n\n2Vt+1(g1, . . . , gt,1)  Vt+1(g1, . . . , gt, +1).\n\nThus, if we can derive a closed form for Vt(g1, . . . , gt), we will have an ef\ufb01cient minimax-optimal\nalgorithm. Note that for any function L,\n\n(9)\n\n(10)\n\n[L(G)] =\n\n1\n2T\n\nE\nG\u21e0BT\n\nTXi=0\u2713T\n\ni\u25c6L(2i  T ),\n\ni is the binomial probability of getting exactly i gradients of +1 over T rounds, which\nsince 2TT\nimplies Ti gradients of 1, so G = i(Ti) = 2iT . Using Theorem 4, and Eqs (9) and (10), in\n\n1However, is easy to extend this to the case where G = [a, b], which leads to different random variables.\n\n5\n\n\fthe following sections we exactly compute the game values and unique minimax optimal strategies\nfor a variety of interesting coordinate-decomposable games. Even when such exact computations\nare not possible, any coordinate-decomposable game where L depends only on G = g1:T can be\nsolved numerically in polynomial time. If \u2327 = T  t, the number of rounds remaining, then we can\ncompute Vt exactly by using the appropriate binomial probabilities (following Eq. (8) and Eq. (10)),\nrequiring only a sum over O(\u2327 ) values. If \u2327 is large enough, then using an approximation to the\nbinomial (e.g., the Gaussian approximation) may be suf\ufb01cient.\nWe can also immediately provide a characterization of the potentially optimal player strategies in\nterms of the subgradients of L. For simplicity, we write @L(g) instead of @(L(g)).\nTheorem 6. Let G = [a, b], with a < 0 < b, and L : R ! R is bounded and concave. Then, on\nevery round, the unique minimax optimal x\u21e4t satis\ufb01es x\u21e4t 2L where L = [w2R  @L(w).\nProof. Following Theorem 3, we know the minimax xt+1 interpolates (a,f (a)) and (b,f (b)),\nwhere we take f (g) = Vt+1(g1, . . . , gt, g). In one dimension, this implies xt+1 2 @f (g) for some\ng 2G . It remains to show @f (g) \u2713L . From Theorem 1 we have f (g) = E[L(g1:t + g + B)],\nwhere the E is with respect to mean-zero random variable B \u21e0B \u2327 , \u2327 = T  t. For each possible\nvalue b that B can take on, @gL(g1:t +g +bi) \u2713L by de\ufb01nition, so @f (g) is a convex combination\nof these sets (e.g., Rockafellar [1997, Thm. 23.8]). The result follows as L is convex.\nNote that for standard regret, L(g) = inf x2X gx, we have @L(g) \u2713X , indicating that (in 1 dimen-\nsion at least), the player never needs to play outside the comparator set X . We will see additional\nconsequences of this theorem in the following sections.\n\n3.1 Constant step-size gradient descent can be minimax optimal\n\nSuppose we use a \u201csoft\u201d feasible set for the benchmark via a quadratic penalty,\n\nL(G) = min\n\nx\n\nGx +\n\n\n2\n\nx2 = \n\n1\n2\n\nG2,\n\n(11)\n\nfor a constant > 0. Does a no-regret algorithm against this comparison class exist? Unfortunately,\nthe general answer is no, as shown in the next theorem. Recalling gt 2 [1, 1],\nTheorem 7. The value of this game is V T = EG\u21e0BTh 1\n\n2 G2i = T\n\nThus, for a \ufb01xed , we cannot have a no regret algorithm with respect to this L. But this does not\nmean the minimax algorithm will be uninteresting. To derive the minimax optimal algorithm, we\ncompute conditional values (using similar techniques to Theorem 7),\n\n2 .\n\nVt(g1, . . . , gt) = E\n\nG\u21e0BTth 1\n\n2\n\n(g1:t + G)2i =\n\n1\n\n2(g1:t)2 + (T  t),\n\nand so following Eq. (9) the minimax-optimal algorithm must use\n\nxt+1 =\n\n1\n\n4(g1:t  1)2 + (T  t  1)  ((g1:t + 1)2 + (T  t  1)) =\n\nThus, a minimax-optimal algorithm is simply constant-learning-rate gradient descent with learning\nrate 1\n . Note that for a \ufb01xed , this is the optimal algorithm independent of T ; this is atypical, as\nusually the minimax optimal algorithm depends on the horizon (as we will see in the next two cases).\nNote that the set L = R (from Theorem 6), and indeed the player could eventually play an arbitrary\npoint in R (given large enough T ).\n\n1\n4\n\n(4g1:t) = \n\n1\n\n\ng1:t\n\n3.2 Non-stochastic betting with exponential upside and bounded worst-case loss\n\nA major advantage of the regret minimization framework is that the guarantees we can achieve are\ntypically robust to arbitrary input sequences. But on the downside the model is very pessimistic: we\nmeasure performance in the worst case. One might aim to perform not too badly in the worst case\nyet extremely well under certain conditions.\n\n6\n\n\fWe now show how the results in the present paper can lead to a very optimistic guarantee, particu-\nlarly in the case of a sequential betting game. On each round t, the world offers the player a betting\nopportunity on a coin toss, i.e. a binary outcome gt 2 {1, 1}. The player may take either side of\nthe bet, and selects a wager amount xt, where xt > 0 implies a bet on tails (gt = 1) and xt < 0 a\nbet on heads (gt = 1). The world then announces whether the bet was won or lost, revealing gt. The\nplayer\u2019s wealth changes (additively) by gtxt (that is, the player strives to minimize loss gtxt). We\nassume that the player begins with some initial capital \u21b5> 0, and at any time period the wager |xt|\nmust not exceed \u21b5 Pt1\nWith the bene\ufb01t of hindsight, the gambler can see G =PT\nt=1 gt, the total number of heads minus the\ntotal number of heads. Let us imagine that the number of heads signi\ufb01cantly exceeded the number of\ntails, or vice versa; that is, |G| was much larger than 0. Without loss of generality let us assume that\nG is positive. Let us imagine that the gambler, with the bene\ufb01t of hindsight, considers what could\nhave happened had he always bet a constant fraction  of his wealth on heads. A simple exercise\nshows that his wealth would become\n\ns=1 gsxs, the initial capital plus the money earned thus far.\n\nTYt=1\n\n(1 + gt) = (1 + )\n\nT +G\n\n2 (1  )\n\nTG\n\n2\n\n.\n\nThis is optimized at  = G\n\nmaximum wealth in hindsight, exp\u21e3T \u00b7 KL\u21e3 1+G/T\n\n2\n\nT , which gives a simple expression in terms of KL-divergence for the\n\n2\u2318\u2318, and the former is well-approximated\n| | 1\n\nby exp(O(G2/T )) when G is not too large relative to T . In other words, with knowledge of the \ufb01nal\nG, a na\u00a8\u0131ve betting strategy could have earned the gambler exponentially large winnings starting with\nconstant capital. Note that this is essentially a Kelly betting scheme [Kelly Jr, 1956], expressed in\nterms of G. We ask: does there exist an adaptive betting strategy that can compete with this hindsight\nbenchmark, even if the gt are chosen fully adversarially?\nIndeed we show we can get reasonably close. Our aim will be to compete with a slightly weaker\n\nabsolute value, so the player only aims for exponential wealth growth for large positive G. It is not\nhard to develop a two-sided algorithm as a result, which we soon discuss.\n\nbenchmark L(G) =  exp(|G|/pT ). We present a solution for the one-sided game, without the\nTheorem 8. Consider the game where G = [1, 1] with benchmark L(G) =  exp(G/pT ). Then\n\nwith the bound tight as T ! 1. Let \u2327 = T  t and Gt = g1:t, then the conditional value of the\ngame is Vt(Gt) =\u21e3cosh 1pT\u2318\u2327\n\nV T =\u21e3cosh 1pT\u2318T\nexp\u21e3 GtpT\u2318 and the player\u2019s minimax optimal strategy is:\n\n\uf8ff pe\n\nxt+1 =  exp\u2713 GtpT\u25c6 sinh 1pT \u21e3cosh 1pT\u2318\u23271\n\nRecall that the value of the game can be thought of as the largest possible difference between the\n\npayoff of the benchmark function exp(G/pT ) and the winnings of the player P gtxt, when the\n\nplayer uses an optimal betting strategy. That the value of the game here is of constant order is\ncritical, since it says that we can always achieve a payoff that is exponential in GpT at a cost of no\nmore than pe = O(1). Notice we have said nothing thus far regarding the nature of our betting\nstrategy; in particular we have not proved that the strategy satis\ufb01es the required condition that the\ngambler cannot bet more than \u21b5 plus the earnings thus far. We now give a general result showing\nthat this condition can be satis\ufb01ed:\nTheorem 9. Consider a one dimensional game with G = [1, 1] with benchmark function L non-\npositive on GT . Then for the optimal betting strategy we have that |xt|\uf8ff  Pt\ns=1 gsxs + V T , and\nfurther V T Pt\nIn other words, the player\u2019s cumulative loss at any time is always bounded from below by V T . This\nimplies that the starting capital \u21b5 required to \u201creplicate\u201d the payoff function is exactly the value2 of\nthe game V T . Indeed, to replicate exp(G/pT ) we would require no more than \u21b5 = $1.65.\n\ns=1 gsxs for any t and any sequence g1, . . . , gt.\n\n(12)\n\n2This idea has a long history in \ufb01nance and was a key tool in Abernethy et al. [2012], DeMarzo et al. [2006],\n\nand other works.\n\n7\n\n\fIt is worth noting an alternative characterization of the benchmark function L used here. For a  0,\nminx2R (Gx  ax log(ax) + ax) =  exp G\na. Thus, if we take (x) = ax log(ax) +\nax + I(x \uf8ff 0), we have minx2R g1:T x + ( x) =  exp G\na . Since this algorithm needs large\nReward when G is large and positive, we might expect that the minimax optimal algorithm only\nplays xt \uf8ff 0. Another intuition for this is that the algorithm should not need to play any point x to\nwhich assigns an in\ufb01nite penalty. This intuition can be con\ufb01rmed immediately via Theorem 6.\nWe now sketch how to derive an algorithm for the \u201ctwo-sided\u201d game. To do this, we let LC(G) \u2318\nL(G) + L(G) \uf8ff  exp(|G|/pT ). We can construct a minimax optimal algorithm for LC(G) by\nrunning two copies of the one-sided minimax algorithm simultaneously, switching the signs of the\ngradients and plays of the second copy. We formalize this in Appendix B.\nThis same benchmark and algorithm can be used in the setting introduced by Streeter and McMa-\nhan [2012].\nIn that work, the goal was to prove bounds on standard regret like Regret \uf8ff\nO(RpT log ((1 + R)T )) simultaneously for any comparator x\u21e4 with |x\u21e4| = R. Stating their The-\norem 1 in terms of losses, this traditional regret bound is achieved by any algorithm that guarantees\n\nLoss =\n\nTXt=1\n\ngtxt \uf8ff  exp\u2713 |G|pT\u25c6 + O(1).\n\n(13)\n\nThe symmetric algorithm (Appendix B) satis\ufb01es\n\nLoss \uf8ff  exp\u2713 G\n\npT\u25c6  exp\u2713G\n\npT\u25c6 + 2pe \uf8ff  exp\u2713 |G|pT\u25c6 + 2pe,\n\nand so we also achieve a standard regret bound of the form given above.\n\n3.3 Optimal regret against hypercube adversaries\n\nPerhaps the simplest and best studied learning games are those that restrict both the player and\nadversary to a norm ball, and use the standard notion of regret. We can derive results for the game\nwhere the adversary has an L1 constraint, the comparator set is also the L1 ball, and the player is\nunconstrained. Corollary 5 implies it is suf\ufb01cient to study the one-dimensional case.\nTheorem 10. Consider the game between an adversary who chooses losses gt 2 [1, 1], and a\nplayer who chooses xt 2 R. For a given sequence of plays, x1, g1, x2, g2, . . . , xT , gT , the value to\nthe adversary isPT\nt=1 gtxt | g1:T|. Then, when T is even with T = 2M, the minimax value of this\ngame is given by\nFurther, as T ! 1, VT !q 2T\n\n\u21e1 . Let B be a random variable drawn from BTt. Then the minimax\n\noptimal strategy for the player given the adversary has played Gt = g1:t is given by\n\n(T  M )!M ! \uf8ffr 2T\n\nVT = 2T\n\n2M T !\n\n.\n\n\u21e1\n\nxt+1 = Pr(B < Gt)  Pr(B > Gt) = 1  2 Pr(B > Gt) 2 [1, 1].\n\n(14)\n\nThe fact that the limiting value of this game isp2T /\u21e1 was previously known, e.g., see a mention\nin Abernethy et al. [2009]; however, we believe this explicit form for the optimal player strategy is\nnew. This strategy can be ef\ufb01ciently computed numerically, e.g, by using the regularized incomplete\nbeta function for the CDF of the binomial distribution. It also follows from this expression that even\nthough we allow the player to select xt+1 2 R, the minimax optimal algorithm always selects points\nfrom [1, 1], so our result applies to the case where the player is constrained to play from X .\nAbernethy et al. [2008a] shows that for the linear game with n  3 where both the learner and\nadversary select vectors from the unit sphere, the minimax value is exactly pT . Interestingly, in the\nn = 1 case (where L2 and L1 coincide), the value of the game is lower, about 0.8pT rather than\npT . This indicates a fundamental difference in the geometry of the n = 1 space and n  3. We\nconjecture the minimax value for the L2 game with n = 2 lies somewhere in between.\n\n8\n\n\fReferences\nJacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In NIPS,\n\n2010.\n\nJacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and\n\nminimax lower bounds for online convex games. In COLT, 2008a.\n\nJacob Abernethy, Manfred K Warmuth, and Joel Yellin. Optimal strategies from random walks.\nIn Proceedings of The 21st Annual Conference on Learning Theory, pages 437\u2013446. Citeseer,\n2008b.\n\nJacob Abernethy, Alekh Agarwal, Peter Bartlett, and Alexander Rakhlin. A stochastic view of\n\noptimal regret through minimax duality. In COLT, 2009.\n\nJacob Abernethy, Rafael M. Frongillo, and Andre Wibisono. Minimax option pricing meets black-\n\nscholes in the limit. In STOC, 2012.\n\nAmit Agarwal, Elad Hazan, Satyen Kale, and Robert E. Schapire. Algorithms for portfolio manage-\n\nment based on the Newton method. In ICML, 2006.\n\nNicol`o Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\nA. de Moivre. The Doctrine of Chances: or, A Method of Calculating the Probabilities of Events in\n\nPlay. 1718.\n\nOfer Dekel, Ambuj Tewari, and Raman Arora. Online bandit learning against an adaptive adversary:\n\nfrom regret to policy regret. In ICML, 2012.\n\nPeter DeMarzo, Ilan Kremer, and Yishay Mansour. Online trading algorithms and robust option\nIn Proceedings of the thirty-eighth annual ACM symposium on Theory of computing,\n\npricing.\npages 477\u2013486. ACM, 2006.\n\nPersi Diaconis and Sandy Zabell. Closed form summation for classical distributions: Variations on\n\na theme of de Moivre. Statistical Science, 6(3), 1991.\n\nElad Hazan and Satyen Kale. On stochastic and worst-case models for investing. In NIPS. 2009.\nJ. L. Kelly Jr. A new interpretation of information rate. Bell System Technical Journal, 1956.\nWouter Koolen, Dmitry Adamskiy, and Manfred Warmuth. Putting bayes to sleep. In NIPS. 2012.\nN. Merhav, E. Ordentlich, G. Seroussi, and M. J. Weinberger. On sequential strategies for loss\n\nfunctions with memory. IEEE Trans. Inf. Theor., 48(7), September 2006.\n\nAlexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to\n\nalgorithms. In NIPS, 2012.\n\nRalph T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Prince-\n\nton University Press, 1997.\n\nShai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 2012.\n\nShai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated\n\nsub-gradient solver for svm. Mathematical Programming, 127(1):3\u201330, 2011.\n\nGilles Stoltz. Contributions to the sequential prediction of arbitrary sequences: applications to the\ntheory of repeated games and empirical studies of the performance of the aggregation of experts.\nHabilitation `a diriger des recherches, Universit\u00b4e Paris-Sud, 2011.\n\nMatthew Streeter and H. Brendan McMahan. No-regret algorithms for unconstrained online convex\n\noptimization. In NIPS, 2012.\n\nVolodya Vovk. Competitive on-line statistics. International Statistical Review, 69, 2001.\nMartin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "Brendan", "family_name": "McMahan", "institution": "Google Research"}, {"given_name": "Jacob", "family_name": "Abernethy", "institution": "University of Pennsylvania"}]}