{"title": "Fast and Furious Learning in Zero-Sum Games: Vanishing Regret with Non-Vanishing Step Sizes", "book": "Advances in Neural Information Processing Systems", "page_first": 12977, "page_last": 12987, "abstract": "We show for the first time that it is possible to reconcile in online learning in zero-sum games two seemingly contradictory objectives: vanishing time-average regret and non-vanishing step sizes. This phenomenon, that we coin ``fast and furious\" learning in games, sets a new benchmark about what is possible both in max-min optimization as well as in multi-agent systems. Our analysis does not depend on introducing a carefully tailored dynamic. Instead we focus on the most well studied online dynamic, gradient descent. Similarly, we focus on the simplest textbook class of games, two-agent two-strategy zero-sum games, such as Matching Pennies. Even for this simplest of benchmarks the best known bound for total regret, prior to our work, was the trivial one of $O(T)$, which is immediately applicable even to a non-learning agent. Based on a tight understanding of the geometry of the non-equilibrating trajectories in the dual space we prove a regret bound of $\\Theta(\\sqrt{T})$ matching the well known optimal bound for adaptive step sizes in the online setting. This guarantee holds for all fixed step-sizes without having to know the time horizon in advance and adapt the fixed step-size accordingly.As a corollary, we establish that even with fixed learning rates the time-average of mixed strategies, utilities converge to their exact Nash equilibrium values. We also provide experimental evidence suggesting the stronger regret bound holds for all zero-sum games.", "full_text": "Fast and Furious Learning in Zero-Sum Games:\nVanishing Regret with Non-Vanishing Step Sizes\n\nJames P. Bailey\n\nTexas A&M University\n\njamespbailey@tamu.edu\n\nGeorgios Piliouras\n\nSingapore University of Technology and Design\n\ngeorgios@sutd.edu.sg\n\nAbstract\n\nWe show for the \ufb01rst time that it is possible to reconcile in online learning in\nzero-sum games two seemingly contradictory objectives: vanishing time-average\nregret and non-vanishing step sizes. This phenomenon, that we coin \u201cfast and\nfurious\" learning in games, sets a new benchmark about what is possible both\nin max-min optimization as well as in multi-agent systems. Our analysis does\nnot depend on introducing a carefully tailored dynamic. Instead we focus on the\nmost well studied online dynamic, gradient descent. Similarly, we focus on the\nsimplest textbook class of games, two-agent two-strategy zero-sum games, such as\nMatching Pennies. Even for this simplest of benchmarks the best known bound for\ntotal regret, prior to our work, was the trivial one of O(T ), which is immediately\napplicable even to a non-learning agent. Based on a tight understanding of the\n\u221a\ngeometry of the non-equilibrating trajectories in the dual space we prove a regret\nbound of \u0398(\nT ) matching the well known optimal bound for adaptive step sizes\nin the online setting. This guarantee holds for all \ufb01xed step-sizes without having\nto know the time horizon in advance and adapt the \ufb01xed step-size accordingly.As\na corollary, we establish that even with \ufb01xed learning rates the time-average of\nmixed strategies, utilities converge to their exact Nash equilibrium values. We also\nprovide experimental evidence suggesting the stronger regret bound holds for all\nzero-sum games.\n\n(a) Player Strategies\n\n(b) Player 1 Regret\n\n(c) Player 1 Regret Squared\n\nFigure 1: 5000 Iterations of Gradient Descent on Matching Pennies with \u03b7 = .15.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nHeads,HeadsTails,HeadsHeads,TailsTails,Tails100020003000400050001020304050Heads,HeadsRegretvsIteration10002000300040005000Regret2vsIteration5001000150020002500\f1\n\nIntroduction\n\nThe performance of online learning algorithms such as online gradient descent in adversarial, adaptive\nsettings is a classic staple of optimization and game theory, e.g, Cesa-Bianchi and Lugoisi [2006],\nFudenberg and Levine [1998], Young [2004]. Arguably, the most well known results in this space are\nthe following:\n\n\u221a\n\ni) Sublinear regret of O(\n\nT ) is achievable in adversarial settings but only after employing a\ncarefully chosen sequence of shrinking step-sizes or if the time horizon is \ufb01nite and known\nin advance and the \ufb01xed learning rate is selected accordingly.\n\nii) Sublinear regret algorithms \u201cconverge\" to Nash equilibria in zero-sum games.\n\nDespite the well established nature of these results recent work has revealed some surprising insights\nthat come to challenge the traditional ways of thinking in this area. Speci\ufb01cally, in the case of\nzero-sum games what is referred to as \u201cconvergence\" to equilibrium, is the fact that when both agent\napply regret-minimizing algorithms, both the time-average of the mixed strategy pro\ufb01les as well\nas the utilities of the agents converge approximately to their Nash equilibrium values, where the\napproximation error can become arbitrarily close to zero by choosing a suf\ufb01ciently small step-size.\nNaturally, this statement does not imply that the day-to-day behavior converges to equilibria. In fact,\nthe actual realized behavior is antithetical to convergence to equilibrium. Bailey and Piliouras [2018]\nshowed that Nash equilibria are repelling in zero-sum games for all follow-the-regularized-leader\ndynamics. As seen in Figure 1 the dynamics spiral outwards away from the equilibrium.\nThese novel insights about the geometry of learning dynamics in zero-sum games suggest a much\nricher and not well understood landscape of coupled strategic behaviors. They also raise the tantalizing\npossibility that we may be able to leverage this knowledge to prove tighter regret bounds in games.\nIn fact, a series of recent papers has focused on beating the \u201cblack-box\" regret bounds using a\ncombination of tailored dynamics and adaptive step-sizes, e.g, Daskalakis et al. [2011], Rakhlin and\nSridharan [2013], Syrgkanis et al. [2015], Foster et al. [2016] but so far no new bounds have been\nproven for the classic setting of \ufb01xed learning rates. Interestingly, Foster et al. [2016] explicitly\nexamine the case of \ufb01xed learning rates \u03b7 to show that learning achieves sublinear \u201capproximate\nregret\" where the algorithm compares itself against (1 \u2212 \u03b7) times the performance of the best action\nwith hindsight. In contrast, our aim is to show sublinear regret for \ufb01xed \u03b7 using the standard notion\nof regret.\nIntuitively, non-equilibration and more generally this emergent behavioral complexity seem like\nharbingers of bad news in terms of system performance as well as of signi\ufb01cant analytical obstacles.\nThis pessimism seems especially justi\ufb01ed given recent results about the behavior of online dynamics\nwith \ufb01xed step-sizes in other small games (e.g. two-by-two coordination/congestion games), where\ntheir behavior can be shown to become provably chaotic (Palaiopanos et al. [2017], Chotibut et al.\n[2018]). Nevertheless, we show that we can leverage this geometric information to provide the \ufb01rst to\nour knowledge sublinear regret guarantees for online gradient descent with \ufb01xed step-size in games.\nInstability of Nash equilibria is not an obstacle, but in fact may be leveraged as a tool, for proving\nlow regret.\n\nOur theoretical results. We study the dynamics of gradient descent with \ufb01xed step size in two-\nstrategy, two-player games. We leverage a deep understanding of the geometry of its orbits to\nprove the \ufb01rst sublinear regret bounds despite the constant learning rate. We show that the player\nstrategies are repelled away from the Nash equilibrium. More speci\ufb01cally, regardless of the choice\nof the initial condition there are only a \ufb01nite number of iterations where both players select mixed\nstrategies (Theorem 1). We prove a worst-case regret bound of O(\nT ) for arbitrarily learning\nwithout prior knowledge of T (Theorem 3) matching the well known optimal bound for adaptive\nlearning rates. An immediate corollary of our results is that time-average of the mixed strategy\n\u221a\npro\ufb01les as well as the utilities of the agents converge to their exact Nash equilibrium values (and\nnot to approximations thereof) (Corollary 4). Finally, we present a matching lower bound of \u2126(\nT )\n(Theorem 5) establishing that our regret analysis is tight.\nTo obtain the upper bound, we establish a tight understanding of the geometry of the trajectories in the\ndual space, i.e., the trajectories of the payoff vectors. We show there exists a linear transformation of\n\n\u221a\n\n2\n\n\fthe payoff vectors that rotate around the Nash equilibrium. Moreover, the distance between the Nash\nequilibrium and these transformed utility vectors increases by a constant in each rotation (Lemma\n8). In addition, the time to complete a rotation is proportional to the distance between the Nash\nequilibrium and the transformed payoff vectors (Lemma 9). Together, these results imply a quadratic\n\u221a\nrelationship between the number of iterations and the number of rotations completed establishing the\nT ) regret bound. We establish the lower bound by exactly tracking the strategies and regret for a\nO(\nsingle game.\nOur experimental results. Many of the proof techniques we develop extend to higher dimensions\nsuggesting sublinear regret in general zero-sum games. To test this, we conducted experiments to\n\u221a\nmeasure regret in higher dimension. Our simulations for 5x5, 10x10, and 50x50 games suggest that\nregret is sublinear and close to \u0398(\nT ) for larger games. A summary of our simulations are given in\nTable 1 and the fully details appear in Appendix I.\n\nTable 1: Regression Summary for 10,000 Iterations of Gradient Descent in 30 Random Games\n|support of x\u2217|\nstrategies Regret1(T ) \u2248 b \u00b7 T a\n\n% of variability explained\n\np-value\na \u2208 [0.4492, 0.5248] < .000001\na \u2208 [0.3662, 0.5504] < .000001\na \u2208 [0.4653, 0.5563] < .000001\na \u2208 [0.5260, 0.5776] < .000001\n\n2\n5\n10\n50\n\n93.53403 \u2013 99.83818\n97.04427 \u2013 99.91377\n98.79963 \u2013 99.87485\n99.40158 \u2013 99.86970\n\n2\n2-5\n3-7\n21-30\n\n2 Preliminaries\nA two-player game consists of two players {1, 2} where each player has ni strategies to select from.\n(cid:80)\nPlayer i can either select a pure strategy j \u2208 [ni] or a mixed strategy xi \u2208 Xi = {xi \u2208 Rni\u22650 :\nj\u2208[ni] xij = 1}. A strategy is fully mixed if xi \u2208 Rni\n>0.\n\nThe most commonly studied class of games is zero-sum games. In a zero-sum game, there is a payoff\nmatrix A \u2208 Rn1\u00d7n2 where player 1 receives utility x1 \u00b7 Ax2 and player 2 receives utility \u2212x1 \u00b7 Ax2\nresulting in the following optimization problem:\n\nmax\nx1\u2208X1\n\nmin\nx2\u2208X2\n\nx1 \u00b7 Ax2\n\n(Two-Player Zero-Sum Game)\n\nThe solution to this saddle problem is the Nash equilibrium xN E. If player 1 selects her Nash\nequilibria xN E\nindependent of what\nstrategy player 2 selects. xN E\n\n\u00b7 AxN E\nis referred to as the value of the game.\n\n, then she guarantees her utility is xN E\n\n\u00b7 Ax2 \u2265 xN E\n\n\u00b7 AxN E\n\n1\n\n1\n\n2\n\n1\n\n1\n\n2\n\n2.1 Online Learning in Continuous Time\n\nIn many applications of game theory, players know neither the payoff matrix nor the Nash equilibria.\nIn such settings, players select their strategies adaptively. The most common way to do this in\ncontinuous time is by using a follow-the-regularized-leader (FTRL) algorithm. Given a strongly\nconvex regularizer, a learning rate \u03b7, and an initial payoff vector yi(0), players select their strategies\nat time T according to\n\ny1(T ) = y1(0) +\n\nAx2(t)dt\n\n(Player 1 Payoff Vector)\n\n(cid:90) T\n(cid:90) T\n\n0\n\n0\n\ny2(T ) = y2(0) \u2212\nxi\u22650:(cid:80)\n\nxi(T ) =\n\narg max\n\nj\u2208[ni] xij =1\n\n(cid:124)\n\nA\n\nx1(t)dt\n\n(cid:26)\n\nyi(T ) \u00b7 xi \u2212 hi(xi)\n\n\u03b7\n\n3\n\n(cid:27)\n\n(Player 2 Payoff Vector)\n\n(Continuous FTRL)\n\n\fIn this paper, we are primarily interested in the regularizer hi(xi) = ||xi||2\nDescent algorithm:\n\n2/2 resulting in the Gradient\n\n(cid:26)\n\n(cid:27)\n\nxi(t) =\n\narg max\n\nxi\u22650:(cid:80)\n\nj\u2208[ni] xij =1\n\nyi(t) \u00b7 xi \u2212 ||xi||2\n\n2\n\n2\u03b7\n\n(Continuous Gradient Descent)\n\nContinuous time FTRL learning in games has an interesting number of properties including time-\naverage converge to the set of coarse correlated equilibria at a rate of O(1/T ) in general games\n(Mertikopoulos et al. [2018]) and thus to Nash equilibria in zero-sum games. These systems can also\nexhibit interesting recurrent behavior e.g. periodicity (Piliouras and Schulman [2018], Nagarajan et al.\n[2018]), Poincar\u00e9 recurrence (Mertikopoulos et al. [2018], Piliouras and Shamma [2014], Piliouras\net al. [2014]) and limit cycles (Kleinberg et al. [2011]). These systems have formal connections to\nHamiltonian dynamics (i.e. energy preserving systems) (Bailey and Piliouras [2019]). All of these\ntypes of recurrent behavior are special cases of chain recurrence (Papadimitriou and Piliouras [2018],\nOmidsha\ufb01ei et al. [2019]).\n\n2.2 Online Learning in Discrete Time\n\nIn most settings, players update their strategies iteratively in discrete time steps. The most common\nclass of online learning algorithms is again the family of follow-the-regularized-leader algorithms.\n\nt=1\n\n1 +\n\nT\u22121(cid:88)\n2 \u2212 T\u22121(cid:88)\nxi\u22650:(cid:80)\nxi\u22650:(cid:80)\n\nt=1\n\nyT\n1 = y0\n\nyT\n2 = y0\n\nxt\ni =\n\nxt\ni =\n\nAxt\n2\n\n(cid:124)\n\nA\n\nxt\n1\n\narg max\n\nj\u2208[ni] xij =1\n\narg max\n\nj\u2208[ni] xij =1\n\n(cid:26)\n(cid:26)\n\n(cid:27)\ni \u00b7 xi \u2212 hi(xi)\n(cid:27)\nyt\ni \u00b7 xi \u2212 ||xi||2\n\nyt\n\n\u03b7\n\n2\n\n2\u03b7\n\n(Player 1 Payoff Vector)\n\n(Player 2 Payoff Vector)\n\n(FTRL)\n\n(Gradient Descent)\n\nwhere \u03b7 corresponds to the learning rate. In Lemma 6 of Appendix B, we show (FTRL) is the \ufb01rst\norder approximation of (Continuous FTRL).\nThese algorithms again have interesting properties in zero-sum games. The time-average strategy\nconverges to a O(\u03b7)-approximate Nash equilibrium (Cesa-Bianchi and Lugoisi [2006]). On the\ncontrary, Bailey and Piliouras [2018] show that the day-to-day behavior diverges away from interior\nNash equilibria. For notational simplicity we do not introduce different learning rates \u03b71, \u03b72 but all of\nour proofs immediately carry over to this setting.\n\n2.3 Regret in Online Learning\n\nThe most common way of analyzing an online learning algorithm is by examining its regret. The\nregret at time/iteration T is the difference between the accumulated utility gained by the algorithm\nand the total utility of the best \ufb01xed action with hindsight. Formally for player 1,\n\nx1 \u00b7 Ax2(t)dt\n\n\u2212\n\nx1(t) \u00b7 Ax2(t)dt\n\n(cid:40)(cid:90) T\n(cid:40) T(cid:88)\n\n0\n\n(cid:41)\n\u2212 T(cid:88)\n\n(cid:41)\n\n(cid:90) T\n\n0\n\n1 \u00b7 Axt\nxt\n\n2\n\nx1 \u00b7 Axt\n\n2\n\nt=0\n\nt=0\n\nRegret1(T ) = max\nx1\u2208X1\n\nRegret1(T ) = max\nx1\u2208X1\n\n(1)\n\n(2)\n\nfor continuous and discrete time respectively.\nIn the case of (Continuous FTRL) it is possible to show rather strong regret guarantees. Speci\ufb01cally,\nMertikopoulos et al. [2018] establish that Regret1(T ) \u2208 O(1) even for non-zero-sum games. In\ncontrast, (FTRL) only guarantees Regret1(T ) \u2208 O(\u03b7 \u00b7 T ) for a \ufb01xed learning rate. In this paper, we\n\u221a\nutilize the geometry of (Gradient Descent) to show Regret1(T ) \u2208 O(\nT ) in 2x2 zero-sum games\n(n1 = n2 = 2).\n\n4\n\n\f3 The Geometry of Gradient Descent\n\nTheorem 1. Let A be a 2x2 game that has a unique fully mixed Nash equilibrium where strategies\nare updated according to (Gradient Descent). For any non-equilibrium initial strategies and any\n\ufb01xed learning rate \u03b7, there exists a B such that xt is on the boundary for all t \u2265 B.\nTheorem 1 strengthens the result for (Gradient Descent) in 2x2 games from Bailey and Piliouras\n[2018]. Speci\ufb01cally, Bailey and Piliouras [2018] show that strategies come arbitrarily close to the\nboundary in\ufb01nitely often when updated with any version of (FTRL). This is accomplished by closely\nstudying the geometry of the player strategies. We strengthen this result for (Gradient Descent) in\n2x2 games by focusing on the geometry of the payoff vectors. The proof of Theorem 1 relies on\nmany of the tools developed in Section 4 for Theorem 3 and is deferred to Appendix G. The \ufb01rst step\nto understanding the trajectories of the dynamics of (Gradient Descent), is characterizing the solution\nto (Gradient Descent). The exact solution of (Gradient Descent) is described by Lemma 2 below.\nLemma 2. The solution to (Gradient Descent) is given by\n\n(cid:40)0\n\n(cid:16)\n\n\u03b7\n\nyt\n\nij \u2212(cid:80)\n\nxt\nij =\n\n(cid:17)\n\nfor j /\u2208 Si\nfor j \u2208 Si\n\n.\n\n+ 1|Si|\n\n(3)\n\nk\u2208Si\n\nyt\nik|Si|\n\nwhere Si is found using Algorithm 1.\n\nSi \u2190 [ni]\n(cid:16)\n(cid:17)\nik}\nSelect j \u2208 arg mink\u2208Si{yt\nif \u03b7\n+ 1|Si| < 0\n\nAlgorithm 1 Finding Optimal Set Si\n1: procedure FIND Si\n2:\n3: Search:\n4:\n5:\n6:\n7:\n8:\n9:\n\nyt\nk\u2208Si\nSi \u2190 Si \\ {j}\ngoto Search\n\nij \u2212(cid:80)\n\nreturn Si\n\nyt\nik|Si|\n\nelse\n\nWe defer the proof of Lemma 2 to Appendix C.\n\n3.1 Convex Conjugate of the Regularizer\n\nOur analysis primarily takes place in the space of payoff vectors. The payoff vector yt\ndual of the strategy xt\n\ni obtained via\n\ni is a formal\n\n(cid:26)\n\n(cid:27)\n\nh\u2217(yt\n\ni ) =\n\nxi\u22650:(cid:80)\n\nmax\nj\u2208[ni] xij =1\n\ni \u00b7 xi \u2212 hi(xi)\nyt\n\n\u03b7\n\n(4)\n\ni (yt\n\ni=1 h\u2217\n\n\u201cenergy\u201d r =(cid:80)2\norder approximation of (Continuous FTRL). The energy {y : r \u2264 (cid:80)2\n\nwhich is known as the convex conjugate or Fenchel Coupling of hi and is closely related to the\nBregman Divergence. Mertikopoulos et al. [2018] and Bailey and Piliouras [2019] show that the\ni ) is conserved in (Continuous FTRL). By Lemma 6, (FTRL) is the \ufb01rst\ni (yi)} is convex, and\ntherefore the energy will be non-decreasing in (FTRL). Bailey and Piliouras [2018] capitalized on\nthis non-decreasing energy to show that strategies come arbitrarily close to the boundary in\ufb01nitely\noften in (FTRL).\nIn a similar fashion, we precisely compute h\u2217(yt\ni ) to better understand the dynamics of (Gradient\nDescent). We deviate slightly from traditional analysis of (FTRL) and embed the learning rate \u03b7 into\ni||2\nthe regularizer hi(xt\n2/(2\u03b7). Through the maximizing argument\n(Kakade et al. [2009]), we have\ni||2\ni \u2212 ||xt\n2\u03b7\n\ni). Formally, de\ufb01ne hi(xt\n\ni) = ||xt\n\ni=1 h\u2217\n\nh\u2217\ni (yt\n\ni ) = yt\n\ni \u00b7 xt\n\n(5)\n\n2\n\n.\n\n5\n\n\fFrom Lemma 2,\n\nh\u2217\ni (yt\n\ni ) = yt\n\ni \u00b7 xt\n\ni||2\ni \u2212 ||xt\n2\u03b7\n\n2\n\n(cid:88)\n\nj\u2208Si\n\n=\n\n\u03b7\n2\n\n(yt\n\nij)2 +\n\n(cid:88)\n\nj\u2208Si\n\nyt\n\nij|Si| \u2212 \u03b7\n\n2\n\n(cid:17)2\n\n(cid:16)(cid:80)\n\nyt\nj\u2208Si\nij\n|Si|\n\n\u2212 1\n2\u03b7\n\n1\n|Si| .\n\n(6)\n\n2/(2\u03b7) is a strongly smooth function in the simplex, we expect for h\u2217\n\n3.2 Selecting the Right Dual Space in 2x2 Games\nSince hi(xi) = ||xi||2\ni (yi) to be\nstrongly convex (Kakade et al. [2009]) \u2013 at least when it\u2019s corresponding dual variable xi is positive.\nHowever, (6) is not strongly convex for all yt\ncannot appear anywhere\nj=1 xij = 1}.\nin Rni. Rather, yt+1\nThere are many non-intersecting dual spaces for the payoff vectors that yield strategies {xt\ni}\u221e\nt=1.\nMertikopoulos et al. [2018] informally de\ufb01ne a dual space when they focus the analysis on the vector\nyi(t) \u2212 yini(t)1. Similarly, we de\ufb01ne a dual space that will be convenient for showing our results in\n2x2 zero-sum games. Consider the payoff matrix\n\ni dual to the domain {xi \u2208 Rni\u22650 :(cid:80)ni\n\ni \u2208 Rni. This is because yt+1\n\nis contained to a space X \u2217\n\ni\n\ni\n\n(cid:21)\n\n(cid:20) a b\n\nc\n\nd\n\nA =\n\nWithout loss of generality, we may assume a > min{0, b, c}, d > min{0, b, c}, and A is singular,\ni.e., ad \u2212 bc = 0 (see Appendix D for details). Denote \u2206yt\n\n1 as\n\n\u2206yt\n\n1 = yt+1\n= Axt\n2\n\n1 \u2212 yt\n\n(cid:20) (a \u2212 b)xt\n\n1\n\n(c \u2212 d)xt\n\n=\n\n21 + b\n21 + d\n\n(cid:21)\n\n(8)\n(9)\n\n(10)\n\nTherefore\n\n[d \u2212 c, a \u2212 b] \u00b7 \u2206yt\n11 increases by a\u2212 b, yt\n\n1 = ad \u2212 bc = 0\n(11)\n12 increases by c\u2212 d. Thus, the vector [a\u2212 b, c\u2212 d]\n1 . Moreover, (FTRL) is invariant to constant shifts in the\n\nsince A is singular. When yt\ndescribes the span of the dual space X \u2217\npayoff vector yt\n\n[d \u2212 c, a \u2212 b] \u00b7 yt\n\n1 and therefore we may assume [d \u2212 c, a \u2212 b] \u00b7 y0\n1 = [d \u2212 c, a \u2212 b] \u00b7 (yt\u22121\n= [d \u2212 c, a \u2212 b] \u00b7 yt\u22121\n12 in terms of yt\n\n11,\n\n1 = 0. By induction,\n1 + \u2206yt\u22121\n1 = 0\n\n)\n\n1\n\nThis conveniently allows us to express yt\n\n(7)\n\n(12)\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\nyt\n12 =\n\nc \u2212 d\na \u2212 b\n\nyt\n11.\n\nSymmetrically,\n\nb \u2212 d\na \u2212 c\nCombining these relationships with Lemma 2 yields\n\nyt\n22 =\n\nyt\n21.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nxt\n11 =\n\nxt\n21 =\n\n(cid:16)\n\n(cid:16)\n\n0\n\n1\n\n\u03b7\n\n0\n\n1\n\n\u03b7\n\n1 \u2212 c\u2212d\na\u2212b\n\n1 \u2212 b\u2212d\na\u2212c\n\n(cid:17) yt\n(cid:17) yt\n\n11\n\n2 + 1\n\n2\n\n21\n\n2 + 1\n\n2\n\n6\n\n(cid:17) yt\n(cid:17) yt\n(cid:17) yt\n(cid:17) yt\n\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n\n1 \u2212 c\u2212d\na\u2212b\n1 \u2212 c\u2212d\na\u2212b\n\nif \u03b7\nif \u03b7\notherwise\n\n1 \u2212 b\u2212d\na\u2212c\n1 \u2212 b\u2212d\na\u2212c\n\nif \u03b7\nif \u03b7\notherwise\n\n11\n\n2 + 1\n2 + 1\n\n11\n\n2 \u2264 0\n2 \u2265 1\n\n21\n\n2 + 1\n2 + 1\n\n21\n\n2 \u2264 0\n2 \u2265 1\n\n\fThe selection of this dual space also allows us to employ a convenient variable substitution to plot xt\nand yt on the same graph.\n\nzt\n1 = \u03b7\n\nzt\n2 = \u03b7\n\nThe strategy xt can now be expressed as\n\nMoreover, (6) can be rewritten as\n\nh\u2217\ni (yt\n\n1) = \u00afh\u2217\n\n1(zt\n\n1) =\n\nh\u2217\ni (yt\n\n2) = \u00afh\u2217\n\n2(zt\n\n2) =\n\n(cid:19) yt\n(cid:19) yt\n\n11\n2\n\n21\n2\n\n+\n\n+\n\n1\n2\n1\n2\n\n1 \u2212 c \u2212 d\na \u2212 b\n1 \u2212 b \u2212 d\na \u2212 c\n\nxt\ni1 =\n\ni \u2264 0\nif zt\ni \u2265 1\nif zt\notherwise\n\n(cid:18)\n(cid:18)\n\n1\nzt\ni\n\n\uf8f1\uf8f2\uf8f30\n\uf8f1\uf8f2\uf8f3\u03b110zt\n\uf8f1\uf8f2\uf8f3\u03b120zt\n\n\u03b111zt\n\u03b31(zt\n\n\u03b121zt\n\u03b32(zt\n\n1 \u2212 \u03b210\n1 \u2212 \u03b211\n1)2 + \u03b11zt\n2 \u2212 \u03b220\n2 \u2212 \u03b221\n2)2 + \u03b12zt\n\n1 \u2212 \u03b21\n\n2 \u2212 \u03b22\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n1 \u2264 0\nif zt\n1 \u2265 1\nif zt\notherwise\n2 \u2264 0\nif zt\n2 \u2265 1\nif zt\notherwise\n\nwhere \u03b1i0 < 0, \u03b1i1 > 0, and \u03b3i > 0. Both of these expressions are obviously strongly convex when\nthe corresponding player strategy is in (0, 1). The full details of these reduction can be found in\nAppendix E. With this notation, (xt\n21) is simply the projection of zt onto the unit square as\nshown in Figure 2.\n\n11, xt\n\nz2\n\nStrategies xt\nTransformed Payoff Vector zt\n\nz2\n\nz1\n\nz1\n\n(a) Iterations 1-95\n\n(b) Iterations 95-140\n\nFigure 2: Strategies and Transformed Payoff Vectors Rotating Clockwise and Outwards in Matching\nPennies with \u03b7 = .15 and (y0\n\n11) = (.2,\u2212.3).\n\n11, y0\n\n\u221a\nT ) Regret in 2x2 Zero-Sum Games\n4 \u0398(\n\n(cid:16)\u221a\n\n(cid:17)\n\nTheorem 3. Let A be a 2x2 game that has a unique fully mixed Nash equilibrium. When xt is\nupdated according to (Gradient Descent) with any \ufb01xed learning rate \u03b7, Regret1(T ) \u2208 O\n.\n\nT\n\nIt is well known that if an algorithm admits sublinear regret in zero-sum games, then the time-average\nplay converges to a Nash equilibirum. Thus, Theorem 3 immediately results in the following corollary.\nCorollary 4. Let A be a 2x2 game that has a unique fully mixed Nash equilibrium. When xt\nis updated according to (Gradient Descent) with any \ufb01xed learning rate \u03b7, the average strategy\n\nxt\n\nT converges to xN E as T \u2192 \u221e.\n\nt=1\n\n\u00afxT =(cid:80)T\n\nProof of Theorem 3. The result is simple if x1 = xN E. Neither player strategy will ever change.\nSince player 1\u2019s opponent is playing the fully mixed xN E\n, player 1\u2019s utility is constant independent\nof what strategy is selected and therefore the regret is always 0. Now consider x1 (cid:54)= xN E.\n\n2\n\n7\n\n\fz2\n\nStrategies xt\nTransformed Payoff Vector zt\nEnergy rj increases by \u0398(1) per iteration.\nThere are \u0398(1) iterations per rotation.\nEnergy rj does not change per iteration.\nThere are \u0398(rj) iterations per rotation.\n\nz1\n\nFigure 3: Partitioning of Payoff Vectors for the Proof of Theorem 3.\n\nThe main details of the proof are captured in Figure 3. Speci\ufb01cally in Appendix F.1, we establish\nbreak points t0 < t1 < ... < tk = T + 1 and analyze the impact strategies xtj , xtj +1, ..., xtj+1\u22121\nhave on the regret. The strategies xtj , xtj +1, ..., xtj+1\u22121 are contained in adjacent red and green\nsections as shown in Figure 3.\nIn Appendix F.2, we show that there exists \u0398(1) iterations where xt (cid:54)= xt+1 for each partitioning,\n{tj, tj + 1, ..., tj+1 \u2212 1}. Speci\ufb01cally, we show that \u0398(1) consecutive payoff vectors appear in a red\nsection of Figure 3. The remaining points all appear in a green section and the corresponding player\nstrategies are equivalent. This implies\n\ntj+1\u22121(cid:88)\n\n1 \u2212 xt\n(xt+1\n\n1) \u00b7 Axt\n\n2 =\n\nt=tj\n\n\u2208\n\n(cid:88)\n(cid:88)\n\nt\u2208[tj ,tj+1\u22121]:xt+1\n\n1\n\nt\u2208[tj ,tj+1\u22121]:xt+1\n\n1\n\n(cid:54)=xt\n\n1\n\n(cid:54)=xt\n\n1\n\n1 \u2212 xt\n(xt+1\n\n1) \u00b7 Axt\n\n2\n\nO(1) \u2208 O(1)\n\n(23)\n\n(24)\n\nDenote rj =(cid:80)2\n(cid:17)\n\n\u00afh\u2217\ni (ztj\n\ni=1\n\ni ) as the total energy of the system in iteration tj. In Appendix F.3, we\nshow this energy increases linearly in each partition, i.e., rj+1 \u2212 rj \u2208 \u0398(1). In Appendix F.4, we\nalso show that the size of each partition is proportional to the energy in the system at the beginning of\nthat partition, i.e., tj+1 \u2212 tj \u2208 \u0398(rj). Combining these two, tj \u2208 \u0398(j2). Therefore T \u2208 \u0398(k2) and\nk \u2208 \u0398\nwhere k is the total number of partitions. Finally, it is well known (Cesa-Bianchi and\nLugoisi [2006]) that the regret of player 1 in zero-sum games through T iterations is bounded by\n\n(cid:16)\u221a\n\nT\n\nt=0\n\nT(cid:88)\nt0\u22121(cid:88)\nk(cid:88)\n\nt=0\n\nRegret1(T ) \u2264 O(1) +\n\n\u2264 O(1) +\n\n\u2208 O(1) +\n\n1 \u2212 xt\n(xt+1\n\n1) \u00b7 Axt\n\n2\n\n2 +\n\n1) \u00b7 Axt\n1 \u2212 xt\n(xt+1\n(cid:16)\u221a\n(cid:17)\n\nO(1) \u2208 O\n\nT\n\nk(cid:88)\n\nti\u22121(cid:88)\n\ni=1\n\nt=ti\u22121\n\n1 \u2212 xt\n(xt+1\n\n1) \u00b7 Axt\n\n2\n\n(25)\n\n(26)\n\n(27)\n\ncompleting the proof of the theorem.\n\ni=1\n\n\u221a\nNext, we provide a game and initial conditions that has regret \u0398(\nTheorem 3 is tight.\nTheorem 5. Consider the game Matching Pennies with \ufb01xed learning rate \u03b7 = 1 and initial\nconditions y0\nT ) when strategies are updated with\n1 = y0\n(Gradient Descent).\n\n\u221a\n2 = (1, 0). Then player 1\u2019s regret is \u0398(\n\nT ) establishing that the bound in\n\nThe proof follows similarly to the proof of Theorem 3 by exactly computing the regret in every\niteration of (Gradient Descent). The full details appear in Appendix H.\n\n8\n\n\f5 Higher Dimensions and Other Regularizers\n\nMany of the techniques introduced in this paper extend both to higher dimensions for Gradient\nDescent and to other variants of FTRL. Our proof consists mainly of three parts:\n\n1. the \u201cstep-size\u201d in the dual space is bounded; i.e., ||yt\n2. a proof of divergence in the dual space where the divergence grows linearly when at least\none agent is not playing a pure strategy and negligibly when both agents are playing a pure\nstrategy.\n\n|| \u2264 b for some constant b.\n\ni \u2212 yt\u22121\n\ni\n\n3. a proof of recurrence where the \u201ccycle\u201d length (in the primal/strategy space) is bounded\n\n\u221a\n\nThe \ufb01rst two components immediately extend to higher dimensions using the current analysis. In\nregards to the last step, recent advancements in understanding the geometry of learning dynamics in\nlarger games (e.g., Mertikopoulos et al. [2018], Bailey and Piliouras [2019]) suggest that, although\nnon-trivial, this last step can also be eventually rigorously established. However, new ideas are most\nlikely needed to for the last step. In Appendix I, we provide more evidence for sublinear regret in\nhigher dimensions including experiments suggesting that regret grows at approximately O(\nT ) even\nwhen the number of strategies is large.\nIt is also likely that sublinear regret extends to other variants of FTRL using a similar analysis. In\ntwo-by-two zero-sum games, both steps (1) and (3) trivially extend for other variants of FTRL. As\nwe discuss further in Appendix I, the proof for (2) relies primarily on the strict convexity of the\nregularizer h \u2013 a property shared by all variants of FTRL. For Gradient Descent, we make use of this\nproperty by showing divergence increases as the strategies move from one pure strategy to another.\nHowever, strategies will never reach the boundary for some variants of FTRL. For example, the\nmultiplicative weights update algorithm always selects fully-mixed strategies and (2) does not hold\nexactly as written. Instead, for any \u0001, after a \ufb01nite number of iterations all strategies will appear\nwithin \u0001 of the boundary. This proof follows identically to the proof for Gradient Descent in Appendix\nG. Moreover, the \ufb01rst part of (2) extends to this settings; when one agent is more than \u0001 away from\nthe boundary, the divergence grows linearly. However, to prove that the divergence grows negligibly\nwhen both players are within \u0001 of the boundary, we will have to carefully evolve \u0001 over time. This is\nbecause for an algorithm like multiplicative weights, the convex conjugate h\u2217 is never linear; rather it\nbecomes arbitrarily close to a linear function as both agents come closer to playing a pure strategy.\nAlternatively, (2) will more readily follow upon establishing a tighter understanding of the geometry\nof learning dynamics.\nFor both higher dimensions and other variants of FTRL, this work provides evidence that regret grows\nsublinearly when both agents are using \ufb01xed step-size. More importantly, it establishes an outline\non the proof that relies on further developments in understanding the trajectories of online learning\ndynamics.\n\n6 Conclusion\n\nWe present the \ufb01rst proof of sublinear regret for the most classic FTRL dynamic, online gradient\ndescent, in two-by-two zero-sum games. Our proof techniques leverage geometric information and\nhinge upon the fact that FTRL dynamics, although are typically referred to as \u201cconverging\" to Nash\nequilibria in zero-sum games, diverge away from them. Our simulations further suggest that sublinear\nregret bounds carry over to larger zero-sum games.\n\n7 Acknowledgements\n\nJames P. Bailey and Georgios Piliouras acknowledge MOE AcRF Tier 2 Grant 2016-T2-1-170, grant\nPIE-SGP-AI-2018-01 and NRF 2018 Fellowship NRF-NRFF2018-07.\n\nReferences\nJames P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ACM\n\nConference on Economics and Computation, 2018.\n\n9\n\n\fJames P. Bailey and Georgios Piliouras. Multi-Agent Learning in Network Zero-Sum Games is a\n\nHamiltonian System. In AAMAS, 2019.\n\nJames P. Bailey, Gauthier Gidel, and Georgios Piliouras. Finite regret and cycles with \ufb01xed step-size\n\nvia alternating gradient descent-ascent, 2019.\n\nD. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The Mechanics of\n\nn-Player Differentiable Games. In ICML, 2018.\n\nDavid Balduzzi, Karl Tuyls, Julien P\u00e9rolat, and Thore Graepel. Re-evaluating evaluation. In NIPS,\n\n2018.\n\nDavid Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M Czarnecki, Julien Perolat, Max\nJaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. arXiv preprint\narXiv:1901.08106, 2019.\n\nDimitri P Bertsekas. Nonlinear programming. Athena scienti\ufb01c Belmont, 1999.\n\nNikolo Cesa-Bianchi and Gabor Lugoisi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\nYun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization:\n\nChaos and butter\ufb02y effects of online learning in zero-sum games. In COLT, 2019.\n\nThiparat Chotibut, Fryderyk Falniowski, Michal Misiurewicz, and Georgios Piliouras. Family of\n\nchaotic maps from game theory. arXiv preprint arXiv:1807.06831, 2018.\n\nThiparat Chotibut, Fryderyk Falniowski, Micha\u0142 Misiurewicz, and Georgios Piliouras. The route\nto chaos in routing games: Population increase drives period-doubling instability, chaos &\ninef\ufb01ciency with Price of Anarchy equal to one. arXiv e-prints, art. arXiv:1906.02486, Jun 2019.\n\nConstantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and con-\nstrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference,\nITCS 2019, January 10-12, 2019, San Diego, California, USA, pages 27:1\u201327:18, 2019. doi:\n10.4230/LIPIcs.ITCS.2019.27. URL https://doi.org/10.4230/LIPIcs.ITCS.2019.27.\n\nConstantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for\nzero-sum games. In Proceedings of the Twenty-second Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA \u201911, pages 235\u2013254, Philadelphia, PA, USA, 2011. Society for Industrial and\nApplied Mathematics. URL http://dl.acm.org/citation.cfm?id=2133036.2133057.\n\nConstantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with\n\noptimism. In ICLR, 2018.\n\nDylan J Foster, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness\nof fast convergence. In Advances in Neural Information Processing Systems, pages 4727\u20134735,\n2016.\n\nDrew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press Books. The\n\nMIT Press, 1998.\n\nGauthier Gidel, Hugo Berard, Gaetan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A\n\nvariational inequality perspective on generative adversarial networks. In ICLR, 2019.\n\nSham M. Kakade, Shai Shalev-shwartz, and Ambuj Tewari. On the duality of strong convexity and\n\nstrong smoothness: Learning applications and matrix regularization, 2009.\n\nR. Kleinberg, K. Ligett, G. Piliouras, and \u00c9. Tardos. Beyond the Nash equilibrium barrier. In\n\nSymposium on Innovations in Computer Science (ICS), 2011.\n\nP. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror\n\ndescent in saddle-point problems: Going the extra (gradient) mile. In ICLR, 2019.\n\nPanayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial\n\nregularized learning. In ACM-SIAM Symposium on Discrete Algorithms, 2018.\n\n10\n\n\fSai Ganesh Nagarajan, Sameh Mohamed, and Georgios Piliouras. Three body problems in evo-\nIn Proceedings of the\nlutionary game dynamics: Convergence, periodicity and limit cycles.\n17th International Conference on Autonomous Agents and MultiAgent Systems, pages 685\u2013693.\nInternational Foundation for Autonomous Agents and Multi-agent Systems, 2018.\n\nShayegan Omidsha\ufb01ei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland,\nJean-Baptiste Lespiau, Wojciech M Czarnecki, Marc Lanctot, Julien Perolat, and Remi Munos.\n\u03b1-rank: Multi-agent evaluation by evolution. arXiv preprint arXiv:1903.01373, 2019.\n\nGerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update\nwith constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in\nNeural Information Processing Systems, pages 5872\u20135882, 2017.\n\nMarco Pangallo, James Sanders, Tobias Galla, and Doyne Farmer. A taxonomy of learning dynamics\n\nin 2 x 2 games, 2017.\n\nChristos Papadimitriou and Georgios Piliouras. From nash equilibria to chain recurrent sets: An\nalgorithmic solution concept for game theory. Entropy, 20(10), 2018. ISSN 1099-4300. doi:\n10.3390/e20100782. URL http://www.mdpi.com/1099-4300/20/10/782.\n\nGeorgios Piliouras and Leonard J. Schulman. Learning dynamics and the co-evolution of competing\n\nsexual species. In ITCS, 2018.\n\nGeorgios Piliouras and Jeff S Shamma. Optimization despite chaos: Convex relaxations to complex\nlimit sets via poincar\u00e9 recurrence. In Proceedings of the twenty-\ufb01fth annual ACM-SIAM symposium\non Discrete algorithms, pages 861\u2013873. SIAM, 2014.\n\nGeorgios Piliouras, Carlos Nieto-Granda, Henrik I. Christensen, and Jeff S. Shamma. Persistent\npatterns: Multi-agent learning beyond equilibrium and utility. In AAMAS, pages 181\u2013188, 2014.\n\nSasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences.\n\nIn Advances in Neural Information Processing Systems, pages 3066\u20133074, 2013.\n\nVasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of\nregularized learning in games. In Proceedings of the 28th International Conference on Neural\nInformation Processing Systems, NIPS\u201915, pages 2989\u20132997, Cambridge, MA, USA, 2015. MIT\nPress. URL http://dl.acm.org/citation.cfm?id=2969442.2969573.\n\nY. Yaz\u0131c\u0131, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and V. Chandrasekhar. The Unusual\n\nEffectiveness of Averaging in GAN Training. In ICLR, 2019.\n\nH Peyton Young. Strategic learning and its limits. Oxford Univ. Press, 2004.\n\n11\n\n\f", "award": [], "sourceid": 7113, "authors": [{"given_name": "James", "family_name": "Bailey", "institution": "Texas A&M University"}, {"given_name": "Georgios", "family_name": "Piliouras", "institution": "Singapore University of Technology and Design"}]}