{"title": "Fast Convergence of Regularized Learning in Games", "book": "Advances in Neural Information Processing Systems", "page_first": 2989, "page_last": 2997, "abstract": "We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games. When each player in a game uses an algorithm from our class, their individual regret decays at $O(T^{-3/4})$, while the sum of utilities converges to an approximate optimum at $O(T^{-1})$--an improvement upon the worst case $O(T^{-1/2})$ rates. We show a black-box reduction for any algorithm in the class to achieve $\\tilde{O}(T^{-1/2})$ rates against an adversary, while maintaining the faster rates against algorithms in the class. Our results extend those of Rakhlin and Shridharan~\\cite{Rakhlin2013} and Daskalakis et al.~\\cite{Daskalakis2014}, who only analyzed two-player zero-sum games for specific algorithms.", "full_text": "Fast Convergence of Regularized Learning in Games\n\nVasilis Syrgkanis\nMicrosoft Research\n\nNew York, NY\n\nvasy@microsoft.com\n\nAlekh Agarwal\n\nMicrosoft Research\n\nNew York, NY\n\nalekha@microsoft.com\n\nHaipeng Luo\n\nPrinceton University\n\nPrinceton, NJ\n\nhaipengl@cs.princeton.edu\n\nRobert E. Schapire\nMicrosoft Research\n\nNew York, NY\n\nschapire@microsoft.com\n\nAbstract\n\nWe show that natural classes of regularized learning algorithms with a form of\nrecency bias achieve faster convergence rates to approximate ef\ufb01ciency and to\ncoarse correlated equilibria in multiplayer normal form games. When each player\nin a game uses an algorithm from our class, their individual regret decays at\nO(T 3/4), while the sum of utilities converges to an approximate optimum at\nO(T 1)\u2013an improvement upon the worst case O(T 1/2) rates. We show a black-\nbox reduction for any algorithm in the class to achieve \u02dcO(T 1/2) rates against an\nadversary, while maintaining the faster rates against algorithms in the class. Our\nresults extend those of Rakhlin and Shridharan [17] and Daskalakis et al. [4], who\nonly analyzed two-player zero-sum games for speci\ufb01c algorithms.\n\n1\n\nIntroduction\n\nWhat happens when players in a game interact with one another, all of them acting independently\nand sel\ufb01shly to maximize their own utilities? If they are smart, we intuitively expect their utilities\n\u2014 both individually and as a group \u2014 to grow, perhaps even to approach the best possible. We\nalso expect the dynamics of their behavior to eventually reach some kind of equilibrium. Under-\nstanding these dynamics is central to game theory as well as its various application areas, including\neconomics, network routing, auction design, and evolutionary biology.\nIt is natural in this setting for the players to each make use of a no-regret learning algorithm for mak-\ning their decisions, an approach known as decentralized no-regret dynamics. No-regret algorithms\nare a strong match for playing games because their regret bounds hold even in adversarial environ-\nments. As a bene\ufb01t, these bounds ensure that each player\u2019s utility approaches optimality. When\nplayed against one another, it can also be shown that the sum of utilities approaches an approximate\noptimum [2, 18], and the player strategies converge to an equilibrium under appropriate condi-\ntions [6, 1, 8], at rates governed by the regret bounds. Well-known families of no-regret algorithms\ninclude multiplicative-weights [13, 7], Mirror Descent [14], and Follow the Regularized/Perturbed\nLeader [12]. (See [3, 19] for excellent overviews.) For all of these, the average regret vanishes at\nthe worst-case rate of O(1/pT ), which is unimprovable in fully adversarial scenarios.\nHowever, the players in our setting are facing other similar, predictable no-regret learning algo-\nrithms, a chink that hints at the possibility of improved convergence rates for such dynamics. This\nwas \ufb01rst observed and exploited by Daskalakis et al. [4]. For two-player zero-sum games, they de-\nveloped a decentralized variant of Nesterov\u2019s accelerated saddle point algorithm [15] and showed\nthat each player\u2019s average regret converges at the remarkable rate of O(1/T ). Although the resulting\n\n1\n\n\fdynamics are somewhat unnatural, in later work, Rakhlin and Sridharan [17] showed surprisingly\nthat the same convergence rate holds for a simple variant of Mirror Descent with the seemingly\nminor modi\ufb01cation that the last utility observation is counted twice.\nAlthough major steps forward, both these works are limited to two-player zero-sum games, the very\nsimplest case. As such, they do not cover many practically important settings, such as auctions or\nrouting games, which are decidedly not zero-sum, and which involve many independent actors.\nIn this paper, we vastly generalize these techniques to the practically important but far more chal-\nlenging case of arbitrary multi-player normal-form games, giving natural no-regret dynamics whose\nconvergence rates are much faster than previously possible for this general setting.\n\nContributions. We show that the average welfare of the game, that is, the sum of player utilities,\nconverges to approximately optimal welfare at the rate O(1/T ), rather than the previously known\nrate of O(1/pT ). Concretely, we show a natural class of regularized no-regret algorithms with re-\ncency bias that achieve welfare at least (/(1 + \u00b5))OPT  O(1/T ), where  and \u00b5 are parameters\nin a smoothness condition on the game introduced by Roughgarden [18]. For the same class of algo-\nrithms, we show that each individual player\u2019s average regret converges to zero at the rate OT 3/4.\nThus, our results entail an algorithm for computing coarse correlated equilibria in a decentralized\nmanner with signi\ufb01cantly faster convergence than existing methods.\nWe additionally give a black-box reduction that preserves the fast rates in favorable environments,\nwhile robustly maintaining \u02dcO(1/pT ) regret against any opponent in the worst case.\nEven for two-person zero-sum games, our results for general games expose a hidden generality and\nmodularity underlying the previous results [4, 17]. First, our analysis identi\ufb01es stability and recency\nbias as key structural ingredients of an algorithm with fast rates. This covers the Optimistic Mirror\nDescent of Rakhlin and Sridharan [17] as an example, but also applies to optimistic variants of Fol-\nlow the Regularized Leader (FTRL), including dependence on arbitrary weighted windows in the\nhistory as opposed to just the utility from the last round. Recency bias is a behavioral pattern com-\nmonly observed in game-theoretic environments [9]; as such, our results can be viewed as a partial\ntheoretical justi\ufb01cation. Second, previous approaches in [4, 17] on achieving both faster conver-\ngence against similar algorithms while at the same time \u02dcO(1/pT ) regret rates against adversaries\nwere shown via ad-hoc modi\ufb01cations of speci\ufb01c algorithms. We give a black-box modi\ufb01cation\nwhich is not algorithm speci\ufb01c and works for all these optimistic algorithms.\nFinally, we simulate a 4-bidder simultaneous auction game, and compare our optimistic algorithms\nagainst Hedge [7] in terms of utilities, regrets and convergence to equilibria.\n\n2 Repeated Game Model and Dynamics\n\nConsider a static game G among a set N of n players. Each player i has a strategy space Si and a\nutility function ui : S1 \u21e5 . . . \u21e5 Sn ! [0, 1] that maps a strategy pro\ufb01le s = (s1, . . . , sn) to a utility\nui(s). We assume that the strategy space of each player is \ufb01nite and has cardinality d, i.e. |Si| = d.\nWe denote with w = (w1, . . . , wn) a pro\ufb01le of mixed strategies, where wi 2 (Si) and wi,x is the\nprobability of strategy x 2 Si. Finally let Ui(w) = Es\u21e0w[ui(s)], the expected utility of player i.\nWe consider the setting where the game G is played repeatedly for T time steps. At each time\ni 2 (Si). At the end of the iteration each player i\nstep t each player i picks a mixed strategy wt\nobserves the expected utility he would have received had he played any possible strategy x 2 Si.\nMore formally, let ut\n[ui(x, si)], where si is the set of strategies of all but the ith\nplayer, and let ut\ni. Observe that\nthe expected utility of a player at iteration t is simply the inner product hwt\nii.\ni, ut\nNo-regret dynamics. We assume that the players each decide their strategy wt\ni based on a van-\nishing regret algorithm. Formally, for each player i, the regret after T time steps is equal to the\nmaximum gain he could have achieved by switching to any other \ufb01xed strategy:\n\ni,x)x2Si. At the end of each iteration each player i observes ut\n\ni = (ut\n\ni,x = Esi\u21e0wt\n\ni\n\nri(T ) = sup\n\nw\u21e4i 2(Si)\n\nTXt=1\u2326w\u21e4i  wt\n\ni, ut\n\ni\u21b5 .\n\n2\n\n\fThe algorithm has vanishing regret if ri(T ) = o(T ).\n\nApproximate Ef\ufb01ciency of No-Regret Dynamics. We are interested in analyzing the average\nwelfare of such vanishing regret sequences. For a given strategy pro\ufb01le s the social welfare is\n\nde\ufb01ned as the sum of the player utilities: W (s) = Pi2N ui(s). We overload notation to denote\n\nW (w) = Es\u21e0w[W (s)]. We want to lower bound how far the average welfare of the sequence is,\nwith respect to the optimal welfare of the static game:\n\nOPT =\n\nmax\n\ns2S1\u21e5...\u21e5Sn\n\nW (s).\n\nThis is the optimal welfare achievable in the absence of player incentives and if a central coordinator\ncould dictate each player\u2019s strategy. We next de\ufb01ne a class of games \ufb01rst identi\ufb01ed by Roughgar-\nden [18] on which we can approximate the optimal welfare using decoupled no-regret dynamics.\nDe\ufb01nition 1 (Smooth game [18]). A game is (, \u00b5)-smooth if there exists a strategy pro\ufb01le s\u21e4 such\n\nthat for any strategy pro\ufb01le s:Pi2N ui(s\u21e4i , si)  OPT  \u00b5W (s).\n\nIn words, any player using his optimal strategy continues to do well irrespective of other players\u2019\nstrategies. This condition directly implies near-optimality of no-regret dynamics as we show below.\nProposition 2. In a (, \u00b5)-smooth game, if each player i suffers regret at most ri(T ), then:\n\n1\nT\n\nTXt=1\n\nW (wt) \n\n\n\n1 + \u00b5\n\nOPT \n\n1\n\n1 + \u00b5\n\n1\n\nT Xi2N\n\nri(T ) =\n\n1\n\u21e2\n\nOPT \n\n1\n\n1 + \u00b5\n\n1\n\nT Xi2N\n\nri(T ),\n\nwhere the factor \u21e2 = (1 + \u00b5)/ is called the price of anarchy (POA).\n\nThis proposition is essentially a more explicit version of Roughgarden\u2019s result [18]; we provide a\nproof in the appendix for completeness. The result shows that the convergence to POA is driven\nby the quantity\n\nT Pi2N ri(T ). There are many algorithms which achieve a regret rate of\nri(T ) = O(plog(d)T ), in which case the latter theorem would imply that the average welfare con-\nverges to POA at a rate of O(nplog(d)/T ). As we will show, for some natural classes of no-regret\n\nalgorithms the average welfare converges at the much faster rate of O(n2 log(d)/T ).\n\n1\n\n1\n\n1+\u00b5\n\n3 Fast Convergence to Approximate Ef\ufb01ciency\n\nIn this section, we present our main theoretical results characterizing a class of no-regret dynamics\nwhich lead to faster convergence in smooth games. We begin by describing this class.\nDe\ufb01nition 3 (RVU property). We say that a vanishing regret algorithm satis\ufb01es the Regret bounded\nby Variation in Utilities (RVU) property with parameters \u21b5> 0 and 0 < \uf8ff  and a pair of dual\nnorms (k\u00b7k ,k\u00b7k \u21e4)1 if its regret on any sequence of utilities u1, u2, . . . , uT is bounded as\n\nTXt=1\u2326w\u21e4  wt, ut\u21b5 \uf8ff \u21b5 + \n\nTXt=1\n\nkut  ut1k2\n\n\u21e4  \n\nTXt=1\n\nkwt  wt1k2.\n\n(1)\n\nin their vanilla form, as the middle term grows asPT\n\nTypical online learning algorithms such as Mirror Descent and FTRL do not satisfy the RVU property\nfor these methods. However, Rakhlin\nand Sridharan [16] give a modi\ufb01cation of Mirror Descent with this property, and we will present a\nsimilar variant of FTRL in the sequel.\nWe now present two sets of results when each player uses an algorithm with this property. The\n\ufb01rst discusses the convergence of social welfare, while the second governs the convergence of the\nindividual players\u2019 utilities at a fast rate.\n\nt=1 kutk2\n\u21e4\n\n1The dual to a norm k\u00b7k is de\ufb01ned as kvk\u21e4 = supkuk\uf8ff1 hu, vi.\n\n3\n\n\f3.1 Fast Convergence of Social Welfare\n\nGiven Proposition 2, we only need to understand the evolution of the sum of players\u2019 regrets\nt=1 ri(T ) in order to obtain convergence rates of the social welfare. Our main result in this\n\nsection bounds this sum when each player uses dynamics with the RVU property.\nTheorem 4. Suppose that the algorithm of each player i satis\ufb01es the property RVU with parameters\n\nPT\n\u21b5,  and  such that  \uf8ff /(n  1)2 and k\u00b7k = k\u00b7k 1. ThenPi2N ri(T ) \uf8ff \u21b5n.\nProof. Since ui(s) \uf8ff 1, de\ufb01nitions imply: kut\nlatter is the total variation distance of two product distributions. By known properties of total varia-\ntion (see e.g. [11]), this is bounded by the sum of the total variations of each marginal distribution:\n\nk\u21e4 \uf8ffPsiQj6=i wt\n\nj,sj\n\uf8ffXj6=i\nj,sj Yj6=i\nYj6=i\nXsi\nj  wt1\nk\u23182\nBy Jensen\u2019s inequality,\u21e3Pj6=i kwt\n\uf8ff (n  1)Pj6=i kwt\nj  wt1\n\u21e4 \uf8ff (n  1)Xi2NXj6=i\nj  wt1\nkwt\nk2\n\nj  wt1\nk2, so that\nk2 = (n  1)2Xi2N\ni  wt1\nkwt\n\n(2)\n\nk2.\n\nj,sj Qj6=i wt1\n\nj,sj . The\n\nThe theorem follows by summing up the RVU property (1) for each player i and observing that the\nsummation of the second terms is smaller than that of the third terms and thereby can be dropped.\n\ni  ut1\n\ni  ut1\n\nXi2N\n\nkwt\n\nkut\n\nwt1\n\nwt\n\nk\n\nj\n\nj\n\nj\n\nj\n\ni\n\ni\n\ni\n\nRemark: The rates from the theorem depend on \u21b5, which will be O(1) in the sequel. The above\ntheorem extends to the case where k\u00b7k\nis any norm equivalent to the `1 norm. The resulting\nrequirement on  in terms of  can however be more stringent. Also, the theorem does not require\nthat all players use the same no-regret algorithm unlike previous results [4, 17], as long as each\nplayer\u2019s algorithm satis\ufb01es the RVU property with a common bound on the constants.\nWe now instantiate the result with examples that satisfy the RVU property with different constants.\n\n3.1.1 Optimistic Mirror Descent\nThe optimistic mirror descent (OMD) algorithm of Rakhlin and Sridharan [16] is parameterized by\ni and a regularizer2 R which is 1-strongly convex3 with respect\nan adaptive predictor sequence Mt\nto a norm k\u00b7k . Let DR denote the Bregman divergence associated with R. Then the update rule is\nde\ufb01ned as follows: let g0\n\ni = argming2(Si) R(g) and\n\n(u, g) = argmax\nw2(Si)\n\n\u2318 \u00b7h w, ui  DR(w, g),\n\nthen:\n\ni\n\ni\n\ni\n\n)\n\nwt\n\ni, gt1\n\ni, gt1\n\n), and gt\n\ni = ut1\n\ni =( Mt\n\nsatis\ufb01es the RVU property\n\ni =( ut\nThen the following proposition can be obtained for this method.\nProposition 5. The OMD algorithm using stepsize \u2318 and Mt\nwith constants \u21b5 = R/\u2318,  = \u2318,  = 1/(8\u2318), where R = maxi supf DR(f, g0\ni ).\nThe proposition follows by further crystallizing the arguments of Rakhlin and Sridaran [17], and we\nprovide a proof in the appendix for completeness. The above proposition, along with Theorem 4,\nimmediately yields the following corollary, which had been proved by Rakhlin and Sridharan [17]\nfor two-person zero-sum games, and which we here extend to general games.\nCorollary 6. If each player runs OMD with Mt\n\nhavePi2N ri(T ) \uf8ff nR/\u2318 \uf8ff n(n  1)p8R = O(1).\nThe corollary follows by noting that the condition  \uf8ff /(n  1)2 is met with our choice of \u2318.\n2Here and in the sequel, we can use a different regularizer Ri for each player i, without qualitatively\naffecting any of the results.\n3R is 1-strongly convex if R u+v\n\nand stepsize \u2318 = 1/(p8(n  1)), then we\n\n2  \uf8ff R(u)+R(v)\n\n kuvk2\n\ni = ut1\n\n, 8u, v.\n\n2\n\n8\n\ni\n\n4\n\n\f3.1.2 Optimistic Follow the Regularized Leader\nWe next consider a different class of algorithms denoted as optimistic follow the regularized leader\n(OFTRL). This algorithm is similar but not equivalent to OMD, and is an analogous extension of\nstandard FTRL [12]. This algorithm takes the same parameters as for OMD and is de\ufb01ned as follows:\nLet w0\n\ni = argminw2(Si) R(w) and:\n\nwT\n\ni = argmax\n\nw2(Si)*w,\n\nut\ni + MT\n\ni+  R(w)\n\n\u2318\n\n.\n\nT1Xt=1\n\nWe consider three variants of OFTRL with different choices of the sequence Mt\nrecency bias in different forms.\n\ni, incorporating the\n\nOne-step recency bias: The simplest form of OFTRL uses Mt\n\nresult, where R = maxi\u21e3supf2(Si) R(f )  inf f2(Si) R(f )\u2318.\n\nProposition 7. The OFTRL algorithm using stepsize \u2318 and Mt\nwith constants \u21b5 = R/\u2318,  = \u2318 and  = 1/(4\u2318).\n\ni = ut1\n\ni\n\ni = ut1\n\ni\n\nand obtains the following\n\nsatis\ufb01es the RVU property\n\nCombined with Theorem 4, this yields the following constant bound on the total regret of all players:\nCorollary 8. If each player runs OFTRL with Mt\nand \u2318 = 1/(2(n  1)), then we have\n\ni = ut1\n\ni\n\nPi2N ri(T ) \uf8ff nR/\u2318 \uf8ff 2n(n  1)R = O(1).\n\nRakhlin and Sridharan [16] also analyze an FTRL variant, but require a self-concordant barrier for\nthe constraint set as opposed to an arbitrary strongly convex regularizer, and their bound is missing\nthe crucial negative terms of the RVU property which are essential for obtaining Theorem 4.\n\ni =\n\nH-step recency bias: More generally, given a window size H, one can de\ufb01ne Mt\n\ni /H. We have the following proposition.\n\nPt1\n\u2327 =tH u\u2327\nProposition 9. The OFTRL algorithm using stepsize \u2318 and Mt\nRVU property with constants \u21b5 = R/\u2318,  = \u2318H 2 and  = 1/(4\u2318).\nSetting \u2318 = 1/(2H(n  1)), we obtain the analogue of Corollary 8, with an extra factor of H.\nGeometrically discounted recency bias: The next proposition considers an alternative form of\nrecency bias which includes all the previous utilities, but with a geometric discounting.\nProposition 10. The OFTRL algorithm using stepsize \u2318 and Mt\nthe RVU property with constants \u21b5 = R/\u2318,  = \u2318/(1  )3 and  = 1/(8\u2318).\nNote that these choices for Mt\n\ni can also be used in OMD with qualitatively similar results.\n\n\u2327 =0 \u2327 Pt1\n1Pt1\n\ni = Pt1\n\ni /H satis\ufb01es the\n\n\u2327 =tH u\u2327\n\n\u2327 =0 \u2327 u\u2327\n\ni satis\ufb01es\n\ni =\n\n3.2 Fast Convergence of Individual Utilities\n\ni, ut\n\nt=1 hw\u21e4i  wt\n\nThe previous section shows implications of the RVU property on the social welfare. This section\ncomplements these with a similar result for each player\u2019s individual utility.\nTheorem 11. Suppose that the players use algorithms satisfying the RVU property with parameters\n\u21b5> 0, > 0,  0. If we further have the stability property kwt\nk \uf8ff \uf8ff, then for any\nplayerPT\nSimilar reasoning as in Theorem 4 yields: kut\nand summing the terms gives the theorem.\nNoting that OFTRL satis\ufb01es the RVU property with constants given in Proposition 7 and stability\nproperty with \uf8ff = 2\u2318 (see Lemma 20 in the appendix), we have the following corollary.\nCorollary 12. If all players use the OFTRL algorithm with Mt\nii \uf8ff (R + 4)pn  1 \u00b7 T 1/4.\n\ni  wt+1\n\u21e4 \uf8ff (n  1)Pj6=i kwt\n\nii \uf8ff \u21b5 + \uf8ff2(n  1)2T.\ni  ut1\n\nand \u2318 = (n1)1/2T 1/4,\n\nt=1 hw\u21e4i  wt\n\nj  wt1\n\ni = ut1\n\ni, ut\n\nk2\n\nj\n\ni\n\ni\n\ni\n\nk2 \uf8ff (n  1)2\uf8ff2,\n\nthen we havePT\n\n5\n\n\fSimilar results hold for the other forms of recency bias, as well as for OMD. Corollary 12 gives a\nfast convergence rate of the players\u2019 strategies to the set of coarse correlated equilibria (CCE) of the\ngame. This improves the previously known convergence rate pT (e.g. [10]) to CCE using natural,\ndecoupled no-regret dynamics de\ufb01ned in [4].\n\n4 Robustness to Adversarial Opponent\n\nSo far we have shown simple dynamics with rapid convergence properties in favorable environments\nwhen each player in the game uses an algorithm with the RVU property. It is natural to wonder if\nthis comes at the cost of worst-case guarantees when some players do not use algorithms with this\nproperty. Rakhlin and Sridharan [17] address this concern by modifying the OMD algorithm with\nadditional smoothing and adaptive step-sizes so as to preserve the fast rates in the favorable case\nwhile still guaranteeing O(1/pT ) regret for each player, no matter how the opponents play. It is\nnot so obvious how this modi\ufb01cation might extend to other procedures, and it seems undesirable\nto abandon the black-box regret transformations we used to obtain Theorem 4. In this section, we\npresent a generic way of transforming an algorithm which satis\ufb01es the RVU property so that it retains\nthe fast convergence in favorable settings, but always guarantees a worst-case regret of \u02dcO(1/pT ).\nIn order to present our modi\ufb01cation, we need a parametric form of the RVU property which will\nalso involve a tunable parameter of the algorithm. For most online learning algorithms, this will\ncorrespond to the step-size parameter used by the algorithm.\nDe\ufb01nition 13 (RVU(\u21e2) property). We say that a parametric algorithm A(\u21e2) satis\ufb01es the Regret\nbounded by Variation in Utilities(\u21e2) (RVU(\u21e2)) property with parameters \u21b5, ,  > 0 and a pair of\ndual norms (k\u00b7k ,k\u00b7k \u21e4) if its regret on any sequence of utilities u1, u2, . . . , uT is bounded as\n\nTXt=1\u2326w\u21e4  wt, ut\u21b5 \uf8ff\n\nTXt=1\n\n\u21b5\n\u21e2\n\n+ \u21e2\n\nkut  ut1k2\n\n\u21e4 \n\n\n\u21e2\n\nkwt  wt1k2.\n\n(3)\n\nTXt=1\n\nIn both OMD and OFTRL algorithms from Section 3, the parameter \u21e2 is precisely the stepsize \u2318.\nWe now show an adaptive choice of \u21e2 according to an epoch-based doubling schedule.\n\ni\n\nt=1 kut\n\niut1\n\ni\n\nk2\n\u21e4\n\nt=1 |ut\n\ni  ut1\n\n\u21e4  Br:\nk2\n\nBlack-box reduction. Given a parametric algorithm A(\u21e2) as a black-box we construct a wrapper\nA0 based on the doubling trick: The algorithm of each player proceeds in epochs. At each epoch r\nthe player i has an upper bound of Br on the quantityPT\n. We start with a parameter\n\u2318\u21e4 and B1 = 1, and for \u2327 = 1, 2, . . . , T repeat:\n1. Play according to A(\u2318r) and receive u\u2327\ni .\n2. IfP\u2327\n(a) Update r r + 1, Br 2Br, \u2318r = minn \u21b5pBr\n(b) Start a new run of A with parameter \u2318r.\nTXt=1\n+ (1 + \u21b5 \u00b7 ) \u00b7vuut2\n\n,\u2318 \u21e4o, with \u21b5 as in Equation (3).\nTheorem 14. Algorithm A0 achieves regret at most the minimum of the following two terms:\nTXt=1\u2326w\u21e4i  wt\ni  wt1\nTXt=1\u2326w\u21e4i  wt\nThat is, the algorithm satis\ufb01es the RVU property, and also has regret that can never exceed \u02dcO(pT ).\nThe theorem thus yields the following corollary, which illustrates the stated robustness of A0.\n(2+)(n1)2 log(T ), achieves regret \u02dcO(pT ) against any\nCorollary 15. Algorithm A0, with \u2318\u21e4 =\nadversarial sequence, while at the same time satisfying the conditions of Theorem 4. Thereby, if all\nplayers use such an algorithm, then:Pi2N ri(T ) \uf8ff n log(T )(\u21b5/\u2318\u21e4 + 2) = \u02dcO(1).\n\ni\u21b5 \uf8ff log(T ) 2 +\ni\u21b5 \uf8ff log(T )0@1 +\n\n+ (2 + \u2318\u21e4 \u00b7 )\n\n\n\u2318\u21e4\n\ni  ut1\nkut\nTXt=1\n\n\u21e4! \nk2\n\u21e41A\ni  ut1\n\nkut\n\nk2\n\nk2; (4)\n\n\u21b5\n\u2318\u21e4\n\n\u21b5\n\u2318\u21e4\n\nTXt=1\n\nkwt\n\ni, ut\n\ni, ut\n\n(5)\n\ni\n\ni\n\ni\n\n\n\n6\n\n\f1500\n\n1000\n\n500\n\nt\n\ne\nr\ng\ne\nr\n \ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n0\n \n0\n\nSum of regrets\n\nHedge\nOptimistic Hedge\n\n \n\n4000\n\n2000\n8000\nNumber of rounds\n\n6000\n\n10000\n\nt\n\ne\nr\ng\ne\nr\n \ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n400\n350\n300\n250\n200\n150\n100\n50\n0\n \n0\n\nMax of regrets\n\nHedge\nOptimistic Hedge\n\n \n\n4000\n\n2000\n8000\nNumber of rounds\n\n6000\n\n10000\n\nFigure 1: Maximum and sum of individual regrets over time under the Hedge (blue) and\nOptimistic Hedge (red) dynamics.\n\n\n\n\u2318\u21e4(n1)2 .\n\nProof. Observe that for such \u2318\u21e4, we have that: (2 + \u2318\u21e4 \u00b7 ) log(T ) \uf8ff (2 + ) log(T ) \uf8ff\nTherefore, algorithm A0, satis\ufb01es the suf\ufb01cient conditions of Theorem 4.\nIf A(\u21e2) is the OFTRL algorithm, then we know by Proposition 7 that the above result applies with\n\u21b5 = R = maxw R(w),  = 1,  = 1\n12(n1)2 , the\nresulting algorithm A0 will have regret at most: \u02dcO(n2pT ) against an arbitrary adversary, while if\nall players use algorithm A0 thenPi2N ri(T ) = O(n3 log(T )).\nAn analogue of Theorem 11 can also be established for this algorithm:\nCorollary 16. If A satis\ufb01es the RVU(\u21e2) property, and also kwt\nk \uf8ff \uf8ff\u21e2, then A0 with\n\u2318\u21e4 = T 1/4 achieves regret \u02dcO(T 1/4) if played against itself, and \u02dcO(pT ) against any opponent.\nOnce again, OFTRL satis\ufb01es the above conditions with \uf8ff = 2, implying robust convergence.\n\n4 and \u21e2 = \u2318. Setting \u2318\u21e4 =\n\n(2+)(n1)2 =\n\ni  wt1\n\n\n\n1\n\ni\n\n5 Experimental Evaluation\n\nWe analyzed the performance of optimistic follow the regularized leader with the entropy regularizer,\nwhich corresponds to the Hedge algorithm [7] modi\ufb01ed so that the last iteration\u2019s utility for each\nstrategy is double counted; we refer to it as Optimistic Hedge. More formally, the probability of\n\nt=1 ut\n\nij + 2uT1\n\nij \u2318\u2318, rather\n\nplayer i playing strategy j at iteration T is proportional to exp\u21e3\u2318 \u00b7\u21e3PT2\nthan exp\u21e3\u2318 \u00b7PT1\n\nij\u2318 as is standard for Hedge.\n\nt=1 ut\n\nWe studied a simple auction where n players are bidding for m items. Each player has a value v\nfor getting at least one item and no extra value for more items. The utility of a player is the value\nfor the allocation he derived minus the payment he has to make. The game is de\ufb01ned as follows:\nsimultaneously each player picks one of the m items and submits a bid on that item (we assume\nbids to be discretized). For each item, the highest bidder wins and pays his bid. We let players play\nthis game repeatedly with each player invoking either Hedge or optimistic Hedge. This game, and\ngeneralizations of it, are known to be (1  1/e, 0)-smooth [20], if we also view the auctioneer as a\nplayer whose utility is the revenue. The welfare of the game is the value of the resulting allocation,\nhence not a constant-sum game. The welfare maximization problem corresponds to the unweighted\nbipartite matching problem. The POA captures how far from the optimal matching is the average\nallocation of the dynamics. By smoothness we know it converges to at least 1  1/e of the optimal.\nFast convergence of individual and average regret. We run the game for n = 4 bidders and\nm = 4 items and valuation v = 20. The bids are discretized to be any integer in [1, 20]. We \ufb01nd\nthat the sum of the regrets and the maximum individual regret of each player are remarkably lower\nunder Optimistic Hedge as opposed to Hedge. In Figure 1 we plot the maximum individual regret\nas well as the sum of the regrets under the two algorithms, using \u2318 = 0.1 for both methods. Thus\nconvergence to the set of coarse correlated equilibria is substantially faster under Optimistic Hedge,\n\n7\n\n\fd\nb\n\ni\n\n \n\nd\ne\n\nt\nc\ne\np\nx\nE\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n \n0\n\nExpected bids of a player\n\nHedge\nOptimistic Hedge\n\n \n\n4000\n\n2000\n8000\nNumber of rounds\n\n6000\n\n10000\n\ny\nt\ni\nl\ni\nt\n\nU\n\n18\n16\n14\n12\n10\n8\n6\n4\n \n0\n\nUtility of a player\n\nHedge\nOptimistic Hedge\n\n \n\n4000\n\n2000\n8000\nNumber of rounds\n\n6000\n\n10000\n\nFigure 2: Expected bid and per-iteration utility of a player on one of the four items over time, under\nHedge (blue) and Optimistic Hedge (red) dynamics.\n\ncon\ufb01rming our results in Section 3.2. We also observe similar behavior when each player only has\nvalue on a randomly picked player-speci\ufb01c subset of items, or uses other step sizes.\n\nMore stable dynamics. We observe that the behavior under Optimistic Hedge is more stable than\nunder Hedge. In Figure 2, we plot the expected bid of a player on one of the items and his expected\nutility under the two dynamics. Hedge exhibits the sawtooth behavior that was observed in gener-\nalized \ufb01rst price auction run by Overture (see [5, p. 21]). In stunning contrast, Optimistic Hedge\nleads to more stable expected bids over time. This stability property of optimistic Hedge is one of\nthe main intuitive reasons for the fast convergence of its regret.\n\nWelfare.\nIn this class of games, we did not observe any signi\ufb01cant difference between the average\nwelfare of the methods. The key reason is the following: the proof that no-regret dynamics are\napproximately ef\ufb01cient (Proposition 2) only relies on the fact that each player does not have regret\nagainst the strategy s\u21e4i used in the de\ufb01nition of a smooth game. In this game, regret against these\nstrategies is experimentally comparable under both algorithms, even though regret against the best\n\ufb01xed strategy is remarkably different. This indicates a possibility for faster rates for Hedge in\nterms of welfare. In Appendix H, we show fast convergence of the ef\ufb01ciency of Hedge for cost-\nminimization games, though with a worse POA .\n\n6 Discussion\n\nThis work extends and generalizes a growing body of work on decentralized no-regret dynamics in\nmany ways. We demonstrate a class of no-regret algorithms which enjoy rapid convergence when\nplayed against each other, while being robust to adversarial opponents. This has implications in\ncomputation of correlated equilibria, as well as understanding the behavior of agents in complex\nmulti-player games. There are a number of interesting questions and directions for future research\nwhich are suggested by our results, including the following:\nConvergence rates for vanilla Hedge: The fast rates of our paper do not apply to algorithms\nsuch as Hedge without modi\ufb01cation.\nIs this modi\ufb01cation to satisfy RVU only suf\ufb01cient or also\nnecessary? If not, are there counterexamples? In the supplement, we include a sketch hinting at such\na counterexample, but also showing fast rates to a worse equilibrium than our optimistic algorithms.\nConvergence of players\u2019 strategies: The OFTRL algorithm often produces much more stable tra-\njectories empirically, as the players converge to an equilibrium, as opposed to say Hedge. A precise\nquanti\ufb01cation of this desirable behavior would be of great interest.\nBetter rates with partial information: If the players do not observe the expected utility function,\nbut only the moves of the other players at each round, can we still obtain faster rates?\n\n8\n\n\fReferences\n[1] A. Blum and Y. Mansour. Learning, regret minimization, and equilibria. In Noam Nisan, Tim Rough-\ngarden, \u00b4Eva Tardos, and Vijay Vazirani, editors, Algorithmic Game Theory, chapter 4, pages 4\u201330. Cam-\nbridge University Press, 2007.\n\n[2] Avrim Blum, MohammadTaghi Hajiaghayi, Katrina Ligett, and Aaron Roth. Regret minimization and the\nprice of total anarchy. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing,\nSTOC \u201908, pages 373\u2013382, New York, NY, USA, 2008. ACM.\n\n[3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA, 2006.\n\n[4] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for\n\nzero-sum games. Games and Economic Behavior, 92:327\u2013348, 2014.\n\n[5] Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. Internet advertising and the generalized\nsecond price auction: Selling billions of dollars worth of keywords. Working Paper 11765, National\nBureau of Economic Research, November 2005.\n\n[6] Dean P. Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and Eco-\n\nnomic Behavior, 21(12):40 \u2013 55, 1997.\n\n[7] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an appli-\n\ncation to boosting. Journal of Computer and System Sciences, 55(1):119 \u2013 139, 1997.\n\n[8] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29(1):79\u2013103, 1999.\n\n[9] Drew Fudenberg and Alexander Peysakhovich. Recency, records and recaps: Learning and non-\nIn Proceedings of the Fifteenth ACM Conference\n\nequilibrium behavior in a simple decision problem.\non Economics and Computation, EC \u201914, pages 971\u2013986, New York, NY, USA, 2014. ACM.\n\n[10] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium.\n\nEconometrica, 68(5):1127\u20131150, 2000.\n\n[11] Wassily Hoeffding and J. Wolfowitz. Distinguishability of sets of distributions. Ann. Math. Statist.,\n\n29(3):700\u2013718, 1958.\n\n[12] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Com-\n\nputer and System Sciences, 71(3):291 \u2013 307, 2005. Learning Theory 2003 Learning Theory 2003.\n\n[13] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computa-\n\ntion, 108(2):212\u2013261, 1994.\n\n[14] AS Nemirovsky and DB Yudin. Problem complexity and method ef\ufb01ciency in optimization. 1983.\n[15] Yu. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, 2005.\n\n[16] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In COLT 2013,\n\npages 993\u20131019, 2013.\n\n[17] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences.\n\nIn Advances in Neural Information Processing Systems, pages 3066\u20133074, 2013.\n\n[18] T. Roughgarden. Intrinsic robustness of the price of anarchy. In Proceedings of the 41st annual ACM\n\nsymposium on Theory of computing, pages 513\u2013522, New York, NY, USA, 2009. ACM.\n\n[19] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn.,\n\n4(2):107\u2013194, February 2012.\n\n[20] Vasilis Syrgkanis and \u00b4Eva Tardos. Composable and ef\ufb01cient mechanisms. In Proceedings of the Forty-\n\ufb01fth Annual ACM Symposium on Theory of Computing, STOC \u201913, pages 211\u2013220, New York, NY, USA,\n2013. ACM.\n\n9\n\n\f", "award": [], "sourceid": 1687, "authors": [{"given_name": "Vasilis", "family_name": "Syrgkanis", "institution": "Microsoft Research"}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft Research"}, {"given_name": "Haipeng", "family_name": "Luo", "institution": "Princeton University"}, {"given_name": "Robert", "family_name": "Schapire", "institution": "MIcrosoft Research"}]}