{"title": "Learning with Bandit Feedback in Potential Games", "book": "Advances in Neural Information Processing Systems", "page_first": 6369, "page_last": 6378, "abstract": "This paper examines the equilibrium convergence properties of no-regret learning with exponential weights in potential games. To establish convergence with minimal information requirements on the players' side, we focus on two frameworks: the semi-bandit case (where players have access to a noisy estimate of their payoff vectors, including strategies they did not play), and the bandit case (where players are only able to observe their in-game, realized payoffs). In the semi-bandit case, we show that the induced sequence of play converges almost surely to a Nash equilibrium at a quasi-exponential rate. In the bandit case, the same result holds for approximate Nash equilibria if we introduce a constant exploration factor that guarantees that action choice probabilities never become arbitrarily small. In particular, if the algorithm is run with a suitably decreasing exploration factor, the sequence of play converges to a bona fide Nash equilibrium with probability 1.", "full_text": "Learning with Bandit Feedback in Potential Games\n\nLRI-CNRS, Universit\u00e9 Paris-Sud,Universit\u00e9 Paris-Saclay, France\n\nJohanne Cohen\n\njohanne.cohen@lri.fr\n\nLIX, Ecole Polytechnique, CNRS, AMIBio, Inria, Universit\u00e9 Paris-Saclay\n\namelie.heliou@polytechnique.edu\n\nAm\u00e9lie H\u00e9liou\n\nPanayotis Mertikopoulos\n\nUniv. Grenoble Alpes, CNRS, Inria, LIG, F-38000, Grenoble, France\n\npanayotis.mertikopoulos@imag.fr\n\nAbstract\n\nThis paper examines the equilibrium convergence properties of no-regret learning\nwith exponential weights in potential games. To establish convergence with mini-\nmal information requirements on the players\u2019 side, we focus on two frameworks:\nthe semi-bandit case (where players have access to a noisy estimate of their payoff\nvectors, including strategies they did not play), and the bandit case (where players\nare only able to observe their in-game, realized payoffs). In the semi-bandit case,\nwe show that the induced sequence of play converges almost surely to a Nash\nequilibrium at a quasi-exponential rate. In the bandit case, the same result holds for\n\u03b5-approximations of Nash equilibria if we introduce an exploration factor \u03b5 > 0\nthat guarantees that action choice probabilities never fall below \u03b5. In particular, if\nthe algorithm is run with a suitably decreasing exploration factor, the sequence of\nplay converges to a bona \ufb01de Nash equilibrium with probability 1.\n\n1\n\nIntroduction\n\nGiven the manifest complexity of computing Nash equilibria, a central question that arises is whether\nsuch outcomes could result from a dynamic process in which players act on empirical information\non their strategies\u2019 performance over time. This question becomes particularly important when the\nplayers\u2019 view of the game is obstructed by situational uncertainty and the \u201cfog of war\u201d: for instance,\nwhen deciding which route to take to work each morning, a commuter is typically unaware of how\nmany other commuters there are at any given moment, what their possible strategies are, how to\nbest respond to their choices, etc. In fact, in situations of this kind, players may not even know that\nthey are involved in a game; as such, it does not seem reasonable to assume full rationality, common\nknowledge of rationality, \ufb02awless execution, etc. to justify the Nash equilibrium prediction.\nA compelling alternative to this \u201crationalistic\u201d viewpoint is provided by the framework of online\nlearning, where players are treated as oblivious entities facing a repeated decision process with a\npriori unknown rules and outcomes. In this context, when the players have no Bayesian prior on their\nenvironment, the most widely used performance criterion is that of regret minimization, a worst-case\nguarantee that was \ufb01rst introduced by Hannan [1], and which has given rise to a vigorous literature at\nthe interface of optimization, statistics and theoretical computer science \u2013 for a survey, see [2, 3]. By\nthis token, our starting point in this paper is the following question:\n\nIf all players of a repeated game follow a no-regret algorithm,\n\ndoes the induced sequence of play converge to Nash equilibrium?\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFor concreteness, we focus on the exponential weights (EW) scheme [4\u20137], one of the most popular\nand widely studied algorithms for no-regret learning. In a nutshell, the main idea of the method\nis that the optimizing agent tallies the cumulative payoffs of each action and then employs a pure\nstrategy with probability proportional to the exponential of these cumulative \u201cscores\u201d. Under this\nscheme, players are guaranteed a universal, min-max O(T 1/2) regret bound (with T denoting the\nhorizon of play), and their empirical frequency of play is known to converge to the game\u2019s set of\ncoarse correlated equilibria (CCE) [8].\nIn this way, no-regret learning would seem to provide a positive partial answer to our original\nquestion: coarse correlated equilibria are indeed learnable if all players follow an exponential weights\nlearning scheme. On the \ufb02ip side however, the set of coarse correlated equilibria may contain highly\nnon-rationalizable strategies, so the end prediction of empirical convergence to such equilibria is\nfairly lax. For instance, in a recent paper, Viossat and Zapechelnyuk constructed a 4 \u00d7 4 variant\nof Rock-Paper-Scissors with a coarse correlated equilibrium that assigns positive weight only on\nstrictly dominated strategies [9]. Even more recently, [10] showed that the mean dynamics of the\nexponential weights method (and, more generally, any method \u201cfollowing the regularized leader\u201d)\nmay cycle in perpetuity in zero-sum games, precluding any possibility of convergence to equilibrium\nin this case. Thus, in view of these negative results, a more calibrated answer to the above question is\n\u201cnot always\u201d: especially when the issue at hand is convergence to a Nash equilibrium (as opposed to\ncoarser notions), \u201cno regret\u201d is a rather loose guarantee.\n\nPaper outline and summary of results. To address the above limitations, we focus on two issues:\n\na) Convergence to Nash equilibrium (as opposed to correlated equilibria, coarse or otherwise).\n\nb) The convergence of the actual sequence of play (as opposed to empirical frequencies).\n\nThe reason for focusing on the actual sequence of play is that time-averages provide a fairly weak\nconvergence mode: a priori, a player could oscillate between non-equilibrium strategies with subopti-\nmal payoffs, but time-averages might still converge to equilibrium. On the other hand, convergence\nof the actual sequence of play both implies empirical convergence and also guarantees that players\nwill be playing a Nash equilibrium in the long run, so it is a much stronger notion.\nTo establish convergence, we focus throughout on the class of potential games [11] that has found\nwidespread applications in theoretical computer science [12], transportation networks [13], wireless\ncommunications [14], biology [15], and many other \ufb01elds. We then focus on two different feedback\nmodels: in the semi-bandit framework (Section 3), players are assumed to have some (possibly\nimperfect) estimate of their payoff vectors at each stage, including strategies that they did not play; in\nthe full bandit framework (Section 4), this assumption is relaxed and players are only assumed to\nobserve their realized, in-game payoff at each stage.\nStarting with the semi-bandit case, our main result is that under fairly mild conditions for the errors\naffecting the players\u2019 observations (zero-mean martingale noise with tame second-moment tails),\nlearning with exponential weights converges to a Nash equilibrium of the game with probability 1\n(or to an \u03b5-equilibrium if the algorithm is implemented with a uniform exploration factor \u03b5 > 0).1\n\u221a\nWe also show that this convergence occurs at a quasi-exponential rate, i.e. much faster than the\nalgorithm\u2019s O(\nThese conclusions also apply to the bandit framework when the algorithm is run with a positive\nexploration factor \u03b5 > 0. Thus, by choosing a suf\ufb01ciently small exploration factor, the end state of\nthe EW algorithm in potential games with bandit feedback is arbitrarily close to a Nash equilibrium.\nOn the other hand, extending the stochastic approximation and martingale limit arguments that\nunderlie the bandit analysis to the \u03b5 = 0 case is not straightforward. However, by letting the\nexploration factor go to zero at a suitable rate (similar to the temperature parameter in simulated\nannealing schemes), we are able to recover convergence to the game\u2019s exact Nash set (and not\nan approximation thereof). We \ufb01nd this property particularly appealing for practical applications\nbecause it shows that equilibrium can be achieved in a wide class of games with minimal information\nrequirements.\n\nT ) regret minimization rate would suggest.\n\n1Having a exploration factor \u03b5 > 0 simply means here that action selection probabilities never fall below \u03b5.\n\n2\n\n\fRelated work. No-regret learning has given rise to a vast corpus of literature in theoretical computer\nscience and machine learning, and several well-known families of algorithms have been proposed\nfor that purpose. The most popular of these methods is based on exponential/multiplicative weight\nupdate rules, and several variants of this general scheme have been studied under different names in\nthe literature (Hedge, EXP3, etc.) [4\u20137].\nWhen applied to games, the time-average of the resulting trajectory of play converges to equilibrium\nin two-player zero-sum games [6, 16, 17] and the players\u2019 social welfare approaches an approximate\noptimum [18]. In a similar vein, focusing on the so-called \u201cHedge\u201d variant of the multiplicative\nweights (MW) algorithm, Kleinberg et al. [19] proved that the dynamics\u2019 long-term limit in load\nbalancing games is exponentially better than the worst correlated equilibrium. The convergence rate\nto approximate ef\ufb01ciency and to coarse correlated equilibria was further improved by Syrgkanis et al.\n[20] for a wide class of N-player normal form games using a natural class of regularized learning\nalgorithms. This result was then extended to a class of games known as smooth games [21] with good\nproperties in terms of the game\u2019s price of anarchy [22].\nIn the context of potential games, learning algorithms and dynamics have received signifcant attention\nand considerable efforts have been devoted to studying the long-term properties of the players\u2019 actual\nsequence of play. To that end, Kleinberg et al. [23] showed that, after a polynomially small transient\nstage, players end up playing a pure equilibrium for a fraction of time that is arbitrarily close to 1\nwith probability also arbitrarily close to 1. Mehta et al. [24] obtained a stronger result for (generic)\n2-player coordination games, showing that the multiplicative weights algorithm (a linearized variant\nof the EW algorithm) converges to a pure Nash equilibrium for all but a measure 0 of initial conditions.\nMore recently, Palaiopanos et al. [25] showed that the MW update rule converges to equilibrium\nin potential games; however, if the EW algorithm is run with a constant step-size that is not small\nenough, the induced sequence of play may exhibit chaotic behavior, even in simple 2 \u00d7 2 games. On\nthe other hand, if the same algorithm is run with a decreasing step-size, Krichene et al. [26] showed\nthat play converges to Nash equilibrium in all nonatomic potential games with a convex potential\n(and hence, in all nonatomic congestion games).\nIn the above works, players are assumed to have full (though possibly imperfect) knowledge of their\npayoff vectors, including actions that were not chosen. Going beyond this semi-bandit framework,\nCoucheney et al. [27] showed that a \u201cpenalty-regulated\u201d variant of the EW algorithm converges to \u03b5-\nlogit equilibria (and hence \u03b5-approximate Nash equilibria) in congestion games with bandit feedback.\nAs in [26], the results of Coucheney et al. [27] employ the powerful ordinary differential equation\n(ODE) method of Bena\u00efm [28] which leverages the convergence of an underlying, continuous-time\ndynamical system to obtain convergence of the algorithm at hand. We also employ this method to\ncompare the actual sequence of play to the replicator dynamics of evolutionary game theory [29];\nhowever, \ufb01netuning the bias-variance trade-off that arises when estimating the payoff of actions that\nwere not employed is a crucial dif\ufb01culty in this case. Overcoming this hurdle is necessary when\nseeking convergence to actual Nash equilibria (as opposed to \u03b5-approximations thereof), so a key\ncontribution of our paper is an extension of Bena\u00efm\u2019s theory to account for estimators with (possibly)\nunbounded variance.\n\n2 The setup\n\nanother are determined by an associated payoff function ui : A \u2261(cid:81)\n\n2.1 Game-theoretic preliminaries\nAn N-player game in normal form consists of a (\ufb01nite) set of players N = {1, . . . , N}, each with a\n\ufb01nite set of actions (or pure strategies) Ai. The preferences of the i-th player for one action over\ni Ai \u2192 R that maps the pro\ufb01le\n(\u03b1i; \u03b1\u2212i) of all players\u2019 actions to the player\u2019s reward ui(\u03b1i; \u03b1\u2212i).2 Putting all this together, a game\nwill be denoted by the tuple \u0393 \u2261 \u0393(N ,A, u).\nPlayers can also mix their strategies by playing probability distributions xi = (xi\u03b1i )\u03b1i\u2208Ai \u2208 \u2206(Ai)\nover their action sets Ai. The resulting probability vector xi is called a mixed strategy and we write\nXi = \u2206(Ai) for the mixed strategy space of player i. Aggregating over players, we also write\ni Xi for the game\u2019s strategy space, i.e. the space of all mixed strategy pro\ufb01les x = (xi)i\u2208N .\n2In the above (\u03b1i; \u03b1\u2212i) is shorthand for (\u03b11, . . . , \u03b1i, . . . , \u03b1N ), used here to highlight the action of player i\n\nX =(cid:81)\n\nagainst that of all other players.\n\n3\n\n\fIn this context (and in a slight abuse of notation), the expected payoff of the i-th player in the pro\ufb01le\nx = (x1, . . . , xN ) is\n\nui(x) =\n\nui(\u03b11, . . . , \u03b1N ) x1\u03b11 \u00b7\u00b7\u00b7 xN \u03b1N .\n\n(cid:88)\n\n\u00b7\u00b7\u00b7 (cid:88)\n\n\u03b11\u2208A1\n\n\u03b1N\u2208AN\n\nTo keep track of the payoff of each pure strategy, we also write vi\u03b1i(x) = ui(\u03b1i; x\u2212i) for the payoff\nof strategy \u03b1i \u2208 Ai under the pro\ufb01le x \u2208 X and\n\n(2.1)\n\n(2.2)\n\n(2.3)\n\nfor the resulting payoff vector of player i. We thus have\n\nvi(x) = (vi\u03b1i (x))\u03b1i\u2208Ai\n\nui(x) = (cid:104)vi(x), xi(cid:105) =\n\nxi\u03b1ivi\u03b1i(x),\n\n(cid:88)\n\n\u03b1i\u2208Ai\n\nwhere (cid:104)v, x(cid:105) \u2261 v(cid:62)x denotes the ordinary pairing between v and x.\nThe most widely used solution concept in game theory is that of a Nash equilibrium (NE), i.e. a state\nx\u2217 \u2208 X such that\ni ; x\u2217\n\n(NE)\nEquivalently, writing supp(xi) = {\u03b1i \u2208 Ai : xi > 0} for the support of xi \u2208 Xi, we have the\ncharacterization\n\nfor every deviation xi \u2208 Xi of player i and all i \u2208 N .\n\n\u2212i) \u2265 ui(xi; x\u2217\n\u2212i)\n\nui(x\u2217\n\nvi\u03b1i(x\u2217) \u2265 vi\u03b2i(x\u2217)\n\nfor all \u03b1i \u2208 supp(x\u2217\n\ni ) and all \u03b2i \u2208 Ai, i \u2208 N .\n\n(2.4)\nA Nash equilibrium x\u2217 \u2208 X is further said to be pure if supp(x\u2217\ni ) = {\u02c6\u03b1i} for some \u02c6\u03b1i \u2208 Ai and all\ni \u2208 N . In generic games (that is, games where small changes to any payoff do not introduce new\nNash equilibria or destroy existing ones), every pure Nash equilibrium is also strict in the sense that\n(2.4) holds as a strict inequality for all \u03b1i (cid:54)= \u02c6\u03b1i.\nIn our analysis, it will be important to consider the following relaxations of the notion of a Nash\nequilibrium: First, weakening the inequality (NE) leads to the notion of a \u03b4-equilibrium, de\ufb01ned here\nas any mixed strategy pro\ufb01le x\u2217 \u2208 X such that\n\nui(x\u2217\n\ni ; x\u2217\n\n\u2212i) + \u03b4 \u2265 ui(xi; x\u2217\n\u2212i)\n\nfor every deviation xi \u2208 Xi and all i \u2208 N .\n\n(NE\u03b4)\n\nFinally, we say that x\u2217 is a restricted equilibrium (RE) of \u0393 if\n\nvi\u03b1i(x\u2217) \u2265 vi\u03b2i(x\u2217)\n\nfor all \u03b1i \u2208 supp(x\u2217\n\ni ) and all \u03b2i \u2208 A(cid:48)\n\ni, i \u2208 N ,\n\ni is some restricted subset of Ai containing supp(x\u2217\n\nMonderer and Shapley [11], \u0393 is a potential game if it admits a potential function f : (cid:81)\n\n(RE)\nwhere A(cid:48)\ni ). In words, restricted equilibria are\nNash equilibria of \u0393 restricted to subgames where only a subset of the players\u2019 pure strategies are\navailable at any given moment. Clearly, Nash equilibria are restricted equilibria but the converse does\nnot hold: for instance, every pure strategy pro\ufb01le is a restricted equilibrium, but not necessarily a\nNash equilibrium.\nThroughout this paper, we will focus almost exclusively on the class of potential games, which have\nbeen studied extensively in the context of congestion, traf\ufb01c networks, oligopolies, etc. Following\ni Ai \u2192 R\n(2.5)\nj(cid:54)=i Xi, and all i \u2208 N . A simple differentiation of (2.1) then\n(2.6)\nObviously, every local maximizer of f is a Nash equilibrium so potential games always admit Nash\nequilibria in pure strategies (which are also strict if the game is generic).\n\ni \u2208 Xi, x\u2212i \u2208 X\u2212i \u2261(cid:81)\n\nvi(x) = \u2207xiui(x) = \u2207xif (x)\n\ni; x\u2212i) = f (xi; x\u2212i) \u2212 f (x(cid:48)\n\nui(xi; x\u2212i) \u2212 ui(x(cid:48)\n\nfor all xi, x(cid:48)\nyields\n\nfor all i \u2208 N .\n\nsuch that\n\ni; x\u2212i),\n\n2.2 The exponential weights algorithm\nOur basic learning framework is as follows: At each stage n = 1, 2, . . . , all players i \u2208 N select an\naction \u03b1i(n) \u2208 Ai based on their mixed strategies; subsequently, they receive some feedback on their\nchosen actions, they update their mixed strategies, and the process repeats.\n\n4\n\n\fA popular (and very widely studied) class of algorithms for no-regret learning in this setting is the\nexponential weights (EW) scheme introduced by Vovk [4] and studied further by Auer et al. [5],\nFreund and Schapire [6], Arora et al. [7], and many others. Somewhat informally, the main idea\nis that each player tallies the cumulative payoffs of each of their actions, and then employs a pure\nstrategy \u03b1i \u2208 Ai with probability roughly proportional to the these cumulative payoff \u201cscores\u201d.\nFocusing on the so-called \u201c\u03b5-HEDGE\u201d variant of the EW algorithm [6], this process can be described\nin pseudocode form as follows:\n\nAlgorithm 1 \u03b5-HEDGE with generic feedback\nRequire: step-size sequence \u03b3n > 0, exploration factor \u03b5 \u2208 [0, 1], initial scores Yi \u2208 RAi.\n1: for n = 1, 2, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nfor every player i \u2208 N do\nset mixed strategy: Xi \u2190 \u03b5 unif i +(1 \u2212 \u03b5) \u039bi(Yi);\nchoose action \u03b1i \u223c Xi;\nacquire estimate \u02c6vi of realized payoff vector vi(\u03b1i; \u03b1\u2212i);\nupdate scores: Yi \u2190 Yi + \u03b3n\u02c6vi;\n\nend for\n\nMathematically, Algorithm 1 represents the recursion\n\nXi(n) = \u03b5 unif i +(1 \u2212 \u03b5) \u039bi(Yi(n)),\n\nYi(n + 1) = Yi(n) + \u03b3n+1\u02c6vi(n + 1),\n\n(\u03b5-Hedge)\n\nwhere\n\nunif i =\n\n1\n|Ai| (1, . . . , 1)\n\nstands for the uniform distribution over Ai and \u039bi : RAi \u2192 Xi denotes the logit choice map\n\n(cid:80)\n\n(exp(yi\u03b1i))\u03b1i\u2208Ai\nexp(yi\u03b1i)\n\n\u03b1i\u2208Ai\n\n\u039bi(yi) =\n\n,\n\n(2.7)\n\n(2.8)\n\nwhich assigns exponentially higher probability to pure strategies with higher scores. Thus, action\nselection probabilities under (\u03b5-Hedge) are a convex combination of uniform exploration (with total\nweight \u03b5) and exponential weights (with total weight 1 \u2212 \u03b5).3 As a result, for \u03b5 \u2248 1, action selection\nis essentially uniform; at the other extreme, when \u03b5 = 0, we obtain the original Hedge algorithm of\nFreund and Schapire [6] with feedback sequence \u02c6v(n) and no explicit exploration.\nThe no-regret properties of (\u03b5-Hedge) have been extensively studied in the literature as a function\nof the algorithm\u2019s step-size sequence \u03b3n, exploration factor \u03b5, and the statistical properties of the\npayoff estimates \u02c6v(n) \u2013 for a survey, we refer the reader to [2, 3]. In our convergence analysis, we\nexamine the role of each of these factors in detail, focusing in particular on the distinction between\n\u201csemi-bandit feedback\u201d (when it is possible to estimate the payoff of pure strategies that were not\nplayed) and \u201cbandit feedback\u201d (when players only observe the payoff of their chosen action).\n\n3 Learning with semi-bandit feedback\n\n3.1 The model\n\nWe begin with the semi-bandit framework, i.e. the case where each player has access to a possibly\nimperfect estimate of their entire payoff vector at stage n. More precisely, we assume here that the\nfeedback sequence \u02c6vi(n) to Algorithm 1 is of the general form\n\n(3.1)\nwhere (\u03bei(n))i\u2208N is a martingale noise process representing the players\u2019 estimation error and\nsatisfying the following statistical hypotheses:\n\n\u02c6vi(n) = vi(\u03b1i(n); \u03b1\u2212i(n)) + \u03bei(n),\n\n3Of course, the exploration factor \u03b5 could also be player-dependent. For simplicity, we state all our results\n\nhere with the same \u03b5 for all players.\n\n5\n\n\f1. Zero-mean:\n\nE[\u03bei(n)|Fn\u22121] = 0 for all n = 1, 2, . . . (a.s.).\n\n(H1)\n\n2. Tame tails:\n\nP((cid:107)\u03bei(n)(cid:107)2\u221e \u2265 z |Fn\u22121) \u2264 A/zq\n\n(H2)\nIn the above, the expectation E[\u00b7 ] is taken with respect to some underlying \ufb01ltered probability space\n(\u2126,F, (Fn)n\u2208N, P) which serves as a stochastic basis for the process (\u03b1(n), \u02c6v(n), Y (n), X(n))n\u22651.4\nIn words, Hypothesis (H1) simply means that the players\u2019 feedback sequence \u02c6v(n) is conditionally\nunbiased with respect to the history of play, i.e.\n\nfor some q > 2, A > 0, and all n = 1, 2, . . . (a.s.).\n\nE[\u02c6vi(n)|Fn\u22121] = vi(X(n \u2212 1)),\n\nfor all n = 1, 2, . . . (a.s.).\n\n(3.2a)\n\nHypothesis (H2) further implies that the variance of the estimator \u02c6v is conditionally bounded, i.e.\n\nVar[\u02c6v(n)|Fn\u22121] \u2264 \u03c32\n\nfor all n = 1, 2, . . . (a.s.).\n\n(3.2b)\nBy Chebyshev\u2019s inequality, an estimator with \ufb01nite variance enjoys the tail bound P((cid:107)\u03bei(n)(cid:107)\u221e \u2265\nz |Fn\u22121) = O(1/z2). At the expense of working with slightly more conservative step-size policies\n(see below), much of our analysis goes through with this weaker requirement for the tails of \u03be. How-\never, the extra control provided by the O(1/zq) tail bound simpli\ufb01es the presentation considerably,\nso we do not consider this relaxation here. In any event, Hypothesis (H2) is satis\ufb01ed by a broad range\nof error noise distributions (including all compactly supported, sub-Gaussian and sub-exponential\ndistributions), so the loss in generality is small compared to the gain in clarity and concision.\n\n3.2 Convergence analysis\n\nWith all this at hand, our main result for the convergence of (\u03b5-Hedge) with semi-bandit feedback of\nthe form (3.1) is as follows:\nTheorem 1. Let \u0393 be a generic potential game and suppose that Algorithm 1 is run with i) semi-\nbandit feedback satisfying (H1) and (H2); ii) a nonnegative exploration factor \u03b5 \u2265 0; and iii) a\nstep-size sequence of the form \u03b3n \u221d 1/n\u03b2 for some \u03b2 \u2208 (1/q, 1]. Then:\n\n1. X(n) converges (a.s.) to a \u03b4-equilibrium of \u0393 with \u03b4 \u2261 \u03b4(\u03b5) \u2192 0 as \u03b5 \u2192 0.\n2. If limn\u2192\u221e X(n) is an \u03b5-pure state of the form x\u2217\n\ni = \u03b5 unif i +(1 \u2212 \u03b5)e \u02c6\u03b1i for some \u02c6\u03b1 \u2208 A,\nthen \u02c6\u03b1 is a.s. a strict equilibrium of \u0393 and convergence occurs at a quasi-exponential rate:\n\n(3.3)\nCorollary 2. If Algorithm 1 is run with assumptions as above and no exploration (\u03b5 = 0), X(n)\nconverges to a Nash equilibrium with probability 1. Moreover, if the limit of X(n) is pure and \u03b2 < 1,\nwe have\n\nfor some positive b, c > 0.\n\nk=1 \u03b3k\n\nXi \u02c6\u03b1i(n) \u2265 1 \u2212 be\u2212cn1\u2212\u03b2\n\nfor some positive b, c > 0.\n\n(3.4)\n\nXi \u02c6\u03b1i(n) \u2265 1 \u2212 \u03b5 \u2212 be\u2212c(cid:80)n\n\nSketch of the proof. The proof of Theorem 1 is fairly convoluted, so we relegate the details to the\npaper\u2019s technical appendix and only present here a short sketch thereof.\nOur main tool is the so-called ordinary differential equation (ODE) method, a powerful stochastic\napproximation scheme due to Bena\u00efm and Hirsch [28, 30]. The key observation is that the mixed strat-\negy sequence X(n) generated by Algorithm 1 can be viewed as a \u201cRobbins\u2013Monro approximation\u201d\n(an asymptotic pseudotrajectory to be precise) of the \u03b5-perturbed exponential learning dynamics\n\nBy differentiating, it follows that xi(t) evolves according to the \u03b5-perturbed replicator dynamics\n\n\u02d9yi = vi(x),\nxi = \u03b5 unif i +(1 \u2212 \u03b5) \u039bi(yi),\n\n\u02d9xi\u03b1 =(cid:0)xi\u03b1 \u2212 |Ai|\u22121\u03b5(cid:1)(cid:104)\n\nvi\u03b1(x) \u2212 (1 \u2212 \u03b5)\u22121(cid:88)\n\n(XL\u03b5)\n\n(cid:105)\n\n(xi\u03b2 \u2212 |Ai|\u22121\u03b5)vi\u03b2(x)\n\n,\n\n(RD\u03b5)\n\n\u03b2\u2208Ai\n\n4Notation-wise, this means that the players\u2019 actions at stage n are drawn based on their mixed strategies at\nstage n \u2212 1. This slight discrepancy with the pseudocode representation of Algorithm 1 is only done to simplify\nnotation later on.\n\n6\n\n\fwhich, for \u03b5 = 0, boil down to the ordinary replicator dynamics of Taylor and Jonker [29]:\n\n\u02d9xi\u03b1 = xi\u03b1[vi\u03b1(x) \u2212 (cid:104)vi(x), xi(cid:105)],\n\n(RD)\n\nA key property of the replicator dynamics that readily extends to the \u03b5-perturbed variant (RD\u03b5) is that\nthe game\u2019s potential f is a strict Lyapunov function \u2013 i.e. f (x(t)) is increasing under (RD\u03b5) unless\nx(t) is stationary. By a standard result of Bena\u00efm [28], this implies that the discrete-time process\nX(n) converges (a.s.) to a connected set of rest points of (RD\u03b5), which are themselves approximate\nrestricted equilibria of \u0393.\nOf course, since every \u03b5-pure point of the form (\u03b5 unif i +(1 \u2212 \u03b5)e\u03b1i)i\u2208N is also stationary under\n(RD\u03b5), the above does not imply that the limit of X(n) is an approximate equilibrium of \u0393. To rule\nout non-equilibrium outcomes, we \ufb01rst note that the set of rest points of (RD\u03b5) is \ufb01nite (by genericity),\nso X(n) must converge to a point. Then, the \ufb01nal step of our convergence proof is provided by a\nmartingale recurrence argument which shows that when X(n) converges to a point, this limit must be\nan approximate equilibrium of \u0393. Finally, the rate of convergence (3.3) is obtained by comparing the\npayoff of a player\u2019s equilibrium strategy to that of the player\u2019s other strategies, and then \u201cinverting\u201d\nthe logit choice map to translate this into an exponential decay rate for (cid:107)Xi \u02c6\u03b1i(n) \u2212 x\u2217(cid:107).\n\nWe close this section with two remarks on Theorem 1. First, we note that there is an inverse\nrelationship between the tail exponent q in (H2) and the decay rate \u03b2 of the algorithm\u2019s step-size\nsequence \u03b3n \u221d n\u2212\u03b2. Speci\ufb01cally, higher values of q imply that the noise in the players\u2019 observations\nis smaller (on average and with high probability), so players can be more aggressive in their choice\nof step-size. This is re\ufb02ected in the lower bound 1/q for \u03b2 and the fact that the players\u2019 rate of\nconvergence to Nash equilibrium increases for smaller \u03b2; in particular, (3.3) shows that Algorithm 1\nenjoys a convergence bound which is just shy of O(exp(\u2212n1\u22121/q)). Thus, if the noise process \u03be is\nsub-Gaussian/sub-exponential (so q can be taken arbitrarily large), a near-constant step-size sequence\n(small \u03b2) yields an almost linear convergence rate.\nSecond, if the noise process \u03be is \u201cisotropic\u201d in the sense of Bena\u00efm [28, Thm. 9.1], the instability of\nnon-pure Nash equilibria under the replicator dynamics can be used to show that the limit of X(n) is\npure with probability 1.5 When this is the case, the quasi-exponential convergence rate (3.3) becomes\nuniversal in that it holds with probability 1 (as opposed to conditioning on limn\u2192\u221e X(n) being\n\u221a\npure). We \ufb01nd this property particularly appealing for practical applications because it shows that\nequilibrium is reached exponentially faster than the O(1/\nn) worst-case regret bound of (\u03b5-Hedge)\nwould suggest.\n\n4 Payoff-based learning: the bandit case\n\nWe now turn to the bandit framework, a minimal-information setting where, at each stage of the\nprocess, players only observe their realized payoffs\n\n\u02c6ui(n) = ui(\u03b1i(n); \u03b1\u2212i(n)).\n\n(4.1)\nIn this case, players have no clue about the payoffs of strategies that were not chosen, so they must\nconstruct an estimator for their payoff vector, including its missing components. A standard way to\ndo this is via the bandit estimator\n1(\u03b1i(n) = \u03b1i)\n\nif \u03b1i = \u03b1i(n),\notherwise.\n\n(4.2)\n\n(cid:26)\u02c6ui(n)/Xi\u03b1i(n \u2212 1)\n(cid:88)\n\nXi\u03b2i(n \u2212 1)\n\n0\n\n\u03b2i\u2208Ai\n\n1(\u03b1i = \u03b2i)\nXi\u03b1i(n \u2212 1)\n\nui(\u03b2i; \u03b1\u2212i)\n\n\u02c6vi\u03b1i(n) =\n\nP(\u03b1i(n) = \u03b1i |Fn\u22121)\n\n\u00b7 \u02c6ui(n) =\n\n(cid:88)\n\nE[\u02c6vi\u03b1i(n)|Fn\u22121] =\n\nIndeed, a straightforward calculation shows that\nX\u2212i,\u03b1\u2212i(n \u2212 1)\n\u03b1\u2212i\u2208A\u2212i\n= ui(\u03b1i; X\u2212i(n \u2212 1))\n= vi\u03b1i(X(n \u2212 1)),\n\n(4.3)\n5Speci\ufb01cally, we refer here to the so-called \u201cfolk theorem\u201d of evolutionary game theory which states that x\u2217\nis asymptotically stable under (RD) if and only if it is a strict Nash equilibrium of \u0393 [15]. The extension of this\nresult to the \u03b5-replicator system (RD\u03b5) is immediate.\n\n7\n\n\fso the estimator (4.2) is unbiased in the sense of (H1)/(3.2a). On the other hand, a similar calculation\nshows that the variance of \u02c6vi\u03b1i(n) grows as O(1/Xi\u03b1i(n \u2212 1)), implying that (H2)/(3.2b) may fail\nto hold if the players\u2019 action selection probabilities become arbitrarily small.\nImportantly, this can never happen if (\u03b5-Hedge) is run with a strictly positive exploration factor \u03b5 > 0.\nIn that case, we can show that the bandit estimator (4.2) satis\ufb01es both (H1) and (H2), leading to the\nfollowing result:\nTheorem 3. Let \u0393 be a generic potential game and suppose that Algorithm 1 is run with i) the bandit\nestimator (4.2); ii) a strictly positive exploration factor \u03b5 > 0; and iii) a step-size sequence of the\nform \u03b3n \u221d 1/n\u03b2 for some \u03b2 \u2208 (0, 1]. Then:\n\n1. X(n) converges (a.s.) to a \u03b4-equilibrium of \u0393 with \u03b4 \u2261 \u03b4(\u03b5) \u2192 0 as \u03b5 \u2192 0.\n2. If limn\u2192\u221e X(n) is an \u03b5-pure state of the form x\u2217\n\ni = \u03b5 unif i +(1 \u2212 \u03b5)e \u02c6\u03b1i for some \u02c6\u03b1 \u2208 A,\nthen \u02c6\u03b1 is a.s. a strict equilibrium of \u0393 and convergence occurs at a quasi-exponential rate:\n\nXi \u02c6\u03b1i(n) \u2265 1 \u2212 \u03b5 \u2212 be\u2212c(cid:80)n\n\nk=1 \u03b3k\n\nfor some positive b, c > 0.\n\n(4.4)\n\nProof. Under Algorithm 1, the estimator (4.2) gives\n\n(cid:107)\u02c6vi(n)(cid:107) =\n\n|\u02c6ui(n)|\n\nXi\u03b1i(n\u22121)(n)\n\n\u2264 |ui(\u03b1i(n); \u03b1\u2212i(n))|\n\n\u03b5\n\n\u2264 umax\n\u03b5\n\n,\n\n(4.5)\n\nwhere umax = maxi\u2208N max\u03b11\u2208A1 \u00b7\u00b7\u00b7 max\u03b1N\u2208AN ui(\u03b11, . . . , \u03b1N ) denotes the absolute maximum\npayoff in \u0393. This implies that (H2) holds true for all q > 2, so our claim follows from Theorem 1.\n\nTheorem 3 shows that the limit of Algorithm 1 is closer to the Nash set of the game if the exploration\nfactor \u03b5 is taken as small as possible. On the other hand, the crucial limitation of this result is that it\ndoes not apply to the case \u03b5 = 0 which corresponds to the game\u2019s bona \ufb01de Nash equilibria. As we\ndiscussed above, the reason for this is that the variance of \u02c6v(n) may grow without bound if action\nchoice probabilities become arbitrarily small, in which case the main components of our proof break\ndown.\nWith this \u201cbias-variance\u201d trade-off in mind, we introduce below a modi\ufb01ed version of Algorithm 1\nwith an \u201cannealing\u201d schedule for the method\u2019s exploration factor:\n\nfor every player i \u2208 N do\n\nAlgorithm 2 Exponential weights with annealing\nRequire: step-size sequence \u03b3n > 0, vanishing exploration factor \u03b5n > 0, initial scores Yi \u2208 RAi\n1: for n = 1, 2, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nset mixed strategy: Xi \u2190 \u03b5n unif i +(1 \u2212 \u03b5n) \u039bi(Yi);\nchoose action \u03b1i \u223c Xi and receive payoff \u02c6ui \u2190 ui(\u03b1i; \u03b1\u2212i);\nset \u02c6vi\u03b1i \u2190 \u02c6ui/Xi\u03b1i and \u02c6vi\u03b2i \u2190 0 for \u03b2i (cid:54)= \u03b1i;\nupdate scores: Yi \u2190 Yi + \u03b3n\u02c6vi;\n\nend for\n\nOf course, the convergence of Algorithm 2 depends heavily on the rate at which \u03b5n decays to 0\nrelative to the algorithm\u2019s step-size sequence \u03b3n. This can be seen clearly in our next result:\nTheorem 4. Let \u0393 be a generic potential game and suppose that Algorithm 1 is run with i) the bandit\nestimator (4.2); ii) a step-size sequence of the form \u03b3n \u221d 1/n\u03b2 for some \u03b2 \u2208 (1/2, 1]; and iii) a\ndecreasing exploration factor \u03b5n \u2193 0 such that\n\nlim\nn\u2192\u221e\n\n\u03b3n\n\u03b52\nn\n\n= 0,\n\n< \u221e,\n\nand\n\n\u03b32\nn\n\u03b5n\n\n\u03b5n \u2212 \u03b5n+1\n\n\u03b32\nn\n\nlim\nn\u2192\u221e\n\n= 0.\n\n(4.6)\n\n\u221e(cid:88)\n\nn=1\n\nThen, X(n) converges (a.s.) to a Nash equilibrium of \u0393.\n\n8\n\n\fThe main challenge in proving Theorem 4 is that, unless the \u201cinnovation term\u201d Ui(n) = \u02c6vi(n) \u2212\nvi(X(n \u2212 1)) has bounded variance, Bena\u00efm\u2019s general theory does not imply that X(n) forms an\nasymptotic pseudotrajectory of the underlying mean dynamics \u2013 here, the unperturbed replicator\nsystem (RD). Nevertheless, under the summability condition (4.6), it is possible to show that this is\nthe case by using a martingale limit argument based on Burkholder\u2019s inequality. Furthermore, under\nthe stated conditions, it is also possible to show that, if X(n) converges, its limit is necessarily a\nNash equilibrium of \u0393. Our proof then follows in roughly the same way as in the case of Theorem 1;\nfor the details, we refer the reader to the appendix.\nWe close this section by noting that the summability condition (4.6) imposes a lower bound on the\nstep-size exponent \u03b2 that is different from the lower bound in Theorem 3. In particular, if \u03b2 = 1/2,\n(4.6) cannot hold for any vanishing sequence of exploration factors \u03b5n \u2193 0. Given that the innovation\nterm Ui is bounded, we conjecture that this suf\ufb01cient condition is not tight and can be relaxed further.\nWe intend to address this issue in future work.\n\n5 Conclusion and perspectives\n\nThe results of the previous sections show that no-regret learning via exponential weights enjoys\nappealing convergence properties in generic potential games. Speci\ufb01cally, in the semi-bandit case,\nthe sequence of play converges to a Nash equilibrium with probability 1, and convergence to pure\nequilibria occurs at a quasi-exponential rate. In the bandit case, the same holds true for O(\u03b5)-\nequilibria if the algorithm is run with a positive mixing factor \u03b5 > 0; and if the algorithm is run with\na decreasing mixing schedule, the sequence of play converges to an actual Nash equilibrium (again,\nwith probability 1). In future work, we intend to examine the algorithm\u2019s convergence properties\nin other classes of games (such as smooth games), extend our analysis to the general \u201cfollow the\nregularized leader\u201d (FTRL) class of policies (of which EW is a special case), and to examine the\nimpact of asynchronicities and delays in the players\u2019 feedback/update cycles.\n\nAcknowledgments\n\nJohanne Cohen was partially supported by the grant CNRS PEPS MASTODONS project ADOC\n2017. Am\u00e9lie H\u00e9liou and Panayotis Mertikopoulos gratefully acknowledge \ufb01nancial support from\nthe Huawei Innovation Research Program ULTRON and the ANR JCJC project ORACLESS (grant\nno. ANR\u201316\u2013CE33\u20130004\u201301).\n\nReferences\n[1] James Hannan. Approximation to Bayes risk in repeated play.\n\nIn Melvin Dresher, Albert William\nTucker, and P. Wolfe, editors, Contributions to the Theory of Games, Volume III, volume 39 of Annals of\nMathematics Studies, pages 97\u2013139. Princeton University Press, Princeton, NJ, 1957.\n\n[2] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\n\nLearning, 4(2):107\u2013194, 2011.\n\n[3] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[4] Volodimir G. Vovk. Aggregating strategies. In COLT \u201990: Proceedings of the 3rd Workshop on Computa-\n\ntional Learning Theory, pages 371\u2013383, 1990.\n\n[5] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino: The\nadversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of\nComputer Science, 1995.\n\n[6] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29:79\u2013103, 1999.\n\n[7] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: A meta-algorithm\n\nand applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[8] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium.\n\nEconometrica, 68(5):1127\u20131150, September 2000.\n\n[9] Yannick Viossat and Andriy Zapechelnyuk. No-regret dynamics and \ufb01ctitious play. Journal of Economic\n\nTheory, 148(2):825\u2013842, March 2013.\n\n9\n\n\f[10] Panayotis Mertikopoulos, Christos H. Papadimitriou, and Georgios Piliouras. Cycles in adversarial\nregularized learning. In SODA \u201918: Proceedings of the 29th annual ACM-SIAM symposium on discrete\nalgorithms, to appear.\n\n[11] Dov Monderer and Lloyd S. Shapley. Potential games. Games and Economic Behavior, 14(1):124 \u2013 143,\n\n1996.\n\n[12] Noam Nisan, Tim Roughgarden, Eva Tardos, and V. V. Vazirani, editors. Algorithmic Game Theory.\n\nCambridge University Press, 2007.\n\n[13] William H. Sandholm. Population Games and Evolutionary Dynamics. Economic learning and social\n\nevolution. MIT Press, Cambridge, MA, 2010.\n\n[14] Samson Lasaulce and Hamidou Tembine. Game Theory and Learning for Wireless Networks: Fundamen-\n\ntals and Applications. Academic Press, Elsevier, 2010.\n\n[15] Josef Hofbauer and Karl Sigmund. Evolutionary game dynamics. Bulletin of the American Mathematical\n\nSociety, 40(4):479\u2013519, July 2003.\n\n[16] Dean Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic\n\nBehavior, 21(1):40\u201355, October 1997.\n\n[17] Avrim Blum and Yishay Mansour. Learning, regret minimization, and equilibria. In Noam Nisan, Tim\nRoughgarden, Eva Tardos, and V. V. Vazirani, editors, Algorithmic Game Theory, chapter 4. Cambridge\nUniversity Press, 2007.\n\n[18] Avrim Blum, Mohammad Taghi Hajiaghayi, Katrina Ligett, and Aaron Roth. Regret minimization and the\nprice of total anarchy. In STOC \u201908: Proceedings of the 40th annual ACM symposium on the Theory of\nComputing, pages 373\u2013382. ACM, 2008.\n\n[19] Robert Kleinberg, Georgios Piliouras, and \u00c9va Tardos. Load balancing without regret in the bulletin board\n\nmodel. Distributed Computing, 24(1):21\u201329, 2011.\n\n[20] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized\n\nlearning in games. In Advances in Neural Information Processing Systems, pages 2989\u20132997, 2015.\n\n[21] Tim Roughgarden. Intrinsic robustness of the price of anarchy. Journal of the ACM (JACM), 62(5):32,\n\n2015.\n\n[22] Dylan J Foster, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness of\n\nfast convergence. In Advances in Neural Information Processing Systems, pages 4727\u20134735, 2016.\n\n[23] Robert Kleinberg, Georgios Piliouras, and Eva Tardos. Multiplicative updates outperform generic no-regret\nlearning in congestion games. In Proceedings of the forty-\ufb01rst annual ACM symposium on Theory of\ncomputing, pages 533\u2013542. ACM, 2009.\n\n[24] Ruta Mehta, Ioannis Panageas, and Georgios Piliouras. Natural selection as an inhibitor of genetic diversity:\nMultiplicative weights updates algorithm and a conjecture of haploid genetics. In ITCS \u201915: Proceedings\nof the 6th Conference on Innovations in Theoretical Computer Science, 2015.\n\n[25] Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with\nconstant step-size in congestion games: Convergence, limit cycles and chaos. In NIPS \u201917: Proceedings of\nthe 31st International Conference on Neural Information Processing Systems, 2017.\n\n[26] Walid Krichene, Benjamin Drigh\u00e8s, and Alexandre M. Bayen. Online learning of Nash equilibria in\n\ncongestion games. SIAM Journal on Control and Optimization, 53(2):1056\u20131081, 2015.\n\n[27] Pierre Coucheney, Bruno Gaujal, and Panayotis Mertikopoulos. Penalty-regulated dynamics and robust\n\nlearning procedures in games. Mathematics of Operations Research, 40(3):611\u2013633, August 2015.\n\n[28] Michel Bena\u00efm. Dynamics of stochastic approximation algorithms. S\u00e9minaire de probabilit\u00e9s de Stras-\n\nbourg, 33, 1999.\n\n[29] Peter D. Taylor and Leo B. Jonker. Evolutionary stable strategies and game dynamics. Mathematical\n\nBiosciences, 40(1-2):145\u2013156, 1978.\n\n[30] Michel Bena\u00efm and Morris W. Hirsch. Asymptotic pseudotrajectories and chain recurrent \ufb02ows, with\n\napplications. Journal of Dynamics and Differential Equations, 8(1):141\u2013176, 1996.\n\n10\n\n\f", "award": [], "sourceid": 3185, "authors": [{"given_name": "Am\u00e9lie", "family_name": "Heliou", "institution": "Univ. Grenoble Alpes"}, {"given_name": "Johanne", "family_name": "Cohen", "institution": "LRI-CNRS"}, {"given_name": "Panayotis", "family_name": "Mertikopoulos", "institution": "CNRS (French National Center for Scientific Research)"}]}