{"title": "Learning Mean-Field Games", "book": "Advances in Neural Information Processing Systems", "page_first": 4966, "page_last": 4976, "abstract": "This paper presents a general mean-field game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. It first establishes the existence of a unique Nash Equilibrium to this GMFG, and explains that naively combining Q-learning with the fixed-point approach in classical MFGs yields unstable algorithms. It then proposes a Q-learning algorithm with Boltzmann policy (GMF-Q), with analysis of convergence property and computational complexity. The experiments on repeated Ad auction problems demonstrate that this GMF-Q algorithm is efficient and robust in terms of convergence and learning accuracy. Moreover, its performance is superior in convergence, stability, and learning ability, when compared with existing algorithms for multi-agent reinforcement learning.", "full_text": "Learning Mean-Field Games\n\nXin Guo\n\nUniversity of California, Berkeley\n\nxinguo@berkeley.edu\n\nAnran Hu\n\nUniversity of California, Berkeley\n\nanran_hu@berkeley.edu\n\nRenyuan Xu\n\nUniversity of California, Berkeley\n\nrenyuanxu@berkeley.edu\n\nJunzi Zhang\n\nStanford University\n\njunziz@stanford.edu\n\nAbstract\n\nThis paper presents a general mean-\ufb01eld game (GMFG) framework for simultane-\nous learning and decision-making in stochastic games with a large population. It\n\ufb01rst establishes the existence of a unique Nash Equilibrium to this GMFG, and ex-\nplains that naively combining Q-learning with the \ufb01xed-point approach in classical\nMFGs yields unstable algorithms. It then proposes a Q-learning algorithm with\nBoltzmann policy (GMF-Q), with analysis of convergence property and computa-\ntional complexity. The experiments on repeated Ad auction problems demonstrate\nthat this GMF-Q algorithm is ef\ufb01cient and robust in terms of convergence and\nlearning accuracy. Moreover, its performance is superior in convergence, stabil-\nity, and learning ability, when compared with existing algorithms for multi-agent\nreinforcement learning.\n\n1\n\nIntroduction\n\nMotivating example. This paper is motivated by the following Ad auction problem for an advertiser.\nAn Ad auction is a stochastic game on an Ad exchange platform among a large number of players, the\nadvertisers. In between the time a web user requests a page and the time the page is displayed, usually\nwithin a millisecond, a Vickrey-type of second-best-price auction is run to incentivize interested\nadvertisers to bid for an Ad slot to display advertisement. Each advertiser has limited information\nbefore each bid: \ufb01rst, her own valuation for a slot depends on an unknown conversion of clicks for\nthe item; secondly, she, should she win the bid, only knows the reward after the user\u2019s activities on\nthe website are \ufb01nished. In addition, she has a budget constraint in this repeated auction.\nThe question is, how should she bid in this online sequential repeated game when there is a large\npopulation of bidders competing on the Ad platform, with unknown distributions of the conversion of\nclicks and rewards?\nBesides the Ad auction, there are many real-world problems involving a large number of players\nand unknown systems. Examples include massive multi-player online role-playing games [19], high\nfrequency tradings [24], and the sharing economy [13].\n\nOur work. Motivated by these problems, we consider a general framework of simultaneous learning\nand decision-making in stochastic games with a large population. We formulate a general mean-\ufb01eld-\ngame (GMFG) with incorporation of action distributions, (randomized) relaxed policies, and with\nunknown rewards and dynamics. This general framework can also be viewed as a generalized version\nof MFGs of McKean-Vlasov type [1], which is a different paradigm from the classical MFG. It is\nalso beyond the scope of the existing Q-learning framework for Markov decision problem (MDP)\nwith unknown distributions, as MDP is technically equivalent to a single player stochastic game.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOn the theory front, this general framework differs from all existing MFGs. We establish under\nappropriate technical conditions, the existence and uniqueness of the Nash equilibrium (NE) to this\nGMFG. On the computational front, we show that naively combining Q-learning with the three-step\n\ufb01xed-point approach in classical MFGs yields unstable algorithms. We then propose a Q-learning\nalgorithm with Boltzmann policy (GMF-Q), establish its convergence property and analyze its\ncomputational complexity. Finally, we apply this GMF-Q algorithm to the Ad auction problem,\nwhere this GMF-Q algorithm demonstrates its ef\ufb01ciency and robustness in terms of convergence\nand learning. Moreover, its performance is superior, when compared with existing algorithms for\nmulti-agent reinforcement learning for convergence, stability, and learning accuracy.\n\nRelated works. On learning large population games with mean-\ufb01eld approximations, [39] focuses\non inverse reinforcement learning for MFGs without decision making, [40] studies an MARL problem\nwith a \ufb01rst-order mean-\ufb01eld approximation term modeling the interaction between one player and all\nthe other \ufb01nite players, and [22] and [41] consider model-based adaptive learning for MFGs in speci\ufb01c\nmodels (e.g., linear-quadratic and oscillator games). More recently, [26] studies the local convergence\nof actor-critic algorithms on \ufb01nite time horizon MFGs, and [34] proposes a policy-gradient based\nalgorithm and analyzes the so-called local NE for reinforcement learning in in\ufb01nite time horizon\nMFGs. For learning large population games without mean-\ufb01eld approximation, see [14, 21] and\nthe references therein. In the speci\ufb01c topic of learning auctions with a large number of advertisers,\n[6] and [20] explore reinforcement learning techniques to search for social optimal solutions with\nreal-word data, and [18] uses MFGs to model the auction system with unknown conversion of clicks\nwithin a Bayesian framework.\nHowever, none of these works consider the problem of simultaneous learning and decision-making in\na general MFG framework. Neither do they establish the existence and uniqueness of the (global)\nNE, nor do they present model-free learning algorithms with complexity analysis and convergence to\nthe NE. Note that in principle, global results are harder to obtain compared to local results.\n\n2 Framework of General MFG (GMFG)\n\n2.1 Background: classical N-player Markovian game and MFG\n\nt \u2208 S \u2286 Rd and she takes an action ai\n\nLet us \ufb01rst recall the classical N-player game. There are N players in a game. At each step t, the\nt \u2208 A \u2286 Rp. Here d, p\nstate of player i (= 1, 2,\u00b7\u00b7\u00b7 , N ) is si\nare positive integers, and S and A are compact (for example, \ufb01nite) state space and action space,\nt ) \u2208 S N and the action ai\nrespectively. Given the current state pro\ufb01le of N-players st = (s1\nt,\nplayer i will receive a reward ri(st, ai\nt+1 according to a transition\nprobability function P i(st, ai\nA Markovian game further restricts the admissible policy/control for player i to be of the form\nt : S N \u2192 P(A) maps each state pro\ufb01le s \u2208 S N to a randomized action,\nai\nt = \u03c0i\nwith P(X ) the space of probability measures on space X . The accumulated reward (a.k.a. the value\nfunction) for player i, given the initial state pro\ufb01le s and the policy pro\ufb01le sequence \u03c0\u03c0\u03c0 := {\u03c0\u03c0\u03c0t}\u221e\nwith \u03c0\u03c0\u03c0t = (\u03c01\n\nt) and her state will change to si\n\nt(st). That is, \u03c0i\n\nt ), is then de\ufb01ned as\n\nt , . . . , \u03c0N\n\nt , . . . , sN\n\nt).\n\nt=0\n\n,\n\nt=0\n\nt \u223c \u03c0i\n\nt(st), and si\n\nV i(s, \u03c0\u03c0\u03c0) := E\n\u03b3tri(st, ai\nt)\nt+1 \u223c P i(st, ai\nwhere \u03b3 \u2208 (0, 1) is the discount factor, ai\nis to maximize her value function over all admissible policy sequences.\nIn general, this type of stochastic N-player game is notoriously hard to analyze, especially when N\nis large [28]. Mean \ufb01eld game (MFG), pioneered by [17] and [23] in the continuous settings and later\ndeveloped in [4, 10, 16, 25, 33] for discrete settings, provides an ingenious and tractable aggregation\napproach to approximate the otherwise challenging N-player stochastic games. The basic idea for\nan MFG goes as follows. Assume all players are identical, indistinguishable and interchangeable,\n(cid:80)N\nwhen N \u2192 \u221e, one can view the limit of other players\u2019 states s\u2212i\nt , . . . , si\u22121\nt ) as\n.1 Due to the homogeneity\na population state distribution \u00b5t with \u00b5t(s) := limN\u2192\u221e\n\nt). The goal of each player\n\nt = (s1\nj=1,j(cid:54)=i I\nj\nt =s\n\n, . . . , sN\n\n, si+1\n\n(1)\n\nt\n\nt\n\ns\n\nN\n\n(cid:34) \u221e(cid:88)\n\n(cid:35)\n(cid:12)(cid:12)(cid:12)s0 = s\n\n1Here the indicator function Isj\n\nt =s = 1 if sj\n\nt = s and 0 otherwise.\n\n2\n\n\fof the players, one can then focus on a single (representative) player. That is, in an MFG, one may\nconsider instead the following optimization problem,\n\n(cid:20) \u221e(cid:80)\n\n(cid:21)\n\nmaximize\u03c0\u03c0\u03c0 V (s, \u03c0\u03c0\u03c0, \u00b5\u00b5\u00b5) := E\nsubject to\n\nst+1 \u223c P (st, at, \u00b5t),\n\nt=0\n\n\u03b3tr(st, at, \u00b5t)|s0 = s\n\nat \u223c \u03c0t(st, \u00b5t),\n\nt=0 denotes the policy sequence and \u00b5\u00b5\u00b5 := {\u00b5t}\u221e\n\nwhere \u03c0\u03c0\u03c0 := {\u03c0t}\u221e\nt=0 the distribution \ufb02ow. In this\nMFG setting, at time t, after the representative player chooses her action at according to some\npolicy \u03c0t, she will receive reward r(st, at, \u00b5t) and her state will evolve under a controlled stochastic\ndynamics of a mean-\ufb01eld type P (\u00b7|st, at, \u00b5t). Here the policy \u03c0t depends on both the current state st\nand the current population state distribution \u00b5t such that \u03c0 : S \u00d7 P(S) \u2192 P(A).\n\n2.2 General MFG (GMFG)\n\nIn the classical MFG setting, the reward and the dynamic for each player are known. They depend\nonly on st the state of the player, at the action of this particular player, and \u00b5t the population state\ndistribution. In contrast, in the motivating auction example, the reward and the dynamic are unknown;\nthey rely on the actions of all players, as well as on st and \u00b5t.\nWe therefore de\ufb01ne the following general MFG (GMFG) framework. At time t, after the representative\nplayer chooses her action at according to some policy \u03c0 : S \u00d7 P(S) \u2192 P(A), she will receive a\nreward r(st, at,Lt) and her state will evolve according to P (\u00b7|st, at,Lt), where r and P are possibly\nunknown. The objective of the player is to solve the following control problem:\n\u03b3tr(st, at,Lt)|s0 = s\n\n(cid:20) \u221e(cid:80)\n\n(cid:21)\n\n(GMFG)\n\nmaximize\u03c0\u03c0\u03c0 V (s, \u03c0\u03c0\u03c0,LLL) := E\nsubject to\n\nst+1 \u223c P (st, at,Lt),\n\nt=0\n\nat \u223c \u03c0t(st, \u00b5t).\n\nHere, LLL := {Lt}\u221e\nt=0, with Lt = Pst,at \u2208 P(S \u00d7 A) the joint distribution of the state and the action\n(i.e., the population state-action pair). Lt has marginal distributions \u03b1t for the population action and\n\u00b5t for the population state. Notice that {Lt}\u221e\nt=0 could depend on time. Namely, an in\ufb01nite time\nhorizon MFG could still have time-dependent NE solution due to the mean information process (game\ninteraction) in the MFG. This is fundamentally different from the theory of single-agent MDP where\nthe optimal control, if exists uniquely, would be time independent in an in\ufb01nite time horizon setting.\nIn this framework, we adopt the well-known Nash Equilibrium (NE) for analyzing stochastic games.\nDe\ufb01nition 2.1 (NE for GMFGs).\n:=\nt}\u221e\n({\u03c0(cid:63)\nt=0) is called an NE if\n\nIn (GMFG), a player-population pro\ufb01le (\u03c0\u03c0\u03c0(cid:63),LLL(cid:63))\n\nt=0,{L(cid:63)\n1. (Single player side) Fix LLL(cid:63), for any policy sequence \u03c0\u03c0\u03c0 := {\u03c0t}\u221e\n\nt }\u221e\n\nV (s, \u03c0\u03c0\u03c0(cid:63),LLL(cid:63)) \u2265 V (s, \u03c0\u03c0\u03c0,LLL(cid:63)) .\nt for all t \u2265 0, where {st, at}\u221e\n\n2. (Population side) Pst,at = L(cid:63)\n\npolicy sequence \u03c0\u03c0\u03c0(cid:63) starting from s0 \u223c \u00b5(cid:63)\nand \u00b5(cid:63)\n\nt being the population state marginal of L(cid:63)\nt .\n\n0, with at \u223c \u03c0(cid:63)\n\nt=0 and initial state s \u2208 S,\n(2)\n\nt=0 is the dynamics under the\nt ), st+1 \u223c P (\u00b7|st, at,L(cid:63)\nt ),\n\nt (st, \u00b5(cid:63)\n\nThe single player side condition captures the optimality of \u03c0\u03c0\u03c0(cid:63), when the population side is \ufb01xed. The\npopulation side condition ensures the \u201cconsistency\u201d of the solution: it guarantees that the state and\naction distribution \ufb02ow of the single player does match the population state and action sequence LLL(cid:63).\n\n2.3 Example: GMFG for the repeated auction\n\nNow, consider the repeated Vickrey auction with a budget constraint in Section 1. Take a representative\nadvertiser in the auction. Denote st \u2208 {0, 1, 2,\u00b7\u00b7\u00b7 , smax} as the budget of this player at time t,\nwhere smax \u2208 N+ is the maximum budget allowed on the Ad exchange with a unit bidding price.\nDenote at as the bid price submitted by this player and \u03b1t as the bidding/(action) distribution of the\npopulation. The reward for this advertiser with bid at and budget st is\n\n(cid:105)\n\n.\n\n(3)\n\n(cid:104)\n\nrt = IwM\n\nt =1\n\n(vt \u2212 aM\n\nt ) \u2212 (1 + \u03c1)Ist st.\n\nThat is, if this player does not win the bid, the budget will remain the same. If she wins and has\nenough money to pay, her budget will decrease from st to st \u2212 aM\nt . However, if she wins but does not\nhave enough money, her budget will be 0 after the payment and there will be a penalty in the reward\nfunction. Note that in this game, both the rewards rt and the dynamics st are unknown a priori.\nIn practice, one often modi\ufb01es the dynamics of st+1 with a non-negative random budget ful\ufb01llment\n\u2206(st+1) after the auction clearing [11], such that \u02c6st+1 = st+1 + \u2206(st+1). One may see some\nparticular choices of \u2206(st+1) in the experiment section (Section 5).\n\n3 Solution for GMFGs\n\nWe now establish the existence and uniqueness of the NE to (GMFG), by generalizing the classical\n\ufb01xed-point approach for MFGs to this GMFG setting. (See [17] and [23] for the classical case). It\nconsists of three steps.\nStep A. Fix LLL := {Lt}\u221e\nt=0, (GMFG) becomes the classical optimization problem. Indeed, with\nLLL \ufb01xed, the population state distribution sequence \u00b5\u00b5\u00b5 := {\u00b5t}\u221e\nt=0 is also \ufb01xed, hence the space of\nadmissible policies is reduced to the single-player case. Solving (GMFG) is now reduced to \ufb01nding a\npolicy sequence \u03c0(cid:63)\nt=0, to maximize\n\nt,LLL \u2208 \u03a0 := {\u03c0 | \u03c0 : S \u2192 P(A)} over all admissible \u03c0\u03c0\u03c0LLL = {\u03c0t,LLL}\u221e\n\n(cid:20) \u221e(cid:80)\n\n(cid:21)\n\nV (s, \u03c0\u03c0\u03c0LLL,LLL) := E\nsubject to\n\n\u03b3tr(st, at,Lt)|s0 = s\n\n,\n\nt=0\n\nst+1 \u223c P (st, at,Lt),\n\nat \u223c \u03c0t,LLL(st).\n\nNotice that with LLL \ufb01xed, one can safely suppress the dependency on \u00b5t in the admissible policies.\nMoreover, given this \ufb01xed LLL sequence and the solution \u03c0\u03c0\u03c0(cid:63)LLL := {\u03c0(cid:63)\nt=0, one can de\ufb01ne a mapping\nfrom the \ufb01xed population distribution sequence LLL to an arbitrarily chosen optimal randomized policy\nsequence. That is,\n\nt,LLL}\u221e\n\n\u03931 : {P(S \u00d7 A)}\u221e\n\nt=0 \u2192 {\u03a0}\u221e\nt=0,\n\nsuch that \u03c0\u03c0\u03c0(cid:63)LLL = \u03931(LLL). Note that this \u03c0\u03c0\u03c0(cid:63)LLL sequence satis\ufb01es the single player side condition in\nDe\ufb01nition 2.1 for the population state-action pair sequence LLL. That is, V (s, \u03c0\u03c0\u03c0(cid:63)LLL,LLL) \u2265 V (s, \u03c0\u03c0\u03c0,LLL) ,\nfor any policy sequence \u03c0\u03c0\u03c0 = {\u03c0t}\u221e\nAs in the MFG literature [17], a feedback regularity condition is needed for analyzing Step A.\nAssumption 1. There exists a constant d1 \u2265 0, such that for any LLL,LLL(cid:48) \u2208 {P(S \u00d7 A)}\u221e\nt=0,\n\nt=0 and any initial state s \u2208 S.\n\nD(\u03931(LLL), \u03931(LLL(cid:48))) \u2264 d1W1(LLL,LLL(cid:48)),\n\nwhere\n\nD(\u03c0\u03c0\u03c0, \u03c0\u03c0\u03c0(cid:48)) := sup\ns\u2208S\nW1(LLL,LLL(cid:48)) := sup\nt\u2208N\n\nW1(\u03c0\u03c0\u03c0(s), \u03c0\u03c0\u03c0(cid:48)(s)) = sup\ns\u2208S\nW1(Lt,L(cid:48)\nt),\n\nsup\nt\u2208N\n\nW1(\u03c0t(s), \u03c0(cid:48)\n\nt(s)),\n\nand W1 is the (cid:96)1-Wasserstein distance between probability measures [9, 31, 37].\n\n4\n\n(4)\n\n(5)\n\n\fStep B. Based on the analysis in Step A and \u03c0\u03c0\u03c0(cid:63)LLL = {\u03c0(cid:63)\nfollowing the controlled dynamics P (\u00b7|st, at,Lt).\nAccordingly, for any admissible policy sequence \u03c0\u03c0\u03c0 \u2208 {\u03a0}\u221e\nsequenceLLL \u2208 {P(S\u00d7A)}\u221e\nt=0, de\ufb01ne a mapping \u03932 : {\u03a0}\u221e\nas follows:\n\u03932(\u03c0\u03c0\u03c0,LLL) := \u02c6L\u02c6L\u02c6L = {Pst,at}\u221e\nt=0,\n\nt,LLL}\u221e\n\nt=0, update the initial sequence LLL to LLL(cid:48)\n\nt=0 and a joint population state-action pair\nt=0\u00d7{P(S\u00d7A)}\u221e\nt=0 \u2192 {P(S\u00d7A)}\u221e\nt=0\n\n(6)\nwhere st+1 \u223c \u00b5tP (\u00b7|\u00b7, at,Lt), at \u223c \u03c0t(st), s0 \u223c \u00b50, and \u00b5t is the population state marginal of Lt.\nOne needs a standard assumption in this step.\nAssumption 2. There exist constants d2, d3 \u2265 0, such that for any admissible policy sequences\n\u03c0\u03c0\u03c0, \u03c0\u03c0\u03c01, \u03c0\u03c0\u03c02 and joint distribution sequences LLL,LLL1,LLL2,\n\nW1(\u03932(\u03c0\u03c0\u03c01,LLL), \u03932(\u03c0\u03c0\u03c02,LLL)) \u2264 d2D(\u03c0\u03c0\u03c01, \u03c0\u03c0\u03c02),\nW1(\u03932(\u03c0\u03c0\u03c0,LLL1), \u03932(\u03c0\u03c0\u03c0,LLL2)) \u2264 d3W1(LLL1,LLL2).\n\n(7)\n\n(8)\n\nAssumption 2 can be reduced to Lipschitz continuity and boundedness of the transition dynamics P .\n(See the Appendix for more details.)\nStep C. Repeat Step A and Step B until LLL(cid:48) matches LLL.\nThis step is to take care of the population side condition. To ensure the convergence of the combined\nstep A and step B, it suf\ufb01ces if \u0393 : {P(S \u00d7 A)}\u221e\nt=0 is a contractive mapping\nunder the W1 distance, with \u0393(LLL) := \u03932(\u03931(LLL),LLL). Then by the Banach \ufb01xed point theorem and\nthe completeness of the related metric spaces, there exists a unique NE to the GMFG.\nIn summary, we have\nTheorem 1 (Existence and Uniqueness of GMFG solution). Given Assumptions 1 and 2, and\nassuming that d1d2 + d3 < 1, there exists a unique NE to (GMFG).\n\nt=0 \u2192 {P(S \u00d7 A)}\u221e\n\n4 RL Algorithms for (stationary) GMFGs\n\nt=0 and LLL := {L}\u221e\n\nIn this section, we design the computational algorithm for the GMFG. Since the reward and transition\ndistributions are unknown, this is simultaneously learning the system and \ufb01nding the NE of the game.\nWe will focus on the case with \ufb01nite state and action spaces, i.e., |S|,|A| < \u221e. We will look for\nstationary (time independent) NEs. Accordingly, we abbreviate \u03c0\u03c0\u03c0 := {\u03c0}\u221e\nt=0 as\n\u03c0 and L, respectively. This stationarity property enables developing appropriate time-independent\nQ-learning algorithm, suitable for an in\ufb01nite time horizon game. Modi\ufb01cation from the GMFG\nframework to this special stationary setting is straightforward, and is left to Appendix B. Note that\nthe assumptions to guarantee the existence and uniqueness of GMFG solutions are slightly different\nbetween the stationary and non-stationary cases. For instance, one can compare (7)-(8) with (21)-(22).\nThe algorithm consists of two steps, parallel to Step A and Step B in Section 3.\nStep 1: Q-learning with stability for \ufb01xed L. With L \ufb01xed, it becomes a standard learning\nproblem for an in\ufb01nite horizon MDP. We will focus on the Q-learning algorithm [35, 32].\nThe Q-learning algorithm approximates the value iteration by stochastic approximation. At each step\nwith the state s and an action a, the system reaches state s(cid:48) according to the controlled dynamics and\nthe Q-function is updated according to\n\nQL(s, a) \u2190 (1 \u2212 \u03b2t(s, a))QL(s, a) + \u03b2t(s, a) [r(s, a,L) + \u03b3 max\u02dca QL(s(cid:48), \u02dca)] ,\n\n(9)\n\nwhere the step size \u03b2t(s, a) can be chosen as (cf. [7])\n\n(cid:26)|#(s, a, t) + 1|\u2212h,\n\n\u03b2t(s, a) =\n\n0,\n\n(s, a) = (st, at),\notherwise.\n\nwith h \u2208 (1/2, 1). Here #(s, a, t) is the number of times up to time t that one visits the pair (s, a).\nThe algorithm then proceeds to choose action a(cid:48) based on QL with appropriate exploration strategies,\nincluding the \u0001-greedy strategy.\n\n5\n\n\fAfter obtaining the approximate \u02c6Q(cid:63)L, in order to retrieve an approximately optimal policy, it would be\nnatural to de\ufb01ne an argmax-e operator so that actions with equal maximum Q-values would have\nequal probabilities to be selected. Unfortunately, the discontinuity and sensitivity of argmax-e could\nlead to an unstable algorithm (see Figure 4 for the corresponding naive Algorithm 2 in Appendix). 2\nInstead, we consider a Boltzmann policy based on the operator softmaxc : Rn \u2192 Rn, de\ufb01ned as\n\nsoftmaxc(x)i =\n\n.\n\n(10)\n\n(cid:80)n\n\nexp(cxi)\nj=1 exp(cxj)\n\nThis operator is smooth and close to the argmax-e (see Lemma 7 in the Appendix). Moreover, even\nthough Boltzmann policies are not optimal, the difference between the Boltzmann and the optimal one\ncan always be controlled by choosing the hyper-parameter c appropriately in the softmax operator.\nNote that other smoothing operators (e.g., Mellowmax [2]) may also be considered in the future.\nStep 2: error control in updating L. Given the sub-optimality of the Boltzmann policy, one needs\nto characterize the difference between the optimal policy and the non-optimal ones. In particular, one\ncan de\ufb01ne the action gap between the best action and the second best action in terms of the Q-value\nas \u03b4s(L) := maxa(cid:48)\u2208A Q(cid:63)L(s, a(cid:48)) \u2212 maxa /\u2208argmaxa\u2208AQ(cid:63)L(s,a) Q(cid:63)L(s, a) > 0. Action gap is important\nfor approximation algorithms [3], and are closely related to the problem-dependent bounds for regret\nanalysis in reinforcement learning and multi-armed bandits, and advantage learning algorithms\nincluding A2C [27].\nThe problem is: in order for the learning algorithm to converge in terms of L (Theorem 2), one needs\nto ensure a de\ufb01nite differentiation between the optimal policy and the sub-optimal ones. This is\nproblematic as the in\ufb01mum of \u03b4s(L) over an in\ufb01nite number of L can be 0. To address this, the\npopulation distribution at step k, say Lk, needs to be projected to a \ufb01nite grid, called \u0001-net. The\nrelation between the \u0001-net and action gaps is as follows:\nFor any \u0001 > 0, there exist a positive function \u03c6(\u0001) and an \u0001-net S\u0001 := {L(1), . . . ,L(N\u0001)} \u2286\nP(S \u00d7 A), with the properties that mini=1,...,N\u0001 dT V (L,L(i)) \u2264 \u0001 for any L \u2208 P(S \u00d7 A),\nand that maxa(cid:48)\u2208A Q(cid:63)L(i)(s, a(cid:48)) \u2212 Q(cid:63)L(i) (s, a) \u2265 \u03c6(\u0001) for any i = 1, . . . , N\u0001, s \u2208 S, and any\na /\u2208 argmaxa\u2208AQ(cid:63)L(i)(s, a).\nHere the existence of \u0001-nets is trivial due to the compactness of the probability simplex P(S \u00d7 A),\nand the existence of \u03c6(\u0001) comes from the \ufb01niteness of the action set A. In practice, \u03c6(\u0001) often takes\nthe form of D\u0001\u03b1 with D > 0 and the exponent \u03b1 > 0 characterizing the decay rate of the action gaps.\nFinally, to enable Q-learning, it is assumed that one has access to a population simulator (See [30, 38]).\nThat is, for any policy \u03c0 \u2208 \u03a0, given the current state s \u2208 S, for any population distribution L, one can\nobtain the next state s(cid:48) \u223c P (\u00b7|s, \u03c0(s, \u00b5),L), a reward r = r(s, \u03c0(s, \u00b5),L), and the next population\ndistribution L(cid:48) = Ps(cid:48),\u03c0(s(cid:48),\u00b5). For brevity, we denote the simulator as (s(cid:48), r,L(cid:48)) = G(s, \u03c0,L). Here \u00b5\nis the state marginal distribution of L.\nIn summary, we propose the following Algorithm 1.\n\nAlgorithm 1 Q-learning for GMFGs (GMF-Q)\n1: Input: Initial L0, tolerance \u0001 > 0.\n2: for k = 0, 1,\u00b7\u00b7\u00b7 do\n3:\n\n4:\n5:\n6:\n7: end for\n\nPerform Q-learning for Tk iterations to \ufb01nd the approximate Q-function \u02c6Q(cid:63)\nof an MDP with dynamics PLk (s(cid:48)|s, a) and rewards rLk (s, a).\nCompute \u03c0k \u2208 \u03a0 with \u03c0k(s) = softmaxc( \u02c6Q(cid:63)\nSample s \u223c \u00b5k (\u00b5k is the population state marginal of Lk), obtain \u02dcLk+1 from G(s, \u03c0k,Lk).\nFind Lk+1 = ProjS\u0001\n\nk(s, a) = \u02c6Q(cid:63)Lk\n\nk(s,\u00b7)).\n\n( \u02dcLk+1)\n\n(s, a)\n\nNote that softmax is applied only at the end of each outer iteration when a good approximation\nof Q function is obtained. Within the outer iteration for the MDP problem with \ufb01xed mean-\ufb01eld\ninformation, standard Q-learning method is applied.\n\n2argmax-e is not continuous: Let x = (1, 1), then argmax-e(x) = (1/2, 1/2). For any \u0001 > 0, let\n\ny = (1, 1 \u2212 \u0001), then argmax-e(y) = (1, 0).\n\n6\n\n\f(L) = argminL(1),...,L(N\u0001 )dT V (L(i),L). For computational tractability, it would be\nHere ProjS\u0001\nsuf\ufb01cient to choose S\u0001 as a truncation grid so that projection of \u02dcLk onto the epsilon-net reduces to\ntruncating \u02dcLk to a certain number of digits. For instance, in our experiment, the number of digits is\nchosen to be 4. The choices of the hyper-parameters c and Tk can be found in Lemma 8 and Theorem\n2. In practice, the algorithm is rather robust with respect to these hyper-parameters.\nIn the special case when the rewards rL and transition dynamics P (\u00b7|s, a,L) are known, one can\nreplace the Q-learning step in the above Algorithm 1 by a value iteration, resulting in the GMF-V\nAlgorithm 3 in the Appendix.\nWe next show the convergence of this GMF-Q algorithm (Algorithm 1) to an \u0001-Nash of (GMFG),\nwith complexity analysis.\nTheorem 2 (Convergence and complexity of GMF-Q). Assume the same conditions in Theorem 4\nand Lemma 8 in the Appendix. For any tolerances \u0001, \u03b4 > 0, set \u03b4k = \u03b4/K\u0001,\u03b7, \u0001k = (k + 1)\u2212(1+\u03b7) for\nsome \u03b7 \u2208 (0, 1] (k = 0, . . . , K\u0001,\u03b7 \u2212 1), Tk = T MLk (\u03b4k, \u0001k) (de\ufb01ned in Lemma 8 in the Appendix)\nand c = log(1/\u0001)\n\nMoreover, the total number of iterations T =(cid:80)K\u0001,\u03b7\u22121\nHere K\u0001,\u03b7 := (cid:6)2 max(cid:8)(\u03b7\u0001/c)\u22121/\u03b7, logd(\u0001/max{diam(S)diam(A), c}) + 1(cid:9)(cid:7) is the number of\n\n. Then with probability at least 1 \u2212 2\u03b4, W1(LK\u0001,\u03b7 ,L(cid:63)) \u2264 C\u0001.\n\nT MLk (\u03b4k, \u0001k) is bounded by 3\n\nouter iterations, h is the step-size exponent in Q-learning (de\ufb01ned in Lemma 8 in the Appendix), and\nthe constant C is independent of \u03b4, \u0001 and \u03b7.\nThe proof of Theorem 2 in the Appendix depends on the Lipschitz continuity of the softmax operator\n[8], the closeness between softmax and the argmax-e (Lemma 7 in the Appendix), and the complexity\nof Q-learning for the MDP (Lemma 8 in the Appendix).\n\nh +3(cid:17)\n\n(log(K\u0001,\u03b7/\u03b4))\n\n2\n\n1\u2212h + 2\n\nT = O\n\nK 1+ 4\n\n\u0001,\u03b7\n\nh\n\n.\n\n(11)\n\n\u03c6(\u0001)\n\n(cid:16)\n\nk=0\n\n5 Experiment: repeated auction game\n\nIn this section, we report the performance of the proposed GMF-Q Algorithm. The objectives of the\nexperiments include 1) testing the convergence, stability, and learning ability of GMF-Q in the GMFG\nsetting, and 2) comparing GMF-Q with existing multi-agent reinforcement learning algorithms,\nincluding IL algorithm and MF-Q algorithm.\nWe take the GMFG framework for the repeated auction game from Section 2.3. Here each advertiser\nlearns to bid in the auction with a budget constraint.\nParameters. The model parameters are set as: |S| = |A| = 10, the overbidding penalty \u03c1 = 0.2,\nthe distributions of the conversion rate v \u223c uniform({1, 2, 3, 4}), and the competition intensity index\nM = 5. The random ful\ufb01llment is chosen as: if s < smax, \u2206(s) = 1 with probability 1\n2 and\n\u2206(s) = 0 with probability 1\nThe algorithm parameters are (unless otherwise speci\ufb01ed): the temperature parameter c = 4.0, the\ndiscount factor \u03b3 = 0.8, the parameter h from Lemma 8 in the Appendix being h = 0.87, and the\nbaseline inner iteration being 2000. Recall that for GMF-Q, both v and the dynamics of P for s are\nunknown a priori. The 90%-con\ufb01dence intervals are calculated with 20 sample paths.\n\n2; if s = smax, \u2206(s) = 0.\n\nPerformance evaluation in the GMFG setting. Our experiment shows that the GMF-Q Algorithm\nis ef\ufb01cient and robust, and learns well.\n\nConvergence and stability of GMF-Q. GMF-Q is ef\ufb01cient and robust. First, GMF-Q converges after\nabout 10 outer iterations; secondly, as the number of inner iterations increases, the error decreases\n(Figure 2); and \ufb01nally, the convergence is robust with respect to both the change of number of states\nand the initial population distribution (Figure 3).\nIn contrast, the Naive algorithm does not converge even with 10000 inner iterations, and the joint\ndistribution Lt keeps \ufb02uctuating (Figure 4).\n\n3Let h = 3\n\n4 , \u03b7 = 1, the bound reduces to T = O(K\n\n\u0001\n\n19\n3\n\n(log( K\u0001\n\n\u03b4 ))\n\n41\n\n3 ). Note that this bound may not be tight.\n\n7\n\n\fTable 1: Q-table with T GMF-V\n\n= 5000.\n\nk\n\nT GMF-Q\nk\n\u2206Q\n\n1000\n\n0.21263\n\n3000\n0.1294\n\n5000\n\n0.10258\n\n10000\n0.0989\n\n(a) GMF-Q.\n\n(b) GMF-V.\n\nFigure 1: Q-tables: GMF-Q vs. GMF-V.\n\n(cid:107)QGMF-V\u2212QGMF-Q(cid:107)2\n\nLearning accuracy of GMF-Q. GMF-Q learns well. Its learning accuracy is tested against its special\nform GMF-V (Appendix G), with the latter assuming a known distribution of conversion rate v\nand the dynamics P for the budget s. The relative L2 distance between the Q-tables of these two\nalgorithms is \u2206Q :=\n= 0.098879. This implies that GMF-Q learns the true GMFG\nsolution with 90-percent accuracy with 10000 inner iterations.\nThe heatmap in Figure 1(a) is the Q-table for GMF-Q Algorithm after 20 outer iterations. Within each\nouter iteration, there are T GMF-Q\n= 10000 inner iterations. The heatmap in Figure 1(b) is the Q-table\nfor GMF-Q Algorithm after 20 outer iterations. Within each outer iteration, there are T GMF-V\n= 5000\ninner iterations.\n\n(cid:107)QGMF-V(cid:107)2\n\nk\n\nk\n\nComparison with existing algorithms for N-player games. To test the effectiveness of GMF-\nQ for approximating N-player games, we next compare GMF-Q with IL algorithm and MF-Q\nalgorithm. IL algorithm [36] considers N independent players and each player solves a decentralized\nreinforcement learning problem ignoring other players in the system. The MF-Q algorithm [40]\n(cid:80)\nextends the NASH-Q Learning algorithm for the N-player game introduced in [15], adds the aggregate\nj(cid:54)=i aj\nactions (\u00afaaa\u2212i =\nN\u22121 ) from the opponents, and works for the class of games where the interactions\nare only through the average actions of N players.\n\nFigure 2: Convergence with different\nnumber of inner iterations.\n\nFigure 3: Convergence with different\nnumber of states.\n\n8\n\n024681012141618outer iteration0.000.010.020.030.040.050.06|t+1t|150020005000024681012141618outer iteration0.00.20.40.60.81.01.2|tt+1|11030100\f(a) \ufb02uctuation in l\u221e.\n\n(b) \ufb02uctuation in l1.\n\nFigure 4: Fluctuations of Naive Algorithm (30 sample paths).\n\n(a) |S| = |A| = 10, N = 20.\n\n(b) |S| = |A| = 20, N = 20.\n\n(c) |S| = |A| = 10, N = 40.\n\nFigure 5: Learning accuracy based on C(\u03c0\u03c0\u03c0).\n\nC(\u03c0\u03c0\u03c0) =\n\n1\n\nN|S|N\n\nmax\u03c0i Vi(sss, (\u03c0\u03c0\u03c0\u2212i, \u03c0i)) \u2212 Vi(sss, \u03c0\u03c0\u03c0)\n| max\u03c0i Vi(sss, (\u03c0\u03c0\u03c0\u2212i, \u03c0i))| + \u00010\n\n.\n\n(cid:88)N\n\n(cid:88)\n\ni=1\n\nsss\u2208SN\n\nPerformance metric. We adopt the following metric to measure the difference between a given\npolicy \u03c0 and an NE (here \u00010 > 0 is a safeguard, and is taken as 0.1 in the experiments):\n\nClearly C(\u03c0\u03c0\u03c0) \u2265 0, and C(\u03c0\u03c0\u03c0\u2217) = 0 if and only if \u03c0\u03c0\u03c0\u2217 is an NE. Policy arg max\u03c0i Vi(sss, (\u03c0\u03c0\u03c0\u2212i, \u03c0i)) is\ncalled the best response to \u03c0\u03c0\u03c0\u2212i. A similar metric without normalization has been adopted in [29].\nOur experiment (Figure 5) shows that GMF-Q is superior in terms of convergence rate, accuracy, and\nstability for approximating an N-player game: GMF-Q converges faster than IL and MF-Q, with the\nsmallest error, and with the lowest variance, as \u0001-net improves the stability.\nFor instance, when N = 20, IL Algorithm converges with the largest error 0.220. The error from\nMF-Q is 0.101, smaller than IL but still bigger than the error from GMF-Q. The GMF-Q converges\nwith the lowest error 0.065. Moreover, as N increases, the error of GMF-Q deceases while the errors\nof both MF-Q and IL increase signi\ufb01cantly. As |S| and |A| increase, GMF-Q is robust with respect\nto this increase of dimensionality, while both MF-Q and IL clearly suffer from the increase of the\ndimensionality with decreased convergence rate and accuracy. Therefore, GMF-Q is more scalable\nthan IL and MF-Q, when the system is complex and the number of players N is large.\n\n6 Conclusion\n\nThis paper builds a GMFG framework for simultaneous learning and decision-making, establishes\nthe existence and uniqueness of NE, and proposes a Q-learning algorithm GMF-Q with convergence\nand complexity analysis. Experiments demonstrate superior performance of GMF-Q.\n\nAcknowledgment\n\nWe thank Haoran Tang for the insightful early discussion on stabilizing the Q-learning algorithm and\nsharing the ideas of his work on soft-Q-learning [12], which motivates our adoption of the soft-max\noperators. We also thank the anonymous NeurIPS 2019 reviewers for the valuable suggestions.\n\n9\n\n020406080100outer iteration0.00.20.40.60.81.0|tt+1|020406080100outer iteration1.31.41.51.61.71.81.92.0|tt+1|1010000200003000040000500006000070000800000.00.10.20.30.40.50.60.7c()MF-QILGMF-Q010000200003000040000500006000070000800000.00.10.20.30.40.50.60.70.8c()MF-QILGMF-Q010000200003000040000500006000070000800000.00.10.20.30.40.50.60.7c()MF-QILGMF-Q\fReferences\n[1] B. Acciaio, J. Backhoff, and R. Carmona. Extended mean \ufb01eld control problems: stochastic\n\nmaximum principle and transport perspective. Arxiv Preprint:1802.05754, 2018.\n\n[2] K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In\nProceedings of the 34th International Conference on Machine Learning, volume 70, pages\n243\u2013252, 2017.\n\n[3] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action\ngap: new operators for reinforcement learning. In AAAI Conference on Arti\ufb01cial Intelligence,\npages 1476\u20131483, 2016.\n\n[4] M. Benaim and J. Y. Le Boudec. A class of mean \ufb01eld interaction models for computer and\n\ncommunication systems. Performance evaluation, 65(11-12):823\u2013838, 2008.\n\n[5] F. Bolley. Separability and completeness for the Wasserstein distance. S\u00e9minaire de Probabilit\u00e9s\n\nXLI, pages 371\u2013377, 2008.\n\n[6] H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by\nreinforcement learning in display advertising. In Proceedings of the Tenth ACM International\nConference on Web Search and Data Mining, pages 661\u2013670. ACM, 2017.\n\n[7] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning\n\nResearch, 5(Dec):1\u201325, 2003.\n\n[8] B. Gao and L. Pavel. On the properties of the softmax function with application in game theory\n\nand reinforcement learning. Arxiv Preprint:1704.00805, 2017.\n\n[9] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics.\n\nStatistical Review, 70(3):419\u2013435, 2002.\n\nInternational\n\n[10] D. A. Gomes, J. Mohr, and R. R. Souza. Discrete time, \ufb01nite state space mean \ufb01eld games.\n\nJournal de math\u00e9matiques pures et appliqu\u00e9es, 93(3):308\u2013328, 2010.\n\n[11] R. Gummadi, P. Key, and A. Proutiere. Repeated auctions under budget constraints: Optimal\n\nbidding strategies and equilibria. In the Eighth Ad Auction Workshop, 2012.\n\n[12] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based\n\npolicies. Arxiv Preprint:1702.08165, 2017.\n\n[13] J. Hamari, M. Sj\u00f6klint, and A. Ukkonen. The sharing economy: Why people participate in\ncollaborative consumption. Journal of the Association for Information Science and Technology,\n67(9):2047\u20132059, 2016.\n\n[14] P. Hernandez-Leal, B. Kartal, and M. E. Taylor. Is multiagent deep reinforcement learning the\n\nanswer or the question? A brief survey. Arxiv Preprint:1810.05587, 2018.\n\n[15] J. Hu and M. P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of\n\nMachine Learning Research, 4(Nov):1039\u20131069, 2003.\n\n[16] M. Huang and Y. Ma. Mean \ufb01eld stochastic games with binary action spaces and monotone\n\ncosts. ArXiv Preprint:1701.06661, 2017.\n\n[17] M. Huang, R. P. Malham\u00e9, and P. E. Caines. Large population stochastic dynamic games: closed-\nloop McKean-Vlasov systems and the Nash certainty equivalence principle. Communications\nin Information & Systems, 6(3):221\u2013252, 2006.\n\n[18] K. Iyer, R. Johari, and M. Sundararajan. Mean \ufb01eld equilibria of dynamic auctions with learning.\n\nACM SIGecom Exchanges, 10(3):10\u201314, 2011.\n\n[19] S. H. Jeong, A. R. Kang, and H. K. Kim. Analysis of game bot\u2019s behavioral characteristics in\nsocial interaction networks of MMORPG. ACM SIGCOMM Computer Communication Review,\n45(4):99\u2013100, 2015.\n\n10\n\n\f[20] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang. Real-time bidding with multi-agent\n\nreinforcement learning in display advertising. Arxiv Preprint:1802.09756, 2018.\n\n[21] S. Kapoor. Multi-agent reinforcement learning: A report on challenges and approaches. Arxiv\n\nPreprint:1807.09427, 2018.\n\n[22] A. C Kizilkale and P. E Caines. Mean \ufb01eld stochastic adaptive control. IEEE Transactions on\n\nAutomatic Control, 58(4):905\u2013920, 2013.\n\n[23] J-M. Lasry and P-L. Lions. Mean \ufb01eld games. Japanese Journal of Mathematics, 2(1):229\u2013260,\n\n2007.\n\n[24] C-A. Lehalle and C. Mouzouni. A mean \ufb01eld game of portfolio trading and its consequences on\n\nperceived correlations. ArXiv Preprint:1902.09606, 2019.\n\n[25] J. P. M. L\u00f3pez. Discrete time mean \ufb01eld games: The short-stage limit. Journal of Dynamics &\n\nGames, 2(1):89\u2013101, 2015.\n\n[26] D. Mguni, J. Jennings, and E. M. de Cote. Decentralised learning in systems with many, many\n\nstrategic agents. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[27] V. M. Minh, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and\nK. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International\nConference on Machine Learning, 2016.\n\n[28] C. H. Papadimitriou and T. Roughgarden. Computing equilibria in multi-player games. In\nProceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages\n82\u201391, 2005.\n\n[29] J. P\u00e9rolat, B. Piot, and O. Pietquin. Actor-critic \ufb01ctitious play in simultaneous move multistage\n\ngames. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n\n[30] J. P\u00e9rolat, F. Strub, B. Piot, and O. Pietquin. Learning Nash equilibrium for general-sum Markov\n\ngames from batch data. Arxiv Preprint:1606.08718, 2016.\n\n[31] G. Peyr\u00e9 and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine\n\nLearning, 11(5-6):355\u2013607, 2019.\n\n[32] B. Recht. A tour of reinforcement learning: The view from continuous control. Annual Review\n\nof Control, Robotics, and Autonomous Systems, 2018.\n\n[33] N. Saldi, T. Basar, and M. Raginsky. Markov\u2013Nash equilibria in mean-\ufb01eld games with\n\ndiscounted cost. SIAM Journal on Control and Optimization, 56(6):4256\u20134287, 2018.\n\n[34] J. Subramanian and A. Mahajan. Reinforcement learning in stationary mean-\ufb01eld games. In\n18th International Conference on Autonomous Agents and Multiagent Systems, pages 251\u2013259,\n2019.\n\n[35] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.\n[36] M. Tan. Multi-agent reinforcement learning: independent vs. cooperative agents. In Interna-\n\ntional Conference on Machine Learning, pages 330\u2013337, 1993.\n\n[37] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[38] H. T. Wai, Z. Yang, Z. Wang, and M. Hong. Multi-agent reinforcement learning via double\naveraging primal-dual optimization. In Advances in Neural Information Processing Systems,\npages 9672\u20139683, 2018.\n\n[39] J. Yang, X. Ye, R. Trivedi, H. Xu, and H. Zha. Deep mean \ufb01eld games for learning optimal\n\nbehavior policy of large populations. Arxiv Preprint:1711.03156, 2017.\n\n[40] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean \ufb01eld multi-agent reinforcement\n\nlearning. Arxiv Preprint:1802.05438, 2018.\n\n[41] H. Yin, P. G. Mehta, S. P. Meyn, and U. V. Shanbhag. Learning in mean-\ufb01eld games. IEEE\n\nTransactions on Automatic Control, 59(3):629\u2013644, 2013.\n\n11\n\n\f", "award": [], "sourceid": 2751, "authors": [{"given_name": "Xin", "family_name": "Guo", "institution": "University of California, Berkeley"}, {"given_name": "Anran", "family_name": "Hu", "institution": "University of Californian, Berkeley (UC Berkeley)"}, {"given_name": "Renyuan", "family_name": "Xu", "institution": "University of Oxford"}, {"given_name": "Junzi", "family_name": "Zhang", "institution": "Stanford University"}]}