{"title": "Modelling the Dynamics of Multiagent Q-Learning in Repeated Symmetric Games: a Mean Field Theoretic Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 12125, "page_last": 12135, "abstract": "Modelling the dynamics of multi-agent learning has long been an important research topic, but all of the previous works focus on 2-agent settings and mostly use evolutionary game theoretic approaches. In this paper, we study an n-agent setting with n tends to infinity, such that agents learn their policies concurrently over repeated symmetric bimatrix games with some other agents. Using mean field theory, we approximate the effects of other agents on a single agent by an averaged effect. A Fokker-Planck equation that describes the evolution of the probability distribution of Q-values in the agent population is derived. To the best of our knowledge, this is the first time to show the Q-learning dynamics under an n-agent setting can be described by a system of only three equations. We validate our model through comparisons with agent-based simulations on typical symmetric bimatrix games and different initial settings of Q-values.", "full_text": "Modelling the Dynamics of Multiagent Q-Learning in\nRepeated Symmetric Games: a Mean Field Theoretic\n\nApproach\n\nShuyue Hu, Chin-Wing Leung, Ho-fung Leung\n\nThe Chinese University of Hong Kong, Hong Kong, China\n\n{syhu,cwleung,lhf}@cse.cuhk.edu.hk\n\nAbstract\n\nModelling the dynamics of multi-agent learning has long been an important re-\nsearch topic, but all of the previous works focus on 2-agent settings and mostly\nuse evolutionary game theoretic approaches. In this paper, we study an n-agent\nsetting with n tends to in\ufb01nity, such that agents learn their policies concurrently\nover repeated symmetric bimatrix games with some other agents. Using the mean\n\ufb01eld theory, we approximate the effects of other agents on a single agent by an\naveraged effect. A Fokker-Planck equation that describes the evolution of the\nprobability distribution of Q-values in the agent population is derived. To the best\nof our knowledge, this is the \ufb01rst time to show the Q-learning dynamics under an\nn-agent setting can be described by a system of only three equations. We validate\nour model through comparisons with agent-based simulations on typical symmetric\nbimatrix games and different initial settings of Q-values.\n\n1\n\nIntroduction\n\nA multi-agent system concerns a set of autonomous agents interacting in a shared environment.\nLearning in multi-agent systems has recently attracted much attention [3, 13, 15], since multi-agent\nsystems \ufb01nd application in a wide variety of domains, such as traf\ufb01c control [1], energy management\n[20], robotic coordination [19], and distributed sensing [16]. While single-agent reinforcement\nlearning has acquired a strong theoretical foundation [26], there is a lack of a thorough understanding\nof reinforcement learning under multi-agent settings [2]. Shoham [24] calls for more grounded\nresearch in this area rather than designing arbitrary learning strategies that result in convergence to a\ncertain solution concept. Bloembergen et al. [2] point out that the modelling of multi-agent learning\ndynamics may facilitate parameter tuning, systematic comparison of different learning algorithms,\nand shedding light into the design of new learning algorithms.\nTuyls et al. [27, 29] model the dynamics of Q-learning with Boltzmann exploration in repeated\n2-player bimatrix games using a evolutionary game theoretic approach. They derive a differential\nequation for each of the row and column player, and show that the learning process of each player can\nbe understood as the replicator dynamics of a strategy change in an in\ufb01nitely large agent population.\nExtensions have been made to study the dynamics of other learning algorithms, such as FAQ-learning\n[10], lenient FAQ-learning [18], gradient ascent [9] and regret minimization [12], in a similar manner.\nGomes and Kowalczyk [22] construct a continuous time model for Q-learning, but focus on how\nanother exploration strategy, \u0001-greedy, affects the expected behaviours of agents. Wunder et al. [32]\nuse dynamical system methods to study an idealization of Q-learning with \u0001-greedy in repeated\n2-player general-sum games. They show that the use of this learning method in certain subclasses of\ngeneral-sum games induces chaotic behaviour for some initial conditions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn general, all of the aforementioned works focus on the dynamics of reinforcement learning under\n2-agent settings. Many real-life multi-agent systems, however, involve a much greater number of\nagents by nature. In this paper, we consider an n-agent setting with n tends to in\ufb01nity, such that,\nconcurrently, agents learn their policies over repeated 2-player symmetric bimatrix games with some\nother agents in the population. The opponents that an agent interacts with will change from time to\ntime. Thus, instead of learning against some \ufb01xed opponents, an agent learns to play with a wide\nrange of socially changeable opponents. We note that this scenario, which has not been considered in\nthe literature before, is a typical setting in norm emergence research [23].\nOne major dif\ufb01culty of modelling multi-agent learning dynamics is to cope with non-stationarity,\ni.e., the fact that the interactions of agents leads to a highly dynamic shared environment [2, 28, 22].\nOne can expect that this non-stationarity will drastically increase as the total number of agents\nincreases. This makes directly applying previous models in aforementioned works to n-agent settings\ninappropriate, because, in principle, the number of equations required to model the entire population\ndynamics is proportional to the number of agents in the population. As n tends to in\ufb01nity, analyzing\nor solving this system of equations becomes practically infeasible. We \ufb01nd the mean \ufb01eld theory\n[31] in statistical mechanics sheds light on this kind of problems. According to this theory, all of\nthe effects of neighboring particles impose on a single particle can be approximated by an averaged\neffect\u2014mean \ufb01eld\u2014on that particle. This consequently reduces the degrees of freedom of the\nproblem, and may make the problem analytically solvable.\nHere, we assume agents use Q-learning with Boltzmann exploration. Using the mean \ufb01eld theory,\nwe approximate the effects of other agents on a single agent with an averaged effect, such that one\ncan conceive each agent in effect learns its policy over repeated interactions with a \ufb01ctitious agent\nusing the mean policy of the population. The Q-learning processes of individual agents will change\nthe environment shared by all the agents. To capture this effect, we derive a Fokker-Planck equation\nthat describes how the distribution of Q-values of the entire population evolves as time goes forward.\nWe show under the n-agent setting we consider, the population dynamics can be modelled by a\nsystem of only three equations. For validation, we compare the behaviours obtained by our mean \ufb01eld\ntheoretic model with the behaviours found in agent-based simulations. The comparison indicates our\nmodel well describes the qualitatively different patterns of evolution resulting from different types of\nsymmetric bimatrix games and different initial settings of Q-values.\nIt is interesting to note that there is another line of research [17, 25, 33] on reinforcement learning\nin mean-\ufb01eld games [8, 14]. In this line of research, novel learning algorithms that converge to\ncertain solution concepts (e.g., Nash equilibria) in mean-\ufb01eld games are proposed, however, the\nactual process of convergence is not formally described. This paper, to the best of our knowledge,\nis the \ufb01rst time to formally show the reinforcement learning dynamics in an in\ufb01nitely large agent\npopulation. In particular, the Fokker-Planck equation describing the evolution of the probability\ndistribution of Q-values in an agent population has not been reported elsewhere.\n2 Preliminaries\n\nIn this paper, we focus on an in\ufb01nitely large Q-learning agent population, in which each agent learns\nits policy concurrently over repeated symmetric bimatrix games with some other agents. Here we\npresent the n-agent learning framework we consider in Section 2.1. The necessary backgrounds on\nsymmetric bimatrix games and Q-learning are provided in Sections 2.2 and 2.3 respectively.\n\n2.1 An n-Agent Concurrent Learning Framework\nConsider a large population N \u201c t1, . . . , nu of n agents, where n tends to in\ufb01nity. Each agent has\nthe same set A \u201c ta1, . . . , aku of k available actions for a symmetric bimatrix game G. The learning\nframework of these n agents is presented in Algorithm 1. Speci\ufb01cally, at each time step, an individual\nagent \ufb01rst independently selects an action to use according to its own policy (lines 3-5). Then, each\nagent plays the game G with each of the m opponents that are randomly selected from the population\n(lines 6-12). Note that the m opponents with whom an agent play games may change for different\ntime steps. A larger value of m suggests agents can learn their policies from the interactions with a\nwider range of other agents.1 For normalization, we assume the received immediate payoff for an\n\n1When m equals 1, our framework is in effect equivalent to the social learning [23], which is a commonly\n\nadopted framework in norm emergence research.\n\n2\n\n\findividual agent is averaged over all of the m games it plays at each time step. At the end of each\ntime step, each agent learns its policy independently and concurrently, so as to maximize its own\nfuture payoff (lines 13-16).\n\ntime step, the maximum time step T\n\nAgent i select an action a P A according to its policy\n\nAlgorithm 1 An n-Agent Concurrent Learning Framework\nRequire: a set N of agents, a set A of available actions, a symmetric game G, the number m of opponents per\n1: while t \u0103 T do\nt \u00d0 t ` 1\n2:\nfor each agent i P N do\n3:\n4:\nend for\n5:\nfor each agent i P N do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\nend for\n16:\n17: end while\n\nRandomly select agent j from the population N if \u03b8j \u0103 m\nAgents i and j play the game G using their selected actions respectively\n\u03b8i \u00d0 \u03b8i ` 1, \u03b8j \u00d0 \u03b8j ` 1\n\nReceive an immediate payoff, and update its policy using a learning method\n\u03b8i \u00d0 0\n\n\u0179 \u03b8i is the number of games that agent i has played\n\nend for\nfor each agent i P N do\n\nwhile \u03b8i \u0103 m do\n\nend while\n\n2.2 Symmetric Bimatrix Games\n\nBimatrix games are typical mathematical modellings of strategic interactions between rational\ndecision-makers (agents). Conventionally, in such a game, there are two players: the row player and\nthe column player. The players play an action at the same time, and receive a payoff immediately.\nA bimatrix game is symmetric if both the players have the same set of available actions, and the\nresulting payoff of each player depends not on the role of the player, but only on their joint actions\n[4]. For reasons of exposition, we focus on an action set size of 2. In Table 1, we present a general\nform of 2-player-2-action symmetric bimatrix games. The \ufb01rst number of each entry is the payoff of\nthe row player and the second number is the payoff of the column player. Clearly, the payoff matrix\nof the row player is the transpose of the payoff matrix of the column player.\n\nAction a1\nAction a2\n\nAction a1 Action a2\n\n\u03b1, \u03b1\n\u03b3, \u03b2\n\n\u03b2, \u03b3\n\u03b4, \u03b4\n\nTable 1: A general form of 2-player-2-action symmetric bimatrix games.\n\n2.3 Q-Learning for Bimatrix Games\n\nQ-learning [30] is one of the most important algorithms in reinforcement learning research, and is\nthe basis of a number of multi-agent reinforcement learning algorithms [3, 5, 7]. Given that there\nis a set S of states and a set A of available actions, such that an agent may transit to a new state\ns1 P S as a result of using an action a P A under the current state s P S. Q-learning maintains a\nQ-value for each state-action pair ps, aq to estimate the cumulative payoff over the successive time\nsteps after performing action a at state s. Consider an arbitrary agent i in the agent population N .\nSuppose that at time t, it plays the jth action aj under state s, and receives an immediate payoff\nt`1ps, ajq for the state-action pair ps, ajq\ntps, ajq accordingly. This agent will update its Q-value Qi\nri\nas follows:\n\nt`1ps, ajq \u201c p1 \u00b4 \u03b7qQi\nQi\n\ntps, ajq ` \u03b7rri\n\n(1)\nwhere \u03b7 is the learning rate, \u03b3 is the discounting factor, and s1 is the resulting state of using action a\ntps1, a1q estimates the optimal discounted future payoff\nunder state s, so that the term \u03b3 maxa1PA Qi\n\ntps, ajq ` \u03b3 max@a1PA\n\ntps1, a1qs,\nQi\n\n3\n\n\ft \u201c rQi\n\nafter the state transition. For any bimatrix game, there is only one episode in the entire course (or\nround) of the game: at a given time step t, the players each takes one action simultaneously, and\nreceives an immediate payoff; then, the game ends. At the next time step t ` 1, agents play another\nround of the game. This means that from time t to t ` 1, there is no state transition for an agent in\nbimatrix games, and the resulting state s1 does not exist at all. Therefore, for bimatrix games, it is a\ntpakqs(cid:124),\ncommon practice to maintain a vector of Q-values for each action, i.e., Qi\nand remove the term \u03b3 maxa1PA Qi\n\ntps1, a1q from the Q-value update function [11, 22, 32]:\nt`1pajq \u201c p1 \u00b4 \u03b7qQi\nQi\n\ntpa1q . . . Qi\n\n(2)\nWe consider that each agent adopts a mixed-strategy policy, such that its Q-values are interpreted as\ntpakqs(cid:124) represent the mixed-\nBoltzmann probabilities for action selection. Let xi\ntpajq,@aj P A is its probability of\nstrategy policy of agent i at time t, in which each component xi\ntpajq is given as follows:\n\u0159\nplaying action aj at time t. The value of xi\n\n(3)\nwhere \u03c4 P r0,8q is the Boltzmann exploration temperature. A larger value of \u03c4 indicates the fewer\nexploration for individual agents. When \u03c4 is 0, the probability of taking each action is uniform, which\nmeans that agents take actions randomly. When \u03c4 \u00d1 8, agents take the action with the highest\nQ-value in probability 1.\n\ntpajq\ne\u03c4 Qi\n@aPA e\u03c4 Qi\n\ntpajq ` \u03b7ri\nt \u201c rxi\n\ntpajq.\ntpa1q . . . xi\n\ntpajq \u201c\nxi\n\ntpaq ,\n\n3 A Mean Field Theoretic Model\n\nIn this section, we model the Q-learning dynamics under the n-agent setting presented in the last\nsection. In Section 3.1, taking the view of an individual agent, we model the dynamics of its Q-values\nwith mean \ufb01eld approximation, such that, \ufb01ctitiously, an agent updates its Q-values in response to\nthe mean policy of the population. In Section 3.2, taking an bird eye\u2019s view, we model how the\nprobability distribution of Q-values in the population evolves as time goes forward, and show the\npopulation dynamics can be characterized by a system of three equations.\n\n3.1 Dynamics of Q-values for Individual Agents\nConsider an arbitrary agent i in the population N . By Equation 2, we can derive the difference\nequation of its Q-values in terms of expected change. For any action aj, at time t, the expected\nchange of the corresponding Q-value is given as follows:\n\nErQi\n\nt`1pajq \u00b4 Qi\n\ntpajqs \u201c xi\n\u201c \u03b7xi\n\ntpajqrQi\ntpajqrErri\n\nt`1pajq \u00b4 Qi\ntpajqs \u00b4 Qi\n\ntpajqs ` r1 \u00b4 xi\ntpajqs.\n\ntpajqs \u02c6 0\n\nOn the right hand side of the \ufb01rst line, the \ufb01rst term represents the change in the Q-value if action\naj is used at time t, and the second term indicates that there should be no change in the Q-value if\naction aj is not used at time t. In the continuous time limit, this difference equation corresponds to\nthe following differential equation:\ntpajq\n\n(4)\n\n(5)\n\n(6)\n\nEr dQi\ndt\n\ns \u201c \u03b7xi\ntpajqrErri\n\u0159\ntpajq\ne\u03c4 Qi\n\u201c \u03b7\n@aPA e\u03c4 Qi\n\ntpajqs \u00b4 Qi\ntpaqrErri\n\ntpajqs\ntpajqs \u00b4 Qi\n\ntpajqs.\n\nThis differential equation governs the dynamics of the expected change in Q-values for individual\nagents. By this equation, at a certain time step t, how fast an agent increases or decreases its Q-value\nfor a particular action is susceptible to the learning rate \u03b7, Boltzmann exploration temperature \u03c4, the\ncurrent Q-values and the received payoff at this time step.\nRemember that at each time step, an agent play games with m other agents that are randomly selected\nfrom the population. Let us \ufb01rst focus on one particular round of the game G and assume the opponent\nin this round to be agent z. We denote the payoff matrix of the row player in game G by U. For agent\ni, the expected payoff of taking action aj against agent z using the policy xz\n\nt is determined as:\n\nErri\n\ntpaj, xz\n\n(cid:124)\ntqs \u201c e\nj Uxz\nt ,\n\n4\n\n\f\u0159\n\nn\n\nzPMi\n\nt\n\nErri\n\nErri\n\ntpaj, xz\n\nts.\n\u2206xz\n\ntpaj, \u00afxtq(cid:124)\n\ntpaj, xz\ntpaj, \u00afxt ` \u2206xz\ntqs\ntpaj, \u00afxtqs ` Er\u2207ri\n\nt, where n is the total number of agents. The policy xz\nt \u201c \u00afxt ` \u2206xz\nt from the mean policy \u00afxt, such that xz\ntqs is approximated as:\n\nLet Mi\nexpected received payoff of taking action aj at time t, that is, Erri\nm rounds of games it plays with m opponents, is approximated as:\n\n(7)\nt \u0102 N be the set of m opponents that agent i plays games with at time t. For agent i, its\ntpajqs, which is averaged over the\n\nwhere ej is the unit vector, in which the jth component equals 1 and the other components equal\n0. Let \u00afxt \u201c rxtpa1q . . . xtpakqs(cid:124) be the mean policy of the population N at time t, such that\n\u00afxt \u201c 1\nt of agent z can be represented\n@iPN xi\nby a deviation \u2206xz\nt . With the \ufb01rst-order Taylor\nseries expansion, the expected payoff Erri\ntqs \u201c Erri\n\u00ab Erri\n\u0159\n\u0159\nErri\ntpaj, xz\ntqs\ntpajqs \u201c 1\nm\n\u00ab 1\nrErri\ntpaj, \u00afxtqs ` Er\u2207ri\nzPMi\nm\ntpaj, \u00afxtqs ` \u2207ri\n\u201c Erri\ntpaj, \u00afxtq(cid:124)Er 1\n\u0159\ntpaj, \u00afxtqs.\n\u00ab Erri\nts will become closer to 0, and hence the\nAs the value of m increases, the term Er 1\napproximation will become more accurate.2 By this approximation, for an individual agent, its\nreceived payoff of playing with its opponents is approximately the payoff of playing against the mean\npolicy \u00afxt averaged over all of the agents in the population. That is to say, although different agents\nactually interact with different opponents, intuitively, one can conceive different agents face one same\n\ufb01ctitious agent that uses the mean policy.\nSubstituting the term Erri\ntpajqs with the approximation shown in Equation 8, Equation 5\u2014the\nequation that fundamentally governs the dynamics of the expected change in Q-values for individual\nagents\u2014is rewritten as follows:\ntpajq\n\n\u0159\ntpaj, \u00afxtq(cid:124)\n\ntss\n\u2206xz\nts\n\u2206xz\n\nzPMi\n\nzPMi\n\nt\n\n\u2206xz\n\nt\n\n(8)\n\nm\n\nm\n\nt\n\nEr dQi\ndt\n\ns \u201c \u03b7\n\ntpaqrErri\n\ntpaj, \u00afxtqs \u00b4 Qi\n\ntpajqs.\n\n\u0159\n\ntpajq\ne\u03c4 Qi\n@aPA e\u03c4 Qi\n\nOn the right hand side, the learning rate \u03b7 and Boltzmann exploration temperature \u03c4 are a priori\ngiven and the same for the entire agent population. Moreover, for symmetric bimatrix games, the\nexpected payoff of using action aj against the mean policy \u00afxt is independent of the roles of individual\nagents. Therefore, at time t, for any agent i, how fast it changes its Q-values should be attributed to\nits current Q-values Qi\nt and the mean policy \u00afxt of the whole population. Dropping the agent index,\nfor any individual agent in the population, Equation 9 can be expressed as a function vj of its current\nQ-values and the mean policy:\n\n(9)\n\n(10)\n\n\u0159\n\nvjpQt, \u00afxtq \ufb01 Er dQtpajq\n\ndt\n\n\u0159\n\ne\u03c4 Qtpajq\n@aPA e\u03c4 QtpaqrErrtpaj, \u00afxtqs \u00b4 Qtpajqs.\ns \u201c \u03b7\n\u0159\n@iPNpe\u03c4 Qi\n\n\u0159\ntpajq{\n\n@aPA e\u03c4 Qi\n\nNote that the mean policy \u00afxt is indeed given by the Q-values of all agents in the population, i.e.,\ntpaqq. Therefore, the expected\n@aj P A, \u00afxtpajq \u201c 1\nchange in Q-values for any individual agent is determined by the joint Q-values of all the agents,\nwhich include the Q-values of itself. This suggests that in long term, the trajectories of Q-values for\nindividual agents are uniquely determined by their joint initial Q-values.\n\ntpajq \u201c 1\n\n@iPN xi\n\nn\n\nn\n\n3.2 Evolution of the Distribution of Q-values in a Population\nConsider a Q-value space Rk with k axes Y1, . . . , Yk, where k is the number of available actions. At\ntime t, each agent i occupies a point Qt \u201c Qi\nt in this space according to its current Q-values Qi\nt.\ntq is an analytic function. Given\nt is between 0 and 1, we consider the second order and the higher order terms\n\n2 The series in Equation 8 should be convergent, since the function upaj, xz\n\neach element of the vector \u2206xz\nnegligible. When m, n \u00d1 8, Equation 8 holds.\n\n5\n\n\fFigure 1: A 3-dimensional illustration of the entry and departure of individual agents through facets\ncausing the change in the number of agents in the box B.\nLet ppQt, tq be the function of agent density in the space at time t, such that the density ppQt, tq at\nany point Qt is the proportion of agents in the population having their Q-values equal to Qt and\nhence occupying the point Qt in the space at time t. Intuitively, ppQt, tq can also be considered as\nthe probability distribution of Q-values in the agent population. Note that agents will update their\nQ-values during interactions. As a result, as time t moves forward, agents will change their positions\nin the space, which will lead to the change in the density function ppQt, tq. In what follows, we shall\nderive the differential equation that describes the time evolution of ppQt, tq.\nLet us focus on an arbitrary point Qt in this space, and an in\ufb01nitesimal box (or hyperrectangle) B\naround this point, such that B \ufb01 tqt : Qtpajq \u010f qtpajq \u010f Qtpajq ` dQtpajq, @aj P Au. Basically,\nthe number of agents in this box at time t is nppQt, tqdV , where dV \u201c \u03a0@ajPAdQtpajq is the\nvolume of the box. Given that there is no birth or death of individual agents over time, there is only\none cause for the change in the density ppQt, tq of agents in that box\u2014some agents enter or leave the\nbox through its surface. Note that there are 2k facets for a k-dimensional box. Let FpQtpajqq denote\na facet of this box, in which the jth component of each vector in this facet is set to Qtpajq, such\nthat the Yj-axis is the normal of this facet. That is, FpQtpajqq \ufb01 tqt : qtpajq \u201c Qtpajq, Qtpaiq \u010f\nqtpaiq \u010f Qtpaiq ` dQtpaiq, @i P t1, . . . , kuztjuu. We de\ufb01ne \u03c8`pQtpajq, tq and \u03c8\u00b4pQtpajq, tq,\nrespectively, to be the number of agents that travel through the facet FpQtpajqq in the positive and\nnegative direction of the Yj-axis from time t to t ` dt. A graphical demonstration with the number of\navailable actions k \u201c 3 is shown in Figure 1. By the conservation law of the number of agents in the\npopulation, we shall have:\n\nnppQt, t ` dtqdV \u00b4 nppQt, tqdV \u201c k\u00ff\n\nj\u201c1\n\n\u03c8`pQtpajq, tq ` \u03c8\u00b4pQtpajq ` dQtpajq, tq\n\u00b4 \u03c8\u00b4pQtpajq, tq \u00b4 \u03c8`pQtpajq ` dQtpajq, tq.\n\n(11)\n\nThis equation expresses that the number of agents entering (or leaving) the box should be the sum of\nthe number of agents entering (or leaving) through every facets. The \ufb01rst and the second term on the\nleft hand side represent the numbers of agents in this box at time t ` dt and at time t, respectively.\nThus, the left hand side corresponds to the change in the number of agents in the box from time t\nto t ` dt. On the right hand side, since agents that travel through the facet FpQtpajq ` dQtpajqq\nin the negative direction of the Yj-axis will in effect enter the box B (as shown in Figure 1), the\n\ufb01rst two terms are the number of agents entering the box B along the Yj-axis. Symmetrically, the\nlast two terms are the number of agents leaving that box along the Yj-axis. Hence, the right hand\nside corresponds to the sum of the net number of agents entering the box along every axes. Let\n\u03c8pQtpajq, tq \ufb01 \u03c8`pQtpajq, tq \u00b4 \u03c8\u00b4pQtpajq, tq, which denotes the \ufb02ow of agents travelling through\nthe facet FpQtpajqq. Equation 11 can be rewritten as:\n\nnppQt, t ` dtqdV \u00b4 nppQt, tqdV \u201c\n\nj\u201c1 \u03c8pQtpajq, tq \u00b4 \u03c8pQtpajq ` dQtpajq, tq.\n\n(12)\nWe now derive the form of \u03c8pQtpajq, tq. Remember that how fast an agent increases or decreases\nits Q-value, i.e., the velocity of this agent in the Q-value space, is given by the function vjpQt, \u00afxtq\nshown in Equation 10. From time t to t ` dt, the displacement that an agent around the point Qt\ntravels should be approximately vjpQt, \u00afxtqdt. That is, roughly speaking, agents that travel through\nthe facet FpQtpajqq along the Yj-axis from time t to t ` dt should be located in the adjacent box\nB1 \ufb01 tqt : Qtpajq\u00b4 vjpQt, \u00afxtqdt \u010f qtpajq \u010f Qtpajq, Qtpaiq \u010f qtpaiq \u010f Qtpaiq` dQtpaiq, @i P\n\n\u0159\n\nk\n\n6\n\n\ft1, . . . , kuztjuu. Therefore, the value of \u03c8pQtpajq, tq should be:\n\n\u03c8pQtpajq, tq \u201c nppQt, tqvjpQt, \u00afxtqdtdSj,\n\n(13)\nwhere dSj \u201c \u03a0@aiPAztajudQtpaiq is the area of the facet FpQtpajqq, so that vjpQt, \u00afxtqdtdSj is the\nvolume of of the box B1. Substituting \u03c8pQtpajq, tq in Equation 12 with Equation 13, and dividing\nboth sides by dV dt, we have:\nnppQt, t ` dtq \u00b4 nppQt, tq\n\nnppQt, tqvjpQt, \u00afxtqdSj \u00b4 nppQt ` dQt, tqvjpQt ` dQt, \u00afxtqdSj\n\ndV\n\n1\n\ndQtpajqrnppQt, tqvjpQt, \u00afxtq \u00b4 nppQt ` dQt, tqvjpQt ` dQt, \u00afxtqs.\n\n\u201c k\u00ff\n\u201c k\u00ff\n\nj\u201c1\n\nj\u201c1\n\ndt\n\n\u201c \u00b4 k\u00ff\n\nj\u201c1\n\nBppQt, tq\n\nBt\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\nThis equation in the continuous limit corresponds to:\n\nB\n\nBQtpajqrppQt, tqvjpQt, \u00afxtqs \u201c \u00b4\u2207 \u00a8 pppQt, tqvpQt, \u00afxtqq,\n\nwhere \u2207\u00a8 is the divergence operator, and vpQt, \u00afxtq is a vector \ufb01eld (or the \ufb02ux) in which the jth\ncomponent is vjpQt, tq. This equation is the Fokker-Planck equation [6, 21] with zero diffusion. By\nthis equation, the change in the density of agents occupying a certain point Qt in the space, which\nis also the probability density of agents having certain Q-values Qt in the population, is jointly\ndetermined by the current density ppQt, tq and the velocity vpQt, \u00afxtq. Note that the velocity in Q-\nvalues depends on the mean policy \u00afxt. By the law of large numbers, each component \u00afxtpajq,@aj P A\nof the mean policy \u00afxt should be close to the expectation, which is given by:\n\nTherefore, the Q-learning dynamics of an in\ufb01nitely large agent population can be modelling by the\nfollowing system of three equations:\n\n\u017c\n\n\u017c\n\n\u00afxt \u201c\n\n. . .\n\n\u0159\ne\u03c4 Qtpajq\n@aPA e\u03c4 Qtpaq ppQt, tqdQtpa1q . . . dQtpakq.\n\u201c \u00b4 k\u00ff\nBQtpajqrppQt, tqvjpQt, \u00afxtqs,\n\u0159\ne\u03c4 Qpajq\n\u017c\nvjpQt, \u00afxtq \u201c \u03b7\n@aPA e\u03c4 QtpaqrErrtpaj, \u00afxtqs \u00b4 Qtpajqs,\n\u0159\ne\u03c4 Qtpajq\n@aPA e\u03c4 Qtpaq ppQt, tqdQtpa1q . . . dQtpakq.\n\nBppQt, tq\nBt\n\u017c\n\n\u00afxt \u201c\n\nj\u201c1\n\n. . .\n\nB\n\n$\u2019\u2019\u2019\u2019\u2019\u2019\u2019\u2019\u2019&\u2019\u2019\u2019\u2019\u2019\u2019\u2019\u2019\u2019%\n\nThis system of equation by nature involves a backward-forward structure. For an individual agent,\nat a certain time instant, it reasons backward and updates its Q-values towards a better estimation\nof the best response action facing the current expected policy. Collectively, the current updates of\nQ-values for individual agents may result in a future Q-value distribution that is different from the\ncurrent one. This will in the other way round cause a change in the expected policy, which will\nmake agents\u2019 current best responses to the expected policy invalid in the future. Therefore, under the\nn-agent setting we consider, the Q-learning agents are usually myopic.\n\n4 Experimental Validation\n\nIn this section, we compare the behaviours obtained by our mean \ufb01eld theoretic model with the\nbehaviours obtained from agent-based simulations. For the model, we employ \ufb01nite difference\nmethods to solve the system of equations shown in Equation 17. For the agent-based simulations, we\nset the number n of agents to 1, 000, and consider two cases of the number m of opponents per time\nstep: m \u201c 0.05n and m \u201c n \u00b4 1. To smooth out the randomness, we run 100 simulations for each\nsetting. For comparison, the learning rate \u03b7 is set to 0.1 and the exploration temperature \u03c4 is set to 2\nin both the model and the simulations.\n\n7\n\n\fFigure 2: Evolution of the expected Q-values derived from our model and that of the mean Q-values\nin agent-based simulations.\n\nFigure 3: Evolution of the expected policy derived from our model and that of the mean policy in\nagent-based simulations.\n\nTo validate if our model can well re\ufb02ect the diverse population dynamics caused by different\ngame settings, we select four typical types of symmetric bimatrix games to experiment on, namely,\nprisoner\u2019s dilemma (PD), choosing side (CS), stag hunt (SH) and hawk dove (HD) games. The payoff\nbimatrices of these games are shown in Table 2. In PD game, the dominant strategy is for both players\nto play D, and hence pD, Dq is the unique Nash equilibrium. In CS game, there are two equally\ngood symmetric Nash equilibria pL, Lq and pR, Rq. In SH game, there are also two symmetric Nash\nequilibria, i.e., pS, Sq and pH, Hq. However, while pS, Sq Pareto dominates pH, Hq and maximizes\nthe social welfare, pH, Hq risk dominates pS, Sq. In HD game, the two Nash equilibria pD, Hq and\npH, Dq are asymmetric, such that it is unfair for the player taking H in these two equilibria.\n\nC\nC 3,3\nD 0,5\n\nD\n0,5\n1,1\n\nL\n1,1\nL\nR -1,-1\n\nR\n-1,-1\n1,1\n\nH\nH 1,1\n0,2\nS\n\nS\n2,0\n4,4\n\nD\nD 1,1\nH 2,0\n\nH\n0,2\n-1,-1\n\n(b) Choosing Side, L: left,\nR: right\n\n(c) Stag Hunt, S: stag,\nH: hare\n\n(d) Hawk Dove, D: dove,\nH: hawk\n\n(a)\nDilemma,\nC:\noperate, D: defect\n\nPrisoner\u2019s\nco-\n\nTable 2: The typical symmetric bimatrix games that we experiment on.\n\nWithout loss of generality, for each game, we assume the initial Q-value of the \ufb01rst action and the\nsecond action follow Beta distributions Betap20, 80, rmin, rmaxq and Betap10, 90, rmin, rmaxq, respec-\ntively.The \ufb01rst two parameters control the shape of the probability density function, and the latter two\nparameters prescribe the support to be rrmin, rmaxs, where rmin is the minimum payoff of the game\nand rmax is the maximum payoff. Consequently, for every games, the initial expected Q-value of the\n\ufb01rst action is slightly higher than that of the second action.\nIn Figures 2 and 3, we compare the expected Q-values ErQts and the expected policy Erxts obtained\nby our model with the counterparts \u00afQt and \u00afxt that are averaged over all of the agents in the agent-\nbased simulations. It is clear that our model well captures the qualitatively different patterns of\nevolution in agent populations playing different kinds of games. In particular, as shown in Figure 2,\nthe dynamics of the expected Q-values generally overlap the dynamics of the mean Q-values, which\nsuggest our model almost precisely describes how the Q-value distribution of the population evolves\nover time. Moreover, we note that in agent-based simulations, the agent behaviours with m \u201c 0.05n\nmatch those with m \u201c n \u00b4 1. This implies that, strictly speaking, Equation 8 holds if m, n \u00d1 8,\nhowever, our mean \ufb01eld theoretic model should be practically valid if the values of m and n are\nsuf\ufb01ciently large.\n\n8\n\n\fFigure 4: The probability density functions of different initial Q-value distribution that we experiment\non. Yellow (light) color indicates the \ufb01rst action H and purple (dark) color indicates the second\naction S.\n\nFigure 5: Evolution of the expected Q-values derived from our model and that of the mean Q-values\nin agent-based simulations.\n\nFigure 6: Evolution of the expected policy derived from our model and that of the mean policy in\nagent-based simulations.\n\nWe proceed to change the initial Q-value distribution for stag hunt games. Given the equilibrium\npS, Sq is Pareto dominant but the other equilibrium pH, Hq is risk dominant, the population dynamics\nshould be highly susceptible to the initial proportion of agents using each action. As shown in Figure\n4, we consider three different cases of the initial Q-value distribution : 1) Q0pa1q\u201eBetap80, 20, 0, 3q\nand Q0pa2q\u201eBetap90, 10, 0, 3q; 2) Q0pa1q\u201eBetap80, 20, 0, 3q and Q0pa2q\u201eBetap20, 80, 0, 3q; and\n3) Q0pa1q\u201eBetap50, 50, 0, 3q and Q0pa2q\u201eBetap5, 5, 0, 3q. In Figures 5 and 6, we compare the\nexpected Q-values and policy obtained by our model with the mean Q-values and policy in agent-\nbased simulations. We can easily observe that the different settings of initial Q-value distribution\nresults in diverse patterns of evolution in agent populations. Under each setting, the dynamics\nobtained by our model match those in agent-based simulations, which again validates our model well\ndescribes the population dynamics under different settings.\n\n5 Conclusions and Future Work\n\nIn this paper, we model the dynamics of Q-learning in symmetric bimatrix games under an n-agent\nsetting where n \u00d1 8. Using the mean \ufb01eld theory, we derive an equation that universally describes\nthe dynamics of Q-values for any individual agent. We also derive a Fokker-Planck equation that\ndescribes the evolution of the distribution of Q-values in the agent population. We show the Q-\nlearning dynamics under the n-agent setting can be described by a system of only three equations.\nThe experiments on typical types of symmetric bimatrix games and different initial settings of Q-\nvalues validate that the expected agent behaviours obtained by our model well match the counterparts\nin agent-based simulations. As future work, we will extend our model to multiple-state games,\nasymmetric games, and multiple populations. Other learning algorithms will also be investigated.\n\n9\n\n\fReferences\n[1] Adrian K Agogino and Kagan Tumer. A multiagent approach to managing air traf\ufb01c \ufb02ow.\n\nAutonomous Agents and Multi-Agent Systems, 24(1):1\u201325, 2012.\n\n[2] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of\nmulti-agent learning: A survey. Journal of Arti\ufb01cial Intelligence Research, 53:659\u2013697, 2015.\n\n[3] Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent rein-\nforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications\nand Reviews), 38(2):156\u2013172, 2008.\n\n[4] Shih-Fen Cheng, Daniel M Reeves, Yevgeniy Vorobeychik, and Michael P Wellman. Notes on\n\nequilibria in symmetric games. 2004.\n\n[5] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent\n\nsystems. In Proc. of AAAI, 1998.\n\n[6] Adriaan Dani\u00ebl Fokker. Die mittlere energie rotierender elektrischer dipole im strahlungsfeld.\n\nAnnalen der Physik, 348(5):810\u2013820, 1914.\n\n[7] Junling Hu, Michael P Wellman, et al. Multiagent reinforcement learning: theoretical framework\n\nand an algorithm. In ICML, volume 98, pages 242\u2013250. Citeseer, 1998.\n\n[8] Minyi Huang, Roland P Malham\u00e9, Peter E Caines, et al. Large population stochastic dy-\nnamic games: closed-loop mckean-vlasov systems and the nash certainty equivalence principle.\nCommunications in Information & Systems, 6(3):221\u2013252, 2006.\n\n[9] Michael Kaisers, Daan Bloembergen, and Karl Tuyls. A common gradient in multi-agent\nreinforcement learning. In Proceedings of the 11th International Conference on Autonomous\nAgents and Multiagent Systems-Volume 3, pages 1393\u20131394. International Foundation for\nAutonomous Agents and Multiagent Systems, 2012.\n\n[10] Michael Kaisers and Karl Tuyls. Frequency adjusted multi-agent q-learning. In Proceedings\nof the 9th International Conference on Autonomous Agents and Multiagent Systems: volume\n1-Volume 1, pages 309\u2013316. International Foundation for Autonomous Agents and Multiagent\nSystems, 2010.\n\n[11] Ardeshir Kianercy and Aram Galstyan. Dynamics of boltzmann q learning in two-player\n\ntwo-action games. Physical Review E, 85(4):041145, 2012.\n\n[12] Tomas Klos, Gerrit Jan Van Ahee, and Karl Tuyls. Evolutionary dynamics of regret minimization.\nIn Joint European Conference on Machine Learning and Knowledge Discovery in Databases,\npages 82\u201396. Springer, 2010.\n\n[13] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien\nP\u00e9rolat, David Silver, and Thore Graepel. A uni\ufb01ed game-theoretic approach to multiagent\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 4190\u2013\n4203, 2017.\n\n[14] Jean-Michel Lasry and Pierre-Louis Lions. Mean \ufb01eld games. Japanese journal of mathematics,\n\n2(1):229\u2013260, 2007.\n\n[15] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent\nreinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference\non Autonomous Agents and MultiAgent Systems, pages 464\u2013473. International Foundation for\nAutonomous Agents and Multiagent Systems, 2017.\n\n[16] Victor Lesser, Charles L Ortiz Jr, and Milind Tambe. Distributed sensor networks: A multiagent\n\nperspective, volume 9. Springer Science & Business Media, 2012.\n\n[17] David Mguni, Joel Jennings, and Enrique Munoz de Cote. Decentralised learning in systems\nwith many, many strategic agents. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n10\n\n\f[18] Liviu Panait, Karl Tuyls, and Sean Luke. Theoretical advantages of lenient learners: An\nevolutionary game theoretic perspective. Journal of Machine Learning Research, 9(Mar):423\u2013\n457, 2008.\n\n[19] Lynne E Parker. Current state of the art in distributed autonomous mobile robotics. In Distributed\n\nAutonomous Robotic Systems 4, pages 3\u201312. Springer, 2000.\n\n[20] M. Pipattanasomporn, H. Feroze, and S. Rahman. Multi-agent systems in a distributed smart\ngrid: Design and implementation. In 2009 IEEE/PES Power Systems Conference and Exposition,\npages 1\u20138, 2009.\n\n[21] VM Planck. \u00dcber einen satz der statistischen dynamik und seine erweiterung in der quantenthe-\n\norie. Sitzungberichte der, 1917.\n\n[22] Eduardo Rodrigues Gomes and Ryszard Kowalczyk. Dynamic analysis of multiagent q-learning\nwith \u03b5-greedy exploration. In Proceedings of the 26th Annual International Conference on\nMachine Learning, pages 369\u2013376. ACM, 2009.\n\n[23] Sandip Sen and St\u00e9phane Airiau. Emergence of norms through social learning. In Proceedings\nof 20th International Joint Conference on Arti\ufb01cial Intelligence, volume 1507, page 1512, 2007.\n\n[24] Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the answer, what is\n\nthe question? Arti\ufb01cial Intelligence, 171(7):365\u2013377, 2007.\n\n[25] Jayakumar Subramanian and Aditya Mahajan. Reinforcement learning in stationary mean-\n\ufb01eld games. In Proceedings of the 18th International Conference on Autonomous Agents and\nMultiAgent Systems, pages 251\u2013259. International Foundation for Autonomous Agents and\nMultiagent Systems, 2019.\n\n[26] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[27] Karl Tuyls, Pieter Jan\u2019T Hoen, and Bram Vanschoenwinkel. An evolutionary dynamical\nanalysis of multi-agent learning in iterated games. Autonomous Agents and Multi-Agent Systems,\n12(1):115\u2013153, 2006.\n\n[28] Karl Tuyls and Simon Parsons. What evolutionary game theory tells us about multiagent\n\nlearning. Arti\ufb01cial Intelligence, 171(7):406\u2013416, 2007.\n\n[29] Karl Tuyls, Katja Verbeeck, and Tom Lenaerts. A selection-mutation model for q-learning in\nmulti-agent systems. In Proceedings of the second international joint conference on Autonomous\nagents and multiagent systems, pages 693\u2013700. ACM, 2003.\n\n[30] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3-4), 1992.\n\n[31] Pierre Weiss. L\u2019hypoth\u00e8se du champ mol\u00e9culaire et la propri\u00e9t\u00e9 ferromagn\u00e9tique. J. Phys. Theor.\n\nAppl., 6(1):661\u2013690, 1907.\n\n[32] Michael Wunder, Michael L Littman, and Monica Babes. Classes of multiagent q-learning\ndynamics with epsilon-greedy exploration. In Proceedings of the 27th International Conference\non Machine Learning (ICML-10), pages 1167\u20131174. Citeseer, 2010.\n\n[33] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean \ufb01eld\nmulti-agent reinforcement learning. In International Conference on Machine Learning, pages\n5567\u20135576, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6526, "authors": [{"given_name": "Shuyue", "family_name": "Hu", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Chin-wing", "family_name": "Leung", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Ho-fung", "family_name": "Leung", "institution": "The Chinese University of Hong Kong"}]}