{"title": "Actor-Critic Policy Optimization in Partially Observable Multiagent Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 3422, "page_last": 3435, "abstract": "Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.", "full_text": "Actor-Critic Policy Optimization in Partially\n\nObservable Multiagent Environments\n\nSriram Srinivasan\u2217,1\n\nsrsrinivasan@\n\nMarc Lanctot\u2217,1\n\nlanctot@\n\nVinicius Zambaldi1\n\nvzambaldi@\n\nJulien P\u00e9rolat1\n\nperolat@\n\nKarl Tuyls1\nkarltuyls@\n\nR\u00e9mi Munos1\n\nmunos@\n\nMichael Bowling1\n\nbowlingm@\n\n...@google.com. 1DeepMind. \u2217These authors contributed equally.\n\nAbstract\n\nOptimization of parameterized policies for reinforcement learning (RL) is an impor-\ntant and challenging problem in arti\ufb01cial intelligence. Among the most common\napproaches are algorithms based on gradient ascent of a score function representing\ndiscounted return. In this paper, we examine the role of these policy gradient and\nactor-critic algorithms in partially-observable multiagent environments. We show\nseveral candidate policy update rules and relate them to a foundation of regret\nminimization and multiagent learning techniques for the one-shot and tabular cases,\nleading to previously unknown convergence guarantees. We apply our method to\nmodel-free multiagent reinforcement learning in adversarial sequential decision\nproblems (zero-sum imperfect information games), using RL-style function ap-\nproximation. We evaluate on commonly used benchmark Poker domains, showing\nperformance against \ufb01xed policies and empirical convergence to approximate Nash\nequilibria in self-play with rates similar to or better than a baseline model-free\nalgorithm for zero-sum games, without any domain-speci\ufb01c state space reductions.\n\n1\n\nIntroduction\n\nThere has been much success in learning parameterized policies for sequential decision-making\nproblems. One paradigm driving progress is deep reinforcement learning (Deep RL), which uses\ndeep learning [52] to train function approximators that represent policies, reward estimates, or both,\nto learn directly from experience and rewards [85]. These techniques have learned to play Atari\ngames beyond human-level [60], Go, chess, and shogi from scratch [82, 81], complex behaviors in\n3D environments [59, 97, 37], robotics [27, 73], character animation [67], among others.\nWhen multiple agents learn simultaneously, policy optimization becomes more complex. First, each\nagent\u2019s environment is non-stationary and naive approaches can be non-Markovian [58], violating the\nrequirements of many traditional RL algorithms. Second, the optimization problem is not as clearly\nde\ufb01ned as maximizing one\u2019s own expected reward, because each agent\u2019s policy affects the others\u2019\noptimization problems. Consequently, game-theoretic formalisms are often used as the basis for\nrepresenting interactions and decision-making in multiagent systems [17, 79, 64].\nComputer poker is a common multiagent benchmark domain. The presence of partial observability\nposes a challenge for traditional RL techniques that exploit the Markov property. Nonetheless, there\nhas been steady progress in poker AI. Near-optimal solutions for heads-up limit Texas Hold\u2019em were\nfound with tabular methods using state aggregation, powered by policy iteration algorithms based\non regret minimization [102, 87, 12]. These approaches were founded on a basis of counterfactual\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fregret minimization (CFR), which is the root of recent advances in no-limit, such as Libratus [16]\nand DeepStack [61]. However, (i) both required Poker-speci\ufb01c domain knowledge, and (ii) neither\nwere model-free, and hence are unable to learn directly from experience, without look-ahead search\nusing a perfect model of the environment.\nIn this paper, we study the problem of multiagent reinforcement learning in adversarial games with\npartial observability, with a focus on the model-free case where agents (a) do not have a perfect\ndescription of their environment (and hence cannot do a priori planning), (b) learn purely from\ntheir own experience without explicitly modeling the environment or other players. We show that\nactor-critics reduce to a form of regret minimization and propose several policy update rules inspired\nby this connection. We then analyze the convergence properties and present experimental results.\n\n2 Background and Related Work\n\nWe brie\ufb02y describe the necessary background. While we draw on game-theoretic formalisms, we\nalign our terminology with RL. We include clari\ufb01cations in Appendix A1. For details, see [79, 85].\n\nof rewards is a return Gt =(cid:80)\u221e\n\n2.1 Reinforcement Learning and Policy Gradient Algorithms\nAn agent acts by taking actions a \u2208 A in states s \u2208 S from their policy \u03c0 : s \u2192 \u2206(A), where \u2206(X)\nis the set of probability distributions over X, which results in changing the state of the environment\nst+1 \u223c T (st, at); the agent then receives an observation o(st, at, st+1) \u2208 \u2126 and reward Rt.2 A sum\nt(cid:48)=t Rt(cid:48), and aim to \ufb01nd \u03c0\u2217 that maximizes expected return E\u03c0[G0].3\nValue-based solution methods achieve this by computing estimates of v\u03c0(s) = E\u03c0[Gt | St = s], or\nq\u03c0(s, a) = E\u03c0[Gt | St = s, At = a], using temporal difference learning to bootstrap from other esti-\nmates, and produce a series of \u0001-greedy policies \u03c0(s, a) = \u0001/|A| + (1\u2212 \u0001)I(a = argmaxa(cid:48) q\u03c0(s, a(cid:48))).\nIn contrast, policy gradient methods de\ufb01ne a score function J(\u03c0\u03b8) of some parameterized (and\ndifferentiable) policy \u03c0\u03b8 with parameters \u03b8, and use gradient ascent directly on J(\u03c0\u03b8) to update \u03b8.\nThere have been several recent successful applications of policy gradient algorithms in complex\ndomains such as self-play learning in AlphaGo [80], Atari and 3D maze navigation [59], continuous\ncontrol problems [76, 54, 21], robotics [27], and autonomous driving [78]. At the core of several\nrecent state-of-the-art Deep RL algorithms [37, 22] is the advantage actor-critic (A2C) algorithm\nde\ufb01ned in [59]. In addition to learning a policy (actor), A2C learns a parameterized critic: an estimate\nof v\u03c0(s), which it then uses both to estimate the remaining return after k steps, and as a control\nvariate (i.e. baseline) that reduces the variance of the return estimates.\n\n2.2 Game Theory, Regret Minimization, and Multiagent Reinforcement Learning\nIn multiagent RL (MARL), n = |N| = |{1, 2,\u00b7\u00b7\u00b7 , n}| agents interact within the same environment.\nAt each step, each agent i takes an action, and the joint action a leads to a new state st+1 \u223c T (st, at);\neach player i receives their own separate observation oi(st, a, st+1) and reward rt,i. Each agent\nmaximizes their own return Gt,i, or their expected return which depends on the joint policy \u03c0.\nMuch work in classical MARL focuses on Markov games where the environment is fully observable\nand agents take actions simultaneously, which in some cases admit Bellman operators [55, 103, 70, 69].\nWhen the environment is partially observable, policies generally map to values and actions from\nagents\u2019 observation histories; even when the problem is cooperative, learning is hard [65].\n\nWe focus our attention to the setting of zero-sum games, where (cid:80)\n\nIn this case,\npolynomial algorithms exist for \ufb01nding optimal policies in \ufb01nite tasks for the two-player case. The\nguarantees that Nash equilibrium provides are less clear for the (n > 2)-player case, and \ufb01nding one\nis hard [20]. Despite this, regret minimization approaches are known to \ufb01lter out dominated actions,\nand have empirically found good (e.g. competition-winning) strategies in this setting [74, 26, 48].\n\ni\u2208N rt,i = 0.\n\n1Appendices are included in the technical report version of the paper; see [84].\n2Note that in fully-observable settings, o(st, at, st+1) = st+1. In partially observable environments [39, 65],\n\nan observation function O : S \u00d7 A \u2192 \u2206(\u2126) is used to sample o(st, at, st+1) \u223c O(st, at).\n\n3 We assume \ufb01nite episodic tasks of bounded length and leave out the discount factor \u03b3 to simplify the\n\nnotation, without loss of generality. We use \u03b3(= 0.99)-discounted returns in our experiments.\n\n2\n\n\fPartially observable environments require a few key de\ufb01nitions in order to de\ufb01ne the notion of state.\nA history h \u2208 H is a sequence of actions from all players including the environment taken from\nthe start of an episode. The environment (also called \u201cnature\u201d) is treated as a player with a \ufb01xed\npolicy and there is a deterministic mapping from any h to the actual state of the environment. De\ufb01ne\nan information state, st = {h \u2208 H | player i\u2019s sequence of observations, oi,t(cid:48)<t(st(cid:48), at(cid:48), st(cid:48)+1),\nis consistent with h}4. So, st includes histories leading to st that are indistinguishable to player i;\ne.g. in Poker, the h \u2208 st differ only in the private cards dealt to opponents. A joint policy \u03c0 is a Nash\ni,\u03c0\u2212i[G0,i]\u2212E\u03c0[G0,i] =\nequilibrium if the incentive to deviate to a best response \u03b4i(\u03c0) = max\u03c0(cid:48)\n0 for each player i \u2208 N , where \u03c0\u2212i is the set of i(cid:48)s opponents\u2019 policies. Otherwise, \u0001-equilibria are\napproximate, with \u0001 = maxi \u03b4i(\u03c0). Regret minimization algorithms produce iterates whose average\ni \u03b4i(\u03c0). Nash equilibrium is\n\nE\u03c0(cid:48)\n\ni\n\n\u00af\u03c0 reduces an upper bound on \u0001, measured using NASHCONV(\u03c0) =(cid:80)\n\u03b7\u03c0(ht) =(cid:81)\n\nminimax-optimal in two-player zero-sum games: using one minimizes worst-case losses.\nThere are well-known links between learning, game theory and regret minimization [9]. One method,\ncounterfactual regret (CFR) minimization [102], has led to signi\ufb01cant progress in Poker AI. Let\nt(cid:48)<t \u03c0(st(cid:48), at(cid:48)), where ht(cid:48) (cid:64) ht is a pre\ufb01x, ht(cid:48) \u2208 st(cid:48), ht \u2208 st, be the reach probability\nof h under \u03c0 from all policies\u2019 action choices. This can be split into player i\u2019s contribution and their\nopponents\u2019 (including nature\u2019s) contribution, \u03b7\u03c0(h) = \u03b7\u03c0\ni (h)\u03b7\u03c0\u2212i(h). Suppose player i is to play at\ns: under perfect recall, player i remembers the sequence of their own states reached, which is the\nsame for all h \u2208 s, since they differ only in private information seen by opponent(s); as a result\n\u2200h, h(cid:48) \u2208 s, \u03b7\u03c0\ni (s). For some history h and action a, we call h a pre\ufb01x history\nh (cid:64) ha, where ha is the history h followed by action a; they may also be smaller, so h (cid:64) ha (cid:64)\nhab \u21d2 h (cid:64) hab. Let Z = {z \u2208 H | z is terminal} and Z(s, a) = {(h, z) \u2208 H \u00d7 Z | h \u2208 s, ha (cid:118)\nz}. CFR de\ufb01nes counterfactual values vc\ni (z)ui(z), and\ni (\u03c0, st, at), where ui(z) is the return to player i along z, and accumulates\nvc\na \u03c0(st, a)vc\ni (\u03c0, st, a(cid:48)) \u2212 vc\nregrets REGi(\u03c0, st, a(cid:48)) = vc\ni (\u03c0, st), producing new policies from cumulative regret\nusing e.g. regret-matching [28] or exponentially-weighted experts [6, 15].\nCFR is a policy iteration algorithm that computes the expected values by visiting every possible\ntrajectory, described in detail in Appendix B. Monte Carlo CFR (MCCFR) samples trajectories using\nan exploratory behavior policy, computing unbiased estimates \u02c6vc\nimportance sampling [49]. Therefore, MCCFR is an off-policy Monte Carlo method. In one MCCFR\nvariant, model-free outcome sampling (MFOS), the behavior policy at opponent states is de\ufb01ned as\n\u03c0\u2212i enabling online regret minimization (player i can update their policy independent of \u03c0\u2212i and T ).\nThere are two main problems with (MC)CFR methods: (i) signi\ufb01cant variance is introduced by\nsampling (off-policy) since quantities are divided by reach probabilities, (ii) there is no generalization\nacross states except through expert abstractions and/or forward simulation with a perfect model. We\nshow that actor-critics address both problems and that they are a form of on-policy MCCFR.\n\ni (\u03c0, st) and (cid:100)REGi(\u03c0, st) corrected by\n\ni (\u03c0, st, at) = (cid:80)\n\n(h,z)\u2208Z(st,at) \u03b7\u03c0\u2212i(h)\u03b7\u03c0\n\ni (h) = \u03b7\u03c0\n\ni (h(cid:48)) := \u03b7\u03c0\n\ni (\u03c0, st) =(cid:80)\n\n2.3 Most Closely Related Work\n\nThere is a rich history of policy gradient approaches in MARL. Early uses of gradient ascent showed\nthat cyclical learning dynamics could arise, even in zero-sum matrix games [83]. This was partly\naddressed by methods that used variable learning rates [13, 11], policy prediction [99], and weighted\nupdates [1]. The main limitation with these classical works was scalability: there was no direct way\nto use function approximation, and empirical analyses focused almost exclusively on one-shot games.\nRecent work on policy gradient approaches to MARL addresses scalability by using newer algorithms\nsuch as A3C or TRPO [76]. However, they focus signi\ufb01cantly less (if at all) on convergence\nguarantees. Naive approaches such as independent reinforcement learning fail to \ufb01nd optimal\nstochastic policies [55, 32] and can over\ufb01t the training data [50]. Much progress has been achieved\nfor cooperative MARL: learning to communicate [51], Starcraft unit micromanagement [24], taxi\n\ufb02eet optimization [63], and autonomous driving [78]. There has also been progress for mixed\ncooperative/competitive environments: using a centralized critic [57], learning to negotiate [18],\nanticipating/learning opponent responses in social dilemmas [23, 53], and control in realistic physical\nenvironments [3, 7]. The most common methodology has been to train centrally (for decentralized\nexecution), either having direct access to the other players\u2019 policy parameters or modeling them. As a\nresult, assumptions are made about the other agents\u2019 policies, utilities, or learning mechanisms.\n\n4In de\ufb01ning st, we drop the reference to acting player i in turn-based games without loss of generality.\n\n3\n\n\fThere are also methods that attempt to model the opponents [36, 30, 4]. Our methods do no such\nmodeling, and can be classi\ufb01ed in the \u201cforget\u201d category of the taxonomy proposed in [33]: that is,\ndue to its on-policy nature, actors and critics adapt to and learn mainly from new/current experience.\nWe focus on the model-free (and online) setting: other agents\u2019 policies are inaccessible; training\nis not separated from execution. Actor-critics were recently studied in this setting for multiagent\ngames [68], whereas we focus on partially-observable environments; only tabular methods are known\nto converge. Fictitious Self-Play computes approximate best responses via RL [31, 32], and can\nalso be model-free. Regression CFR (RCFR) uses regression to estimate cumulative regrets from\nCFR [93]. RCFR is closely related to Advantage Regret Minimization (ARM) [38]. ARM [38] shows\nregret estimation methods handle partial observability better than standard RL, but was not evaluated\nin multiagent environments. In contrast, we focus primarily on the multiagent setting.\n\n3 Multiagent Actor-Critics: Advantages and Regrets\n\nTCREGi(K, s, a) = ((cid:80)\n\nCFR de\ufb01nes policy update rules\n\nfrom thresholded cumulative counterfactual\n\nregret:\nk\u2208{1,\u00b7\u00b7\u00b7 ,K} REGi(\u03c0k, s, a))+, where k is the number of iterations and\n(x)+ = max(0, x). In CFR, regret matching updates a policy to be proportional to TCREGi(K, s, a).\nOn the other hand, REINFORCE [95] samples trajectories and computes gradients for each state st,\nupdating \u03b8 toward \u2207\u03b8 log(st, at; \u03b8)Gt. A baseline is often subtracted from the return: Gt \u2212 v\u03c0(st),\nand policy gradients then become actor-critics, training \u03c0 and v\u03c0 separately. The log appears due to the\nfact that action at is sampled from the policy, the value is divided by \u03c0(st, at) to ensure the estimate\nis properly estimating the true expectation [85, Section 13.3], and \u2207\u03b8\u03c0(st, at; \u03b8)/\u03c0(st, at, \u03b8) =\n\u2207\u03b8 log \u03c0(st, at; \u03b8). One could instead train q\u03c0-based critics from states and actions. This leads to a\nq-based Policy Gradient (QPG) (also known as Mean Actor-Critic [5]):\n\n\u2207QPG\n\n\u03b8\n\n(s) =\n\n[\u2207\u03b8\u03c0(s, a; \u03b8)]\n\n\u03c0(s, b; \u03b8)q(s, b, w)\n\n,\n\n(1)\n\n(cid:32)\nq(s, a; w) \u2212(cid:88)\n\nb\n\n(cid:33)\n\n(cid:33)+\n\n(cid:88)\n\na\n\n(cid:88)\n\n(cid:32)\nq(s, a; w) \u2212(cid:88)\n\nan advantage actor-critic algorithm differing from A2C in the (state-action) representation of the crit-\nics [56, 96] and summing over actions similar to the all-action algorithms [86, 71, 19, 5]. Interpreting\nb \u03c0(s, b)q\u03c0(s, b) as a regret, we can instead minimize a loss de\ufb01ned by an\nk(a\u03c0k (s, a))+, moving\n\na\u03c0(s, a) = q\u03c0(s, a) \u2212(cid:80)\nupper bound on the thresholded cumulative regret:(cid:80)\n(cid:32)\nq(s, a; w) \u2212(cid:88)\n\nthe policy toward a no-regret region. We call this Regret Policy Gradient (RPG):\n\nk(a\u03c0k (s, a))+ \u2265 ((cid:80)\n\n(s) = \u2212(cid:88)\n\n\u03c0(s, b; \u03b8)q(s, b; w)\n\n.\n\n(2)\n\n(cid:33)+\n\n\u2207RPG\n\n\u2207\u03b8\n\n\u03b8\n\na\n\nb\n\nThe minus sign on the front represents a switch from gradient ascent on the score to descent on the\nloss. Another way to implement an adaptation of the regret-matching rule is by weighting the policy\ngradient by the thresholded regret, which we call Regret Matching Policy Gradient (RMPG):\n\n\u2207RMPG\n\n\u03b8\n\n(s) =\n\n[\u2207\u03b8\u03c0(s, a; \u03b8)]\n\n\u03c0(s, b; \u03b8)q(s, b, w)\n\n.\n\n(3)\n\na\n\nb\n\nIn each case, the critic q(st, at; w) is trained in the standard way, using (cid:96)2 regression loss from\nsampled returns. The pseudo-code is given in Algorithm 2 in Appendix C. In Appendix F, we show\nthat the QPG gradient is proportional to the RPG gradient at s: \u2207RPG\n\n(s) \u221d \u2207QPG\n\n(s).\n\n\u03b8\n\n\u03b8\n\n3.1 Analysis of Learning Dynamics on Normal-Form Games\n\nThe \ufb01rst question is whether any of these variants can converge to an equilibrium, even in the\nsimplest case. So, we now show phase portraits of the learning dynamics on Matching Pennies: a\ntwo-action version of Rock, Paper, Scissors. These analyses are common in multiagent learning as\nthey allow visual depiction of the policy changes and how different factors affect the (convergence)\nbehavior [83, 92, 13, 91, 11, 94, 1, 99, 98, 8, 89]. Convergence is dif\ufb01cult in Matching Pennies as\nthe only Nash equilibrium \u03c0\u2217 = (( 1\n2 )) requires learning stochastic policies. We give more\ndetail and results on different games that cause cyclic learning behavior in Appendix D.\n\n2 ), ( 1\n\n2 , 1\n\n2 , 1\n\n4\n\n\f(a) Replicator Dynamics\n\n(c) Average RPG Dynamics\nFigure 1: Learning Dynamics in Matching Pennies: (a) and (b) show the vector \ufb01eld for \u2202\u03c0/\u2202t\nincluding example particle traces, where each point is each player\u2019s probability of their \ufb01rst action;\n\n(c) shows example traces of policies following a discrete approximation to(cid:82) t\n\n(b) RPG Dynamics\n\n0 \u2202\u03c0/\u2202t.\n\nIn Figure 1, we see the similarity of the regret dynamics to replicator dynamics [88, 75]. We also\nshow the average policy dynamics and observe convergence to equilibrium in each game we tried,\nwhich is a known to be guaranteed in two-player zero-sum games using CFR, \ufb01ctitious play [14], and\ncontinuous replicator dynamics [35]. However, computing the average policy is complex [31, 102]\nand potentially worse with function approximation, requiring storing past data in large buffers [32].\n\n3.2 Partially Observable Sequential Games\n\nHow do the values vc\ni (\u03c0, st, at) and q\u03c0,i(st, at) differ? The authors of [38] posit that they are\napproximately equal when st rarely occurs more than once in a trajectory. First, note that st cannot\nbe reached more than once in a trajectory from our de\ufb01nition of st, because the observation histories\n(of the player to play at st) would be different in each occurrence (i.e. due to perfect recall). So, the\ntwo values are indeed equal in deterministic, single-agent environments. In general, counterfactual\nvalues are conditioned on player i playing to reach st, whereas q-function estimates are conditioned\non having reached st. So, q\u03c0,i(st, at) = E\u03c1\u223c\u03c0[Gt,i | St = st, At = at]\n\nPr(h | st)\u03b7\u03c0(ha, z)ui(z)\nPr(st | h) Pr(h)\n\n\u03b7\u03c0(ha, z)ui(z)\n\nh,z\u2208Z(st,at)\n\nPr(st)\n\nwhere \u03b7\u03c0(ha, z) =\n\n\u03b7\u03c0(z)\n\n\u03b7\u03c0(h)\u03c0(s, a)\n\nby Bayes\u2019 rule\n\nsince h \u2208 st, h is unique to st\n\nh,z\u2208Z(st,at)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nh,z\u2208Z(st,at)\n\nh,z\u2208Z(st,at)\n\nh,z\u2208Z(st,at)\n\nh,z\u2208Z(st,at)\n\nh,z\u2208Z(st,at)\n\n=\n\n=\n\n=\n\n=\n\n=\n\n=\n\n=\n\n\u03b7\u03c0(ha, z)ui(z)\n\n\u03b7\u03c0(ha, z)ui(z)\n\n\u03b7\u03c0(ha, z)ui(z)\n\nPr(h)\nPr(st)\n\n(cid:80)\n(cid:80)\ni (s)(cid:80)\n(cid:80)\n\n\u03b7\u03c0\n\nh(cid:48)\u2208st\n\n\u03b7\u03c0(h)\n\n\u03b7\u03c0(h(cid:48))\n\nh(cid:48)\u2208st\n\u03b7\u03c0\ni (h)\u03b7\u03c0\u2212i(h)\nh(cid:48)\u2208st\n\u03b7\u03c0\ni (s)\u03b7\u03c0\u2212i(h)\n\ni (h(cid:48))\u03b7\u03c0\u2212i(h(cid:48))\n\u03b7\u03c0\n\nh(cid:48)\u2208st\n\n\u03b7\u03c0\u2212i(h(cid:48))\n\n\u03b7\u03c0(ha, z)ui(z)\n\ndue to def. of st and perfect recall\n\n\u03b7\u03c0\u2212i(h)\n\n\u03b7\u03c0\u2212i(h(cid:48))\n\n\u03b7\u03c0(ha, z)ui(z) =\n\nvc\ni (\u03c0, st, at).\n\n\u03b7\u03c0\u2212i(h)\n\n1(cid:80)\ni (\u03c0, st)/(cid:80)\n\nh\u2208st\n\n5\n\nizing constant B\u2212i(\u03c0, st) =(cid:80)\n\nThe derivation is similar to show that v\u03c0,i(st) = vc\n\u03b7\u03c0\u2212i(h). Hence, counterfactual\nvalues and standard value functions are generally not equal, but are scaled by the Bayes normal-\n\u03b7\u03c0\u2212i(h). If there is a low probability of reaching st due to the\n\nh\u2208st\n\nenvironment or due to opponents\u2019 policies, these values will differ signi\ufb01cantly.\n\nh\u2208st\n\n\fThis leads to a new interpretation of actor-critic algorithms in the multiagent partially observable\nsetting: the advantage values q\u03c0,i(st, at) \u2212 v\u03c0,i(st, at) are immediate counterfactual regrets scaled\nby 1/B\u2212i(\u03c0, st). This then determines requirements for convergence guarantees in the tabular case.\nNote that the standard policy gradient theorem holds: gradients can be estimated from samples. This\nfollows from the derivation of the policy gradient in the tabular case (see Appendix E). When TD\nbootstrapping is not used, the Markov property is not required; having multiple agents and/or partial\nobservability does not change this. For a proof using REINFORCE (Gt only), see [78, Theorem 1].\nThe proof trivially follows using Gt,i \u2212 v\u03c0,i since v\u03c0,i is trained separately and does not depend on \u03c1.\ns \u00b5(s)(cid:80)\n(cid:80)\nactor-critic equivalent is \u2207\u03b8J AC(\u03c0\u03b8) \u221d (cid:80)\nPolicy gradient algorithms perform gradient ascent on J P G(\u03c0\u03b8) = v\u03c0\u03b8 (s0), using \u2207\u03b8J P G(\u03c0\u03b8) \u221d\na \u2207\u03b8\u03c0\u03b8(s, a)q\u03c0(s, a), where \u00b5 is on-policy distribution under \u03c0 [85, Section 13.2]. The\nb \u03c0(s, b)q\u03c0(s, b)).\nNote that the baseline is unnecessary when summing over the actions and \u2207\u03b8J AC(\u03c0\u03b8) =\n\u2207\u03b8J P G(\u03c0\u03b8) [5]. However, our analysis relies on a projected gradient descent algorithm that does\nnot assume simplex constraints on the policy: in that case, in general \u2207\u03b8J AC(\u03c0\u03b8) (cid:54)= \u2207\u03b8J P G(\u03c0\u03b8).\nDe\ufb01nition 1. De\ufb01ne policy gradient policy iteration (PGPI) as a process that iteratively runs\n\u03b8 \u2190 \u03b8 + \u03b1\u2207\u03b8J P G(\u03c0\u03b8), and actor-critic policy iteration (ACPI) similarly using \u2207\u03b8J AC(\u03c0\u03b8).\n\na \u2207\u03b8\u03c0\u03b8(s, a)(q\u03c0(s, a) \u2212(cid:80)\n\ns \u00b5(s)(cid:80)\n\nIn two-player zero-sum games, PGPI/ACPI are gradient ascent-descent problems, because each\nplayer is trying to ascend their own score function, and when using tabular policies a solution\nexists due to the minimax theorem [79]. De\ufb01ne player i\u2019s external regret over K steps as RK\ni =\nmax\u03c0(cid:48)\n\n, where \u03a0i is the set of deterministic policies.\n\n[G0,i] \u2212 E\u03c0k [G0,i]\n\n(cid:16)(cid:80)K\n\nE\u03c0(cid:48)\n\n(cid:17)\n\ni\u2208\u03a0i\n\nk=1\n\ni\n\n(cid:48)(cid:107)2, where \u2206(S,A) = {\u03b8 | \u2200s \u2208 S,(cid:80)\n\nTheorem 1. In two-player zero-sum games, when using tabular policies and an (cid:96)2 projection\nP (\u03b8) = argmin\u03b8(cid:48)\u2208\u2206(S,A) (cid:107)\u03b8 \u2212 \u03b8\nb\u2208A \u03b8s,b = 1} is the space\ni (s)B\u2212i(\u03c0, st) at s on iteration\nof tabular simplices, if player i uses learning rates of \u03b1s,k = k\u2212 1\n2 \u03b7\u03c0k\ns,\u00b7 \u2190 P ({\u03b8k\nJ P G(\u03c0\u03b8k )}a),\nk, and \u03b8k\n, where Si is the set of player i\u2019s states, \u2206r\ni (s). The same holds for projected ACPI (see appendix).\n\nhas regret RK\nis the reward range, and \u03b7min\n\ns,a > 0 for all k and s, then projected PGPI, \u03b8k+1\n\nK \u2212 1\ni = mins,k \u03b7k\n\n2 )|A|(\u2206r)2(cid:17)\n\n|Si|(cid:16)\u221a\n\ni \u2264 1\n\ns,a + \u03b1s,k\n\nK + (\n\n\u2202\n\u2202\u03b8k\n\n\u221a\n\n\u03b7min\n\ns,a\n\ni\n\nThe proof is given in Appendix E. In the case of sampled trajectories, as long as every state is\nreached with positive probability, Monte Carlo estimators of q\u03c0,i will be consistent. Therefore, we\nuse exploratory policies and decay exploration over time. With a \ufb01nite number of samples, the\nprobability that an estimator \u02c6q\u03c0,i(s, a) differs by some quantity away from its mean is determined by\nHoeffding\u2019s inequality and the reach probabilities. We suspect these errors could be accumulated to\nderive probabilistic regret bounds similar to the off-policy Monte Carlo case [46].\nWhat happens in the sampling case with a \ufb01xed per-state learning rate \u03b1s? If player i collects a\nbatch of data from many sampled episodes and applies them all at once, then the effective learning\nrates (expected update rate relative to the other states) is scaled by the probability of reaching s:\ni (s)B\u2212i(\u03c0, s), which matches the value in the condition of Theorem 1. This suggests using a\n\u03b7\u03c0\nglobally decaying learning rate to simulate the remaining k\u2212 1\n2 .\nThe analysis so far has concentrated on establishing guarantees for the optimization problem that\nunderlies standard formulation of policy gradient and actor-critic algorithms. A better guarantee can\nbe achieved by using stronger policy improvement (proof and details are found in Appendix E):\nTheorem 2. De\ufb01ne a state-local J P G(\u03c0\u03b8, s) = v\u03c0\u03b8 ,i(s), composite gradient { \u2202\nJ P G(\u03c0\u03b8, s)}s,a,\nstrong policy gradient policy iteration (SPGPI), and strong actor-critic policy iteration (SACPI) as\nJ P G(\u03c0\u03b8, s). Then, in two-player\nin De\ufb01nition 1 except replacing the gradient components with\nzero-sum games, when using tabular policies and projection P (\u03b8) as de\ufb01ned in Theorem 1 with learn-\nJ P G(\u03c0\u03b8, s)}a),\ning rates \u03b1k = k\u2212 1\n, where Si is the set of player i\u2019s states and \u2206r\nhas regret RK\nis the reward range. This also holds for projected SACPI (see appendix).\n\n2 )|A|(\u2206r)2(cid:17)\n\ni \u2264 |Si|(cid:16)\u221a\n\n2 on iteration k, projected SPGPI, \u03b8k+1\n\ns,\u00b7 \u2190 P ({\u03b8k\n\nK \u2212 1\n\ns,a + \u03b1k\n\nK + (\n\n\u2202\n\u2202\u03b8k\n\n\u221a\n\n\u2202\u03b8s,a\n\n\u2202\u03b8s,a\n\ns,a\n\n\u2202\n\n6\n\n\f4 Empirical Evaluation\n\nWe now assess the behavior of the actor-critic algorithms in practice. While the analyses in the\nprevious section established guarantees for the tabular case, ultimately we want to assess scalability\nand generalization potential for larger settings. Our implementation parameterizes critics and policies\nusing neural networks with two fully-connected layers of 128 units each, and recti\ufb01ed linear unit\nactivation functions, followed by a linear layer to output a single value q or softmax layer to output \u03c0.\nWe chose these architectures to remain consistent with previous evaluations [32, 50].\n\n4.1 Domains: Kuhn and Leduc Poker\n\nWe evaluate the actor-critic algorithms on two n-player games: Kuhn poker, and Leduc poker.\nKuhn poker is a toy game where each player starts with 2 chips, antes 1 chip to play, and receives\none card face down from a deck of size n + 1 (one card remains hidden). Players proceed by betting\n(raise/call) by adding their remaining chip to the pot, or passing (check/fold) until all players are\neither in (contributed as all other players to the pot) or out (folded, passed after a raise). The player\nwith the highest-ranked card that has not folded wins the pot.\nIn Leduc poker, players have a limitless number of chips, and the deck has size 2(n + 1), divided\ninto two suits of identically-ranked cards. There are two rounds of betting, and after the \ufb01rst round a\nsingle public card is revealed from the deck. Each player antes 1 chip to play, and the bets are limited\nto two per round, and number of chips limited to 2 in the \ufb01rst round, and 4 in the second round.\nThe rewards to each player is the number of chips they had after the game minus before the game. To\nremain consistent with other baselines, we use the form of Leduc described in [50] which does not\nrestrict the action space, adding reward penalties if/when illegal moves are chosen.\n\n4.2 Baseline: Neural Fictitious Self-Play\n\nWe compare to one main baseline. Neural Fictitious Self-Play (NFSP) is an implementation of\n\ufb01ctitious play, where approximate best responses are used in place of full best response [32]. Two\ntransition buffers of are used: DRL and DM L; the former to train a DQN agent towards a best response\n\u03c0i to \u00af\u03c0\u2212i, data in the latter is replaced using reservoir sampling, and trains \u00af\u03c0i by classi\ufb01cation.\n\n4.3 Main Performance Results\n\nHere we show the empirical convergence to approximate Nash equlibria for each algorithm in self-\nplay, and performance against \ufb01xed bots. The standard metric to use for this is NASHCONV(\u03c0)\nde\ufb01ned in Section 2.2, which reports the accuracy of the approximation to a Nash equilibrium.\nTraining Setup. In the domains we tested, we observed that the variance in returns was high and\nhence we performed multiple policy evaluation updates (q-update for \u2207QPG , \u2207RPG, and \u2207RMPG, and\nv-update for A2C) followed by policy improvement (policy gradient update). These updates were\ndone using separate SGD optimizers with their respective learning rates of \ufb01xed 0.001 for policy\nevaluation, and annealed from a starting learning rate to 0 over 20M steps for policy improvement.\n(See Appendix G for exact values). Further, the policy improvement step is applied after Nq policy\nevaluation updates. We treat Nq and batch size as a hyper parameters and sweep over a few reasonable\nvalues. In order to handle different scales of rewards in the multiple domains, we used the streaming\nZ-normalization on the rewards, inspired by its use in Proximal Policy Optimization (PPO) [77]. In\naddition, the agent\u2019s policy is controlled by a(n inverse) temperature added as part of the softmax\noperator. The temperature is annealed from 1 to 0 over 1M steps to ensure adequate state space\ncoverage. An additional entropy cost hyper-parameter is added as is standard practice with Deep RL\npolicy gradient methods such as A3C [59, 77]. For NFSP, we used the same values presented in [50].\nConvergence to Equilibrium. See Figure 2 for convergence results. Please note that we plot the\nNASHCONV for the average policy in the case of NFSP, and the current policy in the case of the policy\ngradient algorithms. We see that in 2-player Leduc, the actor-critic variants we tried are similar in\nperformance; NFSP has faster short-term convergence but long-term the actor critics are comparable.\nEach converges signi\ufb01cantly faster than A2C. However RMPG seems to plateau.\n\n7\n\n\fNASHCONV in 2-player Kuhn\n\nNASHCONV in 3-player Kuhn\n\nNASHCONV in 2-player Leduc\n\nNASHCONV in 3-player Leduc\n\n2-player Leduc vs. CFR500\n\n3-player Leduc vs CFR500\n\nFigure 2: Empirical convergence rates for NASHCONV(\u03c0) and performance versus CFR agents.\n\nPerformance Against Fixed Bots. We also measure the expected reward against \ufb01xed bots, averaged\nover player seats. These bots, CFR500, correspond to the average policy after 500 iterations of CFR.\nQPG and RPG do well here, scoring higher than A2C and even beating NFSP in the long-term.\n\n5 Conclusion\n\nIn this paper, we discuss several update rules for actor-critic algorithms in multiagent reinforcement\nlearning. One key property of this class of algorithms is that they are model-free, leading to a\npurely online algorithm, independent of the opponents and environment. We show a connection\nbetween these algorithms and (counterfactual) regret minimization, leading to previously unknown\nconvergence properties underlying model-free MARL in zero-sum games with imperfect information.\nOur experiments show that these actor-critic algorithms converge to approximate Nash equilibria in\ncommonly-used benchmark Poker domains with rates similar to or better than baseline model-free\nalgorithms for zero-sum games. However, they may be easier to implement, and do not require storing\na large memory of transitions. Furthermore, the current policy of some variants do signi\ufb01cantly better\nthan the baselines (including the average policy of NFSP) when evaluated against \ufb01xed bots. Of the\nactor-critic variants, RPG and QPG seem to outperform RMPG in our experiments.\nAs future work, we would like to formally develop the (probabilistic) guarantees of the sample-based\non-policy Monte Carlo CFR algorithms and/or extend to continuing tasks as in MDPs [41]. We are\nalso curious about what role the connections between actor-critic methods and CFR could play in\nderiving convergence guarantees in model-free MARL for cooperative and/or potential games.\nAcknowledgments. We would like to thank Martin Schmid, Audr\u00afunas Gruslys, Neil Burch, Noam\nBrown, Kevin Waugh, Rich Sutton, and Thore Graepel for their helpful feedback and support.\n\n8\n\n012345678Episodes1e60.00.20.40.60.81.0NashConvNFSPA2CRPGQPGRM0.00.20.40.60.81.0Episodes1e70.00.51.01.52.02.5NashConvNFSPA2CRPGQPGRM0.00.20.40.60.81.01.21.41.6Episodes1e7012345NashConvNFSPA2CRPGQPGRM012345Episodes1e602468101214NashConvNFSPA2CRPGQPGRM0.20.40.60.81.01.21.4Episodes1e676543210Mean rewardA2CRPGQPGNFSP0.51.01.52.02.53.03.54.0Episodes1e69876543210Mean rewardA2CQPGRPGNFSP\fReferences\n\n[1] Sherief Abdallah and Victor Lesser. A multiagent reinforcement learning algorithm with\n\nnon-linear dynamics. JAIR, 33(1):521\u2013549, 2008.\n\n[2] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Nicolas Heess Remi Munos, and\nMartin Riedmiller. Maximum a posteriori policy optimisation. CoRR, abs/1806.06920, 2018.\n\n[3] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter\nAbbeel. Continuous adaptation via meta-learning in nonstationary and competitive environ-\nments. In Proceedings of the Sixth International Conference on Learning Representations,\n2018.\n\n[4] Stefano V. Albrecht and Peter Stone. Autonomous agents modelling other agents: A compre-\n\nhensive survey and open problems. Arti\ufb01cial Intelligence, 258:66\u201395, 2018.\n\n[5] Cameron Allen, Melrose Roderick Kavosh Asadi, Abdel rahman Mohamed, George Konidaris,\n\nand Michael Littman. Mean actor critic. CoRR, abs/1709.00503, 2017.\n\n[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The\nadversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on\nFoundations of Computer Science, pages 322\u2013331, 1995.\n\n[7] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent\ncomplexity via multi-agent competition. In Proceedings of the Sixth International Conference\non Learning Representations, 2018.\n\n[8] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics\n\nof multi-agent learning: A survey. J. Artif. Intell. Res. (JAIR), 53:659\u2013697, 2015.\n\n[9] A. Blum and Y. Mansour. Learning, regret minimization, and equilibria. In Algorithmic Game\n\nTheory, chapter 4. Cambridge University Press, 2007.\n\n[10] Branislav Bo\u0161ansk\u00fd, Viliam Lis\u00fd, Marc Lanctot, Ji\u02c7r\u00ed \u02c7Cerm\u00e1k, and Mark H.M. Winands.\nAlgorithms for computing strategies in two-player simultaneous move games. Arti\ufb01cial\nIntelligence, 237:1\u2014-40, 2016.\n\n[11] Michael Bowling. Convergence and no-regret in multiagent learning. In Advances in Neural\n\nInformation Processing Systems 17 (NIPS), pages 209\u2013216, 2005.\n\n[12] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up Limit\n\nHold\u2019em Poker is solved. Science, 347(6218):145\u2013149, January 2015.\n\n[13] Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate.\n\nArti\ufb01cial Intelligence, 136:215\u2013250, 2002.\n\n[14] G. W. Brown. Iterative solutions of games by \ufb01ctitious play. In T.C. Koopmans, editor, Activity\n\nAnalysis of Production and Allocation, pages 374\u2013376. John Wiley & Sons, Inc., 1951.\n\n[15] Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for\nregret minimization. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence (AAAI),\n2017.\n\n[16] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus\n\nbeats top professionals. Science, 360(6385), December 2017.\n\n[17] L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforce-\nment learning. IEEE Transaction on Systems, Man, and Cybernetics, Part C: Applications and\nReviews, 38(2):156\u2013172, 2008.\n\n[18] Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark.\nEmergent communication through negotiation. In Proceedings of the Sixth International\nConference on Learning Representations (ICLR), 2018.\n\n9\n\n\f[19] Kamil Ciosek and Shimon Whiteson. Expected policy gradients.\n\nIn Proceedings of the\n\nThirty-Second AAAI conference on Arti\ufb01cial Intelligence (AAAI-18), 2018.\n\n[20] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity\nof computing a nash equilibrium. In Proceedings of the Thirty-eighth Annual ACM Symposium\non Theory of Computing, STOC \u201906, pages 71\u201378, New York, NY, USA, 2006. ACM.\n\n[21] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep\n\nreinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.\n\n[22] Lasse Espeholt, Hubert Soyer, R\u00e9mi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward,\nYotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu.\nIMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures.\nCoRR, abs/1802.01561, 2018.\n\n[23] Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel,\nIn Proceedings of the\n\nand Igor Mordatch. Learning with opponent-learning awareness.\nInternational Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2017.\n\n[24] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon\nWhiteson. Counterfactual multi-agent policy gradients. In Proceedings of the Thirty-Second\nAAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[25] N. Gatti, F. Panozzo, and M. Restelli. Ef\ufb01cient evolutionary dynamics with extensive-form\ngames. In Proceedings of the Twenty-Seventh AAAI Conference on Arti\ufb01cial Intelligence,\npages 335\u2013341, 2013.\n\n[26] Richard Gibson. Regret minimization in non-zero-sum games with applications to building\n\nchampion multiplayer computer poker agents. CoRR, abs/1305.0034, 2013.\n\n[27] Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. Deep reinforcement\n\nlearning for robotic manipulation. CoRR, abs/1610.00633, 2016.\n\n[28] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium.\n\nEconometrica, 68(5):1127\u20131150, 2000.\n\n[29] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimiza-\n\ntion, 2(3\u20134):157\u2013325, 2015.\n\n[30] He He, Jordan L. Boyd-Graber, Kevin Kwok, and Hal Daum\u00e9 III. Opponent modeling in deep\nreinforcement learning. In Proceedings of The 33rd International Conference on Machine\nLearning (ICML 2016), 2016.\n\n[31] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form\ngames. In Proceedings of the 32nd International Conference on Machine Learning (ICML\n2015), 2015.\n\n[32] Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-\n\ninformation games. CoRR, abs/1603.01121, 2016.\n\n[33] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A survey\nof learning in multiagent environments: Dealing with non-stationarity. CoRR, abs/1707.09183,\n2017.\n\n[34] Josef Hofbauer and Karl Sigmund. Evolutionary Games and Population Dynamics. Cambridge\n\nUniversity Press, 1998.\n\n[35] Josef Hofbauer, Sylvain Sorin, and Yannick Viossat. Time average replicator and best-reply\n\ndynamics. Mathematics of Operations Research, 34(2):263\u2013269, 2009.\n\n[36] Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A\n\ndeep policy inference q-network for multi-agent systems. CoRR, abs/1712.07893, 2017.\n\n10\n\n\f[37] Max Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and\nK Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In Proceed-\nings of the International Conference on Representation Learning, 2017.\n\n[38] Peter H. Jin, Sergey Levine, and Kurt Keutzer. Regret minimization for partially observable\n\ndeep reinforcement learning. CoRR, abs/1710.11424, 2017.\n\n[39] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in\n\npartially observable stochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[40] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.\nIn Proceedings of the Nineteenth International Conference on Machine Learning, ICML \u201902,\npages 267\u2013274, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.\n\n[41] Ian A. Kash and Katja Hoffman. Combining no-regret and Q-learning. In European Workshop\n\non Reinforcement Learning (EWRL) 14, 2018.\n\n[42] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[43] Vojtech Kovar\u00edk and Viliam Lis\u00fd. Analysis of hannan consistent selection for monte carlo tree\n\nsearch in simultaneous move games. CoRR, abs/1509.00149, 2015.\n\n[44] H. W. Kuhn. Extensive games and the problem of information. Contributions to the Theory of\n\nGames, 2:193\u2013216, 1953.\n\n[45] Shapley L. Some topics in two-person games.\n\nUniversity Press., 1964.\n\nIn Advances in Game Theory. Princeton\n\n[46] M. Lanctot, K. Waugh, M. Bowling, and M. Zinkevich. Sampling for regret minimization in\nextensive games. In Advances in Neural Information Processing Systems (NIPS 2009), pages\n1078\u20131086, 2009.\n\n[47] Marc Lanctot. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation\nand Decision-Making in Large Extensive Form Games. PhD thesis, Department of Computing\nScience, University of Alberta, Edmonton, Alberta, Canada, June 2013.\n\n[48] Marc Lanctot. Further developments of extensive-form replicator dynamics using the sequence-\nform representation. In Proceedings of the Thirteenth International Conference on Autonomous\nAgents and Multi-Agent Systems (AAMAS), pages 1257\u20131264, 2014.\n\n[49] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling\nfor regret minimization in extensive games. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I.\nWilliams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22,\npages 1078\u20131086, 2009.\n\n[50] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien\nPerolat, David Silver, and Thore Graepel. A uni\ufb01ed game-theoretic approach to multiagent\nreinforcement learning. In Advances in Neural Information Processing Systems, 2017.\n\n[51] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation\nand the emergence of (natural) language. In Proceedings of the International Conference on\nLearning Representations (ICLR), April 2017.\n\n[52] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436\u2013444,\n\n2015.\n\n[53] Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilem-\n\nmas using deep reinforcement learning. CoRR, abs/1707.01068, 2017.\n\n[54] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval\nTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nCoRR, abs/1509.02971, 2015.\n\n11\n\n\f[55] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In\nIn Proceedings of the Eleventh International Conference on Machine Learning, pages 157\u2013163.\nMorgan Kaufmann, 1994.\n\n[56] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent\ncontrol variates for policy optimization via stein identity. In Proceedings of the International\nConference on Learning Representations (ICLR), 2018.\n\n[57] Ryan Lowe, YI WU, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-\nagent actor-critic for mixed cooperative-competitive environments. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 30, pages 6379\u20136390. Curran Associates, Inc., 2017.\n\n[58] L. Matignon, G. J. Laurent, and N. Le Fort-Piat.\n\nIndependent reinforcement learners in\ncooperative Markov games: a survey regarding coordination problems. The Knowledge\nEngineering Review, 27(01):1\u201331, 2012.\n\n[59] Volodymyr Mnih, Adri\u00e0 Puigdom\u00e8nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In Proceedings of the 33rd International Conference on Machine\nLearning (ICML), pages 1928\u20131937, 2016.\n\n[60] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig\nPetersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,\nDaan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep rein-\nforcement learning. Nature, 518:529\u2013533, 2015.\n\n[61] Matej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis\u00fd, Dustin Morrill, Nolan Bard, Trevor\nDavis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level\narti\ufb01cial intelligence in heads-up no-limit poker. Science, 358(6362), October 2017.\n\n[62] Todd W. Neller and Marc Lanctot. An introduction to counterfactual regret minimization. In\nProceedings of Model AI Assignments, The Fourth Symposium on Educational Advances in\nArti\ufb01cial Intelligence (EAAI-2013), 2013. http://modelai.gettysburg.edu/2013/cfr/\nindex.html.\n\n[63] Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Policy gradient with value\nfunction approximation for collective multiagent planning.\nIn I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 30, pages 4319\u20134329. Curran Associates, Inc., 2017.\n\n[64] A. Now\u00e9, P. Vrancx, and Y-M. De Hauwere. Game theory and multi-agent reinforcement\n\nlearning. In Reinforcement Learning: State-of-the-Art, chapter 14, pages 441\u2013470. 2012.\n\n[65] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs.\n\nSpringer, 2016.\n\n[66] Fabio Panozzo, Nicola Gatti, and Marcello Restelli. Evolutionary dynamics of q-learning\nover the sequence form. In Proceedings of the Twenty-Eighth AAAI Conference on Arti\ufb01cial\nIntelligence, pages 2034\u20132040, 2014.\n\n[67] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-\nguided deep reinforcement learning of physics-based character skills. CoRR, abs/1804.02717,\n2018.\n\n[68] Julien Perolat, Bilal Piot, and Olivier Pietquin. Actor-critic \ufb01ctitious play in simultaneous\nmove multistage games. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of\nthe Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics, volume 84\nof Proceedings of Machine Learning Research, pages 919\u2013928, Playa Blanca, Lanzarote,\nCanary Islands, 09\u201311 Apr 2018. PMLR.\n\n12\n\n\f[69] Julien P\u00e9rolat, Bilal Piot, Bruno Scherrer, and Olivier Pietquin. On the use of non-stationary\nstrategies for solving two-player zero-sum markov games. In The 19th International Confer-\nence on Arti\ufb01cial Intelligence and Statistics (AISTATS 2016), 2016.\n\n[70] Julien P\u00e9rolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic\nprogramming for two-player zero-sum markov games. In Proceedings of the International\nConference on Machine Learning (ICML), 2015.\n\n[71] Jan Peters. Policy gradient methods for control applications. Technical Report TR-CLMC-\n\n2007-1, University of Southern California, 2002.\n\n[72] Yu Qian, Fang Debin, Zhang Xiaoling, Jin Chen, and Ren Qiyu. Stochastic evolution dynamic\nof the rock\u2013scissors\u2013paper game based on a quasi birth and death process. Scienti\ufb01c Reports,\n6(1):28585, 2016.\n\n[73] Deirdre Quillen, Eric Jang, O\ufb01r Nachum, Chelsea Finn, Julian Ibarz, and Sergey Levine. Deep\nreinforcement learning for vision-based robotic grasping: A simulated comparative evaluation\nof off-policy methods. CoRR, abs/1802.10264, 2018.\n\n[74] N. A. Risk and D. Szafron. Using counterfactual regret minimization to create competitive\nmultiplayer poker agents. In Proceedings of the International Conference on Autonomus\nAgents and Multiagent Systems (AAMAS), pages 159\u2013166, 2010.\n\n[75] William H. Sandholm. Population Games and Evolutionary Dynamics. The MIT Press, 2010.\n\n[76] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust\n\nregion policy optimization. CoRR, abs/1502.05477, 2015.\n\n[77] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[78] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforce-\n\nment learning for autonomous driving. CoRR, abs/1610.03295, 2016.\n\n[79] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and\n\nLogical Foundations. Cambridge University Press, 2009.\n\n[80] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den\nDriessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,\nSander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy\nLillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mas-\ntering the game of Go with deep neural networks and tree search. Nature, 529:484\u2014-489,\n2016.\n\n[81] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur\nGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap,\nKaren Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general\nreinforcement learning algorithm. CoRR, abs/1712.01815, 2017.\n\n[82] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy\nLillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis\nHassabis. Mastering the game of go without human knowledge. Nature, 530:354\u2013359, 2017.\n\n[83] Satinder P. Singh, Michael J. Kearns, and Yishay Mansour. Nash convergence of gradient\ndynamics in general-sum games. In Proceedings of the 16th Conference on Uncertainty in\nArti\ufb01cial Intelligence, UAI \u201900, pages 541\u2013548, San Francisco, CA, USA, 2000. Morgan\nKaufmann Publishers Inc.\n\n[84] Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P\u00e9rolat, Karl Tuyls, R\u00e9mi Munos,\nand Michael Bowling. Actor-critic policy optimization in partially observable multiagent\nenvironments. CoRR, abs/1810.09026, 2018.\n\n13\n\n\f[85] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition,\n\n2018.\n\n[86] Richard S. Sutton, Satinder Singh, and David McAllester. Comparing policy-gradient algo-\n\nrithms, 2001. Unpublished.\n\n[87] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up\nlimit Texas Hold\u2019em. In Proceedings of the 24th International Joint Conference on Arti\ufb01cial\nIntelligence, 2015.\n\n[88] Taylor and Jonker. Evolutionarily stable strategies and game dynamics. Mathematical\n\nBiosciences, 40:145\u2013156, 1978.\n\n[89] Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A Generalised\n\nMethod for Empirical Game Theoretic Analysis . In AAMAS, 2018.\n\n[90] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical\n\nSoftware, 11(1):37\u201357.\n\n[91] W. E. Walsh, D. C. Parkes, and R. Das. Choosing samples to compute heuristic-strategy Nash\nequilibrium. In Proceedings of the Fifth Workshop on Agent-Mediated Electronic Commerce,\n2003.\n\n[92] William E Walsh, Rajarshi Das, Gerald Tesauro, and Jeffrey O Kephart. Analyzing Complex\n\nStrategic Interactions in Multi-Agent Systems. In AAAI, 2002.\n\n[93] Kevin Waugh, Dustin Morrill, J. Andrew Bagnell, and Michael Bowling. Solving games with\nfunctional regret estimation. In Proceedongs of the AAAI Conference on Arti\ufb01cial Intelligence,\n2015.\n\n[94] Michael P. Wellman. Methods for empirical game-theoretic analysis. In Proceedings, The\nTwenty-First National Conference on Arti\ufb01cial Intelligence and the Eighteenth Innovative\nApplications of Arti\ufb01cial Intelligence Conference, pages 1552\u20131556, 2006.\n\n[95] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3):229\u2013256, 1992.\n\n[96] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham\nKakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-\ndependent factorized baselines. In Proceedings of the International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[97] Yuxin Wu and Yuandong Tian. Training agent for \ufb01rst-person shooter game with actor-\ncritic curriculum learning. In Proceedings of the International Conference on Representation\nLearning, 2017.\n\n[98] Michael Wunder, Michael Littman, and Monica Babes. Classes of multiagent q-learning\ndynamics with \u0001-greedy exploration. In Proceedings of the 27th International Conference on\nInternational Conference on Machine Learning, ICML\u201910, pages 1167\u20131174, 2010.\n\n[99] Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In Proceedings\n\nof the Twenty-Fourth AAAI Conference on Arti\ufb01cial Intelligence, pages 927\u2013934, 2010.\n\n[100] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of Twentieth International Conference on Machine Learning (ICML-2003), 2003.\n[101] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nTechnical Report CMU-CS-03-110, Carnegie Mellon University, 2003.\n\n[102] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with\nincomplete information. In Advances in Neural Information Processing Systems 20 (NIPS\n2007), 2008.\n\n[103] Martin Zinkevich, Amy Greenwald, and Michael L. Littman. Cyclic equilibria in markov\ngames. In Proceedings of the 18th International Conference on Neural Information Processing\nSystems, NIPS\u201905, pages 1641\u20131648, Cambridge, MA, USA, 2005. MIT Press.\n\n14\n\n\f", "award": [], "sourceid": 1760, "authors": [{"given_name": "Sriram", "family_name": "Srinivasan", "institution": "Google"}, {"given_name": "Marc", "family_name": "Lanctot", "institution": "DeepMind"}, {"given_name": "Vinicius", "family_name": "Zambaldi", "institution": "Deepmind"}, {"given_name": "Julien", "family_name": "Perolat", "institution": "DeepMind"}, {"given_name": "Karl", "family_name": "Tuyls", "institution": "DeepMind"}, {"given_name": "Remi", "family_name": "Munos", "institution": "DeepMind"}, {"given_name": "Michael", "family_name": "Bowling", "institution": "DeepMind / University of Alberta"}]}