{"title": "Finding Friend and Foe in Multi-Agent Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1251, "page_last": 1261, "abstract": "Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have seen great strides in recent years. Yet none of these games address the real-life challenge of cooperation in the presence of unknown and uncertain teammates. This challenge is a key game mechanism in hidden role games. Here we develop the DeepRole algorithm, a multi-agent reinforcement learning agent that we test on \"The Resistance: Avalon\", the most popular hidden role game. DeepRole combines counterfactual regret minimization (CFR) with deep value networks trained through self-play. Our algorithm integrates deductive reasoning into vector-form CFR to reason about joint beliefs and deduce partially observable actions. We augment deep value networks with constraints that yield interpretable representations of win probabilities. These innovations enable DeepRole to scale to the full Avalon game. Empirical game-theoretic methods show that DeepRole outperforms other hand-crafted and learned agents in five-player Avalon. DeepRole played with and against human players on the web in hybrid human-agent teams. We find that DeepRole outperforms human players as both a cooperator and a competitor.", "full_text": "Finding Friend and Foe in Multi-Agent Games\n\nJack Serrino\u21e4\n\nMIT\n\njserrino@mit.edu\n\nMax Kleiman-Weiner\u21e4\nHarvard, MIT, Diffeo\n\nmaxkleimanweiner@fas.harvard.edu\n\nDavid C. Parkes\nHarvard University\n\nparkes@eecs.harvard.edu\n\nJoshua B. Tenenbaum\n\nMIT, CBMM\njbt@mit.edu\n\nAbstract\n\nRecent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have\nseen great strides in recent years. Yet none of these games address the real-life\nchallenge of cooperation in the presence of unknown and uncertain teammates.\nThis challenge is a key game mechanism in hidden role games. Here we develop\nthe DeepRole algorithm, a multi-agent reinforcement learning agent that we test on\nThe Resistance: Avalon, the most popular hidden role game. DeepRole combines\ncounterfactual regret minimization (CFR) with deep value networks trained through\nself-play. Our algorithm integrates deductive reasoning into vector-form CFR to\nreason about joint beliefs and deduce partially observable actions. We augment\ndeep value networks with constraints that yield interpretable representations of\nwin probabilities. These innovations enable DeepRole to scale to the full Avalon\ngame. Empirical game-theoretic methods show that DeepRole outperforms other\nhand-crafted and learned agents in \ufb01ve-player Avalon. DeepRole played with and\nagainst human players on the web in hybrid human-agent teams. We \ufb01nd that\nDeepRole outperforms human players as both a cooperator and a competitor.\n\n1\n\nIntroduction\n\nCooperation enables agents to achieve feats together that no individual can achieve on her own [16, 39].\nCooperation is challenging, however, because it is embedded within a competitive world [15]. Many\nmulti-party interactions start off by asking: who is on my team? Who will collaborate with me and\nwho do I need to watch out for? These questions arise whether it is your \ufb01rst day of kindergarten or\nyour \ufb01rst day at the stock exchange. Figuring out who to cooperate with and who to protect oneself\nagainst is a fundamental challenge for any agent in a diverse multi-agent world. This has been explored\nin cognitive science, economics, and computer science [2, 7, 8, 21, 23, 24, 25, 26, 28, 30, 31, 44].\nCore to this challenge is that information about who to cooperate with is often noisy and ambiguous.\nTypically, we only get this information indirectly through others\u2019 actions [1, 3, 21, 41]. Since different\nagents may act in different ways, these inferences must be robust and take into account ad-hoc factors\nthat arise in an interaction. Furthermore, these inferences might be carried out in the presence of a\nsophisticated adversary with superior knowledge and the intention to deceive. These adversaries could\nintentionally hide their non-cooperative intentions and try to appear cooperative for their own bene\ufb01t\n[36]. The presence of adversaries makes communication challenging\u2014 when intent to cooperate is\nunknown, simple communication is unreliable or \u201ccheap\u201d [14].\nThis challenge has not been addressed by recent work in multi-agent reinforcement learning (RL). In\nparticular, the impressive results in imperfect-information two-player zero-sum games such as poker\n\n\u21e4indicates equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[4, 6, 27] are not straightforward to apply to problems where cooperation is ambiguous. In heads-up\npoker, there is no opportunity to actually coordinate or cooperate with others since two-player zero-\nsum games are strictly adversarial. In contrast, games such as Dota and capture the \ufb02ag have been\nused to train Deep RL agents that coordinate with each other to compete against other teams [17, 29].\nHowever, in neither setting was there ambiguity about who to cooperate with. Further in real-time\ngames, rapid re\ufb02exes and reaction times give an inherent non-strategic advantage to machines [9].\nHere we develop DeepRole, a multi-agent reinforcement learning algorithm that addresses the\nchallenge of learning who to cooperate with and how. We apply DeepRole to a \ufb01ve-player game\nof alliances, The Resistance: Avalon (Avalon), a popular hidden role game where the challenge\nof learning who to cooperate with is the central focus of play [13]. Hidden role games start with\nplayers joining particular teams and adopting roles that are not known to all players of the game.\nDuring the course of the game, the players try to infer and deduce the roles of their peers while others\nsimultaneously try to prevent their role from being discovered. As of May 2019, Avalon is the most\nhighly rated hidden role game on boardgamegeek.com. Hidden role games such as Ma\ufb01a, Werewolf,\nand Saboteur are widely played around the world.\n\nRelated work DeepRole builds on the recent success of heuristic search techniques that combine\nef\ufb01cient depth-limited lookahead planning with a value function learned through self-play in two-\nplayer zero-sum games [27, 33, 34]. In particular, the DeepStack algorithm for no-limit heads up\npoker combines counterfactual regret minimization (CFR) using a continual re-solving local search\nstrategy with deep neural networks [27, 45]. While DeepStack was developed for games where all\nactions are public (such as poker), in hidden role games some actions are only observable by some\nagents and therefore must be deduced. In Avalon, players obtain new private information as the game\nprogresses while in poker the only hidden information is the initial set of cards.\n\nContributions. Our key contributions build on these recent successes. Our algorithm integrates\ndeductive reasoning into vector-form CFR [19] to reason about joint beliefs and partially observable\nactions based on consistency with observed outcomes, and augments value networks with constraints\nthat yield interpretable representations of win probabilities. This augmented network enables training\nwith better sample ef\ufb01ciency and generalization. We conduct an empirical game-theoretic analysis in\n\ufb01ve-player Avalon and show that the DeepRole CFR-based algorithm outperforms existing approaches\nand hand-crafted systems. Finally, we had DeepRole play with a large sample of human players on a\npopular online Avalon site. DeepRole outperforms people as both a teammate and opponent when\nplaying with and against humans, even though it was only trained through self-play. We conclude by\ndiscussing the value of hidden role games as a long-term challenge for multi-agent RL systems.\n\n1 2 3 4 5\n\nround #\n\nsucceed\n1 2 3 4 5\nfail\n\nsucceed\n1 2 3 4 5\nfail\n\nsucceed\n1 2 3 4 5\nfail\n\n1 2 3 4 5\nfail\n\nsucceed\n1 2 3 4 5\nfail\n\nsucceed\n1 2 3 4 5\n\n1 2 3 4 5\n\nfail\n\nsucceed\n1 2 3 4 5\n\n1 2 3 4 5\n\nround\n\nproposal 1\n\nproposal 2\n\n\u2026\n\n2\n\n1\n\n[4,3]\n\n[2,3]\n\n[1,5]\n\n1 2 3 4 5\n\nno\n\nyes\n\nmajority?\n\nno\n\nyes\n\n1 5\n\nprivate \nactions\n1 fail\n\n2 fail\n\nsucceed\n\nl\na\ns\no\np\no\nr\np\n\nl\na\nv\no\nr\np\np\na\n\nn\no\ni\ns\ns\ni\nm\n\nFigure 1: Description of the public game dynamics in The Resistance: Avalon. (left) Each round\n(rectangle) has up to 5 proposals (white circles) and leads to either a mission that fails or succeeds.\n(right) Example dynamics within each round. Players (colored circles) alternate proposing subsets\nof players (2 or 3) to go on a mission which are then put to vote by all 5 players. If the majority\napproves, those players (1 & 5 in this example) privately and independently decide to succeed or fail\nthe mission. If the majority disapproves, the next player proposes a subset.\n\n2\n\n\f2 The Resistance: Avalon\n\nWe \ufb01rst brie\ufb02y describe game mechanics of The Resistance: Avalon played with \ufb01ve players. At the\nbeginning of the game, 3 players are randomly assigned to the Resistance team and 2 players are\nassigned to the Spy team. The spies know which players are on the Spy team (and hence also know\nwhich players are on the Resistance team). One member of the Resistance team is randomly and\nprivately chosen to be the Merlin role who also knows all role assignments. One member of the Spy\nteam is randomly chosen to be the Assassin. At the end of the game, if the Resistance team has won,\nthe Assassin guesses the identity of Merlin. If the Assassin guesses Merlin correctly then the Spy\nteam wins.\nFigure 1 shows a visual description of the public game dynamics. There are \ufb01ve rounds in the game.\nDuring each round a player proposes a subset (two or three depending on the round) of agents to go\non a mission. All players simultaneously and publicly vote (approve or not approve) of that subset. If\na simple majority do not approve, another player is selected to propose a subset to go on the mission.\nIf after \ufb01ve attempts, no proposal receives a simple majority, the Spy team wins. If a simple majority\napproves, the subset of players privately select whether the mission succeeds or fails. Players on\nthe Resistance team must always choose success but players on the Spy team can choose success\nor failure. If any of the Spies choose to fail the mission, the mission fails. Otherwise, the mission\nsucceeds. The total number of success and fail votes is made public but the identity of who made\nthose votes is private. If three missions succeed, the Resistance team wins. If three missions fail,\nthe Spy team wins. When people play Avalon, the games are usually rich in \u201ccheap talk,\u201d such as\ndefending oneself, accusing others, or debunking others\u2019 claims [10]. In this work, we do not consider\nthe strategic implications of natural language communication.\nAlthough Avalon is a simple game to describe, it has a large state space. We compute a lower bound\nof 1056 distinct information sets in the 5-player version of Avalon (Appendix D for details). This is\nlarger than the state space of Chess (1047) and larger than the number of information sets in heads-up\nlimit poker (1014) [18].\n\n3 Algorithm: DeepRole\n\nThe DeepRole algorithm builds off of recent success in poker by combining DeepStack\u2019s innovations\nof deep value networks and depth-limited solving with deductive reasoning. Compared to MCTS-\nbased methods like AlphaGo, CFR-based methods like DeepStack and DeepRole can soundly reason\nover hidden information. Unique to DeepRole, our innovations allow the algorithm to play games\nwith simultaneous and hidden actions. In broad strokes, DeepRole is composed of two parts: (1) a\nCFR planning algorithm augmented with deductive reasoning; and (2) neural value networks that\nare used to reduce the size of the game tree. Source code and experimental data is available here:\nhttps://github.com/Detry322/DeepRole.\n\nBackground. Hidden role games like Avalon can be modeled as extensive-form games. We follow\nthe notation of [19]. Brie\ufb02y, these games have a game tree with nodes that correspond to different\nhistories of actions, h 2 H, with Z \u21e2 H the set of terminal histories. For each h 2 Z, let ui(h)\nbe the utility to player i in terminal history h. In extensive-form games, only a single player P (h)\ncan move at any history h, but because Avalon\u2019s mechanics intimately involve simultaneous action,\nwe extend this de\ufb01nition to let P 0(h) be the set of players simultaneously moving at h. Histories\nare partitioned into information sets (I 2I i) that represent the game states that player i cannot\ndistinguish between. For example, a Resistance player does not know who is on the Spy team and\nthus all h differing only in the role assignments to the other players are in a single information set.\nThe actions available in a given information set are a 2 A(I).\nA strategy i for player i is a mapping for each I 2I i to a probability distribution over A(I). Let\n = (1, . . . , p) be the joint strategy of all p players. We write I!a to mean strategy , modi\ufb01ed\nso action a is always played at information set I. Then, we let \u21e1(h) be the probability of reaching\nh if all players act according to . We write \u21e1\ni (h) to mean the contribution of player i to the joint\nprobability \u21e1(h) =Q1...p \u21e1\ni(h) be the product of strategies for all players\nexcept i and let \u21e1(h, h0) be the probability of reaching history h0 under strategy , given h has\noccurred.\n\ni (h). Finally, let \u21e1\n\n3\n\n\fi\n\n(I, a) = max{PT CF Vi(t\n\nCounterfactual regret minimization (CFR) iteratively re\ufb01nes based on the regret accumulated\nthrough a self-play like procedure. Speci\ufb01cally, in CFR+, at iteration T , the cumulative counterfactual\nregret is R+,T\nI!a, I)CF Vi(t, I), 0} where the counterfactual values\nfor player i are de\ufb01ned as CF Vi(, I) =Pz2Z ui(z)\u21e1\ni(z[I])\u21e1(z[I], z), where z[I] is the h 2 I\nsuch that h v z [38]. At a high-level, CFR iteratively improves by boosting the probability of\nactions that would have been bene\ufb01cial to each player. In two-player zero-sum games, CFR provably\nconverges to a Nash equilibrium. However, it does not necessarily converge to an equilibrium in\ngames with more than two players [37]. We investigate whether CFR can generate strong strategies\nin a multi-agent hidden role game like Avalon.\n\n3.1 CFR with deductive logic\nThe CFR component of DeepRole is based on the vector-form public chance sampling (PCS) version\nof CFR introduced in [19], together with CFR+ regret matching [38]. Vector-form versions of CFR\ncan result in faster convergence and take advantage of SIMD instructions, but require a public game\ntree [20]. In poker-like games, one can construct a public game tree from player actions, since all\nactions are public (e.g., bets, new cards revealed) except for the initial chance action (giving players\ncards). In hidden role games, however, key actions after the initial chance action are made privately,\nbreaking the standard construction.\nTo support hidden role games, we extend the public game tree to be a history of third-person\nobservations, o 2 O(h), instead of just actions. This includes both public actions and observable\nconsequences of private actions (lines 22-44 in Alg. 1 in the Appendix). Our extension works when\ndeductive reasoning from these observations reveals the underlying private actions. For instance, if\na mission fails and one of the players is known to be a Spy, one can deduce that the Spy failed the\nmission. deduceActions(h, o) carries out this deductive reasoning and returns the actions taken by\neach player under each information set (~ai[I]) (line 23). With ~ai[I] and the player\u2019s strategy ( ~i),\nthe player\u2019s reach probabilities are updated for the public game state following the observation (ho)\n(lines 24-26).\nUsing the public game tree, we maintain a human-interpretable joint posterior belief b(\u21e2|h) over the\ninitial assignment of roles \u21e2. \u21e2 represents a full assignment of roles to players (the result of the initial\nchance action) \u2013 so our belief b(\u21e2|h) represents the joint probability that each player has the role\nspeci\ufb01ed in \u21e2, given the observed actions in the public game tree. See Figure 2 for an example belief\nb and assignment \u21e2. This joint posterior b(\u21e2|h) can be approximated by using the individual players\u2019\nstrategies as the likelihood in Bayes rule:\n\nb(\u21e2|h) / b(\u21e2)(1 {h `\u00ac \u21e2}) Yi21...p\n\n\u21e1\ni (Ii(h, \u21e2))\n\n(1)\n\nwhere b(\u21e2) is the prior over assignments (uniform over the 60 possible assignments), Ii(h, \u21e2) is the\ninformation set implied by public history h and assignment \u21e2, and the product is the likelihood of\nplaying to h given each player\u2019s implied information set. A problem is that this likelihood can put\npositive mass on assignments that are impossible given the history. This arises because vector-form\nCFR algorithms can only compute likelihoods for each player independently rather than jointly. For\ninstance, consider two players that went on a failing mission. In the information sets implied by the \u21e2\nwhere they are both resistance, each player is assumed to have passed the mission. However, this\nis logically inconsistent with the history, as one of them must have played fail. To address this, the\nindicator term (1 {h `\u00ac \u21e2}) zeros the probability of any \u21e2 that is logically inconsistent with the\npublic game tree h. This zeroing removes any impact these impossible outcomes would have had on\nthe value and regret calculations in CFR (line 20 in Alg. 2).\n\n3.2 Value network\nThe enhanced vector-form CFR cannot be run on the full public game tree of Avalon (or any real\nhidden role game). This is also the case for games like poker, so CFR-based poker systems [6, 27]\nrely on action abstraction and state abstraction to reduce the size of the game tree. However, actions\nin Avalon are not obviously related to each other. Betting 105 chips in poker is strategically similar\nto betting 104 chips, but voting up a mission in Avalon is distinct from voting it down. The size of\nAvalon\u2019s game tree does not come from the number of available actions, but rather from the number\nof players. Furthermore, since until now Avalon has only received limited attention, there are no\n\n4\n\n\fOne-hot encoding of \nproposer position\n\n[0 1 0 0 0]\n\n5 \nx \n1\n\nDense (ReLU)\n\nWin likelihood\nP(win|role (\u2374))\n\nP1 P2 P3 P4 P5 P(win)\nM\n\nR\n\nR\n\nS\n\nA\n\n80 \nx \n1\n\n80 \nx \n1\n\n60\nx\n1\n\nM\n\nM\n\nS\n\nR\n\nR\n\nS\n\nA\n\nS\n\nS\n\nR\n\nR\n\nA\n\n\u2026\n\nR\n\nR\n\nR\n\nR\n\nA\n\nA\n\nM\n\nM\n\n60 \nx \n1\n\n.96\n.73\n.55\n\n.23\n.41\n\nPublic belief state\n\nb = P(role (\u2374))\nP1 P2 P3 P4 P5 Pr\nA .15\nM\nA .12\nA .00\n\nM\n\nR\n\nR\n\nS\n\nS\n\nR\n\nS\n\nR\n\nR\n\nR\n\nM\n\n\u2026\n\nS\n\nR\n\nA\n\nS\n\nR\n\nA\n\nR\n\nR\n\nM .00\nM .03\n\nProbability-weighted \n\nvalue of each \ninformation set \n\nP(Ii)*Vi(Ii)\nM\n\n\u2026\n\nS\n(4)\n\nA\n(5)\n\n(2,3)\n\nR\n\nX\n\n.5 -.2\n\n.3 -.1\n\n1 x 15\n\n1 x 15\n\n1 x 15\n\n1 x 15\n\n1 x 15\n\nP1\n\nP2\n\nP3\n\nP4\n\nP5\n\nFigure 2: DeepRole neural network architecture used to limit game tree depth. Tables (black headers)\nshow example inputs. The uppercase characters represent the different roles: (R)esistance, (S)py,\n(M)erlin, (A)ssassin. The outputs are the probability weighted value for each player in each of their\ninformation sets. While there is only one information set for Resistance (since they only know their\nown role), there are multiple for each of the other roles types. \u201cM (2,3)\u201d should be read as Merlin\nwho sees players 2 and 3 as Spy and \u201cS (4)\u201d should be read as Spy who sees player 4 as Assassin.\n\ndeveloped hand-crafted state abstractions available either (although see [5] for how these could be\nlearned). We follow the general approach taken by [27], using deep neural networks to limit the size\nof the game tree that we traverse (lines 14-16 in Alg. 1 in Appendix A).\nWe \ufb01rst partition the Avalon public game tree into individually solvable parts, segmented by a\nproposal for every possible number of succeeded and failed missions (white circles on the left side\nof Figure 1). This yields 45 neural networks. Each h corresponding to a proposal is mapped to one\nof these 45 networks. These networks take in a tuple \u2713 2 \u21e5,\u2713 = (i, b) where i is the proposing\nplayer, and b is the posterior belief at that position in the game tree. \u21e5 is the set of all possible\ngame situations. The value networks are trained to predict the probability-weighted value of each\ninformation set (Figure 2).\nUnlike in DeepStack, our networks calculate the non-counterfactual (i.e. normal) values for every\ninformation set I for each player. This is because our joint belief representation loses the individual\ncontribution of each player\u2019s likelihood, making it impossible to calculate a counterfactual. The value\nVi(I) for private information I for player i can be written as:\n\nVi(I) = \u21e1\n\ni (I)Xh2I\n\n\u21e1\n\ni(h)Xz2Z\n\n\u21e1(h, z)ui(z) = \u21e1\n\ni (I) CF Vi(I)\n\nwhere players play according to a strategy . Since we maintain a \u21e1\ni (I) during planning, we can\nconvert the values produced by the network to the counterfactual values needed by CFR (line 15 in\nAlg. 2).\n\nValue network architecture While it\u2019s possible to estimate these values using a generic feed-\nforward architecture, it may cause lower sample ef\ufb01ciency, require longer training time, or fail\nto achieve a low loss. We design an interpretable custom neural network architecture that takes\nadvantage of restrictions imposed by the structure of many hidden role games. Our network feeds\na one-hot encoded vector of the proposer player i and the belief vector b into two fully-connected\nhidden layers of 80 ReLU units. These feed into a fully-connected win probability layer with sigmoid\n\nFigure 3: DeepRole generalization\nand sample ef\ufb01ciency. (left) General-\nization error on held out samples av-\neraged across the 45 neural networks.\n(right) Generalization error as a func-\ntion of training data for the \ufb01rst deep\nvalue network (averaged over N=5\nruns, intervals are SE).\n\n5\n\n\fFigure 4: Comparing the expected win rate of DeepRole with other agents. The x-axis shows how\nmany of the \ufb01rst four agents are DeepRole. The y-axis shows the expected win rate for the \ufb01fth agent\nif they played as DeepRole or the benchmark. Standard errors smaller than the size of the dots. (top)\nCombined expected win rate. (middle) Spy-only win rate. (bottom) Resistance-only win rate.\n\nactivation. This layer is designed to take into account the speci\ufb01c structure of V , respecting the binary\nnature of payoffs in Avalon (players can only win or lose). It explicitly represents the probability of a\nResistance win (~w = P (win|\u21e2)) for each assignment \u21e2.\nUsing these probabilities, we then calculate the Vi(I) for each player and information set, constraining\nthe network\u2019s outputs to sound values. To do this calculation, for each player i, win probabilities are\n\ufb01rst converted to expected values (~ui~w + -~ui(1 ~w) representing i\u2019s payoff in each \u21e2 if resistance\nwin. It is then turned into the probability-weighted value of each information set which is used and\nproduced by CFR: ~Vi = Mi[(~ui~w + -~ui(1 ~w)) b] where Mi is a (15 \u21e5 60) multi-one-hot matrix\nmapping each \u21e2 to player i\u2019s information set, and b is the belief over roles passed to the network. This\narchitecture is fully differentiable and is trained via back-propagation. A diagram and description of\nthe full network is shown in Figure 2. See Appendix B and Alg. 3 for details of the network training\nalgorithm, procedure, parameters and compute details.\nThe win probability layer enabled training with less training data and better generalization. When\ncompared to a lesioned neural network that replaced the win probability layer with a zero-sum\nlayer (like DeepStack) the average held-out loss per network was higher and more training data was\nrequired (Figure 3).\n\n4 Empirical game-theoretic analysis\n\nThe possibility of playing with diverse teammates who may be playing con\ufb02icting equilibrium\nstrategies, out-of-equilibrium strategies, or even human strategies makes evaluation outside of two-\nplayer zero-sum games challenging. In two-player zero-sum games, all Nash equilibria are minimally\nexploitable, so algorithms that converge to Nash are provably optimal in that sense. However\nevaluating 3+ player interactions requires considering multiple equilibria and metrics that account for\ncoordinating with teammates. Further, Elo and its variants such as TrueSkill are only good measures\nof performance when relative skill is intransitive, but have no predictive power in transitive games\n(e.g., rock-paper-scissors) [40]. Thus, we turn to methods for empirical game-theoretic analysis\nwhich require running agents against a wide variety of benchmark opponents [40, 42].\nWe compare the performance of DeepRole to 5 alternative baseline agents: CFR \u2013 an agent trained\nwith MCCFR [22] over a hand-crafted game abstraction; LogicBot \u2013 a hand-crafted strategy that uses\nlogical deduction; RandomBot - plays randomly; ISMCTS - a single-observer ISMCTS algorithm\n\n6\n\n\fFigure 5: Empirical game-theoretic evaluation. Arrow size and darkness are proportional to the size\nof the gradient. (left) DeepRole against hand-coded agents. (center) DeepRole compared to systems\nwithout our algorithmic improvements. (right) DeepRole against itself but with CFR iterations equal\nto the number next to the game.\n\nfound in [11, 12, 35]; MOISMCTS - a multiple-observer variant of ISMCTS [43]. Details for these\nagents are found in Appendix C.\nWe \ufb01rst investigated the conditional win rates for each baseline agent playing against DeepRole. We\nconsider the effect of adding a 5th agent to a preset group of agents and compare DeepRole\u2019s win\nrate as the 5th agent with the win rate of a baseline strategy as the 5th agent in that same preset group.\nFor each preset group (0-4 DeepRole agents) we simulated >20K games.\nFigure 4 shows the win probability of each of these bots when playing DeepRole both overall and\nwhen conditioning on the role (Spy or Resistance). In most cases adding a 5th DeepRole player\nyielded a higher win rate than adding any of the other bots. This was true in every case we tested\nwhen there were at least two other DeepRole agents playing. Thus from an evolutionary perspective,\nDeepRole is robust to invasion from all of these agent types and in almost all cases outperforms the\nbaselines even when it is the minority.\nTo formalize these intuitions we construct a meta-game where players select a mixed meta-strategy\nover agent types rather than actions. Figure 5 shows the gradient of the replicator dynamic in these\nmeta-games [40, 42]. The replicator dynamic gradient describes the direction a player playing\nmeta-strategy can update their strategy for maximal gain, assuming other players are also playing .\nBoth vector \ufb01eld sinks and points with zero gradient correspond to Nash equilibria in the meta-game.\nFirst, we compare DeepRole to the two hand-crafted strategies (LogicBot and CFR), and show that\npurely playing DeepRole is the equilibrium with the largest basin of attraction. The ISMCTS agents\nare too computationally demanding to run in these contests, but in a pairwise evaluation, playing\nDeepRole is the sole equilibrium.\nNext, we test whether our innovations make DeepRole a stronger agent. We compare DeepRole to\ntwo lesioned alternatives. The \ufb01rst, DeepRole (No Win Layer), uses a zero-sum sum layer instead of\nour win probability layer in the neural network. Otherwise it is identical to DeepRole. In Figure 3, we\nsaw that this neural network architecture did not generalize as well. We also compare to a version of\nDeepRole that does not include the logical deduction step in equation 1, and also uses the zero-sum\nlayer instead of the probability win layer (No Win Layer, No Deduction). The agent without deduction\nis the weakest, and the full DeepRole agent is the strongest, showing that our innovations lead to\nenhanced performance.\nFinally, we looked at the impact of CFR solving iterations during play (thinking time). More iterations\nmake each move slower but may yield a better strategy. When playing DeepRole variants with 10, 30,\nand 100 iterations against each other, each variant was robust to invasion by the others but the more\niterations used, the larger the basin of attraction (Figure 5).\n\n5 Human evaluation\n\nPlaying with and against human players is a strong test of generalization. First, humans are likely to\nplay a diverse set of strategies that will be challenging for DeepRole to respond to. During training\ntime, it never learns from any human data and so its abilities to play with people must be the result\nof playing a strategy that generalizes to human play. Importantly, even if human players take the\n\n7\n\n\fFigure 6: Belief dynamics over the course of\nthe game. (left) DeepRole\u2019s posterior belief in\nthe ground truth Spy role assignments as a Re-\nsistance player with four humans. (right) Deep-\nRole\u2019s posterior belief of the true spy team\nwhile observing all-human games from the per-\nspective a Resistance player.\n\nDeepRole neural networks \u201cout of distribution\u201d, the online CFR iterations can still enable smart play\nin novel situations (as with MCTS in AlphaGo).\nHumans played with DeepRole on the popular online platform ProAvalon.com (see Appendix F\nfor commentated games and brief descriptions of DeepRole\u2019s \u201cplay style\u201d). In the 2189 mixed\nhuman/agent games we collected, all humans knew which players were human and which were\nDeepRole. There were no restrictions on chat usage for the human players, but DeepRole did not say\nanything and did not process sent messages. Table 1 shows the win rate of DeepRole compared to\nhumans. On the left, we can see that DeepRole is robust; when four of the players were DeepRole, a\nplayer would do better playing the DeepRole strategy than playing as an average human, regardless\nof team. More interestingly, when considering a game of four humans, the humans were better\noff playing with the DeepRole agent as a teammate than another human, again regardless of team.\nAlthough we have no way quantifying the absolute skill level of these players, among this pool of\navid Avalon players, DeepRole acted as both a superior cooperator and competitor \u2013 it cooperated\nwith its teammates to compete against the others.\nFinally, DeepRole\u2019s interpretable belief state can be used to gain insights into play. In Figure 6 we\nshow DeepRole\u2019s posterior probability estimate of the true set of Spies when playing as a Resistance\nplayer. When DeepRole played as the sole agent among four humans (left plot), the belief state\nrapidly converged to the ground truth in the situations where three missions passed, even though it\nhad never been trained on human data. If three missions failed, it was often because it failed to learn\ncorrectly. Next, we analyze the belief state when fed actions and observations from the perspective of\na human resistance player playing against a group of humans (yoked actions). As shown in Figure 6,\nthe belief estimates increase as the game progresses, indicating DeepRole can make correct inferences\neven while just observing the game. The belief estimate converges to the correct state faster in games\nwith three passes, presumably because the data in these games was more informative to all players.\n\n6 Discussion\n\nWe developed a new algorithm for multi-agent games called DeepRole which effectively collaborates\nand competes with a diverse set of agents in The Resistance: Avalon. DeepRole surpassed both\nhumans and existing machines in both simulated contests against other agents and a real-world\nevaluation with human Avalon players. These results are achieved through the addition of a deductive\nreasoning system to vector-based CFR and a win probability layer in deep value networks for depth-\n\nAdding DeepRole or a Human\n\nto 4 DeepRole\n\nto 4 Human\n\nOverall\n\nResistance\nSpy\n\n+DeepRole\n\n+Human\n\n+DeepRole\n\nWin Rate (%)\n\n(N)\n\nWin Rate (%)\n\n(N)\n\nWin Rate (%)\n\n46.9 \u00b1 0.6\n34.4 \u00b1 0.7\n65.6 \u00b1 0.9\n\n(7500)\n(4500)\n(3000)\n\n38.8 \u00b1 1.3\n25.6 \u00b1 1.5\n57.8 \u00b1 2.0\n\n(1451)\n(856)\n(595)\n\n60.0 \u00b1 5.5\n51.4 \u00b1 8.2\n67.4 \u00b1 7.1\n\n(N)\n\n(80)\n(37)\n(43)\n\n+Human\n\nWin Rate (%)\n\n(N)\n\n48.1 \u00b1 1.2\n40.3 \u00b1 1.5\n59.7 \u00b1 1.9\n\n(1675)\n(1005)\n(670)\n\nTable 1: Win rates for humans playing with and against the DeepRole agent. When a human replaces\na DeepRole agent in a group of 5 DeepRole agents, the win rate goes down for the team that the\nhuman joins. When a DeepRole agent replaces a human in a group of 5 humans, the win rate goes up\nfor the team the DeepRole agent joins. Averages \u00b1 standard errors of the mean calculated over a\nbinary outcome.\n\n8\n\n\flimited search. Taken together, these innovations allow DeepRole to scale to the full game of Avalon\nallowing CFR agents to play hidden role games for the \ufb01rst time. In future work, we will investigate\nwhether the interpretable belief state of DeepRole could also be used to ground language, enabling\nbetter coordination through communication.\nLooking forward, hidden role games are an exciting opportunity for developing AI agents. They\ncapture the ambiguous nature of day-to-day interactions with others and go beyond the strictly\nadversarial nature of two-player zero-sum games. Only by studying 3+ player environments can we\nstart to capture some of the richness of human social interactions including alliances, relationships,\nteams, and friendships [32].\n\nAcknowledgments\nWe thank Victor Kuo and ProAvalon.com for help integrating DeepRole with human players online.\nWe also thank Noam Brown, Murray Campbell, and Michael Wellman for helpful discussions and\ncomments. This work was supported by Harvard Data Science Initiative, CRCS, and MBB, The\nFuture of Life Institute, DARPA Ground Truth, and the Center for Brains, Minds and Machines (NSF\nSTC award CCF-1231216).\n\nReferences\n[1] Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A comprehensive survey\n\nand open problems. arXiv preprint arXiv:1709.08071, 2017.\n\n[2] Robert Axelrod. The Evolution of Cooperation. Basic Books, 1985.\n\n[3] Chris L Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B Tenenbaum. Rational quantitative\nattribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1:0064, 2017.\n\n[4] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold\u2019em poker is\n\nsolved. Science, 347(6218):145\u2013149, 2015.\n\n[5] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization.\n\narXiv preprint arXiv:1811.00164, 2018.\n\n[6] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top\n\nprofessionals. Science, 2017.\n\n[7] Colin Camerer. Behavioral game theory: Experiments in strategic interaction. Princeton University Press,\n\n2003.\n\n[8] Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games. The\n\nQuarterly Journal of Economics, pages 861\u2013898, 2004.\n\n[9] Rodrigo Canaan, Christoph Salge, Julian Togelius, and Andy Nealen. Leveling the playing \ufb01eld-fairness in\n\nai versus human game benchmarks. arXiv preprint arXiv:1903.07008, 2019.\n\n[10] Gokul Chittaranjan and Hayley Hung. Are you a werewolf? detecting deceptive roles and outcomes in\na conversational role-playing game. In 2010 IEEE International Conference on Acoustics, Speech and\nSignal Processing, pages 5334\u20135337. IEEE, 2010.\n\n[11] Peter I Cowling, Edward J Powley, and Daniel Whitehouse. Information set Monte Carlo tree search. IEEE\n\nTransactions on Computational Intelligence and AI in Games, 4(2):120\u2013143, 2012.\n\n[12] Peter I Cowling, Daniel Whitehouse, and Edward J Powley. Emergent bluf\ufb01ng and inference with Monte\nCarlo tree search. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pages\n114\u2013121. IEEE, 2015.\n\n[13] Don Eskridge. The Resistance: Avalon, 2012.\n\n[14] Joseph Farrell and Matthew Rabin. Cheap talk. Journal of Economic perspectives, 10(3):103\u2013118, 1996.\n\n[15] Adam Galinsky and Maurice Schweitzer. Friend and Foe: When to Cooperate, when to Compete, and how\n\nto Succeed at Both. Random House, 2015.\n\n[16] Joseph Henrich. The secret of our success: how culture is driving human evolution, domesticating our\n\nspecies, and making us smarter. Princeton University Press, 2015.\n\n9\n\n\f[17] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda,\nCharles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level performance\nin \ufb01rst-person multiplayer games with population-based deep reinforcement learning. arXiv preprint\narXiv:1807.01281, 2018.\n\n[18] Michael Johanson. Measuring the size of large no-limit poker games. arXiv preprint arXiv:1302.7008,\n\n2013.\n\n[19] Michael Johanson, Nolan Bard, Marc Lanctot, Richard Gibson, and Michael Bowling. Ef\ufb01cient nash\nequilibrium approximation through monte carlo counterfactual regret minimization. In Proceedings of the\n11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 837\u2013846.\nInternational Foundation for Autonomous Agents and Multiagent Systems, 2012.\n\n[20] Michael Johanson, Kevin Waugh, Michael Bowling, and Martin Zinkevich. Accelerating best response\nIn Twenty-Second International Joint Conference on Arti\ufb01cial\n\ncalculation in large extensive games.\nIntelligence, 2011.\n\n[21] Max Kleiman-Weiner, Mark K Ho, Joseph L Austerweil, Michael L Littman, and Joshua B Tenenbaum.\nCoordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In Proceedings\nof the 38th Annual Conference of the Cognitive Science Society, 2016.\n\n[22] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sampling for regret\nminimization in extensive games. In Advances in neural information processing systems, pages 1078\u20131086,\n2009.\n\n[23] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Julien Perolat, David Silver,\nThore Graepel, et al. A uni\ufb01ed game-theoretic approach to multiagent reinforcement learning. In Advances\nin Neural Information Processing Systems, pages 4193\u20134206, 2017.\n\n[24] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent rein-\nforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous\nAgents and MultiAgent Systems, pages 464\u2013473. International Foundation for Autonomous Agents and\nMultiagent Systems, 2017.\n\n[25] Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using\n\ndeep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017.\n\n[26] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In ICML,\n\nvolume 94, pages 157\u2013163, 1994.\n\n[27] Matej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis,\nKevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level arti\ufb01cial intelligence in\nheads-up no-limit poker. Science, 356(6337):508\u2013513, 2017.\n\n[28] Martin A Nowak. Five rules for the evolution of cooperation. Science, 314(5805):1560\u20131563, 2006.\n\n[29] OpenAI. OpenAI Five. https://blog.openai.com/openai-five/, 2018.\n\n[30] Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A\nmulti-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural\nInformation Processing Systems, pages 3646\u20133655, 2017.\n\n[31] David G Rand and Martin A Nowak. Human cooperation. Trends in cognitive sciences, 17(8):413, 2013.\n\n[32] Michael Shum, Max Kleiman-Weiner, Michael L Littman, and Joshua B Tenenbaum. Theory of minds:\nUnderstanding behavior in groups through inverse planning. In Proceedings of the Thirty-Third AAAI\nConference on Arti\ufb01cial Intelligence (AAAI-19), 2019.\n\n[33] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go\nwith deep neural networks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[34] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,\nMarc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning\nalgorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140\u20131144, 2018.\n\n[35] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information\n\nprocessing systems, pages 2164\u20132172, 2010.\n\n10\n\n\f[36] DJ Strouse, Max Kleiman-Weiner, Josh Tenenbaum, Matt Botvinick, and David J Schwab. Learning to\nshare and hide intentions using information regularization. In Advances in Neural Information Processing\nSystems, pages 10270\u201310281, 2018.\n\n[37] Duane Szafron, Richard Gibson, and Nathan Sturtevant. A parameterized family of equilibrium pro\ufb01les\nfor three-player kuhn poker. In Proceedings of the 2013 international conference on Autonomous agents\nand multi-agent systems, pages 247\u2013254. International Foundation for Autonomous Agents and Multiagent\nSystems, 2013.\n\n[38] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit texas\n\nhold\u2019em. In Twenty-Fourth International Joint Conference on Arti\ufb01cial Intelligence, 2015.\n\n[39] Michael Tomasello. A natural history of human thinking. Harvard University Press, 2014.\n\n[40] Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A generalised method for\nempirical game theoretic analysis. In Proceedings of the 17th International Conference on Autonomous\nAgents and MultiAgent Systems, pages 77\u201385. International Foundation for Autonomous Agents and\nMultiagent Systems, 2018.\n\n[41] Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, and Joshua B Tenenbaum.\nHelp or hinder: Bayesian models of social goal inference. In Advances in neural information processing\nsystems, pages 1874\u20131882, 2009.\n\n[42] Michael P Wellman. Methods for empirical game-theoretic analysis. In AAAI, pages 1552\u20131556, 2006.\n\n[43] Daniel Whitehouse. Monte Carlo tree search for games with hidden information and uncertainty. PhD\n\nthesis, University of York, 2014.\n\n[44] Michael Wunder, Michael Kaisers, John Robert Yaros, and Michael Littman. Using iterated reasoning to\npredict opponent strategies. In The 10th International Conference on Autonomous Agents and Multiagent\nSystems-Volume 2, pages 593\u2013600. International Foundation for Autonomous Agents and Multiagent\nSystems, 2011.\n\n[45] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization\nin games with incomplete information. In Advances in neural information processing systems, pages\n1729\u20131736, 2008.\n\n11\n\n\f", "award": [], "sourceid": 751, "authors": [{"given_name": "Jack", "family_name": "Serrino", "institution": "MIT"}, {"given_name": "Max", "family_name": "Kleiman-Weiner", "institution": "Harvard/MIT"}, {"given_name": "David", "family_name": "Parkes", "institution": "Harvard University"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}