{"title": "On the Utility of Learning about Humans for Human-AI Coordination", "book": "Advances in Neural Information Processing Systems", "page_first": 5174, "page_last": 5185, "abstract": "While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves. Agents that assume their partner to be optimal or similar to them can converge to coordination protocols that fail to understand and be understood by humans. To demonstrate this, we introduce a simple environment that requires challenging coordination, based on the popular game Overcooked, and learn a simple model that mimics human play. We evaluate the performance of agents trained via self-play and population-based training. These agents perform very well when paired with themselves, but when paired with our human model, they are significantly worse than agents designed to play with the human model. An experiment with a planning algorithm yields the same conclusion, though only when the human-aware planner is given the exact human model that it is playing with. A user study with real humans shows this pattern as well, though less strongly. Qualitatively, we find that the gains come from having the agent adapt to the human's gameplay. Given this result, we suggest several approaches for designing agents that learn about humans in order to better coordinate with them. Code is available at https://github.com/HumanCompatibleAI/overcooked_ai.", "full_text": "On the Utility of Learning about Humans\n\nfor Human-AI Coordination\n\nMicah Carroll\nUC Berkeley\n\nmdc@berkeley.edu\n\nRohin Shah\nUC Berkeley\n\nrohinmshah@berkeley.edu\n\nMark K. Ho\n\nPrinceton University\nmho@princeton.edu\n\nThomas L. Grif\ufb01ths\nPrinceton University\n\nSanjit A. Seshia\n\nUC Berkeley\n\nPieter Abbeel\nUC Berkeley\n\nAnca Dragan\nUC Berkeley\n\nAbstract\n\nWhile we would like agents that can coordinate with humans, current algorithms\nsuch as self-play and population-based training create agents that can coordinate\nwith themselves. Agents that assume their partner to be optimal or similar to them\ncan converge to coordination protocols that fail to understand and be understood\nby humans. To demonstrate this, we introduce a simple environment that requires\nchallenging coordination, based on the popular game Overcooked, and learn a\nsimple model that mimics human play. We evaluate the performance of agents\ntrained via self-play and population-based training. These agents perform very\nwell when paired with themselves, but when paired with our human model, they\nare signi\ufb01cantly worse than agents designed to play with the human model. An\nexperiment with a planning algorithm yields the same conclusion, though only\nwhen the human-aware planner is given the exact human model that it is playing\nwith. A user study with real humans shows this pattern as well, though less strongly.\nQualitatively, we \ufb01nd that the gains come from having the agent adapt to the\nhuman\u2019s gameplay. Given this result, we suggest several approaches for designing\nagents that learn about humans in order to better coordinate with them. Code is\navailable at https://github.com/HumanCompatibleAI/overcooked_ai.\n\n1\n\nIntroduction\n\nAn increasingly effective way to tackle two-player games is to train an agent to play with a set of\nother AI agents, often past versions of itself. This powerful approach has resulted in impressive\nperformance against human experts in games like Go [33], Quake [20], Dota [29], and Starcraft [34].\nSince the AI agents never encounter humans during training, when evaluated against human experts,\nthey are undergoing a distributional shift. Why doesn\u2019t this cause the agents to fail? We hypothesize\nthat it is because of the competitive nature of these games. Consider the canonical case of a two-player\nzero-sum game, as shown in Figure 1 (left). When humans play the minimizer role but take a branch\nin the search tree that is suboptimal, this only increases the maximizer\u2019s score.\nHowever, not all settings are competitive. Arguably, one of the main goals of AI is to generate agents\nthat collaborate, rather than compete, with humans. We would like agents that help people with\nthe tasks they want to achieve, augmenting their capabilities [10, 6]. Looking at recent results, it\nis tempting to think that self-play-like methods extend nicely to collaboration: AI-human teams\nperformed well in Dota [28] and Capture the Flag [20]. However, in these games, the advantage may\ncome from the AI system\u2019s individual ability, rather than from coordination with humans. We claim\nthat in general, collaboration is fundamentally different from competition, and will require us to go\nbeyond self-play to explicitly account for human behavior.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The impact of incorrect expectations of optimality. Left: In a competitive game, the agent plans for\nthe worst case. AI expects that if it goes left, H will go left. So, it goes right where it expects to get 3 reward\n(since H would go left). When H suboptimally goes right, AI gets 7 reward: more than it expected. Right: In a\ncollaborative game, AI expects H to coordinate with it to choose the best option, and so it goes left to obtain\nthe 8 reward. However, when H suboptimally goes left, AI only gets 1 reward: the worst possible outcome!\n\nConsider the canonical case of a common-payoff game, shown in Figure 1 (right): since both agents\nare maximizers, a mistake on the human\u2019s side is no longer an advantage, but an actual problem,\nespecially if the agent did not anticipate it. Further, agents that are allowed to co-train might converge\nonto opaque coordination strategies. For instance, agents trained to play the collaborative game\nHanabi learned to use the hint \u201cred\u201d or \u201cyellow\u201d to indicate that the newest card is playable, which\nno human would immediately understand [12]. When such an agent is paired with a human, it will\nexecute the opaque policy, which may fail spectacularly when the human doesn\u2019t play their part.\nWe thus hypothesize that in true collaborative scenarios agents trained to play well with other AI\nagents will perform much more poorly when paired with humans. We further hypothesize that\nincorporating human data or models into the training process will lead to signi\ufb01cant improvements.\nTo test this hypothesis, we introduce a simple environment based on the popular game Overcooked\n[13], which is speci\ufb01cally designed to be challenging for humans to coordinate in (Figure 2).We\nuse this environment to compare agents trained with themselves to agents trained with a learned\nhuman model. For the former, we consider self-play [33], population-based training [19], and coupled\nplanning with replanning. For the latter, we collect human-human data and train a behavior cloning\nhuman model; we then train a reinforcement learning and a planning agent to collaborate well with\nthis model. We evaluate how well these agents collaborate with a held-out \u201csimulated\u201d human model\n(henceforth the \u201cproxy human\u201d), trained on a different data set, as well as in a user study.\n\nFigure 2: Our Overcooked environment. The goal is to place three onions in a pot (dark grey), take out the\nresulting soup on a plate (white) and deliver it (light grey), as many times as possible within the time limit. H,\nthe human, is close to a dish dispenser and a cooked soup, and AI, the agent, is facing a pot that is not yet full.\nThe optimal strategy is for H to put an onion in the partially full pot, and for AI to put the existing soup in a\ndish and deliver it. This is due to the layout structure, that makes H have an advantage in placing onions in pots,\nand AI have an advantage in delivering soups. However, we can guess that H plans to pick up a plate to deliver\nthe soup. If AI nonetheless expects H to be optimal, it will expect H to turn around to get the onion, and will\ncontinue moving towards its own dish dispenser, leading to a coordination failure.\n\n2\n\n\fWe \ufb01nd that the agents which did not leverage human data in training perform very well with\nthemselves, and drastically worse when paired with the proxy human. This is not explained only by\nhuman suboptimality, because the agent also signi\ufb01cantly underperforms a \u201cgold standard\u201d agent\nthat has access to the proxy human. The agent trained with the behavior-cloned human model is\ndrastically better, showing the bene\ufb01t of having a relatively good human model. We found the same\ntrends even when we paired these agents with real humans, for whom our model has much poorer\npredictive power but nonetheless helps the agent be a better collaborator. We also experimented with\nusing behavior cloning directly for the agent\u2019s policy, and found that it also outperforms self-play-like\nmethods, but still underperforms relative to methods that that leverage planning (with respect to the\nactual human model) or reinforcement learning with a proxy human model.\nOverall, we learned a few important lessons in this work. First, our results showcase the importance\nof accounting for real human behavior during training: even using a behavior cloning model prone to\nfailures of distributional shift seems better than treating the human as if they were optimal or similar\nto the agent. Second, leveraging planning or reinforcement learning to maximize the collaborative\nreward, again even when using such a simple human model, seems to already be better than vanilla\nimitation. These results are a cautionary tale against relying on self-play or vanilla imitation for\ncollaboration, and advocate for methods that leverage models of human behavior, actively improve\nthem, or even use them as part of a population to be trained with.\n\n2 Related Work\n\nHuman-robot interaction (HRI). The \ufb01eld of human robot interaction has already embraced our\nmain point that we shouldn\u2019t model the human as optimal. Much work focuses on achieving\ncollaboration by planning and learning with (non-optimal) models of human behavior [26, 21, 31],\nas well as on speci\ufb01c properties of robot behavior that aid collaboration [2, 14, 9]. However, to our\nknowledge, ours is the \ufb01rst work to analyze the optimal human assumption in the context of deep\nreinforcement learning, and to test potential solutions such as population-based training (PBT).\nChoudhury et al. [7] is particularly related to our work. It evaluates whether it is useful to learn a\nhuman model using deep learning, compared to a more structured \u201ctheory of mind\u201d human model.\nWe are instead evaluating how useful it is to have a human model at all.\nMultiagent reinforcement learning. Deep reinforcement learning has also been applied to multia-\ngent settings, in which multiple agents take actions in a potentially non-competitive game [24, 15].\nSome work tries to teach agents collaborative behaviors [22, 18] in environments where rewards are\nnot shared across agents. The Bayesian Action Decoder [12] learns communicative policies that\nallow two agents to collaborate, and has been applied to the cooperative card game Hanabi. However,\nmost multiagent RL research focuses on AI-AI interaction, rather than the human-AI setting.\nLerer and Peysakhovich [23] starts from the same observation that self-play will perform badly in\ngeneral-sum games, and aims to do better given some data from agents that will be evaluated on\nat test time (analogous to our human data). However, they assume that the test time agents have\nsettled into an equilibrium that their agent only needs to replicate, and so they train their agent with\nObservational Self-Play (OSP): a combination of imitation learning and MARL. In contrast, we allow\nfor the case where humans do not play an equilibrium strategy (because they are suboptimal), and so\nwe only use imitation learning to create human models, and train our agent using pure RL.\nImitation learning. Imitation learning [1, 16] aims to train agents that mimic the policies of a\ndemonstrator. In our work, we use behavior cloning [30] to imitate demonstrations collected from\nhumans, in order to learn human models to collaborate with. However, the main focus of our work is\nin the design of agents that can collaborate with these models, and not the models themselves.\n\n3 Preliminaries\n\nMulti-agent MDP. A multiagent Markov decision process [5]\nis de\ufb01ned by a tuple\n(cid:104)S, \u03b1,{Ai\u2208\u03b1},T , R(cid:105). S is a \ufb01nite set of states, and R : S \u2192 R is a real-valued reward function. \u03b1 is a\n\ufb01nite set of agents; Ai is the \ufb01nite set of actions available to agent i. T : S\u00d7A1\u00d7\u00b7\u00b7\u00b7\u00d7An\u00d7S \u2192 [0, 1]\nis a transition function that determines the next state given all of the agents\u2019 actions.\n\n3\n\n\fFigure 3: Experiment layouts. From left to right: Cramped Room presents low-level coordination challenges:\nin this shared, con\ufb01ned space it is very easy for the agents to collide. Asymmetric Advantages tests whether\nplayers can choose high-level strategies that play to their strengths, as illustrated in Figure 2. In Coordination\nRing, players must coordinate to travel between the bottom left and top right corners of the layout. Forced\nCoordination instead removes collision coordination problems, and forces players to develop a high-level\njoint strategy, since neither player can serve a dish by themselves. Counter Circuit involves a non-obvious\ncoordination strategy, where onions are passed over the counter to the pot, rather than being carried around.\n\nBehavior cloning. One of the simplest approaches to imitation learning is given by behavior cloning,\nwhich learns a policy from expert demonstrations by directly learning a mapping from observations\nto actions with standard supervised learning methods [4]. Since we have a discrete action space, this\nis a traditional classi\ufb01cation task. Our model takes an encoding of the state as input, and outputs a\nprobability distribution over actions, and is trained using the standard cross-entropy loss function.\nPopulation Based Training. Population Based Training (PBT) [19] is an online evolutionary\nalgorithm which periodically adapts training hyperparameters and performs model selection. In\nmultiagent RL, PBT maintains a population of agents, whose policies are parameterized by neural\nnetworks, and trained with a DRL algorithm. In our case, we use Proximal Policy Optimization\n(PPO) [32]. During each PBT iteration, pairs of agents are drawn from the population, trained for a\nnumber of timesteps, and have their performance recorded. At the end of each PBT iteration, the\nworst performing agents are replaced with copies of the best agents with mutated hyperparameters.\n\n4 Environment and Agents\n\n4.1 Overcooked\n\nTo test our hypotheses, we would like an environment in which coordination is challenging, and where\ndeep RL algorithms work well. Existing environments have not been designed to be challenging for\ncoordination, and so we build a new one based on the popular video game Overcooked [13], in which\nplayers control chefs in a kitchen to cook and serve dishes. Each dish takes several high-level actions\nto deliver, making strategy coordination dif\ufb01cult, in addition to the challenge of motion coordination.\nWe implement a simpli\ufb01ed version of the environment, in which the only objects are onions, dishes,\nand soups (Figure 2). Players place 3 onions in a pot, leave them to cook for 20 timesteps, put the\nresulting soup in a dish, and serve it, giving all players a reward of 20. The six possible actions\nare: up, down, left, right, noop, and \"interact\", which does something based on the tile the player is\nfacing, e.g. placing an onion on a counter. Each layout has one or more onion dispensers and dish\ndispensers, which provide an unlimited supply of onions and dishes respectively. Most of our layouts\n(Figure 3) were designed to lead to either low-level motion coordination challenges or high-level\nstrategy challenges.\nAgents should learn how to navigate the map, interact with objects, drop the objects off in the right\nlocations, and \ufb01nally serve completed dishes to the serving area. All the while, agents should be\naware of what their partner is doing and coordinate with them effectively.\n\n4.2 Human models\n\nWe created a web interface for the game, from which we were able to collect trajectories of humans\nplaying with other humans for the layouts in Figure 3. In preliminary experiments we found that\nhuman models learned through behavior cloning performed better than ones learned with Generative\nAdversarial Imitation Learning (GAIL) [16], so decided to use the former throughout our experiments.\nTo incentivize generalization in spite of the scarcity of human data, we perform behavior cloning\nover a manually designed featurization of the underlying game state.\n\n4\n\n\fFor each layout we gathered ~16 human-human trajectories (for a total of 18k environment timesteps).\nWe partition the joint trajectories into two subsets, and split each trajectory into two single-agent\ntrajectories. For each layout and each subset, we learn a human model through behavior cloning. We\ntreat one model, BC, as a human model we have access to, while the second model HP roxy is treated\nas the ground truth human proxy to evaluate against at test time. On the \ufb01rst three layouts, when\npaired with themselves, most of these models perform similarly to an average human. Performance is\nsigni\ufb01cantly lower for the last two layouts (Forced Coordination and Counter Circuit).\nThe learned models sometimes got \u201cstuck\u201d: they would perform the same action over and over\nagain (such as walking into each other), without changing the state. We added a simple rule-based\nmechanism to get the agents unstuck by taking a random action. For more details see Appendix A.\n\n4.3 Agents designed for self-play\n\nWe consider two DRL algorithms that train agents designed for self-play, and one planning algorithm.\nDRL algorithms. For DRL we consider PPO trained in self-play (SP) and PBT. Since PBT involves\ntraining agents to perform well with a population of other agents, we might expect them to be more\nrobust to potential partners compared to agents trained via self-play, and so PBT might coordinate\nbetter with humans. For PBT, all of the agents in the population used the same network architecture.\nPlanning algorithm. Working in the simpli\ufb01ed Overcooked environment enables us to also compute\nnear-optimal plans for the joint planning problem of delivering dishes. This establishes a baseline\nfor performance and coordination, and is used to perform coupled planning with re-planning. With\ncoupled planning, we compute the optimal joint plan for both agents. However, rather than executing\nthe full plan, we only execute the \ufb01rst action of the plan, and then replan the entire optimal joint plan\nafter we see our partner\u2019s action (since it may not be the same as the action we computed for them).\nTo achieve this, we pre-compute optimal joint motion plans for every possible starting and desired\ngoal locations for each agent. We then create high-level actions such as \u201cget an onion\u201d, \u201cserve the\ndish\u201d, etc. and use the motion plans to compute the cost of each action. We use A\u2217 search to \ufb01nd\nthe optimal joint plan in this high-level action space. This planner does make some simplifying\nassumptions, detailed in Appendix E, making it near-optimal instead of optimal.\n\n4.4 Agents designed for humans\n\nWe seek the simplest possible solution to having an agent that actually takes advantage of the human\nmodel. We embed our learned human model BC in the environment, treating it\u2019s choice of action as\npart of the dynamics. We then directly train a policy on this single-agent environment with PPO. We\nfound empirically that the policies achieved best performance by initially training them in self-play,\nand then linearly annealing to training with the human model (see Appendix C).\nFor planning, we implemented a model-based planner that uses hierarchical A\u2217 search to act near-\noptimally, assuming access to the policy of the other player (see Appendix F). In order to preserve near-\noptimality we make our training and test human models deterministic, as dealing with stochasticity\nwould be too computationally expensive.\n\n5 Experiments in Simulation\n\nWe pair each agent with the proxy human model and evaluate the team performance.\nIndependent variables. We vary the type of agent used. We have agents trained with themselves:\nself-play (SP), population-based training (PBT), coupled planning (CP); agents trained with the human\nmodel BC: reinforcement learning based (PPOBC), and planning-based PBC; and the imitation agent\nBC itself. Each agent is paired with HP roxy in each of the layouts in Figure 3.\nDependent measures. As good coordination between teammates is essential to achieve high returns\nin this environment, we use cumulative rewards over a horizon of 400 timesteps for our agents as\na proxy for coordination ability. For all DRL experiments, we report average rewards across 100\nrollouts and standard errors across 5 different seeds. To aid in interpreting these measures, we also\ncompute a \"gold standard\" performance by training the PPO and planning agents not with BC, but\nwith HP roxy itself, essentially giving them access to the ground truth human they will be paired with.\n\n5\n\n\f(a) Comparison with agents trained in self-play.\n\n(b) Comparison with agents trained via PBT.\n\nFigure 4: Rewards over trajectories of 400 timesteps for the different agents (agents trained with themselves \u2013\nSP or PBT \u2013 in teal, agents trained with the human model \u2013 PPOBC \u2013 in orange, and imitation agents \u2013 BC\n\u2013 in gray), with standard error over 5 different seeds, paired with the proxy human HP roxy. The white bars\ncorrespond to what the agents trained with themselves expect to achieve, i.e. their performance when paired\nwith itself (SP+SP and PBT+PBT). First, these agents perform much worse with the proxy human than with\nthemselves. Second, the PPO agent that trains with human data performs much better, as hypothesized. Third,\nimitation tends to perform somewhere in between the two other agents. The red dotted lines show the \"gold\nstandard\" performance achieved by a PPO agent with direct access to the proxy model itself \u2013 the difference in\nperformance between this agent and PPOBC stems from the innacuracy of the BC human model with respect to\nthe actual HP roxy. The hashed bars show results with the starting position of the agents switched. This most\nmakes a difference for asymmetric layouts such as Asymmetric Advantages or Forced Coordination.\n\nAnalysis. We present quantitative results for DRL in Fig-\nure 4. Even though SP and PBT achieve excellent per-\nformance in self-play, when paired with a human model\nthey struggle to even meet the performance of the imi-\ntation agent. There is a large gap between the imitation\nagent and gold standard performance. Note that the gold\nstandard reward is lower than self-play methods paired\nwith themselves, due to human suboptimality affecting the\nhighest possible reward the agent can get. We then see that\nPPOBC outperforms the agents trained with themselves,\ngetting closer to the gold standard. This supports our hy-\npothesis that 1) self-play-like agents perform drastically\nworse when paired with humans, and 2) it is possible to\nimprove performance signi\ufb01cantly by taking the human\ninto account. We show this holds (for these layouts) even\nwhen using an unsophisticated, behavior-cloning based\nmodel of the human.\n\n6\n\nFigure 5: Comparison across planning\nmethods. We see a similar trend: cou-\npled planning (CP) performs well with itself\n(CP+CP) and worse with the proxy human\n(CP+HP roxy). Having the correct model of\nthe human (the dotted line) helps, but a bad\nmodel (PBC+HP roxy) can be much worse\nbecause agents get stuck (see Appendix G).\n\n\fFigure 6: Average rewards over 400 timesteps of agents paired with real humans, with standard error across\nstudy participants. In most layouts, the PPO agent that trains with human data (PPOBC, orange) performs better\nthan agents that don\u2019t model the human (SP and PBT, teal), and in some layouts signi\ufb01cantly so. We also include\nthe performance of humans when playing with other humans (gray) for information. Note that for human-human\nperformance, we took long trajectories and evaluated the reward obtained at 400 timesteps. In theory, humans\ncould have performed better by optimizing for short-term reward near the end of the 400 timesteps, but we\nexpect that this effect is small.\n\nDue to computational complexity, model-based planning was only feasible on two layouts. Figure 5\nshows the results on these layouts, demonstrating a similar trend. As expected, coupled planning\nachieves better self-play performance than reinforcement learning. But when pairing it with the\nhuman proxy, performance drastically drops, far below the gold standard. Qualitatively, we notice that\na lot of this drop seems to happen because the agent expects optimal motion, whereas actual human\nplay is much slower. Giving the planner access to the true human model and planning with respect to\nit is suf\ufb01cient to improve performance (the dotted line above PBC). However, when planning with\nBC but evaluating with Hproxy, the agent gets stuck in loops (see Appendix G).\nOverall, these results showcase the bene\ufb01t of getting access to a good human model \u2013 BC is in all\nlikelihood closer to HP roxy than to a real human. Next, we study to what extent the bene\ufb01t is still\nthere with a poorer model, i.e. when still using BC, but this time testing on real users.\n\n6 User Study\n\nDesign. We varied the AI agent (SP vs. PBT vs. PPOBC) and measured the average reward per\nepisode when the agent was paired with a human user. We recruited 60 users (38 male, 19 female, 3\nother, ages 20-59) on Amazon Mechanical Turk and used a between-subjects design, meaning each\nuser was only paired with a single AI agent. Each user played all 5 task layouts, in the same order that\nwas used when collecting human-human data for training. See Appendix H for more information.\nAnalysis. We present results in Figure 6. PPOBC outperforms the self-play methods in three of the\nlayouts, and is roughly on par with the best one of them in the other two. While the effect is not as\nstrong as in simulation, it follows the same trend, where PPOBC is overall preferable.\nAn ANOVA with agent type as a factor and layout and player index as covariates showed a signi\ufb01cant\nmain effect for agent type on reward (F (2, 224) = 6.49, p < .01), and the post-hoc analysis with\nTukey HSD corrections con\ufb01rmed that PPOBC performed signi\ufb01cantly better than SP (p = .01) and\nPBT (p < .01). This supports our hypothesis.\nIn some cases, PPOBC also signi\ufb01cantly outperforms human-human performance. Since imitation\nlearning typically cannot exceed the performance of the demonstrator it is trained on, this suggests\nthat in these cases PPOBC would also outperform imitation learning.\nWe speculate that the differences across layouts are due to differences in the quality of BC and\nDRL algorithms across layouts. In Cramped Room, Coordination Ring, and the second setting of\nAsymmetric Advantages, we have both a good BC model as well as good DRL training, and so\nPPOBC outperforms both self-play methods and human-human performance. In the \ufb01rst setting of\nAsymmetric Advantages, the DRL training does not work very well, and the resulting policy lets the\nhuman model do most of the hard work. (In fact, in the second setting of Asymmetric Advantages in\n\n7\n\n\fFigure 7: Cross-entropy loss incurred when using various models as a predictive model of the human in the\nhuman-AI data collected, with standard error over 5 different seeds. Unsurprisingly, SP and PBT are poor\nmodels of the human, while BC and HP roxy are good models. See Appendix H for prediction accuracy.\n\nFigure 4a, the human-AI team beats the AI-AI team, suggesting that the role played by the human\nis hard to learn using DRL.) In Forced Coordination and Counter Circuit, BC is a very poor human\nmodel, and so PPOBC still has an incorrect expectation, and doesn\u2019t perform as well.\nWe also guess that the effects are not as strong in simulation because humans are able to adapt to\nagent policies and \ufb01gure out how to get the agent to perform well, a feat that our simple HP roxy\nis unable to do. This primarily bene\ufb01ts self-play based methods, since they typically have opaque\ncoordination policies, and doesn\u2019t help PPOBC as much, since there is less need to adapt to PPOBC.\nWe describe some particular scenarios in Section 7.\nFigure 7 shows how well each model performs as a predictive model of the human, averaged across\nall human-AI data, and unsurprisingly \ufb01nds that SP and PBT are poor models, while BC and HP roxy\nare decent. Since SP and PBT expect the other player to be like themselves, they are effectively using\na bad model of the human, explaining their poor performance with real humans. PPOBC instead\nexpects the other player to be BC, a much better model, explaining its superior performance.\n\n7 Qualitative Findings\n\nHere, we speculate on some qualitative behaviors that we observed. We found similar behaviors\nbetween simulation and real users, and SP and PBT had similar types of failures, though the speci\ufb01c\nfailures were different.\nAdaptivity to the human. We observed that over the course of training the SP agents became very\nspecialized, and so suffered greatly from distributional shift when paired with human models and real\nhumans. For example, in Asymmetric Advantages, the SP agents only use the top pot, and ignore the\nbottom one. However, humans use both pots. The SP agent ends up waiting unproductively for the\nhuman to deliver a soup from the top pot, while the human has instead decided to \ufb01ll up the bottom\npot. In contrast, PPOBC learns to use both pots, depending on the context.\nLeader/follower behavior. In Coordination Ring, SP and PBT agents tend to be very headstrong:\nfor any speci\ufb01c portion of the task, they usually expect either clockwise or counterclockwise motion,\nbut not both. Humans have no such preference, and so the SP and PBT agents often collide with\nthem, and keep colliding until the human gives way. The PPOBC agent instead can take on both\nleader and follower roles. If it is carrying a plate to get a soup from the pot, it will insist on following\nthe shorter path, even if a human is in the way. On the other hand, when picking which route to carry\nonions to the pots, it tends to adapt to the human\u2019s choice of route.\nAdaptive humans. Real humans learn throughout the episode to anticipate and work with the agent\u2019s\nparticular coordination protocols. For example, in Cramped Room, after picking up a soup, SP and\nPBT insist upon delivering the soup via right-down-interact instead of down-right-down-interact \u2013\neven when a human is in the top right corner, blocking the way. Humans can \ufb01gure this out and\nmake sure that they are not in the way. Notably, PPOBC cannot learn and take advantage of human\nadaptivity, because the BC model is not adaptive.\n\n8\n\n\f8 Discussion\n\nSummary. While agents trained via general DRL algorithms in collaborative environments are very\ngood at coordinating with themselves, they are not able to handle human partners well, since they\nhave never seen humans during training. We introduced a simple environment based on the game\nOvercooked that is particularly well-suited for studying coordination, and demonstrated quantitatively\nthe poor performance of such agents when paired with a learned human model, and with actual\nhumans. Agents that were explicitly designed to work well with a human model, even in a very naive\nway, achieved signi\ufb01cantly better performance. Qualitatively, we observed that agents that learned\nabout humans were signi\ufb01cantly more adaptive and able to take on both leader and follower roles\nthan agents that expected their partners to be optimal (or like them).\nLimitations and future work. An alternative hypothesis for our results is that training against BC\nsimply forces the trained agent to be robust to a wider variety of states, since BC is more stochastic\nthan an agent trained via self-play, but it doesn\u2019t matter whether BC models real humans or not. We\ndo not \ufb01nd this likely a priori, and we did try to check this: PBT is supposed to be more robust than\nself-play, but still has the same issue, and planning agents are automatically robust to states, but still\nshowed the same broad trend. Nonetheless, it is possible that DRL applied to a suf\ufb01ciently wide\nset of states could recoup most of the lost performance. One particular experiment we would like\nto run is to rain a single agent that works on arbitrary layouts. Since agents would be trained on a\nmuch wider variety of states, it could be that such agents require more general coordination protocols,\nand self-play-like methods will be more viable since they are forced to learn the same protocols that\nhumans would use.\nIn contrast, in this work, we trained separate agents for each of the layouts in Figure 3. We limited the\nscope of each agent because of our choice to train the simplest human model, in order to showcase\nthe importance of human data: if a naive model is already better, then more sophisticated ones will be\ntoo. Our \ufb01ndings open the door to exploring such models and algorithms:\nBetter human models: Using imitation learning for the human model is prone to distributional shift\nthat reinforcement learning will exploit. One method to alleviate this would be to add inductive bias\nto the human model that makes it more likely to generalize out of distribution, for example by using\ntheory of mind [7] or shared planning [17]. However, we could also use the standard data aggregation\napproach, where we periodically query humans for a new human-AI dataset with the current version\nof the agent, and retrain the human model to correct any errors caused by distribution shift.\nBiasing population-based training towards humans: Agents trained via PBT should be able to\ncoordinate well with any of the agents that were present in the population during training. So, we\ncould train multiple human models using variants of imitation learning or theory of mind, and inject\nthese human models as agents in the population. The human models need not even be accurate, as\nlong as in aggregate they cover the range of possible human behavior. This becomes a variant of\ndomain randomization [3] applied to interaction with humans.\nAdapting to the human at test time: So far we have been assuming that we must deploy a static agent\nat test time, but we could have our agent adapt online. One approach would be to learn multiple\nhuman models (corresponding to different humans in the dataset). At test time, we can select the\nmost likely human model [27], and choose actions using a model-based algorithm such as model\npredictive control [25]. We could also use a meta-learning algorithm such as MAML [11] to learn a\npolicy that can quickly adapt to new humans at test time.\nHumans who learn: We modeled the human policy as stationary, preserving the Markov assumption.\nHowever, in reality humans will be learning and adapting as they play the game, which we would\nideally model. We could take this into account by using recurrent architectures, or by using a more\nexplicit model of how humans learn.\n\nAcknowledgments\n\nWe thank the researchers at the Center for Human Compatible AI and the Interact lab for valuable\nfeedback. This work was supported by the Open Philanthropy Project, NSF CAREER, the NSF\nVeHICaL project (CNS-1545126), and National Science Foundation Graduate Research Fellowship\nGrant No. DGE 1752814.\n\n9\n\n\fReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.\nIn Proceedings of the twenty-\ufb01rst international conference on Machine learning, page 1. ACM,\n2004.\n\n[2] Rachid Alami, Aur\u00e9lie Clodic, Vincent Montreuil, Emrah Akin Sisbot, and Raja Chatila.\nToward human-aware robot task planning. In AAAI spring symposium: to boldly go where no\nhuman-robot team has gone before, pages 39\u201346, 2006.\n\n[3] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub\nPachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous\nin-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.\n\n[4] Michael Bain and Claude Sammut. A framework for behavioural cloning.\n\nIn Machine\nIntelligence 15, Intelligent Agents [St. Catherine\u2019s College, Oxford, July 1995], pages\n103\u2013129, Oxford, UK, UK, 1999. Oxford University.\nISBN 0-19-853867-7. URL http:\n//dl.acm.org/citation.cfm?id=647636.733043.\n\n[5] Craig Boutilier. Planning, learning and coordination in multiagent decision processes. In\nProceedings of the 6th conference on Theoretical aspects of rationality and knowledge, pages\n195\u2013210. Morgan Kaufmann Publishers Inc., 1996.\n\n[6] Shan Carter and Michael Nielsen. Using arti\ufb01cial intelligence to augment human intelligence.\n\nDistill, 2017. doi: 10.23915/distill.00009. https://distill.pub/2017/aia.\n\n[7] Rohan Choudhury, Gokul Swamy, Dylan Had\ufb01eld-Menell, and Anca Dragan. On the utility of\n\nmodel learning in HRI. 01 2019.\n\n[8] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec\nRadford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI baselines.\nhttps://github.com/openai/baselines, 2017.\n\n[9] Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. Legibility and predictability of\nrobot motion. In Proceedings of the 8th ACM/IEEE international conference on Human-robot\ninteraction, pages 301\u2013308. IEEE Press, 2013.\n\n[10] Douglas C Engelbart. Augmenting human intellect: A conceptual framework. Menlo Park, CA,\n\n1962.\n\n[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1126\u20131135. JMLR. org, 2017.\n\n[12] Jakob N Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson,\nMatthew Botvinick, and Michael Bowling. Bayesian action decoder for deep multi-agent\nreinforcement learning. arXiv preprint arXiv:1811.01458, 2018.\n\n[13] Ghost Town Games. Overcooked, 2016.\n\n448510/Overcooked/.\n\nhttps://store.steampowered.com/app/\n\n[14] Michael J Gielniak and Andrea L Thomaz. Generating anticipation in robot motion. In 2011\n\nRO-MAN, pages 449\u2013454. IEEE, 2011.\n\n[15] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control\nusing deep reinforcement learning. In International Conference on Autonomous Agents and\nMultiagent Systems, pages 66\u201383. Springer, 2017.\n\n[16] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, pages 4565\u20134573, 2016.\n\n[17] M. K. Ho, J. MacGlashan, A. Greenwald, M. L. Littman, E. M. Hilliard, C. Trimbach,\nS. Brawner, J. B. Tenenbaum, M. Kleiman-Weiner, and J. L. Austerweil. Feature-based joint\nplanning and norm learning in collaborative games. In A. Papafragou, D. Grodner, D. Mirman,\nand J.C. Trueswell, editors, Proceedings of the 38th Annual Conference of the Cognitive Science\nSociety, pages 1158\u20131163, Austin, TX, 2016. Cognitive Science Society.\n\n10\n\n\f[18] Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Due\u00f1ez-Guzman, Anto-\nnio Garc\u00eda Casta\u00f1eda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, et al. Inequity\naversion improves cooperation in intertemporal social dilemmas. In Advances in Neural Infor-\nmation Processing Systems, pages 3326\u20133336, 2018.\n\n[19] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali\nRazavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based\ntraining of neural networks. arXiv preprint arXiv:1711.09846, 2017.\n\n[20] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia\nCasta\u00f1eda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas\nSonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray\nKavukcuoglu, and Thore Graepel. Human-level performance in 3d multiplayer games with\npopulation-based reinforcement learning. Science, 364(6443):859\u2013865, 2019. ISSN 0036-8075.\ndoi: 10.1126/science.aau6249. URL https://science.sciencemag.org/content/364/\n6443/859.\n\n[21] Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight\n\noptimization. Robotics science and systems: online proceedings, 2015.\n\n[22] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent\nreinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference\non Autonomous Agents and MultiAgent Systems, pages 464\u2013473. International Foundation for\nAutonomous Agents and Multiagent Systems, 2017.\n\n[23] Adam Lerer and Alexander Peysakhovich. Learning existing social conventions via observa-\ntionally augmented self-play. AAAI / ACM conference on Arti\ufb01cial Intelligence, Ethics, and\nSociety, 2019.\n\n[24] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent\nactor-critic for mixed cooperative-competitive environments. In Advances in Neural Information\nProcessing Systems, pages 6379\u20136390, 2017.\n\n[25] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network\ndynamics for model-based deep reinforcement learning with model-free \ufb01ne-tuning. In 2018\nIEEE International Conference on Robotics and Automation (ICRA), pages 7559\u20137566. IEEE,\n2018.\n\n[26] Stefanos Nikolaidis and Julie Shah. Human-robot cross-training: computational formulation,\nmodeling and evaluation of a human team training strategy. In Proceedings of the 8th ACM/IEEE\ninternational conference on Human-robot interaction, pages 33\u201340. IEEE Press, 2013.\n\n[27] Eshed Ohn-Bar, Kris M. Kitani, and Chieko Asakawa. Personalized dynamics models for\n\nadaptive assistive navigation interfaces. CoRR, abs/1804.04118, 2018.\n\n[28] OpenAI.\n\nHow to train your OpenAI Five.\n\nhow-to-train-your-openai-five/.\n\n2019.\n\nhttps://openai.com/blog/\n\n[29] OpenAI. OpenAI Five \ufb01nals. 2019. https://openai.com/blog/openai-five-finals/.\n\n[30] Dean A Pomerleau. Ef\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation.\n\nNeural Computation, 3(1):88\u201397, 1991.\n\n[31] Dorsa Sadigh, Shankar Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for autonomous\ncars that leverage effects on human actions. In Robotics: Science and Systems, volume 2. Ann\nArbor, MI, USA, 2016.\n\n[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[33] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur\nGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering\nchess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint\narXiv:1712.01815, 2017.\n\n11\n\n\f[34] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wo-\njciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo\nEwalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin\nDalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor\nCai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen,\nYuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul,\nTimothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaS-\ntar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/\nalphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.\n\n12\n\n\f", "award": [], "sourceid": 2815, "authors": [{"given_name": "Micah", "family_name": "Carroll", "institution": "UC Berkeley"}, {"given_name": "Rohin", "family_name": "Shah", "institution": "UC Berkeley"}, {"given_name": "Mark", "family_name": "Ho", "institution": "Princeton University"}, {"given_name": "Tom", "family_name": "Griffiths", "institution": "Princeton University"}, {"given_name": "Sanjit", "family_name": "Seshia", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley & covariant.ai"}, {"given_name": "Anca", "family_name": "Dragan", "institution": "UC Berkeley"}]}