{"title": "Inequity aversion improves cooperation in intertemporal social dilemmas", "book": "Advances in Neural Information Processing Systems", "page_first": 3326, "page_last": 3336, "abstract": "Groups of humans are often able to find ways to cooperate with one another in complex, temporally extended social dilemmas. Models based on behavioral economics are only able to explain this phenomenon for unrealistic stateless matrix games. Recently, multi-agent reinforcement learning has been applied to generalize social dilemma problems to temporally and spatially extended Markov games. However, this has not yet generated an agent that learns to cooperate in social dilemmas as humans do. A key insight is that many, but not all, human individuals have inequity averse social preferences. This promotes a particular resolution of the matrix game social dilemma wherein inequity-averse individuals are personally pro-social and punish defectors. Here we extend this idea to Markov games and show that it promotes cooperation in several types of sequential social dilemma, via a profitable interaction with policy learnability. In particular, we find that inequity aversion improves temporal credit assignment for the important class of intertemporal social dilemmas. These results help explain how large-scale cooperation may emerge and persist.", "full_text": "Inequity aversion improves cooperation in\n\nintertemporal social dilemmas\n\nEdward Hughes\u2217, Joel Z. Leibo\u2217, Matthew Phillips, Karl Tuyls, Edgar Due\u00f1ez-Guzman,\nAntonio Garc\u00eda Casta\u00f1eda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster,\n\nHeather Roff, Thore Graepel\n\nDeepMind, London, United Kingdom\n\n{edwardhughes, jzl, karltuyls, duenez, antoniogc, idunning, tinazhu,\n\nkevinrmckee, rkoster, hroff, thore}@google.com,\n\nmatthew.phillips.12@ucl.ac.uk\n\nAbstract\n\nGroups of humans are often able to \ufb01nd ways to cooperate with one another\nin complex, temporally extended social dilemmas. Models based on behavioral\neconomics are only able to explain this phenomenon for unrealistic stateless matrix\ngames. Recently, multi-agent reinforcement learning has been applied to generalize\nsocial dilemma problems to temporally and spatially extended Markov games.\nHowever, this has not yet generated an agent that learns to cooperate in social\ndilemmas as humans do. A key insight is that many, but not all, human individuals\nhave inequity averse social preferences. This promotes a particular resolution of\nthe matrix game social dilemma wherein inequity-averse individuals are personally\npro-social and punish defectors. Here we extend this idea to Markov games and\nshow that it promotes cooperation in several types of sequential social dilemma,\nvia a pro\ufb01table interaction with policy learnability. In particular, we \ufb01nd that\ninequity aversion improves temporal credit assignment for the important class\nof intertemporal social dilemmas. These results help explain how large-scale\ncooperation may emerge and persist.\n\nIntroduction\n\n1\nIn intertemporal social dilemmas, there is a tradeoff between short-term individual incentives and\nlong-term collective interest. Humans face such dilemmas when contributing to a collective food\nstorage during the summer in preparation for a harsh winter, organizing annual maintenance of\nirrigation systems, or sustainably sharing a local \ufb01shery. Classical models of human behavior based\non rational choice theory predict that cooperation in these situations is impossible [1, 2]. This poses\na puzzle since humans evidently do \ufb01nd ways to cooperate in many everyday intertemporal social\ndilemmas, as documented by decades of \ufb01eldwork [3, 4] and laboratory experiments [5, 6]. Providing\nan empirically grounded explanation of how individual behavior gives rise to societal cooperation is\nseen as a core goal in several sub\ufb01elds of the social sciences and evolutionary biology [7, 8, 9].\n[10, 11] proposed in\ufb02uential models based on behavioral game theory. However, these models have\nlimited applicability since they only generate predictions when the problem can be cast as a matrix\ngame (see e.g. [12, 13]). Here we consider a more realistic video-game setting, like those introduced\nin the behavioral research of [14, 15, 16]. In this environment, agents do not simply choose to\ncooperate or defect like they do in matrix games. Rather they must learn policies to implement their\nstrategic decisions, and must do so while coping with the non-stationarity arising from other agents\nlearning simultaneously. Several papers used multi-agent reinforcement learning [17, 18, 19] and\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fplanning [20, 21, 22, 23] to generate cooperation in this setting. However, this approach has not yet\ndemonstrated robust cooperation in games with more than two players, which is often observed in\nhuman behavioral experiments. Moreover na\u00efvely optimizing group reward is also ineffective, due to\nthe lazy agent problem [24].\u2020\nIt is dif\ufb01cult for both natural and arti\ufb01cial agents to \ufb01nd cooperative solutions to intertemporal social\ndilemmas for the following reasons:\n\n1. Collective action \u2013 individuals must learn and coordinate policies at a group level to avoid\n\nfalling into socially de\ufb01cient equilibria.\n\n2. Temporal credit assignment \u2013 rational defection in the short-term must become associated\n\nwith long-term negative consequences.\n\nMany different research traditions, including economics, evolutionary biology, sociology, psychology,\nand political philosophy have all converged on the idea that fairness norms are involved in resolving\nsocial dilemmas [25, 26, 27, 28, 29, 30, 31]. In one well-known model, agents are assumed to have\ninequity-averse preferences [10]. They balance their sel\ufb01sh desire for individual rewards against a\nneed to keep deviations between their own rewards and the rewards of others as small as possible.\nInequity-averse individuals are able to solve social dilemmas by resisting the temptation to pull ahead\nof others or\u2014if punishment is possible\u2014by punishing and discouraging free-riding. The inequity\naversion model has been successfully applied to explain human behavior in a variety of laboratory\neconomic games, such as the ultimatum game, the dictator game, the gift exchange game, market\ngames, the trust game and public goods [32, 33].\u2021\nIn this research, we generalize the inequity aversion model to Markov games, and show that it\nresolves intertemporal social dilemmas. Crucial to our analysis will be the distinction between\ndisadvantageous inequity aversion (negative reward received by individuals who underperform\nrelative to others) and advantageous inequity aversion (negative reward received by individuals who\noverperform relative to others). Colloquially, these may be thought of as reductionist models of envy\n(disadvantageous inequity aversion) and guilt (advantageous inequity aversion) respectively [36]. We\nhypothesise that these directly address the two challenges set out above in the following way.\nInequity aversion mitigates the problem of collective action by changing the effective payoff structure\nexperienced by agents through both a direct and an indirect mechanism. In the direct mechanism,\ndefectors experience advantageous inequity aversion, diminishing the marginal bene\ufb01t of defection\nover cooperation. The indirect mechanism arises when cooperating agents are disadvantageous-\ninequity averse. This motivates them to punish defectors by sanctioning them, reducing the payoff\nincentive for free-riding. Since agents must learn a defecting strategy via exploration, initially\ncooperative agents are deterred from switching strategies if the payoff bonus does not outweigh the\ncost of inef\ufb01ciently executing the defecting strategy while learning.\nInequity aversion also ameliorates the temporal credit assignment problem. Learning the association\nbetween short-term actions and long-term consequences is a high-variance and error-prone process,\nboth for animals [37] and reinforcement learning algorithms [38]. Inequity aversion short-circuits\nthe need for such long-term temporal credit assignment by acting as an \u201cearly warning system\u201d for\nintertemporal social dilemmas. As before, both a direct and an indirect mechanism are at work.\nWith the direct mechanism, advantageous-inequity-averse defectors receive negative rewards in the\nshort-term, since the bene\ufb01ts of defection are delivered on that timescale. The indirect mechanism\noperates because cooperators experience disadvantageous inequity aversion at precisely the time\nwhen other agents defect. This leads cooperators to punish defectors on a short-term timescale.\nBoth systems have the effect of operant conditioning [39], incentivizing agents that cannot resolve\nlong-term uncertainty to act in the lasting interest of the group.\n\n2 Reinforcement learning in sequential social dilemmas\n2.1 Partially observable Markov games\nWe consider multi-agent reinforcement learning in partially-observable general-sum Markov games\n[40, 41]. In each game state, agents take actions based on a partial observation of the state space and\n\n\u2020For more detail on the motivations for our research program, see the supplementary information.\n\u2021For alternative theories of the other-regarding preferences that may underlie human cooperative behavior in\n\neconomic games, see [34, 35].\n\n2\n\n\fFigure 1: Screenshots from (A) the Cleanup game, (B) the Harvest game, (C) the Dictate apples\ngame, and (D) the Take apples and Give apples games. The size of the agent-centered observation\nwindow is also shown in (B). The same size observation was used in all experiments.\n\nreceive an individual reward. Agents must learn through experience an appropriate behavior policy\nwhile interacting with one another. We formalize this as follows.\nConsider an N-player partially observable Markov game M de\ufb01ned on a \ufb01nite set of states S. The\nobservation function O : S \u00d7 {1, . . . , N} \u2192 Rd speci\ufb01es each player\u2019s d-dimensional view on\nthe state space. From each state, players may take actions from the set A1, . . . ,AN (one for each\nplayer). As a result of their joint action a1, . . . , aN \u2208 A1, . . . ,AN the state changes following\nthe stochastic transition function T : S \u00d7 A1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 AN \u2192 \u2206(S) (where \u2206(S) denotes the set\nof discrete probability distributions over S). Write Oi = {oi | s \u2208 S, oi = O(s, i)} to indicate\nthe observation space of player i. Each player receives an individual extrinsic reward de\ufb01ned as\nri : S \u00d7 A1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 AN \u2192 R for player i.\u00a7\nEach agent learns, independently through its own experience of the environment, a behavior policy\n\u03c0i : Oi \u2192 \u2206(Ai) (written \u03c0(ai|oi)) based on its own observation oi = O(s, i) and extrinsic reward\nri(s, a1, . . . , aN ). For the sake of simplicity we will write (cid:126)a = (a1, . . . , aN ), (cid:126)o = (o1, . . . , oN )\nand (cid:126)\u03c0(.|(cid:126)o) = (\u03c01(.|o1), . . . , \u03c0N (.|oN )). Each agent\u2019s goal is to maximize a long term \u03b3-discounted\npayoff de\ufb01ned as follows:\n\n(cid:35)\n\n(cid:34) \u221e(cid:88)\n\n(cid:126)\u03c0(s0) = E\nV i\n\n\u03b3tri(st, (cid:126)at)|(cid:126)at \u223c (cid:126)\u03c0t, st+1 \u223c T (st, (cid:126)at)\n\n.\n\n(1)\n\nt=0\n\n2.2 Learning agents\nWe deploy asynchronous advantage actor-critic (A3C) as the learning algorithm for our agents [42].\nA3C maintains both value (critic) and policy (actor) estimates using a deep neural network. The\npolicy is updated according to the policy gradient method, using a value estimate as a baseline to\nreduce variance. Gradients are generated asynchronously by 24 independent copies of each agent,\nplaying simultaneously in distinct instantiations of the environment. Explicitly, the gradients are\n\u2207\u03b8 log \u03c0(at|st; \u03b8)A(st, at; \u03b8, \u03b8v), where A(st, at; \u03b8, \u03b8v) is the advantage function, estimated via\ni=0 \u03b3iut+i + \u03b3kV (st+k; \u03b8v) \u2212 V (st; \u03b8v) where ut+i is the subjective reward. In\nsection 3.1 we decompose this into an extrinsic reward from the environment and an intrinsic reward\nthat de\ufb01nes the agent\u2019s inequity-aversion.\n\nk-step backups,(cid:80)k\u22121\n\nIntertemporal social dilemmas\n\n2.3\nAn intertemporal social dilemma is a temporally extended multi-agent game in which individual\nshort-term optimal strategies lead to poor long-term outcomes for the group. To de\ufb01ne this term\n\u00a7In our games, N = 5, d = 15 \u00d7 15 \u00d7 3 and |Ai| ranges from 8 to 10, with actions comprising movement,\n\nrotation and \ufb01ring.\n\n3\n\n\fFigure 2: The public goods game (Cleanup) and the commons game (Harvest) are social dilemmas.\n(A) shows the Schelling diagram for Cleanup. (B) shows the Schelling diagram for Harvest. The\ndotted line shows the overall average return were the individual to choose defection.\n\nprecisely, we employ a formalization of empirical game theoretic analysis [43, 44]. Our de\ufb01nition\nis consistent with that of [17]. However, since that work was limited to the 2-player case, it relied\non the empirical payoff matrix to represent the relative values of cooperation and defection. This\nquantity is unwieldy for N > 2 since it becomes a tensor. Therefore we base our de\ufb01nition on a\ndifferent representation of the N-player game. Explicitly, a Schelling diagram [45, 18] depicts the\nrelative payoffs for a single cooperator or defector given a \ufb01xed number of other cooperators. Thus\nSchelling diagrams are a natural and convenient generalization of payoff matrices to multi-agent\nsettings. Game-theoretic properties like Nash equilibria are readily visible in Schelling diagrams; see\n[45] for additional details and intuition.\nAn N-player sequential social dilemma is a tuple (M, \u03a0 = \u03a0c (cid:116) \u03a0d) of a Markov game and\ntwo disjoint sets of policies, said to implement cooperation and defection respectively, satisfying\nthe following properties. Consider the strategy pro\ufb01le (\u03c01\nd with\n(cid:96) + m = N. We shall denote the average payoff for the cooperating policies by Rc((cid:96)) and for the\ndefecting policies by Rd((cid:96)). A Schelling diagram plots the curves Rc((cid:96) + 1) and Rd((cid:96)). Intuitively,\nthe diagram displays the two possible payoffs to the N th player given that (cid:96) of the remaining players\nelect to cooperate and the rest defect. We say that (M, \u03a0) is a sequential social dilemma iff the\nfollowing hold:\n\nd ) \u2208 \u03a0(cid:96)\n\nc \u00d7 \u03a0m\n\nc , . . . , \u03c0(cid:96)\n\nc, \u03c01\n\nd, . . . , \u03c0m\n\n1. Mutual cooperation is preferred over mutual defection: Rc(N ) > Rd(0).\n2. Mutual cooperation is preferred to being exploited by defectors: Rc(N ) > Rc(0).\n3. Either the fear property, the greed property, or both:\n\n\u2022 Fear: mutual defection is preferred to being exploited. Rd(i) > Rc(i) for suf\ufb01ciently\n\u2022 Greed: exploiting a cooperator is preferred to mutual cooperation. Rd(i) > Rc(i) for\n\nsmall i.\n\nsuf\ufb01ciently large i.\n\nWe show that the matrix games Stag Hunt, Chicken and Prisoner\u2019s Dilemma satisfy these properties\nin Supplementary Fig. 1.\nA sequential social dilemma is intertemporal if the choice to defect is optimal in the short-term. More\nprecisely, consider an individual i and an arbitrary set of policies for the rest of the group. Given a\nk \u2208 \u03a0 with maximum return in the next k steps\nstarting state, for all k suf\ufb01ciently small, the policy \u03c0i\nis a defecting policy. There is thus a tension between short-term personal gain and long-term group\nutility.\n\n2.4 Examples\n[46] divides all multi-person social dilemmas into two broad categories:\n\n1. Public goods dilemmas, in which an individual must pay a personal cost in order to provide\n\na resource that is shared by all.\n\n2. Commons dilemmas, in which an individual is tempted by a personal bene\ufb01t, depleting a\n\nresource that is shared by all.\n\n4\n\n\fFigure 3: Advantageous inequity aversion facilitates cooperation in the Cleanup game. (A) compares\nthe collective return achieved by A3C and advantageous inequity averse agents, (B) shows contribu-\ntions to the public good, and (C) shows equality over the course of training. (D-F) demonstrate that\ndisadvantageous inequity aversion does not promote greater cooperation in the Cleanup game.\n\nWe consider two dilemmas in this paper, one of the public goods type and one of the commons\ntype. Each was implemented as a partially observable Markov game on a 2D grid. Both are\nalso intertemporal social dilemmas because individually sel\ufb01sh actions produce immediate bene\ufb01ts\nwhile their impacts on the collective develop over a longer time horizon. The availability of costly\npunishment is of critical importance in human sequential social dilemmas [47, 48] and is therefore an\naction in the environments presented here.\u00b6\nIn the Cleanup game, the aim is to collect apples from a \ufb01eld. Each apple provides a reward of 1.\nThe spawning of apples is controlled by a geographically separate aquifer that supplies water and\nnutrients. Over time, this aquifer \ufb01lls up with waste, lowering the respawn rate of apples linearly. For\nsuf\ufb01ciently high waste levels, no apples can spawn. At the start of each episode, the environment\nresets with waste just beyond this saturation point. To cause apples to spawn, agents must clean some\nof the waste.\nHere we have a dilemma. Provided that some agents contribute to the public good by cleaning up the\naquifer, it is individually more rewarding to stay in the apple \ufb01eld. However, if all players defect,\nthen no-one gets any reward. A successful group must balance the temptation to free-ride with the\nprovision of the public good. Cooperative agents must make a positive commitment to group-level\nwell-being to solve the task.\nThe goal of the Harvest game is to collect apples. Each apple provides a reward of 1. The apple\nregrowth rate varies across the map, dependent on the spatial con\ufb01guration of uncollected apples: the\nmore nearby apples, the higher the local regrowth rate. If all apples in a local area are harvested then\nnone ever grow back. After 1000 steps the episode ends, at which point the game resets to an initial\nstate.\nThe dilemma is as follows. The short-term interests of each individual leads toward harvesting\nas rapidly as possible. However, the long-term interests of the group as a whole are advanced if\nindividuals refrain from doing so, especially when many agents are in the same local region. Such\nsituations are precarious because the more harvesting agents there are, the greater the chance of\npermanently depleting the local resources. Cooperators must abstain from a personal bene\ufb01t for the\ngood of the group.(cid:107)\n\ntimeout beam was used.\n\n\u00b6In both games, players can \ufb01ne each other using a punishment beam. This contrasts with [18], in which a\n(cid:107)Precise details of the ecological dynamics may be found in the supplementary information.\n\n5\n\n\f2.5 Validating the environments\nWe would like to demonstrate that these environments are social dilemmas by plotting Schelling dia-\ngrams. In complex, spatially and temporally extended Markov games, it is not feasible to analytically\ndetermine cooperating and defecting policies. Instead, we must study the environment empirically.\nOne method employs reinforcement learning to train such policies. We enforce cooperation or\ndefection by making appropriate modi\ufb01cations to the environment, as follows.\nIn Harvest, we enforce cooperation by modifying the environment to prevent some agents from\ngathering apples in low-density areas. In Cleanup, we enforce free-riding by removing the ability\nof some agents to clean up waste. We also add a small group reward signal to encourage the\nremaining agents to cooperate. The resulting empirical Schelling diagrams in Figure 2 prove that our\nenvironments are indeed social dilemmas.\n\nFigure 4: Inequity aversion promotes cooperation in the Harvest game. When all 5 agents have\nadvantageous inequity aversion, there is a small improvement over A3C in the three social outcome\nmetrics: (A) collective return, (B) apple consumption, and (C) sustainability. Disadvantageous\ninequity aversion provides a much larger improvement over A3C, and works even when only 1\nout of 5 agents are inequity averse. (D) shows collective return, (E) apple consumption, and (F)\nsustainability.\n\n3 The model\nWe \ufb01rst introduce the inequity aversion model of [10]. It is directly applicable only to stateless games.\nWe then extend their model to sequential or multi-state problems, making use of deep reinforcement\nlearning.\n\nInequity aversion\n\n3.1\nThe [10] utility function is as follows. Let r1, . . . , rN be the extrinsic payoffs achieved by each of N\nplayers. Each agent receives a utility\nUi(ri, . . . rN ) = ri \u2212 \u03b1i\nN \u2212 1\n\nmax (rj \u2212 ri, 0) \u2212 \u03b2i\n\nmax (ri \u2212 rj, 0) ,\n\n(2)\n\n(cid:88)\n\nj(cid:54)=i\n\n(cid:88)\n\nj(cid:54)=i\n\nN \u2212 1\n\nwhere the additional terms may be interpreted as intrinsic payoffs, in the language of [49].\nThe parameter \u03b1i controls an agent\u2019s aversion to disadvantageous inequity. A larger value for \u03b1i\nimplies a larger utility loss when other agents achieve rewards greater than one\u2019s own. Likewise, the\nparameter \u03b2i controls an agent\u2019s aversion to advantageous inequity, utility lost when performing\nbetter than others. [10] argue that \u03b1 > \u03b2. That is, most people are loss averse in social comparisons.\nThere is some empirical support for this prediction [50], though the evidence is mixed [51, 52]. In a\nsweep over values for \u03b1 and \u03b2, we found our strongest results for \u03b1 = 5 and \u03b2 = 0.05.\n\n6\n\n\fFigure 5: Inequity aversion promotes cooperation by improving temporal credit assignment. (A)\nshows collective return for delayed advantageous inequity aversion in the Cleanup game. (B) shows\napple consumption for delayed disadvantageous inequity aversion in the Harvest game.\n\nInequity aversion in sequential dilemmas\n\n3.2\nExperimental work in behavioral economics suggests that some proportion of natural human pop-\nulations are inequity averse [8]. However, as a computational model, inequity aversion has only\nbeen expounded for the matrix game setting. Equation (2) can be directly applied only to stateless\ngames [53, 54]. In this section we extend this model of inequity aversion to the temporally extended\nMarkov game case.\nThe main problem in re-de\ufb01ning the social preference of equation (2) for Markov games is that the\nrewards of different players may occur on different timesteps. Thus the key step in extending (2) to\nthis case is to introduce per-player temporal smoothing of the reward traces.\nLet ri(s, a) denote the reward obtained by the i-th player when it takes action a from state s. For\nconvenience, we also sometimes write it with a time index: rt\ni := ri(st, at). We de\ufb01ne the subjective\nreward ui(s, a) received by the i-th player when it takes action a from state s to be\n\nui(st\n\ni, at\n\ni) = ri(st\n\ni, at\n\ni) \u2212 \u03b1i\nN \u2212 1\n\nmax(et\n\nj(st\n\nj, at\n\nj) \u2212 et\n\ni(st\n\ni, at\n\ni), 0)\n\n(3)\n\n(cid:88)\n(cid:88)\n\nj(cid:54)=i\n\n\u2212 \u03b2i\n\nj(st\n\nmax(et\n\nN \u2212 1\nj for the agents j = 1, . . . , N are updated at each timestep t\n\nj), 0) ,\n\nj(cid:54)=i\n\nj, at\n\ni, at\n\nj(st\n\ni) \u2212 et\n\nwhere the temporal smoothed rewards et\naccording to\n\n(4)\nwhere \u03b3 is the discount factor and \u03bb is a hyperparameter. This is analogous to the mathematical\nformalism used for eligibility traces [55]. Furthermore, we allow agents to observe the smoothed\nreward of every player on each timestep.\n\n) + rt\n\net\nj(st\n\nj, at\n\nj, at\n\nj(st\n\nj) ,\n\nj\n\nj\n\nj\n\nj) = \u03b3\u03bbet\u22121\n\n(st\u22121\n\n, at\u22121\n\n4 Results\nWe show that advantageous inequity aversion is able to resolve certain intertemporal social dilem-\nmas without resorting to punishment by providing a temporally correct intrinsic reward. For this\nmechanism to be effective, the population must have suf\ufb01ciently many advantageous-inequity-averse\nindividuals. By contrast disadvantageous-inequity-averse agents can drive mutual cooperation even\nin small numbers. They achieve this by punishing defectors at a time concomitant with their offences.\nIn addition, we \ufb01nd that advantageous inequity aversion is particularly effective for resolving public\ngoods dilemmas, whereas disadvantageous inequity aversion is more powerful for addressing com-\nmons dilemmas. Our baseline A3C agent fails to \ufb01nd socially bene\ufb01cial outcomes in either category\nof game. We de\ufb01ne the metrics used to quantify our results in the supplementary information.\n\n4.1 Advantageous inequity aversion promotes cooperation\nAdvantageous-inequity-averse agents are better than A3C at maintaining cooperation in both public\ngoods and commons games. This effect is particularly pronounced in the Cleanup game (Figure 3).\n\n7\n\n\fHere groups of 5 advantageous-inequity-averse agents \ufb01nd solutions in which 2 consistently clean\nlarge amounts of waste, producing a large collective return.\u2217\u2217 We clarify the effect of advantageous\ninequity aversion on the intertemporal nature of the problem by delaying the delivery of the intrinsic\nreward signal. Figure 5 suggests that improving temporal credit assignment is an important function\nof inequity aversion since delaying the time at which the intrinsic reward signal is delivered removes\nits bene\ufb01cial effect.\n\n4.2 Disadvantageous inequity aversion promotes cooperation\nDisadvantageous-inequity-averse agents are better than A3C at maintaining cooperation via pun-\nishment in commons games (Figure 4). In particular, a single disadvantageous-averse agent can\n\ufb01ne defectors, generating a sustainable outcome.\u2020\u2020 In Figure 5, we see that the disadvantageous-\ninequity-aversion signal must be temporally aligned with over-consumption for effective policing\nto arise. Hence, it is plausible that inequity aversion bridges the temporal gap between short-term\nincentives and long-term outcomes. Disadvantageous inequity aversion has no such positive impact\nin the Cleanup game, for reasons that we discuss in section 5.\n\n5 Discussion\nIn the Cleanup game, advantageous inequity aversion is an unambiguous feedback signal: it en-\ncourages agents to contribute to the public good. In the direct pathway, trial and error will quickly\ndiscover that the fastest way to diminish the negative rewards arising from advantageous inequity\naversion is to clean up waste, since doing so creates more apples for others to consume. However the\nindirect mechanism of disadvantageous inequity aversion and punishment lacks this property; while\npunishment may help exploration of new policies, it does not directly increase the attractiveness of\nwaste cleaning.\nThe Harvest game requires passive abstention rather than active provision. In this setting, advan-\ntageous inequity aversion provides a noisy signal for sustainable behaviour. This is because it is\nsensitive to the precise apple con\ufb01guration in the environment, which changes rapidly over time.\nHence advantageous inequity aversion does not greatly aid the exploration of policy space. Pun-\nishment, on the other hand, operates as a valuable shaping reward for learning, dis-incentivizing\noverconsumption at precisely the correct time and place.\nIn the Harvest game, disadvantageous inequity aversion generates cooperation in a grossly inef\ufb01-\ncient manner: huge amounts of collective resource are lost to \ufb01nes (compare Figures 4D and 4E).\nThis parallels human behavior in laboratory matrix games, e.g. [56, 57]. In the Cleanup game,\nadvantageous-inequity averse agents resolve the social dilemma without such losses, but must com-\nprise a large proportion of the population to be successful. This mirrors the cultural modulation of\nadvantageous inequity aversion in humans [58]. Evolution is hypothesized to have favored fairness\nas a mechanism for continued human cooperation [59]. It remains to be seen whether emergent\ninequity-aversion can be obtained by evolving reinforcement learning agents.\nWe conclude by putting our approach in the context of prior work. Since our mechanism does not\nrequire explicitly training cooperating and defecting agents or modelling their behaviour, it scales\nmore easily to complex environments and large populations of agents. However, our method has\nseveral limitations. Firstly, our guilty agents are quite exploitable, as evidenced by the necessity of a\nhomogeneous guilty population to achieve cooperation. Secondly, our agents use outcomes rather\nthan predictions to inform their policies. This is known to be a problem in environments with high\nstochasticity [22]. Finally, the heterogeneity of the population is an additional hyperparameter in our\nmodel. Clearly, one must set this appropriately, particularly in games with asymmetric outcomes. It\nis likely that a hybrid approach will be required to solve these challenging issues at scale.\n\n\u2217\u2217For a video of this behavior, visit https://youtu.be/N8BUzzFx7uQ.\n\u2020\u2020For a video of this behavior, visit https://youtu.be/tz3ZpTTmxTk.\n\n8\n\n\fReferences\n[1] M. Olson, The logic of collective action. Harvard University Press, 1965.\n[2] G. Hardin, \u201cThe tragedy of the commons,\u201d Science, vol. 162, no. 3859, pp. 1243\u20131248, 1968.\n[3] E. Ostrom, Governing the Commons: The Evolution of Institutions for Collective Action.\n\nCambridge University Press, 1990.\n\n[4] T. Dietz, E. Ostrom, and P. C. Stern, \u201cThe struggle to govern the commons,\u201d science, vol. 302,\n\nno. 5652, pp. 1907\u20131912, 2003.\n\n[5] E. Ostrom, J. Walker, and R. Gardner, \u201cCovenants with and without a sword: Self-governance\n\nis possible.,\u201d American political science Review, vol. 86, no. 02, pp. 404\u2013417, 1992.\n\n[6] E. Fehr and S. G\u00e4chter, \u201cAltruistic punishment in humans,\u201d Nature, vol. 415, no. 6868, p. 137,\n\n2002.\n\n[7] E. Ostrom, \u201cA behavioral approach to the rational choice theory of collective action: Presidential\naddress, american political science association, 1997,\u201d American political science review, vol. 92,\nno. 1, pp. 1\u201322, 1998.\n\n[8] E. Fehr and H. Gintis, \u201cHuman motivation and social cooperation: Experimental and analytical\n\nfoundations,\u201d Annu. Rev. Sociol., vol. 33, pp. 43\u201364, 2007.\n\n[9] D. G. Rand and M. A. Nowak, \u201cHuman cooperation,\u201d Trends in cognitive sciences, vol. 17,\n\nno. 8, pp. 413\u2013425, 2013.\n\n[10] E. Fehr and K. M. Schmidt, \u201cA theory of fairness, competition, and cooperation,\u201d The quarterly\n\njournal of economics, vol. 114, no. 3, pp. 817\u2013868, 1999.\n\n[11] A. Falk and U. Fischbacher, \u201cA theory of reciprocity,\u201d Games and economic behavior, vol. 54,\n\nno. 2, pp. 293\u2013315, 2006.\n\n[12] T. W. Sandholm and R. H. Crites, \u201cMultiagent reinforcement learning in the iterated prisoner\u2019s\n\ndilemma,\u201d Biosystems, vol. 37, no. 1-2, pp. 147\u2013166, 1996.\n\n[13] E. Munoz de Cote, A. Lazaric, and M. Restelli, \u201cLearning to cooperate in multi-agent social\ndilemmas,\u201d in 5th International Joint Conference on Autonomous Agents and Multiagent Systems\n(AAMAS 2006), pp. 783\u2013790, 2006.\n\n[14] M. A. Janssen, R. Holahan, A. Lee, and E. Ostrom, \u201cLab experiments for the study of social-\n\necological systems,\u201d Science, vol. 328, no. 5978, pp. 613\u2013617, 2010.\n\n[15] M. Janssen, \u201cIntroducing ecological dynamics into common-pool resource experiments,\u201d Ecol-\n\nogy and Society, vol. 15, no. 2, 2010.\n\n[16] M. Janssen, \u201cThe role of information in governing the commons: experimental results,\u201d Ecology\n\nand Society, vol. 18, no. 4, 2013.\n\n[17] J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel, \u201cMulti-agent Reinforcement\nLearning in Sequential Social Dilemmas,\u201d in Proceedings of the 16th International Conference\non Autonomous Agents and Multiagent Systems (AA-MAS 2017), (Sao Paulo, Brazil), 2017.\n\n[18] J. Perolat, J. Z. Leibo, V. Zambaldi, C. Beattie, K. Tuyls, and T. Graepel, \u201cA multi-agent\nreinforcement learning model of common-pool resource appropriation,\u201d in Advances in Neural\nInformation Processing Systems (NIPS), (Long Beach, CA), 2017.\n\n[19] J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch, \u201cLearning\n\nwith opponent-learning awareness,\u201d arXiv preprint arXiv:1709.04326, 2017.\n\n[20] A. Lerer and A. Peysakhovich, \u201cMaintaining cooperation in complex social dilemmas using\n\ndeep reinforcement learning,\u201d arXiv preprint arXiv:1707.01068, 2017.\n\n[21] A. Peysakhovich and A. Lerer, \u201cProsocial learning agents solve generalized stag hunts better\n\nthan sel\ufb01sh ones,\u201d arXiv preprint arXiv:1709.02865, 2017.\n\n[22] A. Peysakhovich and A. Lerer, \u201cConsequentialist conditional cooperation in social dilemmas\n\nwith imperfect information,\u201d CoRR, vol. abs/1710.06975, 2017.\n\n[23] M. Kleiman-Weiner, M. K. Ho, J. L. Austerweil, M. L. Littman, and J. B. Tenenbaum, \u201cCo-\nordinate to cooperate or compete: Abstract goals and joint intentions in social interaction,\u201d in\nCogSci, 2016.\n\n9\n\n\f[24] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanc-\ntot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, \u201cValue-decomposition networks for\ncooperative multi-agent learning,\u201d CoRR, vol. abs/1706.05296, 2017.\n\n[25] J. J. Rousseau, Discourse on the Origin of Inequality. Marc-Michel Rey, 1755.\n[26] H. L. A. Hart, \u201cAre there any natural rights?,\u201d The Philosophical Review, vol. 64, no. 2,\n\npp. 175\u2013191, 1955.\n\n[27] J. Rawls, \u201cJustice as fairness,\u201d The philosophical review, vol. 67, no. 2, pp. 164\u2013194, 1958.\n[28] G. Klosko, \u201cThe principle of fairness and political obligation,\u201d Ethics, vol. 97, no. 2, pp. 353\u2013\n\n362, 1987.\n\n[29] B. S. Frey and I. Bohnet, \u201cInstitutions Affect Fairness: Experimental Investigations,\u201d Journal of\n\nInstitutional and Theoretical Economics, vol. 151, pp. 286\u2013303, June 1995.\n\n[30] C. Bicchieri and A. Chavez, \u201cBehaving as expected: Public information and fairness norms,\u201d J.\n\nBehav. Decis. Making, vol. 23, pp. 161\u2013178, Apr. 2010.\n\n[31] J. Henrich, J. Ensminger, R. McElreath, A. Barr, C. Barrett, et al., \u201cMarkets, Religion, Commu-\nnity Size, and the Evolution of Fairness and Punishment,\u201d Science, vol. 327, pp. 1480\u20131484,\nMar. 2010.\n\n[32] R. Gibbons, A primer in game theory. Harvester Wheatsheaf, 1992.\n[33] C. Eckel and H. Gintis, \u201cBlaming the messenger: Notes on the current state of experimental\neconomics,\u201d Journal of Economic Behavior & Organization, vol. 73, no. 1, pp. 109\u2013119, 2010.\n[34] G. Charness and M. Rabin, \u201cUnderstanding social preferences with simple tests,\u201d The Quarterly\n\nJournal of Economics, vol. 117, no. 3, pp. 817\u2013869, 2002.\n\n[35] D. Engelmann and M. Strobel, \u201cInequality aversion, ef\ufb01ciency, and maximin preferences in\nsimple distribution experiments,\u201d American economic review, vol. 94, no. 4, pp. 857\u2013869, 2004.\n\n[36] C. Camerer, Behavioral Game Theory: Experiments in Strategic Interaction. 01 2011.\n[37] G. R. Grice, \u201cThe relation of secondary reinforcement to delayed reward in visual discrimination\n\nlearning.,\u201d Journal of Experimental Psychology, vol. 38, no. 1, pp. 1\u201316, 1948.\n\n[38] M. J. Kearns and S. P. Singh, \u201cBias-variance error bounds for temporal difference updates,\u201d in\nProceedings of the Thirteenth Annual Conference on Computational Learning Theory, COLT\n\u201900, (San Francisco, CA, USA), pp. 142\u2013147, Morgan Kaufmann Publishers Inc., 2000.\n\n[39] B. F. Skinner, The Behavior of Organisms; An Experimental Analysis. D. Appleton-Century\n\nCompany, 1938.\n\n[40] L. S. Shapley, \u201cStochastic Games,\u201d In Proc. of the National Academy of Sciences of the United\n\nStates of America, 1953.\n\n[41] M. L. Littman, \u201cMarkov games as a framework for multi-agent reinforcement learning,\u201d in\nProceedings of the 11th International Conference on Machine Learning (ICML), pp. 157\u2013163,\n1994.\n\n[42] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and\nK. Kavukcuoglu, \u201cAsynchronous methods for deep reinforcement learning,\u201d in Proceedings of\nthe 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA,\nJune 19-24, 2016, pp. 1928\u20131937, 2016.\n\n[43] W. E. Walsh, R. Das, G. Tesauro, and J. O. Kephart, \u201cAnalyzing complex strategic interactions\nin multi-agent systems,\u201d in AAAI-02 Workshop on Game-Theoretic and Decision-Theoretic\nAgents, pp. 109\u2013118, 2002.\n\n[44] M. P. Wellman, \u201cMethods for empirical game-theoretic analysis,\u201d in Proceedings of the national\nconference on arti\ufb01cial intelligence, vol. 21, p. 1552, Menlo Park, CA; Cambridge, MA;\nLondon; AAAI Press; MIT Press; 1999, 2006.\n\n[45] T. C. Schelling, \u201cHockey helmets, concealed weapons, and daylight saving: A study of binary\nchoices with externalities,\u201d The Journal of Con\ufb02ict Resolution, vol. 17, no. 3, pp. 381\u2013428,\n1973.\n\n[46] P. Kollock, \u201cSocial dilemmas: The anatomy of cooperation,\u201d Annual review of sociology, vol. 24,\n\nno. 1, pp. 183\u2013214, 1998.\n\n10\n\n\f[47] P. Oliver, \u201cRewards and punishments as selective incentives for collective action: theoretical\n\ninvestigations,\u201d American journal of sociology, vol. 85, no. 6, pp. 1356\u20131375, 1980.\n\n[48] \u00d6. G\u00fcrerk, B. Irlenbusch, and B. Rockenbach, \u201cThe competitive advantage of sanctioning\n\ninstitutions,\u201d Science, vol. 312, no. 5770, pp. 108\u2013111, 2006.\n\n[49] N. Chentanez, A. G. Barto, and S. P. Singh, \u201cIntrinsically motivated reinforcement learning,\u201d in\n\nAdvances in neural information processing systems, pp. 1281\u20131288, 2005.\n\n[50] G. F. Loewenstein, L. Thompson, and M. H. Bazerman, \u201cSocial utility and decision making in\ninterpersonal contexts.,\u201d Journal of Personality and Social psychology, vol. 57, no. 3, p. 426,\n1989.\n\n[51] C. Bellemare, S. Kr\u00f6ger, and A. Van Soest, \u201cMeasuring inequity aversion in a heterogeneous\npopulation using experimental decisions and subjective probabilities,\u201d Econometrica, vol. 76,\nno. 4, pp. 815\u2013839, 2008.\n\n[52] E. I. Hoppe and P. W. Schmitz, \u201cContracting under incomplete information and social prefer-\nences: An experimental study,\u201d The Review of Economic Studies, vol. 80, no. 4, pp. 1516\u20131544,\n2013.\n\n[53] K. Verbeeck, J. Parent, and A. Now\u00e9, \u201cHomo egualis reinforcement learning agents for load\nbalancing,\u201d in Innovative Concepts for Agent-Based Systems, First International Workshop on\nRadical Agent Concepts, WRAC 2002, McLean, VA, USA, January 16-18, 2002, Revised Papers,\npp. 81\u201391, 2002.\n\n[54] S. de Jong and K. Tuyls, \u201cHuman-inspired computational fairness,\u201d Autonomous Agents and\n\nMulti-Agent Systems, vol. 22, no. 1, pp. 103\u2013126, 2011.\n\n[55] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA:\n\nMIT Press, 1st ed., 1998.\n\n[56] T. Yamagishi, \u201cThe provision of a sanctioning system as a public good,\u201d vol. 51, pp. 110\u2013116,\n\n07 1986.\n\n[57] E. Fehr and S. Gachter, \u201cCooperation and punishment in public goods experiments,\u201d American\n\nEconomic Review, vol. 90, pp. 980\u2013994, September 2000.\n\n[58] R. P. Blake, K. Mcauliffe, J. Corbit, T. Callaghan, O. Barry, A. Bowie, L. Kleutsch, K. Kramer,\nE. Ross, H. Vongsachang, R. Wrangham, and F. Warneken, \u201cThe ontogeny of fairness in seven\nsocieties,\u201d vol. 528, 11 2015.\n\n[59] S. F. Brosnan and F. B. M. de Waal, \u201cEvolution of responses to (un)fairness,\u201d Science (New\n\nYork, N.Y.), vol. 346, no. 6207, p. 1251776, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1687, "authors": [{"given_name": "Edward", "family_name": "Hughes", "institution": "DeepMind"}, {"given_name": "Joel", "family_name": "Leibo", "institution": "DeepMind"}, {"given_name": "Matthew", "family_name": "Phillips", "institution": "DeepMind"}, {"given_name": "Karl", "family_name": "Tuyls", "institution": "DeepMind"}, {"given_name": "Edgar", "family_name": "Due\u00f1ez-Guzman", "institution": "DeepMind"}, {"given_name": "Antonio", "family_name": "Garc\u00eda Casta\u00f1eda", "institution": "DeepMind"}, {"given_name": "Iain", "family_name": "Dunning", "institution": "DeepMind"}, {"given_name": "Tina", "family_name": "Zhu", "institution": "DeepMind"}, {"given_name": "Kevin", "family_name": "McKee", "institution": "DeepMind"}, {"given_name": "Raphael", "family_name": "Koster", "institution": "DeepMind"}, {"given_name": "Heather", "family_name": "Roff", "institution": "DeepMind"}, {"given_name": "Thore", "family_name": "Graepel", "institution": "DeepMind"}]}