{"title": "Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making", "book": "Advances in Neural Information Processing Systems", "page_first": 4712, "page_last": 4720, "abstract": "It is commonly believed that an agent making decisions on behalf of two or more principals who have different utility functions should adopt a Pareto optimal policy, i.e. a policy that cannot be improved upon for one principal without making sacrifices for another. Harsanyi's theorem shows that when the principals have a common prior on the outcome distributions of all policies, a Pareto optimal policy for the agent is one that maximizes a fixed, weighted linear combination of the principals\u2019 utilities. In this paper, we derive a more precise generalization for the sequential decision setting in the case of principals with different priors on the dynamics of the environment. We refer to this generalization as the Negotiable Reinforcement Learning (NRL) framework. In this more general case, the relative weight given to each principal\u2019s utility should evolve over time according to how well the agent\u2019s observations conform with that principal\u2019s prior. To gain insight into the dynamics of this new framework, we implement a simple NRL agent and empirically examine its behavior in a simple environment.", "full_text": "Negotiable Reinforcement Learning for Pareto\n\nOptimal Sequential Decision-Making\n\nNishant Desai\n\nCenter for Human-Compatible AI\nUniversity of California, Berkeley\nnishantdesai@berkeley.edu\n\nAndrew Critch\n\nDepartment of EECS\n\nUniversity of California, Berkeley\n\ncritch@berkeley.edu\n\nStuart Russell\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\nrussell@cs.berkeley.edu\n\nAbstract\n\nIt is commonly believed that an agent making decisions on behalf of two or more\nprincipals who have different utility functions should adopt a Pareto optimal policy,\ni.e. a policy that cannot be improved upon for one principal without making\nsacri\ufb01ces for another. Harsanyi\u2019s theorem shows that when the principals have a\ncommon prior on the outcome distributions of all policies, a Pareto optimal policy\nfor the agent is one that maximizes a \ufb01xed, weighted linear combination of the\nprincipals\u2019 utilities. In this paper, we derive a more precise generalization for the\nsequential decision setting in the case of principals with different priors on the\ndynamics of the environment. We refer to this generalization as the Negotiable\nReinforcement Learning (NRL) framework. In this more general case, the relative\nweight given to each principal\u2019s utility should evolve over time according to how\nwell the agent\u2019s observations conform with that principal\u2019s prior. To gain insight\ninto the dynamics of this new framework, we implement a simple NRL agent and\nempirically examine its behavior in a simple environment.\n\n1\n\nIntroduction\n\nIt has been argued that the \ufb01rst AI systems with generally super-human cognitive abilities will play a\npivotal decision-making role in directing the future of civilization [Bostrom, 2014]. If that is the case,\nan important question will arise: Whose values will the \ufb01rst super-human AI systems serve? Since\nsafety is a crucial consideration in developing such systems, assuming the institutions building them\ncome to understand the risks and the time investments needed to address them [Baum, 2016], they\nwill have a large incentive to cooperate in their design rather than racing under time-pressure to build\ncompeting systems [Armstrong et al., 2016].\nTherefore, consider two nations\u2014allies or adversaries\u2014who must decide whether to cooperate in the\ndeployment of an extremely powerful AI system. Implicitly or explicitly, the resulting system would\nhave to strike compromises when con\ufb02icts arise between the wishes of those nations. How can they\nspecify the degree to which that system would be governed by the distinctly held principles of each\nnation? More mundanely, suppose a couple purchases a domestic robot. How should the robot strike\ncompromises when con\ufb02icts arise between the commands of its owners?\nIt is already an interesting and dif\ufb01cult problem to robustly align an AI system\u2019s values with those\nof a single human (or a group of humans in close agreement). Inverse reinforcement learning (IRL)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[Russell, 1998] [Ng and Russell, 2000] [Abbeel and Ng, 2004] and cooperative inverse reinforcement\nlearning (CIRL) [Had\ufb01eld-Menell et al., 2016] represent successively realistic early approaches to\nthis problem. But supposing some adequate solution eventually exists for aligning the values of a\nmachine intelligence with a single human decision-making unit, how should the values of a system\nserving multiple decision-makers be \u201caligned\u201d?\nIn the present work, we attempt to begin answering this question. We begin by observing some\nde\ufb01ciencies of optimizing a \ufb01xed, linear weighted sum of principals\u2019 utilities, as prescribed by\nHarsanyi\u2019s social aggregation theorem, in the case that those principals have differing beliefs about\nthe probability distributions dictating the agent\u2019s observations. We show that building a decision\nagent whose policy optimizes such an objective is not, in general, ex-ante Pareto optimal (i.e. \"as\nevaluated before the agent has taken any actions\"). Intuitively, linear weighted aggregation fails\nbecause, before the jointly owned agent has taken any actions, each principal evaluates its policy with\nrespect to their own beliefs, meaning that a policy that selectively prefers one principal over the other\nconditioned on its observations can be desirable to both principals.\nSection 3 addresses the shortcomings of Harsanyi-style preference aggregation by presenting the\nNegotiable Reinforcement Learning (NRL) framework. In this domain, we model each principal\u2019s\nprior on the environment and utility function as a Partially Observable Markov Decision Process\n(POMDP). We place necessary and suf\ufb01cient conditions on Pareto optimality for an agent acting over\nthese POMDPs with policy \u03c0. We then construct a third POMDP and show that the optimal policy for\nthis single POMDP satis\ufb01es the conditions for Pareto optimality. We refer to an agent implementing\na policy that solves this reduced POMDP as a NRL agent.\nFollowing directly from this reduction is the intriguing property that a Pareto optimal policy must,\nover time, prefer the utility function of the principal whose beliefs are a better predictor of the agent\u2019s\nobservations. This counter-intuitive result constitutes the main theorem of this paper. This can be\nseen as settling a kind of bet between the two principals: whichever principal makes the correct\npredictions gets to have their utility prioritized. In Section 4, we implement a simple NRL agent and\nmake empirical observations of this bet settling behavior.\n\n2 Related work\n\nSocial choice theory. The entirety of social choice theory and voting theory may be viewed as\nan attempt to specify an agreeable formal policy to enact on behalf of a group. Harsanyi\u2019s utility\naggregation theorem [Harsanyi, 1980] suggests one form of solution: maximizing a linear combination\nof group members\u2019 utility functions. The present work shows that this solution is inappropriate when\nprincipals have different beliefs, and Theorem 4 may be viewed as an extension of Harsanyi\u2019s form\nthat accounts simultaneously for differing priors and the prospect of future observations. Indeed,\nHarsanyi\u2019s form follows as a direct corollary of Theorem 4 when principals do share the same beliefs.\n\nMulti-agent systems. Zhang and Shah [2014] may be considered a sequential decision-making\napproach to social choice: they use MDPs to represent the decisions of principals in a competitive\ngame, and exhibit an algorithm for the principals that, if followed, arrives at a Pareto optimal Nash\nequilibrium satisfying a certain fairness criterion. Among the literature surveyed here, this paper\nis the closest to the present work in terms of its intended application: roughly speaking, achieving\nmutually desirable outcomes via sequential decision-making. However, the work is concerned with\nan ongoing interaction between the principals, rather than selecting a policy for a single agent to\nfollow as in this paper.\n\nMulti-objective sequential decision-making. There is also a good deal of work on Multi-\nObjective Optimization (MOO) [Tzeng and Huang, 2011], including for sequential decision-making,\nwhere solution methods have been called Multi-Objective Reinforcement Learning (MORL). For\ninstance, G\u00e1bor et al. [1998] introduce a MORL method called Pareto Q-learning for learning a set\nof a Pareto optimal polices for a Multi-Objective MDP (MOMDP). Soh and Demiris [2011] de\ufb01ne\nMulti-Reward Partially Observable Markov Decision Processes (MR-POMDPs), and use genetic\nalgorithms to produce non-dominated sets of policies for them. Roijers et al. [2015] refer to the same\nproblems as Multi-objective POMDPS (MOPOMDPs), and provide a bounded approximation method\nfor the optimal solution set for all possible weightings of the objectives. Wang [2014] surveys MORL\n\n2\n\n\fmethods, and contributes Multi-Objective Monte-Carlo Tree Search (MOMCTS) for discovering\nmultiple Pareto optimal solutions to a multi-objective optimization problem.\nHowever, none of these or related works address scenarios where the objectives are derived from\nprincipals with differing beliefs, from which the priority-shifting phenomenon of Theorem 4 arises.\nDiffering beliefs are likely to play a key role in negotiations, so for that purpose, the formulation of\nmulti-objective decision-making adopted here is preferable.\n\n3 Negotiable Reinforcement Learning\n\nConsider, informally, a scenario wherein two principals \u2014 perhaps individuals, companies, or states\n\u2014 are considering cooperating to build or otherwise obtain a machine that will then interact with an\nenvironment on their behalf.1 In such a scenario, the principals will tend to bargain for \u201chow much\u201d\nthe machine will prioritize their separate interests, so to begin, we need some way to quantify \u201chow\nmuch\u201d each principal is prioritized.\nFor instance, one might model the machine as maximizing the expected value, given its observations,\nof some utility function U of the environment that equals a weighted sum\n\nw(1)U (1) + w(2)U (2)\n\n(1)\nof the principals\u2019 individual utility functions U (1) and U (2), as Harsanyi\u2019s social aggregation theorem\nrecommends [Harsanyi, 1980]. Then the bargaining process could focus on choosing the values of\nthe weights w(i).\nHowever, this turns out to be a bad idea. As we shall see in the following example, this solution form\nis not generally compatible with Pareto optimality when agents have different beliefs. Harsanyi\u2019s\nsetting does not account for agents having different priors, nor for decisions being made sequentially,\nafter future observations. In such a setting, we need a new form of solution, exhibited here.\n\nA cake-splitting scenario. Alice (principal 1) and Bob (principal 2) are about to be presented with\na cake which they can choose to split in half to share, or give entirely to one of them. They have\n(built or purchased) a robot that will make the cake-splitting decision on their behalf. Alice\u2019s utility\nfunction returns 0 if she gets no cake, 20 if she gets half a cake, or 30 if she gets a whole cake. Bob\u2019s\nutility function works similarly.\nHowever, Alice and Bob have slightly different beliefs about how the environment works. They both\nagree on the state of the environment that the robot will encounter at \ufb01rst: a room with a cake in\nit (s1 = \u201ccake\u201d). But Alice and Bob have different predictions about how the robot\u2019s sensors will\nperceive the cake: Alice thinks that when the robot perceives the cake, it is 90% likely to appear\nwith a red tint (o1 = \u201cred\"), and 10% likely to appear with a green tint (o1 = \u201cgreen\"), whereas Bob\nbelieves the exact opposite. In either case, upon seeing the cake, the robot will either give Alice the\nentire cake (a1 = (all, none)), split the cake half-and-half (a1 = (half, half)), or give Bob the entire\ncake (a1 = (none, all)). Moreover, Alice and Bob have common knowledge of all these facts.\nNow, consider the following Pareto optimal policy that favors Alice (principal 1) when o1 is red, and\nBob (principal 2) when o1 is green:\n\n\u02c6\u03c0(\u2212 | red) = 100%(all, none)\n\u02c6\u03c0(\u2212 | green) = 100%(none, all)\n\nThis policy can be viewed intuitively as a bet between Alice and Bob about the value of o1, and is\nhighly appealing to both principals:\n\nE(1)[U (1); \u02c6\u03c0] = 90%(30) + 10%(0) = 27\nE(2)[U (2); \u02c6\u03c0] = 10%(0) + 90%(30) = 27\n\nIn particular, \u02c6\u03c0 is more appealing to both Alice and Bob than an agreement to deterministically split\nthe cake (half, half). We start to see that when principals evaluate their expected returns under a\npolicy \u03c0 with respect to differing beliefs about action outcomes, they may mutually agree to a policy\nthat favors one principal over the other, contingent on the action-observation history. We formalize\nthis intuition and explore its consequences in the following sections.\n\n1The results here all generalize from two principals to n principals being combined successively in any order,\n\nbut for clarity of exposition, the two person case is prioritized.\n\n3\n\n\f3.1 A POMDP formulation\n\nLet us formalize the machine\u2019s decision-making situation using the structure of a Partially Observable\nMarkov Decision Process (POMDP) [Sondik, 1973]. It is assumed that the principals will have\ncommon knowledge of the policy \u03c0 = (\u03c01, . . . , \u03c0n) they select for the machine to implement, but that\nthe principals may have different beliefs about how the environment works, and of course different\nutility functions. We refer to this as the common knowledge assumption.\nWe encode each principal j\u2019s outlook as a POMDP, D(j) = (S (j),A, T (j), U (j),O, \u2126(j), n), which\nsimultaneously represents that principal\u2019s beliefs about the environment, and the principal\u2019s utility\nfunction.\nS (j) is a set of possible states s of the environment.\nA is the set of possible actions a available to the NRL agent.\nT (j) is the conditional probabilities principal j believes will govern the environment state transitions,\n\ni.e., P(j)(si+1 | si, ai).\n\nU (j) is principal j\u2019s utility function from sequences of environmental states (s1, . . . , sn) to R. 2\nO is the set of possible observations o of the NRL agent.\n\u2126(j) is the conditional probabilities principal j believes will govern the agent\u2019s observations, i.e.,\n\nP(j)(oi | si).\nn is the time horizon.\nThus, principal j\u2019s subjective probability of an outcome (\u00afs, \u00afo, \u00afa), for any \u00afs \u2208 (S (j))n, is given by a\nprobability distribution P(j) that takes \u03c0 as a parameter:\n\nP(j)(\u00afs, \u00afo, \u00afa; \u03c0) := P(j)(s1) \u00b7 n(cid:89)\n\nP(j)(oi | si) \u03c0(ai | o\u2264ia