{"title": "Multi-View Decision Processes: The Helper-AI Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 5443, "page_last": 5452, "abstract": "We consider a two-player sequential game in which agents have the same reward function but may disagree on the transition probabilities of an underlying Markovian model of the world. By committing to play a specific policy, the agent with the correct model can steer the behavior of the other agent, and seek to improve utility. We model this setting as a multi-view decision process, which we use to formally analyze the positive effect of steering policies. Furthermore, we develop an algorithm for computing the agents' achievable joint policy, and we experimentally show that it can lead to a large utility increase when the agents' models diverge.", "full_text": "Multi-View Decision Processes:\n\nThe Helper-AI Problem\n\nChalmers University of Technology & University of Lille\n\nChristos Dimitrakakis\n\nDavid C. Parkes\nHarvard University\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nGoran Radanovic\nHarvard University\n\nPaul Tylkin\n\nHarvard University\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nAbstract\n\nWe consider a two-player sequential game in which agents have the same reward\nfunction but may disagree on the transition probabilities of an underlying Marko-\nvian model of the world. By committing to play a speci\ufb01c policy, the agent with\nthe correct model can steer the behavior of the other agent, and seek to improve\nutility. We model this setting as a multi-view decision process, which we use to for-\nmally analyze the positive effect of steering policies. Furthermore, we develop an\nalgorithm for computing the agents\u2019 achievable joint policy, and we experimentally\nshow that it can lead to a large utility increase when the agents\u2019 models diverge.\n\n1\n\nIntroduction.\n\nIn the past decade, we have been witnessing the ful\ufb01llment of Licklider\u2019s profound vision on AI\n[Licklider, 1960]:\n\nMan-computer symbiosis is an expected development in cooperative interaction\nbetween men and electronic computers.\n\nNeedless to say, such a collaboration, between humans and AIs, is natural in many real-world AI\nproblems. As a motivating example, consider the case of autonomous vehicles, where a human driver\ncan override the AI driver if needed. With advances in AI, the human will bene\ufb01t most if she allows\nthe AI agent to assume control and drive optimally. However, this might not be achievable\u2014due to\nhuman behavioral biases, such as over-weighting the importance of rare events, the human might\nincorrectly override the AI. In the way, the misaligned models of the two drivers can lead to a decrease\nin utility. In general, this problem may occur whenever two agents disagree on their view of reality,\neven if they cooperate to achieve a common goal.\nFormalizing this setting leads to a class of sequential multi-agent decision problems that extend\nstochastic games. While in a stochastic game there is an underlying transition kernel to which all\nagents (players) agree, the same is not necessarily true in the described scenario. Each agent may\nhave a different transition model. We focus on a leader-follower setting in which the leader commits\nto a policy that the follower then best responds to, according to the follower\u2019s model. Mapped to our\nmotivating example, this would mean that the AI driver is aware of human behavioral biases and\ntakes them into account when deciding how to drive.\nTo incorporate both sequential and stochastic aspects, we model this as a multi-view decision process.\nOur multi-view decision process is based on an MDP model, with two, possibly different, transition\nkernels. One of the agents, hereafter denoted as P1, is assumed to have the correct transition kernel\nand is chosen to be the leader of the Stackelberg game\u2014it commits to a policy that the second agent\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(P2) best-responds to according to its own model. The agents have the same reward function, and\nare in this sense cooperative. In an application setting, while the human (P2) may not be a planner,\nwe motivate our set-up as modeling the endpoint of an adaptive process that leads P2 to adopt a\nbest-response to the policy of P1.\nUsing the multi-view decision process, we analyze the effect of P2\u2019s imperfect model on the achieved\nutility. We place an upper bound on the utility loss due to this, and also provide a lower bound on\nhow much P1 gains by knowing P2\u2019s model. One of our main analysis tools is the amount of\nin\ufb02uence an agent has, i.e. how much its actions affect the transition probabilities, both according\nto its own model, and according to the model of the other agent. We also develop an algorithm,\nextending backwards induction for simultaneous-move sequential games [c.f. Bo\u0161ansk`y et al., 2016],\nto compute a pair of policies that constitute a subgame perfect equilibrium.\nIn our experiments, we introduce intervention games as a way to construct example scenarios. In\nan intervention game, an AI and a human share control of a process, and the human can intervene\nto override the AI\u2019s actions but suffers some cost in doing so. This allows us to derive a multi-view\nprocess from any single-agent MDP. We consider two domains: \ufb01rst, the intervention game variant of\nthe shelter-food game introduced by Guo et al. [2013], as well as an autonomous driving problem\nthat we introduce here. Our results show that the proposed approach provides a large increase in\nutility in each domain, thus overcoming the de\ufb01ciencies of P2\u2019s model, when the latter model is\nknown to the AI.\n\n1.1 Related work\n\nEnvironment design [Zhang et al., 2009, Zhang and Parkes, 2008] is a related problem, where a\n\ufb01rst agent seeks to modulate the behavior of a second agent. However, the interaction between\nagents occurs through \ufb01nding a good modi\ufb01cation of the second agent\u2019s reward function: the AI\nobserves a human performing a task, and uses inverse reinforcement learning [Ng et al., 2000] to\nestimate the human\u2019s reward function. Then it can assign extrinsic reward to different states in order\nto improve the human\u2019s policy. A similar problem in single-agent reinforcement learning is how\nto use internal rewards to improve the performance of a computationally-bounded, reinforcement\nlearning agent [Sorg et al., 2010]. For example, even a myopic agent can maximize expected utility\nover a long time horizon if augmented with appropriately designed internal rewards. Our model\ndiffers from these prior works, in that the interaction between a \u2018helper agent\u2019 and a second agent is\nthrough taking actions in the same environment as the second agent.\nIn cooperative inverse reinforcement learning [Had\ufb01eld-Menell et al., 2016], an AI wants to cooperate\nwith a human but does not initially understand the task. While their framework allows for simultaneous\nmoves of the AI and the human, they only apply it to two-stage games, where the human demonstrates\na policy in the \ufb01rst stage and the AI imitates in the second stage. They show that the human should\ntake into account the AI\u2019s best response when providing demonstrations, and develop an algorithm\nfor computing an appropriate demonstration policy. Our focus is on joint actions in a multi-period,\nuncertain environment, rather than teaching. The model of Amir et al. [2016] is also different, in that\nit considers the problem of how a teacher can optimally give advice to a sub-optimal learner, and\nis thus focused on communication and adaptation rather than interaction through actions. Finally,\nElmalech et al. [2015] consider an advice-giving AI in single-shot games, where the human has an\nincorrect model. They experimentally \ufb01nd that when the AI heuristically models human expectations\nwhen giving advice, their performance is improved. We \ufb01nd that this also holds in our more general\nsetting.\nWe cannot use standard methods for computing optimal strategies in stochastic games [Bo\u0161ansk\u00fd\net al., 2015, Zinkevich et al., 2005], as the two agents have different models of the transitions between\nstates. On the other extreme, a very general formalism to represent agent beliefs, such as that of\nGal and Pfeffer [2008] is not well suited, because we have a Stackelberg setting and the problem of\nthe follower is standard. Our approach is to extend backwards induction [c.f. Bo\u0161ansk`y et al., 2016,\nSec. 4] to the case of misaligned models in order to obtain a subgame perfect policy for the AI.\n\nPaper organization. Section 2 formalises the setting and its basic properties, and provides a lower\nbound on the improvement P1 obtains when P2\u2019s model is known. Section 3 introduces a backwards\ninduction algorithm, while Section 4 discusses the experimental results. We conclude with Section 5.\nFinally, Appendix A collects all the proofs, additional technical material and experimental details.\n\n2\n\n\f2 The Setting and Basic Properties\n\nWe consider two-agent sequential stochastic game, with two agents P1, P2, who disagree on the\nunderlying model of the world, with the i-th agent\u2019s model being \u00b5i, but share the same reward\nfunction. More formally,\nDe\ufb01nition 1 (Multi-view decision process (MVDP)). A multi-view decision process G =\n\ufffdS,A, \u03c31, \u03c32, \u00b51, \u00b52, \u03c1, \u03b3\ufffd is a game between two agents, P1, P2, who share the same reward\nfunction. The game has a state space S, with S \ufffd |S|, action space A =\ufffdi Ai, with A \ufffd |A|,\nstarting state distribution \u03c3, transition kernel \u00b5, reward function1 \u03c1 : S \u2192 [0, 1], and discount factor\n\u03b3 \u2208 [0, 1].\nAt time t, the agents observe the state st, take a joint action at = (at,1, at,2) and receive reward\nrt = \u03c1(st). However, the two agents may have a different view of the game, with agent i modelling\nthe transition probabilities of the process as \u00b5i(st+1 | st, at) for the probability of the next state\nst+1 given the current state st and joint action at. Each agent\u2019s actions are drawn from a policy \u03c0i,\nwhich may be an arbitrary behavioral policy, \ufb01xed at the start of the game. For a given policy pair\n\u03c0 = (\u03c01, \u03c02), with \u03c0i \u2208 \u03a0i and \u03a0 \ufffd\ufffdi \u03a0i, the respective payoff from the point of view of the i-th\nagent ui : \u03a0 \u2192 R is de\ufb01ned to be:\n\nui(\u03c0) = E\u03c0\n\n\u00b5i [U | s1 \u223c \u03c3],\n\nU \ufffd\n\nT\ufffdt=t\n\n\u03b3t\u22121\u03c1(st).\n\n(2.1)\n\nFor simplicity of presentation, we de\ufb01ne reward rt = \u03c1(st) at time t, as a function of the state only,\nalthough an extension to state-action reward functions is trivial. The reward, as well, as well as the\nutility U (the discounted sum of rewards over time) are the same for both agents for a given sequence\nof states. However, the payoff for agent i is their expected utility under the model i, and can be\ndifferent for each agent.\nAny two-player stochastic game can be cast into an MVDP:\nLemma 1. Any two-player general-sum stochastic game (SG) can be reduced to a two-player MVDP\nin polynomial time and space.\n\nThe proof of Lemma 1 is in Appendix A.\n\n2.1 Stackelberg setting\n\nWe consider optimal policies from the point of view of P1, who is trying to assist a misguided\nP2. For simplicity, we restrict our attention to the Stackelberg setting, i.e. where P1 commits to\na speci\ufb01c policy \u03c01 at the start of the game. This simpli\ufb01es the problem for P2, who can play the\noptimal response according to the agent\u2019s model of the world. We begin by de\ufb01ning the (potentially\nunachievable) optimal joint policy, where both policies are chosen to maximise the same utility\nfunction:\nDe\ufb01nition 2 (Optimal joint policy). A joint policy \u00af\u03c0 is optimal under \u03c3 and \u00b51 iff u1( \u00af\u03c0) \u2265 u1(\u03c0),\n\u2200\u03c0 \u2208 \u03a0. We furthermore use \u00afu1 \ufffd u1( \u00af\u03c0) to refer to the value of the jointly optimal policy.\nThis value may not be achievable, even though the two agents share a reward function, as the second\nagent\u2019s model does not agree with the \ufb01rst agent\u2019s, and so their expected utilities are different. To\nmodel this, we de\ufb01ne the Stackelberg utility of policy \u03c01 for the \ufb01rst agent as:\n\nuSt\n1 (\u03c01) \ufffd u1(\u03c01, \u03c0B\n\n2 (\u03c01)),\n\n\u03c0B\n2 (\u03c01) = arg max\n\u03c02\u2208\u03a02\n\nu2(\u03c01, \u03c02),\n\n(2.2)\n\ni.e. the value of the policy when the second agent best responds to agent one\u2019s policy under the\nsecond agent\u2019s model.2 The following de\ufb01nes the highest utility that P1 can achieve.\n\n1For simplicity we consider state-dependent rewards bounded in [0, 1]. Our results are easily generalizable to\n\u03c1 : S \u00d7 A \u2192 [0, 1], through scaling by a factor of B and shifting by a factor of bm for any reward function in\n[b, b + B].\n\n2If there is no unique best response, we de\ufb01ne the utility in terms of the worst-case, best response.\n\n3\n\n\fDe\ufb01nition 3 (Optimal policy). The optimal policy for P1, denoted by \u03c0\u22171, is the one maximizing the\nStackelberg utility, i.e. uSt\n1 (\u03c01), \u03c01 \u2208 \u03a01, and we use u\u22171 \ufffd uSt(\u03c0\u22171) to refer to the value of\nthis optimal policy.\n\n1 (\u03c0\u22171) \u2265 uSt\n\nIn the remainder of the technical discussion, we will characterize P1 policies in terms of how much\nworse they are than the jointly optimal policy, as well as how much better they can be than the policy\nthat blithely assumes that P2 shares the same model.\nWe start with some observations about the nature of the game when one agent \ufb01xes its policy, and we\nargue how the difference between the models of the two agents affects the utility functions. We then\ncombine this with a de\ufb01nition of in\ufb02uence to obtain bounds on the loss due to the difference in the\nmodels.\nWhen agent i \ufb01xes a Markov policy \u03c0i, the game is an MDP for agent j. However, if agent i\u2019s policy\nis not Markovian the resulting game is not an MDP on the original state space. We show that if P1\nacts as if P2 has the correct transition kernel, then the resulting joint policy has value bounded by\nthe L1 norm between the true kernel and agent 2\u2019s actual kernel. We begin by establishing a simple\ninequality to show that knowledge of the model \u00b52 is bene\ufb01cial for P1.\nLemma 2. For any MVDP, the utility of the jointly optimal policy is greater than that of the\n(achievable) optimal policy, which is in turn greater than that of the policy that assumes that \u00b52 = \u00b51.\n\nu1( \u00af\u03c0) \u2265 uSt\n\n1 (\u03c0\u22171) \u2265 uSt\n\n1 (\u00af\u03c01)\n\n(2.3)\n\nProof. The \ufb01rst inequality follows from the de\ufb01nition of the jointly optimal policy and uSt\nsecond inequality, note that the middle term is a maximizer for the right-hand side.\n\n1 . For the\n\nConsequently, P1 must be able to do (weakly) better if it knows \u00b52 compared to if it just assumes\nthat \u00b52 = \u00b51. However, this does not tell us how much (if any) improvement we can obtain. Our\nidea is to see what policy \u03c01 we would need to play in order to make P2 play \u00af\u03c02, and measure the\ndistance of this policy from \u00af\u03c01. To obtain a useful bound, we need to have a measure on how much\nP1 must deviate from \u00af\u03c01 in order for P2 to play \u00af\u03c02. For this, we de\ufb01ne the notion of in\ufb02uence. This\nwill capture the amount by which a agent i can affect the game in the eyes of agent j. In particular,\nit is the maximal amount by which an agent i can affect the transition distribution of agent j by\nchanging i\u2019s action at each state s:\nDe\ufb01nition 4 (In\ufb02uence). The in\ufb02uence of agent i on the transition distribution of model \u00b5j is de\ufb01ned\nas the vector:\n\nIi,j(s) \ufffd max\nat,\u2212i\n\nat,ia\ufffdt,i \ufffd\u00b5j(\u00b7 | st = s, at,i, at,\u2212i) \u2212 \u00b5j(\u00b7 | st = s, a\ufffdt,i, at,\u2212i)\ufffd1,\nmax\nwhere the norm is over the difference in next-state distributions st+1 for the two models.\nThus, I1,1 describes the actual in\ufb02uence of P1 on the transition probabilities, while I1,2 describes\nthe perceived in\ufb02uence of P1 by P2. We will use in\ufb02uence to de\ufb01ne an \u00b5-dependent distance\nbetween policies, capturing the effect of an altered policy on the model:\nDe\ufb01nition 5 (Policy distance). The distance between policies \u03c0i, \u03c0\ufffdi under model \u00b5j is:\n\n(2.4)\n\n\ufffd\u03c0i \u2212 \u03c0\ufffdi\ufffd\u00b5j \ufffd max\n\ns\u2208S \ufffd\u03c0i(\u00b7 | s) \u2212 \u03c0\ufffdi(\u00b7 | s)\ufffd1Ii,j(s).\n\n(2.5)\n\nThese two de\ufb01nitions result in the following Lipschitz condition on the utility function, whose proof\ncan be found in Appendix A.\nLemma 3. For any \ufb01xed \u03c02, and any \u03c01, \u03c0\ufffd1: ui(\u03c01, \u03c02) \u2264 ui(\u03c0\ufffd1, \u03c02) + \ufffd\u03c01 \u2212 \u03c0\ufffd1\ufffd\u00b5i\na symmetric result holding for any \ufb01xed policy \u03c01, and any pair \u03c02, \u03c0\ufffd2.\n\n(1\u2212\u03b3)2 , with\n\n\u03b3\n\nLemma 3 bounds the change in utility due to a change in policy by P1 with respect to i\u2019s payoff. As\nshall be seen in the next section, it allows us to analyze how close the utility we can achieve comes\nto that of the jointly optimal policy, and how much can be gained by not naively assuming that the\nmodel of P2 is the same.\n\n4\n\n\f2.2 Optimality\n\nIn this section, we illuminate the relationship between different types of policies. First, we show that\nif P1 simply assumes \u00b52 = \u00b51, it only suffers a bounded loss relative to the jointly optimal policy.\nSubsequently, we prove that knowing \u00b52 allows P1 to \ufb01nd an improved policy.\n\nLemma 4. Consider the optimal policy \u00af\u03c01 for the modi\ufb01ed game \ufffdG = \ufffdS,A, \u03c31, \u03c31, \u00b51, \u00b51, \u03c1, \u03b3\ufffd\nwhere P2\u2019s model is correct. Then \u00af\u03c01 is Markov and achieves utility \u00afu in \ufffdG, while its utility in G is:\n\n\ufffd\u00b51 \u2212 \u00b52\ufffd1 \ufffd max\n\nst,at \ufffd\u00b51(st+1 | st, at) \u2212 \u00b52(st+1 | st, at)\ufffd1.\n\nuSt\n1 (\u00af\u03c01) \u2265 \u00afu \u2212\n\n2\ufffd\u00b51 \u2212 \u00b52\ufffd1\n(1 \u2212 \u03b3)2\n\n,\n\nAs this bound depends on the maximum between all state action pairs, we re\ufb01ne it in terms of the\nin\ufb02uence of each agent\u2019s actions. This also allows us to measure the loss in terms of the difference in\nP2\u2019s actual and desired response, rather than the difference between the two models, which can be\nmuch larger.\nCorollary 1. If P2\u2019s best response to \u00af\u03c01 is \u03c0B\npolicy is bounded by u1(\u00af\u03c01, \u00af\u03c02) \u2212 u1(\u00af\u03c01, \u03c0B\n\n2 (\u00af\u03c01) \ufffd= \u00af\u03c02, then our loss relative to the jointly optimal\n\n(1\u2212\u03b3)2 .\n\n\u03b3\n\n2 (\u00af\u03c01)) \u2264\ufffd\ufffd\u03c0B\n\n2 (\u00af\u03c01) \u2212 \u00af\u03c02\ufffd\ufffd\u00b51\n\nProof. This follows from Lemma 3 by \ufb01xing \u00af\u03c01 for the policy pairs \u03c0B\n\n2 (\u00af\u03c01), \u00af\u03c02 under \u00b51.\n\nWhile the previous corollary gave us an upper bound on the loss we incur if we ignore the beliefs of\nP2, we can bound the loss of the optimal Stackelberg policy in the same way:\nCorollary 2. The difference between the optimal utility u1(\u00af\u03c01, \u00af\u03c02) and the optimal Stackleberg utility\nuSt\n1 (\u03c0\u22171) is bounded by u1(\u00af\u03c01, \u00af\u03c02) \u2212 uSt\n\n(1\u2212\u03b3)2 .\n\n\u03b3\n\n1 (\u03c0\u22171) \u2264\ufffd\ufffd\u03c0B\n\n2 (\u00af\u03c01) \u2212 \u00af\u03c02\ufffd\ufffd\u00b51\n\nProof. The result follows directly from Corollary 1 and Lemma 2.\n\nThis bound is not very informative by itself, as it does not suggest an advantage for the optimal\nStackelberg policy. Instead, we can use Lemma 3 to lower bound the increase in utility obtained\nrelative to just playing the optimistic policy \u00af\u03c01. We start by observing that when P2 responds with\nsome \u02c6\u03c02 to \u00af\u03c01, P1 could improve upon this by playing \u02c6\u03c01 = \u03c0B\n1 (\u02c6\u03c02), the best response of to \u02c6\u03c02, if\nP1 could somehow force P2 to stick to \u02c6\u03c02. We can de\ufb01ne\n\n\u0394 \ufffd u1(\u02c6\u03c01, \u02c6\u03c02) \u2212 u1(\u00af\u03c01, \u02c6\u03c02),\n\n(2.6)\n\nto be the potential advantage from switching to \u02c6\u03c01. Theorem 1 characterizes how close to this\n1 (a | s) \ufffd \u03b1\u00af\u03c01(a | s) + (1 \u2212 \u03b1)\u02c6\u03c01(a | s),\nadvantage P1 can get by playing a stochastic policy \u03c0\u03b1\nwhile ensuring that P2 sticks to \u02c6\u03c02.\nTheorem 1 (A suf\ufb01cient condition for an advantage over the naive policy). Let \u02c6\u03c02 = \u03c0B\n2 (\u00af\u03c01) be the\nresponse of P2 to the optimistic policy \u00af\u03c01 and assume \u0394 > 0. Then we can obtain an advantage of\nat least:\n\n\u0394 \u2212\n\n\u03b3 \ufffd\u00af\u03c01 \u2212 \u02c6\u03c01\ufffd\u00b51\n\n(1 \u2212 \u03b3)2\n\n+\n\n\u03b4\n2\n\n\ufffd\u00af\u03c01 \u2212 \u02c6\u03c01\ufffd\u00b51\n\ufffd\u00af\u03c01 \u2212 \u02c6\u03c01\ufffd\u00b52\n\n(2.7)\n\nwhere \u03b4 \ufffd u2(\u00af\u03c01, \u02c6\u03c02) \u2212 max\u03c02\ufffd=\u02c6\u03c02 u2(\u00af\u03c01, \u03c02) is the gap between \u02c6\u03c02 and all other deterministic\npolicies of P2 when P1 plays \u00af\u03c01.\n\nWe have shown that knowledge of \u00b52 allows P1 to obtain improved policies compared to simply\nassuming \u00b52 = \u00b51, and that this improvement depends on both the real and perceived effects of a\nchange in P1\u2019s policy. In the next section we develop an ef\ufb01cient dynamic programming algorithm\nfor \ufb01nding a good policy for P1.\n\n5\n\n\f3 Algorithms for the Stackelberg Setting\n\nIn the Stackelberg setting, we assume that P1 commits to a policy \u03c01, and this policy is observed\nby P2. Because of this, it is suf\ufb01cient for P2 to use a Markov policy, and this can be calculated in\npolynomial time in the number of states and actions.\nHowever, there is a polynomial reduction from stochastic games to MVDPs (Lemma 1), and since\nLetchford et al. [2012] show that computing optimal commitment strategies is NP-hard, then the\nplanning problem for MVDPs is also NP-hard. Another dif\ufb01culty that occurs is that dominating\npolicies in the MDP sense may not exist in MVDPs.\nDe\ufb01nition 6 (Dominating policies). A dominating policy \u03c0 satis\ufb01es V \u03c0(s) \u2265 V \u03c0\ufffd (s),\u2200s \u2208 S,\nwhere V \u03c0(s) = E\u03c0(u | s0 = s).\nDominating policies have the nice property that they are also optimal for any starting distribution \u03c3.\nHowever, dominating, stationary Markov polices need not exist in our setting.\nTheorem 2. A dominating, stationary Markov policy may not exist in a given MVDP.\n\nThe proof of this theorem is given by a counterexample in Appendix A, where the optimal policy\ndepends on the history of previously visited states.\nIn the trivial case when \u00b51 = \u00b52, the problem can be reduced to a Markov decision process, which\ncan be solved in O(S2A) [Mansour and Singh, 1999, Littman et al., 1995]. Generally, however, the\ncommitment by P1 creates new dependencies that render the problem inherently non-Markovian\nwith respect to the state st and thus harder to solve. In particular, even though the dynamics of the\nenvironment are Markovian with respect to the state st, the MVDP only becomes Markov in the\nStackelberg setting with respect to the hyper-state \u03b7t = (st, \u03c0t:T,1) where \u03c0t:T,1 is the commitment\nby P1 for steps t, . . . , T . To see that the game is non-Markovian, we only need to consider a single\ntransition from st to st+1. P2\u2019s action depends not only on the action at,1 of P1, but also on the\nexpected utility the agent will obtain in the future, which in turn depends on \u03c0t:T,1. Consequently,\nstate st is not a suf\ufb01cient statistic for the Stackelberg game.\n\n3.1 Backwards Induction\n\nThese dif\ufb01culties aside, we now describe a backwards induction algorithm for approximately solving\nMVDPs. The algorithm can be seen as a generalization of the backwards induction algorithm for\nsimultaneous-move stochastic games [c.f. Bo\u0161ansk`y et al., 2016] to the case of disagreement on the\ntransition distribution.\nIn our setting, at stage t of the interaction, P2 has observed the current state st and also knows the\ncommitment of P1 for all future periods. P2 now chooses the action\n\na\u2217t,2(\u03c01) \u2208 arg max\n\nat,2\n\n\u03c1(st) + \u03b3\ufffdat,1\n\n\u03c01(at,1 | st)\ufffdst+1\n\n\u00b52(st+1|st, at,1, at,2) \u00b7 V2,t+1(st+1).\n\n(3.1)\n\nThus, for every state, there is a well-de\ufb01ned continuation for P2. Now, P1 needs to choose an action.\nThis can be done easily, since we know P2\u2019s continuation, and so we can de\ufb01ne a value for each\nstate-action-action triplet for either agent:\n\nQi,t(st, at,1, at,2) = \u03c1(s) + \u03b3\ufffdst+1\n\n\u00b5i(st+1|st, at,1, at,2) \u00b7 Vi,t+1(st+1).\n\nAs the agents act simultaneously, the policy of P1 needs to be stochastic. The local optimization\nproblem can be formed as a set of linear programs (LPs), one for each action a2 \u2208 A2:\n\nmax\n\n\u03c01(a1|s) \u00b7 Qt,1(s, a1, a2)\n\n\u03c01 \ufffda1\ns.t. \u2200\u02c6a2 :\ufffda1\n\u2200\u02c6a1 : 0 \u2264 \u03c01(\u02c6a1|s) \u2264 1, and \ufffda1\n\n\u03c01(a1|s) \u00b7 Qt,2(s, a1, a2) \u2265\ufffda1\n\n\u03c01(a1|s) = 1.\n\n6\n\n\u03c0(a1) \u00b7 Qt,2(s, a1, \u02c6a2),\n\n\fEach LP results in the best possible policy at time t, such that we force P2 to play a2. From these,\nwe select the best one. At the end, the algorithm, given the transitions (\u00b51, \u00b52), and the time horizon\nT , returns an approximately optimal joint policy, (\u03c0\u22171, \u03c0\u22172) for the MVDP. The complete pseudocode\nis given in Appendix C, algorithm 1.\nAs this solves a \ufb01nite horizon problem, the policy is inherently non-stationary. In addition, because\nthere is no guarantee that there is a dominating policy, we may never obtain a stationary policy (see\nbelow). However, we can extract a stationary policy from the policies played at individual time steps\nt, and select the one with the highest expected utility. We can also obtain a version of the algorithm\nthat attains a deterministic policy, by replacing the linear program with a maximization over P1\u2019s\nactions.\n\nOptimality. The policies obtained using this algorithm are subgame perfect, up to the time horizon\nadopted for backward induction; i.e. the continuation policies are optimal (considering the possibly\nincorrect transition kernel of P2) off the equilibrium path. As a dominating Markov policy may not\nexist, the algorithm may not converge to a stationary policy in the in\ufb01nite horizon discounted setting,\nsimilarly to the cyclic equilibria examined by Zinkevich et al. [2005]. This is because the commitment\nof P1 affects the current action of P2, and so the effective transition matrix for P1. More precisely,\nthe transition actually depends on the future joint policy \u03c0n+1:T , because this determines the value\nQ2,t and so the policy of P2. Thus, the Bellman optimality condition does not hold, as the optimal\ncontinuation may depend on previous decisions.\n\n4 Experiments\n\nWe focus on a natural subclass of multi-view decision processes, which we call intervention games.\nTherein, a human and an AI have joint control of a system, and the human can override the AI\u2019s\nactions at a cost. As an example, consider semi-autonomous driving, where the human always has an\noption to override the AI\u2019s decisions. The cost represents the additional effort of human intervention;\nif there was no cost, the human may always prefer to assume manual control and ignore the AI.\nDe\ufb01nition 7 (c-intervention game). A MVDP is a c-intervention game if all of P2\u2019s actions override\nthose of P1, apart from the null action a0 \u2208 A2, which has no effect.\n\n\u00b51(st+1 | st, at,1, at,2) = \u00b51(st+1 | st, a\ufffdt,1, at,2)\n\n\u2200at,1, a\ufffdt,1 \u2208 A, at,2 \ufffd= a0.\n\n(4.1)\n\nIn addition, the agents subtract a cost c(s) > 0 from the reward rt = \u03c1(st) whenever P2 takes an\naction other than a0.\n\nAny MDP with action space A\ufffd and reward function \u03c1\ufffd : S \u2192 [0, 1] can be converted into a c-\nintervention game, and modeled as an MVDP, with action space A = A1 \u00d7 A2, where A1 = A\ufffd,\nA2 = A1 \u222a\ufffd a0\ufffd, a1 \u2208 A1, a2 \u2208 A2, a = (a1, a2) \u2208 A,\n\n(4.2)\n\nrMIN =\n\nmin\n\ns\ufffd\u2208S, a\ufffd2\u2208A2\n\n\u03c1\ufffd(s\ufffd) \u2212 c(s\ufffd),\n\nand reward function3 \u03c1 : S \u00d7 A \u2192 [0, 1], with\n\nrMAX = max\n\n\u03c1\ufffd(s\ufffd)\n\ns\ufffd\u2208S, a\ufffd2\u2208A2\n\n\u03c1(s, a) =\n\n\u03c1\ufffd(s) \u2212 c(s) I\ufffda2 \ufffd= a0\ufffd \u2212 rMIN\n\nrMAX \u2212 rMIN\n\n(4.3)\n\n(4.4)\n\n.\n\nThe reward function in the MVDP is de\ufb01ned so that it also has the range [0, 1].\n\nAlgorithms and scenarios. We consider the main scenario, as well as three variant scenarios, with\ndifferent assumptions about the AI\u2019s model. For the main scenario, the human has an incorrect model\nof the world, which the AI knows. For this, we consider three types of AI policies:\n\nPURE: The AI only uses deterministic Markov policies.\n\n3Note that although our original de\ufb01nition used a state-only reward function, we are using a state-action\n\nreward function.\n\n7\n\n\fy\nt\ni\nl\ni\nt\n\nu\n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n0\n\ny\nt\ni\nl\ni\nt\n\nu\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n0.0\n\nopt\npure\nm ixed\nnaive\nhum an\nstat\n\n10\n\n20\n\n30\n\n40\n\n50\n\nhuman error (factor)\n\n(a) Multilane Highway\n\n(b) Highway: Error\n\nopt\npure\nm ixed\nnaive\nhum an\nstat\n\n0.1\n\n0.05\n0.15\ncost (safety+intervention)\n(c) Highway: Cost\n\n0.2\n\ny\nt\ni\nl\ni\nt\n\nu\n\n4\n\n3\n\n2\n\n1\n\n0\n\n-1\n\n-2\n\n0.0\n\n3.5\n\n3.0\n\n2.5\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\ny\nt\ni\nl\ni\nt\n\nu\n\n0.4\n\n0.5\n\nopt\npure\nm ixed\nnaive\nhum an\nstat\n\n0.0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\ncost (intervention)\n\nopt\npure\nm ixed\nnaive\nhum an\nstat\n\n0.2\n\n0.1\nhuman error (skewness)\n\n0.3\n\n(d) Food and Shelter\n\n(e) Food and Shelter: Error\n\n(f) Food and Shelter: Cost\n\nFigure 1: Illustrations and experimental results for the \u2018multilane highway\u2019 and \u2018food and shelter\u2019\ndomains. Plots (b,e) show the effect of varying the error in the human\u2019s transition kernel with \ufb01xed\nintervention cost. Plots (c,f) show the effect of varying the intervention cost for a \ufb01xed error in the\nhuman\u2019s transition kernel.\n\nMIXED: The AI may use stochastic Markov policies.\n\nSTAT: As above, but use the best instantaneous deterministic policy of the \ufb01rst 25 time-steps found\n\nin PURE as a stationary Markov policy (running for the same time horizon as PURE).\n\nWe also have three variant scenarios of AI and human behaviour.\n\nOPT: Both the AI and human have the correct model of the world.\n\nNAIVE: The AI assumes that the human\u2019s model is correct.\nHUMAN: Both agents use the incorrect human model to take actions. It is equivalent to the human\n\nhaving full control without any intervention cost.\n\nIn all of these, the AI uses a MIXED policy. We consider two simulated problem domains in which\nto evaluate our methods. The \ufb01rst is a multilane highway scenario, where the human and AI have\nshared control of a car, and the second is a food and shelter domain where they must collect food\nand maintain a shelter. In all cases, we use a \ufb01nite time horizon of 100 steps and a discount factor of\n\u03b3 = 0.95.\n\nMultilane Highway.\nIn this domain, a car is under joint control of an AI agent and a human, with\nthe human able to override the AI\u2019s actions at any time. There are multiple lanes in a highway,\nwith varying levels of risk and speed (faster lanes are more risky). Within each lane, there is some\nprobability of having an accident. However, the human overestimates this probability, and so wants\nto travel in a slower lane than is optimal. We denote a starting state by A, a destination state by B,\nand, for lane i, intermediate states Ci1, ..., CiJ, where J is the number of intermediate states in a\nlane, and an accident state D. See Figure 1(a) for an illustration of the domain, and for the simulation\nresults. In the plots, the error parameter represents a factor by which the human is wrong in assessing\nthe accident probability (assumed to be small), while the cost parameter determines both the cost of\nsafety (slow driving) of different lanes as well as the cost of human intervening on these lanes. The\nlatter is because our experimental model couples the cost of intervention with the safety cost. The\nrewards range from \u221210 to 10. More details are provided in the Appendix (Section B).\n\n8\n\n\fFood and Shelter Domain. The food and shelter domain [Guo et al., 2013] involves an agent\nsimultaneously trying to \ufb01nd randomly placed food (in one of the top \ufb01ve locations) while maintaining\na shelter. With positive probability at each time step, the shelter can collapse if it is not maintained.\nThere is a negative reward for the shelter collapsing and positive reward for \ufb01nding food (food\nreappears whenever it is found). In order to exercise the abilities of our modeling, we make the\noriginal setting more complex by increasing the size of the grid to 5\u00d7 5 and allowing diagonal moves.\nFor our MVDP setting, we give the AI the correct model but assume the human overestimates the\nprobabilities. Furthermore, the human believes that diagonal movements are more prone to error.\nSee Figure 1(d) for an illustration of the domain, and for the simulation results. In the plots, the\nerror parameter determines how skewed the human\u2019s belief about the error is towards the uniform\ndistribution, while the cost parameter determines the cost of intervention. The rewards range from\n\u22121 to 1. More details are provided in the Appendix (Section B).\n\nResults.\nIn the simulations, when we change the error parameter, we keep the cost parameter\nconstant (0.15 for the multilane highway domain and 0.1 for the food and shelter domain), and vice\nversa, when we change the cost, we keep the error constant (25 for the multilane highway domain\nand 0.25 for the food and shelter domain). Overall, the results show that PURE, MIXED and STAT\nperform considerably better than NAIVE and HUMAN. Furthermore, for low costs, HUMAN is better\nthan NAIVE. The reason is that in NAIVE the human agent overrides the AI, which is more costly than\nhaving the AI perform the same policy (as it happens to be for HUMAN). Therefore, simply assuming\nthat the human has the correct model does not only lead to a larger error than knowing the human\u2019s\nmodel, but it can also be worse than simply adopting the human\u2019s erroneous model when making\ndecisions.\nAs the cost of intervention increases, the utilities become closer to the jointly optimal one (OPT\nscenario), with the exception of the utility for scenario HUMAN. This is not surprising since the\nintervention cost has an important tempering effect\u2014the human is less likely to take over the control\nif interventions are costly. When the human error is small, the utility approaches that of the jointly\noptimal policy. Clearly, the increasing error leads to larger deviations from the the optimal utility.\nOut of the three algorithms (PURE, MIXED and STAT), MIXED obtains a slightly better performance\nand shows the additional bene\ufb01t from allowing for stochastic polices. PURE and STAT have quite\nsimilar performance, which indicates that in most of the cases the backwards induction algorithm\nconverges to a stationary policy.\n\n5 Conclusion\n\nWe have introduced the framework of multi-view decision processes to model value-alignment\nproblems in human-AI collaboration. In this problem, an AI and a human act in the same environment,\nand share the same reward function, but the human may have an incorrect world model. We analyze\nthe effect of knowledge of the human\u2019s world model on the policy selected by the AI.\nMore precisely, we developed a dynamic programming algorithm, and gave simulation results to\ndemonstrate that an AI with this algorithm can adopt a useful policy in simple environments and\neven when the human adopts an incorrect model. This is important for modern applications involving\nthe close cooperation between humans and AI such as home robots or automated vehicles, where\nthe human can choose to intervene but may do so erroneously. Although backwards induction is\nef\ufb01cient for discrete state and action spaces, it cannot usefully be applied to the continuous case. We\nwould like to develop stochastic gradient algorithms for this case. More generally, we see a number\nof immediate extensions to MVDP: estimating the human\u2019s world model, studying a setting in which\nhuman is learning to respond to the actions of the AI, and moving away from Stackelberg to the case\nof no commitment.\n\nAcknowledgements. The research has received funding from: the People Programme (Marie Curie\nActions) of the European Union\u2019s Seventh Framework Programme (FP7/2007-2013) under REA\ngrant agreement 608743, the Swedish national science foundation (VR), the Future of Life Institute,\nthe SEAS TomKat fund, and a SNSF Early Postdoc Mobility fellowship.\n\n9\n\n\fReferences\nOfra Amir, Ece Kamar, Andrey Kolobov, and Barbara Grosz. Interactive teaching strategies for agent\n\ntraining. In IJCAI 2016, 2016.\n\nBranislav Bo\u0161ansk\u00fd, Simina Br\u00e2nzei, Kristoffer Arnsfelt Hansen, Peter Bro Miltersen, and\nTroels Bjerre S\u00f8rensen. Computation of Stackelberg Equilibria of Finite Sequential Games.\n2015.\n\nBranislav Bo\u0161ansk`y, Viliam Lis`y, Marc Lanctot, Ji\u02c7r\u00ed \u02c7Cerm\u00e1k, and Mark HM Winands. Algorithms\nfor computing strategies in two-player simultaneous move games. Arti\ufb01cial Intelligence, 237:1\u201340,\n2016.\n\nAvshalom Elmalech, David Sarne, Avi Rosenfeld, and Eden Shalom Erez. When suboptimal rules.\n\nIn AAAI, pages 1313\u20131319, 2015.\n\nEyal Even-Dar and Yishai Mansour. Approximate equivalence of markov decision processes. In\nLearning Theory and Kernel Machines. COLT/Kernel 2003, Lecture notes in Computer science,\npages 581\u2013594, Washington, DC, USA, 2003. Springer.\n\nYa\u2019akov Gal and Avi Pfeffer. Networks of in\ufb02uence diagrams: A formalism for representing agents\u2019\nbeliefs and decision-making processes. Journal of Arti\ufb01cial Intelligence Research, 33(1):109\u2013147,\n2008.\n\nXiaoxiao Guo, Satinder Singh, and Richard L Lewis. Reward mapping for transfer in long-lived\nagents. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 26, pages 2130\u20132138. 2013.\n\nDylan Had\ufb01eld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse\n\nreinforcement learning, 2016.\n\nJoshua Letchford, Liam MacDermed, Vincent Conitzer, Ronald Parr, and Charles L. Isbell. Computing\noptimal strategies to commit to in stochastic games. In Proceedings of the Twenty-Sixth AAAI\nConference on Arti\ufb01cial Intelligence, AAAI\u201912, 2012.\n\nJ. C. R. Licklider. Man-computer symbiosis. RE Transactions on Human Factors in Electronics, 1:\n\n4\u201311, 1960.\n\nMichael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. On the complexity of solving\nmarkov decision problems. In Proceedings of the Eleventh conference on Uncertainty in arti\ufb01cial\nintelligence, pages 394\u2013402. Morgan Kaufmann Publishers Inc., 1995.\n\nYishay Mansour and Satinder Singh. On the complexity of policy iteration. In Proceedings of the\nFifteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 401\u2013408. Morgan Kaufmann\nPublishers Inc., 1999.\n\nAndrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In ICML, pages\n\n663\u2013670, 2000.\n\nJonathan Sorg, Satinder P Singh, and Richard L Lewis. Internal rewards mitigate agent boundedness.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pages\n1007\u20131014, 2010.\n\nHaoqi Zhang and David C. Parkes. Value-based policy teaching with active indirect elicitation. In\nProc. 23rd AAAI Conference on Arti\ufb01cial Intelligence (AAAI\u201908), page 208\u2013214, Chicago, IL, July\n2008.\n\nHaoqi Zhang, David C. Parkes, and Yiling Chen. Policy teaching through reward function learning.\n\nIn 10th ACM Electronic Commerce Conference (EC\u201909), page 295\u2013304, 2009.\n\nMartin Zinkevich, Amy Greenwald, and Michael Littman. Cyclic equilibria in markov games. In\n\nAdvances in Neural Information Processing Systems, 2005.\n\n10\n\n\f", "award": [], "sourceid": 2811, "authors": [{"given_name": "Christos", "family_name": "Dimitrakakis", "institution": "Chalmers / Harvard / Lille / Oslo"}, {"given_name": "David", "family_name": "Parkes", "institution": "Harvard University"}, {"given_name": "Goran", "family_name": "Radanovic", "institution": "Harvard"}, {"given_name": "Paul", "family_name": "Tylkin", "institution": "Harvard University"}]}