{"title": "MAP Inference for Bayesian Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1989, "page_last": 1997, "abstract": "The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms.", "full_text": "MAP Inference for\n\nBayesian Inverse Reinforcement Learning\n\nJaedeug Choi and Kee-Eung Kim\nbDepartment of Computer Science\n\nKorea Advanced Institute of Science and Technology\n\nDaejeon 305-701, Korea\n\njdchoi@ai.kaist.ac.kr, kekim@cs.kaist.ac.kr\n\nAbstract\n\nThe dif\ufb01culty in inverse reinforcement learning (IRL) arises in choosing the best\nreward function since there are typically an in\ufb01nite number of reward functions\nthat yield the given behaviour data as optimal. Using a Bayesian framework, we\naddress this challenge by using the maximum a posteriori (MAP) estimation for\nthe reward function, and show that most of the previous IRL algorithms can be\nmodeled into our framework. We also present a gradient method for the MAP es-\ntimation based on the (sub)differentiability of the posterior distribution. We show\nthe effectiveness of our approach by comparing the performance of the proposed\nmethod to those of the previous algorithms.\n\n1\n\nIntroduction\n\nThe objective of inverse reinforcement learning (IRL) is to determine the decision making agent\u2019s\nunderlying reward function from its behaviour data and the model of environment [1]. The signi\ufb01-\ncance of IRL has emerged from problems in diverse research areas. In animal and human behaviour\nstudies [2], the agent\u2019s behaviour could be understood by the reward function since the reward func-\ntion re\ufb02ects the agent\u2019s objectives and preferences. In robotics [3], IRL provides a framework for\nmaking robots learn to imitate the demonstrator\u2019s behaviour using the inferred reward function.\nIn other areas related to reinforcement learning, such as neuroscience [4] and economics [5], IRL\naddresses the non-trivial problem of \ufb01nding an appropriate reward function when building a com-\nputational model for decision making.\n\nIn IRL, we generally assume that the agent is an expert in the problem domain and hence it be-\nhaves optimally in the environment. Using the Markov decision process (MDP) formalism, the IRL\nproblem is de\ufb01ned as \ufb01nding the reward function that the expert is optimizing given the behaviour\ndata of state-action histories and the environment model of state transition probabilities. In the last\ndecade, a number of studies have addressed IRL in a direct (reward learning) and indirect (policy\nlearning by inferring the reward function, i.e., apprenticeship learning) fashions. Ng and Russell [6]\nproposed a suf\ufb01cient and necessary condition on the reward functions that guarantees the optimality\nof the expert\u2019s policy and formulated a linear programming (LP) problem to \ufb01nd the reward func-\ntion from the behaviour data. Extending their work, Abbeel and Ng [7] presented an algorithm for\n\ufb01nding the expert\u2019s policy from its behaviour data with a performance guarantee on the learned pol-\nicy. Ratliff et al. [8] applied the structured max-margin optimization to IRL and proposed a method\nfor \ufb01nding the reward function that maximizes the margin between the expert\u2019s policy and all other\npolicies. Neu and Szepesv\u00b4ari [9] provided an algorithm for \ufb01nding the policy that minimizes the\ndeviation from the behaviour. Their algorithm uni\ufb01es the direct method that minimizes a loss func-\ntion of the deviation and the indirect method that \ufb01nds an optimal policy from the learned reward\nfunction using IRL. Syed and Schapire [10] proposed a method to \ufb01nd a policy that improves the\nexpert\u2019s policy using a game-theoretic framework. Ziebart et al. [11] adopted the principle of the\n\n1\n\n\fmaximum entropy for learning the policy whose feature expectations are constrained to match those\nof the expert\u2019s behaviour. In addition, Neu and Szepesv\u00b4ari [12] provided a (non-Bayesian) uni\ufb01ed\nview for comparing the similarities and differences among previous IRL algorithms.\n\nIRL is an inherently ill-posed problem since there may be an in\ufb01nite number of reward functions\nthat yield the expert\u2019s policy as optimal. Previous approaches summarized above employ various\npreferences on the reward function to address the non-uniqueness. For example, Ng and Russell [6]\nsearch for the reward function that maximizes the difference in the values of the expert\u2019s policy and\nthe second best policy. More recently, Ramachandran and Amir [13] presented a Bayesian approach\nformulating the reward preference as the prior and the behaviour compatibility as the likelihood, and\nproposed a Markov chain Monte Carlo (MCMC) algorithm to \ufb01nd the posterior mean of the reward\nfunction.\n\nIn this paper, we propose a Bayesian framework subsuming most of the non-Bayesian IRL algo-\nrithms in the literature. This is achieved by searching for the maximum-a-posteriori (MAP) reward\nfunction, in contrast to computing the posterior mean. We show that the posterior mean can be prob-\nlematic for the reward inference since the loss function is integrated over the entire reward space,\neven including those inconsistent with the behaviour data. Hence, the inferred reward function can\ninduce a policy much different from the expert\u2019s policy. The MAP estimate, however, is more ro-\nbust in the sense that the objective function (the posterior probability in our case) is evaluated on\na single reward function. In order to \ufb01nd the MAP reward function, we present a gradient method\nusing the differentiability result of the posterior, and show the effectiveness of our approach through\nexperiments.\n\n2 Preliminaries\n\n2.1 MDPs\n\nA Markov decision process (MDP) is de\ufb01ned as a tuple hS, A, T, R, \u03b3, \u03b1i: S is the \ufb01nite set of\nstates; A is the \ufb01nite set of actions; T is the state transition function where T (s, a, s\u2032) denotes the\nprobability P (s\u2032|s, a) of changing to state s\u2032 from state s by taking action a; R is the reward function\nwhere R(s, a) denotes the immediate reward of executing action a in state s, whose absolute value\nis bounded by Rmax; \u03b3 \u2208 [0, 1) is the discount factor; \u03b1 is the initial state distribution where\n\u03b1(s) denotes the probability of starting in state s. Using matrix notations, the transition function is\ndenoted as an |S||A| \u00d7 |S| matrix T , and the reward function is denoted as an |S||A|-dimensional\nvector R.\n\nA policy is de\ufb01ned as a mapping \u03c0 : S \u2192 A. The value of policy \u03c0 is the expected discounted\nreturn of executing the policy and de\ufb01ned as V \u03c0 = E [P\u221e\nt=0 \u03b3tR(st, at)|\u03b1, \u03c0] where the initial\nstate s0 is determined according to initial state distribution \u03b1 and action at is chosen by policy \u03c0\nin state st. The value function of policy \u03c0 for each state s is computed by V \u03c0(s) = R(s, \u03c0(s)) +\n\u03b3 Ps\u2032\u2208S T (s, \u03c0(s), s\u2032)V \u03c0(s\u2032) such that the value of policy \u03c0 is calculated by V \u03c0 = Ps \u03b1(s)V \u03c0(s).\nSimilarly, the Q-function is de\ufb01ned as Q\u03c0(s, a) = R(s, a) + \u03b3 Ps\u2032\u2208S T (s, a, s\u2032)V \u03c0(s\u2032). We can\nrewrite the equations for the value function and the Q-function in matrix notations as\n\nV \u03c0 = R\u03c0 + \u03b3T \u03c0V \u03c0, Q\u03c0\n\na = Ra + \u03b3T aV \u03c0\n\n(1)\n\nwhere T \u03c0 is an |S| \u00d7 |S| matrix with the (s, s\u2032) element being T (s, \u03c0(s), s\u2032), T a is an |S| \u00d7 |S|\nmatrix with the (s, s\u2032) element being T (s, a, s\u2032), R\u03c0 is an |S|-dimensional vector with the s-th\nelement being R(s, \u03c0(s)), Ra is an |S|-dimensional vector with the s-th element being R(s, a), and\nQ\u03c0\n\na is an |S|-dimensional vector with the s-th element being Q\u03c0(s, a).\n\nAn optimal policy \u03c0\u2217 maximizes the value function for all the states, and thus should satisfy\nthe Bellman optimality equation: \u03c0 is an optimal policy if and only if for all s \u2208 S, \u03c0(s) \u2208\nargmaxa\u2208A Q\u03c0(s, a). We denote V \u2217 = V \u03c0\u2217 and Q\u2217 = Q\u03c0\u2217.\nWhen the state space is large, the reward function is often linearly parameterized: R(s, a) =\nPd\n: S \u00d7 A \u2192 R and the weight vector w =\n[w1, w2, \u00b7 \u00b7 \u00b7 , wd]. Each basis function \u03c6i has a corresponding basis value V \u03c0\ni =\nE [P\u221e\n\ni=1 wi\u03c6i(s, a) with the basis functions \u03c6i\n\ni of policy \u03c0 : V \u03c0\n\nt=0 \u03b3t\u03c6i(st, at)|\u03b1, \u03c0].\n\n2\n\n\fWe also assume that the expert\u2019s behaviour is given as the set X of M trajectories executed by\nthe expert\u2019s policy \u03c0E, where the m-th trajectory is an H-step sequence of state-action pairs:\nH )}. Given the set of trajectories, the value and the basis value\n{(sm\nof the expert\u2019s policy \u03c0E can be empirically estimated by\n\n2 ), \u00b7 \u00b7 \u00b7 , (sm\n\n1 ), (sm\n\n2 , am\n\n1 , am\n\nH , am\n\n\u02c6V E = 1\n\nM PM\nIn addition, we can empirically estimate the expert\u2019s policy \u02c6\u03c0E and its state visitation frequency \u02c6\u00b5E\nfrom the trajectories:\n\nh=1 \u03b3h\u22121\u03c6i(sm\n\nh=1 \u03b3h\u22121R(sm\n\nm=1 PH\n\nm=1 PH\n\nM PM\n\n\u02c6V E\ni = 1\n\nh , am\n\nh , am\n\nh ).\n\nh ),\n\n\u02c6\u03c0E(s, a) = PM\n\nh=1\n\nm=1 PH\nPM\n\nm=1 PH\n\n1(sm\n\nh =s\u2227am\n\nh =a)\n\nh=1\n\n1(sm\n\nh =s)\n\n,\n\n\u02c6\u00b5E(s) =\n\n1\n\nM H PM\n\nm=1 PH\n\nh=1\n\n1(sm\n\nh =s).\n\nIn the rest of the paper, we use the notation f (R) or f (x; R) for function f in order to be explicit\nthat f is computed using reward function R. For example, the value function V \u03c0(s; R) denotes the\nvalue of policy \u03c0 for state s using reward function R.\n\n2.2 Reward Optimality Condition\n\nNg and Russell [6] presented a necessary and suf\ufb01cient condition for reward function R of an MDP\nto guarantee the optimality of policy \u03c0: Q\u03c0\na (R) \u2264 V \u03c0(R) for all a \u2208 A. From the condition,\nwe obtain the following corollary (although it is a succinct reformulation of the theorem in [6], the\nproof is provided in the supplementary material).\n\nCorollary 1 Given an MDP\\R hS, A, T, \u03b3, \u03b1i, policy \u03c0 is optimal if and only if reward function R\nsatis\ufb01es\n\nhI \u2212 (I A \u2212 \u03b3T )(I \u2212 \u03b3T \u03c0)\u22121E\u03c0i R \u2264 0,\n\n(2)\n\nwhere E\u03c0 is an |S| \u00d7 |S||A| matrix with the (s, (s\u2032, a\u2032)) element being 1 if s = s\u2032 and \u03c0(s\u2032) = a\u2032,\nand I A is an |S||A| \u00d7 |S| matrix constructed by stacking the |S| \u00d7 |S| identity matrix |A| times.\n\nWe refer to Equation (2) as the reward optimality condition w.r.t. policy \u03c0. Since the linear in-\nequalities de\ufb01ne the region of the reward functions that yield policy \u03c0 as optimal, we refer to the\nregion bounded by Equation (2) as the reward optimality region w.r.t. policy \u03c0. Note that there ex-\nist in\ufb01nitely many reward functions in the reward optimality region even including constant reward\nfunctions (e.g. R = c1 where c \u2208 [\u2212Rmax, Rmax]). In other words, even when we are presented\nwith the expert\u2019s policy, there are in\ufb01nitely many reward functions to choose from, including the de-\ngenerate ones. To resolve this non-uniqueness in solutions, IRL algorithms in the literature employ\nvarious preferences on reward functions.\n\n2.3 Bayesian framework for IRL (BIRL)\n\nRamachandran and Amir [13] proposed a Bayesian framework for IRL by encoding the reward\nfunction preference as the prior and the optimality con\ufb01dence of the behaviour data as the likelihood.\nWe refer to their work as BIRL.\n\nAssuming the rewards are i.i.d., the prior in BIRL is computed by\nP (R) = Qs\u2208S,a\u2208A P (R(s, a)).\n\n(3)\n\nVarious distributions can be used as the prior. For example, the uniform prior can be used if we have\nno knowledge about the reward function other than its range, and a Gaussian or a Laplacian prior\ncan be used if we prefer rewards to be close to some speci\ufb01c values.\n\nThe likelihood in BIRL is de\ufb01ned as an independent exponential distribution analogous to the soft-\nmax function:\n\nP (X |R) =\n\nM\n\nH\n\nY\n\nm=1\n\nY\n\nh=1\n\nP (am\n\nh |sm\n\nh , R) =\n\nM\n\nH\n\nY\n\nm=1\n\nY\n\nh=1\n\nexp(\u03b2Q\u2217(sm\n\nh , am\nPa\u2208A exp(\u03b2Q\u2217(sm\n\nh ; R))\nh , a; R))\n\n(4)\n\n3\n\n\f)\nX\n|\n)\n5\ns\n(\nR\n\n,\n)\n1\ns\n(\nR\n(\nP\n\n0.04\n\n0.02\n\n0\n1\n\n0.5\n\nR(s5)\n\n0\n\n0\n\n0.2\n\n0.4\n\nR(s1)\n\n0.6\n\n0.8\n\n1\n\n(a)\n\n(b)\n\nFigure 1: (a) 5-state chain MDP. (b) Posterior for R(s1) and R(s5) of the 5-state chain MDP.\n\nwhere \u03b2 is a parameter that is equivalent to the inverse of temperature in the Boltzmann distribution.\n\nThe posterior over the reward function is then formulated by combining the prior and the likelihood,\nusing Bayes theorem:\n\nP (R|X ) \u221d P (X |R)P (R).\n\n(5)\n\nBIRL uses a Markov chain Monte Carlo (MCMC) algorithm to compute the posterior mean of the\nreward function.\n\n3 MAP Inference in Bayesian IRL\n\nIn the Bayesian approach to IRL, the reward function can be determined using different estimates,\nsuch as the posterior mean, median, or maximum-a-posterior (MAP). The posterior mean is com-\nmonly used since it can be shown to be optimal under the mean square error function. However,\nthe problem with the posterior mean in Bayesian IRL is that the error is integrated over the entire\nspace of reward functions, even including in\ufb01nitely many rewards that induce policies inconsistent\nwith the behaviour data. This can yield a posterior mean reward function with an optimal policy\nagain inconsistent with the data. On the other hand, the MAP does not involve an objective function\nthat is integrated over the reward function space; it is simply a point that maximizes the posterior\nprobability. Hence, it is more robust to in\ufb01nitely many inconsistent reward functions. We present a\nsimple example that compares the posterior mean and the MAP reward function estimation.\n\nConsider an MDP with 5 states arranged in a chain, 2 actions, and the discount factor 0.9. As shown\nin Figure 1(a), we denote the leftmost state as s1 and the rightmost state as s5. Action a1 moves to\nthe state on the right with probability 0.6 and to the state on the left with probability 0.4. Action a2\nalways moves to state s1. The true reward of each state is [0.1, 0, 0, 0, 1], hence the optimal policy\nchooses a1 in every state. Suppose that we already know R(s2), R(s3), and R(s4) which are all 0,\nand estimate R(s1) and R(s5) from the behaviour data X which contains optimal actions for all the\nstates. We can compute the posterior P (R(s1), R(s5)|X ) using Equations (3), (4), and (5) under the\nassumption that 0 \u2264 R \u2264 1 and priors P (R(s1)) being N (0.1, 1), and P (R(s5)) being N (1, 1).\nThe reward optimality region can be also computed using Equation (2).\n\nFigure 1(b) presents the posterior distribution of the reward function. The true reward, the MAP\nreward, and the posterior mean reward are marked with the black star, the blue circle, and the red\ncross, respectively. The black solid line is the boundary of the reward optimality region. Although\nthe prior mean is set to the true reward, the posterior mean is outside the reward optimality region.\nAn optimal policy for the posterior mean reward function chooses action a2 rather than action a1\nin state s1, while an optimal policy for the MAP reward function is identical to the true one. The\nsituation gets worse when using the uniform prior. An optimal policy for the posterior mean reward\nfunction chooses action a2 in states s1 and s2, while an optimal policy for the MAP reward function\nis again identical to the true one.\n\nIn the rest of this section, we additionally show that most of the IRL algorithms in the literature can\nbe cast as searching for the MAP reward function in Bayesian IRL. The main insight comes from\nthe fact that these algorithms try to optimize an objective function consisting of a regularization term\nfor the preference on the reward function and an assessment term for the compatibility of the reward\nfunction with the behaviour data. The objective function is naturally formulated as the posterior in\na Bayesian framework by encoding the regularization into the prior and the data compatibility into\nthe likelihood. In order to subsume different approaches used in the literature, we generalize the\n\n4\n\n\fTable 1: IRL algorithms and their equivalent f (X ; R) and prior for the Bayesian formulation. q \u2208\n{1, 2} is for representing L1 or L2 slack penalties.\n\nPrevious algorithm\n\nf (X ; R)\n\nPrior\n\nNg and Russell\u2019s IRL from sampled trajectories [6]\n\nMMP without the loss function [8]\n\nMWAL [10]\n\nPolicy matching [9]\n\nMaxEnt [11]\n\nfV\n\n(fV )q\n\nfG\nfJ\nfE\n\nUniform\nGaussian\nUniform\nUniform\nUniform\n\nlikelihood in Equation (4) to the following:\n\nP (X |R) \u221d exp(\u03b2f (X ; R))\n\nwhere \u03b2 is a parameter for scaling the likelihood and f (X ; R) is a function which will be de\ufb01ned\nappropriately to encode the data compatibility assessment used in each IRL algorithm. We then have\nthe following result (the proof is provided in the supplementary material):\n\nTheorem 1 IRL algorithms listed in Table 1 are equivalent to computing the MAP estimates with\nthe prior and the likelihood using f (X ; R) de\ufb01ned as follows:\n\u2022 fV (X ; R) = \u02c6V E(R) \u2212 V \u2217(R)\n\u2022 fJ (X ; R) = \u2212Ps,a \u02c6\u00b5E(s) (J(s, a; R) \u2212 \u02c6\u03c0E(s, a))2\nwhere \u03c0\u2217(R) is an optimal policy induced by the reward function R, J(s, a; R) is a smooth\nmapping from reward function R to a greedy policy such as the soft-max function, and PMaxEnt\nis the distribution on the behaviour data (trajectory or path) satisfying the principle of maximum\nentropy.\n\ni i\n\u2022 fG(X ; R) = mini hV \u03c0\u2217(R)\n\u2212 \u02c6V E\n\u2022 fE(X ; R) = log PMaxEnt(X |T , R)\n\ni\n\nThe MAP estimation approach provides a rich framework for explaining the previous non-Bayesian\nIRL algorithms in a uni\ufb01ed manner, as well as encoding various types of a priori knowledge into the\nprior distribution. Note that this framework can exploit the insights behind apprenticeship learning\nalgorithms even if they do not explicitly learn a reward function (e.g., MWAL [10]).\n\n4 A Gradient Method for Finding the MAP Reward Function\n\nWe have proposed a unifying framework for Bayesian IRL and suggested that the MAP estimate can\nbe a better solution to the IRL problem. We can then reformulate the IRL problem into the posterior\noptimization problem, which is \ufb01nding RMAP that maximizes the (log unnormalized) posterior:\n\nRMAP = argmaxR P (R|X ) = argmaxR [log P (X |R) + log P (R)]\n\nBefore presenting a gradient method for the optimization problem, we show that the generalized\nlikelihood is differentiable almost everywhere.\n\nThe likelihood is de\ufb01ned for measuring the compatibility of the reward function R with the be-\nhaviour data X . This is often accomplished using the optimal value function V \u2217 or the optimal\nQ-function Q\u2217 w.r.t. R. For example, the empirical value of X is compared with V \u2217 [6, 8], X\nis directly compared to the learned policy (e.g. the greedy policy from Q\u2217) [9], or the probability\nof following the trajectories in X is computing using Q\u2217 [13]. Thus, we generally assume that\nP (X |R) = g(X , V \u2217(R)) or g(X , Q\u2217(R)) where g is differentiable w.r.t. V \u2217 or Q\u2217. The remain-\ning question is the differentiability of V \u2217 and Q\u2217 w.r.t. R, which we address in the following two\ntheorems (The proofs are provided in the supplementary material.):\n\nTheorem 2 V \u2217(R) and Q\u2217(R) are convex.\n\nTheorem 3 V \u2217(R) and Q\u2217(R) are differentiable almost everywhere.\nTheorems 2 and 3 relate to the previous work on gradient methods for IRL. Neu and Szepesv\u00b4ari [9]\nshowed that Q\u2217(R) is Lipschitz continuous, and except on a set of measure zero (almost every-\nwhere), it is Fr\u00b4echet differentiable by Rademacher\u2019s theorem. We have obtained the same result\n\n5\n\n\fbased on the reward optimality region, and additionally identi\ufb01ed the condition for which V \u2217(R)\nand Q\u2217(R) are non-differentiable (refer to the proof for details). Ratliff et al. [8] used a subgra-\ndient of their objective function because it involves differentiating V \u2217(R). Using Theorem 3 for\ncomputing the subgradient of their objective function yields an identical result.\n\nAssuming a differentiable prior, we can compute the gradient of the posterior using the result in The-\norem 3 and the chain rule. If the posterior is convex, we will \ufb01nd the MAP reward function. Other-\nwise, as in [9], we will obtain a locally optimal solution. In the next section, we will experimentally\nshow that the locally optimal solutions are nonetheless better than the posterior mean in practice.\nThis is due to the property that they are generally found within the reward optimality region w.r.t.\nthe policy consistent with the behaviour data.\nThe gradient method uses the update rule Rnew \u2190 R + \u03b4t\u2207RP (R|X ) where \u03b4t is an appropriate\nstep-size (or learning rate). Since computing \u2207RP (R|X ) involves computing an optimal policy\nfor the current reward function and a matrix inversion, caching these results helps reduce repetitive\ncomputation. The idea is to compute the reward optimality region for checking whether we can\nreuse the cached result. If Rnew is inside the reward optimality region of an already visited reward\nfunction R\u2032, they share the same optimal policy and hence the same \u2207RV \u03c0(R) or \u2207RQ\u03c0(R).\nGiven policy \u03c0, the reward optimality region is de\ufb01ned by H \u03c0 = I \u2212 (I A \u2212 \u03b3T )(I \u2212 \u03b3T \u03c0)\u22121E\u03c0,\nand we can reuse the cached result if H \u03c0 \u00b7 Rnew \u2264 0. The gradient method using this idea is\npresented in Algorithm 1.\n\nAlgorithm 1 Gradient method for MAP inference in Bayesian IRL\nInput: MDP\\R, behaviour data X , step-size sequence {\u03b4t}, number of iterations N\n1: Initialize R\n2: \u03c0 \u2190 solveMDP(R)\n3: H \u03c0 \u2190 computeRewardOptRgn(\u03c0)\n4: \u03a0 \u2190 {h\u03c0, H \u03c0i}\n5: for t = 1 to N do\n6: Rnew \u2190 R + \u03b4t\u2207R P (R|X )\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: R \u2190 Rnew\n16: end for\n\nh\u03c0, H \u03c0i \u2190 \ufb01ndRewardOptRgn(Rnew, \u03a0)\nif isEmpty(h\u03c0, H \u03c0i) then\n\u03c0 \u2190 solveMDP(Rnew)\nH \u03c0 \u2190 computeRewardOptRgn(\u03c0)\n\u03a0 \u2190 \u03a0 \u222a {h\u03c0, H \u03c0i}\n\nif isNotInRewardOptRgn(Rnew, H \u03c0) then\n\nend if\n\nend if\n\n5 Experimental Results\n\nThe \ufb01rst set of experiments was conducted in N \u00d7 N gridworlds [7]. The agent can move west,\neast, north, or south, but with probability 0.3, it fails and moves in a random direction. The grids\nare partitioned into M \u00d7 M non-overlapping regions, so there are ( N\nM )2 regions. The basis function\nis de\ufb01ned by a 0-1 indicator function for each region. The linearly parameterized reward function\nis determined by the weight vector w sampled i.i.d. from a zero mean Gaussian prior with variance\n0.1 and |wi| \u2264 1 for all i. The discount factor is set to 0.99.\nWe compared the performance of our gradient method to those of other IRL algorithms in the liter-\nature: Maximum Margin Planning (MMP) [8], Maximum Entropy (MaxEnt) [11], Policy Matching\nwith natural gradient (NatPM) and the plain gradient (PlainPM) [9], and Bayesian Inverse Rein-\nforcement Learning (BIRL) [13]. We executed our gradient method for \ufb01nding MAP using three\ndifferent choices of the likelihood: B denotes the BIRL likelihood, and E and J denote the likelihood\nwith fE and fJ , respectively. For the Bayesian IRL algorithms (BIRL and MAP), two types of the\nprior are prepared: U denotes the uniform prior and G denotes the true Gaussian prior. We evaluated\nthe performance of the algorithms using the difference between V \u2217 (the value of the expert\u2019s policy)\nand V L(the value of the optimal policy induced by the learned weight wL measured on the true\nweight w\u2217) and the difference between w\u2217 and wL using L2 norm.\n\n6\n\n\fTable 2: Results in the gridworld problems.\nk w \u2217 \u2212 wL k2\n\nV \u2217 \u2212 V L\n\n|S|\n\n24 \u00d7 24\n\n32 \u00d7 32\n\n144\n576\n36\ndim(w)\n6.84 16.83\nNatPM\n3.04\n6.63 16.60\nPlainPM 3.77\n6.05 11.98 22.11\nMaxEnt\n0.85\n2.38\nMMP\n1.26\nn.a.\n3.27\nBIRL-U\n5.67\n1.36\nBIRL-G\n0.86\nn.a.\n8.46 13.87\nMAP-B-U 4.45\n2.40\n1.30\nMAP-B-G 0.83\nMAP-E-G 0.83\n1.22\n2.33\n2.30\n1.10\n0.48\nMAP-J-G\n\n256 1024\n64\n8.88 21.25\n3.50\n5.21\n9.05 17.36\n7.91 15.48 25.52\n0.83\n3.17\nn.a.\n3.78\n0.98\nn.a.\n5.68 10.50 18.21\n3.17\n0.94\n0.76\n3.13\n3.11\n0.65\n\n1.62\n1.53\n1.51\n\n1.61\n7.89\n1.71\n\n24 \u00d7 24\n144\n8.97\n0.67\n0.60\n\n576\n36\n8.74\n2.49\n0.51\n0.15\n0.33\n0.60\n10.74 16.32 13.72\nn.a.\n1.38\nn.a.\n2.21\n0.13\n1.06\n0.40\n0.16\n0.19\n0.42\n0.37\n0.17\n\n0.80\n0.54\n0.57\n0.45\n0.44\n0.42\n\n32 \u00d7 32\n\n64\n\n256 1024\n1.08 12.84 10.97\n1.91\n1.28\n0.41\n2.91\n0.95\n2.22\n13.58 10.59\n8.87\nn.a.\n2.24\n0.35\nn.a.\n0.90\n0.50\n2.17\n1.34\n1.63\n0.87\n0.77\n0.41\n0.43\n1.29\n1.88\n0.38\n1.21\n0.90\n\n3\n\n2\n\n1\n\n2\nk\n\nL\nw\n\u2212\n\u2217\n\nw\nk\n\n0\n\n20\n\n40\n\n60\n\n80\n\nL\nV\n\u2212\n\u2217\nV\n\n15\n\n10\n\n5\n\n0\n\n0\n\nCPU time (sec)\n\n(a)\n\n20\n\n40\n\n60\n\n80\n\nCPU time (sec)\n\n2\nk\n\nL\nw\n\u2212\n\u2217\n\nw\nk\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n200\n\n400\n\n600\n\n800\n\n20\n\n10\n\nL\nV\n\u2212\n\u2217\nV\n\n0\n\n \n0\n\nCPU time (sec)\n\n(b)\n\n \n\nBIRL\nMAP\u2212B\n\n200\n\n400\n\n600\n\n800\n\nCPU time (sec)\n\nFigure 2: CPU timing results of BIRL and MAP-B in 24 \u00d7 24 gridworld problem. (a) dim(w) = 36.\n(b) dim(w) = 144.\n\nWe used training data with 10 trajectories of 50 time steps, collected from the simulated runs of\nthe expert\u2019s policy. Table 2 shows the average performance over 10 training data. Most of the\nalgorithms found the weight that induces an optimal policy whose performance is as good as that\nof the expert\u2019s policy (i.e., small V \u2217 \u2212 V L) except for MMP and NatPM. The poor performance of\nMMP was due to the small size in the training data, as already noted in [14]. The poor performance\nof NatPM may be due to the ineffectiveness of pseudo-metric in high dimensional reward spaces,\nsince PlainPM was able produce good performance. Regarding the learned weights, the algorithms\nusing the true prior (MMP, BIRL, and the variants of MAP) found the weight close to the true one\n(i.e., small ||w\u2217 \u2212 wL||2). Comparing BIRL and MAP-B is especially meaningful since they share\nthe same prior and likelihood. The only difference was in computing the mean versus MAP from the\nposterior. MAP-B was consistently better than BIRL in terms of both ||w\u2217 \u2212 wL||2 and V \u2217 \u2212 V L.\nFinally, we note that the correct prior yields small ||w\u2217 \u2212 wL||2 and V \u2217 \u2212 V L when we compare\nPlainPM, MaxEnt, BIRL-U, and MAP-B-U (uniform prior) to MAP-J-G, MAP-E-G, BIRL-G, and\nMAP-B-G (Gaussian prior), respectively.\n\nFigure 2 compares the CPU timing results of the MCMC algorithm in BIRL and the gradient method\nin MAP-B for the 24\u00d724 gridworld with 36 and 144 basis functions. BIRL takes much longer\nCPU time to converge than MAP-B since the former takes much larger number of iterations to\nconverge, and in addition, each iteration requires solving an MDP with a sampled reward function.\nThe CPU time gap gets larger as we increase the dimension of the reward function. Caching the\noptimal policies and gradients sped up the gradient method by factors of 1.5 to 4.2 until convergence,\nalthough not explicitly shown in the \ufb01gure.\n\nThe second set of experiments was performed on a simpli\ufb01ed car race problem, modi\ufb01ed from [14].\nThe racetrack is shown in Figure 3. The shaded and white cells indicate the off-track and on-track\nlocations, respectively. The state consists of the location and velocity of the car. The velocities in the\nvertical and horizontal directions are represented as 0, 1, or 2, and the net velocity is computed as\nthe squared sum of directional velocities. The net velocity is regarded as high if greater than 2, zero\nif 0, and low otherwise. The car can increase, decrease, or maintain one of the directional velocities.\nThe control of the car succeeds with p=0.9 if the net velocity is low, but p=0.6 if high. If the control\nfails, the velocity is maintained, and if the car attempts to move outside the racetrack, it remains in\nthe previous location with velocity 0. The basis functions are 0-1 indicator functions for the goal\nlocations, off-track locations, and 3 net velocity values (zero, low, high) while the car is on track.\nHence, there are 3150 states, 5 actions, and 5 basis functions. The discount factor is set to 0.99.\n\n7\n\n\fTable 3: True and learned weights in the car race problem.\nVelocity while on track\n\nOff-track\n\nGoal\n\nFast expert\n\nBIRL\nMAP-B\n\n1.00\n0.96\u00b10.02\n1.00\u00b10.00\n\n0.00\n-0.20\u00b10.03\n-0.19\u00b10.02\n\n0.00\n-0.04\u00b10.01\n-0.03\u00b10.01\n\n0.00\n-0.12\u00b10.02\n-0.13\u00b10.01\n\n0.10\n0.32\u00b10.02\n0.29\u00b10.01\n\nZero\n\nLow\n\nHigh\n\nTable 4: Statistics of the policies simulated in the car race problem.\n\nAvg. steps\n\nto goal\n\n20.41\n32.98\u00b16.42\n24.77\u00b11.92\n\nAvg. steps in locations\nOff-track\nOn-track\n1.56\n2.13\u00b10.60\n1.68\u00b10.26\n\n17.85\n29.85\u00b16.03\n22.09\u00b11.71\n\nAvg. steps in velocity\n\nZero\n\nLow\n\nHigh\n\n2.01\n3.33\u00b10.52\n2.70\u00b10.16\n\n3.40\n4.34\u00b10.79\n3.38\u00b10.18\n\n12.44\n22.18\u00b14.84\n16.01\u00b11.48\n\nFast expert\n\nBIRL\nMAP-B\n\nWe designed two experts. The slow expert prefers low velocity and avoids the off-track locations,\nw = [1, \u22120.1, 0, 0.1, 0]. The fast expert prefers high velocity, w = [1, 0, 0, 0, 0.1]. We compared the\nposterior mean and the MAP using the prior P (w1)=N (1, 1) and P (w2)=P (w3)=P (w4)=P (w5)=\nN (0, 1) assuming that we do not know the experts\u2019 preference on the locations nor the velocity, but\nwe know the experts\u2019 ultimate goal is to reach one of the goal locations. We used BIRL for the\nposterior mean and MAP-B for the MAP estimation, hence using the identical prior and likelihood.\n\nWe used 10 training data, each consisting of 5 trajectories. We omit\nthe results regarding the slow expert since both BIRL and MAP-\nB successfully found the weight similar with the true one, which\ninduced the slow expert\u2019s policy as optimal. However for the fast\nexpert, MAP-B was signi\ufb01cantly better than BIRL.1 Table 3 shows\nthe true and learned weights, and Table 4 shows some statistics\ncharacterizing the expert\u2019s and learned policies. The policy from\nBIRL tends to remain in high speed on the track for signi\ufb01cantly\nmore steps than the one from MAP-B since BIRL converged to a larger ratio of w5 to w1.\n\nS\n\nS\n\nFigure 3: Racetrack.\n\nG\n\nG\n\nG\n\nG\n\n6 Conclusion\n\nWe have argued that, when using a Bayesian framework for learning reward functions in IRL, the\nMAP estimate is preferable over the posterior mean. Experimental results con\ufb01rmed the effec-\ntiveness of our approach. We have also shown that the MAP estimation approach subsumes non-\nBayesian IRL algorithms in the literature, and allows us to incorporate various types of a priori\nknowledge about the reward functions and the measurement of the compatibility with behaviour\ndata.\n\nWe proved that the generalized posterior is differentiable almost everywhere, and proposed a gradi-\nent method to \ufb01nd a locally optimal solution to the MAP estimation. We provided the theoretical\nresult equivalent to the previous work on gradient methods for non-Bayesian IRL, but used a differ-\nent proof based on the reward optimality region.\n\nOur work could be extended in a number of ways. For example, the IRL algorithm for partially\nobservable environments in [15] mostly relies on Ng and Russell [6]\u2019s heuristics for MDPs, but our\nwork opens up new opportunities to leverage the insight behind other IRL algorithms for MDPs.\n\nAcknowledgments\n\nThis work was supported by National Research Foundation of Korea (Grant# 2009-0069702) and the\nDefense Acquisition Program Administration and the Agency for Defense Development of Korea\n(Contract# UD080042AD)\n\n1All the results in Table 4 except for the average number of steps in the off-track locations are statistically\n\nsigni\ufb01cant at the 95% con\ufb01dence level.\n\n8\n\n\fReferences\n[1] S. Russell. Learning agents for uncertain environments (extended abstract). In Proceedings of COLT,\n\n1998.\n\n[2] P. R. Montague and G. S. Berns. Neural economics and the biological substrates of valuation. Neuron,\n\n36(2), 2002.\n\n[3] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration.\n\nRobotics and Autonomous Systems, 57(5), 2009.\n\n[4] Y. Niv. Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3), 2009.\n[5] E. Hopkins. Adaptive learning models of consumer behavior. Journal of Economic Behavior and Orga-\n\nnization, 64(3\u20134), 2007.\n\n[6] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of ICML, 2000.\n[7] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of\n\nICML, 2004.\n\n[8] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In Proceedings of ICML,\n\n2006.\n\n[9] G. Neu and C. Szepesv\u00b4ari. Apprenticeship learning using inverse reinforcement learning and gradient\n\nmethods. In Proceedings of UAI, 2007.\n\n[10] U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Proceedings of\n\nNIPS, 2008.\n\n[11] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning.\n\nIn Proceedings of AAAI, 2008.\n\n[12] G. Neu and C. Szepesv\u00b4ari. Training parsers by inverse reinforcement learning. Machine Learning, 77(2),\n\n2009.\n\n[13] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In Proceedings of IJCAI, 2007.\n[14] A. Boularias and B. Chaib-Draa. Bootstrapping apprenticeship learning. In Proceedings of NIPS, 2010.\n[15] J. Choi and K. Kim. Inverse reinforcement learning in partially observable environments. In Proceedings\n\nof IJCAI, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1121, "authors": [{"given_name": "Jaedeug", "family_name": "Choi", "institution": null}, {"given_name": "Kee-eung", "family_name": "Kim", "institution": null}]}