{"title": "On the Correctness and Sample Complexity of Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7112, "page_last": 7121, "abstract": "Inverse reinforcement learning (IRL) is the problem of finding a reward function that generates a given optimal policy for a given Markov Decision Process. This paper looks at  an algorithmic-independent geometric analysis of the IRL problem with finite states and actions. A L1-regularized Support Vector Machine formulation of the IRL problem motivated by the geometric analysis is then proposed with the basic objective of the inverse reinforcement problem in mind: to find a reward function that generates a specified optimal policy. The paper further analyzes the proposed formulation of inverse reinforcement learning with $n$ states and $k$ actions, and shows a sample complexity of $O(d^2 \\log (nk))$  for transition probability matrices with at most $d$ non-zeros per row, for recovering a reward function that generates a policy that satisfies Bellman's optimality condition with respect to the true transition probabilities.", "full_text": "On the Correctness and Sample Complexity of\n\nInverse Reinforcement Learning\n\nAbi Komanduru\nPurdue University\n\nWest Lafayette IN 47906\nakomandu@purdue.edu\n\nAbstract\n\nJean Honorio\n\nPurdue University\n\nWest Lafayette IN 47906\njhonorio@purdue.edu\n\nInverse reinforcement learning (IRL) is the problem of \ufb01nding a reward function\nthat generates a given optimal policy for a given Markov Decision Process. This\npaper looks at an algorithmic-independent geometric analysis of the IRL problem\nwith \ufb01nite states and actions. A L1-regularized Support Vector Machine formula-\ntion of the IRL problem motivated by the geometric analysis is then proposed with\nthe basic objective of the inverse reinforcement problem in mind: to \ufb01nd a reward\nfunction that generates a speci\ufb01ed optimal policy. The paper further analyzes\nthe proposed formulation of inverse reinforcement learning with n states and k\nactions, and shows a sample complexity of O(d2 log(nk)) for transition probability\nmatrices with at most d nonzeros per row, for recovering a reward function that\ngenerates a policy that satis\ufb01es Bellman\u2019s optimality condition with respect to the\ntrue transition probabilities.\n\n1\n\nIntroduction\n\nReinforcement Learning is the process of generating an optimal policy for a given Markov Decision\nProcess (MDP) along with a reward function. Often, in situations including apprenticeship learning,\nthe reward function is unknown but optimal policy can be observed through the actions of an expert.\nIn cases such as these, it is desirable to learn a reward function generating the observed optimal policy.\nThis problem is referred to as Inverse Reinforcement Learning (IRL) Ng & Russel (2000). It is well\nknown that such a reward function is not necessarily unique. The IRL problem can be formulated\nin two different ways. The \ufb01rst is the form considered in Ng & Russel (2000) as well as this paper,\nwhich considers a standard MDP. Several other approaches instead consider the linearly-solvable\nMDP (LMDP) formulation presented in Dvijotham & Todorov (2010). Various algorithms to solve\nthe IRL problem have been proposed including linear programming Ng & Russel (2000), Hybrid IRL\nNeu & Szepesv\u00e1ri (2007), Maximum Margin Planning Ratliff et al. (2006), Multiplicative Weights\nfor Apprenticeship Learning Syed et al. (2008) and Bayesian estimation Ramachandran & Amir\n(2007). Approaches to the LMDP formulation of IRL include Maximum Entropy IRL Ziebart et al.\nand Gaussian Process IRL Levine et al. (2011). Methods such as Abbeel & Ng (2004) looked at using\nIRL to solve the apprenticeship learning problem by trying to \ufb01nd a reward function that maximizes\nthe margin of the expert\u2019s policy. However, none of the prior works provide a formal guarantee that\nthe reward function obtained from the empirical data is optimal for the true transition probabilities\nin inverse reinforcement learning. We note that the objective for the LMDP formulation of IRL is\ndifferent than that of the standard MDP formulation. Indeed, given the true transition probabilities\n(although in practice we only access estimated probabilities), methods using the standard MDP\nformulation recover a policy that is Bellman optimal while this may not be the case for methods using\nthe LMDP formulation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis paper looks at formulating the IRL problem by using the basic objective of inverse reinforcement:\nto \ufb01nd a reward function that generates a speci\ufb01ed optimal policy. The paper also looks at establishing\na sample complexity to meet this basic goal when the transition probabilities are estimated from\nobserved trajectories. To achieve this, an algorithmic-independent geometric analysis of the IRL\nproblem with \ufb01nite states and action is provided. A L1-regularized Support Vector Machine (SVM)\nformulation of the IRL problem motivated by the geometric analysis is then proposed. The formulation\nprovided is a nonparametric approach as compared to approaches that use features derived from states,\nsuch as Neu & Szepesv\u00e1ri (2007) and Ziebart et al.. Theoretical analysis of the sample complexity of\nthe L1 SVM formulation is then performed. Finally, experimental results comparing the L1 SVM\nformulation to the linear-programming method presented in Ng & Russel (2000) as well as several\nother standard and LMDP methods are presented showing the improved performance of the L1\nSVM formulation with respect to the basic objective, i.e Bellman optimality with respect to the true\ntransition probabilities. To the best of our knowledge, we are the \ufb01rst to provide an algorithm with\nformal guarantees for inverse reinforcement learning.\n\n2 Preliminaries\n\nThe formulation of the IRL problem is based on a standard Markov Decision Process (MDP)\n(S, A,{Psa}, \u03b3, R), where\n\n\u2022 S is a \ufb01nite set of n states.\n\u2022 A = {a1, . . . , ak} is a set of k actions.\n\u2022 Pa \u2208 [0, 1]n\u00d7n are the state transition probabilities for action a. We use Pa(s) \u2208 [0, 1]n\nand Psa \u2208 [0, 1]n to represent the state transition probabilities for action a in state s and\nPa(i, j) \u2208 [0, 1] to represent the probability of going from state i to state j when taking\naction a.\n\n\u2022 \u03b3 \u2208 [0, 1] is the discount factor.\n\u2022 R : S \u2192 R is the reinforcement or reward function.\n\nand (cid:80)\n\nIt is important to note that the state transition probability matrices are right stochastic. Mathematically\nthis can be stated as Pa(i, j) \u2265 0 \u2200 i, j\nIn this paper the reward function is assumed to be a function of purely the state instead of the state\nand the action. This assumption is also made for the initial results in Ng & Russel (2000). A policy is\nde\ufb01ned as a map \u03c0 : S \u2192 A. Given a policy \u03c0, we can de\ufb01ne two functions.\n\nj Pa(i, j) = 1 \u2200i\n\nThe value function at a state s1 is de\ufb01ned as\n\nV \u03c0(s1) = E(cid:2)R(s1) + \u03b3R(\u03b8(s1)) + \u03b32R(\u03b8(\u03b8(s1))) + . . . | \u03c0(cid:3)\n\nwhere \u03b8(s) represents the trajectory under policy \u03c0.\nThe Q function is de\ufb01ned as\n\nQ\u03c0(s, a) = R(s) + \u03b3E\n\n(cid:48)\ns(cid:48)\u223cPa(s)[V \u03c0(s\n\n)]\n\nThe Bellman Optimality equation states that a policy \u03c0\u2217(s) is an optimal policy for an MDP if and\nonly if\n\n\u2217\n\n(s) \u2208 arg max\n\n\u03c0\n\na\u2208A\n\n\u2217\n\nQ\u03c0\n\n(s, a),\n\ns \u2208 S\n\nAs shown in Ng & Russel (2000), for a \ufb01nite-state MDP with reward R, and for \u03c0\u2217 \u2261 a1, the Bellman\noptimality equation is equivalent to the following condition:\n\n(Pa1(i) \u2212 Pa(i))(I \u2212 \u03b3Pa1)\n\n\u22121R \u2265 0 \u2200i = 1, . . . , n; a (cid:54)= a1\n\nIt can also be shown that \u03c0\u2217 \u2261 a1 is the unique optimal policy if the above inequality is strict. We\nnote that this condition is necessary and suf\ufb01cient for the policy to be optimal for the reward. Thus,\n\n2\n\n\fthis condition results in the fundamental constraint for standard MDP IRL problems. Further analysis\nof this condition is presented in the following section. In the subsequent sections, we use the notation\n\nFai := (Pa1 (i) \u2212 Pa(i))(I \u2212 \u03b3Pa1 )\n\n\u22121\n\nThe empirical maximum likelihood estimates of the transition probabilities from sampled trajectories\nare denoted by \u02c6Pa(i) and in a similar fashion we use the notation\n\n\u02c6Fai := ( \u02c6Pa1 (i) \u2212 \u02c6Pa(i))(I \u2212 \u03b3 \u02c6Pa1 )\n\n\u22121\n\nBy enforcing the Bellman optimality of the policy \u03c0\u2217, linear constraints on the reward function can be\nformed in the IRL problem. This leads to different formulations of the IRL problem including linear\nprogramming, Bayesian estimation, Maximum Weights for Apprenticeship Learning and Maximum\nMargin Planning. The IRL problem can be formed as an optimization problem by minimizing some\nloss function. For instance, one such formulation presented in Ng & Russel (2000) is as follows:\n\nmaximize\n\nR\n\nsubject to\n\naiR(cid:1) \u2212 \u03bb(cid:107)R(cid:107)1\n(cid:0) \u02c6F T\n\nn(cid:88)\ni=1\naiR \u2265 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n\n\u02c6F T\n(cid:107)R(cid:107)\u221e \u2264 Rmax\n\nmin\n\na\u2208{a2,...,ak}\n\n(2.1)\n\nde\ufb01ned as (cid:107)A(cid:107)\u221e= supi,j |aij|. The L1 norm of a vector is de\ufb01ned as (cid:107)b(cid:107)1 =(cid:80)\n\nThe following norms are used throughout this paper. The in\ufb01nity norm of a matrix A = [aij] is\ni |bi|. The induced\nmatrix norm is de\ufb01ned as |||A|||\u221e = supj(cid:107)aj(cid:107)1 where aj is the j-th row of the matrix A. Note that\nfor a right stochastic matrix P , we can see that |||P|||\u221e = 1 and (cid:107)P(cid:107)\u221e\u2264 1.\n\n3 Geometric analysis of the IRL problem\n\nThe objective of the Inverse Reinforcement Learning problem is to \ufb01nd a reward function that\ngenerates an optimal policy. As stated above, the necessary and suf\ufb01cient conditions for a policy \u03c0\u2217\n(without loss of generality \u03c0\u2217 \u2261 a1) to be optimal are given by the Bellman Optimality principle and\ncan be stated mathematically as\n\naiR \u2265 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n\nF T\n\nClearly, R = 0 is always a solution. However this solution is degenerate in the sense that it also allows\nany and every other policy to be \"optimal\" and as a result is not of practical use. If the constraint of\nR (cid:54)= 0 is considered, then by noticing that the points Fai \u2208 Rn, the set of reward functions generating\nthe optimal policy \u03c01 is then the set of hyperplanes passing through the origin for which the entire\ncollection of points {Fai} lie in one half space. The problem of Inverse Reinforcement Learning,\nthen is equivalent to the problem of \ufb01nding such a separating hyperplane passing through the origin\nfor the points {Fai}. Here we also assume none of the Fai = 0 as this would mean that there is no\ndistinction between the policies \u03c0 = a and \u03c01 = a1.\nThis geometric perspective of the IRL problem allows the classi\ufb01cation of all \ufb01nite state, \ufb01nite action\nIRL problems into 3 regimes, graphically visualized in Figure 1:\n\n3\n\n\fFigure 1: Left: An example graphical visualization of Regime 1 where the origin lies inside the\nconvex hull of {Fai}. Here no hyperplane passing through the origin exists for which all the points\n{Fai} lie in one half space. Center: An example graphical visualization of Regime 2 where the\norigin lies on the boundary of the convex hull of {Fai}. Here only one hyperplane passing through\nthe origin exists for which all the points {Fai} lie in one half space. Right: An example graphical\nvisualization of Regime 3 where the origin lies outside the convex hull of {Fai}. Here in\ufb01nitely many\nhyperplanes passing through the origin exist for which all the points {Fai} lie in one half space.\n\nRegime 1: In this regime, there is no hyperplane passing through the origin for which all the points\n{Fai} lie in one half space. This is equivalent to saying that the origin is in the interior of the convex\nhull of the points {Fai}. In this case, independent of the algorithm, there is no nonzero reward\nfunction for which the policy \u03c01 is optimal.\nRegime 2: In this regime, up to scaling by a constant, there can be one or more hyperplanes passing\nthrough the origin for which all the points {Fai} lie in one half space, however the hyperplanes\nalways contain one of the points {Fai}. This is equivalent to saying that the origin is on the boundary\nof the convex hull of the points {Fai} but is not one of the vertices since by assumption Fai (cid:54)= 0. In\nthis case, up to a constant scaling, there are one or more nonzero reward functions that generates\nthe optimal policy \u03c01. In this case, it is also important to notice that the policy \u03c01 cannot be strictly\noptimal for any of the reward functions.\nRegime 3: In this regime, up to scaling by a constant, there are in\ufb01nitely many hyperplanes passing\nthrough the origin for which all the points {Fai} lie in one half space. This is equivalent to saying\nthat the origin is outside the convex hull of the points {Fai}. In this case, up to a constant scaling,\nthere are in\ufb01nitely many nonzero reward functions that generates the optimal policy \u03c01 and it is\npossible to \ufb01nd a reward function for which the policy \u03c01 is strictly optimal.\nThese geometric regimes and their implication on the \ufb01nite state, \ufb01nite action inverse reinforcement\nlearning problem are summed up in the following theorem.\nTheorem 3.1. There exists a hyperplane passing through the origin such that all the points {Fai} lie\non one side of the hyperplane (or on the hyperplane) if and only if there is a non-zero reward function\nR (cid:54)= 0 that generates the optimal policy \u03c0 = a1 for the inverse reinforcement learning problem\n{S, A, Pa, \u03b3}. i.e., \u2203R such that F T\nRemark 3.1. Notice that as an extension of Theorem 3.1, there is an R for which the policy \u03c0 = a1\nis strictly optimal iff there exists a hyperplane for which all the points {Fai} are strictly on one side.\nRemark 3.2. Note that it is possible to \ufb01nd a separating hyperplane between the origin and the\ncollection of points {Fai} if and only if the problem is in Regime 3. Therefore, the problem of inverse\nreinforcement learning can be viewed as a one class support vector machine (or as a two class\nsupport vector machine with the origin as the negative class) problem in this regime. This, along with\nthe objective of determining sample complexity, leads in to the formulation of the problem discussed\nin the next section.\n\naiR \u2265 0 \u2200a, i.\n\n4 Formulation of optimization problem\n\nThe objective function formulation of the inverse reinforcement problem described in Ng & Russel\n(2000) was formed by imposing the conditions that the value from the optimal policy was as far as\n\n4\n\nFaiFaiFai\fpossible from the next best action at each state, as well as sparseness of the reward function. These\nwere choices made by the authors to enable a unique solution to the proposed linear programming\nproblem. We propose a different formulation in terms of a 1 class L1-regularized support vector\nmachine that allows for a geometric interpretation as well as provides an ef\ufb01cient sample complexity.\nThe Inverse Reinforcement Learning problem is now considered in Regime 3. Here it is known\nthat there is a separating hyperplane between the origin and {Fai} so the strict inequality F T\naiR > 0\naiR \u2265 1. Formally this assumption is stated as follows\nwhich by scaling of R is equivalent to F T\nDe\ufb01nition 4.1 (\u03b2-Strict Separability). An inverse reinforcement learning problem {S, A, Pa, \u03b3}\nsatis\ufb01es \u03b2-strict separability if and only if there exists a {\u03b2, R\u2217} such that\n\n(cid:107)R\n\n\u2217(cid:107)1 = 1 and F T\n\naiR\n\n\u2217 \u2265 \u03b2 > 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n\n\nNotice that the IRL problem is in Regime 3 (i.e., \u2203w such that wT Fai > 0) if and only if the strict\nseparability assumption is satis\ufb01ed.\nStrict nonzero assumptions are well-accepted in the statistical learning theory community, and have\nbeen used for instance in compressed sensing Wainwright (2009), Markov random \ufb01elds Ravikumar\net al. (2010), nonparametric regression Liu et al. (2008), diffusion networks Daneshmand et al.\n(2014).\n\nFigure 2: An example graphical visualization of Regime 2 the origin lies on the boundary of the\nconvex hull of {Fai}. Perturbation from statistical estimation of the transition probability matrices\nfrom empirical data (solid red), makes the problem easily tip into Regime 1 (shown) or Regime 3. An\nin\ufb01nite number of samples would be required to solve IRL problems falling into Regime 2.\n\nProblems in Regime 2 are avoided since based on the statistical estimation of the transition probability\nmatrices from empirical data, the problem can easily tip into Regime 1 or Regime 3, as shown in\nFigure 2. To solve problems in Regime 2, an in\ufb01nite number of samples would be required, where as\nproblems in Regime 3 can be solved with a large enough number of samples.\nGiven the strict separability assumption, the optimization problem proposed is as follows\n\nminimize\n\nR\n\nsubject to\n\n(cid:107)R(cid:107)1\naiR \u2265 1 \u2200a \u2208 A \\ a1 i = 1, . . . , n\n\u02c6F T\n\n(4.1)\n\nThis problem is in the form of a one class L1-regularized Support Vector Machine Zhu et al. (2004)\nexcept that we use hard margins instead of soft margins. The minimization of the L1 norm plays a two\nfold role in this formulation. First, it promotes a sparse reward function, keeping in lines with the idea\nof simplicity. Second, it also plays a role in establishing the sample complexity bounds of the inverse\nreinforcement learning problem as well, as shown in the subsequent section. The constraints derive\nfrom strict Bellman optimality in the separable case (Regime 3) of inverse reinforcement learning\nand help avoid the degenerate solution of R = 0. We now use this optimization problem along with\nthe objective of \ufb01nding a reward function for which the policy \u03c0 = a1 is optimal to establish the\ncorrectness and sample complexity of the inverse reinforcement learning problem.\n\n5\n\nFai\u02c6Fai\f5 Correctness and sample complexity of Inverse Reinforcement Learning\n\nConsider the inverse reinforcement learning problem in the strictly separable case (Regime 3). We\nhave \u2203{\u03b2, R\u2217} such that\n\n\u2217 \u2265 \u03b2 > 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n\n\nF T\n\naiR\n\nLet (cid:107)Fai \u2212 \u02c6Fai(cid:107)\u221e \u2264 \u03b5. Let \u02c6R be the solution to the optimization problem 4.1 with \u02c6Fai. We desire\nthat\n\n\u02c6R \u2265 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n\n\nF T\nai\n\ni.e., the reward we obtain from the problem using the estimated transition probability matrices also\ngenerates \u03c0 = a1 as the optimal for the problem with the true transition probabilities. This can be\ndone by reducing \u03b5, i.e., by using more samples. The result in the strictly separable case follows from\nthe following theorem.\nTheorem 5.1. Let {S, A, Pa, \u03b3} be an inverse reinforcement learning problem that is \u03b2- strictly\nseparable. Let \u02c6Fai be the values of Fai using estimates of the transition probability matrices such that\n(cid:107)Fai \u2212 \u02c6Fai(cid:107)\u221e \u2264 \u03b5. Let \u02c6R be the solution to the optimization problem 4.1 with \u02c6Fai. Let 1 \u2265 c \u2265 0\n\n\u03b5 \u2264 1 \u2212 c\n2 \u2212 c\n\u02c6R \u2265 c \u2200a \u2208 A \\ a1, i = 1, . . . , n.\n\n\u03b2\n\nThen we have F T\nai\n\u03b2 , we have c \u2264 1\nRemark 5.1. It is important to note that since K, \u03b2 > 0 and \u03b5 \u2265 0 and c \u2264 1 \u2212 \u03b5 K\nwith equality holding only when \u03b5 = 0, i.e in\ufb01nitely many samples. This shows the equivalence of the\nproblems with the true and the estimated transitions probabilities in the case of in\ufb01nite samples.\n\nOur desired result then follows as a corollary of the above theorem.\nCorollary 5.1. Let {S, A, Pa, \u03b3} be an inverse reinforcement learning problem that is \u03b2- strictly\nseparable. Let \u02c6Fai be the values of Fai using estimates of the transition probability matrices such\nthat (cid:107)Fai \u2212 \u02c6Fai(cid:107)\u221e \u2264 \u03b5. Let \u02c6R be the solution to the optimization problem 4.1 with \u02c6Fai.\n\n\u03b5 \u2264 1\n\u03b2\n2\n\u02c6R \u2265 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n.\n\nThen we have F T\nai\n\nProof. Straightforwardly, by setting c = 0 in Theorem 5.1.\nTheorem 5.2. Let {S, A, Pa, \u03b3} be an inverse reinforcement learning problem that is \u03b2- strictly\nseparable. Let the transition matrices Pa have at most d \u2208 {1, . . . , n} non-zero elements per row.\nLet every state be reachable from the starting state in one step with probability at least \u03b1. Let \u02c6R be\nthe solution to the optimization problem 4.1 with \u02c6Fai with transition probability matrices \u02c6Pa that are\nmaximum likelihood estimates of Pa formed from m samples where\n\n(cid:18) (d \u2212 1)\u03b3 + 1\n\n(cid:19)2\n\n(1 \u2212 \u03b3)2\n\nlog\n\n4nk\n\n\u03b4\n\nm \u2265 64\n\u03b1\u03b22\n\nThen with probability at least (1 \u2212 \u03b4), we have F T\nThe theorem above follows from concentration inequalities for the estimation of the transition\nprobabilities, which are detailed in the following section. (All missing proofs are included in the\nSupplementary Material.)\n\n\u02c6R \u2265 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n.\n\nai\n\n6 Concentration inequalities\n\nIn this section we look at the propagation of the concentration of the empirical estimate of the\ntransition probabilities around their true values.\n\n6\n\n\fLemma 6.1. Let A and B be two matrices, we have\n\n(cid:107)AB(cid:107)\u221e \u2264 |||A|||\u221e(cid:107)B(cid:107)\u221e\n\nNext we look at the propagation of the concentration of a right stochastic matrix P to the concentration\nof its k-th power.\nLemma 6.2. Let P be a n\u00d7 n right stochastic matrix with at most d \u2208 {1, . . . , n} non-zero elements\nper row and let \u02c6P be an estimate of P such that\n\nthen,\n\n(cid:107) \u02c6P \u2212 P(cid:107)\u221e \u2264 \u03b5\n\n(cid:107) \u02c6P k \u2212 P k(cid:107)\u221e \u2264 ((k \u2212 1)d + 1)\u03b5\n\nNow we can consider the concentration of the expression Fai = (Pa1(i) \u2212 Pa(i))(I \u2212 \u03b3Pa1)\u22121.\nNotice that since P is a right stochastic matrix and \u03b3 < 1, we can expand (I \u2212 \u03b3Pa1)\u22121 as\n\n(I \u2212 \u03b3Pa1 )\u22121 =(cid:80)\u221e\n\nj=0 (\u03b3Pa1)j and therefore\n\n\u221e(cid:88)\n\n(Pa1(i) \u2212 Pa(i))(I \u2212 \u03b3Pa1)\n\n\u22121 = (Pa1(i) \u2212 Pa(i))\n\n(\u03b3Pa1)j\n\nTheorem 6.1. Let Pa and Pa1 be n \u00d7 n right stochastic matrices with at most d \u2208 {1, . . . , n}\nnon-zero elements per row, corresponding to actions a and a1 and let \u03b3 < 1. Let \u02c6Pa and \u02c6Pa1 be\nestimates of Pa and Pa1 such that\n\nj=0\n\n(cid:107) \u02c6Pa \u2212 Pa(cid:107)\u221e \u2264 \u03b5 and (cid:107) \u02c6Pa1 \u2212 Pa1(cid:107)\u221e \u2264 \u03b5\n\nThen, \u2200a, a1 \u2208 A\n\n(cid:13)(cid:13)(cid:13)(cid:13)( \u02c6Pa1 \u2212 \u02c6Pa)(I \u2212 \u03b3 \u02c6Pa1 )\n\n\u22121 \u2212 (Pa1 \u2212 Pa)(I \u2212 \u03b3Pa1)\n\n\u22121\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n\u2264 2\u03b5\n\n(d \u2212 1)\u03b3 + 1\n\n(1 \u2212 \u03b3)2\n\nNote that this result is for each action. The concentration over all actions can be found by using the\nunion bound over the set of actions.\nAn estimate of the value of \u03b5 when the estimation is done using m samples can be shown using\nthe Dvoretzky-Kiefer-Wolfowitz inequality A. Dvoretzky & Wolfowitz (1956) to be on the order of\n\u03b5 \u2208 O\n\n(cid:18)(cid:113) 2 log 2n\n\n(cid:19)\n\n.\n\n\u03b4\n\nm\n\nThis result is shown in the following Theorem 6.2.\nTheorem 6.2. Let Pa be a n \u00d7 n right stochastic matrix for an action a \u2208 A and let \u02c6Pa be an\nmaximum likelihood estimate of Pa formed from m samples. If m \u2265 2\n\n\u03b4 , then we have\n\n\u03b52 log 2n\n\nP(cid:104)(cid:13)(cid:13)(cid:13) \u02c6Pa \u2212 Pa\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:105) \u2265 1 \u2212 \u03b4\n\n\u2264 \u03b5\n\nThe theorem above assumes that it is possible to start in any given state. However, this may not always\nbe the case. In this case, as long as every state is reachable from an initial state with probability at\nleast \u03b1, the result presented in Theorem 5.2 can be modi\ufb01ed to use Theorem 6.3 instead of Theorem\n6.2.\nTheorem 6.3. Let Pa be a n \u00d7 n right stochastic matrix for an action a \u2208 A and let \u02c6Pa be an\nmaximum likelihood estimate of Pa formed from m samples. Let every state be reachable from the\nstarting state in one step with probability at least \u03b1. If m \u2265 4\n\nthen\n\nP(cid:104)(cid:13)(cid:13)(cid:13) \u02c6Pa \u2212 Pa\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:105) \u2265 1 \u2212 \u03b4, \u03b4 \u2208 (0, 1)\u2200a \u2208 A\n\n\u03b1\u03b52 log 4nk\n\n\u03b4\n\n\u2264 \u03b5\n\n7\n\n\f7 Discussion\n\n(cid:16) d2\n\n(cid:17)\n\n\u03b22 log (nk)\n\nThe result of Theorem 5.2 shows that the number of samples required to solve a \u03b2-strict separable\ninverse reinforcement learning problem and obtain a reward that generates the desired optimal policy is\non the order of m \u2208 O\nfor transition probability matrices with at most d \u2208 {1, . . . , n}\nnon-zero elements per row. Notice that the number of samples in inversely proportional to \u03b22. Thus\nby viewing the case of Regime 2 as lim \u03b2 \u2192 0 of the \u03b2-strict separable case (Regime 3), it is easy to\nsee that an in\ufb01nite number of samples are required to guarantee that the reward obtained will generate\nthe optimal policy for the MDP with the true transition probability matrices.\nIn practical applications, however, it may be dif\ufb01cult to determine if an inverse reinforcement learning\nproblem is \u03b2-strict separable (Regime 3) or not. In this case, the result of equation (A.1) can be used\nas a witness to determine that the obtained \u02c6R satis\ufb01es Bellman\u2019s optimality condition with respect to\nthe true transition probability matrices with high probability as shown in the following remark.\n\nRemark 7.1. Let {S, A, Pa, \u03b3} be an inverse reinforcement learning problem. Let the transition\nprobability matrices Pa each have at most d \u2208 {1, . . . , n} non-zero elements per row. Let every state\nbe reachable from the starting state in one step with probability at least \u03b1. Let \u02c6R be the solution\nto the optimization problem 4.1 with \u02c6Fai with transition probability matrices \u02c6Pa that are maximum\nlikelihood estimates of Pa formed from m samples and let\n\n(cid:114)\n\n\u03b5 = 2\n\n4\n\n\u03b1m\n\nlog\n\n4nk\n\n\u03b4\n\n\u00b7 (d \u2212 1)\u03b3 + 1\n(1 \u2212 \u03b3)2\n\nIf (cid:107) \u02c6R(cid:107)1 (cid:28) 1\n\n\u03b5 , then with probability at least (1 \u2212 \u03b4), we have F T\n\nai\n\n\u02c6R \u2265 0 \u2200a \u2208 A \\ a1, i = 1, . . . , n.\n\n8 Experimental results\n\nFigure 3: Empirical probability of success versus number of samples for an inverse reinforcement\nlearning problem performed with n = 5 states and k = 5 actions (Left) and with n = 7 states and\nk = 7 actions (Right) using both our L1-regularized support vector machine formulation and the\nlinear programming formulation proposed in Ng & Russel (2000). The vertical blue line represents\nthe sample complexity for our method, as stated in Theorem 5.2\n\n8\n\n\fFigure 4: Empirical probability of success versus number of samples for an inverse reinforcement\nlearning problem performed with n = 7 states and k = 7 actions using both our L1-regularized\nsupport vector machine formulation, the linear programming formulation proposed in Ng & Russel\n(2000), Multiplicative Weights for Apprenticeship Learning Syed et al. (2008) , Bayesian IRL with\nLaplacian prior Ramachandran & Amir (2007) and Gaussian Process IRL Levine et al. (2011). The\nvertical blue line represents the sample complexity for our method, as stated in Theorem 5.2\n\nExperiments were performed using randomly generated transition probability matrices with d = n\nnon-zero elements per row, for \u03b2-strictly separable MDPs with n = 5 states, k = 5 actions,\n\u03b3 = 0.1 and with n = 7 states, k = 7 actions, \u03b3 = 0.1. Both experiments were done with\nPa1 as the optimal policy. Thirty randomly generated MDPs were considered in each case and\na varying number of samples were used to \ufb01nd estimates of the transition probability matrices\nin each trial. Reward functions \u02c6R were found by solving Problem 4.1 for our L1-regularized\nSVM formulation, Problem 2.1 for the method of Ng & Russel (2000) along with code available\nat https://graphics.stanford.edu/projects/gpirl/ to solve Multiplicative Weights for\nApprenticeship Learning, Bayesian IRL with Laplacian prior and Gaussian Process IRL, using the\nsame set of estimated transition probabilities, i.e., \u02c6Fai. The resulting reward functions were then\n\u02c6R \u2265 0. The percentage of trials for which\ntested using the true transition probabilities for F T\nai\n\u02c6R \u2265 0 held true is shown in Figure 3 and Figure 4 for different number of samples used. As\nF T\nprescribed by Theorem 5.2, for \u03b2 \u2248 0.0032, the suf\ufb01cient number of samples for the success of\nai\nour method is O\n. As we can observe, the success rate increases with the number of\nsamples as expected. The L1-regularized support vector machine, however, signi\ufb01cantly outperforms\nthe linear programming formulation proposed in Ng & Russel (2000), Multiplicative Weights for\nApprenticeship Learning Syed et al. (2008), Bayesian IRL with Laplacian prior Ramachandran &\nAmir (2007) and Gaussian Process IRL Levine et al. (2011), reaching 100% success shortly after the\nsuf\ufb01cient number of samples while the other methods fall far behind. The result is that the reward\nfunction given by the L1-regularized support vector machine formulation successfully generates the\noptimal policy \u03c0 = a1 in almost 100% of the trials given O\nsamples while the reward\nfunction estimated by the other methods fail to generate the desired optimal policy.\n\n(cid:16) n2\n\n\u03b22 log (nk)\n\n(cid:17)\n\n(cid:16) n2\n\n\u03b22 log (nk)\n\n(cid:17)\n\n9 Concluding remarks\n\nThe L1-regularized support vector formulation along with the geometric interpretation provide a\nuseful way of looking at the inverse reinforcement learning problem with strong, formal guarantees.\nPossible future work on this problem includes extension to the inverse reinforcement learning problem\nwith continuous states by using sets of basis functions as presented in Ng & Russel (2000).\n\nReferences\nA. Dvoretzky, J. K. and Wolfowitz, J. Asymptotic minimax character of the sample distribution\nfunction and of the classical multinomial estimator. The Annals of Mathematical Statistics, pp.\n642\u2013669, 1956.\n\n9\n\n\fAbbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. ICML \u201904, pp.\n\n1\u2013, New York, NY, USA, 2004. ACM. doi: 10.1145/1015330.1015430.\n\nDaneshmand, H., Gomez-Rodriguez, M., Song, L., and Scholkopf, B. Estimating diffusion network\nstructures: Recovery conditions, sample complexity & soft-thresholding algorithm. In International\nConference on Machine Learning, pp. 793\u2013801, 2014.\n\nDvijotham, K. and Todorov, E. Inverse optimal control with linearly-solvable mdps. In ICML 2010,\n\npp. 335\u2013342, 2010.\n\nLevine, S., Popovic, Z., and Koltun, V. Nonlinear inverse reinforcement learning with gaussian\n\nprocesses. In NeurIPS 2011, pp. 19\u201327, 2011.\n\nLiu, H., Wasserman, L., Lafferty, J. D., and Ravikumar, P. K. Spam: Sparse additive models. In\n\nAdvances in Neural Information Processing Systems, pp. 1201\u20131208, 2008.\n\nNeu, G. and Szepesv\u00e1ri, C. Apprenticeship learning using inverse reinforcement learning and gradient\n\nmethods. In UAI 2007, pp. 295\u2013302. AUAI Press, 2007.\n\nNg, A. Y. and Russel, S. Algorithms for inverse reinforcement learning. In ICML 2000, pp. 663 \u2013\n\n670, 2000.\n\nRamachandran, D. and Amir, E. Bayesian inverse reinforcement learning. IJCAI 2007, 51(61801):\n\n1\u20134, 2007.\n\nRatliff, N. D., Bagnell, J. A., and Zinkevich, M. A. Maximum margin planning. In ICML 2006, pp.\n\n729\u2013736. ACM, 2006.\n\nRavikumar, P., Wainwright, M. J., Lafferty, J. D., et al. High-dimensional ising model selection using\n\nl1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319, 2010.\n\nSyed, U., Bowling, M., and Schapire, R. Apprenticeship learning using linear programming. In\n\nICML 2008, pp. 1032\u20131039. ACM, 2008.\n\nWainwright, M. J. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-\nconstrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):\n2183\u20132202, 2009.\n\nZhu, J., Rosset, S., Tibshirani, R., and Hastie, T. J. 1-norm support vector machines. In Advances in\n\nneural information processing systems, pp. 49\u201356, 2004.\n\nZiebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement\n\nlearning. AAAI 2008.\n\n10\n\n\f", "award": [], "sourceid": 3848, "authors": [{"given_name": "Abi", "family_name": "Komanduru", "institution": "Purdue University"}, {"given_name": "Jean", "family_name": "Honorio", "institution": "Purdue University"}]}