{"title": "Feature Construction for Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1342, "page_last": 1350, "abstract": "The goal of inverse reinforcement learning is to find a reward function for a Markov decision process, given example traces from its optimal policy. Current IRL techniques generally rely on user-supplied features that form a concise basis for the reward. We present an algorithm that instead constructs reward features from a large collection of component features, by building logical conjunctions of those component features that are relevant to the example policy. Given example traces, the algorithm returns a reward function as well as the constructed features. The reward function can be used to recover a full, deterministic, stationary policy, and the features can be used to transplant the reward function into any novel environment on which the component features are well defined.", "full_text": "Feature Construction for Inverse Reinforcement\n\nLearning\n\nSergey Levine\n\nStanford University\n\nsvlevine@cs.stanford.edu\n\nZoran Popovi\u00b4c\n\nUniversity of Washington\n\nzoran@cs.washington.edu\n\nVladlen Koltun\n\nStanford University\n\nvladlen@cs.stanford.edu\n\nAbstract\n\nThe goal of inverse reinforcement learning is to \ufb01nd a reward function for a\nMarkov decision process, given example traces from its optimal policy. Current\nIRL techniques generally rely on user-supplied features that form a concise basis\nfor the reward. We present an algorithm that instead constructs reward features\nfrom a large collection of component features, by building logical conjunctions of\nthose component features that are relevant to the example policy. Given example\ntraces, the algorithm returns a reward function as well as the constructed features.\nThe reward function can be used to recover a full, deterministic, stationary pol-\nicy, and the features can be used to transplant the reward function into any novel\nenvironment on which the component features are well de\ufb01ned.\n\n1\n\nIntroduction\n\nInverse reinforcement learning aims to \ufb01nd a reward function for a Markov decision process, given\nonly example traces from its optimal policy.\nIRL solves the general problem of apprenticeship\nlearning, in which the goal is to learn the policy from which the examples were taken. The MDP\nformalism provides a compact method for specifying a task in terms of a reward function, and IRL\nfurther simpli\ufb01es task speci\ufb01cation by requiring only a demonstration of the task being performed.\nHowever, current IRL methods generally require not just expert demonstrations, but also a set of\nfeatures or basis functions that concisely capture the structure of the reward function [1, 7, 9, 10].\nIncorporating feature construction into IRL has been recognized as an important problem for some\ntime [1]. It is often easier to enumerate all potentially relevant component features (\u201ccomponents\u201d)\nthan to manually specify a set of features that is both complete and fully relevant. For example,\nwhen emulating a human driver, it is easier to list all known aspects of the environment than to con-\nstruct a complete and fully relevant reward basis. The dif\ufb01culty of performing IRL given only such\ncomponents is that many of them may have important logical relationships that make it impossible\nto represent the reward function as their linear combination, while enumerating all possible relation-\nships is intractable. In our example, some of the components, like the color of the road, may be\nirrelevant. Others, like the car\u2019s speed and the presence of police, might have an important logical\nrelationship for a driver who prefers to speed.\nWe present an IRL algorithm that constructs reward features out of a large collection of compo-\nnent features, many of which may be irrelevant for the expert\u2019s policy. The Feature construction\nfor Inverse Reinforcement Learning (FIRL) algorithm constructs features as logical conjunctions\nof the components that are most relevant for the observed examples, thus capturing their logical\nrelationships. At the same time, it \ufb01nds a reward function for which the optimal policy matches\n\n1\n\n\fthe examples. The reward function can be used to recover a deterministic, stationary policy for the\nexpert, and the features can be used to transplant the reward to any novel environment on which the\ncomponent features are well de\ufb01ned. In this way, the features act as a portable explanation for the\nexpert\u2019s policy, enabling the expert\u2019s behavior to be predicted in unfamiliar surroundings.\n\nthat maximizes the expected discounted sum of rewards E [(cid:80)\u221e\n\n2 Algorithm Overview\nWe de\ufb01ne a Markov decision process as M = {S,A, \u03b8, \u03b3, R}, where S is a state space, A is a\nset of actions, \u03b8sas(cid:48) is the probability of a transition from s \u2208 S to s(cid:48) \u2208 S under action a \u2208 A,\n\u03b3 \u2208 [0, 1) is a discount factor, and R(s, a) is a reward function. The optimal policy \u03c0\u2217 is the policy\nt=0 \u03b3tR(st, at)|\u03c0\u2217, \u03b8]. FIRL takes as\ninput M\\ R, as well as a set of traces from \u03c0\u2217, denoted by D = {(s1,1, a1,1), ..., (sn,T , an,T )},\nwhere si,t is the tth state in the ith trace. FIRL also accepts a set of component features of the form\n\u03b4 : S \u2192 Z, which are used to construct a set of relevant features for representing R.\nThe algorithm iteratively constructs both the features and the reward function. Each iteration consists\nof an optimization step and a \ufb01tting step. The algorithm begins with an empty feature set \u03a6(0). The\noptimization step of the ith iteration computes a reward function R(i) using the current set of features\n\u03a6(i\u22121), and the following \ufb01tting step determines a new set of features \u03a6(i).\nThe objective of the optimization step is to \ufb01nd a reward function R(i) that best \ufb01ts the last feature\nhypothesis \u03a6(i\u22121) while remaining consistent with the examples D. This appears similar to the ob-\njective of standard IRL methods. However, prior IRL algorithms generally minimize some measure\nof deviation from the examples, subject to the constraints of the provided features [1, 7, 8, 9, 10].\nIn contrast, the FIRL optimization step aims to discover regions where the current features are in-\nsuf\ufb01cient, and must be able to step outside of the constraints of the these features. To this end,\nthe reward function R(i) is found by solving a quadratic program, with constraints that keep R(i)\nconsistent with D, and an objective that penalizes the deviation of R(i) from its projection onto the\nlinear basis formed by the features \u03a6(i\u22121).\nThe \ufb01tting step analyzes the reward function R(i) to generate a new feature hypothesis \u03a6(i) that\nbetter captures the variation in the reward function. Intuitively, the regions where R(i) is poorly rep-\nresented by \u03a6(i\u22121) correspond to features that must be re\ufb01ned further, while regions where different\nfeatures take on similar rewards are indicative of redundant features that should be merged. The\nhypothesis is constructed by building a regression tree on S for R(i), with the components acting as\ntests at each node. Each leaf (cid:96) contains some subset of S, denoted \u03c6(cid:96). The new features are the set\nof indicator functions for membership in \u03c6(cid:96). A simple explanation of the reward function is often\nmore likely to be the correct one [7], so we prefer the smallest tree that produces a suf\ufb01ciently rich\nfeature set to represent a reward function consistent with the examples. To obtain such a tree, we\nstop subdividing a node (cid:96) when setting the reward for all states in \u03c6(cid:96) to their average induces an\noptimal policy consistent with the examples.\nThe constructed features are iteratively improved through the interaction between the optimization\nand \ufb01tting steps. Since the optimization is constrained to be consistent with D, if the current set of\nfeatures is insuf\ufb01cient to represent a consistent reward function, R(i) will not be well-represented\nby the features \u03a6(i\u22121). This intra-feature reward variance is detected in the \ufb01tting step, and the\nfeatures that were insuf\ufb01ciently re\ufb01ned are subdivided further, while redundant features that have\nlittle variance between them are merged.\n\n3 Optimization Step\nDuring the ith optimization step, we compute a reward function R(i) using the examples D and the\ncurrent feature set \u03a6(i\u22121). This reward function is chosen so that the optimal policy under the reward\nis consistent with the examples D and so that it minimizes the sum of squared errors between R(i)\nand its projection onto the linear basis of features \u03a6(i\u22121). Formally, let TR\u2192\u03a6 be a |\u03a6(i\u22121)| by |S|\nmatrix for which TR\u2192\u03a6(\u03c6, s) = |\u03c6|\u22121 if s \u2208 \u03c6, and 0 otherwise, and let T\u03a6\u2192R be a |S| by |\u03a6(i\u22121)|\nmatrix for which T\u03a6\u2192R(s, \u03c6) = 1 if s \u2208 \u03c6, and 0 otherwise. Thus, T\u03a6\u2192RTR\u2192\u03a6R is a vector where\n\n2\n\n\fthe reward in each state is the average over all rewards in the feature that state belongs to. Letting\n\u03c0R denote the optimal policy under R, the reward optimization problem can be expressed as:\n\nmin\nR\ns.t.\n\n(cid:107)R \u2212 T\u03a6\u2192RTR\u2192\u03a6R(cid:107)2\n\u03c0R(s) = a\n\n\u2200 (s, a) \u2208 D\n\n(1)\nUnfortunately, the constraint (1) is not convex, making it dif\ufb01cult to solve the optimization ef\ufb01-\nciently. We can equivalently express it in terms of the value function corresponding to R as\n\n(cid:88)\n\ns(cid:48)\n\n(cid:88)\n\ns(cid:48)\n\n(cid:88)\n\ns(cid:48)\n\nV (s) = R(s, a) + \u03b3\n\n\u03b8sas(cid:48)V (s(cid:48))\n\nV (s) = max\n\na\n\nR(s, a) + \u03b3\n\n\u03b8sas(cid:48)V (s(cid:48))\n\n\u2200 (s, a) \u2208 D\n\n\u2200 s \u2208 S\n\n(2)\n\nThese constraints are also not convex, but we can construct a convex relaxation by using a pseudo-\nvalue function that bounds the value function from above, replacing (2) with the linear constraint\n\nV (s) \u2265 R(s, a) + \u03b3\n\n\u03b8sas(cid:48)V (s(cid:48))\n\n\u2200 s /\u2208 D\n\nthe MDP transition probabilities \u03b8 are deterministic,\n\nIn the special case that\nstraints are equivalent\ntrue value function V \u2217 obtained by value iteration,\ntion V .\n\nthat V (cid:48)(s) \u2264 V (s) for all s \u2208 S: since V (s) \u2265 R(s, a) + \u03b3(cid:80)\nV (s) \u2265 maxa [R(s, a) + \u03b3(cid:80)\n\nthese con-\nto the original constraint (1). We prove this by considering the\ninitialized with the pseudo-value func-\nLet V (cid:48) be the result obtained by performing one step of value iteration. Note\ns(cid:48) \u03b8sas(cid:48)V (s(cid:48)), we must have\ns(cid:48) \u03b8sas(cid:48)V (s(cid:48))] = V (cid:48)(s). Since the MDP is deterministic and the ex-\nample set D consists of traces from the optimal policy, we have a unique next state for each state-\naction pair. Let (si,t, ai,t) \u2208 D be the tth state-action pair from the ith expert trace. Since the\nconstraints ensure that V (si,t) = maxa [R(si,t, a) + \u03b3V (si,t+1)], we have V (cid:48)(si,t) = V (si,t) for\nall i, t, and since V (cid:48)(s) for s /\u2208 D can only decrease, we know that the optimal actions in all si,t\nmust remain the same. Therefore, for each example state si,t, ai,t remains the optimal action under\nthe true value function V \u2217, and the convex relaxation is equivalent to the original constraint (1).\nIn the case that \u03b8 is not deterministic, not all successors of an example state si,t are always observed,\nand their values under the pseudo-value function may not be suf\ufb01ciently constrained. However,\nempirical tests presented in Figure 2(b) suggest that the constraint (1) is rarely violated under the\nconvex relaxation, even in highly non-deterministic MDPs.\nIn practice, we prefer a reward function under which the examples are not just part of an optimal\npolicy, but are part of the unique optimal policy [7]. To prevent rewards under which example actions\n\u201ctie\u201d for the optimal choice, we require that ai,t be better than all other actions in state si,t by some\nmargin \u03b5, which we accomplish by adding \u03b5 to all inequality constraints for state si,t. The precise\nvalue of \u03b5 is not important, since changing it only scales the reward function by a constant.\nAll of the constraints in the \ufb01nal optimization are sparse, but the matrix T\u03a6\u2192RTR\u2192\u03a6 in the origi-\nnal objective can be arbitrarily dense (if, for instance, there is only one feature which contains all\nstates). Since both T\u03a6\u2192R and TR\u2192\u03a6 are sparse, and in fact only contain |S||A| non-zero entries,\nwe can make the optimization fully sparse by introducing a new set of variables R\u03a6 de\ufb01ned as\nR\u03a6 = TR\u2192\u03a6R, yielding the sparse objective (cid:107)R \u2212 T\u03a6\u2192RR\u03a6(cid:107)2.\nRecall that the \ufb01tting step must determine not only which features must be re\ufb01ned further, but\nalso which features can be merged. We therefore add a second term to the objective to dis-\ncourage nearby features from taking on different values when it is unnecessary. To that end,\nwe construct a sparse matrix N, where each row k of N corresponds to a pair of features \u03c6k1\n= \u2206(\u03c6k1, \u03c6k2), so that\nand \u03c6k2 (for a total of K rows). We de\ufb01ne N as Nk,\u03c6k1\n[N R\u03a6]k = (R\u03a6\u03c6k1\n)\u2206(\u03c6k1 , \u03c6k2). The loss factor \u2206(\u03c6k1 , \u03c6k2) indicates how much we be-\nlieve a priori that the features \u03c6k1 and \u03c6k2 should be merged, and is discussed further in Section 4.\nSince the purpose of the added term is to allow super\ufb02uous features to be merged because they take\non similar values, we prefer for a feature to be very similar to one of its neighbors, rather than to\nhave minimal distance to all of them. We therefore use a linear rather than quadratic penalty. Since\nwe would like to make nearby features similar so long as it does not adversely impact the primary\nobjective, we give this adjacency penalty a low weight. In our implementation, this weight was set to\n\n= \u2212Nk,\u03c6k2\n\n\u2212 R\u03a6\u03c6k2\n\n3\n\n\fwN = 10\u22125. Normalizing the two objectives by the number of entries, we get the following sparse\nquadratic program:\n\nmin\nR,R\u03a6,V\n\ns.t. R\u03a6 = TR\u2192\u03a6R\n\nV (s) = R(s, a) + \u03b3\n\n1\n|S||A|(cid:107)R \u2212 T\u03a6\u2192RR\u03a6(cid:107)2\n(cid:88)\n(cid:88)\n(cid:88)\n\nV (s) \u2265 R(s, a) + \u03b3\n\nV (s) \u2265 R(s, a) + \u03b3\n\ns(cid:48)\n\ns(cid:48)\n\ns(cid:48)\n\n2 + wN\nK\n\n(cid:107)N R\u03a6(cid:107)1\n\n\u03b8sas(cid:48)V (s(cid:48))\n\n\u03b8sas(cid:48)V (s(cid:48)) + \u03b5\n\n\u03b8sas(cid:48)V (s(cid:48))\n\n\u2200 (s, a) \u2208 D\n\n\u2200 s \u2208 D, (s, a) /\u2208 D\n\n\u2200 s /\u2208 D\n\nThis program can be solved ef\ufb01ciently with any quadratic programming solver. It contains on the\norder of |S||A| variables and constraints, and the constraint matrix is sparse with O(|S||A|\u00b5a) non-\nzero entries, where \u00b5a is the average sparsity of \u03b8sa \u2014 that is, the average number of states s(cid:48) that\nhave a non-zero probability of being reached from s using action a. In our implementation, we use\nthe cvx Matlab package [6] to solve this optimization ef\ufb01ciently.\n\n4 Fitting Step\nOnce the reward function R(i) for the current feature set \u03a6(i\u22121) is computed, we formulate a new\nfeature hypothesis \u03a6(i) that is better able to represent this reward function. The objective of this\nstep is to construct a set of features that gives greater resolution in regions where the old features are\ntoo coarse, and lower resolution in regions where the old features are unnecessarily \ufb01ne. We obtain\n\u03a6(i) by building a regression tree for R(i) over the state-space S, using the standard intra-cluster\nvariance splitting criterion [3]. The tree is rooted at the node t0, and each node of the tree is de\ufb01ned\nas tj = {\u03b4j, \u03c6j, tj\u2212, tj+}. tj\u2212 and tj+ are the left and right subtrees, \u03c6j \u2286 S is the set of states\nbelonging to node j (initialized as \u03c60 = S), and \u03b4j is the component feature that acts as the splitting\ntest at node j. States s \u2208 \u03c6j for which \u03b4j(s) = 0 are assigned to the left subtree, and states for which\n\u03b4j(s) = 1 are assigned to the right subtree. In our implementation, all component features are binary,\nthough the generalization to multivariate components and non-binary trees is straightforward. The\nnew set of features consists of indicators for each of the leaf clusters \u03c6(cid:96) (where t(cid:96) is a leaf node), and\ncan be equivalently expressed as a conjunction of components: letting j0, ..., jn, (cid:96) be the sequence\nof nodes on the path from the root to t(cid:96), and de\ufb01ning r0, ..., rn so that rk is 1 if tjk+1 = tjk+ and 0\notherwise, s \u2208 \u03c6(cid:96) if and only if \u03b4jk(s) = rk for all k \u2208 {0, ..., n}.\nAs discussed in Section 2, we prefer the smallest tree that produces a rich enough feature set to\nrepresent a reward function consistent with the examples D. We therefore terminate the splitting\nprocedure at node t(cid:96) when we detect that further splitting of the node is unnecessary to maintain\nconsistency with the example set. This is done by constructing a new reward function \u02c6R(i) for\nR(i)(s, a) if s \u2208 \u03c6(cid:96), and \u02c6R(i)(s, a) = R(i)(s, a) otherwise. The\noptimal policy under \u02c6R(i) is determined with value iteration and, if the policy is consistent with the\nexamples D, t(cid:96) becomes a leaf and R(i) is updated to be equal to \u02c6R(i). Although value iteration\nordinarily can take many iterations, since the changes we are considering often make small, local\nchanges to the optimal policy compared to the current reward function R(i), we can often converge\nin only a few iterations by starting with the value function V (i) for the current reward R(i). We\ntherefore store this value function and update it along with R(i).\nIn addition to this stopping criterion, we can also employ the loss factor \u2206(\u03c6k1, \u03c6k2) to encourage\nthe next optimization step to assign similar values to nearby features, allowing them to be merged\nin subsequent iterations. Recall that \u2206(\u03c6k1, \u03c6k2) is a linear penalty on the difference between the\naverage rewards of states in \u03c6k1 and \u03c6k2, and can be used to drive the rewards in these features\ncloser together so that they can be merged in a subsequent iteration. Features found deeper in\nthe tree exhibit greater complexity, since they are formed by a conjunction of a larger number of\ncomponents. These complex features are more likely to be the result of over\ufb01tting, and can be\nmerged to form smaller trees. To encourage such mergers, we set \u2206(\u03c6k1, \u03c6k2) to be proportional\n\nwhich \u02c6R(i)(s, a) = |\u03c6(cid:96)|\u22121(cid:80)\n\ns\u2208\u03c6(cid:96)\n\n4\n\n\fGridworld\nsize\n16\u00d716\n32\u00d732\n64\u00d764\n128\u00d7128\n256\u00d7256\n\nTotal\nstates\n256\n1024\n4096\n16384\n65536\n\nLPAL MMP Abbeel & Ng\n(sec)\n(sec)\n27.05\n0.29\n74.66\n0.66\n272.10\n2.22\n19.33\n876.18\n1339.87\n52.60\n\n(sec)\n0.24\n0.42\n1.26\n7.58\n81.26\n\nFIRL Optimization\n(sec each)\n0.39\n1.01\n4.26\n24.44\n170.14\n\n(sec total)\n8.34\n29.00\n165.29\n1208.47\n10389.59\n\nFitting\n(sec each)\n0.11\n0.73\n5.80\n48.44\n428.49\n\nTable 1: Performance comparison of FIRL, LPAL, MMP, and Abbeel & Ng on gridworlds of varying\nsize. FIRL ran for 15 iterations. Individual iterations were comparable in length to prior methods.\n\nto the depth of the deepest common ancestor of \u03c6k1 and \u03c6k2. The loss factor is therefore set to\n\u2206(\u03c6k1, \u03c6k2) = Da(k1, k2)/Dt, where Da gives the depth of the deepest common ancestor of two\nnodes, and Dt is the total depth of the tree.\nFinally, we found that limiting the depth of the tree and iteratively increasing that limit reduced\nover\ufb01tting and produced features that more accurately described the true reward function, since the\noptimization and \ufb01tting steps could communicate more frequently before committing to a set of\ncomplex features. We therefore begin with a depth limit of one, and increase the limit by one on\neach successive iteration. We experimented with a variety of other depth limiting schemes and found\nthat this simple iterative deepening procedure produced the best results.\n\n5 Experiments\n\n5.1 Gridworld\n\nIn the \ufb01rst experiment, we compare FIRL with the MMP algorithm [9], the LPAL algorithm [10],\nand the algorithm of Abbeel & Ng [1] on a gridworld modeled after the one used by Abbeel & Ng.\nThe purpose of this experiment is to determine how well FIRL performs on a standard IRL example,\nwithout knowledge of the relevant features. A gridworld consists of an N\u00d7N grid of states, with\n\ufb01ve actions possible in each state, corresponding to movement in each of the compass directions and\nstanding in place. In the deterministic gridworld, each action deterministically moves the agent into\nthe corresponding state. In the non-deterministic world, each action has a 30% chance of causing a\ntransition to another random neighboring state. The world is partitioned into 64 equal-sized regions,\nand all the cells in a single region are assigned the same randomly selected reward. The expert\u2019s\npolicy is the optimal policy under this reward. The example set D is generated by randomly sampling\nstates and following the expert\u2019s policy for 100 steps.\nSince the prior algorithms do not perform feature construction, they were tested either with indica-\ntors for each of the 64 regions (referred to as \u201cperfect\u201d features), or with indicators for each state\n(the \u201cprimitive\u201d features). FIRL was instead provided with 2N component features corresponding\nto splits on the x and y axes, so that \u03b4x,i(sx,y) = 1 if x \u2265 i, and \u03b4y,i(sx,y) = 1 if y \u2265 i. By\ncomposing such splits, it is possible to represent any rectangular partitioning of the state space.\nWe \ufb01rst compare the running times of the algorithms (using perfect features for prior methods) on\ngridworlds of varying sizes, shown in Table 1. Performance was tested on an Intel Core i7 2.66\nGHz computer. Each trial was repeated 10 times on random gridworlds, with average running times\npresented. For FIRL, running time is given for 15 iterations, and is also broken down into the\naverage length of each optimization and \ufb01tting step. Although FIRL is often slower than methods\nthat do not perform feature construction, the results suggest that it scales gracefully with the size of\nthe problem. The optimization time scales almost linearly, while the tree construction scales worse\nthan linearly but better than quadratically. The latter can likely be improved for large problems by\nusing heuristics to minimize evaluations of the expensive stopping test.\nIn the second experiment, shown in Figure 1, we evaluate accuracy on 64\u00d7 64 gridworlds with\nvarying numbers of examples, again repeating each trial 10 times. We measured the percentage of\nstates in which each algorithm failed to predict the expert\u2019s optimal action (\u201cpercent misprediction\u201d),\nas well as the Euclidean distance between the expectations of the perfect features under the learned\npolicy and the expert\u2019s policy (normalized by (1 \u2212 \u03b3) as suggested by Abbeel & Ng [1]). For the\nmixed policies produced by Abbeel & Ng, we computed the metrics for each policy and mixed them\nusing the policy weights \u03bb [1]. For the non-deterministic policies of LPAL, percent misprediction is\n\n5\n\n\fFigure 1: Accuracy comparison between FIRL, LPAL, MMP, and Abbeel & Ng, the latter provided\nwith either perfect or primitive features. Shaded regions show standard error. Although FIRL was\nnot provided the perfect features, it achieved similar accuracy to prior methods that were.\n\nthe mean probability of taking an incorrect action in each state. Results for prior methods are shown\nwith both the perfect and primitive features. FIRL again ran for 15 iterations, and generally achieved\ncomparable accuracy to prior algorithms, even when they were provided with perfect features.\n\n5.2 Transfer Between Environments\n\nWhile the gridworld experiments demonstrate that FIRL performs comparably to existing methods\non this standard example, even without knowing the correct features, they do not evaluate the two\nkey advantages of FIRL: its ability to construct features from primitive components, and its ability\nto generalize learned rewards to different environments. To evaluate reward transfer and see how\nthe method performs with more realistic component features, we populated a world with objects.\nThis environment also consists of an N\u00d7N grid of states, with the same actions as the gridworld.\nObjects are randomly placed with 5% probability in each state, and each object has 1 of C \u201cinner\u201d\nand \u201couter\u201d colors, selected uniformly at random. The algorithm was provided with components of\nthe form \u201cis the nearest X at most n units away,\u201d where X is a wall or an object with a speci\ufb01c\ninner or outer color, giving a total of (2C + 1)N component features. The expert received a reward\nof \u22122 for being within 3 units of an object with inner color 1, otherwise a reward of \u22121 for being\nwithin 2 units of a wall, otherwise a reward of 1 for being within 1 unit of an object with inner color\n2, and 0 otherwise. All other colors acted as distractors, allowing us to evaluate the robustness of\nfeature construction to irrelevant components. For each trial, the learned reward tree was used to test\naccuracy on 10 more random environments, by specifying a reward for each state according to the\nregression tree. We will refer to these experiments as \u201ctransfer.\u201d Each trial was repeated 10 times.\nIn Figure 2(a), we evaluate how FIRL performs\nwith varying numbers of iterations on both the\ntraining and transfer environments, as well as\non the gridworld from the previous section. The\nresults indicate that FIRL converged to a sta-\nble hypothesis more quickly than in the grid-\nworld, since the square regions in the gridworld\nrequired many more partitions than the object-\nrelative features. However, the required number\nof iterations was low on both environments.\nIn Figure 2(b), we evaluate how often the non-\nconvex constraints discussed in Section 3 are\nviolated under our convex approximation. We\nmeasure the percent of examples that are vio-\nlated with varying amounts of non-determinism, by varying the probability \u03b2 with which an action\nmoves the agent to the desired state. \u03b2 = 1 is deterministic, and \u03b2 = 0.2 gives a uniform distribution\nover neighboring states. The results suggest that the constraint is rarely violated under the convex\nrelaxation, even in highly non-deterministic MDPs, and the number of violations decreases sharply\nas the MDP becomes more deterministic.\nWe compared FIRL\u2019s accuracy on the transfer task with Abbeel & Ng and MMP. LPAL was not\nused in the comparison because it does not return a reward function, and therefore cannot transfer\n\nFigure 2(b): Constraint\nviolation was low in non-\ndeterministic MDPs.\n\nFigure 2(a): FIRL con-\nverged after a small num-\nber of iterations.\n\n6\n\ndeterministicpercent mispredictionexamples2481632641282565120%10%20%30%40%50%60%deterministicpercent mispredictionexamples2481632641282565120%10%20%30%40%50%60%deterministicfeature expectation distexamples24816326412825651200.050.10.150.2deterministicfeature expectation distexamples24816326412825651200.050.10.150.2non-deterministicpercent mispredictionexamples2481632641282565120%10%20%30%40%50%60%non-deterministicpercent mispredictionexamples2481632641282565120%10%20%30%40%50%60%non-deterministicfeature expectation distexamples24816326412825651200.050.10.150.2A&N prim.A&N perf.LPAL prim.LPAL perf.MMP prim.MMP perf.FIRLnon-deterministicfeature expectation distexamples24816326412825651200.050.10.150.2convergence analysisiterationspercent misprediction24681012141618200%10%20%30%40%50%60%70%80%FIRL with objectsFIRL transferFIRL gridworldconvergence analysisiterationspercent misprediction24681012141618200%10%20%30%40%50%60%70%80%constraint violation\u03b2percent violation0.20.40.60.810%1%2%3%4%5%6%7%8%9%10%constraint violation\u03b2percent violation0.20.40.60.810%1%2%3%4%5%6%7%8%9%10%\fFigure 3: Comparison of FIRL and Abbeel & Ng on training environments and randomly generated\ntransfer environments, with increasing numbers of component features. FIRL maintained higher\ntransfer accuracy in the presence of distractors by constructing features out of relevant components.\n\nits policy to new environments. Since prior methods do not perform feature construction, they were\nprovided with all of the component features. The experiments used 64\u00d764 environments and 64\nexamples. The number of colors C was varied from 2 to 20 to test how well the algorithms handle\nirrelevant \u201cdistractors.\u201d FIRL ran for 10 iterations on each trial. The results in Figure 3 indicate\nthat accuracy on the training environment remained largely stable, while transfer accuracy gradu-\nally decreased with more colors due to the ambiguity caused by large numbers of distractors. Prior\nalgorithms were more affected by distractors on the training environments, and their inability to con-\nstruct features prevented them from capturing a portable \u201cexplanation\u201d of the expert\u2019s reward. They\ntherefore could not transfer the learned policy to other environments with comparable accuracy.\nIn contrast to the gridworld experiments, the expert\u2019s reward function in these environments was\nencoded in terms of logical relationships between the component features, which standard IRL al-\ngorithms cannot capture. In the next section, we examine another environment that also exempli\ufb01es\nthe need for feature construction.\n\n5.3 Highway Driving Behaviors\n\nTo demonstrate FIRL\u2019s ability to learn meaningful behaviors, we implemented a driving simulator\ninspired by the environments in [1] and [10]. The task is to navigate a car on a three-lane highway.\nAll other vehicles are moving at speed 1. The agent can drive at speeds 1 through 4, and can move\none lane left or one lane right. The other vehicles can be cars or motorcycles, and can be either\ncivilian or police, for a total of 4 possibilities. The component features take the form \u201cis a vehicle\nof type X at most n car-lengths in front/behind me,\u201d where X can be either all vehicles, cars,\nmotorcycles, police, or civilian, and n is in the range from 0 to 5 car-lengths. There are equivalent\nfeatures for checking for cars in front or behind in the lanes to the left and to the right of the agent\u2019s,\nas well as a feature for each of the four speeds and each lane the agent can occupy.\nThe rich feature set of this driving simulator enables interesting behaviors to be demonstrated. For\nthis experiment, we implemented expert policies for two behaviors: a \u201clawful\u201d driver and an \u201cout-\nlaw\u201d driver. The lawful driver prefers to drive fast, but does not exceed speed 2 in the right lane, or\nspeed 3 in the middle lane. The outlaw driver also prefers to drive fast, but slows down to speed 2\nor below when within 2 car-lengths of a police vehicle (to avoid arrest).\nIn Table 2, we compare the policies learned from traces of the two experts by FIRL, MMP, and\nAbbeel & Ng\u2019s algorithm. As before, prior methods were provided with all of the component fea-\ntures. All algorithms were trained on 30 traces on a stretch of highway 100 car-lengths long, and\ntested on 10 novel highways. As can be seen in the supplemental videos, the policy learned by FIRL\nclosely matched that of the expert, maintaining a high speed whenever possible but not driving fast in\nthe wrong lane or near police vehicles. The policies learned by Abbeel & Ng\u2019s algorithm and MMP\ndrove at the minimum speed when trained on either the lawful or outlaw expert traces. Because\nprior methods only represented the reward as a linear combination of the provided features, they\nwere unable to determine the logical connection between speed and the other features. The policies\nlearned by these methods found the nearest \u201coptimal\u201d position with respect to their learned feature\nweights, accepting the cost of violating the speed expectation in exchange for best matching the\nexpectation of all other (largely irrelevant) features. FIRL, on the other hand, correctly established\n\n7\n\ndeterministiccolorspercent misprediction2814200%10%20%30%40%50%60%A&NMMPFIRLA&N transferMMP transferFIRL transferdeterministiccolorspercent misprediction2814200%10%20%30%40%50%60%deterministiccolorsfeature expectation dist28142000.050.10.150.20.250.3deterministiccolorsfeature expectation dist28142000.050.10.150.20.250.3non-deterministiccolorspercent misprediction2814200%10%20%30%40%50%60%non-deterministiccolorspercent misprediction2814200%10%20%30%40%50%60%non-deterministiccolorsfeature expectation dist28142000.050.10.150.20.250.3non-deterministiccolorsfeature expectation dist28142000.050.10.150.20.250.3\f\u201cLawful\u201d policies\n\n\u201cOutlaw\u201d policies\n\npercent mis-\nprediction\n0.0%\n22.9%\n27.0%\n38.6%\n42.7%\n\nfeature expect-\nation distance\n0.000\n0.025\n0.111\n0.202\n0.220\n\naverage percent mis-\nprediction\n0.0%\n24.2%\n27.2%\n39.3%\n41.4%\n\nspeed\n2.410\n2.314\n1.068\n1.054\n1.053\n\nfeature expect-\nation distance\n0.000\n0.027\n0.096\n0.164\n0.184\n\naverage\nspeed\n2.375\n2.376\n1.056\n1.055\n1.053\n\nExpert\nFIRL\nMMP\nA&N\nRandom\n\nTable 2: Comparison of FIRL, MMP and Abbeel & Ng on the highway environment (left). The\npolicies learned by FIRL closely match the expert\u2019s average speed, while those of other methods do\nnot. The difference between the policies is particularly apparent in the supplemental videos, which\ncan be found at http://graphics.stanford.edu/projects/firl/index.htm\n\nthe logical connection between speed and police vehicles or lanes, and drove fast when appropriate,\nas indicated by the average speed in Table 2. As a baseline, the table also shows the performance of\na random policy generated by picking weights for the component features uniformly at random.\n\n6 Discussion and Future Work\n\nThis paper presents an IRL algorithm that constructs reward features, represented as a regression\ntree, out of a large collection of component features. By combining relevant components into logical\nconjunctions, the FIRL algorithm is able to discover logical precedence relationships that would not\notherwise be apparent. The learned regression tree concisely captures the structure of the reward\nfunction and acts as a portable \u201cexplanation\u201d of the observed behavior in terms of the provided\ncomponents, allowing the learned reward function to be transplanted onto different environments.\nFeature construction for IRL may be a valuable tool for analyzing the motivations of an agent (such\nas a human or an animal) from observed behavior. Research indicates that animals learn optimal\npolicies for a pattern of rewards [4], suggesting that it may be possible to learn such behavior with\nIRL. While it can be dif\ufb01cult to manually construct a complete list of relevant reward features for\nsuch an agent, it is comparatively easier to list all aspects of the environment that a human or animal\nis aware of. With FIRL, such a list can be used to form hypotheses about reward features, possibly\nleading to increased understanding of the agent\u2019s motivations. In fact, models that perform a variant\nof IRL have been shown to correspond well to goal inference in humans [2].\nWhile FIRL achieves good performance on discrete MDPs, in its present form it is unable to handle\ncontinuous state spaces, since the optimization constraints require an enumeration of all states in S.\nApproximate linear programming has been used to solve MDPs with continuous state spaces [5], and\na similar approach could be used to construct a tractable set of constraints for the optimization step,\nmaking it possible to perform feature construction on continuous or extremely large state spaces.\nAlthough we found that FIRL converged to a stable hypothesis quickly, it is dif\ufb01cult to provide an\naccurate convergence test. Theoretical analysis of convergence is complicated by the fact that regres-\nsion trees provide few guarantees. The conventional training error metric is not a good measure of\nconvergence, because the optimization constraints keep training error consistently low. Instead, we\ncan use cross-validation, or heuristics such as leaf count and tree depth, to estimate convergence. In\npractice, we found this unnecessary, as FIRL consistently converged in very few iterations. De\ufb01ning\na practical convergence test and analyzing convergence is an interesting avenue for future work.\nFIRL may also bene\ufb01t from future work on the \ufb01tting step. A more intelligent hypothesis pro-\nposal scheme, perhaps with a Bayesian approach, could more readily incorporate priors on poten-\ntial features to penalize excessively deep trees or prevent improbable conjunctions of components.\nFurthermore, while regression trees provide a principled method for constructing logical conjunc-\ntions of component features, if the desired features are not readily expressible as conjunctions of\nsimple components, other regression methods may be used in the \ufb01tting step. For example, the al-\ngorithm could be modi\ufb01ed to perform feature adaptation by using the \ufb01tting step to adapt a set of\ncontinuously-parameterized features to best \ufb01t the reward function.\n\nAcknowledgments. We thank Andrew Y. Ng, Emanuel Todorov, and Sameer Agarwal for helpful\nfeedback and discussion. This work was supported in part by NSF grant CCF-0641402.\n\n8\n\n\fReferences\n\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML\n\n\u201904: Proceedings of the 21st International Conference on Machine Learning. ACM, 2004.\n\n[2] C. L. Baker, J. B. Tenenbaum, and R. R. Saxe. Goal inference as inverse planning. In Proceed-\n\nings of the 29th Annual Conference of the Cognitive Science Society, 2007.\n\n[3] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi\ufb01cation and Regression Trees.\n\nWadsworth and Brooks, Monterey, CA, 1984.\n\n[4] P. Dayan and B. W. Balleine. Reward, motivation, and reinforcement learning. Neuron,\n\n36(2):285\u2013298, 2002.\n\n[5] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic\n\nprogramming. Operations Research, 51(6):850\u2013865, 2003.\n\n[6] M. Grant and S. Boyd. CVX: Matlab Software for Disciplined Convex Programming (web\n\npage and software), 2008. http://stanford.edu/\u02dcboyd/cvx.\n\n[7] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In ICML \u201900: Pro-\nceedings of the 17th International Conference on Machine Learning, pages 663\u2013670. Morgan\nKaufmann Publishers Inc., 2000.\n\n[8] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In IJCAI\u201907: Pro-\nceedings of the 20th International Joint Conference on Arti\ufb01cal Intelligence, pages 2586\u20132591.\nMorgan Kaufmann Publishers Inc., 2007.\n\n[9] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning.\n\nIn ICML\n\u201906: Proceedings of the 23rd International Conference on Machine Learning, pages 729\u2013736.\nACM, 2006.\n\n[10] U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linear programming.\nIn ICML \u201908: Proceedings of the 25th International Conference on Machine Learning, pages\n1032\u20131039. ACM, 2008.\n\n9\n\n\f", "award": [], "sourceid": 151, "authors": [{"given_name": "Sergey", "family_name": "Levine", "institution": null}, {"given_name": "Zoran", "family_name": "Popovic", "institution": null}, {"given_name": "Vladlen", "family_name": "Koltun", "institution": null}]}