{"title": "Generative Adversarial Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4565, "page_last": 4573, "abstract": "Consider learning a policy from example expert behavior, without interaction with the expert or access to a reinforcement signal. One approach is to recover the expert's cost function with inverse reinforcement learning, then extract a policy from that cost function with reinforcement learning. This approach is indirect and can be slow. We propose a new general framework for directly extracting a policy from data as if it were obtained by reinforcement learning following inverse reinforcement learning. We show that a certain instantiation of our framework draws an analogy between imitation learning and generative adversarial networks, from which we derive a model-free imitation learning algorithm that obtains significant performance gains over existing model-free methods in imitating complex behaviors in large, high-dimensional environments.", "full_text": "Generative Adversarial Imitation Learning\n\nJonathan Ho\n\nOpenAI\n\nhoj@openai.com\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nAbstract\n\nConsider learning a policy from example expert behavior, without interaction with\nthe expert or access to a reinforcement signal. One approach is to recover the\nexpert\u2019s cost function with inverse reinforcement learning, then extract a policy\nfrom that cost function with reinforcement learning. This approach is indirect\nand can be slow. We propose a new general framework for directly extracting a\npolicy from data as if it were obtained by reinforcement learning following inverse\nreinforcement learning. We show that a certain instantiation of our framework\ndraws an analogy between imitation learning and generative adversarial networks,\nfrom which we derive a model-free imitation learning algorithm that obtains signif-\nicant performance gains over existing model-free methods in imitating complex\nbehaviors in large, high-dimensional environments.\n\n1\n\nIntroduction\n\nWe are interested in a speci\ufb01c setting of imitation learning\u2014the problem of learning to perform a\ntask from expert demonstrations\u2014in which the learner is given only samples of trajectories from\nthe expert, is not allowed to query the expert for more data while training, and is not provided a\nreinforcement signal of any kind. There are two main approaches suitable for this setting: behavioral\ncloning [18], which learns a policy as a supervised learning problem over state-action pairs from\nexpert trajectories; and inverse reinforcement learning [23, 16], which \ufb01nds a cost function under\nwhich the expert is uniquely optimal.\nBehavioral cloning, while appealingly simple, only tends to succeed with large amounts of data, due\nto compounding error caused by covariate shift [21, 22]. Inverse reinforcement learning (IRL), on\nthe other hand, learns a cost function that prioritizes entire trajectories over others, so compounding\nerror, a problem for methods that \ufb01t single-timestep decisions, is not an issue. Accordingly, IRL has\nsucceeded in a wide range of problems, from predicting behaviors of taxi drivers [29] to planning\nfootsteps for quadruped robots [20].\nUnfortunately, many IRL algorithms are extremely expensive to run, requiring reinforcement learning\nin an inner loop. Scaling IRL methods to large environments has thus been the focus of much recent\nwork [6, 13]. Fundamentally, however, IRL learns a cost function, which explains expert behavior\nbut does not directly tell the learner how to act. Given that the learner\u2019s true goal often is to take\nactions imitating the expert\u2014indeed, many IRL algorithms are evaluated on the quality of the optimal\nactions of the costs they learn\u2014why, then, must we learn a cost function, if doing so possibly incurs\nsigni\ufb01cant computational expense yet fails to directly yield actions?\nWe desire an algorithm that tells us explicitly how to act by directly learning a policy. To develop such\nan algorithm, we begin in Section 3, where we characterize the policy given by running reinforcement\nlearning on a cost function learned by maximum causal entropy IRL [29, 30]. Our characterization\nintroduces a framework for directly learning policies from data, bypassing any intermediate IRL step.\nThen, we instantiate our framework in Sections 4 and 5 with a new model-free imitation learning\nalgorithm. We show that our resulting algorithm is intimately connected to generative adversarial\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fnetworks [8], a technique from the deep learning community that has led to recent successes in\nmodeling distributions of natural images: our algorithm harnesses generative adversarial training to \ufb01t\ndistributions of states and actions de\ufb01ning expert behavior. We test our algorithm in Section 6, where\nwe \ufb01nd that it outperforms competing methods by a wide margin in training policies for complex,\nhigh-dimensional physics-based control tasks over various amounts of expert data.\n\n2 Background\nPreliminaries R will denote the extended real numbers R \u222a {\u221e}. Section 3 will work with\n\ufb01nite state and action spaces S and A, but our algorithms and experiments later in the paper will\nrun in high-dimensional continuous environments. \u03a0 is the set of all stationary stochastic policies\nthat take actions in A given states in S; successor states are drawn from the dynamics model\nP (s(cid:48)|s, a). We work in the \u03b3-discounted in\ufb01nite horizon setting, and we will use an expectation\nwith respect a policy \u03c0 \u2208 \u03a0 to denote an expectation with respect to the trajectory it generates:\nt=0 \u03b3tc(st, at)], where s0 \u223c p0, at \u223c \u03c0(\u00b7|st), and st+1 \u223c P (\u00b7|st, at) for t \u2265 0.\nWe will use \u02c6E\u03c4 to denote an empirical expectation with respect to trajectory samples \u03c4, and we will\nalways use \u03c0E to refer to the expert policy.\n\nE\u03c0[c(s, a)] (cid:44) E [(cid:80)\u221e\n\n(cid:18)\n\nInverse reinforcement learning Suppose we are given an expert policy \u03c0E that we wish to ratio-\nnalize with IRL. For the remainder of this paper, we will adopt and assume the existence of solutions\nof maximum causal entropy IRL [29, 30], which \ufb01ts a cost function from a family of functions C with\nthe optimization problem\n\n\u2212H(\u03c0) + E\u03c0[c(s, a)]\n\n\u2212 E\u03c0E [c(s, a)]\n\nc\u2208C\n\nmin\n\u03c0\u2208\u03a0\n\nmaximize\n\n(1)\nwhere H(\u03c0) (cid:44) E\u03c0[\u2212 log \u03c0(a|s)] is the \u03b3-discounted causal entropy [3] of the policy \u03c0. In practice,\n\u03c0E will only be provided as a set of trajectories sampled by executing \u03c0E in the environment, so the\nexpected cost of \u03c0E in Eq. (1) is estimated using these samples. Maximum causal entropy IRL looks\nfor a cost function c \u2208 C that assigns low cost to the expert policy and high cost to other policies,\nthereby allowing the expert policy to be found via a certain reinforcement learning procedure:\n\n(cid:19)\n\nRL(c) = arg min\n\n\u03c0\u2208\u03a0\n\n\u2212H(\u03c0) + E\u03c0[c(s, a)]\n\n(2)\n\nwhich maps a cost function to high-entropy policies that minimize the expected cumulative cost.\n\n3 Characterizing the induced optimal policy\n\nTo begin our search for an imitation learning algorithm that both bypasses an intermediate IRL\nstep and is suitable for large environments, we will study policies found by reinforcement learning\non costs learned by IRL on the largest possible set of cost functions C in Eq. (1): all functions\nRS\u00d7A = {c : S \u00d7 A \u2192 R}. Using expressive cost function classes, like Gaussian processes [14]\nand neural networks [6], is crucial to properly explain complex expert behavior without meticulously\nhand-crafted features. Here, we investigate the best IRL can do with respect to expressiveness by\nexamining its capabilities with C = RS\u00d7A.\nOf course, with such a large C, IRL can easily over\ufb01t when provided a \ufb01nite dataset. Therefore,\nwe will incorporate a (closed, proper) convex cost function regularizer \u03c8 : RS\u00d7A \u2192 R into our\nstudy. Note that convexity is a not particularly restrictive requirement: \u03c8 must be convex as a\nfunction de\ufb01ned on all of RS\u00d7A, not as a function de\ufb01ned on a small parameter space; indeed, the\ncost regularizers of Finn et al. [6], effective for a range of robotic manipulation tasks, satisfy this\nrequirement. Interestingly, \u03c8 will play a central role in our discussion and will not serve as a nuisance\nin our analysis.\nLet us de\ufb01ne an IRL primitive procedure, which \ufb01nds a cost function such that the expert performs\nbetter than all other policies, with the cost regularized by \u03c8:\n\nIRL\u03c8(\u03c0E) = arg max\nc\u2208RS\u00d7A\n\n\u2212\u03c8(c) +\n\nmin\n\u03c0\u2208\u03a0\n\n\u2212H(\u03c0) + E\u03c0[c(s, a)]\n\n\u2212 E\u03c0E [c(s, a)]\n\n(3)\n\n(cid:19)\n\n(cid:18)\n\n2\n\n\fpolicy \u03c0 \u2208 \u03a0 its occupancy measure \u03c1\u03c0 : S \u00d7 A \u2192 R as \u03c1\u03c0(s, a) = \u03c0(a|s)(cid:80)\u221e\nwrite E\u03c0[c(s, a)] =(cid:80)\n\nLet \u02dcc \u2208 IRL\u03c8(\u03c0E). We are interested in a policy given by RL(\u02dcc)\u2014this is the policy given by\nrunning reinforcement learning on the output of IRL. To characterize RL(\u02dcc), let us \ufb01rst de\ufb01ne for a\nt=0 \u03b3tP (st = s|\u03c0).\nThe occupancy measure can be interpreted as the unnormalized distribution of state-action pairs\nthat an agent encounters when navigating the environment with the policy \u03c0, and it allows us to\ns,a \u03c1\u03c0(s, a)c(s, a) for any cost function c. We will also need the concept of a\nconvex conjugate: for a function f : RS\u00d7A \u2192 R, its convex conjugate f\u2217 : RS\u00d7A \u2192 R is given by\nf\u2217(x) = supy\u2208RS\u00d7A xT y \u2212 f (y).\nNow, we are prepared to characterize RL(\u02dcc), the policy learned by RL on the cost recovered by IRL:\nProposition 3.1. RL\u25e6 IRL\u03c8(\u03c0E) = arg min\u03c0\u2208\u03a0 \u2212H(\u03c0) + \u03c8\u2217(\u03c1\u03c0 \u2212 \u03c1\u03c0E )\n(4)\nThe proof of Proposition 3.1 can be found in Appendix A.1. It relies on the observation that the\noptimal cost function and policy form a saddle point of a certain function. IRL \ufb01nds one coordinate\nof this saddle point, and running RL on the output of IRL reveals the other coordinate.\nProposition 3.1 tells us that \u03c8-regularized inverse reinforcement learning, implicitly, seeks a policy\nwhose occupancy measure is close to the expert\u2019s, as measured by \u03c8\u2217. Enticingly, this suggests that\nvarious settings of \u03c8 lead to various imitation learning algorithms that directly solve the optimization\nproblem given by Proposition 3.1. We explore such algorithms in Sections 4 and 5, where we show\nthat certain settings of \u03c8 lead to both existing algorithms and a novel one.\nThe special case when \u03c8 is a constant function is particularly illuminating, so we state and show it\ndirectly using concepts from convex optimization.\nProposition 3.2. Suppose \u03c1\u03c0E > 0. If \u03c8 is a constant function, \u02dcc \u2208 IRL\u03c8(\u03c0E), and \u02dc\u03c0 \u2208 RL(\u02dcc),\nthen \u03c1\u02dc\u03c0 = \u03c1\u03c0E .\nIn other words, if there were no cost regularization at all, the recovered policy will exactly match the\nexpert\u2019s occupancy measure. (The condition \u03c1\u03c0E > 0, inherited from Ziebart et al. [30], simpli\ufb01es\nour discussion and in fact guarantees the existence of \u02dcc \u2208 IRL\u03c8(\u03c0E). Elsewhere in the paper, as\nmentioned in Section 2, we assume the IRL problem has a solution.) To show Proposition 3.2, we need\nthe basic result that the set of valid occupancy measures D (cid:44) {\u03c1\u03c0 : \u03c0 \u2208 \u03a0} can be written as a feasible\nset of af\ufb01ne constraints [19]: if p0(s) is the distribution of starting states and P (s(cid:48)|s, a) is the dynamics\nmodel, then D =\n.\nFurthermore, there is a one-to-one correspondence between \u03a0 and D:\nLemma 3.1 (Theorem 2 of Syed et al. [27]). If \u03c1 \u2208 D, then \u03c1 is the occupancy measure for\n\ns(cid:48),a P (s|s(cid:48), a)\u03c1(s(cid:48), a) \u2200 s \u2208 S(cid:111)\n\na \u03c1(s, a) = p0(s) + \u03b3(cid:80)\n\nand (cid:80)\n\na(cid:48) \u03c1(s, a(cid:48)), and \u03c0\u03c1 is the only policy whose occupancy measure is \u03c1.\n\n\u03c1 : \u03c1 \u2265 0\n\n(cid:110)\n\u03c0\u03c1(a|s) (cid:44) \u03c1(s, a)/(cid:80)\nLemma 3.2. Let \u00afH(\u03c1) = \u2212(cid:80)\n\nWe are therefore justi\ufb01ed in writing \u03c0\u03c1 to denote the unique policy for an occupancy measure \u03c1. We\nalso need a lemma that lets us speak about causal entropies of occupancy measures:\na(cid:48) \u03c1(s, a(cid:48))). Then, \u00afH is strictly concave,\n\ns,a \u03c1(s, a) log(\u03c1(s, a)/(cid:80)\n\nand for all \u03c0 \u2208 \u03a0 and \u03c1 \u2208 D, we have H(\u03c0) = \u00afH(\u03c1\u03c0) and \u00afH(\u03c1) = H(\u03c0\u03c1).\nThe proof of this lemma is in Appendix A.1. Lemma 3.1 and Lemma 3.2 together allow us to\nfreely switch between policies and occupancy measures when considering functions involving causal\nentropy and expected costs, as in the following lemma:\n\nLemma 3.3. If L(\u03c0, c) = \u2212H(\u03c0) + E\u03c0[c(s, a)] and \u00afL(\u03c1, c) = \u2212 \u00afH(\u03c1) +(cid:80)\n\ns,a \u03c1(s, a)c(s, a), then,\nfor all cost functions c, L(\u03c0, c) = \u00afL(\u03c1\u03c0, c) for all policies \u03c0 \u2208 \u03a0, and \u00afL(\u03c1, c) = L(\u03c0\u03c1, c) for all\noccupancy measures \u03c1 \u2208 D.\nNow, we are ready to verify Proposition 3.2.\n\nProof of Proposition 3.2. De\ufb01ne \u00afL(\u03c1, c) = \u2212 \u00afH(\u03c1) +(cid:80)\n\u03c1(s, a)c(s, a) \u2212(cid:88)\n\n\u03c8 is a constant function, we have the following, due to Lemma 3.3:\n\u02dcc \u2208 IRL\u03c8(\u03c0E) = arg max\nc\u2208RS\u00d7A\n\u03c1\u2208D \u2212 \u00afH(\u03c1) +\n\n(cid:88)\n\u2212H(\u03c0) + E\u03c0[c(s, a)] \u2212 E\u03c0E [c(s, a)] + const.\n\n\u03c1E(s, a)c(s, a) = arg max\nc\u2208RS\u00d7A\n\nmin\n\u03c0\u2208\u03a0\n\n= arg max\nc\u2208RS\u00d7A\n\nmin\n\ns,a\n\ns,a\n\ns,a c(s, a)(\u03c1(s, a) \u2212 \u03c1E(s, a)). Given that\n\n(5)\n\n\u00afL(\u03c1, c).\n\n(6)\n\nmin\n\u03c1\u2208D\n\n3\n\n\fThis is the dual of the optimization problem\nsubject to\n\nminimize\n\n\u03c1\u2208D \u2212 \u00afH(\u03c1)\n\n\u03c1(s, a) = \u03c1E(s, a) \u2200 s \u2208 S, a \u2208 A\n\n(7)\n\nwith Lagrangian \u00afL, for which the costs c(s, a) serve as dual variables for equality constraints. Thus,\n\u02dcc is a dual optimum for (7). In addition, strong duality holds for (7): D is compact and convex, \u2212 \u00afH\nis convex, and, since \u03c1E > 0, there exists a feasible point in the relative interior of the domain D.\nMoreover, Lemma 3.2 guarantees that \u2212 \u00afH is in fact strictly convex, so the primal optimum can\nbe uniquely recovered from the dual optimum [4, Section 5.5.5] via \u02dc\u03c1 = arg min\u03c1\u2208D \u00afL(\u03c1, \u02dcc) =\ns,a \u02dcc(s, a)\u03c1(s, a) = \u03c1E, where the \ufb01rst equality indicates that \u02dc\u03c1 is the\nunique minimizer of \u00afL(\u00b7, \u02dcc), and the third follows from the constraints in the primal problem (7). But\nif \u02dc\u03c0 \u2208 RL(\u02dcc), then Lemma 3.3 implies \u03c1\u02dc\u03c0 = \u02dc\u03c1 = \u03c1E.\n\narg min\u03c1\u2208D \u2212 \u00afH(\u03c1) +(cid:80)\n\nLet us summarize our conclusions. First, IRL is a dual of an occupancy measure matching\nproblem, and the recovered cost function is the dual optimum. Classic IRL algorithms that solve\nreinforcement learning repeatedly in an inner loop, such as the algorithm of Ziebart et al. [29] that\nruns a variant of value iteration in an inner loop, can be interpreted as a form of dual ascent, in which\none repeatedly solves the primal problem (reinforcement learning) with \ufb01xed dual values (costs).\nDual ascent is effective if solving the unconstrained primal is ef\ufb01cient, but in the case of IRL, it\namounts to reinforcement learning! Second, the induced optimal policy is the primal optimum.\nThe induced optimal policy is obtained by running RL after IRL, which is exactly the act of recovering\nthe primal optimum from the dual optimum; that is, optimizing the Lagrangian with the dual variables\n\ufb01xed at the dual optimum values. Strong duality implies that this induced optimal policy is indeed\nthe primal optimum, and therefore matches occupancy measures with the expert. IRL is traditionally\nde\ufb01ned as the act of \ufb01nding a cost function such that the expert policy is uniquely optimal, but we can\nalternatively view IRL as a procedure that tries to induce a policy that matches the expert\u2019s occupancy\nmeasure.\n\n4 Practical occupancy measure matching\n\nWe saw in Proposition 3.2 that if \u03c8 is constant, the resulting primal problem (7) simply matches\noccupancy measures with expert at all states and actions. Such an algorithm is not practically useful.\nIn reality, the expert trajectory distribution will be provided only as a \ufb01nite set of samples, so in large\nenvironments, most of the expert\u2019s occupancy measure values will be small, and exact occupancy\nmeasure matching will force the learned policy to rarely visit these unseen state-action pairs simply\ndue to lack of data. Furthermore, in the cases in which we would like to use function approximation to\nlearn parameterized policies \u03c0\u03b8, the resulting optimization problem of \ufb01nding an appropriate \u03b8 would\nhave an intractably large number of constraints when the environment is large: as many constraints as\npoints in S \u00d7 A.\nKeeping in mind that we wish to eventually develop an imitation learning algorithm suitable for large\nenvironments, we would like to relax Eq. (7) into the following form, motivated by Proposition 3.1:\n(8)\nby modifying the IRL regularizer \u03c8 so that d\u03c8(\u03c1\u03c0, \u03c1E) (cid:44) \u03c8\u2217(\u03c1\u03c0 \u2212 \u03c1E) smoothly penalizes violations\nin difference between the occupancy measures.\n\nd\u03c8(\u03c1\u03c0, \u03c1E) \u2212 H(\u03c0)\n\nminimize\n\n\u03c0\n\nEntropy-regularized apprenticeship learning It turns out that with certain settings of \u03c8, Eq. (8)\ntakes on the form of regularized variants of existing apprenticeship learning algorithms, which\nindeed do scale to large environments with parameterized policies [10]. For a class of cost functions\nC \u2282 RS\u00d7A, an apprenticeship learning algorithm \ufb01nds a policy that performs better than the expert\nacross C, by optimizing the objective\nminimize\n\n(9)\nClassic apprenticeship learning algorithms restrict C to convex sets given by linear combinations\nof basis functions f1, . . . , fd, which give rise a feature vector f (s, a) = [f1(s, a), . . . , fd(s, a)] for\neach state-action pair. Abbeel and Ng [1] and Syed et al. [27] use, respectively,\n\nE\u03c0[c(s, a)] \u2212 E\u03c0E [c(s, a)]\n\nmax\nc\u2208C\n\n\u03c0\n\nClinear = {(cid:80)\n\niwifi\n\n: (cid:107)w(cid:107)2 \u2264 1}\n\nand Cconvex = {(cid:80)\n\niwifi\n\n: (cid:80)\n\niwi = 1, wi \u2265 0 \u2200i} .\n\n(10)\n\n4\n\n\fClinear leads to feature expectation matching [1], which minimizes (cid:96)2 distance between expected\nE\u03c0[c(s, a)]\u2212E\u03c0E [c(s, a)] = (cid:107)E\u03c0[f (s, a)]\u2212E\u03c0E [f (s, a)](cid:107)2. Meanwhile,\nfeature vectors: maxc\u2208Clinear\nCconvex leads to MWAL [26] and LPAL [27], which minimize worst-case excess cost among the\nE\u03c0[c(s, a)]\u2212 E\u03c0E [c(s, a)] = maxi\u2208{1,...,d} E\u03c0[fi(s, a)]\u2212\nindividual basis functions, as maxc\u2208Cconvex\nE\u03c0E [fi(s, a)].\nWe now show how Eq. (9) is a special case of Eq. (8) with a certain setting of \u03c8. With the indicator\nfunction \u03b4C : RS\u00d7A \u2192 R, de\ufb01ned by \u03b4C(c) = 0 if c \u2208 C and +\u221e otherwise, we can write the\napprenticeship learning objective (9) as\nE\u03c0[c(s, a)]\u2212E\u03c0E [c(s, a)] = max\n\n(\u03c1\u03c0(s, a)\u2212\u03c1\u03c0E(s, a))c(s, a) = \u03b4\u2217\n\nC(\u03c1\u03c0\u2212\u03c1\u03c0E)\n\nc\u2208RS\u00d7A\u2212\u03b4C(c) +\n\nmax\nc\u2208C\n\n(cid:88)\n\ns,a\n\nTherefore, we see that entropy-regularized apprenticeship learning\n\nminimize\n\n\u03c0\n\n\u2212H(\u03c0) + max\nc\u2208C\n\nE\u03c0[c(s, a)] \u2212 E\u03c0E [c(s, a)]\n\n(11)\n\nis equivalent to performing RL following IRL with cost regularizer \u03c8 = \u03b4C, which forces the implicit\nIRL procedure to recover a cost function lying in C. Note that we can scale the policy\u2019s entropy\nregularization strength in Eq. (11) by scaling C by a constant \u03b1 as {\u03b1c : c \u2208 C}, recovering the\noriginal apprenticeship objective (9) by taking \u03b1 \u2192 \u221e.\n\nCons of apprenticeship learning It is known that apprenticeship learning algorithms generally do\nnot recover expert-like policies if C is too restrictive [27, Section 1]\u2014which is often the case for the\nlinear subspaces used by feature expectation matching, MWAL, and LPAL, unless the basis functions\nf1, . . . , fd are very carefully designed. Intuitively, unless the true expert cost function (assuming it\nexists) lies in C, there is no guarantee that if \u03c0 performs better than \u03c0E on all of C, then \u03c0 equals \u03c0E.\nWith the aforementioned insight based on Proposition 3.1 that apprenticeship learning is equivalent\nto RL following IRL, we can understand exactly why apprenticeship learning may fail to imitate: it\nforces \u03c0E to be encoded as an element of C. If C does not include a cost function that explains expert\nbehavior well, then attempting to recover a policy from such an encoding will not succeed.\nPros of apprenticeship learning While restrictive cost classes C may not lead to exact imitation,\napprenticeship learning with such C can scale to large state and action spaces with policy function\napproximation. Ho et al. [10] rely on the following policy gradient formula for the apprenticeship\nobjective (9) for a parameterized policy \u03c0\u03b8:\n\u2207\u03b8 max\nc\u2208C\nwhere c\u2217 = arg max\nObserving that Eq. (12) is the policy gradient for a reinforcement learning objective with cost c\u2217, Ho\net al. propose an algorithm that alternates between two steps:\n\nE\u03c0\u03b8 [c(s, a)] \u2212 E\u03c0E [c(s, a)] = \u2207\u03b8E\u03c0\u03b8 [c\u2217(s, a)] = E\u03c0\u03b8 [\u2207\u03b8 log \u03c0\u03b8(a|s)Qc\u2217 (s, a)]\n\nE\u03c0\u03b8 [c(s, a)] \u2212 E\u03c0E [c(s, a)], Qc\u2217 (\u00afs, \u00afa) = E\u03c0\u03b8 [c\u2217(\u00afs, \u00afa)| s0 = \u00afs, a0 = \u00afa]\n\n(12)\n\nc\u2208C\n\n1. Sample trajectories of the current policy \u03c0\u03b8i by simulating in the environment, and \ufb01t a\ni , as de\ufb01ned in Eq. (12). For the cost classes Clinear and Cconvex (10), this cost\n\ncost function c\u2217\n\ufb01tting amounts to evaluating simple analytical expressions [10].\n\ni and the sampled trajectories, and take a trust\n\n2. Form a gradient estimate with Eq. (12) with c\u2217\n\nregion policy optimization (TRPO) [24] step to produce \u03c0\u03b8i+1.\n\nThis algorithm relies crucially on the TRPO policy step, which is a natural gradient step constrained\nto ensure that \u03c0\u03b8i+1 does not stray too far \u03c0\u03b8i, as measured by KL divergence between the two\npolicies averaged over the states in the sampled trajectories. This carefully constructed step scheme\nensures that the algorithm does not diverge due to high noise in estimating the gradient (12). We refer\nthe reader to Schulman et al. [24] for more details on TRPO.\nWith the TRPO step scheme, Ho et al. were able train large neural network policies for apprentice-\nship learning with linear cost function classes (10) in environments with hundreds of observation\ndimensions. Their use of these linear cost function classes, however, limits their approach to settings\nin which expert behavior is well-described by such classes. We will draw upon their algorithm to\ndevelop an imitation learning method that both scales to large environments and imitates arbitrarily\ncomplex expert behavior. To do so, we \ufb01rst turn to proposing a new regularizer \u03c8 that wields more\nexpressive power than the regularizers corresponding to Clinear and Cconvex (10).\n\n5\n\n\f5 Generative adversarial imitation learning\n\nAs discussed in Section 4, the constant regularizer leads to an imitation learning algorithm that exactly\nmatches occupancy measures, but is intractable in large environments. The indicator regularizers\nfor the linear cost function classes (10), on the other hand, lead to algorithms incapable of exactly\nmatching occupancy measures without careful tuning, but are tractable in large environments. We\npropose the following new cost regularizer that combines the best of both worlds, as we will show in\nthe coming sections:\n\n(cid:26)\u2212x \u2212 log(1 \u2212 ex)\n\n(cid:26)E\u03c0E [g(c(s, a))]\n\n+\u221e\n\n\u03c8GA(c) (cid:44)\n\nif c < 0\notherwise\n\nwhere g(x) =\n\n+\u221e\n\nif x < 0\notherwise\n\n(13)\n\nThis regularizer places low penalty on cost functions c that assign an amount of negative cost to\nexpert state-action pairs; if c, however, assigns large costs (close to zero, which is the upper bound\nfor costs feasible for \u03c8GA) to the expert, then \u03c8GA will heavily penalize c. An interesting property of\n\u03c8GA is that it is an average over expert data, and therefore can adjust to arbitrary expert datasets. The\nindicator regularizers \u03b4C, used by the linear apprenticeship learning algorithms described in Section 4,\nare always \ufb01xed, and cannot adapt to data as \u03c8GA can. Perhaps the most important difference between\n\u03c8GA and \u03b4C, however, is that \u03b4C forces costs to lie in a small subspace spanned by \ufb01nitely many basis\nfunctions, whereas \u03c8GA allows for any cost function, as long as it is negative everywhere.\nOur choice of \u03c8GA is motivated by the following fact, shown in the appendix (Corollary A.1.1):\n\nGA(\u03c1\u03c0 \u2212 \u03c1\u03c0E ) =\n\u03c8\u2217\n\nE\u03c0[log(D(s, a))] + E\u03c0E [log(1 \u2212 D(s, a))]\n\n(14)\nwhere the supremum ranges over discriminative classi\ufb01ers D : S \u00d7 A \u2192 (0, 1). Equation (14) is pro-\nportional to the optimal negative log loss of the binary classi\ufb01cation problem of distinguishing between\nstate-action pairs of \u03c0 and \u03c0E. It turns out that this optimal loss is, up to a constant shift and scaling,\nthe Jensen-Shannon divergence DJS(\u00af\u03c1\u03c0, \u00af\u03c1\u03c0E ) (cid:44) DKL (\u00af\u03c1\u03c0(cid:107)(\u00af\u03c1\u03c0 + \u00af\u03c1E)/2) + DKL (\u00af\u03c1E(cid:107)(\u00af\u03c1\u03c0 + \u00af\u03c1E)/2),\nwhich is a squared metric between the normalized occupancy distributions \u00af\u03c1\u03c0 = (1 \u2212 \u03b3)\u03c1\u03c0 and\n\u00af\u03c1\u03c0E = (1 \u2212 \u03b3)\u03c1\u03c0E [8, 17]. Treating the causal entropy H as a policy regularizer controlled by \u03bb \u2265 0\nand dropping the 1 \u2212 \u03b3 occupancy measure normalization for clarity, we obtain a new imitation\nlearning algorithm:\n\nsup\n\nD\u2208(0,1)S\u00d7A\n\nminimize\n\n\u03c0\n\nGA(\u03c1\u03c0 \u2212 \u03c1\u03c0E ) \u2212 \u03bbH(\u03c0) = DJS(\u03c1\u03c0, \u03c1\u03c0E ) \u2212 \u03bbH(\u03c0),\n\u03c8\u2217\n\n(15)\n\nwhich \ufb01nds a policy whose occupancy measure minimizes Jensen-Shannon divergence to the expert\u2019s.\nEquation (15) minimizes a true metric between occupancy measures, so, unlike linear apprenticeship\nlearning algorithms, it can imitate expert policies exactly.\n\nAlgorithm Equation (15) draws a connection between imitation learning and generative adversarial\nnetworks [8], which train a generative model G by having it confuse a discriminative classi\ufb01er\nD. The job of D is to distinguish between the distribution of data generated by G and the true\ndata distribution. When D cannot distinguish data generated by G from the true data, then G has\nsuccessfully matched the true data. In our setting, the learner\u2019s occupancy measure \u03c1\u03c0 is analogous\nto the data distribution generated by G, and the expert\u2019s occupancy measure \u03c1\u03c0E is analogous to the\ntrue data distribution.\nWe now present a practical imitation learning algorithm, called generative adversarial imitation\nlearning or GAIL (Algorithm 1), designed to work in large environments. GAIL solves Eq. (15) by\n\ufb01nding a saddle point (\u03c0, D) of the expression\n\nE\u03c0[log(D(s, a))] + E\u03c0E [log(1 \u2212 D(s, a))] \u2212 \u03bbH(\u03c0)\n\n(16)\nwith both \u03c0 and D represented using function approximators: GAIL \ufb01ts a parameterized policy\n\u03c0\u03b8, with weights \u03b8, and a discriminator network Dw : S \u00d7 A \u2192 (0, 1), with weights w. GAIL\nalternates between an Adam [11] gradient step on w to increase Eq. (16) with respect to D, and a\nTRPO step on \u03b8 to decrease Eq. (16) with respect to \u03c0 (we derive an estimator for the causal entropy\ngradient \u2207\u03b8H(\u03c0\u03b8) in Appendix A.2). The TRPO step serves the same purpose as it does with the\napprenticeship learning algorithm of Ho et al. [10]: it prevents the policy from changing too much\ndue to noise in the policy gradient. The discriminator network can be interpreted as a local cost\nfunction providing learning signal to the policy\u2014speci\ufb01cally, taking a policy step that decreases\nexpected cost with respect to the cost function c(s, a) = log D(s, a) will move toward expert-like\nregions of state-action space, as classi\ufb01ed by the discriminator.\n\n6\n\n\fAlgorithm 1 Generative adversarial imitation learning\n1: Input: Expert trajectories \u03c4E \u223c \u03c0E, initial policy and discriminator parameters \u03b80, w0\n2: for i = 0, 1, 2, . . . do\n3:\n4:\n\nSample trajectories \u03c4i \u223c \u03c0\u03b8i\nUpdate the discriminator parameters from wi to wi+1 with the gradient\n\n\u02c6E\u03c4i[\u2207w log(Dw(s, a))] + \u02c6E\u03c4E [\u2207w log(1 \u2212 Dw(s, a))]\n\n(17)\n\n5:\n\nTake a policy step from \u03b8i to \u03b8i+1, using the TRPO rule with cost function log(Dwi+1(s, a)).\nSpeci\ufb01cally, take a KL-constrained natural gradient step with\n\u02c6E\u03c4i [\u2207\u03b8 log \u03c0\u03b8(a|s)Q(s, a)] \u2212 \u03bb\u2207\u03b8H(\u03c0\u03b8),\nwhere Q(\u00afs, \u00afa) = \u02c6E\u03c4i[log(Dwi+1(s, a))| s0 = \u00afs, a0 = \u00afa]\n\n(18)\n\n6: end for\n\n6 Experiments\n\nWe evaluated GAIL against baselines on 9 physics-based control tasks, ranging from low-dimensional\ncontrol tasks from the classic RL literature\u2014the cartpole [2], acrobot [7], and mountain car [15]\u2014to\ndif\ufb01cult high-dimensional tasks such as a 3D humanoid locomotion, solved only recently by model-\nfree reinforcement learning [25, 24]. All environments, other than the classic control tasks, were\nsimulated with MuJoCo [28]. See Appendix B for a complete description of all the tasks.\nEach task comes with a true cost function, de\ufb01ned in the OpenAI Gym [5]. We \ufb01rst generated expert\nbehavior for these tasks by running TRPO [24] on these true cost functions to create expert policies.\nThen, to evaluate imitation performance with respect to sample complexity of expert data, we sampled\ndatasets of varying trajectory counts from the expert policies. The trajectories constituting each\ndataset each consisted of about 50 state-action pairs. We tested GAIL against three baselines:\n\n1. Behavioral cloning: a given dataset of state-action pairs is split into 70% training data and\n30% validation data. The policy is trained with supervised learning, using Adam [11] with\nminibatches of 128 examples, until validation error stops decreasing.\n\n2. Feature expectation matching (FEM): the algorithm of Ho et al. [10] using the cost function\n\nclass Clinear (10) of Abbeel and Ng [1]\n\n3. Game-theoretic apprenticeship learning (GTAL): the algorithm of Ho et al. [10] using the\n\ncost function class Cconvex (10) of Syed and Schapire [26]\n\nWe used all algorithms to train policies of the same neural network architecture for all tasks: two\nhidden layers of 100 units each, with tanh nonlinearities in between. The discriminator networks\nfor GAIL also used the same architecture. All networks were always initialized randomly at the\nstart of each trial. For each task, we gave FEM, GTAL, and GAIL exactly the same amount of\nenvironment interaction for training. We ran all algorithms 5-7 times over different random seeds in\nall environments except Humanoid, due to time restrictions.\nFigure 1 depicts the results, and Appendix B provides exact performance numbers and details of our\nexperiment pipeline, including expert data sampling and algorithm hyperparameters. We found that on\nthe classic control tasks (cartpole, acrobot, and mountain car), behavioral cloning generally suffered\nin expert data ef\ufb01ciency compared to FEM and GTAL, which for the most part were able produce\npolicies with near-expert performance with a wide range of dataset sizes, albeit with large variance\nover different random initializations of the policy. On these tasks, GAIL consistently produced\npolicies performing better than behavioral cloning, FEM, and GTAL. However, behavioral cloning\nperformed excellently on the Reacher task, on which it was more sample ef\ufb01cient than GAIL. We\nwere able to slightly improve GAIL\u2019s performance on Reacher using causal entropy regularization\u2014in\nthe 4-trajectory setting, the improvement from \u03bb = 0 to \u03bb = 10\u22123 was statistically signi\ufb01cant over\ntraining reruns, according to a one-sided Wilcoxon rank-sum test with p = .05. We used no causal\nentropy regularization for all other tasks.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Performance of learned policies. The y-axis is negative cost, scaled so that the expert\nachieves 1 and a random policy achieves 0. (b) Causal entropy regularization \u03bb on Reacher. Except\nfor Humanoid, shading indicates standard deviation over 5-7 reruns.\n\nOn the other MuJoCo environments, GAIL almost always achieved at least 70% of expert performance\nfor all dataset sizes we tested and reached it exactly with the larger datasets, with very little variance\namong random seeds. The baseline algorithms generally could not reach expert performance even\nwith the largest datasets. FEM and GTAL performed poorly for Ant, producing policies consistently\nworse than a policy that chooses actions uniformly at random. Behavioral cloning was able to reach\nsatisfactory performance with enough data on HalfCheetah, Hopper, Walker, and Ant, but was unable\nto achieve more than 60% for Humanoid, on which GAIL achieved exact expert performance for all\ntested dataset sizes.\n\n7 Discussion and outlook\n\nAs we demonstrated, GAIL is generally quite sample ef\ufb01cient in terms of expert data. However, it is\nnot particularly sample ef\ufb01cient in terms of environment interaction during training. The number\nof such samples required to estimate the imitation objective gradient (18) was comparable to the\nnumber needed for TRPO to train the expert policies from reinforcement signals. We believe that we\ncould signi\ufb01cantly improve learning speed for GAIL by initializing policy parameters with behavioral\ncloning, which requires no environment interaction at all.\nFundamentally, our method is model free, so it will generally need more environment interaction than\nmodel-based methods. Guided cost learning [6], for instance, builds upon guided policy search [12]\nand inherits its sample ef\ufb01ciency, but also inherits its requirement that the model is well-approximated\nby iteratively \ufb01tted time-varying linear dynamics. Interestingly, both GAIL and guided cost learning\nalternate between policy optimization steps and cost \ufb01tting (which we called discriminator \ufb01tting),\neven though the two algorithms are derived completely differently.\nOur approach builds upon a vast line of work on IRL [29, 1, 27, 26], and hence, just like IRL,\nour approach does not interact with the expert during training. Our method explores randomly\nto determine which actions bring a policy\u2019s occupancy measure closer to the expert\u2019s, whereas\nmethods that do interact with the expert, like DAgger [22], can simply ask the expert for such actions.\nUltimately, we believe that a method that combines well-chosen environment models with expert\ninteraction will win in terms of sample complexity of both expert data and environment interaction.\n\nAcknowledgments\n\nWe thank Jayesh K. Gupta, John Schulman, and the anonymous reviewers for assistance, advice,\nand critique. This work was supported by the SAIL-Toyota Center for AI Research and by a NSF\nGraduate Research Fellowship (grant no. DGE-114747).\n\n8\n\n147100.00.20.40.60.81.0Cartpole147100.00.20.40.60.81.0Acrobot147100.00.20.40.60.81.0Mountain Car41118250.00.20.40.60.81.0HalfCheetah41118250.00.20.40.60.81.0Hopper41118250.00.20.40.60.81.0Walker4111825\u22121.0\u22120.50.00.51.0Ant801602400.00.20.40.60.81.0HumanoidNumber of trajectories in datasetPerformance (scaled)ExpertRandomBehavioral cloningFEMGTALGAIL (ours)41118\u22120.50.00.51.0ReacherNumber of trajectories in datasetPerformance (scaled)ExpertRandomBehavioral cloningGAIL (\u00b8=0)GAIL (\u00b8=10\u00a13)GAIL (\u00b8=10\u00a12)\fReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the\n\n21st International Conference on Machine Learning, 2004.\n\n[2] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve dif\ufb01cult\n\nlearning control problems. Systems, Man and Cybernetics, IEEE Transactions on, (5):834\u2013846, 1983.\n\n[3] M. Bloem and N. Bambos. In\ufb01nite time horizon maximum causal entropy inverse reinforcement learning.\nIn Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 4911\u20134916. IEEE, 2014.\n\n[4] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI\n\nGym. arXiv preprint arXiv:1606.01540, 2016.\n\n[6] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy\n\noptimization. In Proceedings of the 33rd International Conference on Machine Learning, 2016.\n\n[7] A. Geramifard, C. Dann, R. H. Klein, W. Dabney, and J. P. How. Rlpy: A value-function-based reinforce-\n\nment learning framework for education and research. JMLR, 2015.\n\n[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[9] J.-B. Hiriart-Urruty and C. Lemar\u00e9chal. Convex Analysis and Minimization Algorithms, volume 305.\n\nSpringer, 1996.\n\n[10] J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization. In Proceedings\n\nof the 33rd International Conference on Machine Learning, 2016.\n\n[11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[12] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown\n\ndynamics. In Advances in Neural Information Processing Systems, pages 1071\u20131079, 2014.\n\n[13] S. Levine and V. Koltun. Continuous inverse optimal control with locally optimal examples. In Proceedings\n\nof the 29th International Conference on Machine Learning, pages 41\u201348, 2012.\n\n[14] S. Levine, Z. Popovic, and V. Koltun. Nonlinear inverse reinforcement learning with gaussian processes.\n\nIn Advances in Neural Information Processing Systems, pages 19\u201327, 2011.\n\n[15] A. W. Moore and T. Hall. Ef\ufb01cient memory-based learning for robot control. 1990.\n[16] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In ICML, 2000.\n[17] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On surrogate loss functions and f-divergences. The Annals\n\n[20] N. D. Ratliff, D. Silver, and J. A. Bagnell. Learning to search: Functional gradient techniques for imitation\n\nlearning. Autonomous Robots, 27(1):25\u201353, 2009.\n\n[21] S. Ross and D. Bagnell. Ef\ufb01cient reductions for imitation learning. In AISTATS, pages 661\u2013668, 2010.\n[22] S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to\n\nno-regret online learning. In AISTATS, pages 627\u2013635, 2011.\n\n[23] S. Russell. Learning agents for uncertain environments. In Proceedings of the Eleventh Annual Conference\n\non Computational Learning Theory, pages 101\u2013103. ACM, 1998.\n\n[24] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization.\n\nProceedings of The 32nd International Conference on Machine Learning, pages 1889\u20131897, 2015.\n\n[25] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using\n\nIn\n\ngeneralized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.\n\n[26] U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in Neural\n\nInformation Processing Systems, pages 1449\u20131456, 2007.\n\n[27] U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linear programming.\n\nProceedings of the 25th International Conference on Machine Learning, pages 1032\u20131039, 2008.\n\n[28] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Intelligent\nRobots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026\u20135033. IEEE, 2012.\n[29] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning.\n\nIn\n\n[30] B. D. Ziebart, J. A. Bagnell, and A. K. Dey. Modeling interaction via the principle of maximum causal\n\nIn AAAI, AAAI\u201908, 2008.\n\nentropy. In ICML, pages 1255\u20131262, 2010.\n\n[18] D. A. Pomerleau. Ef\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation. Neural\n\n[19] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley &\n\nof Statistics, pages 876\u2013904, 2009.\n\nComputation, 3(1):88\u201397, 1991.\n\nSons, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2278, "authors": [{"given_name": "Jonathan", "family_name": "Ho", "institution": "Stanford"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}