{"title": "InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations", "book": "Advances in Neural Information Processing Systems", "page_first": 3812, "page_last": 3822, "abstract": "The goal of imitation learning is to mimic expert behavior without access to an explicit reward signal. Expert demonstrations provided by humans, however, often show significant variability due to latent factors that are typically not explicitly modeled. In this paper, we propose a new algorithm that can infer the latent structure of expert demonstrations in an unsupervised way. Our method, built on top of Generative Adversarial Imitation Learning, can not only imitate complex behaviors, but also learn interpretable and meaningful representations of complex behavioral data, including visual demonstrations. In the driving domain, we show that a model learned from human demonstrations is able to both accurately reproduce a variety of behaviors and accurately anticipate human actions using raw visual inputs. Compared with various baselines, our method can better capture the latent structure underlying expert demonstrations, often recovering semantically meaningful factors of variation in the data.", "full_text": "InfoGAIL: Interpretable Imitation Learning from\n\nVisual Demonstrations\n\nYunzhu Li\n\nMIT\n\nliyunzhu@mit.edu\n\nJiaming Song\n\nStanford University\n\ntsong@cs.stanford.edu\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nAbstract\n\nThe goal of imitation learning is to mimic expert behavior without access to an\nexplicit reward signal. Expert demonstrations provided by humans, however, often\nshow signi\ufb01cant variability due to latent factors that are typically not explicitly\nmodeled. In this paper, we propose a new algorithm that can infer the latent\nstructure of expert demonstrations in an unsupervised way. Our method, built on\ntop of Generative Adversarial Imitation Learning, can not only imitate complex\nbehaviors, but also learn interpretable and meaningful representations of complex\nbehavioral data, including visual demonstrations.\nIn the driving domain, we\nshow that a model learned from human demonstrations is able to both accurately\nreproduce a variety of behaviors and accurately anticipate human actions using raw\nvisual inputs. Compared with various baselines, our method can better capture the\nlatent structure underlying expert demonstrations, often recovering semantically\nmeaningful factors of variation in the data.\n\n1\n\nIntroduction\n\nA key limitation of reinforcement learning (RL) is that it involves the optimization of a prede\ufb01ned\nreward function or reinforcement signal [1\u20136]. Explicitly de\ufb01ning a reward function is straightforward\nin some cases, e.g., in games such as Go or chess. However, designing an appropriate reward function\ncan be dif\ufb01cult in more complex and less well-speci\ufb01ed environments, e.g., for autonomous driving\nwhere there is a need to balance safety, comfort, and ef\ufb01ciency.\nImitation learning methods have the potential to close this gap by learning how to perform tasks\ndirectly from expert demonstrations, and has succeeded in a wide range of problems [7\u201311]. Among\nthem, Generative Adversarial Imitation Learning (GAIL, [12]) is a model-free imitation learning\nmethod that is highly effective and scales to relatively high dimensional environments. The training\nprocess of GAIL can be thought of as building a generative model, which is a stochastic policy\nthat when coupled with a \ufb01xed simulation environment, produces similar behaviors to the expert\ndemonstrations. Similarity is achieved by jointly training a discriminator to distinguish expert\ntrajectories from ones produced by the learned policy, as in GANs [13].\nIn imitation learning, example demonstrations are typically provided by human experts. These\ndemonstrations can show signi\ufb01cant variability. For example, they might be collected from multiple\nexperts, each employing a different policy. External latent factors of variation that are not explicitly\ncaptured by the simulation environment can also signi\ufb01cantly affect the observed behavior. For\nexample, expert demonstrations might be collected from users with different skills and habits. The\ngoal of this paper is to develop an imitation learning framework that is able to automatically discover\nand disentangle the latent factors of variation underlying expert demonstrations. Analogous to the goal\nof uncovering style, shape, and color in generative modeling of images [14], we aim to automatically\nlearn similar interpretable concepts from human demonstrations through an unsupervised manner.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe propose a new method for learning a latent variable generative model that can produce trajectories\nin a dynamic environment, i.e., sequences of state-actions pairs in a Markov Decision Process. Not\nonly can the model accurately reproduce expert behavior, but also empirically learns a latent space\nof the observations that is semantically meaningful. Our approach is an extension of GAIL, where\nthe objective is augmented with a mutual information term between the latent variables and the\nobserved state-action pairs. We \ufb01rst illustrate the core concepts in a synthetic 2D example and\nthen demonstrate an application in autonomous driving, where we learn to imitate complex driving\nbehaviors while recovering semantically meaningful structure, without any supervision beyond the\nexpert trajectories. 1 Remarkably, our method performs directly on raw visual inputs, using raw\npixels as the only source of perceptual information. The code for reproducing the experiments are\navailable at https://github.com/ermongroup/InfoGAIL.\nIn particular, the contributions of this paper are threefold:\n\n1. We extend GAIL with a component which approximately maximizes the mutual information\nbetween latent space and trajectories, similar to InfoGAN [14], resulting in a policy where\nlow-level actions can be controlled through more abstract, high-level latent variables.\n\n2. We extend GAIL to use raw pixels as input and produce human-like behaviors in complex\n\nhigh-dimensional dynamic environments.\n\n3. We demonstrate an application to autonomous highway driving using the TORCS driving\nsimulator [15]. We \ufb01rst demonstrate that the learned policy is able to correctly navigate the\ntrack without collisions. Then, we show that our model learns to reproduce different kinds\nof human-like driving behaviors by exploring the latent variable space.\n\n2 Background\n\n2.1 Preliminaries\nWe use the tuple (S,A, P, r ,\u21e2 0, ) to de\ufb01ne an in\ufb01nite-horizon, discounted Markov decision process\n(MDP), where S represents the state space, A represents the action space, P : S\u21e5A\u21e5S ! R denotes\nthe transition probability distribution, r : S! R denotes the reward function, \u21e20 : S! R is the\ndistribution of the initial state s0, and 2 (0, 1) is the discount factor. Let \u21e1 denote a stochastic policy\n\u21e1 : S\u21e5A! [0, 1], and \u21e1E denote the expert policy to which we only have access to demonstrations.\nThe expert demonstrations \u2327E are a set of trajectories generated using policy \u21e1E, each of which\nconsists of a sequence of state-action pairs. We use an expectation with respect to a policy \u21e1 to denote\nan expectation with respect to the trajectories it generates: E\u21e1[f (s, a)] , E[P1t=0 tf (st, at)], where\ns0 \u21e0 \u21e20, at \u21e0 \u21e1(at|st), st+1 \u21e0 P (st+1|at, st).\n2.2\n\nImitation learning\n\nThe goal of imitation learning is to learn how to perform a task directly from expert demonstrations,\nwithout any access to the reinforcement signal r. Typically, there are two approaches to imitation\nlearning: 1) behavior cloning (BC), which learns a policy through supervised learning over the state-\naction pairs from the expert trajectories [16]; and 2) apprenticeship learning (AL), which assumes the\nexpert policy is optimal under some unknown reward and learns a policy by recovering the reward\nand solving the corresponding planning problem. BC tends to have poor generalization properties\ndue to compounding errors and covariate shift [17, 18]. AL, on the other hand, has the advantage of\nlearning a reward function that can be used to score trajectories [19\u201321], but is typically expensive to\nrun because it requires solving a reinforcement learning (RL) problem inside a learning loop.\n\n2.3 Generative Adversarial Imitation Learning\n\nRecent work on AL has adopted a different approach by learning a policy without directly estimat-\ning the corresponding reward function. In particular, Generative Adversarial Imitation Learning\n(GAIL, [12]) is a recent AL method inspired by Generative Adversarial Networks (GAN, [13]). In the\nGAIL framework, the agent imitates the behavior of an expert policy \u21e1E by matching the generated\nstate-action distribution with the expert\u2019s distribution, where the optimum is achieved when the\n\n1A video showing the experimental results is available at https://youtu.be/YtNPBAW6h5k.\n\n2\n\n\fdistance between these two distributions is minimized as measured by Jensen-Shannon divergence.\nThe formal GAIL objective is denoted as\n\nmin\n\n\u21e1\n\nmax\n\nD2(0,1)S\u21e5A\n\nE\u21e1[log D(s, a)] + E\u21e1E [log(1 D(s, a))] H(\u21e1)\n\n(1)\n\nwhere \u21e1 is the policy that we wish to imitate \u21e1E with, D is a discriminative classi\ufb01er which\ntries to distinguish state-action pairs from the trajectories generated by \u21e1 and \u21e1E, and H(\u21e1) ,\nE\u21e1[ log \u21e1(a|s)] is the -discounted causal entropy of the policy \u21e1\u2713 [22]. Instead of directly learning\na reward function, GAIL relies on the discriminator to guide \u21e1 into imitating the expert policy.\nGAIL is model-free: it requires interaction with the environment to generate rollouts, but it does\nnot need to construct a model for the environment. Unlike GANs, GAIL considers the environ-\nment/simulator as a black box, and thus the objective is not differentiable end-to-end. Hence,\noptimization of GAIL objective requires RL techniques based on Monte-Carlo estimation of policy\ngradients. Optimization over the GAIL objective is performed by alternating between a gradient step\nto increase (1) with respect to the discriminator parameters, and a Trust Region Policy Optimization\n(TRPO, [2]) step to decrease (1) with respect to \u21e1.\n\n3\n\nInterpretable Imitation Learning through Visual Inputs\n\nDemonstrations are typically collected from human experts. The resulting trajectories can show\nsigni\ufb01cant variability among different individuals due to internal latent factors of variation, such as\nlevels of expertise and preferences for different strategies. Even the same individual might make\ndifferent decisions while encountering the same situation, potentially resulting in demonstrations\ngenerated from multiple near-optimal but distinct policies. In this section, we propose an approach\nthat can 1) discover and disentangle salient latent factors of variation underlying expert demonstrations\nwithout supervision, 2) learn policies that produce trajectories which correspond to these latent factors,\nand 3) use visual inputs as the only external perceptual information.\nFormally, we assume that the expert policy is a mixture of experts \u21e1E = {\u21e10\nE, . . .}, and we de\ufb01ne\nthe generative process of the expert trajectory \u2327E as: s0 \u21e0 \u21e20, c \u21e0 p(c), \u21e1 \u21e0 p(\u21e1|c), at \u21e0 \u21e1(at|st),\nst+1 \u21e0 P (st+1|at, st), where c is a discrete latent variable that selects a speci\ufb01c policy \u21e1 from the\nmixture of expert policies through p(\u21e1|c) (which is unknown and needs to be learned), and p(c) is the\nprior distribution of c (which is assumed to be known before training). Similar to the GAIL setting,\nwe consider the apprenticeship learning problem as the dual of an occupancy measure matching\nproblem, and treat the trajectory \u2327E as a set of state-action pairs. Instead of learning a policy solely\nbased on the current state, we extend it to include an explicit dependence on the latent variable c. The\nobjective is to recover a policy \u21e1(a|s, c) as an approximation of \u21e1E; when c is samples from the prior\np(c), the trajectories \u2327 generated by the conditional policy \u21e1(a|s, c) should be similar to the expert\ntrajectories \u2327E, as measured by a discriminative classi\ufb01er.\n\nE,\u21e1 1\n\n3.1\n\nInterpretable Imitation Learning\n\nLearning from demonstrations generated by a mixture of experts is challenging as we have no access\nto the policies employed by the individual experts. We have to proceed in an unsupervised way,\nsimilar to clustering. The original Generative Adversarial Imitation Learning method would fail as it\nassumes all the demonstrations come from a single expert, and there is no incentive in separating\nand disentangling variations observed in the data. A method that can automatically disentangle the\ndemonstrations in a meaningful way is thus needed.\nThe way we address this problem is to introduce a latent variable c into our policy function, \u21e1(a|s, c).\nWithout further constraints over c, applying GAIL directly to this \u21e1(a|s, c) could simply ignore c\nand fail to separate different types of behaviors present in the expert trajectories 2. To incentivize\nthe model to use c as much as possible, we utilize an information-theoretic regularization enforcing\nthat there should be high mutual information between c and the state-action pairs in the generated\ntrajectory. This concept was introduced by InfoGAN [14], where latent codes are utilized to discover\nthe salient semantic features of the data distribution and guide the generating process. In particular,\nthe regularization seeks to maximize the mutual information between latent codes and trajectories,\n\n2For a fair comparison, we consider this form as our GAIL baseline in the experiments below.\n\n3\n\n\fdenoted as I(c; \u2327 ),which is hard to maximize directly as it requires access to the posterior P (c|\u2327 ).\nHence we introduce a variational lower bound, LI(\u21e1, Q), of the mutual information I(c; \u2327 )3:\n\nLI(\u21e1, Q) = Ec\u21e0p(c),a\u21e0\u21e1(\u00b7|s,c)[log Q(c|\u2327 )] + H(c)\n\n(2)\nwhere Q(c|\u2327 ) is an approximation of the true posterior P (c|\u2327 ). The objective under this regularization,\nwhich we call Information Maximizing Generative Adversarial Imitation Learning (InfoGAIL), then\nbecomes:\n\n\uf8ff I(c; \u2327 )\n\nmin\n\u21e1,Q\n\nmax\n\nD\n\nE\u21e1[log D(s, a)] + E\u21e1E [log(1 D(s, a))] 1LI(\u21e1, Q) 2H(\u21e1)\n\n(3)\n\nwhere 1 > 0 is the hyperparameter for information maximization regularization term, and 2 > 0 is\nthe hyperparameter for the casual entropy term. By introducing the latent code, InfoGAIL is able\nto identify the salient factors in the expert trajectories through mutual information maximization,\nand imitate the corresponding expert policy through generative adversarial training. This allows us\nto disentangle trajectories that may arise from a mixture of experts, such as different individuals\nperforming the same task.\nTo optimize the objective, we use a simpli\ufb01ed posterior approximation Q(c|s, a), since directly\nworking with entire trajectories \u2327 would be too expensive, especially when the dimension of the\nobservations is very high (such as images). We then parameterize policy \u21e1, discriminator D and\nposterior approximation Q with weights \u2713, ! and respectively. We optimize LI(\u21e1\u2713, Q ) with\nstochastic gradient methods, \u21e1\u2713 using TRPO [2], and Q is updated using the Adam optimizer [23].\nAn outline for the optimization procedure is shown in Algorithm 1.\n\nAlgorithm 1 InfoGAIL\n\nInput: Initial parameters of policy, discriminator and posterior approximation \u27130,! 0, 0; expert\ntrajectories \u2327E \u21e0 \u21e1E containing state-action pairs.\nOutput: Learned policy \u21e1\u2713\nfor i = 0, 1, 2, ... do\n\nSample a batch of latent codes: ci \u21e0 p(c)\nSample trajectories: \u2327i \u21e0 \u21e1\u2713i(ci), with the latent code \ufb01xed during each rollout.\nSample state-action pairs i \u21e0 \u2327i and E \u21e0 \u2327E with same batch size.\nUpdate !i to !i+1 by ascending with gradients\n\n!i = \u02c6Ei[r!i log D!i(s, a)] + \u02c6EE [r!i log(1 D!i(s, a))]\n\nUpdate i to i+1 by descending with gradients\n\nTake a policy step from \u2713i to \u2713i+1, using the TRPO update rule with the following objective:\n\n i = 1 \u02c6Ei[r i log Q i(c|s, a)]\n\n\u02c6Ei[log D!i+1(s, a)] 1LI(\u21e1\u2713i, Q i+1) 2H(\u21e1\u2713i)\n\nend for\n\n3.2 Reward Augmentation\n\nIn complex and less well-speci\ufb01ed environments, imitation learning methods have the potential to\nperform better than reinforcement learning methods as they do not require manual speci\ufb01cation of\nan appropriate reward function. However, if the expert is performing sub-optimally, then any policy\ntrained under the recovered rewards will be also suboptimal; in other words, the imitation learning\nagent\u2019s potential is bounded by the capabilities of the expert that produced the training data. In\nmany cases, while it is very dif\ufb01cult to fully specify a suitable reward function for a given task, it is\nrelatively straightforward to come up with constraints that we would like to enforce over the policy.\nThis motivates the introduction of reward augmentation [8], a general framework to incorporate prior\nknowledge in imitation learning by providing additional incentives to the agent without interfering\n\n3[14] presents a proof for the lower bound.\n\n4\n\n\fwith the imitation learning process. We achieve this by specifying a surrogate state-based reward\n\u2318(\u21e1\u2713) = Es\u21e0\u21e1\u2713 [r(s)] that re\ufb02ects our bias over the desired agent\u2019s behavior:\nmin\n\u2713, \n\n! E\u21e1\u2713 [log D!(s, a)] +E\u21e1E [log(1 D!(s, a))] 0\u2318(\u21e1\u2713) 1LI(\u21e1\u2713, Q ) 2H(\u21e1\u2713) (4)\nmax\nwhere 0 > 0 is a hyperparameter. This approach can be seen as a hybrid between imitation and\nreinforcement learning, where part of the reinforcement signal for the policy optimization is coming\nfrom the surrogate reward and part from the discriminator, i.e., from mimicking the expert. For\nexample, in our autonomous driving experiment below we show that by providing the agent with a\npenalty if it collides with other cars or drives off the road, we are able to signi\ufb01cantly improve the\naverage rollout distance of the learned policy.\n\n3.3\n\nImproved Optimization\n\nWhile GAIL is successful in tasks with low-dimensional inputs (in [12], the largest observation has\n376 continuous variables), few have explored tasks where the input dimension is very high (such as\nimages - 110 \u21e5 200 \u21e5 3 pixels as in our driving experiments). In order to effectively learn a policy\nthat relies solely on high-dimensional input, we make the following improvements over the original\nGAIL framework.\nIt is well known that the traditional GAN objective suffers from vanishing gradient and mode collapse\nproblems [24, 25]. We propose to use the Wasserstein GAN (WGAN [26]) technique to alleviate\nthese problems and augment our objective function as follows:\n\nmin\n\u2713, \n\n! E\u21e1\u2713 [D!(s, a)] E\u21e1E [D!(s, a)] 0\u2318(\u21e1\u2713) 1LI(\u21e1\u2713, Q ) 2H(\u21e1\u2713)\nmax\n\n(5)\n\nWe note that this modi\ufb01cation is especially important in our setting, where we want to model complex\ndistributions over trajectories that can potentially have a large number of modes.\nWe also use several variance reduction techniques, including baselines [27] and replay buffers [28].\nBesides the baseline, we have three models to update in the InfoGAIL framework, which are\nrepresented as neural networks: the discriminator network D!(s, a), the policy network \u21e1\u2713(a|s, c),\nand the posterior estimation network Q (c|s, a). We update D! using RMSprop (as suggested in\nthe original WGAN paper), and update Q and \u21e1\u2713 using Adam and TRPO respectively. We include\nthe detailed training procedure in Appendix C. To speed up training, we initialize our policy from\nbehavior cloning, as in [12].\nNote that the discriminator network D! and the posterior approximation network Q are treated as\ndistinct networks, as opposed to the InfoGAN approach where they share the same network parameters\nuntil the \ufb01nal output layer. This is because the current WGAN training framework requires weight\nclipping and momentum-free optimization methods when training D!. These changes would interfere\nwith the training of an expressive Q if D! and Q share the same network parameters.\n\n4 Experiments\n\nWe demonstrate the performance of our method by applying it \ufb01rst to a synthetic 2D example and\nthen in a challenging driving domain where the agent is imitating driving behaviors from visual\ninputs. By conducting experiments on these two environments, we show that our learned policy \u21e1\u2713\ncan 1) imitate expert behaviors using high-dimensional inputs with only a small number of expert\ndemonstrations, 2) cluster expert behaviors into different and semantically meaningful categories, and\n3) reproduce different categories of behaviors by setting the high-level latent variables appropriately.\nThe driving experiments are conducted in the TORCS (The Open Source Racing Car Simulator, [15])\nenvironment. The demonstrations are collected by manually driving along the race track, and show\ntypical behaviors like staying within lanes, avoiding collisions and surpassing other cars. The policy\naccepts raw visual inputs as the only external inputs for the state, and produces a three-dimensional\ncontinuous action that consists of steering, acceleration, and braking. We assume that our policies\nare Gaussian distributions with \ufb01xed standard deviations, thus H(\u21e1) is constant.\n\n5\n\n\f(a) Expert\n\n(b) Behavior cloning\n\n(c) GAIL\n\n(d) Ours\n\nFigure 1: Learned trajectories in the synthetic 2D plane environment. Each color denotes one\nspeci\ufb01c latent code. Behavior cloning deviates from the expert demonstrations due to compounding\nerrors. GAIL does produce circular trajectories but fails to capture the latent structure for it assumes\nthat the demonstrations are generated from a single expert, and tries to learn an average policy. Our\nmethod (InfoGAIL) successfully distinguishes expert behaviors and imitates each mode accordingly\n(colors are ordered in accordance to the expert for visualization purposes, but are not identi\ufb01able).\n\n4.1 Learning to Distinguish Trajectories\nWe demonstrate the effectiveness of InfoGAIL on a synthetic example. The environment is a 2D\nplane where the agent can move around freely at a constant velocity by selecting its direction pt at\n(discrete) time t. For the agent, the observations at time t are positions from t 4 to t. The (unlabeled)\nexpert demonstrations contain three distinct modes, each generated with a stochastic expert policy\nthat produces a circle-like trajectory (see Figure 1, panel a). The objective is to distinguish these\nthree distinct modes and imitate the corresponding expert behavior. We consider three methods:\nbehavior cloning, GAIL and InfoGAIL (details included in Appendix A). In particular, for all the\nexperiments we assume the same architecture and that the latent code is a one-hot encoded vector\nwith 3 dimensions and a uniform prior; only InfoGAIL regularizes the latent code. Figure 1 shows\nthat the introduction of latent variables allows InfoGAIL to distinguish the three types of behavior and\nimitate each behavior successfully; the other two methods, however, fail to distinguish distinct modes.\nBC suffers from the compounding error problem and the learned policy tends to deviate from the\nexpert trajectories; GAIL does learn to generate circular trajectories but it fails to separate different\nmodes due to the lack of a mechanism that can explicitly account for the underlying structure.\nIn the rest of Section 4, we show how InfoGAIL can infer the latent structure of human decision-\nmaking in a driving domain.\nIn particular, our agent only relies on visual inputs to sense the\nenvironment.\n\n4.2 Utilizing Raw Visual Inputs via Transfer Learning\nThe high dimensional nature of visual inputs poses a signi\ufb01cant challenges to learning a policy.\nIntuitively, the policy will have to simultaneously learn how to identify meaningful visual features,\nand how to leverage them to achieve the desired behavior using only a small number of expert\ndemonstrations. Therefore, methods to mitigate the high sample complexity of the problem are\ncrucial to success in this domain.\nIn this paper, we take a transfer learning approach. Features extracted using a CNN pre-trained\non ImageNet contain high-level information about the input images, which can be adapted to new\nvision tasks via transfer learning [29]. However, it is not yet clear whether these relatively high-level\nfeatures can be directly applied to tasks where perception and action are tightly interconnected; we\ndemonstrate that this is possible through our experiments. We perform transfer learning by exploiting\nfeatures from a pre-trained neural network that effectively convert raw images into relatively high-\nlevel information [30]. In particular, we use a Deep Residual Network [31] pre-trained on the\nImageNet classi\ufb01cation task [32] to obtain the visual features used as inputs for the policy network.\n\n4.3 Network Structure\nOur policy accepts certain auxiliary information as internal input to serve as a short-term memory.\nThis auxiliary information can be accessed along with the raw visual inputs. In our experiments, the\nauxiliary information for the policy at time t consists of the following: 1) velocity at time t, which\nis a three dimensional vector; 2) actions at time t 1 and t 2, which are both three dimensional\nvectors; 3) damage of the car, which is a real value. The auxiliary input has 10 dimensions in total.\n\n6\n\n\fFigure 2: Visualizing the training process of turn. Here we show the trajectories of InfoGAIL\nat different stages of training. Blue and red indicate policies under different latent codes, which\ncorrespond to \u201cturning from inner lane\u201d and \u201cturning from outer lane\u201d respectively. The rightmost\n\ufb01gure shows the trajectories under latent codes [1, 0] (red), [0, 1] (blue), and [0.5, 0.5] (purple), which\nsuggests that, to some extent, our method is able to generalize to cases previously unseen in the\ntraining data.\n\nFor the policy network, input visual features are passed through two convolutional layers, and then\ncombined with the auxiliary information vector and (in the case of InfoGAIL) the latent code c. We\nparameterize the baseline as a network with the same architecture except for the \ufb01nal layer, which is\njust a scalar output that indicates the expected accumulated future rewards.\nThe discriminator D! accepts three elements as input: the input image, the auxiliary information,\nand the current action. The output is a score for the WGAN training objective, which is supposed to\nbe lower for expert state-action pairs, and higher for generated ones. The posterior approximation\nnetwork Q adopts the same architecture as the discriminator, except that the output is a softmax\nover the discrete latent variables or a factored Gaussian over continuous latent variables. We include\ndetails of our architecture in Appendix B.\n\n4.4\n\nInterpretable Imitation Learning from Visual Demonstrations\n\nIn this experiment, we consider two subsets of human driving behaviors: turn, where the expert\ntakes a turn using either the inside lane or the outside lane; and pass, where the expert passes another\nvehicle from either the left or the right. In both cases, the expert policy has two signi\ufb01cant modes.\nOur goal is to have InfoGAIL capture these two separate modes from expert demonstrations in an\nunsupervised way.\nWe use a discrete latent code, which is a one-hot encoded vector with two possible states. For both\nsettings, there are 80 expert trajectories in total, with 100 frames in each trajectory; our prior for\nthe latent code is a uniform discrete distribution over the two states. The performance of a learned\npolicy is quanti\ufb01ed with two metrics: the average distance is determined by the distance traveled by\nthe agent before a collision (and is bounded by the length of the simulation horizon), and accuracy\nis de\ufb01ned as the classi\ufb01cation accuracy of the expert state-action pairs according to the latent code\ninferred with Q . We add constant reward at every time step as reward augmentation, which is used\nto encourage the car to \"stay alive\" as long as possible and can be regarded as another way of reducing\ncollision and off-lane driving (as these will lead to the termination of that episode).\nThe average distance and sampled trajectories at different stages of training are shown in Figures 2 and\n3 for turn and pass respectively. During the initial stages of training, the model does not distinguish\nthe two modes and has a high chance of colliding and driving off-lane, due to the limitations of\nbehavior cloning (which we used to initialize the policy). As training progresses, trajectories provided\nby the learned policy begin to diverge. Towards the end of training, the two types of trajectories are\nclearly distinguishable, with only a few exceptions. In turn, [0, 1] corresponds to using the inside\nlane, while [1, 0] corresponds to the outside lane. In pass, the two kinds of latent codes correspond\nto passing from right and left respectively. Meanwhile, the average distance of the rollouts steadily\nincreases with more training.\nLearning the two modes separately requires accurate inference of the latent code. To examine the\naccuracy of posterior inference, we select state-action pairs from the expert trajectories (where\nthe state is represented as a concatenation of raw image and auxiliary variables) and obtain the\ncorresponding latent code through Q (c|s, a); see Table 1. Although we did not explicitly provide\nany label, our model is able to correctly distinguish over 81% of the state-action pairs in pass (and\nalmost all the pairs in turn, con\ufb01rming the clear separation between generated trajectories with\ndifferent latent codes in Figure 2).\n\n7\n\n\fFigure 3: Experimental results for pass. Left: Trajectories of InfoGAIL at different stages of\ntraining (epoch 1 to 37). Blue and red indicate policies using different latent code values, which\ncorrespond to passing from right or left. Middle: Traveled distance denotes the absolute distance\nfrom the start position, averaged over 60 rollouts of the InfoGAIL policy trained at different epochs.\nRight: Trajectories of pass produced by an agent trained on the original GAIL objective. Compared\nto InfoGAIL, GAIL fails to distinguish between different modes.\n\nTable 1: Classi\ufb01cation accuracies for pass.\n\nTable 2: Average rollout distances.\n\nMethod\nChance\nK-means\nPCA\nInfoGAIL (Ours)\nSVM\nCNN\n\nAccuracy\n\n50%\n55.4%\n61.7%\n81.9%\n85.8%\n90.8%\n\nMethod\nBehavior Cloning\nGAIL\nInfoGAIL \\ RB\nInfoGAIL \\ RA\nInfoGAIL \\ WGAN\nInfoGAIL (Ours)\nHuman\n\nAvg. rollout distance\n\n701.83\n914.45\n1031.13\n1123.89\n1177.72\n1226.68\n1203.51\n\nFor comparison, we also visualize the trajectories of pass for the original GAIL objective in Figure 3,\nwhere there is no mutual information regularization. GAIL learns the expert trajectories as a whole,\nand cannot distinguish the two modes in the expert policy.\nInterestingly, instead of learning two separate trajectories, GAIL tries to \ufb01t the left trajectory by\nswinging the car suddenly to the left after it has surpassed the other car from the right. We believe\nthis re\ufb02ects a limitation in the discriminators. Since D!(s, a) only requires state-action pairs as\ninput, the policy is only required to match most of the state-action pairs; matching each rollout in a\nwhole with expert trajectories is not necessary. InfoGAIL with discrete latent codes can alleviate this\nproblem by forcing the model to learn separate trajectories.\n\n4.5 Ablation Experiments\n\nWe conduct a series of ablation experiments to demonstrate that our proposed improved optimization\ntechniques in Section 3.2 and 3.3 are indeed crucial for learning an effective policy. Our policy drives\na car on the race track along with other cars, whereas the human expert provides 20 trajectories with\n500 frames each by trying to drive as fast as possible without collision. Reward augmentation is\nperformed by adding a reward that encourages the car to drive faster. The performance of the policy\nis determined by the average distance. Here a longer average rollout distance indicates a better policy.\nIn our ablation experiments, we selectively remove some of the improved optimization methods\nfrom Section 3.2 and 3.3 (we do not use any latent code in these experiments). InfoGAIL(Ours)\nincludes all the optimization techniques; GAIL excludes all the techniques; InfoGAIL\\WGAN\nswitches the WGAN objective with the GAN objective; InfoGAIL\\RA removes reward augmen-\ntation; InfoGAIL\\RB removes the replay buffer and only samples from the most recent rollouts;\nBehavior Cloning is the behavior cloning method and Human is the expert policy. Table 2 shows\nthe average rollout distances of different policies. Our method is able to outperform the expert with\nthe help of reward augmentation; policies without reward augmentation or WGANs perform slightly\nworse than the expert; removing the replay buffer causes the performance to deteriorate signi\ufb01cantly\ndue to increased variance in gradient estimation.\n\n8\n\n\f5 Related work\n\nThere are two major paradigms for vision-based driving systems [33]. Mediated perception is a\ntwo-step approach that \ufb01rst obtains scene information and then makes a driving decision [34\u201336];\nbehavior re\ufb02ex, on the other hand, adopts a direct approach by mapping visual inputs to driving\nactions [37, 16]. Many of the current autonomous driving methods rely on the two-step approach,\nwhich requires hand-crafting features such as the detection of lane markings and cars [38, 33]. Our\napproach, on the other hand, attempts to learn these features directly from vision to actions. While\nmediated perception approaches are currently more prevalent, we believe that end-to-end learning\nmethods are more scalable and may lead to better performance in the long run.\n[39] introduce an end-to-end imitation learning framework that learns to drive entirely from visual\ninformation, and test their approach on real-world scenarios. However, their method uses behavior\ncloning by performing supervised learning over the state-action pairs, which is well-known to\ngeneralize poorly to more sophisticated tasks, such as changing lanes or passing vehicles. With the\nuse of GAIL, our method can learn to perform these sophisticated operations easily. [40] performs\nend-to-end visual imitation learning in TORCS through DAgger [18], querying the reference policies\nduring training, which in many cases is dif\ufb01cult.\nMost imitation learning methods for end-to-end driving rely heavily on LIDAR-like inputs to obtain\nprecise distance measurements [21, 41]. These inputs are not usually available to humans during\ndriving. In particular, [41] applies GAIL to the task of modeling human driving behavior on highways.\nIn contrast, our policy requires only raw visual information as external input, which in practice is all\nthe information humans need in order to drive.\n[42] and [9] have also introduced a pre-trained deep neural network to achieve better performance\nin imitation learning with relatively few demonstrations. Speci\ufb01cally, they introduce a pre-trained\nmodel to learn dense, incremental reward functions that are suitable for performing downstream\nreinforcement learning tasks, such as real-world robotic experiments. This is different from our\napproach, in that transfer learning is performed over the critic instead of the policy. It would be\ninteresting to combine that reward with our approach through reward augmentation.\n\n6 Conclusion\n\nIn this paper, we present a method to imitate complex behaviors while identifying salient latent factors\nof variation in the demonstrations. Discovering these latent factors does not require direct supervision\nbeyond expert demonstrations, and the whole process can be trained directly with standard policy\noptimization algorithms. We also introduce several techniques to successfully perform imitation\nlearning using visual inputs, including transfer learning and reward augmentation. Our experimental\nresults in the TORCS simulator show that our methods can automatically distinguish certain behaviors\nin human driving, while learning a policy that can imitate and even outperform the human experts\nusing visual information as the sole external input. We hope that our work can further inspire\nend-to-end learning approaches to autonomous driving under more realistic scenarios.\n\nAcknowledgements\n\nWe thank Shengjia Zhao and Neal Jean for their assistance and advice. Toyota Research Institute\n(TRI) provided funds to assist the authors with their research but this article solely re\ufb02ects the\nopinions and conclusions of its authors and not TRI or any other Toyota entity. This research was\nalso supported by Intel Corporation, FLI and NSF grants 1651565, 1522054, 1733686.\n\nReferences\n[1] S. Levine and V. Koltun, \u201cGuided policy search.,\u201d in ICML (3), pp. 1\u20139, 2013.\n\n[2] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, \u201cTrust region policy optimiza-\n\ntion.,\u201d in ICML, pp. 1889\u20131897, 2015.\n\n[3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra,\n\u201cContinuous control with deep reinforcement learning,\u201d arXiv preprint arXiv:1509.02971, 2015.\n\n9\n\n\f[4] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, \u201cHigh-dimensional continuous\n\ncontrol using generalized advantage estimation,\u201d arXiv preprint arXiv:1506.02438, 2015.\n\n[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al., \u201cMastering the game of go with deep\nneural networks and tree search,\u201d Nature, vol. 529, no. 7587, pp. 484\u2013489, 2016.\n\n[6] A. Tamar, S. Levine, P. Abbeel, Y. WU, and G. Thomas, \u201cValue iteration networks,\u201d in Advances\n\nin Neural Information Processing Systems, pp. 2146\u20132154, 2016.\n\n[7] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, \u201cMaximum entropy inverse reinforce-\n\nment learning.,\u201d in AAAI, vol. 8, pp. 1433\u20131438, Chicago, IL, USA, 2008.\n\n[8] P. Englert and M. Toussaint, \u201cInverse kkt\u2013learning cost functions of manipulation tasks from\ndemonstrations,\u201d in Proceedings of the International Symposium of Robotics Research, 2015.\n\n[9] C. Finn, S. Levine, and P. Abbeel, \u201cGuided cost learning: Deep inverse optimal control via\npolicy optimization,\u201d in Proceedings of the 33rd International Conference on Machine Learning,\nvol. 48, 2016.\n\n[10] B. Stadie, P. Abbeel, and I. Sutskever, \u201cThird person imitation learning,\u201d in ICLR, 2017.\n\n[11] S. Ermon, Y. Xue, R. Toth, B. N. Dilkina, R. Bernstein, T. Damoulas, P. Clark, S. DeGloria,\nA. Mude, C. Barrett, et al., \u201cLearning large-scale dynamic discrete choice models of spatio-\ntemporal preferences with application to migratory pastoralism in east africa.,\u201d in AAAI, pp. 644\u2013\n650, 2015.\n\n[12] J. Ho and S. Ermon, \u201cGenerative adversarial imitation learning,\u201d in Advances in Neural Infor-\n\nmation Processing Systems, pp. 4565\u20134573, 2016.\n\n[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio, \u201cGenerative adversarial nets,\u201d in Advances in neural information processing systems,\npp. 2672\u20132680, 2014.\n\n[14] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, \u201cInfogan: Inter-\npretable representation learning by information maximizing generative adversarial nets,\u201d in\nAdvances in Neural Information Processing Systems, pp. 2172\u20132180, 2016.\n\n[15] B. Wymann, E. Espi\u00e9, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner, \u201cTorcs, the\n\nopen racing car simulator,\u201d Software available at http://torcs. sourceforge. net, 2000.\n\n[16] D. A. Pomerleau, \u201cEf\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation,\u201d\n\nNeural Computation, vol. 3, no. 1, pp. 88\u201397, 1991.\n\n[17] S. Ross and D. Bagnell, \u201cEf\ufb01cient reductions for imitation learning.,\u201d in AISTATS, pp. 3\u20135,\n\n2010.\n\n[18] S. Ross, G. J. Gordon, and D. Bagnell, \u201cA reduction of imitation learning and structured\n\nprediction to no-regret online learning.,\u201d in AISTATS, p. 6, 2011.\n\n[19] P. Abbeel and A. Y. Ng, \u201cApprenticeship learning via inverse reinforcement learning,\u201d in\nProceedings of the twenty-\ufb01rst international conference on Machine learning, p. 1, ACM, 2004.\n\n[20] U. Syed, M. Bowling, and R. E. Schapire, \u201cApprenticeship learning using linear programming,\u201d\nin Proceedings of the 25th international conference on Machine learning, pp. 1032\u20131039, ACM,\n2008.\n\n[21] J. Ho, J. K. Gupta, and S. Ermon, \u201cModel-free imitation learning with policy optimization,\u201d in\n\nProceedings of the 33rd International Conference on Machine Learning, 2016.\n\n[22] M. Bloem and N. Bambos, \u201cIn\ufb01nite time horizon maximum causal entropy inverse reinforcement\nlearning,\u201d in Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 4911\u2013\n4916, IEEE, 2014.\n\n10\n\n\f[23] D. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[24] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, \u201cImproved\ntechniques for training gans,\u201d in Advances in Neural Information Processing Systems, pp. 2234\u2013\n2242, 2016.\n\n[25] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, \u201cGeneralization and equilibrium in generative\n\nadversarial nets (gans),\u201d arXiv preprint arXiv:1703.00573, 2017.\n\n[26] M. Arjovsky, S. Chintala, and L. Bottou, \u201cWasserstein gan,\u201d arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[27] R. J. Williams, \u201cSimple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning,\u201d Machine learning, vol. 8, no. 3-4, pp. 229\u2013256, 1992.\n\n[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., \u201cHuman-level control through deep\nreinforcement learning,\u201d Nature, vol. 518, no. 7540, pp. 529\u2013533, 2015.\n\n[29] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, \u201cHow transferable are features in deep neural\n\nnetworks?,\u201d in Advances in neural information processing systems, pp. 3320\u20133328, 2014.\n\n[30] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, \u201cCnn features off-the-shelf: an\nastounding baseline for recognition,\u201d in Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition Workshops, pp. 806\u2013813, 2014.\n\n[31] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770\u2013778,\n2016.\n\n[32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, et al., \u201cImagenet large scale visual recognition challenge,\u201d Interna-\ntional Journal of Computer Vision, vol. 115, no. 3, pp. 211\u2013252, 2015.\n\n[33] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, \u201cDeepdriving: Learning affordance for direct\nperception in autonomous driving,\u201d in Proceedings of the IEEE International Conference on\nComputer Vision, pp. 2722\u20132730, 2015.\n\n[34] M. Aly, \u201cReal time detection of lane markers in urban streets,\u201d in Intelligent Vehicles Symposium,\n\n2008 IEEE, pp. 7\u201312, IEEE, 2008.\n\n[35] P. Lenz, J. Ziegler, A. Geiger, and M. Roser, \u201cSparse scene \ufb02ow segmentation for moving\nobject detection in urban environments,\u201d in Intelligent Vehicles Symposium (IV), 2011 IEEE,\npp. 926\u2013932, IEEE, 2011.\n\n[36] K. Kitani, B. Ziebart, J. Bagnell, and M. Hebert, \u201cActivity forecasting,\u201d Computer Vision\u2013ECCV\n\n2012, pp. 201\u2013214, 2012.\n\n[37] D. A. Pomerleau, \u201cAlvinn, an autonomous land vehicle in a neural network,\u201d tech. rep., Carnegie\n\nMellon University, Computer Science Department, 1989.\n\n[38] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, \u201cVision meets robotics: The kitti dataset,\u201d The\n\nInternational Journal of Robotics Research, vol. 32, no. 11, pp. 1231\u20131237, 2013.\n\n[39] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel,\nM. Monfort, U. Muller, J. Zhang, et al., \u201cEnd to end learning for self-driving cars,\u201d arXiv\npreprint arXiv:1604.07316, 2016.\n\n[40] J. Zhang and K. Cho, \u201cQuery-ef\ufb01cient imitation learning for end-to-end autonomous driving,\u201d\n\narXiv preprint arXiv:1605.06450, 2016.\n\n[41] A. Kue\ufb02er, J. Morton, T. Wheeler, and M. Kochenderfer, \u201cImitating driver behavior with\n\ngenerative adversarial networks,\u201d arXiv preprint arXiv:1701.06699, 2017.\n\n[42] P. Sermanet, K. Xu, and S. Levine, \u201cUnsupervised perceptual rewards for imitation learning,\u201d\n\narXiv preprint arXiv:1612.06699, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2095, "authors": [{"given_name": "Yunzhu", "family_name": "Li", "institution": "MIT"}, {"given_name": "Jiaming", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}