{"title": "Robust Imitation of Diverse Behaviors", "book": "Advances in Neural Information Processing Systems", "page_first": 5320, "page_last": 5329, "abstract": "Deep generative models have recently shown great promise in imitation learning for motor control. Given enough data, even supervised approaches can do one-shot imitation learning; however, they are vulnerable to cascading failures when the agent trajectory diverges from the demonstrations. Compared to purely supervised methods, Generative Adversarial Imitation Learning (GAIL) can learn more robust controllers from fewer demonstrations, but is inherently mode-seeking and more difficult to train. In this paper, we show how to combine the favourable aspects of these two approaches. The base of our model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. We show that these embeddings can be learned on a 9 DoF Jaco robot arm in reaching tasks, and then smoothly interpolated with a resulting smooth interpolation of reaching behavior. Leveraging these policy representations, we develop a new version of GAIL that (1) is much more robust than the purely-supervised controller, especially with few demonstrations, and (2) avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not. We demonstrate our approach on learning diverse gaits from demonstration on a 2D biped and a 62 DoF 3D humanoid in the MuJoCo physics environment.", "full_text": "Robust Imitation of Diverse Behaviors\n\nZiyu Wang\u21e4, Josh Merel\u21e4, Scott Reed, Greg Wayne, Nando de Freitas, Nicolas Heess\n\nDeepMind\n\nziyu,jsmerel,reedscot,gregwayne,nandodefreitas,heess@google.com\n\nAbstract\n\nDeep generative models have recently shown great promise in imitation learning\nfor motor control. Given enough data, even supervised approaches can do one-shot\nimitation learning; however, they are vulnerable to cascading failures when the\nagent trajectory diverges from the demonstrations. Compared to purely supervised\nmethods, Generative Adversarial Imitation Learning (GAIL) can learn more robust\ncontrollers from fewer demonstrations, but is inherently mode-seeking and more\ndif\ufb01cult to train. In this paper, we show how to combine the favourable aspects\nof these two approaches. The base of our model is a new type of variational\nautoencoder on demonstration trajectories that learns semantic policy embeddings.\nWe show that these embeddings can be learned on a 9 DoF Jaco robot arm in\nreaching tasks, and then smoothly interpolated with a resulting smooth interpolation\nof reaching behavior. Leveraging these policy representations, we develop a new\nversion of GAIL that (1) is much more robust than the purely-supervised controller,\nespecially with few demonstrations, and (2) avoids mode collapse, capturing many\ndiverse behaviors when GAIL on its own does not. We demonstrate our approach\non learning diverse gaits from demonstration on a 2D biped and a 62 DoF 3D\nhumanoid in the MuJoCo physics environment.\n\n1\n\nIntroduction\n\nBuilding versatile embodied agents, both in the form of real robots and animated avatars, capable\nof a wide and diverse set of behaviors is one of the long-standing challenges of AI. State-of-the-art\nrobots cannot compete with the effortless variety and adaptive \ufb02exibility of motor behaviors produced\nby toddlers. Towards addressing this challenge, in this work we combine several deep generative\napproaches to imitation learning in a way that accentuates their individual strengths and addresses\ntheir limitations. The end product of this is a robust neural network policy that can imitate a large and\ndiverse set of behaviors using few training demonstrations.\nWe \ufb01rst introduce a variational autoencoder (VAE) [15, 26] for supervised imitation, consisting of a\nbi-directional LSTM [13, 32, 9] encoder mapping demonstration sequences to embedding vectors,\nand two decoders. The \ufb01rst decoder is a multi-layer perceptron (MLP) policy mapping a trajectory\nembedding and the current state to a continuous action vector. The second is a dynamics model\nmapping the embedding and previous state to the present state, while modelling correlations among\nstates with a WaveNet [39]. Experiments with a 9 DoF Jaco robot arm and a 9 DoF 2D biped walker,\nimplemented in the MuJoCo physics engine [38], show that the VAE learns a structured semantic\nembedding space, which allows for smooth policy interpolation.\nWhile supervised policies that condition on demonstrations (such as our VAE or the recent approach\nof Duan et al. [6]) are powerful models for one-shot imitation, they require large training datasets in\norder to work for non-trivial tasks. They also tend to be brittle and fail when the agent diverges too\nmuch from the demonstration trajectories. These limitations of supervised learning for imitation, also\nknown as behavioral cloning (BC) [24], are well known [28, 29].\n\n\u21e4Joint First authors.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRecently, Ho and Ermon [12] showed a way to overcome the brittleness of supervised imitation using\nanother type of deep generative model called Generative Adversarial Networks (GANs) [8]. Their\ntechnique, called Generative Adversarial Imitation Learning (GAIL) uses reinforcement learning,\nallowing the agent to interact with the environment during training. GAIL allows one to learn more\nrobust policies with fewer demonstrations, but adversarial training introduces another dif\ufb01culty called\nmode collapse [7]. This refers to the tendency of adversarial generative models to cover only a subset\nof modes of a probability distribution, resulting in a failure to produce adequately diverse samples.\nThis will cause the learned policy to capture only a subset of control behaviors (which can be viewed\nas modes of a distribution), rather than allocating capacity to cover all modes.\nRoughly speaking, VAEs can model diverse behaviors without dropping modes, but do not learn\nrobust policies, while GANs give us robust policies but insuf\ufb01ciently diverse behaviors. In section\n3, we show how to engineer an objective function that takes advantage of both GANs and VAEs to\nobtain robust policies capturing diverse behaviors. In section 4, we show that our combined approach\nenables us to learn diverse behaviors for a 9 DoF 2D biped and a 62 DoF humanoid, where the VAE\npolicy alone is brittle and GAIL alone does not capture all of the diverse behaviors.\n\n2 Background and Related Work\n\nWe begin our brief review with generative models. One canonical way of training generative\n\nmodels is to maximize the likelihood of the data: maxPi log p\u2713(xi). This is equivalent to\nminimizing the Kullback-Leibler divergence between the distribution of the data and the model:\nDKL(pdata(\u00b7)||p\u2713(\u00b7)). For highly-expressive generative models, however, optimizing the log-\nlikelihood is often intractable.\nOne class of highly-expressive yet tractable models are the auto-regressive models which decompose\nthe log likelihood as log p(x) = Pi log p\u2713(xi|x<i). Auto-regressive models have been highly\neffective in both image and audio generation [40, 39].\nInstead of optimizing the log-likelihood directly, one can introduce a parametric inference model\nover the latent variables, q(z|x), and optimize a lower bound of the log-likelihood:\n\nEq(z|xi) [log p\u2713(xi|z)]  DKL (q(z|xi)||p(z)) \uf8ff log p(x).\n\n(1)\nFor continuous latent variables, this bound can be optimized ef\ufb01ciently via the re-parameterization\ntrick [15, 26]. This class of models are often referred to as VAEs.\nGANs, introduced by Goodfellow et al. [8], have become very popular. GANs use two networks:\na generator G and a discriminator D. The generator attempts to generate samples that are indistin-\nguishable from real data. The job of the discriminator is then to tell apart the data and the samples,\npredicting 1 with high probability if the sample is real and 0 otherwise. More precisely, GANs\noptimize the following objective function\n\nmin\nG\n\nmax\n\nD\n\nEpdata(x) [log D(x)] + Ep(z) [log(1  D(G(z))] .\n\n(2)\n\nAuto-regressive models, VAEs and GANs are all highly effective generative models, but have different\ntrade-offs. GANs were noted for their ability to produce sharp image samples, unlike the blurrier\nsamples from contemporary VAE models [8]. However, unlike VAEs and autoregressive models\ntrained via maximum likelihood, they suffer from the mode collapse problem [7]. Recent work has\nfocused on alleviating mode collapse in image modeling [2, 4, 19, 25, 42, 11, 27], but so far these\nhave not been demonstrated in the control domain. Like GANs, autoregressive models produce sharp\nand at times realistic image samples [40], but they tend to be slow to sample from and unlike VAEs\ndo not immediately provide a latent vector representation of the data. This is why we used VAEs to\nlearn representations of demonstration trajectories.\nWe turn our attention to imitation. Imitation is the problem of learning a control policy that mimics a\nbehavior provided via a demonstration. It is natural to view imitation learning from the perspective\nof generative modeling. However, unlike in image and audio modeling, in imitation the generation\nprocess is constrained by the environment and the agent\u2019s actions, with observations becoming\naccessible through interaction. Imitation learning brings its own unique challenges.\nIn this paper, we assume that we have been provided with demonstrations {\u2327i}i where the i-th\ntrajectory of state-action pairs is \u2327i = {xi\nTi}. These trajectories may have been\n1, ai\nproduced by either an arti\ufb01cial or natural agent.\n\n1,\u00b7\u00b7\u00b7 , xi\n\nTi, ai\n\n2\n\n\ft|xi\n\nt,\u21e1 \u2713(xi\n\nt+1|xi\n\nt=1 log p(xi\n\nt=1 log \u21e1\u2713(ai\n\nAs in generative modeling, we can easily apply maximum likelihood to imitation learning. For\ninstance, if the dynamics are tractable, we can maximize the likelihood of the states directly:\nt)). If a model of the dynamics is unavailable, we can instead\nt). The latter approach is what\n\nmax\u2713PiPTi\nmaximize the likelihood of the actions: max\u2713PiPTi\n\nwe referred to as behavioral cloning (BC) in the introduction.\nWhen demonstrations are plentiful, BC is effective [24, 30, 6]. Without abundant data, BC is known\nto be inadequate [28, 29, 12]. The inef\ufb01ciencies of BC stem from the sequential nature of the problem.\nWhen using BC, even the slightest errors in mimicking the demonstration behavior can quickly\naccumulate as the policy is unrolled. A good policy should correct for mistakes made previously, but\nfor BC to achieve this, the corrective behaviors have to appear frequently in the training data.\nGAIL [12] avoids some of the pitfalls of BC by allowing the agent to interact with the environment and\nlearn from these interactions. It constructs a reward function using GANs to measure the similarity\nbetween the policy-generated trajectories and the expert trajectories. As in GANs, GAIL adopts the\nfollowing objective function\nmax\n\nE\u21e1E [log D (x, a)] + E\u21e1\u2713 [log(1  D (x, a))] ,\n\nmin\n\n\u2713\n\n \n\n(3)\n\nwhere \u21e1E denotes the expert policy that generated the demonstration trajectories.\nTo avoid differentiating through the system dynamics, policy gradient algorithms are used to train\nthe policy by maximizing the discounted sum of rewards r (xt, at) =  log(1  D (xt, at)).\nMaximizing this reward, which may differ from the expert reward, drives \u21e1\u2713 to expert-like regions\nof the state-action space. In practice, trust region policy optimization (TRPO) is used to stabilize\nthe learning process [31]. GAIL has become a popular choice for imitation learning [16] and there\nalready exist model-based [3] and third-person [36] extensions. Two recent GAIL-based approaches\n[17, 10] introduce additional reward signals that encourage the policy to make use of latent variables\nwhich would correspond to different types of demonstrations after training. These approaches are\ncomplementary to ours. Neither paper, however, demonstrates the ability to do one-shot imitation.\nThe literature on imitation including BC, apprenticeship learning and inverse reinforcement learning\nis vast. We cannot cover this literature at the level of detail it deserves, and instead refer readers to\nrecent authoritative surveys on the topic [5, 1, 14]. Inspired by recent works, including [12, 36, 6],\nwe focus on taking advantage of the dramatic recent advances in deep generative modelling to learn\nhigh-dimensional policies capable of learning a diverse set of behaviors from few demonstrations.\nIn graphics, a signi\ufb01cant effort has been devoted to the design physics controllers that take advantage\nof motion capture data, or key-frames and other inputs provided by animators [33, 35, 43, 22]. Yet,\nas pointed out in a recent hierarchical control paper [23], the design of such controllers often requires\nsigni\ufb01cant human insight. Our focus is on \ufb02exible, general imitation methods.\n\n3 A Generative Modeling Approach to Imitating Diverse Behaviors\n\n3.1 Behavioral cloning with variational autoencoders suited for control\nIn this section, we follow a similar approach to Duan et al. [6], but opt for stochastic VAEs as having\na distribution q(z|x1:T ) to better regularize the latent space.\nIn our VAE, an encoder maps a demonstration sequence to an embedding vector z. Given z, we\ndecode both the state and action trajectories as shown in Figure 1. To train the model, we minimize\nthe following loss:\n\n)\" TiXt=1\n\n1:Ti\n\nt|xi\n\nlog \u21e1\u21b5(ai\n\nt, z)+log pw(xi\n\nL(\u21b5, w, ; \u2327i) =Eq(z|xi\nOur encoder q uses a bi-directional LSTM. To produce the \ufb01nal embedding, it calculates the average\nof all the outputs of the second layer of this LSTM before applying a \ufb01nal linear transformation to\ngenerate the mean and standard deviation of an Gaussian. We take one sample from this Gaussian as\nour demonstration encoding.\nThe action decoder is an MLP that maps the concatenation of the state and the embedding to the\nparameters of a Gaussian policy. The state decoder is similar to a conditional WaveNet model [39].\n\nt+1|xi\n\nt, z)#+DKLq(z|xi\n\n1:Ti)||p(z)\n\n3\n\n\fFigure 1: Schematic of the encoder decoder architecture. LEFT: Bidirectional LSTM on demonstra-\ntion states, followed by action and state decoders at each time step. RIGHT: State decoder model\nwithin a single time step, that is autoregressive over the state dimensions.\n\nIn particular, it conditions on the embedding z and previous state xt1 to generate the vector xt\nautoregressively. That is, the autoregression is over the components of the vector xt. Wavenet lessens\nthe load of the encoder which no longer has to carry information that can be captured by modeling\nauto-correlations between components of the state vector . Finally, instead of a Softmax, we use a\nmixture of Gaussians as the output of the WaveNet.\n\n3.2 Diverse generative adversarial imitation learning\nAs pointed out earlier, it is hard for BC policies to mimic experts under environmental perturbations.\nOur solution to obtain more robust policies from few demonstrations, which are also capable of\ndiverse behaviors, is to build on GAIL. Speci\ufb01cally, to enable GAIL to produce diverse solutions,\nwe condition the discriminator on the embeddings generated by the VAE encoder and integrate out\nthe GAIL objective with respect to the variational posterior q(z|x1:T ). Speci\ufb01cally, we train the\ndiscriminator by optimizing the following objective\n\nE\u2327i\u21e0\u21e1E(Eq(z|xi\n\n1:Ti\n\nmax\n\n \n\n)\" 1\n\nTi\n\nTiXt=1\n\nlog D (xi\n\nt, ai\n\nt|z) + E\u21e1\u2713 [log(1  D (x, a|z))]#).\n\n(4)\n\nA related work [20] introduces a conditional GAIL objective to learn controllers for multiple behaviors\nfrom state trajectories, but the discriminator conditions on an annotated class label, as in conditional\nGANs [21].\nWe condition on unlabeled trajectories, which have been passed through a powerful encoder, and\nhence our approach is capable of one-shot imitation learning. Moreover, the VAE encoder enables us\nto obtain a continuous latent embedding space where interpolation is possible, as shown in Figure 3.\nSince our discriminator is conditional, the reward function is also conditional: rt\n (xt, at|z) =\n log(1  D (xt, at|z)). We also clip the reward so that it is upper-bounded. Conditioning on z\nallows us to generate an in\ufb01nite number of reward functions each of them tailored to imitating a\ndifferent trajectory. Policy gradients, though mode seeking, will not cause collapse into one particular\nmode due to the diversity of reward functions.\nTo better motivate our objective, let us temporarily leave the context of imitation learning and consider\nthe following alternative value function for training GANs\n\nmin\nG\n\nmax\n\nD\n\nV (G, D) =Zy\n\np(y)Zz\n\nq(z|y)\uf8fflog D(y|z) +Z\u02c6y\n\nG(\u02c6y|z) log(1  D(\u02c6y|z))d\u02c6y dydz.\n\nThis function is a simpli\ufb01cation of our objective function. Furthermore, it satis\ufb01es the following\nproperty.\nLemma 1. Assuming that q computes the true posterior distribution that is q(z|y) = p(y|z)p(z)\nG(\u02c6y|z) log(1  D(\u02c6y|z))d\u02c6y dz.\n\np(y|z) log D(y|z)dy +Z\u02c6x\n\nV (G, D) =Zz\n\np(z)\uf8ffZy\n\n, then\n\np(y)\n\n4\n\nAction decoderState decoderDemonstration state encoder.........Autoregressive state model(given , )\fAlgorithm 1 Diverse generative adversarial imitation learning.\n\nINPUT: Demonstration trajectories {\u2327i}i and VAE encoder q.\nrepeat\n\nend for\nUpdate policy parameters via TRPO with rewards rj\nUpdate discriminator parameters from i to i+1 with gradient:\n\nfor j 2{ 1,\u00b7\u00b7\u00b7 , n} do\nSample trajectory \u2327j from the demonstration set and sample zj \u21e0 q(\u00b7|xj\nRun policy \u21e1\u2713(\u00b7|zj) to obtain the trajectoryb\u2327j.\nr 8<:\n\nt|zj)35 +24 1\nbTj\n\nbTjXt=1\n\n24 1\n\nTj\n\nlog D (xj\n\nt , aj\n\nTjXt=1\n\nnXj=1\n\nuntil Max iteration or time reached.\n\n1\nn\n\n).\n\n1:Tj\n\nt (xj\n\nt , aj\n\nt|zj) =  log(1  D (xj\n\nt , aj\n\nt|zj)).\n\nt|zj))359=;\nt ,baj\nlog(1  D (bxj\n\nIf we further assume an optimal discriminator [8], the cost optimized by the generator then becomes\n\nC(G) = 2Zz\n\np(z)JSD [p(\u00b7|z)|| G(\u00b7|z)] dz  log 4,\n\n(5)\n\nwhere JSD stands for the Jensen-Shannon divergence. We know that GANs approximately optimize\nthis divergence, and it is well documented that optimizing it leads to mode seeking behavior [37].\nThe objective de\ufb01ned in (5) alleviates this problem. Consider an example where p(x) is a mixture\nof Gaussians and p(z) describes the distribution over the mixture components. In this case, the\nconditional distribution p(x|z) is not multi-modal, and therefore minimizing the Jensen-Shannon\ndivergence is no longer problematic. In general, if the latent variable z removes most of the ambiguity,\nwe can expect the conditional distributions to be close to uni-modal and therefore our generators to\nbe non-degenerate. In light of this analysis, we would like q to be as close to the posterior as possible\nand hence our choice of training q with VAEs.\nWe now turn our attention to some algorithmic considerations. We can use the VAE policy \u21e1\u21b5(at|xt, z)\nto accelerate the training of \u21e1\u2713(at|xt, z). One possible route is to initialize the weights \u2713 to \u21b5.\nHowever, before the policy behaves reasonably, the noise injected into the policy for exploration\n(when using stochastic policy gradients) can cause poor initial performance. Instead, we \ufb01x \u21b5 and\nstructure the conditional policy as follows\n\n\u21e1\u2713(\u00b7|x, z) = N (\u00b7|\u00b5\u2713(x, z) + \u00b5\u21b5(x, z), \u2713(x, z)) ,\n\nwhere \u00b5\u21b5 is the mean of the VAE policy. Finally, the policy parameterized by \u2713 is optimized with\nTRPO [31] while holding parameters \u21b5 \ufb01xed, as shown in Algorithm 1.\n\n4 Experiments\n\nThe primary focus of our experimental evaluation is to demonstrate that the architecture allows\nlearning of robust controllers capable of producing the full spectrum of demonstration behaviors for\na diverse range of challenging control problems. We consider three bodies: a 9 DoF robotic arm,\na 9 DoF planar walker, and a 62 DoF complex humanoid (56-actuated joint angles, and a freely\ntranslating and rotating 3d root joint). While for the reaching task BC is suf\ufb01cient to obtain a working\ncontroller, for the other two problems our full learning procedure is critical.\nWe analyze the resulting embedding spaces and demonstrate that they exhibit rich and sensible\nstructure that an be exploited for control. Finally, we show that the encoder can be used to capture\nthe gist of novel demonstration trajectories which can then be reproduced by the controller.\nAll experiments are conducted with the MuJoCo physics engine [38]. For details of the simulation\nand the experimental setup please see appendix.\n4.1 Robotic arm reaching\nWe \ufb01rst demonstrate the effectiveness of our VAE architecture and investigate the nature of the\nlearned embedding space on a reaching task with a simulated Jaco arm. The physical Jaco is a\nrobotics arm developed by Kinova Robotics.\n\n5\n\n\fFigure 3: Interpolation in the latent space for the Jaco arm. Each column shows three frames of a\ntarget-reach trajectory (time increases across rows). The left and right most columns correspond to the\ndemonstration trajectories in between which we interpolate. Intermediate columns show trajectories\ngenerated by our VAE policy conditioned on embeddings which are convex combinations of the\nembeddings of the demonstration trajectories. Interpolating in the latent space indeed correspond to\ninterpolation in the physical dimensions.\n\nTo obtain demonstrations, we trained 60 independent policies to reach to random target locations2 in\nthe workspace starting from the same initial con\ufb01guration. We generated 30 trajectories from each of\nthe \ufb01rst 50 policies. These serve as training data for the VAE model (1500 training trajectories in\ntotal). The remaining 10 policies were used to generate test data.\nThe reaching task is relatively simple, so with this amount\nof data the VAE policy is fairly robust. After training,\nthe VAE encodes and reproduces the demonstrations as\nshown in Figure 2. Representative examples can be found\nin the video in the supplemental material.\nTo further investigate the nature of the embedding space\nwe encode two trajectories. Next, we construct the em-\nbeddings of interpolating policies by taking convex com-\nbinations of the embedding vectors of the two trajectories.\nWe condition the VAE policy on these interpolating em-\nbeddings and execute it. The results of this experiment\nare illustrated with a representative pair in Figure 3. We\nobserve that interpolating in the latent space indeed cor-\nresponds to interpolation in task (trajectory endpoint)\nspace, highlighting the semantic meaningfulness of the\ndiscovered latent space.\n4.2\nWe found reaching behavior to be relatively easy to imitate, presumably because it does not involve\nmuch physical contact. As a more challenging test we consider bipedal locomotion. We train 60\nneural network policies for a 2d walker to serve as demonstrations3. These policies are each trained\nto move at different speeds both forward and backward depending on a label provided as additional\ninput to the policy. Target speeds for training were chosen from a set of four different speeds (m/s):\n-1, 0, 1, 3. For the distribution of speeds that the trained policies actually achieve see Figure 4, top\nright). Besides the target speed the reward function imposes few constraints on the behavior. The\nresulting policies thus form a diverse set with several rather idiosyncratic movement styles. While for\nmost purposes this diversity is undesirable, for the present experiment we consider it a feature.\n\nFigure 2: Trajectories for the Jaco arm\u2019s\nend-effector on test set demonstrations.\nThe trajectories produced by the VAE pol-\nicy and corresponding demonstration are\nplotted with the same color, illustrating\nthat the policy can imitate well.\n\n2D Walker\n\n2See appendix for details\n3See section A.2 in the appendix for details.\n\n6\n\nPolicy 1Policy 2Interpolated policiesTime\fFigure 4: LEFT: t-SNE plot of the embedding vectors of the training trajectories; marker color\nindicates average speed. The plot reveals a clear clustering according to speed. Insets show pairs\nof frames from selected example trajectories. Trajectories nearby in the plot tend to correspond to\nsimilar movement styles even when differing in speed (e.g. see pair of trajectories on the right hand\nside of plot). RIGHT, TOP: Distribution of walker speeds for the demonstration trajectories. RIGHT,\nBOTTOM: Difference in speed between the demonstration and imitation trajectories. Measured\nagainst the demonstration trajectories, we observe that the \ufb01ne-tuned controllers tend to have less\ndifference in speed compared to controllers without \ufb01ne-tuning.\n\nWe trained our model with 20 episodes per policy (1200 demonstration trajectories in total, each\nwith a length of 400 steps or 10s of simulated time). In this experiment our full approach is required:\ntraining the VAE with BC alone can imitate some of the trajectories, but it performs poorly in general,\npresumably because our relatively small training set does not cover the space of trajectories suf\ufb01ciently\ndensely. On this generated dataset, we also train policies with GAIL using the same architecture and\nhyper-parameters. Due to the lack of conditioning, GAIL does not reproduce coherently trajectories.\nInstead, it simply meshes different behaviors together. In addition, the policies trained with GAIL\nalso exhibit dramatically less diversity; see video.\nA general problem of adversarial training is that there is no easy way to quantitatively assess the\nquality of learned models. Here, since we aim to imitate particular demonstration trajectories that\nwere trained to achieve particular target speed(s) we can use the difference between the speed of the\ndemonstration trajectory the trajectory produced by the decoder as a surrogate measure of the quality\nof the imitation (cf. also [12]).\nThe general quality of the learned model and the improvement achieved by the adversarial stage of\nour training procedure are quanti\ufb01ed in Fig. 4. We draw 660 trajectories (11 trajectories each for all\n60 policies) from the training set, compute the corresponding embedding vectors using the encoder,\nand use both the VAE policy as well as the improved policy from the adversarial stage to imitate\neach of the trajectories. We determine the absolute values of the difference between the average\nspeed of the demonstration and the imitation trajectories (measured in m/s). As shown in Fig. 4 the\nadversarial training greatly improves reliability of the controller as well as the ability of the model to\naccurately match the speed of the demonstration. We also include addition quantitative analysis of\nour approach using this speed metric in Appendix B. Video of our agent imitating a diverse set of\nbehaviors can be found in the supplemental material.\nTo assess generalization to novel trajectories we encode and subsequently imitate trajectories not\ncontained in the training set. The supplemental video contains several representative examples,\ndemonstrating that the style of movement is successfully imitated for previously unseen trajectories.\nFinally, we analyze the structure of the embedding space. We embed training trajectories and perform\ndimensionality reduction with t-SNE [41]. The result is shown in Fig. 4. It reveals a clear clustering\naccording to movement speeds thus recovering the nature of the task context for the demonstration\ntrajectories. We further \ufb01nd that trajectories that are nearby in embedding space tend to correspond\nto similar movement styles even when differing in speed.\n\n7\n\n\fFigure 5: Left: examples of the demonstration trajectories in the CMU humanoid domain. The\ntop row shows demonstrations from both the training and test set. The bottom row shows the\ncorresponding imitation. Right: Percentage of falling down before the end of the episode with and\nwithout \ufb01ne tuning.\n\n4.3 Complex humanoid\nWe consider a humanoid body of high dimensionality that poses a hard control problem. The\nconstruction of this body and associated control policies is described in [20], and is brie\ufb02y summarized\nin the appendix (section A.3) for completness. We generate training trajectories with the existing\ncontrollers, which can produce instances of one of six different movement styles (see section A.3).\nExamples of such trajectories are shown in Fig. 5 and in the supplemental video.\nThe training set consists of 250 random trajectories from 6 different neural network controllers that\nwere trained to match 6 different movement styles from the CMU motion capture data base4. Each\ntrajectory is 334 steps or 10s long. We use a second set of 5 controllers from which we generate\ntrajectories for evaluation (3 of these policies were trained on the same movement styles as the\npolicies used for generating training data).\nSurprisingly, despite the complexity of the body, supervised learning is quite effective at producing\nsensible controllers: The VAE policy is reasonably good at imitating the demonstration trajectories,\nalthough it lacks the robustness to be practically useful. Adversarial training dramatically improves\nthe stability of the controller. We analyze the improvement quantitatively by computing the percentage\nof the humanoid falling down before the end of an episode while imitating either training or test\npolicies. The results are summarized in Figure 5 right. The \ufb01gure further shows sequences of frames\nof representative demonstration and associated imitation trajectories. Videos of demonstration and\nimitation behaviors can be found in the supplemental video.\nFor practical purposes it is desirable to allow the controller to transition from one behavior to another.\nWe test this possibility in an experiment similar to the one for the Jaco arm: We determine the\nembedding vectors of pairs of demonstration trajectories, start the trajectory by conditioning on\nthe \ufb01rst embedding vector, and then transition from one behavior to the other half-way through the\nepisode by linearly interpolating the embeddings of the two demonstration trajectories over a window\nof 20 control steps. Although not always successful the learned controller often transitions robustly,\ndespite not having been trained to do so. Representative examples of these transitions can be found in\nthe supplemental video.\n\n5 Conclusions\n\nWe have proposed an approach for imitation learning that combines the favorable properties of\ntechniques for density modeling with latent variables (VAEs) with those of GAIL. The result is a\nmodel that learns, from a moderate number of demonstration trajectories (1) a semantically well\nstructured embedding of behaviors, (2) a corresponding multi-task controller that allows to robustly\nexecute diverse behaviors from this embedding space, as well as (3) an encoder that can map new\ntrajectories into the embedding space and hence allows for one-shot imitation.\nOur experimental results demonstrate that our approach can work on a variety of control problems,\nand that it scales even to very challenging ones such as the control of a simulated humanoid with a\nlarge number of degrees of freedoms.\n\n4See appendix for details.\n\n8\n\nTimeDemoImitationTrainTestTime\fReferences\n[1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration.\n\nRobotics and Autonomous Systems, 57(5):469\u2013483, 2009.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Preprint arXiv:1701.07875, 2017.\n\n[3] N. Baram, O. Anschel, and S. Mannor. Model-based adversarial imitation learning. Preprint\n\narXiv:1612.02179, 2016.\n\n[4] D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary equilibrium generative adversarial networks.\n\nPreprint arXiv:1703.10717, 2017.\n\n[5] A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Robot programming by demonstration. In Springer\n\nhandbook of robotics, pages 1371\u20131394. 2008.\n\n[6] Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba.\n\nOne-shot imitation learning. Preprint arXiv:1703.07326, 2017.\n\n[7] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. Preprint arXiv:1701.00160, 2016.\n\n[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[9] A. Graves, S. Fern\u00e1ndez, and J. Schmidhuber. Bidirectional LSTM networks for improved phoneme\nclassi\ufb01cation and recognition. Arti\ufb01cial Neural Networks: Formal Models and Their Applications\u2013ICANN\n2005, pages 753\u2013753, 2005.\n\n[10] K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. Lim. Multi-modal imitation learning from\n\nunstructured demonstrations using generative adversarial nets. arXiv preprint arXiv:1705.10479, 2017.\n\n[11] R. D. Hjelm, A. P. Jacob, T. Che, K. Cho, and Y. Bengio. Boundary-seeking generative adversarial\n\nnetworks. Preprint arXiv:1702.08431, 2017.\n\n[12] J. Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, pages 4565\u20134573, 2016.\n\n[13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n\n[14] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM\n\nComputing Surveys, 50(2):21, 2017.\n\n[15] D. Kingma and M. Welling. Auto-encoding variational bayes. Preprint arXiv:1312.6114, 2013.\n\n[16] A. Kue\ufb02er, J. Morton, T. Wheeler, and M. Kochenderfer.\n\nadversarial networks. Preprint arXiv:1701.06699, 2017.\n\nImitating driver behavior with generative\n\n[17] Y. Li, J. Song, and S. Ermon. Inferring the latent structure of human decision-making from raw visual\n\ninputs. arXiv preprint arXiv:1703.08840, 2017.\n\n[18] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control\n\nwith deep reinforcement learning. arXiv:1509.02971, 2015.\n\n[19] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial\n\nnetworks. Preprint ArXiv:1611.04076, 2016.\n\n[20] J. Merel, Y. Tassa, TB. Dhruva, S. Srinivasan, J. Lemmon, Z. Wang, G. Wayne, and N. Heess. Learning\n\nhuman behaviors from motion capture by adversarial imitation. Preprint arXiv:1707.02201, 2017.\n\n[21] M. Mirza and S. Osindero. Conditional generative adversarial nets. Preprint arXiv:1411.1784, 2014.\n\n[22] U. Muico, Y. Lee, J. Popovi\u00b4c, and Z. Popovi\u00b4c. Contact-aware nonlinear control of dynamic characters. In\n\nSIGGRAPH, 2009.\n\n[23] X. B. Peng, G. Berseth, K. Yin, and M. van de Panne. DeepLoco: Dynamic locomotion skills using\n\nhierarchical deep reinforcement learning. In SIGGRAPH, 2017.\n\n[24] D. A. Pomerleau. Ef\ufb01cient training of arti\ufb01cial neural networks for autonomous navigation. Neural\n\nComputation, 3(1):88\u201397, 1991.\n\n[25] G. J. Qi. Loss-sensitive generative adversarial networks on Lipschitz densities. Preprint arXiv:1701.06264,\n\n2017.\n\n9\n\n\f[26] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep\n\ngenerative models. In ICML, 2014.\n\n[27] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. Variational approaches for auto-\n\nencoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.\n\n[28] S. Ross and A. Bagnell. Ef\ufb01cient reductions for imitation learning. In AIStats, 2010.\n\n[29] S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to\n\nno-regret online learning. In AIStats, 2011.\n\n[30] A. Rusu, S. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu,\n\nand R. Hadsell. Policy distillation. Preprint arXiv:1511.06295, 2015.\n\n[31] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML,\n\n2015.\n\n[32] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal\n\nProcessing, 45(11):2673\u20132681, 1997.\n\n[33] D. Sharon and M. van de Panne. Synthesis of controllers for stylized planar bipedal walking. In ICRA,\n\npages 2387\u20132392, 2005.\n\n[34] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient\n\nalgorithms. In ICML, 2014.\n\n[35] K. W. Sok, M. Kim, and J. Lee. Simulating biped behaviors from human motion data. 2007.\n\n[36] B. C. Stadie, P. Abbeel, and I. Sutskever. Third-person imitation learning. Preprint arXiv:1703.01703,\n\n2017.\n\n[37] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. Preprint\n\narXiv:1511.01844, 2015.\n\n[38] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In IROS, pages\n\n5026\u20135033, 2012.\n\n[39] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,\n\nand K. Kavukcuoglu. WaveNet: A generative model for raw audio. Preprint arXiv:1609.03499, 2016.\n\n[40] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves. Conditional image generation\n\nwith pixelCNN decoders. In NIPS, 2016.\n\n[41] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research,\n\n9:2579\u20132605, 2008.\n\n[42] R. Wang, A. Cully, H. Jin Chang, and Y. Demiris. MAGAN: Margin adaptation for generative adversarial\n\nnetworks. Preprint arXiv:1704.03817, 2017.\n\n[43] K. Yin, K. Loken, and M. van de Panne. SIMBICON: Simple biped locomotion control. In SIGGRAPH,\n\n2007.\n\n10\n\n\f", "award": [], "sourceid": 2754, "authors": [{"given_name": "Ziyu", "family_name": "Wang", "institution": "Deepmind"}, {"given_name": "Josh", "family_name": "Merel", "institution": "DeepMind"}, {"given_name": "Scott", "family_name": "Reed", "institution": "Google DeepMind"}, {"given_name": "Nando", "family_name": "de Freitas", "institution": "DeepMind"}, {"given_name": "Gregory", "family_name": "Wayne", "institution": "Google DeepMind"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}]}