{"title": "SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies", "book": "Advances in Neural Information Processing Systems", "page_first": 7881, "page_last": 7891, "abstract": "Imitation Learning (IL) has been successfully applied to complex sequential decision-making problems where standard Reinforcement Learning (RL) algorithms fail. A number of recent methods extend IL to few-shot learning scenarios, where a meta-trained policy learns to quickly master new tasks using limited demonstrations. However, although Inverse Reinforcement Learning (IRL) often outperforms Behavioral Cloning (BC) in terms of imitation quality, most of these approaches build on BC due to its simple optimization objective. In this work, we propose SMILe, a scalable framework for Meta Inverse Reinforcement Learning (Meta-IRL) based on maximum entropy IRL, which can learn high-quality policies from few demonstrations. We examine the efficacy of our method on a variety of high-dimensional simulated continuous control tasks and observe that SMILe significantly outperforms Meta-BC. Furthermore, we observe that SMILe performs comparably or outperforms Meta-DAgger, while being applicable in the state-only setting and not requiring online experts. To our knowledge, our approach is the first efficient method for Meta-IRL that scales to the function approximator setting. For datasets and reproducing results please refer to https://github.com/KamyarGh/rl_swiss/blob/master/reproducing/smile_paper.md .", "full_text": "SMILe : Scalable Meta Inverse Reinforcement\nLearning through Context-Conditional Policies\n\nSeyed Kamyar Seyed Ghasemipour\n\nUniversity of Toronto\n\nVector Institute\n\nkamyar@cs.toronto.edu\n\nRichard Zemel\n\nUniversity of Toronto\n\nVector Institute\n\nzemel@cs.toronto.edu\n\nShixiang Gu\nGoogle Brain\n\nshanegu@google.com\n\nAbstract\n\nImitation Learning (IL) has been successfully applied to complex sequential\ndecision-making problems where standard Reinforcement Learning (RL) algo-\nrithms fail. A number of recent methods extend IL to few-shot learning scenarios,\nwhere a meta-trained policy learns to quickly master new tasks using limited\ndemonstrations. However, although Inverse Reinforcement Learning (IRL) often\noutperforms Behavioral Cloning (BC) in terms of imitation quality, most of these\napproaches build on BC due to its simple optimization objective. In this work, we\npropose SMILe, a scalable framework for Meta Inverse Reinforcement Learning\n(Meta-IRL) based on maximum entropy IRL, which can learn high-quality policies\nfrom few demonstrations. We examine the ef\ufb01cacy of our method on a variety\nof high-dimensional simulated continuous control tasks and observe that SMILe\nsigni\ufb01cantly outperforms Meta-BC. Furthermore, we observe that SMILe performs\ncomparably or outperforms Meta-DAgger, while being applicable in the state-only\nsetting and not requiring online experts. To our knowledge, our approach is the\n\ufb01rst ef\ufb01cient method for Meta-IRL that scales to the function approximator set-\nting. For datasets and reproducing results please refer to https://github.com/\nKamyarGh/rl_swiss/blob/master/reproducing/smile_paper.md.\n\n1\n\nIntroduction\n\nModern advances in reinforcement learning aim to alleviate the need for hand-engineered decision-\nmaking and control algorithms by designing general purpose methods that learn to optimize provided\nreward functions. In many cases however, it is either too challenging to optimize a given reward\n(e.g. due to sparsity of signal), or it is simply impossible to design a reward function that captures\nthe intricate details of desired outcomes. One approach to overcoming such hurdles is Imitation\nLearning (IL) (also known as Learning from Demonstrations (LfD)) [29, 28, 21], where algorithms\nare provided with expert demonstrations of how to accomplish desired tasks.\nIn practical applications such as robotics we are interested in learning policies for accomplishing a\nvariety of tasks with very few demonstrations. As a result, a number of recent works study IL in the\ncontext of few-shot or meta learning [3, 7, 35, 33]. By leveraging the shared structure among tasks,\nthese approaches can learn a generic policy that adapts quickly to related new tasks using small sets\nof new demonstrations.\nDespite the importance and popularity of Meta-IL, existing approaches [3, 7, 35] mostly rely on\nBehavioral Cloning (BC) [28], where learning from demonstrations is treated as a supervised learning\nproblem and policies are trained to regress expert actions from a dataset of expert demonstrations.\nOn the other hand, much less work explores meta learning through the lens of Inverse Reinforcement\nLearning (IRL) [21, 1], whose aim is to infer the reward function of the expert, and subsequently train\na policy to optimize this reward. In single-task benchmarks for continuous control [2], recent IRL\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmethods [17, 9] have been shown to outperform BC by a wide margin, particularly in the low data\nregime where a very limited number of expert trajectories are available. Furthermore, IRL algorithms\ncan be employed even in settings where we only observe the expert\u2019s states, and not their actions.\nAs a result, within the problem of Meta-IL, it is desirable to develop scalable methods for Meta-IRL\nin particular. However, due to IRL algorithms being signi\ufb01cantly more complex than their BC\ncounterparts, prior approaches to Meta-IRL have either failed to scale to the function approximator\nsetting [34], or require pre-training procedures that are beyond state-of-the-art and easily become\ncomputational bottlenecks [33].\nIn this work, we interpret the problem of Meta-IRL as IRL conditioned upon a context, where\nthe context takes the form of expert demonstrations. We present an ef\ufb01cient method that scales to\nintractable function approximators, and demonstrate its performances on dif\ufb01cult high-dimensional\ncontinuous control environments. We call our algorithm Scalable Meta-Inverse reinforcement\nLearning, or SMILe. Re\ufb02ecting the results from prior work in the single-task setting, we observe\nthat even with exceedingly large amount of demonstrations Meta Behavioral Cloning performs very\npoorly. However, even in the state-only setting - where expert actions are unobserved - and with a\nvery limited set of demonstrations, our proposed Meta-IRL approach can achieve close to expert\nperformance. Additionally, as our method is not dependent on any particular form of context, our\nwork opens avenues for Meta-IRL conditioned on different forms of data such as linguistic task\ndescriptions or images of desired goal states.\n\n2 Background\n\nIn this section we discuss relevant background on Max-Ent IRL and modern approaches to this\nproblem, which our approach to meta IRL builds from.\n\n1\nZ\n\n(cid:89)\n\nt\n\np(st+1|st, at)\n\n2.1 Maximum Entropy Inverse Reinforcement Learning\nConsider a Markov Decision Process (MDP) represented as a tuple (S,A,P, r) with state-space S,\naction-space A, dynamics P, reward function r(s, a). In Maximum Entropy (Max-Ent) reinforcement\nlearning [30, 32, 26, 8, 15, 16], the goal is to \ufb01nd a policy \u03c0 such that trajectories sampled using this\npolicy follow the distribution,\n\nwhere \u03c4 = (s0, a0, s1, a1, ...) denotes a trajectory, R(\u03c4 ) = (cid:80)\n\nexp(R(\u03c4 ))p(s0)\n\np(\u03c4 ) =\n\nt r(st, at), and Z = (cid:82)\n\n\u03c4 exp(R(\u03c4 )).\nHence, trajectories that accumulate more reward are exponentially more likely to be sampled, provided\nthey have similar likelihood under the MDP dynamics.\nConverse to the standard RL setting, in Max-Ent IRL [37, 36] we are instead presented with an\noptimal policy \u03c0exp - or more realistically sample trajectories from such a policy - and we seek to \ufb01nd\na reward function r that maximizes the likelihood of the trajectories sampled from \u03c0exp. Formally,\nE\u03c4\u223c\u03c0exp[R(\u03c4 ) \u2212 log Z]. As in all energy-based models,\nthe simpli\ufb01ed form of the objective is: max\noptimizing a maximum likelihood learning objective is dif\ufb01cult due to the need to estimate the\npartition function. Initial methods addressed this problem using dynamic programming [37, 36],\nand recent approaches present methods aimed at intractable domains with unknown dynamics\n[6, 17, 5, 9, 20].\n\nr\n\n2.2 Adversarial Max-Ent IRL\n\nInstead of recovering the expert\u2019s reward function and subsequently searching for the optimal policy,\nrecent successful methods in Max-Ent IRL aim to directly recover the policy that would result from\nthis process. Since such methods only recover the policy, it would be more accurate to refer to them\nas Imitation Learning algorithms.\n\nAIRL [17] present one of the \ufb01rst formulations of Max-Ent IRL, with a successful special case\ndubbed GAIL. Subsequent to the advent of GAIL [17], [5] present a theoretical discussion of an\nadversarial training method for Max-Ent IRL that could recover the Max-Ent reward function from\n\n2\n\n\fT\n\nE\n\nC\n\n\u03c0\n\ns\n\na\n\nN\n\nT\n\nE\n\nC\n\n\u03c0\n\ns\n\na\n\nN\n\n(a) Meta (Max-Ent) IRL\n\n(b) Meta BC\n\nFigure 1: Graphical models describing the generative process in Meta-IRL and Meta-BC. E represents\nthe initial state distribution and dynamics (i.e. the environment) for task T , and C represents a context.\nGrey and white nodes denote observed and unobserved variables respectively. In Meta-IRL the aim is\nto match the state-action marginal of a context-conditioned policy to that of the task expert, whereas\nin Meta-BC the aim is to match expert actions in states visited by the expert.\n\nexpert demonstrations and simultaneously train the Max-Ent policy corresponding to that reward.\n[9] present a practical implementation of this method, dubbed Adversarial Inverse Reinforcement\nLearning (AIRL), alongside additional contributions that are not directly relevant to our work. The\nmethod for AIRL is formulated as follows:\nLet \u03c1exp(s, a), \u03c1\u03c0(s, a) denote the state-action marginal distributions of the expert and student policy\nrespectively and let D(s, a) : S \u00d7 A \u2192 [0, 1] be the discriminator. AIRL de\ufb01nes the reward function,\nr(s, a) := log D(s, a) \u2212 log (1 \u2212 D(s, a)), and sets the objective for the student policy to be the RL\nt r(st, at)]. This leads to an iterative optimization process alternating\nbetween optimizing the discriminator and the policy, and it is shown that this procedure implicitly\noptimizes the Max-Ent IRL objective.\n\nobjective, max\u03c0 E\u03c1\u03c0(s,a) [(cid:80)\n\nPerformance With Respect to BC Methods such as GAIL and AIRL have demonstrated signi\ufb01-\ncant performance gains compared to BC. In particular, in standard Mujoco benchmarks [31, 2], direct\nmethods for Max-Ent IRL achieve strong performance using limited amount of demonstrations from\nan expert policy, an important failure scenario for BC.\n\n2.3 Max-Ent IRL as Distribution Matching\n\nAs noted above, [17] present a class of methods for Max-Ent IRL that directly retrieve the ex-\npert policy without explicitly \ufb01nding the reward function of the expert. They demonstrate that\nthe Max-Ent IRL problem is equivalent to matching \u03c1\u03c0(s, a) to \u03c1exp(s, a), and that GAIL mini-\nmizes the Jensen-Shannon divergence between these two distributions. In concurrently published\nwork, we have demonstrated that the AIRL [9] objective optimizes the reverse KL divergence\nKL (\u03c1\u03c0(s, a)||\u03c1exp(s, a)). Using a connection to f-GANs [23], we presented a generalization of\nAIRL for minimizing any f-divergence between the two distributions.\nThe objective for standard Behavioral Cloning, max\u03c0 E\u03c1exp(s,a) [log \u03c0(a|s)], does not depend on the\nstates visited by the student policy and only encourages the policy\u2019s actions to match those of the\nexpert in the states visited by the expert. At test-time, due to not exactly matching the expert, BC\npolicies will visit new states for which they did not receive expert supervision. This covariate shift\nresults in compounding errors in subsequent timesteps and has been identi\ufb01ed as one of the most\nimportant factors contributing to Behavioral Cloning\u2019s poor performance [27]. In contrast, the results\nfrom prior work discussed above demonstrate that the core de\ufb01ning characteristic of Max-Ent IRL is\nnot its interpretation of recovering the expert\u2019s reward function, but instead the matching of \u03c1exp(s, a)\nand \u03c1\u03c0(s, a). This encourages the imitation policy to learn to visit similar states as the expert.\n\n3 Meta Inverse Reinforcement Learning\n\n3.1 Problem Formulation\nWe begin by framing the Meta-IRL problem we seek to solve. Let p(T ) denote a distribution of tasks\nwe are interested in acquiring a policy for. Per task, we have its initial state and transition distributions\n(i.e. the environment for that task), which we denote using the variable E. p(C|T ) represents a\n\n3\n\n\fusing the graphical model shown in \ufb01gure 1a. Let \u03c1exp(s, a|C,E) = (cid:82)\n\ndistribution over contexts for task T . Contexts for a task can take many forms; they can be a set of\ndemonstrations, language instructions, or even an image representing a desired goal state. Given\naccess to an environment E, and provided with a context C, a context-conditional policy induces a\njoint distribution over states and actions, \u03c1\u03c0(s, a|C,E). This generative process can be described\nT p(T |C,E)\u03c1exp(s, a|T ,E)\n(cid:17)(cid:105)\ndenote the posterior expert state-action distribution. Building on the discussion in section 2.3 our\nMeta-IRL objective takes the form,\n\n(cid:16)\n\n(cid:104)\n\n\u03c1\u03c0(s, a|C,E)\n\n|| \u03c1exp(s, a|C,E)\n\n(1)\n\nmin\n\n\u03c0\n\nEp(C,E)\n\nM\n\nwhere M is any measure of discrepancy between distributions1. Intuitively, this objective requires\nthe policy to match expert behaviour for different tasks, proportional to the likelihood of each task\ngiven the context and environment. The counterpart Meta-BC objective takes the following form,\n\n(cid:34)\n\n(cid:104)\n\n(cid:16)\n\nmin\n\n\u03c0\n\nEp(C,E)\n\nE\u03c1exp(s|C,E)\n\nM\n\n\u03c1\u03c0(a|s,C,E)\n\n|| \u03c1exp(a|s,C,E)\n\n3.2 Approach\n\nInspired by AIRL [9], the speci\ufb01c form of objective 1 we choose to optimize is,\n\n|| \u03c1exp(s, a|C,E)\nSimilar to AIRL, we employ the following iterative optimization scheme,\n\n\u03c1\u03c0(s, a|C,E)\n\nEp(C,E)\n\nmin\n\nKL\n\n\u03c0\n\n(cid:17)(cid:105)(cid:35)\n\n(cid:17)(cid:105)\n\n(cid:21)(cid:35)\n\n(2)\n\n(3)\n\n(4)\n\n(cid:35)\n\n(cid:104)\n\n(cid:16)\n\n(cid:20)(cid:88)\n\n(cid:34)\n(cid:34)\n\n(cid:34)\n\nmax\n\nD\n\nEp(C,E)\n\nE\u03c1exp(s,a|C,E) [log D(s, a,C,E)] + E\u03c1\u03c0(s,a|C,E) [log (1 \u2212 D(s, a,C,E))]\n\n\u03c0\n\nmax\n\nEp(C,E)\n\nE\u03c4\u223c\u03c0(\u00b7|C,E)\n\nlog D(st, at,C,E) \u2212 log (1 \u2212 D(st, at,C,E))\n\n(5)\nwhere the discriminator model is conditioned upon the context and environment, and \u03c4 \u223c \u03c0(\u00b7|C,E)\ndenotes sampling trajectories in environment E using the policy conditioned on context C. As in the\nAIRL formulation, the objective for each conditional policy is an RL objective using a reward de\ufb01ned\nby the discriminator. Many practical considerations are necessary in order to convert equations 4 and\n5 into an ef\ufb01cient and effective algorithm. The following sections outline such practicalities.\n\nt\n\n3.3 Optimizing the Discriminator Objective\nConditioning on E\nIn addition to conditioning on the state, action, and context, the discriminator\nmust take into account the environment E. However, as there does not exist a convenient representation\nfor an environment, we decide to not condition the discriminator model on this variable. This choice\ncan be particularly hindering in settings where the different tasks are de\ufb01ned by their varying\ndynamics. In such scenarios we rely on the context variable C conveying the necessary information\nregarding the environment. For example, when the context consists of a set of expert demonstrations,\nit is reasonable to assume we can acquire information about the dynamics through observing the\nexpert\u2019s trajectories.\n\nestimates. To do so, recalling that \u03c1exp(s, a|C,E) =(cid:82)\n\nMonte-Carlo Estimation We would like to optimize the discriminator objective using Monte-Carlo\nT p(T |C,E)\u03c1exp(s, a|T ,E), and removing E\n\nfrom the discriminator\u2019s input, we can rewrite equation 4 as,\n\nEp(T ,C,E)\n\nE\u03c1exp(s,a|T ,E) [log D(s, a,C)] \u2212 E\u03c1\u03c0(s,a|C,E) [log (1 \u2212 D(s, a,C))]\n\n(6)\n\n(cid:35)\n\nIn this form we can see how to obtain estimates of the discriminator objective. After obtaining a\nsample task, context, and environment, the \ufb01rst term can be estimated by sampling state-action pairs\nthrough rollouts of the task\u2019s expert in the given environment. Similarly, the second term can be\nestimated through rollouts of the conditional policy.\n\n1e.g. KL divergence or Wasserstein Distance\n\n4\n\n\fOff-Policy Training An important computational bottleneck in the GAIL [17] and AIRL [9]\nalgorithms is that the discriminator and policy must be trained with on-policy samples. This is a large\ncomputational burden since after every update to the policy we must generate multiple trajectories in\norder to train the discriminator. The meta-learning setting exacerbates this problem as we must now\ngenerate multiple trajectories for many tasks. In this work we take inspiration from [20] and instead\nmaintain a replay buffer of policy rollouts per task. We then estimate the second term of equation 6,\nby sampling a task T \u223c p(T ), sampling C,E \u223c p(C,E|T ), and \ufb01nally sampling state-action pairs\nfrom the replay buffer of task T . The main effect of this approach is that the discriminator now\nestimates the likelihood ratio between the experts and the mixture of policies that populated the replay\nbuffers.\n\n3.4 Optimizing the Policy Objective\n\nObtaining a Context Representation In order to optimize their respective objectives, both the\npolicy and the discriminator models must learn to use provided contexts. The na\u00efve approach would\nbe to allow the discriminator and policy to learn their own representations. However, since the\ndiscriminator is trained using state-action pairs from both the experts as well as the policy, it is much\nbetter poised for learning effective representations of the context. On the other hand, the policy must\nlearn to extract important aspects of the context solely through the reward signal it obtains from the\ndiscriminator. As a result, we choose to allow the discriminator to train its own context-encoder\nmodel, and have the policy adapt itself to the discriminator\u2019s representation of the context. Not only\ndoes this design choice remove the computational cost of training a second encoder, it also leads to a\nstable and effective learning algorithm.\n\nOff-Policy Training As mentioned above, we would like to leverage off-policy training for im-\nproved sample-ef\ufb01ciency. To correctly optimize the policy objective using samples from the replay\nbuffer, importance sampling must be employed. However, it can be quite challenging to accurately\nestimate the necessary importance weights. [20] demonstrate that omitting these weights works well\nin practice. We follow the same approach and obtain the sample-ef\ufb01ciency of off-policy training\nwithout empirical hindrance to policy performance.\n\n3.5 Algorithm\n\nTaking considerations above into account, we arrive at the \ufb01nal form of our approach, SMILe . In\nthis section we present a conceptual overview of SMILe and defer exact details to Algorithm 1 in\nAppendix A. The SMILe training procedure alternates between generating rollouts and updating\nmodels.\nGenerating Rollouts\nIn each iteration we add m trajectories to the task replay buffers by sampling\nm tasks from the meta-train set, sampling a context for each task, and generating a trajectory using\nthe context-conditioned policies in the corresponding environments. After generating the rollouts, for\nd times we alternate between updating the discriminator and the policy.\nUpdating the Discriminator To update the discriminator model, including its context encoder, we\n\ufb01rst obtain a minibatch of meta-train tasks as well as a context for each. For a given task T , positive\nsamples for the discriminator objective have the form (s, a,CT ), where (s, a) is a state-action pair\nsampled from T \u2019s expert demonstrations, and CT is the task context. Similarly negative batches are\nformed by concatenating (s, a) pairs from each task\u2019s replay buffer with corresponding contexts.\nUpdating the Policy Throughout this work we will perform policy updates using the Soft-Actor-\nCritic (SAC) algorithm [16]. In this step, we \ufb01rst sample a minibatch of tasks and contexts, and\nobtain a representation for each context using the discriminator\u2019s encoder. To form an update batch\nfor the policy, Q, and value functions, we sample (s, a, s(cid:48)) transitions from each task\u2019s replay buffer\nand concatenate the context representation to the states. Subsequently, an SAC update step is done by\nusing the rewards de\ufb01ned through the discriminator. We do not update the encoder in this step.\nAn important advantage of IRL compared to BC is that it also enables Imitation Learning using\nstate-only demonstrations. In such settings the discriminator model takes as input state-next-state\npairs (s, s(cid:48)) instead of state-action pairs (s, a).\n\n5\n\n\fMethod\nMeta-BC\nSMILe (State-Action)\nSMILe (State-Only)\nMeta-DAgger\n\n4 demos, 20 sub\n0.931 \u00b1 0.620\n0.181 \u00b1 0.095\n0.186 \u00b1 0.107\n\n16 demos, 20 sub\n0.643 \u00b1 0.410\n0.125 \u00b1 0.060\n0.113 \u00b1 0.058\n\n64 demos, 20 sub\n0.468 \u00b1 0.274\n0.102 \u00b1 0.100\n0.121 \u00b1 0.086\n\n\u2014\n\n\u2014\n\n\u2014\n\n64 demos, 1 sub\n0.340 \u00b1 0.188\n0.118 \u00b1 0.078\n0.138 \u00b1 0.074\n0.082 \u00b1 0.059\n\nTable 1: Mean and standard deviation of absolute difference between target velocity and velocity of trained\nmodels (lower is better). \u201cn demos, m sub\" indicates that n expert trajectories were provided per task, with each\ntrajectory uniformly subsampled by a factor of m starting at a random offset. As a reminder, meta-test target\nvelocities range from [0.05, 2.95]. Even with a very large number of demonstrations per task, Meta-BC obtains\nvelocities quite distant from the target, whereas with quite few demonstrations, both variants of Meta-IRL obtain\nrespectable performance. Meta-DAgger outperforms SMILe in this experiment, at the expense of much larger\nnumber of queries from an online expert. Full evaluation details in appendix C.\n\n4 Experiments\n\nAs discussed above, contexts may take many forms. In this work, however, we focus exclusively on\nthe Meta Imitation Learning problem and use expert trajectories as contexts. Through our experiments\nwe seek to empirically evaluate the following questions:\n\n\u2022 Can SMILe signi\ufb01cantly outperform Meta-BC, analogous to the single-task setting?\n\u2022 Can expert demonstrations be an effective source of information about dynamics?\n\u2022 Can SMILe be an effective learning algorithm when it is necessary to combine information\n\nfrom multiple demonstration trajectories?\n\nTo answer these questions we compare state-action and state-only variants of SMILe to a strong\nMeta-BC baseline on a varied range of simulated continuous control problems. For each experiment\nthe Meta-BC baseline model is obtained by composing SMILe\u2019s context encoder and policy models.\nWe also compare SMILe to a meta-learning variant of DAgger [27], a Behavioural Cloning algorithm\ndesigned to overcome the covariate shift problems associated with standard BC. Meta-DAgger\nrequires online access to the experts, and hence is not directly comparable to SMILe. Nonetheless, it\ncan serve as an informative on the upper bound on the performance we may expect from standard\nBC. Full baseline model and training details are presented in Appendix B. In all experiments below\nwe use the same form of context encoder model inspired by permutation-invariant encoders used in\nrecent meta-learning literature [10, 13]. This modelling choice is discussed in Appendix A.\n\n4.1 HalfCheetah Random Velocity\n\nThe HalfCheetah Random Velocity task is a popular baseline for meta-learning in standard RL. To\nevaluate SMILe, we adapt this task for the Few-Shot Imitation Learning setup. Each task is de\ufb01ned\nby a target velocity that we wish a HalfCheetah agent maintain over the duration of an episode;\nepisodes are of length 1000 and start with the agent at standstill. Target velocities for meta-train tasks\nrange from 0 to 3, uniformly spaced at 0.1 intervals, and meta-test tasks are de\ufb01ned by the range\n0.05 to 2.95, uniformly spaced at 0.1 intervals. In this experiment, a sampled context is a random\nnumber (1 to 4) of expert trajectories. To obtain expert demonstrations, we train an expert policy\nusing Soft-Actor-Critic [16] which observes as part of the state the desired target velocity. We train\nall models using various amounts of total expert demonstrations and evaluate on the meta-test tasks\nusing context trajectories generated by the pre-trained expert. Table 1 presents evaluation results.\nAs can be seen both variants of Meta-IRL very signi\ufb01cantly outperform Meta-BC: With the least\namount of demonstrations, Meta-IRL is able to achieve respectable performance whereas Meta-BC\nobtains quite poor results even with the exceedingly large maximum amount of demonstrations used.\nIn this experiment we observe that Meta-DAgger can outperform SMILe, at the expense of much\nlarger number of queries from an online expert.\n\n4.2 Ant 2D Goal\n\nThe second set of tasks we experiment with is 2D navigation using a simulated Ant agent [2]. The\nagent starts each episode in the center of the environment at coordinates (0,0) and has to navigate\nto a certain point on the perimeter of a circle with radius 2.0 centered at the origin. Episodes have\n\n6\n\n\fSMILe (State-Only)\n\nSMILe (State-Action)\nFigure 2: Mean and standard deviation of minimum distance when attempting to reach meta-test targets (lower\nis better) on the Ant 2D tasks. x-axis denotes angle of the target on the circle, and y-axis denotes how closely the\nagent approached the target. Rows from top to bottom in order present results when trained on 4, 16, and 64\ndemonstrations per task. Despite its stronger assumptions, Meta-DAgger performs comparably to SMILe. Full\nevalution details in Appendix D.2. In Appendix D.3 we also demonstrate that RL is not suf\ufb01cient for learning\nthis task and that Imitation Learning is necessary to overcome the exploration problem.\n\nMeta-DAgger\n\nMeta-BC\n\na horizon of 100 timesteps. The meta-training set consists of 32 target positions located at every\ninteger multiple of \u03c0\n16 radians on the circle. The meta-testing set consists of 16 targets located at every\n32 angle on the circle. In this experiment contexts are single expert trajectories. To obtain\n2n\u03c0\n16 + \u03c0\nexpert demonstrations, we found it challenging to train a single expert policy that was uniformly\ngood at navigating to the different targets. As a result, we trained separate experts for each train and\ntest target location.\nFigure 2 presents results when training on 4, 16, and 64 demonstrations per meta-train task (4 random\nseeds per model per setting). As can be seen, both variants of our method signi\ufb01cantly outperform\nMeta-BC: models trained with Meta-IRL learn to approach most meta-test target locations well,\nwhereas Meta-BC models barely get close to most targets. We also observe that despite its stronger\nassumptions Meta-DAgger performs comparably to both variants of SMILe.\n\n4.3 Handling Varying Dynamics\nIn section 3.3 we described that we do not condition directly on the environment variable E and\nrely on contexts when it is necessary to take dynamics into account. In this section we evaluate the\nfeasibility of this approach by designing an experiment where tasks are de\ufb01ned by their varying\ndynamics. Speci\ufb01cally, we begin with the Walker environment [2] and generate different task by\nrandomizing various dynamics variables2. We use 50 meta-train tasks and perform evaluations on 25\nmeta-test tasks. As in the previous section we train separate expert policies for each of the meta-train\nand meta-test tasks.3 Per task we generate 32 expert trajectories (with a subsampling factor of 20)\nand the contexts used are single expert trajectories.\nDespite having separate experts for each task, we continue to observe large variability in expert\nrewards obtained for each environment. This leads us to believe that some dynamics settings produce\nintrinsically harder tasks. As a result, we introduce the following custom evaluation metric. First, we\ntrained a policy using RL on the unmodi\ufb01ed Walker environment. Let Ri\nbase denote its performance\nexpert denote expert performance for task i. For a trained meta-policy \u03c0,\non task i. Additionally, let Ri\nthe evaluation metric we propose is: Eval(\u03c0) := 1\n\u03c0 is the performance\nN\nof the meta-policy on task i, estimated by sampling multiple contexts for task i and averaging the\nperformance of the respective conditional policies.\nTable 2 (Left) presents evaluations using our custom metric as well as Walker rewards.4 Indeed we\nobserve that policies trained with SMILe learn to adapt to new dynamics through conditioning on a\nsingle expert trajectory, whereas Meta-BC policies fail to outperform a baseline policy trained only\non the unmodi\ufb01ed environment.\n\n\u03c0\u2212Ri\nRi\nexpert\u2212Ri\nRi\n\n(cid:80)N\n\nwhere Ri\n\ni=1\n\nbase\n\nbase\n\n2We discuss exact details of dynamics randomizations in appendix E.2\n3The reasoning behind this choice is discussed in Appendix E.1\n4As de\ufb01ned in the original Walker task [2]\n\n7\n\n\fMethod\nMeta-BC\nSMILe (State-Action)\nSMILe (State-Only)\nMeta-DAgger\n\nReturns\n\n1289.9 \u00b1 63.0\n3106.6 \u00b1 67.6\n3159.3 \u00b1 84.9\n2404.1 \u00b1 91.7\n\nCustom Metric\n-0.403 \u00b1 0.054\n0.710 \u00b1 0.036\n0.738 \u00b1 0.061\n0.275 \u00b1 0.089\n\nSuccess Rate No-Op Rate\n0.50 \u00b1 0.00\n0.00 \u00b1 0.00\n0.02 \u00b1 0.02\n0.68 \u00b1 0.01\n0.65 \u00b1 0.01\n0.01 \u00b1 0.00\n0.67 \u00b1 0.00\n0.14 \u00b1 0.01\n\nTable 2: (Left) Walker random dynamics. Results demonstrate that policies trained with SMILe learn to adapt to\nvarying meta-test dynamics by conditioning on single expert trajectories. Note that the custom metric compares\nmodels to expert policies trained separately for each dynamics setting. (Right) Ant Linear Classi\ufb01cation. Results\nshow that while policies trained with Meta-BC learned the locomotion aspect of the task, they failed to connect\ntheir movements to the linear classi\ufb01cation problem.\n\nInterestingly, we observe that despite using orders of magnitude more queries from online task experts,\nMeta-DAgger cannot compete with SMILe. We believe this demonstrates the strong advantage of\nmatching state-marginals using Inverse Reinforcement Learning (Section 2.3 and Figure 1): Due\nto the complexity of handling varying dynamics, not matching expert actions well can result in the\nMeta-DAgger policies quickly tumbling over. In contrast, to match the expert state distributions the\nSMILe policies will directly try to not tumble over, and can hence reach the later states.\n\n4.4 Aggregating Information Across Demonstrations\n\nIn the HalfCheetah experiment described in section 4.1 contexts could be between 1 to 4 expert\ntrajectories. Figure 4 indicates that for this task very little gain is made from conditioning on\nincreasing number of demonstrations. As a result, in this section we present an experiment where\ncombining information across multiple trajectories is necessary for good policy performance; we\ntrain a Mujoco Ant model to learn few-shot linear classi\ufb01cation in 4 dimensions.\nEach task is de\ufb01ned by a 4-dimensional hyperplane passing through the origin. For a given task,\nthe Ant starts each episode at position (0, 0) and observes as part of its state an ordered pair of\n4-dimensional vectors sampled from [\u22121, 1]4, one from each side of the hyperplane. To make a\nclassi\ufb01cation decision, the Ant must navigate to one of two target locations positioned at (1.41,1.41)\nand (-1.41, 1.41), depending on whether it believes the \ufb01rst vector is from the positive or negative\nclass. Contexts are a random number of expert demonstrations (1 to 8), and a classi\ufb01cation decision\nis made when the Ant is within 0.5 radius of either target. Each episode is 50 timesteps long. To\ngenerate expert demonstrations we scripted a policy to manually drive two of the experts trained\nfor the \u201cAnt 2D Goal\" experiment. We trained using 64 tasks with 16 demonstrations for each, and\nevaluated models using 32 meta-test tasks.\nTable 2 (Right) demonstrates that policies trained with SMILe can successfully combine information\nfrom multiple trajectories in order to perform classi\ufb01cation, while Meta-BC performs at random.\nInterestingly, given the large number of demonstrations for only two target locations, the poor\nperformance of Meta-BC models is not due to locomotion failure; it is due to their inability to connect\nlocomotion to the classi\ufb01cation task at hand. Lastly, we observe that Meta-DAgger achieves a similar\nsuccess rate to SMILe, yet the number of episodes in which a classi\ufb01cation decision is not made is\nsigni\ufb01cantly higher than all other methods.\n\n5 Related Work\n\nIn this section we situate our work with respect to prior meta-learning algorithms for inverse RL,\nspeci\ufb01cally those geared towards the Few-Shot setting and scaling to intractable environments.\n[12] present two methods for Meta-IRL, one extending the maximum causal entropy formulation\n[36], and the other extending the Adversarial IRL (AIRL) formulation [9]. To adapt AIRL to the\nFew-Shot setting, the discriminator is optimized using the Reptile meta-learning method [22]. Every\ntime a new task is sampled, the discriminator and policy are reinitialized and reoptimized with the\nhope being that the due to Reptile, the intialization for the discriminator will lead to fast adaptation.\nThis is a computationally expensive procedure at both train and test time. Instead, in our work we\nlearn amortized versions of the discriminator and policy models which removes this computational\n\n8\n\n\foverhead. Furthermore, the most challenging domain used in [12] is a modi\ufb01ed Mountain-Car\ndomain, a single-dimensional control problem.\nConcurrent with [12], [34] extend IRL to the meta-learning setup using the MAML algorithm [4], a\nmeta-learning method not too dissimilar to Reptile [22]. This work focuses on the setting where the\nobservation space is high-dimensional, yet the state-action space is discrete and tractable. As a result\nof this choice, they use a model to predict the reward for every state-action pair in the task. Extending\nthis work to intractable continuous control environments requires signi\ufb01cant modi\ufb01cations to the\nalgorithm, likely leading to a similar method to [12] which inherits the challenges described above.\nThe most similar prior work to our method is that of [33]. In this work, \ufb01rst a powerful VAE [19] is\ntrained on the full dataset of expert demonstrations; a bi-directional LSTM model encodes a single\ntrajectory and infers a posterior distribution, while a Wave-Net [24] decoder reconstructs the states\nand actions of this trajectory. After this pre-training step, the VAE is frozen. When training on a task\nfrom the task distribution, an expert trajectory is passed through the VAE encoder to obtain a sample\nfrom the posterior distribution, and the IRL algorithm as well as the policy condition on this sample.\nWhile this work demonstrates impressive experimental results, training good generative models of\nfull trajectories is a dif\ufb01cult and expensive procedure that requires powerful models (consider for\nexample image-based domains). Indeed, learning good dynamics models is a dif\ufb01cult, active problem\nstudied by the model-based RL research community. Our work requires no such generative model to\nbe trained.\nInterestingly, concurrent with our work, [25] present a very effective Meta-RL algorithm that incor-\nporates many similar choices to SMILe; a permutation-invariant context encoder is used to encode\nsamples from a task\u2019s replay buffer, and the meta-policy learns to use this information to adapt itself\nmore effectively. As there is no discriminator involved, the gradients from the Q function are used to\nupdate the encoder.\n\n6 Conclusion & Future Work\n\nIn this work we propose SMILe, a novel framework for Meta Inverse Reinforcement Learning. Inter-\npreting the Meta Imitation Learning problem from a context-conditional graphical model perspective,\nwe develop to our knowledge the \ufb01rst ef\ufb01cient algorithm for Meta-IRL that scales to the function\napproximator setting. We demonstrated the effectiveness of our method on a series of dif\ufb01cult\nMeta Imitation Learning problems in high-dimensional continuous control, and observe that SMILe\nconsistently outperforms Meta-BC. Furthermore, we observed that SMILe performed comparably or\noutperformed Meta-DAgger, while being applicable in the state-only setting and not requiring online\nexperts. In future work we hope our framework and practical implementation details are applied to\nImitation Learning from other forms of contexts such as linguistic input or images of desired goal\nstates.\n\n9\n\n\fReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.\nIn Proceedings of the twenty-\ufb01rst international conference on Machine learning, page 1. ACM,\n2004.\n\n[2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[3] Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya\nSutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in\nneural information processing systems, pages 1087\u20131098, 2017.\n\n[4] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1126\u20131135. JMLR. org, 2017.\n\n[5] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between\ngenerative adversarial networks, inverse reinforcement learning, and energy-based models.\narXiv preprint arXiv:1611.03852, 2016.\n\n[6] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal\ncontrol via policy optimization. In International Conference on Machine Learning, pages 49\u201358,\n2016.\n\n[7] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual\n\nimitation learning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.\n\n[8] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft\n\nupdates. arXiv preprint arXiv:1512.08562, 2015.\n\n[9] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse\n\nreinforcement learning. In International Conference on Learning Representations, 2018.\n\n[10] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami,\n\nand Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.\n\n[11] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Gabriel Huang, Remi Lepriol,\nSimon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game\ndynamics. arXiv preprint arXiv:1807.04740, 2018.\n\n[12] Adam Gleave and Oliver Habryka. Multi-task maximum entropy inverse reinforcement learning.\n\narXiv preprint arXiv:1805.08882, 2018.\n\n[13] Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard Turner.\n\nMeta-learning probabilistic inference for prediction. 2018.\n\n[14] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5767\u20135777, 2017.\n\n[15] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning\nIn Proceedings of the 34th International Conference on\n\nwith deep energy-based policies.\nMachine Learning-Volume 70, pages 1352\u20131361. JMLR. org, 2017.\n\n[16] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-\npolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint\narXiv:1801.01290, 2018.\n\n[17] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, pages 4565\u20134573, 2016.\n\n[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n10\n\n\f[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[20] Ilya Kostrikov, Kumar Krishna Agrawal, Sergey Levine, and Jonathan Tompson. Address-\ning sample inef\ufb01ciency and reward bias in inverse reinforcement learning. arXiv preprint\narXiv:1809.02925, 2018.\n\n[21] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml,\n\nvolume 1, page 2, 2000.\n\n[22] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint\n\narXiv:1803.02999, 2, 2018.\n\n[23] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, pages 271\u2013279, 2016.\n\n[24] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. arXiv preprint arXiv:1609.03499, 2016.\n\n[25] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Ef\ufb01cient\noff-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint\narXiv:1903.08254, 2019.\n\n[26] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and re-\ninforcement learning by approximate inference. In Twenty-Third International Joint Conference\non Arti\ufb01cial Intelligence, 2013.\n\n[27] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. In Proceedings of the fourteenth international\nconference on arti\ufb01cial intelligence and statistics, pages 627\u2013635, 2011.\n\n[28] Stuart J Russell. Learning agents for uncertain environments. In COLT, volume 98, pages\n\n101\u2013103, 1998.\n\n[29] Stefan Schaal. Learning from demonstration. In Advances in neural information processing\n\nsystems, pages 1040\u20131046, 1997.\n\n[30] Emanuel Todorov. General duality between optimal control and estimation. In 2008 47th IEEE\n\nConference on Decision and Control, pages 4286\u20134292. IEEE, 2008.\n\n[31] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages\n5026\u20135033. IEEE, 2012.\n\n[32] Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of\nthe 26th annual international conference on machine learning, pages 1049\u20131056. ACM, 2009.\n\n[33] Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess.\nRobust imitation of diverse behaviors. In Advances in Neural Information Processing Systems,\npages 5320\u20135329, 2017.\n\n[34] Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, and Chelsea Finn. Few-shot intent\n\ninference via meta-inverse reinforcement learning. 2018.\n\n[35] Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey\nLevine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv\npreprint arXiv:1802.01557, 2018.\n\n[36] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal\n\nentropy. 2010.\n\n[37] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy\ninverse reinforcement learning. In Aaai, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n11\n\n\f", "award": [], "sourceid": 4289, "authors": [{"given_name": "Seyed Kamyar", "family_name": "Seyed Ghasemipour", "institution": "University of Toronto, Vector Institute"}, {"given_name": "Shixiang (Shane)", "family_name": "Gu", "institution": "Google Brain"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "Vector Institute/University of Toronto"}]}