{"title": "Meta-Reinforcement Learning of Structured Exploration Strategies", "book": "Advances in Neural Information Processing Systems", "page_first": 5302, "page_last": 5311, "abstract": "Exploration is a fundamental challenge in reinforcement learning (RL). Many\ncurrent exploration methods for deep RL use task-agnostic objectives, such as\ninformation gain or bonuses based on state visitation. However, many practical\napplications of RL involve learning more than a single task, and prior tasks can be\nused to inform how exploration should be performed in new tasks. In this work, we\nstudy how prior tasks can inform an agent about how to explore effectively in new\nsituations. We introduce a novel gradient-based fast adaptation algorithm \u2013 model\nagnostic exploration with structured noise (MAESN) \u2013 to learn exploration strategies\nfrom prior experience. The prior experience is used both to initialize a policy\nand to acquire a latent exploration space that can inject structured stochasticity into\na policy, producing exploration strategies that are informed by prior knowledge\nand are more effective than random action-space noise. We show that MAESN is\nmore effective at learning exploration strategies when compared to prior meta-RL\nmethods, RL without learned exploration strategies, and task-agnostic exploration\nmethods. We evaluate our method on a variety of simulated tasks: locomotion with\na wheeled robot, locomotion with a quadrupedal walker, and object manipulation.", "full_text": "Meta-Reinforcement Learning of Structured\n\nExploration Strategies\n\nAbhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey Levine\n\nDepartment of Electrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\n{abhigupta, pabbeel, svlevine}@eecs.berkeley.edu\n\n{russellm, yuxuanliu}@berkeley.edu\n\nAbstract\n\nExploration is a fundamental challenge in reinforcement learning (RL). Many\ncurrent exploration methods for deep RL use task-agnostic objectives, such as\ninformation gain or bonuses based on state visitation. However, many practical\napplications of RL involve learning more than a single task, and prior tasks can be\nused to inform how exploration should be performed in new tasks. In this work, we\nstudy how prior tasks can inform an agent about how to explore effectively in new\nsituations. We introduce a novel gradient-based fast adaptation algorithm \u2013 model\nagnostic exploration with structured noise (MAESN) \u2013 to learn exploration strate-\ngies from prior experience. The prior experience is used both to initialize a policy\nand to acquire a latent exploration space that can inject structured stochasticity into\na policy, producing exploration strategies that are informed by prior knowledge\nand are more effective than random action-space noise. We show that MAESN is\nmore effective at learning exploration strategies when compared to prior meta-RL\nmethods, RL without learned exploration strategies, and task-agnostic exploration\nmethods. We evaluate our method on a variety of simulated tasks: locomotion with\na wheeled robot, locomotion with a quadrupedal walker, and object manipulation.\n\n1\n\nIntroduction\n\nDeep reinforcement learning methods have been shown to learn complex tasks ranging from\ngames [17] to robotic control [14, 20] with minimal supervision, by simply exploring the envi-\nronment and experiencing rewards. As tasks become more complex or temporally extended, simple\nexploration strategies become less effective. Prior works have proposed guiding exploration based\non criteria such as intrinsic motivation [23, 26, 25], state-visitation counts [16, 27, 2], Thompson\nsampling and bootstrapped models [4, 18], optimism in the face of uncertainty [3, 12], and parameter\nspace exploration [19, 8]. These exploration strategies are largely task agnostic, in that they aim to\nprovide good exploration without exploiting the particular structure of the task itself.\nHowever, an intelligent agent interacting with the real world will likely need to learn many tasks, not\njust one, in which case prior tasks should be used to inform how exploration in new tasks should be\nperformed. For example, a robot that is tasked with learning a new household chore likely has prior\nexperience of learning other related chores. It can draw on these experiences to decide how to explore\nthe environment to acquire the new skill more quickly. Similarly, a walking robot that has previously\nlearned to navigate different buildings doesn\u2019t need to reacquire the skill of walking when it must\nlearn to navigate through a maze, but simply needs to explore in the space of navigation strategies.\nIn this work, we study how experience from multiple distinct but related prior tasks can be used to\nautonomously acquire directed exploration strategies via meta-learning. Meta-learning, or learning to\nlearn, refers to the problem of learning strategies which can adapt quickly to novel tasks by using\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fprior experience on different but related tasks [23, 28, 10, 29, 1, 21, 22]. In the context of RL, meta-\nlearning algorithms typically fall into one of the following categories - RNN based learners [5, 30]\nand gradient descent based learners [6, 15].\nRNN meta-learners address meta-RL by training recurrent models [5, 30] that ingest past states,\nactions, and rewards, and predict new actions that will maximize rewards, with memory across\nseveral episodes of interaction. These methods are not ideal for learning to explore. First, good\nexploration strategies are qualitatively different from optimal policies: while an optimal policy is\ntypically deterministic in fully observed environments, exploration depends critically on stochasticity.\nMethods that simply recast the meta-RL problem into an RL problem generally acquire behaviors\nthat exhibit insuf\ufb01cient variability to explore effectively in new settings for dif\ufb01cult tasks. The same\npolicy has to represent highly exploratory behavior and adapt very quickly to optimal behavior, which\nbecomes dif\ufb01cult with typical time-invariant representations for action distributions. Second, these\nmethods aim to learn the entire \u201clearning algorithm,\u201d using a recurrent model. While this allows them\nto adapt very quickly, via a single forward pass of the RNN, it limits their asymptotic performance\nwhen compared to learning from scratch, since the learned \u201calgorithm\u201d (i.e., RNN) generally does not\ncorrespond to a convergent iterative optimization procedure, and is not guaranteed to keep improving.\nGradient descent based meta-learners such as model-agnostic meta-learning (MAML) [6], directly\ntrain for model parameters that can adapt quickly with gradient descent for new tasks. These methods\nhave the bene\ufb01t of allowing for similar asymptotic performance as learning from scratch, since\nadaptation is performed using gradient descent, while also enabling acceleration from meta-training.\nHowever, our experiments show that MAML alone is not very effective at learning to explore, due to\nthe lack of structured stochasticity in the exploration strategy.\nWe aim to address these challenges by devising a meta-RL algorithm that adapts to new tasks by\nfollowing the policy gradient, while also injecting learned structured stochasticity into a latent space to\nenable effective exploration. Our algorithm, which we call model agnostic exploration with structured\nnoise (MAESN), uses prior experience both to initialize a policy and to learn a latent exploration\nspace from which it can sample temporally coherent structured behaviors. This produces exploration\nstrategies that are stochastic, informed by prior knowledge, and more effective than random noise.\nImportantly, the policy and latent space are explicitly trained to adapt quickly to new tasks with the\npolicy gradient. Since adaptation is performed by following the policy gradient, our method achieves\nat least the same asymptotic performance as learning from scratch (and often performs substantially\nbetter), while the structured stochasticity allows for randomized but task-aware exploration. Latent\nspace models have been explored in prior works [9, 7, 13], though not in the context of meta-learning\nor learning exploration strategies. These methods do not explicitly train for fast adaptation, and\ncomparisons in Section 4 illustrate the advantages of our method.\nOur experimental evaluation shows that existing meta-RL methods, including MAML [6] and RNN-\nbased algorithms [5, 30], are limited in their ability to acquire complex exploratory policies, likely\ndue to limitations on their ability to acquire a strategy that is both stochastic and structured with policy\nparameterizations that can only introduce time-invariant stochasticity into the action space. While\nin principle certain RNN based architectures could capture time-correlated stochasticity, we \ufb01nd\nexperimentally that current methods fall short. Effective exploration strategies must select randomly\nfrom among the potentially useful behaviors, while avoiding behaviors that are highly unlikely\nto succeed. MAESN leverages this insight to acquire signi\ufb01cantly better exploration strategies\nby incorporating learned time-correlated noise through its meta-learned latent space, and training\nboth the policy parameters and the latent exploration space explicitly for fast adaptation. In our\nexperiments, we \ufb01nd that we are able to explore coherently and adapt quickly for a number of\nsimulated manipulation and locomotion tasks with challenging exploration components. Some\nprevious works\nOne natural question that arises with meta-learning exploration is: if our goal is to learn exploration\nstrategies that solve challenging tasks with sparse or delayed rewards, how can we solve the diverse\nand challenging tasks at meta-training time to acquire those strategies in the \ufb01rst place? One approach\nthat we can take with MAESN is to use dense or shaped reward tasks to meta-learn exploration\nstrategies that work well for sparse or delayed reward tasks. In this setting, we assume that the\nmeta-training tasks are provided with well-shaped rewards (e.g., distances to a goal), while the more\nchallenging tasks that will be seen at meta-test time will have sparse rewards (e.g., an indicator for\nbeing within a small distance of the goal). As we will see in Section 4, this enables MAESN to solve\n\n2\n\n\fchallenging tasks signi\ufb01cantly better than prior methods at meta-test time for task families where\nexisting meta-RL methods cannot meta-learn effectively from only sparse rewards.\n\n2 Preliminaries: Meta-Reinforcement Learning\nIn meta-RL, we consider a distribution \u03c4i \u223c p(\u03c4 ) over tasks, where each task \u03c4i is a different\nMarkov decision process (MDP) Mi = (S, A, Pi, Ri), with state space S, action space A, transition\ndistribution Pi, and reward function Ri. The reward function and transitions vary across tasks.\nMeta-RL aims to to learn a policy that can adapt to maximize the expected reward for novel tasks\nfrom p(\u03c4 ) as ef\ufb01ciently as possible.\nWe build on the gradient-based meta-learning framework of MAML [6], which trains a model in such\na way that it can adapt quickly with standard gradient descent, which in RL corresponds to the policy\ngradient. The meta-training objective for MAML can be written as\n\n(cid:34)(cid:88)\n\n(cid:35)\n\n(cid:34)(cid:88)\n\n(cid:35)\n\n(cid:88)\n\n\u03c4i\n\nmax\n\n\u03b8\n\nE\u03c0\u03b8(cid:48)\n\ni\n\nRi(st)\n\n\u03b8(cid:48)\ni = \u03b8 + \u03b1E\u03c0\u03b8\n\nRi(st)\u2207\u03b8 log \u03c0\u03b8(at|st)\n\n(1)\n\nt\n\nt\n\nThe intuition behind this optimization objective is that, since the policy will be adapted at meta-test\ntime using the policy gradient, we can optimize the policy parameters so that one step of policy\ngradient improves its performance on any meta-training task as much as possible.\nSince MAML reverts to conventional policy gradient when faced with out-of-distribution tasks, it\nprovides a natural starting point for us to consider the design of a meta-exploration algorithm: by\nstarting with a method that is essentially on par with task-agnostic RL methods that learn from scratch\nin the worst case, we can improve on it to incorporate the ability to acquire stochastic exploration\nstrategies from experience, while preserving asymptotic performance.\n\n3 Model Agnostic Exploration with Structured Noise\n\nWhile meta-learning has been shown to be effective for fast adaptation on several RL problems\n[6, 5], the prior methods generally focus on tasks where exploration is trivial and a few random trials\nare suf\ufb01cient to identify the goals of the task [6], or the policy should acquire a consistent \u201csearch\u201d\nstrategy, for example to \ufb01nd the exit in new mazes [5]. Both of these adaptation regimes differ\nsubstantially from stochastic exploration. Tasks where discovering the goal requires exploration that\nis both stochastic and structured cannot be easily captured by such methods, as demonstrated in our\nexperiments. Speci\ufb01cally, there are two major shortcomings with these methods: (1) The stochasticity\nof the policy is limited to time-invariant noise from action distributions, which fundamentally limits\nthe exploratory behavior it can represent. (2) For RNN based methods, the policy is limited in\nits ability to adapt to new environments, since adaptation is performed with a forward pass of the\nrecurrent network. If this single forward pass does not produce good behavior, there is no further\nmechanism for improvement. Methods that adapt by gradient descent, such as MAML, simply revert\nto standard policy gradient and can make slow but steady improvement in the worst case, but do not\naddress (1). In this section, we introduce a novel method for learning structured exploration behavior\nbased on gradient based meta-learning which is able to learn good exploratory behavior and adapt\nquickly to new tasks that require signi\ufb01cant exploration, without suffering in asymptotic performance.\n\n3.1 Overview\n\nOur algorithm, which we call model agnostic exploration with structured noise (MAESN), combines\nstructured stochasticity with MAML. MAESN is a gradient-based meta-learning algorithm that\nintroduces stochasticity not just by perturbing the actions, but also through a learned latent space\nwhich allows exploration to be time-correlated. Both the policy and the latent space are trained\nwith meta-learning to explicitly provide for fast adaptation to new tasks. When solving new tasks at\nmeta-test time, a different sample is generated from this latent space for each episode (and kept \ufb01xed\nthroughout the episode), providing structured and temporally correlated stochasticity. Because of\nmeta-training, the distribution over latent variables is adapted to the task quickly via policy gradient\nupdates. We \ufb01rst show how structured stochasticity can be introduced through latent spaces, and then\ndescribe how both the policy and the latent space can be meta-trained to form our overall algorithm.\n\n3\n\n\f3.2 Policies with Latent State\nTypical stochastic policies parameterize action distributions \u03c0\u03b8(a|s) in a way that is indepen-\ndent for each time step. This representation has no notion of temporally coherent randomness\nthroughout the trajectory, since stochasticity is added independently at each step. Under this\nrepresentation, additive noise is sampled independently for every time step. This limits the\nrange of possible exploration strategies, since the policy essentially \u201cchanges its mind\u201d about\nwhat it wants to explore at each time step. The distribution \u03c0\u03b8(a|s) is also typically repre-\nsented with simple parametric families, such as unimodal Gaussians, which restrict its expressivity.\nTo incorporate temporally coherent exploration and allow the policy to model\nmore complex time-correlated stochastic processes, we can condition the policy\non per-episode random variables drawn from a learned latent distribution, as\nshown on the right. Since these latent variables are sampled only once per\nepisode, they provide temporally coherent stochasticity. Intuitively, the policy\ndecides only once what it will try to do in each episode, and commits to this\nplan. Since the random sample is provided as an input, a nonlinear neural\nnetwork policy can transform this sample into arbitrarily complex distributions.\nThe resulting policies can be written as \u03c0\u03b8(a|s, z), where z \u223c q\u03c9(z), and q\u03c9(z) is the latent variable\ndistribution with parameters \u03c9. For example, in our experiments we consider diagonal Gaussian\ndistributions of the form q\u03c9(z) = N (\u00b5, \u03c3), such that \u03c9 = {\u00b5, \u03c3}. Structured stochasticity of this\nform can provide more coherent exploration, by sampling entire behaviors or goals, rather than simply\nrelying on independent random actions.\nWe discuss how to meta-learn latent representations and adapt quickly to new tasks. Related\nrepresentations have been explored in prior work [9, 7] but simply inputting random variables\ninto a policy does not by itself provide for rapid adaptation to new tasks. To achieve fast adaptation,\nwe can incorporate meta-learning as discussed below.\n\n3.3 Meta-Learning Latent Variable Policies\n\nFigure 1: Computation graph for MAESN. Meta-\nlearn pre-update latent parameters \u03c9i, and policy\nparameters \u03b8, such that after a gradient step, the\npost-update latent parameters \u03c9(cid:48)\ni, policy parame-\nters \u03b8(cid:48), are optimal for the task. The sampling\nprocedure introduces time correlated noise.\n\nGiven a latent variable conditioned policy as\ndescribed above, our goal is to train it so as to\ncapture coherent exploration strategies from a\nfamily of training tasks that enable fast adap-\ntation to new tasks from a similar distribution.\nWe use a combination of variational inference\nand gradient-based meta-learning to achieve this.\nSpeci\ufb01cally, our aim is to meta-train the policy\nparameters \u03b8 so that they can make use of the\nlatent variables to perform coherent exploration\non a new task and the behavior can be adapted as\nfast as possible. To that end, we jointly learn a\nset of policy parameters and a set of latent space\ndistribution parameters, such that they achieve\noptimal performance for each task after a policy gradient adaptation step. This procedure encourages\nthe policy to actually make use of the latent variables for exploration. From one perspective, MAESN\ncan be understood as augmenting MAML with a latent space to inject structured noise. From a\ndifferent perspective it amounts to learning a structured latent space, similar to [9], but trained for\nquick adaptation to new tasks. While [6] enables quick adaptation for simple tasks, and [9] learns\nstructured latent spaces, MAESN can achieve both structured exploration and fast adaptation. As\nshown in our experiments, neither of the prior methods alone effectively learn complex and stochastic\nexploration strategies.\nTo formalize the objective for meta-training, we introduce a model parameterization with policy\nparameters \u03b8 shared across all tasks, and per-task variational parameters \u03c9i for tasks i = 1, 2..., N,\nwhich parameterize a per-task latent distribution q\u03c9i(zi). We refer to \u03b8, \u03c9i as the pre-update parame-\nters. Meta-training involves optimizing the pre-update parameters on a set of training tasks, so as\nto maximize expected reward after a policy gradient update. As is standard in variational inference,\nwe also add to the objective the KL-divergence between the per-task pre-update distributions q\u03c9i(zi)\nand a prior p(z), which in our experiments is simply a unit Gaussian. Without this additional loss,\n\n4\n\n\fi, \u03c9(cid:48)\n\nthe per-task parameters \u03c9i can simply memorize task-speci\ufb01c information. The KL loss ensures that\nsampling z \u223c p(z) for a new task at meta-test time still produces effective structured exploration.\nFor every iteration of meta-training, we sample from the latent variable conditioned policies rep-\nresented by the pre-update parameters \u03b8, \u03c9i, perform an \u201cinner\u201d gradient update on the variational\nparameters for each task (and, optionally, the policy parameters) to get the task-speci\ufb01c post-update\nparameters \u03b8(cid:48)\ni , and then propagate gradients through this update to obtain a meta-gradient for \u03b8,\n\u03c90, \u03c91, ..., \u03c9N such that the sum of expected task rewards over all tasks using the post-update latent-\nconditioned policies \u03b8(cid:48)\ni is maximized, while the KL divergence of pre-update distributions q\u03c9i(zi)\nagainst the prior p(zi) is minimized. Note that the KL-divergence loss is applied to the pre-update\ndistributions q\u03c9i, not the post-update distributions, so the policy can exhibit very different behaviors\non each task after the inner update. Computing the gradient of the reward under the post-update\nparameters requires differentiating through the inner policy gradient term, as in MAML [6].\nA concise description of the meta-training procedure is provided in Algorithm 1, and the computation\ngraph representing MAESN is shown in Fig 1. The full meta-training problem can be stated\nmathematically as\n\ni, \u03c9(cid:48)\n\n(cid:88)\n\nmax\n\u03b8,\u03c9i\n\ni\u2208tasks\n\nEat\u223c\u03c0(at|st;\u03b8(cid:48)\ni,z(cid:48)\ni)\n(.)\n\ni\u223cq\u03c9(cid:48)\nz(cid:48)\n\ni\n\nRi(st)\n\nDKL(q\u03c9i(.)(cid:107)p(z))\n\n(cid:35)\n\n(cid:34)(cid:88)\n\nt\n\n(cid:35)\n\ni\u2208tasks\n\n\u2212 (cid:88)\n(cid:34)(cid:88)\n(cid:34)(cid:88)\n\nt\n\ni = \u03c9i + \u03b1\u03c9 \u25e6 \u2207\u03c9iEat\u223c\u03c0(at|st;\u03b8,zi)\n\u03c9(cid:48)\n\nzi\u223cq\u03c9i (.)\ni = \u03b8 + \u03b1\u03b8 \u25e6 \u2207\u03b8Eat\u223c\u03c0(at|st;\u03b8,zi)\n\u03b8(cid:48)\n\nzi\u223cq\u03c9i (.)\n\nt\n\nRi(st)\n\n(cid:35)\n\nRi(st)\n\n(2)\n\n(3)\n\n(4)\n\nThe two objective terms are the expected reward under the post update parameters for each task and\nthe KL-divergence between each task\u2019s pre-update latent distribution and the prior. The \u03b1 values are\nper-parameter step sizes, and \u25e6 is an elementwise product. The last update (to \u03b8) is optional. We\nfound that we could in fact obtain better results simply by omitting this update, which corresponds to\nmeta-training the initial policy parameters \u03b8 simply to use the latent space ef\ufb01ciently, without training\nthe parameters themselves explicitly for fast adaptation. Including the \u03b8 update makes the resulting\noptimization problem more challenging.\nMAESN enables structured exploration by using the latent variables z, while explicitly training for\nfast adaptation via policy gradient. We could in principle train such a model without meta-training\nfor adaptation at all, which resembles the model proposed by [9]. However, as we will show in our\nexperimental evaluation, meta-training produces substantially better results.\nInterestingly, during the course of meta-training, we \ufb01nd that the pre-update variational parameters\n\u03c9i for each task are usually close to the prior at convergence. This has a simple explanation: meta-\ntraining optimizes for post-update rewards, after \u03c9i have been updated to \u03c9(cid:48)\ni, so even if \u03c9i matches\nthe prior, it does not match the prior after the inner update. This allows the learned policy to succeed\non new tasks at meta-test time for which we do not have a good initialization for \u03c9, and have no\nchoice but to begin with the prior, as discussed in the next section.\n\nAlgorithm 1 MAESN meta-RL algorithm\n1: Initialize variational parameters \u03c9i for each training task \u03c4i\n2: for iteration k \u2208 {1, . . . , K} do\nSample a batch of N training tasks from p(\u03c4 )\n3:\nfor task \u03c4i \u2208 {1, . . . , N} do\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nGather data using the latent conditioned policy \u03b8, (\u03c9i)\nCompute inner policy gradient on variational parameters via Equation (4) (optionally (5))\n\nend for\nCompute meta update on both latents and policy parameters by optimizing (3) with TRPO\n\n5\n\n\f3.4 Using the Latent Space for Exploration\n\nLet us consider a new task \u03c4i with reward Ri, and a learned model with policy parameters \u03b8. The\nvariational parameters \u03c9i are speci\ufb01c to the tasks used during meta-training, and will not be useful\nfor a new task. However, since the KL-divergence loss (Eqn 3) encourages the pre-update parameters\nto be close to the prior, all of the variational parameters \u03c9i are driven to the prior at convergence\n(Fig 5a). Hence, for exploration in a new task, we can initialize the latent distribution to the prior\nq\u03c9(z) = p(z). In our experiments, we use the prior with \u00b5 = 0 and \u03c3 = I. Adaptation to a new task\nis then done by simply using the policy gradient to adapt \u03c9 via backpropagation on the RL objective,\nt R(st)] where R represents the sum of rewards along the trajectory.\nSince we meta-trained to adapt \u03c9 in the inner loop, we adapt these parameters at meta-test time as\nwell. To compute the gradients with respect to \u03c9, we need to backpropagate through the sampling\noperation z \u223c q\u03c9(z), using either likelihood ratio or the reparameterization trick(if possible). The\nlikelihood ratio update is\n\nmax\u03c9 Eat\u223c\u03c0(at|st,\u03b8,z),z\u223cq\u03c9(.) [(cid:80)\n\n(cid:35)\n\n(cid:34)\n\nR(st)\n\n(5)\n\n\u2207\u03c9\u03b7 = Eat\u223c\u03c0(at|st;\u03b8,z)\n\n\u2207\u03c9 log q\u03c9(z)\n\nz\u223cq\u03c9(.)\n\n(cid:88)\n\nt\n\nThis adaptation scheme has the advantage of quick learning on new tasks because of meta-training,\nwhile maintaining good asymptotic performance since we are simply using the policy gradient.\n\n4 Experiments\n\nOur experiments aim to comparatively evaluate our meta-learning method and study the following\nquestions: (1) Can meta-learned exploration strategies with MAESN explore coherently and adapt\nquickly to new tasks, providing a signi\ufb01cant advantage over learning from scratch? (2) How does meta-\nlearning with MAESN compare with prior meta-learning methods such as MAML [6] and RL2 [5],\nas well as latent space learning methods [9]? (3) Can we visualize the exploration behavior and see\ncoherent exploration strategies with MAESN? (4) Can we better understand which components of\nMAESN are the most critical? Videos and experimental details for all our experiments can be found\nat https://sites.google.com/view/meta-explore/\n\n4.1 Experimental Details\n\nDuring meta-training, the \u201cinner\u201d update corresponds to standard REINFORCE, while the meta-\noptimizer is trust region policy optimization(TRPO) [24]. Hyperparameters of each algorithm are\nmentioned in the supplementary materials, which were selected via a hyperparameter sweep (also\ndetailed in the appendix). All experiments were initially run on a local 2 GPU machine, and run\nat scale using Amazon Web Services. While our goal is to adapt quickly with sparse and delayed\nrewards at meta-test time, this poses a major challenge at meta-training time: if the tasks themselves\nare too dif\ufb01cult to learn from scratch, they will also be dif\ufb01cult to solve at meta-training time, making\nit hard for the meta-learner to make progress. In fact, none of the methods we evaluated, including\nMAESN, were able to make any learning progress on the sparse reward tasks at meta-training time\n(refer to meta-training progress in supplementary materials Fig 2).\nWhile this issue could potentially be addressed by using many more samples or existing task-agnostic\nexploration strategies during meta-training only, our method allows for a simpler solution. As\ndiscussed in Section 1, we can make use of shaped rewards during meta-training (both for our method\nand for baselines), while only the sparse rewards are used to adapt at meta-test time. As shown below,\nexploration strategies with MAESN meta-trained with reward shaping generalize effectively to sparse\nand delayed rewards, despite the mismatch in the reward function.\n\n4.2 Task Setup\n\nWe evaluated our method on three task distributions p(\u03c4 ). For each family of tasks we used 100\ndistinct meta-training tasks, each with a different reward function Ri. After meta-training on a\nparticular distribution of tasks, MAESN is able to explore well and adapt quickly to tasks drawn from\nthis distribution (with sparse rewards). The input state of the environments does not contain the goal \u2013\ninstead, the agent must explore different locations to locate the goal through exploration. The details\nof the meta-train and test reward functions can be found in the supplementary materials.\n\n6\n\n\f(a) Robotic Manipulation\n\n(b) Wheeled Locomotion\n\n(c) Legged Locomotion\n\nFigure 2: Task distributions for MAESN. For each subplot, left shows the general task setup and\nright shows the distribution of tasks. For robotic manipulation, orange indicates block location region\nacross tasks, and blue indicates the goal regions. For both locomotion tasks, the red circles indicate\ngoal positions across tasks from the distribution.\n\nFigure 3: Learning progress on novel tasks with sparse rewards for wheeled locomotion, legged\nlocomotion, and object manipulation. Rewards are averaged over 100 validation tasks, which have\nsparse rewards as described in supplementary material. MAESN learns signi\ufb01cantly better policies,\nand learns much quicker than prior meta-learning approaches and learning from scratch.\n\nRobotic Manipulation. The goal in these tasks is to push blocks to target locations with a robotic\nhand. Only one block (unknown to the agent) is relevant for each task, and that block must be moved\nto a goal location (see Fig. 2a). The position of the blocks and the goals are randomized across\ntasks. A coherent exploration strategy should pick random blocks to move to the goal location, trying\ndifferent blocks on each episode to discover the right one. This task is generally representative of\nexploration challenges in robotic manipulation: while a robot might perform a variety of different\nmanipulation skills, only motions that actually interact with objects in the world are useful for\ncoherent exploration.\n\nWheeled Locomotion. We consider a wheeled robot which controls its two wheels independently\nto move to different goal locations. The task family is illustrated in Fig. 2b. Coherent exploration on\nthis family of tasks requires driving to random locations in the world, which requires a coordinated\npattern of actions that is dif\ufb01cult to achieve purely with action-space noise.\n\nLegged Locomotion. To understand if we can scale to more complex locomotion tasks, we consider\na quadruped (\u201cant\u201d) tasked to walk to randomly placed goals (see Fig. 2c). This task presents a further\nexploration challenge, since only carefully coordinated leg motion produces movement to different\npositions, so an ideal exploration strategy would always walk, but to different places.\n\n4.3 Comparisons\n\nWe compare MAESN with RL2 [5], MAML [6], simply learning latent spaces without fast adap-\ntation(LatentSpace), analogously to [9]. For training from scratch, we compare with TRPO [24],\nREINFORCE [31], and training from scratch with VIME [11], a general-purpose exploration algo-\nrithm. Further details can be found in the supplementary materials.\nIn Figure 3, we report results for our method and prior approaches when adapting to new tasks\nat meta-test time, using sparse rewards. We plot the performance of all methods in terms of the\nreward (averaged across 30 validation tasks) that the methods obtain while adapting to tasks drawn\nfrom a test set of tasks. Our results on the tasks we discussed above show that MAESN is able to\nexplore and adapt quickly on sparse reward environments. In comparison, MAML and RL2 don\u2019t\nlearn behaviors that explore as effectively. The pure latent spaces model (LatentSpace in Figure 3)\nachieves reasonable performance, but is limited in terms of its capacity to improve beyond the initial\nidenti\ufb01cation of latent space parameters and is not optimized for fast adaptation in the latent space.\nSince MAESN can train the latent space explicitly for fast adaptation, it achieves better results faster.\n\n7\n\n\fWe also observe that, for many tasks, learning from scratch actually provides a competitive baseline\nto prior meta-learning methods in terms of asymptotic performance. This indicates that the task\ndistributions are quite challenging, and simply memorizing the meta-training tasks is insuf\ufb01cient to\nsucceed. However, in all cases, we see that MAESN is able to outperform learning from scratch and\ntask-agnostic exploration in terms of both learning speed and asymptotic performance.\nOn the challenging legged locomotion task, which requires coherent walking behaviors to random\nlocations in the world to discover the sparse rewards, we \ufb01nd that only MAESN is able to adapt\neffectively.\n\n4.4 Exploration Strategies\n\nMAESN\n\nMAML\n\nRandom\n\nTo understand the exploration strategies learned\nby MAESN, we visualize the trajectories ob-\ntained by sampling from the meta-learned latent-\nconditioned policy \u03c0\u03b8 with the latent distribution\nq\u03c9(z) set to the prior N (0, I). The resulting tra-\njectories show the 2D position of the hand for\nthe block pushing task and the 2D position of the\ncenter of mass for the locomotion tasks. Task\ndistributions for each family of tasks are shown\nin Fig 2a, 2b, 2c. We can see from these trajecto-\nries (Fig 4) that learned exploration strategies ex-\nplore in the space of coherent behaviors broadly\nand effectively, especially in comparison with\nrandom exploration and standard MAML.\n\n4.5 Analysis of Structured Latent Space\n\nFigure 4: Plot of exploration behavior visualizing\n2D position of the manipulator (for blockpushing)\nand CoM for locomotion for MAESN, MAML and\nrandom initialization. Top: Block Manipulation\nBottom: Wheeled Locomotion. Goals indicated\nby the translucent overlays. MAESN captures the\ntask distribution better than other methods.\n\nWe investigate the structure of the learned latent space in the manipulation task by visualizing pre-\nupdate \u03c9i = (\u00b5i, \u03c3i) and post-update \u03c9(cid:48)\ni) parameters for a 2D latent space. The variational\ndistributions are plotted as ellipses. As can be seen from Fig 5a, pre-update parameters are all driven\nto the prior N (0, I), while the post-update parameters move to different locations in the latent space\nto adapt to their respective tasks. This indicates that the meta-training process effectively utilizes the\nlatent variables, but also minimizes the KL-divergence against the prior, ensuring that initializing \u03c9\nto the prior for a new task will produce effective exploration.\n\ni = (\u00b5(cid:48)\n\ni, \u03c3(cid:48)\n\n(a) Visualizing latent distributions\n\n(b) Role of structured noise in exploration with MAESN\n\nFigure 5: Analysis of learned latent space (a) Latent distributions in MAESN visualized for a 2D\nlatent space.Left: Pre-update latents, Right: Post update latents. (Each number in the post-update\nplot corresponds to a different task.) (b) Visualization of exploration for legged locomotion Left:\nCoM visitations using structured noise. Right: CoM visitations with no structured noise. Increased\nspread of exploration and wider trajectory distribution suggests that structured noise is being used.\nWe also evaluate whether the noise injected from the latent space learned by MAESN is actually used\nfor exploration. We observe the exploratory behavior displayed by a policy trained with MAESN\nwhen the latent variable z is kept \ufb01xed, as compared to when it is sampled from the learned latent\ndistribution. We can see from Fig. 5b that, although there is some random exploration even without\nlatent space sampling, the range of trajectories is much broader when z is sampled from the prior.\n5 Conclusion\nWe presented MAESN, a meta-RL algorithm that explicitly learns to explore by combining gradient-\nbased meta-learning with a learned latent exploration space. MAESN learns a latent space that can\nbe used to inject temporally correlated, coherent stochasticity into the policy to explore effectively at\n\n8\n\n\fmeta-test time. A good exploration strategy must randomly sample from among the useful behaviors,\nwhile omitting behaviors that are never useful. Our experimental evaluation illustrates that MAESN\ndoes precisely this, outperforming both prior meta-learning methods and learning from scratch,\nincluding methods that use task-agnostic exploration strategies. It\u2019s worth noting, however, that our\napproach is not mutually exclusive with these methods, and in fact a promising direction for future\nwork would be to combine our approach with these methods [11].\n\n6 Acknowledgements\n\nThe authors would like to thank Chelsea Finn, Gregory Kahn, Ignasi Clavera for thoughtful discus-\nsions and Justin Fu, Marvin Zhang for comments on an early version of the paper. This work was\nsupported by a National Science Foundation Graduate Research Fellowship for Abhishek Gupta,\nONR PECASE award for Pieter Abbeel, and the National Science Foundation through IIS-1651843\nand IIS-1614653, as well as an ONR Young Investigator Program award for Sergey Levine.\n\nReferences\n[1] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, and\nIn D. D. Lee,\n\nN. de Freitas. Learning to learn by gradient descent by gradient descent.\nM. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, editors, NIPS, 2016.\n\n[2] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying\n\ncount-based exploration and intrinsic motivation. In NIPS, 2016.\n\n[3] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-\noptimal reinforcement learning. Journal of Machine Learning Research, 3:213\u2013231, Mar.\n2003.\n\n[4] O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In J. Shawe-Taylor, R. S.\nZemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information\nProcessing Systems 24, pages 2249\u20132257. Curran Associates, Inc., 2011.\n\n[5] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl$\u02c62$: Fast\n\nreinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779, 2016.\n\n[6] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. In D. Precup and Y. W. Teh, editors, ICML, 2017.\n\n[7] C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement\n\nlearning. CoRR, abs/1704.03012, 2017.\n\n[8] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos,\nD. Hassabis, O. Pietquin, C. Blundell, and S. Legg. Noisy networks for exploration. CoRR,\nabs/1706.10295, 2017.\n\n[9] K. Hausman, J. T. Springenberg, N. H. Ziyu Wang, and M. Riedmiller. Learning an embedding\nspace for transferable robot skills. In Proceedings of the International Conference on Learning\nRepresentations, ICLR, 2018.\n\n[10] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In\nArti\ufb01cial Neural Networks - ICANN 2001, International Conference Vienna, Austria, August\n21-25, 2001 Proceedings, pages 87\u201394, 2001.\n\n[11] R. Houthooft, X. Chen, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime:\n\nVariational information maximizing exploration. In NIPS. 2016.\n\n[12] M. J. Kearns and S. P. Singh. Near-optimal reinforcement learning in polynomial time. Machine\n\nLearning, 49(2-3):209\u2013232, 2002.\n\n[13] J. Z. Kolter and A. Y. Ng. Learning omnidirectional path following using dimensionality\n\nreduction. In RSS, 2007.\n\n9\n\n\f[14] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.\n\nJMLR, 17(39):1\u201340, 2016.\n\n[15] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few shot learning.\n\nCoRR, abs/1707.09835, 2017.\n\n[16] M. Lopes, T. Lang, M. Toussaint, and P. yves Oudeyer. Exploration in model-based reinforce-\n\nment learning by empirically estimating learning progress. In NIPS. 2012.\n\n[17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[18] I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep exploration via bootstrapped DQN. In\n\nNIPS, 2016.\n\n[19] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel,\nand M. Andrychowicz. Parameter space noise for exploration. CoRR, abs/1706.01905, 2017.\n\n[20] A. Rajeswaran, V. Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine. Learning\ncomplex dexterous manipulation with deep reinforcement learning and demonstrations. CoRR,\nabs/1709.10087, 2017.\n\n[21] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In Proceedings of\n\nthe International Conference on Learning Representations, ICLR, 2017.\n\n[22] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. P. Lillicrap. Meta-learning with\nmemory-augmented neural networks. In M. Balcan and K. Q. Weinberger, editors, ICML, 2016.\n\n[23] J. Schmidhuber. Evolutionary principles in self-referential learning. on learning now to learn:\nThe meta-meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany, 14\nMay 1987.\n\n[24] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization.\n\nIn ICML, 2015.\n\n[25] S. P. Singh, A. G. Barto, and N. Chentanez. Intrinsically motivated reinforcement learning. In\n\nNIPS, 2004.\n\n[26] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with\n\ndeep predictive models. CoRR, abs/1507.00814, 2015.\n\n[27] A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov\n\ndecision processes. J. Comput. Syst. Sci., 74(8):1309\u20131331, 2008.\n\n[28] S. Thrun and L. Pratt. Learning to learn. chapter Learning to Learn: Introduction and Overview,\n\npages 3\u201317. Kluwer Academic Publishers, Norwell, MA, USA, 1998.\n\n[29] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks\nfor one shot learning. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett,\neditors, NIPS, 2016.\n\n[30] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell,\nD. Kumaran, and M. Botvinick. Learning to reinforcement learn. CoRR, abs/1611.05763, 2016.\n\n[31] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3):229\u2013256, 1992.\n\n10\n\n\f", "award": [], "sourceid": 2533, "authors": [{"given_name": "Abhishek", "family_name": "Gupta", "institution": "University of California, Berkeley"}, {"given_name": "Russell", "family_name": "Mendonca", "institution": "UC Berkeley"}, {"given_name": "YuXuan", "family_name": "Liu", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley | Gradescope | Covariant"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}