{"title": "Unsupervised Curricula for Visual Meta-Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10519, "page_last": 10531, "abstract": "In principle, meta-reinforcement learning algorithms leverage experience across many tasks to learn fast and effective reinforcement learning (RL) strategies. However, current meta-RL approaches rely on manually-defined distributions of training tasks, and hand-crafting these task distributions can be challenging and time-consuming. Can ``useful'' pre-training tasks be discovered in an unsupervised manner? We develop an unsupervised algorithm for inducing an adaptive meta-training task distribution, i.e. an automatic curriculum, by modeling unsupervised interaction in a visual environment. \nThe task distribution is scaffolded by a parametric density model of the meta-learner's trajectory distribution. \nWe formulate unsupervised meta-RL as information maximization between a latent task variable and the meta-learner\u2019s data distribution, and describe a practical instantiation which alternates between integration of recent experience into the task distribution and meta-learning of the updated tasks. Repeating this procedure leads to iterative reorganization such that the curriculum adapts as the meta-learner's data distribution shifts. Moreover, we show how discriminative clustering frameworks for visual representations can support trajectory-level task acquisition and exploration in domains with pixel observations, avoiding the pitfalls of alternatives.\nIn experiments on vision-based navigation and manipulation domains, we show that the algorithm allows for unsupervised meta-learning that both transfers to downstream tasks specified by hand-crafted reward functions and serves as pre-training for more efficient meta-learning of test task distributions.", "full_text": "Unsupervised Curricula\n\nfor Visual Meta-Reinforcement Learning\n\nAllan Jabri\u03b1 Kyle Hsu\u03b2,\u2020 Benjamin Eysenbach\u03b3\nAbhishek Gupta\u03b1 Sergey Levine\u03b1 Chelsea Finn\u03b4\n\nAbstract\n\nIn principle, meta-reinforcement learning algorithms leverage experience across\nmany tasks to learn fast reinforcement learning (RL) strategies that transfer to\nsimilar tasks. However, current meta-RL approaches rely on manually-de\ufb01ned\ndistributions of training tasks, and hand-crafting these task distributions can be\nchallenging and time-consuming. Can \u201cuseful\u201d pre-training tasks be discovered in\nan unsupervised manner? We develop an unsupervised algorithm for inducing an\nadaptive meta-training task distribution, i.e. an automatic curriculum, by modeling\nunsupervised interaction in a visual environment. The task distribution is scaffolded\nby a parametric density model of the meta-learner\u2019s trajectory distribution. We\nformulate unsupervised meta-RL as information maximization between a latent\ntask variable and the meta-learner\u2019s data distribution, and describe a practical\ninstantiation which alternates between integration of recent experience into the task\ndistribution and meta-learning of the updated tasks. Repeating this procedure leads\nto iterative reorganization such that the curriculum adapts as the meta-learner\u2019s\ndata distribution shifts. In particular, we show how discriminative clustering for\nvisual representation can support trajectory-level task acquisition and exploration\nin domains with pixel observations, avoiding pitfalls of alternatives. In experiments\non vision-based navigation and manipulation domains, we show that the algorithm\nallows for unsupervised meta-learning that transfers to downstream tasks speci\ufb01ed\nby hand-crafted reward functions and serves as pre-training for more ef\ufb01cient\nsupervised meta-learning of test task distributions.\n\nIntroduction\n\n1\nThe discrepancy between animals and learning machines in their capacity to gracefully adapt and\ngeneralize is a central issue in arti\ufb01cial intelligence research. The simple nematode C. elegans is\ncapable of adapting foraging strategies to varying scenarios [9], while many higher animals are driven\nto acquire reusable behaviors even without extrinsic task-speci\ufb01c rewards [64, 45]. It is unlikely that\nwe can build machines as adaptive as even the simplest of animals by exhaustively specifying shaped\nrewards or demonstrations across all possible environments and tasks. This has inspired work in\nreward-free learning [28], intrinsic motivation [55], multi-task learning [11], meta-learning [50], and\ncontinual learning [59].\nAn important aspect of generalization is the ability to share and transfer ability between related tasks.\nIn reinforcement learning (RL), a common strategy for multi-task learning is conditioning the policy\non side-information related to the task. For instance, contextual policies [49] are conditioned on a task\ndescription (e.g. a goal) that is meant to modulate the strategy enacted by the policy. Meta-learning of\nreinforcement learning (meta-RL) is yet more general as it places the burden of inferring the task on\nthe learner itself, such that task descriptions can take a wider range of forms, the most general being\nan MDP. In principle, meta-reinforcement learning (meta-RL) requires an agent to distill previous\n\n\u03b1UC Berkeley \u03b2University of Toronto \u03b3Carnegie Mellon University \u03b4Stanford University\n\u2020Work done as a visiting student researcher at UC Berkeley.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An illustration of CARML, our approach for unsupervised meta-RL. We choose the behavior model q\u03c6\nto be a Gaussian mixture model in a jointly, discriminatively learned embedding space. An automatic curriculum\narises from periodically re-organizing past experience via \ufb01tting q\u03c6 and meta-learning an RL algorithm for\nperformance over tasks speci\ufb01ed using reward functions from q\u03c6.\n\nexperience into fast and effective adaptation strategies for new, related tasks. However, the meta-RL\nframework by itself does not prescribe where this experience should come from; typically, meta-RL\nalgorithms rely on being provided \ufb01xed, hand-speci\ufb01ed task distributions, which can be tedious to\nspecify for simple behaviors and intractable to design for complex ones [27]. These issues beg the\nquestion of whether \u201cuseful\u201d task distributions for meta-RL can be generated automatically.\nIn this work, we seek a procedure through which an agent in an environment with visual observations\ncan automatically acquire useful (i.e. utility maximizing) behaviors, as well as how and when to\napply them \u2013 in effect allowing for unsupervised pre-training in visual environments. Two key\naspects of this goal are: 1) learning to operationalize strategies so as to adapt to new tasks, i.e.\nmeta-learning, and 2) unsupervised learning and exploration in the absence of explicitly speci\ufb01ed\ntasks, i.e. skill acquisition without supervised reward functions. These aspects interact insofar as the\nformer implicitly relies on a task curriculum, while the latter is most effective when compelled by\nwhat the learner can and cannot do. Prior work has offered a pipelined approach for unsupervised\nmeta-RL consisting of unsupervised skill discovery followed by meta-learning of discovered skills,\nexperimenting mainly in environments that expose low-dimensional ground truth state [25]. Yet, the\naforementioned relation between skill acquisition and meta-learning suggests that they should not be\ntreated separately.\nHere, we argue for closing the loop between skill acquisition and meta-learning in order to induce\nan adaptive task distribution. Such co-adaptation introduces a number of challenges related to the\nstability of learning and exploration. Most recent unsupervised skill acquisition approaches optimize\nfor the discriminability of induced modes of behavior (i.e. skills), typically expressing the discovery\nproblem as a cooperative game between a policy and a learned reward function [24, 16, 1]. However,\nrelying solely on discriminability becomes problematic in environments with high-dimensional\n(image-based) observation spaces as it results in an issue akin to mode-collapse in the task space. This\nproblem is further complicated in the setting we propose to study, wherein the policy data distribution\nis that of a meta-learner rather than a contextual policy. We will see that this can be ameliorated by\nspecifying a hybrid discriminative-generative model for parameterizing the task distribution.\nThe main contribution of this paper is an approach for inducing a task curriculum for unsupervised\nmeta-RL in a manner that scales to domains with pixel observations. Through the lens of information\nmaximization, we frame our unsupervised meta-RL approach as variational expectation-maximization\n(EM), in which the E-step corresponds to \ufb01tting a task distribution to a meta-learner\u2019s behavior and the\nM-step to meta-RL on the current task distribution with reinforcement for both skill acquisition and\nexploration. For the E-step, we show how deep discriminative clustering allows for trajectory-level\nrepresentations suitable for learning diverse skills from pixel observations. Through experiments in\nvision-based navigation and robotic control domains, we demonstrate that the approach i) enables\nan unsupervised meta-learner to discover and meta-learn skills that transfer to downstream tasks\nspeci\ufb01ed by human-provided reward functions, and ii) can serve as pre-training for more ef\ufb01cient\nsupervised meta-reinforcement learning of downstream task distributions.\n2 Preliminaries: Meta-Reinforcement Learning\nSupervised meta-RL optimizes an RL algorithm f\u03b8 for performance on a hand-crafted distribution\nof tasks p(T ), where f\u03b8 might take the form of an recurrent neural network (RNN) implementing\na learning algorithm [13, 61], or a function implementing a gradient-based learning algorithm [18].\nTasks are Markov decision processes (MDPs) Ti = (S,A, ri, P, \u03b3, \u03c1, T ) consisting of state space S,\n\n2\n\nq(s)=Xzq(s|z)p(z)Update behavior model2. Meta-Train1. OrganizeAcquire skills and explorerz(s)=logq(s|z)logq(s)+++DataTasksAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmGG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBBPF7K2IRFhhYmw8FRuCt/ryOuk26t5VvfHQrLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEU1o5D\fFigure 2: A step for the meta-learner.\n(Left) Unsupervised pre-training. The\npolicy meta-learns self-generated tasks\nbased on the behavior model q\u03c6. (Right)\nTransfer. Faced with new tasks, the policy\ntransfers acquired meta-learning strategies\nto maximize unseen reward functions.\n\naction space A, reward function ri : S \u00d7 A \u2192 R, probabilistic transition dynamics P (st+1|st, at),\ndiscount factor \u03b3, initial state distribution \u03c1(s1), and \ufb01nite horizon T . Often, and in our setting,\ntasks are assumed to share S,A. For a given T \u223c p(T ), f\u03b8 learns a policy \u03c0\u03b8(a|s,DT ) conditioned\non task-speci\ufb01c experience. Thus, a meta-RL algorithm optimizes f\u03b8 for expected performance of\n\u03c0\u03b8(a|s,DT ) over p(T ), such that it can generalize to unseen test tasks also sampled from p(T ).\nFor example, RL2 [13, 61] chooses f\u03b8 to be an RNN with weights \u03b8. For a given task T , f\u03b8 hones\n\u03c0\u03b8(a|s,DT ) as it recurrently ingests DT = (s1, a1, r(s1, a1), d1, . . . ), the sequence of states, actions,\nand rewards produced via interaction within the MDP. Crucially, the same task is seen several times,\nand the hidden state is not reset until the next task. The loss is the negative discounted return obtained\nby \u03c0\u03b8 across episodes of the same task, and f\u03b8 can be optimized via standard policy gradient methods\nfor RL, backpropagating gradients through time and across episode boundaries.\nUnsupervised meta-RL aims to break the reliance of the meta-learner on an explicit, upfront spec-\ni\ufb01cation of p(T ). Following Gupta et al. [25], we consider a controlled Markov process (CMP)\nC = (S,A, P, \u03b3, \u03c1, T ), which is an MDP without a reward function. We are interested in the problem\nof learning an RL algorithm f\u03b8 via unsupervised interaction within the CMP such that once a reward\nfunction r is speci\ufb01ed at test-time, f\u03b8 can be readily applied to the resulting MDP to ef\ufb01ciently\nmaximize the expected discounted return.\nPrior work [25] pipelines skill acquisition and meta-learning by pairing an unsupervised RL algorithm\nDIAYN [16] and a meta-learning algorithm MAML [18]: \ufb01rst, a contextual policy is used to discover\nskills in the CMP, yielding a \ufb01nite set of learned reward functions distributed as p(r); then, the CMP\nis combined with a frozen p(r) to yield p(T ), which is fed to MAML to meta-learn f\u03b8. In the next\nsection, we describe how we can generalize and improve upon this pipelined approach by jointly\nperforming skill acquisition as the meta-learner learns and explores in the environment.\n\n3 Curricula for Unsupervised Meta-Reinforcement Learning\nMeta-learning is intended to prepare an agent to ef\ufb01ciently solve new tasks related to those seen\npreviously. To this end, the meta-RL agent must balance 1) exploring the environment to infer which\ntask it should solve, and 2) visiting states that maximize reward under the inferred task. The duty\nof unsupervised meta-RL is thus to present the meta-learner with tasks that allow it to practice task\ninference and execution, without the need for human-speci\ufb01ed task distributions. Ideally, the task\ndistribution should exhibit both structure and diversity. That is, the tasks should be distinguishable\nand not excessively challenging so that a developing meta-learner can infer and execute the right skill,\nbut, for the sake of generalization, they should also encompass a diverse range of associated stimuli\nand rewards, including some beyond the current scope of the meta-learner. Our aim is to strike this\nbalance by inducing an adaptive task distribution.\nWith this motivation, we develop an algorithm for unsupervised meta-reinforcement learning in visual\nenvironments that constructs a task distribution without supervision. The task distribution is derived\nfrom a latent-variable density model of the meta-learner\u2019s cumulative behavior, with exploration\nbased on the density model driving the evolution of the task distribution. As depicted in Figure1,\nlearning proceeds by alternating between two steps: organizing experiential data (i.e., trajectories\ngenerated by the meta-learner) by modeling it with a mixture of latent components forming the basis\nof \u201cskills\u201d, and meta-reinforcement learning by treating these skills as a training task distribution.\nLearning the task distribution in a data-driven manner ensures that tasks are feasible in the environ-\nment. While the induced task distribution is in no way guaranteed to align with test task distributions,\nit may yet require an implicit understanding of structure in the environment. This can indeed be\nseen from our visualizations in \u00a75, which demonstrate that acquired tasks show useful structure,\nthough in some settings this structure is easier to meta-learn than others. In the following, we\nformalize our approach, CARML, through the lens of information maximization and describe a\nconcrete instantiation that scales to the vision-based environments considered in \u00a75.\n\n3\n\nq\u21e1\u2713rzt-1ot+1rztUnsupervised Pre-trainingznDirect Transfer (5.2, 5.3)Finetune (5.4) \u2713Transferto Test Tasks\u21e1\u2713EnvstAAACaHicbZDPahRBEMZ7x38xUTPqQcRLkyiImGVmDSTHoBePEdwksL2sPT01SbP9Z+iuibs081p5AcF38BXEi4fkbO9sEE1S0PDxfVVU9a+olfSYZT96ya3bd+7eW7m/uvbg4aP19PGTA28bJ2AorLLuqOAelDQwRIkKjmoHXBcKDovph0V+eArOS2s+47yGsebHRlZScIzWJP3CDHwVVmtuyjeBzdrAtJ/KeqAbpjme+CrM2r9Wu8pKqNhWYAgzdDpstdFa9BVV8O0kMC+4ArQB27BdY0wn6WbWz7qi10V+KTb3Xp6ffTtdu9ifpL9ZaUWjwaBQ3PtRntU4DtyhFAriusZDzcWUH8MoSsM1+HHoSLT0VXRKWlkXn0Hauf9OBK69n+sidna/u5otzJuyUYPV7jhIUzcIRiwXVY2iaOkCKy2lA4FqHgUXTsZbqTjhjguM8G+64C3tUDlQHaP8KpHr4mDQz9/1B58irPdkWSvkBdkgr0lOdsge+Uj2yZAI8p38IufkovczSZNnyfNla9K7nHlK/qtk4w+kVcQIat-1AAACbHicbVHLahRBFK1pXzHxMT52IVAYFZHM0D0J6DLoxmUCThKYGobq6ttJMfVoqm7HGYr+LX/AnX+Qnwi6UNCVNT1BNMmFgsM553LvPZVXSnpM07NOcuPmrdt3Vu6urt27/+Bh99HjA29rJ2AorLLuKOcelDQwRIkKjioHXOcKDvPp+4V+eArOS2s+4ryCsebHRpZScIzUpFswA5+E1Zqb4nVgsyYw7aeyGuiaaY4nvgyz5i/VrLICStYLDGGGTodeE6mFLy8DbyaBecEVoA3IejRrwk6F0THpbqb9tC16FWQXYHP3+Y/PX07Xfu5Nut9ZYUWtwaBQ3PtRllY4DtyhFAriyNpDxcWUH8MoQsM1+HFo02joi8gUtLQuPoO0Zf/tCFx7P9d5dLYXXtYW5HXaqMby7ThIU9UIRiwHlbWiaOkiWlpIBwLVPAIunIy7UnHCHRcYP+C6DbZoG5cD1WaUXU7kKjgY9LPt/mA/hvWOLGuFrJNn5BXJyBuySz6QPTIkgnwl38gv8rtznjxN1pONpTXpXPQ8If9V8vIPF+LE+A==atAAACaHicbZDPahRBEMZ7x38xUTPqQcRLkyiImGVmDSTHoBePEdwksL2sPT01SbP9Z+iuibs081p5AcF38BXEi4fkbO9sEE1S0PDxfVVU9a+olfSYZT96ya3bd+7eW7m/uvbg4aP19PGTA28bJ2AorLLuqOAelDQwRIkKjmoHXBcKDovph0V+eArOS2s+47yGsebHRlZScIzWJP3CDHwVVmtuyjeBzdrAtJ/KeqAbpjme+CrM2r9Wu8pKqNhWYAgzdDpstdFa9BVV4O0kMC+4ArQB27BdY0wn6WbWz7qi10V+KTb3Xp6ffTtdu9ifpL9ZaUWjwaBQ3PtRntU4DtyhFAriusZDzcWUH8MoSsM1+HHoSLT0VXRKWlkXn0Hauf9OBK69n+sidna/u5otzJuyUYPV7jhIUzcIRiwXVY2iaOkCKy2lA4FqHgUXTsZbqTjhjguM8G+64C3tUDlQHaP8KpHr4mDQz9/1B58irPdkWSvkBdkgr0lOdsge+Uj2yZAI8p38IufkovczSZNnyfNla9K7nHlK/qtk4w+FZcP2st+1AAACanicbZDLahRBFIZr2lsuXsa4kmwKE0HUDN2joMugmywTcJLA1DBUV59OiqlLU3U6zlD0a/kC2eQVfAcRXCS6TE1PEE1yoODn/8/hnPrySkmPafq9k9y5e+/+g6XlldWHjx4/6T5d2/e2dgIGwirrDnPuQUkDA5So4LBywHWu4CCffJ7nByfgvLTmC84qGGl+ZGQpBcdojbs5M/BVWK25KV4HNm0C034iq76umeZ47Mswbf5azQoroGRbgSFM0emw1URr3peXwTfjwLzgCtAGfJM14X2FMR93N9Je2ha9KbIrsbG9ef7t9GT1Ynfc/cUKK2oNBoXi3g+ztMJR4A6lUBAX1h4qLib8CIZRGq7Bj0LLoqEvo1PQ0rr4DNLW/XcicO39TOexs/3f9Wxu3pYNayw/joI0VY1gxGJRWSuKls7B0kI6EKhmUXDhZLyVimPuuMCI/7YL3tIWlgPVMsquE7kp9vu97F2vvxdhfSKLWiLr5AV5RTLygWyTHbJLBkSQM/KT/CZ/Oj+SteR5sr5oTTpXM8/If5VsXgLQtcR4atAAACaHicbZDPahRBEMZ7x38xUTPqQcRLkyiImGVmDSTHoBePEdwksL2sPT01SbP9Z+iuibs081p5AcF38BXEi4fkbO9sEE1S0PDxfVVU9a+olfSYZT96ya3bd+7eW7m/uvbg4aP19PGTA28bJ2AorLLuqOAelDQwRIkKjmoHXBcKDovph0V+eArOS2s+47yGsebHRlZScIzWJP3CDHwVVmtuyjeBzdrAtJ/KeqAbpjme+CrM2r9Wu8pKqNhWYAgzdDpstdFa9BVV4O0kMC+4ArQB27BdY0wn6WbWz7qi10V+KTb3Xp6ffTtdu9ifpL9ZaUWjwaBQ3PtRntU4DtyhFAriusZDzcWUH8MoSsM1+HHoSLT0VXRKWlkXn0Hauf9OBK69n+sidna/u5otzJuyUYPV7jhIUzcIRiwXVY2iaOkCKy2lA4FqHgUXTsZbqTjhjguM8G+64C3tUDlQHaP8KpHr4mDQz9/1B58irPdkWSvkBdkgr0lOdsge+Uj2yZAI8p38IufkovczSZNnyfNla9K7nHlK/qtk4w+FZcP2rzt-1stAAACaHicbZDPahRBEMZ7x38xUTPqQcRLkyiImGVmDSTHoBePEdwksL2sPT01SbP9Z+iuibs081p5AcF38BXEi4fkbO9sEE1S0PDxfVVU9a+olfSYZT96ya3bd+7eW7m/uvbg4aP19PGTA28bJ2AorLLuqOAelDQwRIkKjmoHXBcKDovph0V+eArOS2s+47yGsebHRlZScIzWJP3CDHwVVmtuyjeBzdrAtJ/KeqAbpjme+CrM2r9Wu8pKqNhWYAgzdDpstdFa9BVV8O0kMC+4ArQB27BdY0wn6WbWz7qi10V+KTb3Xp6ffTtdu9ifpL9ZaUWjwaBQ3PtRntU4DtyhFAriusZDzcWUH8MoSsM1+HHoSLT0VXRKWlkXn0Hauf9OBK69n+sidna/u5otzJuyUYPV7jhIUzcIRiwXVY2iaOkCKy2lA4FqHgUXTsZbqTjhjguM8G+64C3tUDlQHaP8KpHr4mDQz9/1B58irPdkWSvkBdkgr0lOdsge+Uj2yZAI8p38IufkovczSZNnyfNla9K7nHlK/qtk4w+kVcQIat-1AAACbHicbVHLahRBFK1pXzHxMT52IVAYFZHM0D0J6DLoxmUCThKYGobq6ttJMfVoqm7HGYr+LX/AnX+Qnwi6UNCVNT1BNMmFgsM553LvPZVXSnpM07NOcuPmrdt3Vu6urt27/+Bh99HjA29rJ2AorLLuKOcelDQwRIkKjioHXOcKDvPp+4V+eArOS2s+4ryCsebHRpZScIzUpFswA5+E1Zqb4nVgsyYw7aeyGuiaaY4nvgyz5i/VrLICStYLDGGGTodeE6mFLy8DbyaBecEVoA3IejRrwk6F0THpbqb9tC16FWQXYHP3+Y/PX07Xfu5Nut9ZYUWtwaBQ3PtRllY4DtyhFAriyNpDxcWUH8MoQsM1+HFo02joi8gUtLQuPoO0Zf/tCFx7P9d5dLYXXtYW5HXaqMby7ThIU9UIRiwHlbWiaOkiWlpIBwLVPAIunIy7UnHCHRcYP+C6DbZoG5cD1WaUXU7kKjgY9LPt/mA/hvWOLGuFrJNn5BXJyBuySz6QPTIkgnwl38gv8rtznjxN1pONpTXpXPQ8If9V8vIPF+LE+A==rztatAAACaHicbZDPahRBEMZ7x38xUTPqQcRLkyiImGVmDSTHoBePEdwksL2sPT01SbP9Z+iuibs081p5AcF38BXEi4fkbO9sEE1S0PDxfVVU9a+olfSYZT96ya3bd+7eW7m/uvbg4aP19PGTA28bJ2AorLLuqOAelDQwRIkKjmoHXBcKDovph0V+eArOS2s+47yGsebHRlZScIzWJP3CDHwVVmtuyjeBzdrAtJ/KeqAbpjme+CrM2r9Wu8pKqNhWYAgzdDpstdFa9BVV4O0kMC+4ArQB27BdY0wn6WbWz7qi10V+KTb3Xp6ffTtdu9ifpL9ZaUWjwaBQ3PtRntU4DtyhFAriusZDzcWUH8MoSsM1+HHoSLT0VXRKWlkXn0Hauf9OBK69n+sidna/u5otzJuyUYPV7jhIUzcIRiwXVY2iaOkCKy2lA4FqHgUXTsZbqTjhjguM8G+64C3tUDlQHaP8KpHr4mDQz9/1B58irPdkWSvkBdkgr0lOdsge+Uj2yZAI8p38IufkovczSZNnyfNla9K7nHlK/qtk4w+FZcP2st+1AAACanicbZDLahRBFIZr2lsuXsa4kmwKE0HUDN2joMugmywTcJLA1DBUV59OiqlLU3U6zlD0a/kC2eQVfAcRXCS6TE1PEE1yoODn/8/hnPrySkmPafq9k9y5e+/+g6XlldWHjx4/6T5d2/e2dgIGwirrDnPuQUkDA5So4LBywHWu4CCffJ7nByfgvLTmC84qGGl+ZGQpBcdojbs5M/BVWK25KV4HNm0C034iq76umeZ47Mswbf5azQoroGRbgSFM0emw1URr3peXwTfjwLzgCtAGfJM14X2FMR93N9Je2ha9KbIrsbG9ef7t9GT1Ynfc/cUKK2oNBoXi3g+ztMJR4A6lUBAX1h4qLib8CIZRGq7Bj0LLoqEvo1PQ0rr4DNLW/XcicO39TOexs/3f9Wxu3pYNayw/joI0VY1gxGJRWSuKls7B0kI6EKhmUXDhZLyVimPuuMCI/7YL3tIWlgPVMsquE7kp9vu97F2vvxdhfSKLWiLr5AV5RTLygWyTHbJLBkSQM/KT/CZ/Oj+SteR5sr5oTTpXM8/If5VsXgLQtcR4atAAACaHicbZDPahRBEMZ7x38xUTPqQcRLkyiImGVmDSTHoBePEdwksL2sPT01SbP9Z+iuibs081p5AcF38BXEi4fkbO9sEE1S0PDxfVVU9a+olfSYZT96ya3bd+7eW7m/uvbg4aP19PGTA28bJ2AorLLuqOAelDQwRIkKjmoHXBcKDovph0V+eArOS2s+47yGsebHRlZScIzWJP3CDHwVVmtuyjeBzdrAtJ/KeqAbpjme+CrM2r9Wu8pKqNhWYAgzdDpstdFa9BVV4O0kMC+4ArQB27BdY0wn6WbWz7qi10V+KTb3Xp6ffTtdu9ifpL9ZaUWjwaBQ3PtRntU4DtyhFAriusZDzcWUH8MoSsM1+HHoSLT0VXRKWlkXn0Hauf9OBK69n+sidna/u5otzJuyUYPV7jhIUzcIRiwXVY2iaOkCKy2lA4FqHgUXTsZbqTjhjguM8G+64C3tUDlQHaP8KpHr4mDQz9/1B58irPdkWSvkBdkgr0lOdsge+Uj2yZAI8p38IufkovczSZNnyfNla9K7nHlK/qtk4w+FZcP2Env\f3.1 An Overview of CARML\nWe begin from the principle of information maximization (IM), which has been applied across\nunsupervised representation learning [4, 3, 41] and reinforcement learning [39, 24] for organization\nof data involving latent variables. In what follows, we organize data from our policy by maximizing\nthe mutual information (MI) between state trajectories \u03c4 := (s1, . . . , sT ) and a latent task variable z.\nThis objective provides a principled manner of trading-off structure and diversity: from I(\u03c4 ; z) :=\nH(\u03c4 ) \u2212 H(\u03c4|z), we see that H(\u03c4 ) promotes coverage in policy data space (i.e. diversity) while\n\u2212H(\u03c4|z) encourages a lack of diversity under each task (i.e. structure that eases task inference).\nWe approach maximizing I(\u03c4 ; z) exhibited by the meta-learner f\u03b8 via variational EM [3], introducing\na variational distribution q\u03c6 that can intuitively be viewed as a task scaffold for the meta-learner.\nIn the E-step, we \ufb01t q\u03c6 to a reservoir of trajectories produced by f\u03b8, re-organizing the cumulative\nexperience. In turn, q\u03c6 gives rise to a task distribution p(T ): each realization of the latent variable z\ninduces a reward function rz(s), which we combine with the CMP Ci to produce an MDP Ti (Line 8).\nIn the M-step, f\u03b8 meta-learns the task distribution p(T ). Repeating these steps forms a curriculum in\nwhich the task distribution and meta-learner co-adapt: each M-step adapts the meta-learner f\u03b8 to the\nupdated task distribution, while each E-step updates the task scaffold q\u03c6 based on the data collected\nduring meta-training. Pseudocode for our method is presented in Algorithm 1.\nAlgorithm 1: CARML \u2013 Curricula for Automatic Reinforcement of Meta-Learning\n1: Require: C, an MDP without a reward function\n2: Initialize f\u03b8, an RL algorithm parameterized by \u03b8.\n3: Initialize D, a reservoir of state trajectories, via a randomly initialized policy.\n4: while not done do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: Return: a meta-learned RL algorithm f\u03b8 tailored to C\n\nFit a task-scaffold q\u03c6 to D, e.g. by using Algorithm 2.\nfor a desired mixture model-\ufb01tting period do\nSample a latent task variable z \u223c q\u03c6(z).\nDe\ufb01ne the reward function rz(s), e.g. by Eq. 8, and a task T = C \u222a rz(s).\nApply f\u03b8 on task T to obtain a policy \u03c0\u03b8(a|s,DT ) and trajectories {\u03c4i}.\nUpdate f\u03b8 via a meta-RL algorithm, e.g. RL2 [13].\nAdd the new trajectories to the reservoir: D \u2190 D \u222a {\u03c4i}.\n\nM-step \u00a73.3\n\nE-step \u00a73.2\n\n3.2 E-Step: Task Acquisition\nThe purpose of the E-step is to update the task distribution by integrating changes in the meta-learner\u2019s\ndata distribution with previous experience, thereby allowing for re-organization of the task scaffold.\nThis data is from the post-update policy, meaning that it comes from a policy \u03c0\u03b8(a|s,DT ) conditioned\non data collected by the meta-learner for the respective task. In the following, we abuse notation by\nwriting \u03c0\u03b8(a|s, z) \u2013 conditioning on the latent task variable z rather than the task experience DT .\nThe general strategy followed by recent approaches for skill discovery based on IM is to lower\nbound the objective by introducing a variational posterior q\u03c6(z|s) in the form of a classi\ufb01er. In these\napproaches, the E-step amounts to updating the classi\ufb01er to discriminate between data produced by\ndifferent skills as much as possible. A potential failure mode of such an approach is an issue akin to\nmode-collapse in the task distribution, wherein the policy drops modes of behavior to favor easily\ndiscriminable trajectories, resulting in a lack of diversity in the task distribution and no incentive for\nexploration; this is especially problematic when considering high-dimensional observations. Instead,\nhere we derive a generative variant, which allows us to account for explicitly capturing modes of\nbehavior (by optimizing for likelihood), as well as a direct mechanism for exploration.\nWe introduce a variational distribution q\u03c6, which could be e.g. a (deep) mixture model with discrete\nz or a variational autoencoder (VAE) [34] with continuous z, lower-bounding the objective:\n\nE\n\nz\u223cq\u03c6(z),\u03c4\u223cD(cid:2) log q\u03c6(\u03c4|z)(cid:3)\n\nmax\n\n\u03c6\n\n4\n\nI(\u03c4 ; z) = \u2212(cid:88)\u03c4\n\u2265 \u2212(cid:88)\u03c4\n\n\u03c0\u03b8(\u03c4 ) log \u03c0\u03b8(\u03c4 ) +(cid:88)\u03c4 ,z\n\u03c0\u03b8(\u03c4 ) log \u03c0\u03b8(\u03c4 ) +(cid:88)\u03c4 ,z\n\n\u03c0\u03b8(\u03c4 , z) log \u03c0\u03b8(\u03c4|z)\n\n\u03c0\u03b8(\u03c4|z)q\u03c6(z) log q\u03c6(\u03c4|z)\n\n(1)\n\n(2)\n\nThe E-step corresponds to optimizing Eq. 2 with respect to \u03c6, and thus amounts to \ufb01tting q\u03c6 to a\nreservoir of trajectories D produced by \u03c0\u03b8:\n\n(3)\n\n\fWhat remains is to determine the form of q\u03c6. We choose the variational distribution to be a state-level\nmixture density model q\u03c6(s, z) = q\u03c6(s|z)q\u03c6(z). Despite using a state-level generative model, we can\ntreat z as a trajectory-level latent by computing the trajectory-level likelihood as the factorized product\nof state likelihoods (Algorithm 2, Line 4). This is useful for obtaining trajectory-level tasks; in the\nM-step (\u00a73.3), we map samples from q\u03c6(z) to reward functions to de\ufb01ne tasks for meta-learning.\n\nAlgorithm 2: Task Acquisition via Discriminative Clustering\n1: Require: a set of trajectories D = {(s1, . . . , sT )}N\ni=1\n2: Initialize (\u03c6w, \u03c6m), encoder and mixture parameters.\n3: while not converged do\nCompute L(\u03c6m; \u03c4 , z) =(cid:80)st\u2208\u03c4 log q\u03c6m(g\u03c6w (st)|z).\n4:\nm; \u03c4i, z) (via MLE)\n5:\n6: D := {(s, y := arg maxk q\u03c6m(z = k|g\u03c6w (s))}.\n7:\n8: Return: a mixture model q\u03c6(s, z)\n\nm(cid:80)N\nw(cid:80)(s,y)\u2208D log q(y|g\u03c6(cid:48)\n\n\u03c6m \u2190 arg max\u03c6(cid:48)\n\u03c6w \u2190 arg max\u03c6(cid:48)\n\ni=1 L(\u03c6(cid:48)\n\nw (s))\n\nFigure 3: Conditional independence\nassumption for states along a trajectory.\n\nModeling Trajectories of Pixel Observations. While models like the variational autoencoder have\nbeen used in related settings [40], a basic issue is that optimizing for reconstruction treats all pixels\nequally. We, rather, will tolerate lossy representations as long as they capture discriminative features\nuseful for stimulus-reward association. Drawing inspiration from recent work on unsupervised feature\nlearning by clustering [6, 10], we propose to \ufb01t the trajectory-level mixture model via discriminative\nclustering, striking a balance between discriminative and generative approaches.\nWe adopt the optimization scheme of DeepCluster [10], which alternates between i) clustering\nrepresentations to obtain pseudo-labels and ii) updating the representation by supervised learning\nof pseudo-labels. In particular, we derive a trajectory-level variant (Algorithm 2) by forcing the\nresponsibilities of all observations in a trajectory to be the same (see Appendix A.1 for a derivation),\nleading to state-level visual representations optimized with trajectory-level supervision.\nThe conditional independence assumption in Algorithm 2 is a simpli\ufb01cation insofar as it discards\nthe order of states in a trajectory. However, if the dynamics exhibit continuity and causality, the\nvisual representation might yet capture temporal structure, since, for example, attaining certain\nobservations might imply certain antecedent subtrajectories. We hypothesize that a state-level model\ncan regulate issues of over-expressive sequence encoders, which have been found to lead to skills\nwith undesirable attention to details in dynamics [1]. As we will see in \u00a75, learning representations\nunder this assumption still allows for learning visual features that capture trajectory-level structure.\n3.3 M-Step: Meta-Learning\nUsing the task scaffold updated via the E-step, we meta-learn f\u03b8 in the M-step so that \u03c0\u03b8 can be\nquickly adapted to tasks drawn from the task scaffold. To de\ufb01ne the task distribution, we must specify\na form for the reward functions rz(s). To allow for state-conditioned Markovian rewards rather than\nnon-Markovian trajectory-level rewards, we lower-bound the trajectory-level MI objective:\n\nI(\u03c4 ; z) =\n\n1\nT\n\nH(z) \u2212 H(z|s1, ..., sT ) \u2265\n\nT(cid:88)t=1\nT(cid:88)t=1\nz\u223cq\u03c6(z),s\u223c\u03c0\u03b8(s|z)(cid:2) log q\u03c6(s|z) \u2212 log \u03c0\u03b8(s)(cid:3)\n\u2265 E\n\n1\nT\n\nH(z) \u2212 H(z|st)\n\n(4)\n\n(5)\n\nWe would like to optimize the meta-learner under the variational objective in Eq. 5, but optimizing\nthe second term, the policy\u2019s state entropy, is in general intractable. Thus, we make the simplifying\nassumption that the \ufb01tted variational marginal distribution matches that of the policy:\n\nmax\n\n\u03b8\n\n= max\n\n\u03b8\n\nE\n\nz\u223cq\u03c6(z),s\u223c\u03c0\u03b8(s|z)(cid:2) log q\u03c6(s|z) \u2212 log q\u03c6(s)(cid:3)\n\nI(\u03c0\u03b8(s); q\u03c6(z)) \u2212 DKL(\u03c0\u03b8(s|z) (cid:107) q\u03c6(s|z)) + DKL(\u03c0\u03b8(s) (cid:107) q\u03c6(s)))\n\n(6)\n\n(7)\n\nOptimizing Eq. 6 amounts to maximizing the reward of rz(s) = log q\u03c6(s|z) \u2212 log q\u03c6(s). As shown\nin Eq. 7, this corresponds to information maximization between the policy\u2019s state marginal and the\nlatent task variable, along with terms for matching the task-speci\ufb01c policy data distribution to the\n\n5\n\nzAAAB8XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLoxmUF+8C2lEx6pw3NZIYkI9Shf+HGhSJu/Rt3/o2ZdhbaeiBwOOdecu7xY8G1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZQohg0WiUi1fapRcIkNw43AdqyQhr7Alj++yfzWIyrNI3lvJjH2QjqUPOCMGis9dENqRn6QPk37pbJbcWcgy8TLSRly1Pulr+4gYkmI0jBBte54bmx6KVWGM4HTYjfRGFM2pkPsWCppiLqXzhJPyalVBiSIlH3SkJn6eyOlodaT0LeTWUK96GXif14nMcFVL+UyTgxKNv8oSAQxEcnOJwOukBkxsYQyxW1WwkZUUWZsSUVbgrd48jJpViveeaV6d1GuXed1FOAYTuAMPLiEGtxCHRrAQMIzvMKbo50X5935mI+uOPnOEfyB8/kDAaCRIg==\u2026\fcorresponding mixture mode and deviating from the mixture\u2019s marginal density. We can trade-off\nbetween component-matching and exploration by introducing a weighting term \u03bb \u2208 [0, 1] into rz(s):\n(8)\n(9)\nwhere C is a constant with respect to the optimization of \u03b8. From Eq. 9, we can interpret \u03bb as trading\noff between discriminability of skills and task-speci\ufb01c exploration. Figure 4 shows the effect of\ntuning \u03bb on the structure-diversity trade-off alluded to at the beginning of \u00a73.\n\n= (\u03bb \u2212 1) log q\u03c6(s|z) + log q\u03c6(z|s) + C\n\nrz(s) = \u03bb log q\u03c6(s|z) \u2212 log q\u03c6(s)\n\nFigure 4: Balancing consistency and ex-\nploration with \u03bb in a simple 2D maze en-\nvironment. Each row shows a progression\nof tasks developed over the course of train-\ning. Each box presents the mean recon-\nstructions under a VAE q\u03c6 (Appendix C)\nof 2048 trajectories. Varying \u03bb of Eq. 8\nacross rows, we observe that a small \u03bb (top)\nresults in aggressive exploration; a large\n\u03bb (bottom) yields relatively conservative\nbehavior; and a moderate \u03bb (middle) pro-\nduces suf\ufb01cient exploration and a smooth\ntask distribution.\n\n4 Related Work\nUnsupervised Reinforcement Learning. Unsupervised learning in the context of RL is the problem\nof enabling an agent to learn about its environment and acquire useful behaviors without human-\nspeci\ufb01ed reward functions. A large body of prior work has studied exploration and intrinsic motivation\nobjectives [51, 48, 43, 22, 8, 5, 35, 42]. These algorithms do not aim to acquire skills that can be\noperationalized to solve tasks, but rather try to achieve wide coverage of the state space; our objective\n(Eq. 8) reduces to pure density-based exploration with \u03bb = 0. Hence, these algorithms still rely on\nslow RL [7] in order to adapt to new tasks posed at test-time. Some prior works consider unsupervised\npre-training for ef\ufb01cient RL, but these works typically focus on settings in which exploration is not as\nmuch of a challenge [63, 17, 14], focus on goal-conditioned policies [44, 40], or have not been shown\nto scale to high-dimensional visual observation spaces [36, 54]. Perhaps most relevant to our work\nare unsupervised RL algorithms for learning reward functions via optimizing information-theoretic\nobjectives involving latent skill variables [24, 1, 16, 62]. In particular, with a choice of \u03bb = 1 in\nEq. 9 we recover the information maximization objective used in prior work [1, 16], besides the fact\nthat we simulatenously perform meta-learning. The setting of training a contextual policy with a\nclassi\ufb01er as q\u03c6 in our proposed framework (see Appendix A.3) provides an interpretation of DIAYN\nas implicitly doing trajectory-level clustering. Warde-Farley et al. [62] also considers accumulation\nof tasks, but with a focus on goal-reaching and by maintaining a goal reservoir via heuristics that\npromote diversity.\nMeta-Learning. Our work is distinct from above works in that it formulates a meta-learning approach\nto explicitly train, without supervision, for the ability to adapt to new downstream RL tasks. Prior\nwork [31, 33, 2] has investigated this unsupervised meta-learning setting for image classi\ufb01cation; the\nsetting considered herein is complicated by the added challenges of RL-based policy optimization and\nexploration. Gupta et al. [25] provides an initial exploration of the unsupervised meta-RL problem,\nproposing a straightforward combination of unsupervised skill acquisition (via DIAYN) followed by\nMAML [18] with experiments restricted to environments with fully observed, lower-dimensional\nstate. Unlike these works and other meta-RL works [61, 13, 38, 46, 18, 30, 26, 47, 56, 58], we\nclose the loop to jointly perform task acquisition and meta-learning so as to achieve an automatic\ncurriculum to facilitate joint meta-learning and task-level exploration.\nAutomatic Curricula. The idea of automatic curricula has been widely explored both in supervised\nlearning and RL. In supervised learning, interest in automatic curricula is based on the hypothesis\nthat exposure to data in a speci\ufb01c order (i.e. a non-uniform curriculum) may allow for learning harder\ntasks more ef\ufb01ciently [15, 51, 23]. In RL, an additional challenge is exploration; hence, related work\nin RL considers the problem of curriculum generation, whereby the task distribution is designed\nto guide exploration towards solving complex tasks [20, 37, 19, 52] or unsupervised pre-training\n[57, 21]. Our work is driven by similar motivations, though we consider a curriculum in the setting\nof meta-RL and frame our approach as information maximization.\n\n6\n\n\f5 Experiments\nWe experiment in visual navigation and visuomotor control domains to study the following questions:\n\n\u2022 What kind of tasks are discovered through our task acquisition process (the E-step)?\n\u2022 Do these tasks allow for meta-training of strategies that transfer to test tasks?\n\u2022 Does closing the loop to jointly perform task acquisition and meta-learning bring bene\ufb01ts?\n\u2022 Does pre-training with CARML accelerate meta-learning of test task distributions?\nVideos are available at the project website https://sites.google.com/view/carml.\n5.1 Experimental Setting\nThe following experimental details are common to the two vision-based environments we consider.\nOther experimental are explained in more detail in Appendix B.\nMeta-RL. CARML is agnostic to the meta-RL algorithm used in the M-step. We use the RL2\nalgorithm [13], which has previously been evaluated on simpler visual meta-RL domains, with a\nPPO [53] optimizer. Unless otherwise stated, we use four episodes per trial (compared to the two\nepisodes per trial used in [13]), since the settings we consider involve more challenging task inference.\nBaselines. We compare against: 1) PPO from scratch on each evaluation task, 2) pre-training with\nrandom network distillation (RND) [8] for unsupervised exploration, followed by \ufb01ne-tuning on\nevaluation tasks, and 3) supervised meta-learning on the test-time task distribution, as an oracle.\nVariants. We consider variants of our method to ablate the role of design decisions related to task\nacquisition and joint training: 4) pipelined (most similar to [25]) \u2013 task acquisition with a contextual\npolicy, followed by meta-RL with RL2; 5) online discriminator \u2013 task acquisition with a purely\ndiscriminative q\u03c6 (akin to online DIAYN); and 6) online pretrained-discriminator \u2013 task acquisition\nwith a discriminative q\u03c6 initialized with visual features trained via Algorithm 2.\n5.2 Visual Navigation\nThe \ufb01rst domain we consider is \ufb01rst-person visual navigation in ViZDoom [32], involving a room\n\ufb01lled with \ufb01ve different objects (drawn from a set of 50). We consider a setup akin to those featured in\n[12, 65] (see Figure 3). The true state consists of continuous 2D position and continuous orientation,\nwhile observations are egocentric images with limited \ufb01eld of view. Three discrete actions allow for\nturning right or left, and moving forward. We consider two ways of sampling the CMP C. Fixed: \ufb01x\na set of \ufb01ve objects and positions for both unsupervised meta-training and testing. Random: sample\n\ufb01ve objects and randomly place them (thereby randomizing the state space and dynamics).\nVisualizing the task distribution. Modeling pixel observations reveals trajectory-level organization\nin the underlying true state space (Figure 5). Each map portrays trajectories of a mixture component,\nwith position encoded in 2D space and orientation encoded in the jet color-space; an example of\ninterpreting the maps is shown left of the legend. The components of the mixture model reveal\nstructured groups of trajectories: some components correspond to exploration of the space (marked\nwith green border), while others are more strongly directed towards speci\ufb01c areas (blue border). The\nskill maps of the \ufb01xed and random environments are qualitatively different: tasks in the \ufb01xed room\ntend towards interactions with objects or walls, while many of the tasks in the random setting sweep\n\nFigure 5: Skill maps for visual navigation. We visualize some of the discovered tasks by projecting trajectories\nof certain mixture components into the true state space. White dots correspond to \ufb01xed objects. The legend\nindicates orientation as color; on its left is an interpretation of the depicted component. Some tasks seem to\ncorrespond to exploration of the space (green border), while others are more directed towards speci\ufb01c areas (blue\nborder). Comparing tasks earlier and later in the curriculum (step 1 to step 5), we \ufb01nd an increase in structure.\n\n7\n\nRandomLegendFixedStep 1Step 5Start\f(a) ViZDoom\n\n(b) Sawyer\n\n(c) Variants (ViZDoom Random)\n\nFigure 6: CARML enables unsupervised meta-learning of skills that transfer to downstream tasks. Direct\ntransfer curves (marker and dotted line) represent a meta-learner deploying for just 200 time steps at test time.\nCompared to CARML, PPO and RND Init sample the test reward function orders of magnitude more times\nto perform similarly on a single task. Finetuning the CARML policy also allows for solving individual tasks\nwith signi\ufb01cantly fewer samples. The ablation experiments (c) assess both direct transfer and \ufb01netuning for\neach variant. Compared to variants, the CARML task acquisition procedure results in improved transfer due to\nmitigation of task mode-collapse and adaptation of the task distribution.\nthe space in a particular direction. We can also see the evolution of the task distribution at earlier and\nlater stages of Algorithm 1. While initial tasks (produced by a randomly initialized policy) tend to\nbe less structured, we later see re\ufb01nement of certain tasks as well as the emergence of others as the\nagent collects new data and acquires strategies for performing existing tasks.\nDo acquired skills transfer to test tasks? We evaluate how well the CARML task distribution\nprepares the agent for unseen tasks. For both the \ufb01xed and randomized CMP experiments, each test\ntask speci\ufb01es a dense goal-distance reward for reaching a single object in the environment. In the\nrandomized environment setting, the target objects at test-time are held out from meta-training. The\nPPO and RND-initialized baseline polices, and the \ufb01netuned CARML meta-policy, are trained for a\nsingle target (a speci\ufb01c object in a \ufb01xed environment), with 100 episodes per PPO policy update.\nIn Figure 6a, we compare the success rates on test tasks as a function of the number of samples\nwith supervised rewards seen from the environment. Direct transfer performance of meta-learners is\nshown as points, since in this setting the RL2 agent sees only four episodes (200 samples) at test-time,\nwithout any parameter updates. We see that direct transfer is signi\ufb01cant, achieving up to 71% and\n59% success rates on the \ufb01xed and randomized settings, respectively. The baselines require over two\norders of magnitude more test-time samples to solve a single task at the same level.\nWhile the CARML meta-policy does not consistently solve the test tasks, this is not surprising since no\ninformation is assumed about target reward functions during unsupervised meta-learning; inevitable\ndiscrepancies between the meta-train and test task distributions will mean that meta-learned strategies\nwill be suboptimal for the test tasks. For instance, during testing, the agent sometimes \u2018stalls\u2019 before\nthe target object (once inferred), in order to exploit the inverse distance reward. Nevertheless, we also\nsee that \ufb01netuning the CARML meta-policy trained on random environments on individual tasks is\nmore sample ef\ufb01cient than learning from scratch. This suggests that deriving reward functions from\nour mixture model yields useful tasks insofar as they facilitate learning of strategies that transfer.\nBene\ufb01t of reorganization. In Figure 6a, we also compare performance across early and late outer-\nloop iterations of Algorithm 1, to study the effect of adapting the task distribution (the CARML\nE-step) by reorganizing tasks and incorporating new data. In both cases, number of outer-loop\niterations K = 5. Overall, the re\ufb01nement of the task distribution, which we saw in Figure 5, leads\nimproved to transfer performance. The effect of reorganization is further visualized in the Appendix F.\nVariants. From Figure 6c, we see that the purely online discriminator variant suffers in direct transfer\nperformance; this is due to the issue of mode-collapse in task distribution, wherein the task distribution\nlacks diversity. Pretraining the discriminator encoder with Algorithm 2 mitigates mode-collapse to an\nextent, improving task diversity as the features and task decision boundaries are \ufb01rst \ufb01t on a corpus\nof (randomly collected) trajectories. Finally, while the distribution of tasks eventually discovered\nby the pipelined variant may be diverse and structured, meta-learning the corresponding tasks from\nscratch is harder. More detailed analysis and visualization is given in Appendix E.\n5.3 Visual Robotic Manipulation\nTo experiment in a domain with different challenges, we consider a simulated Sawyer arm interacting\nwith an object in MuJoCo [60], with end-effector continous control in the 2D plane. The observation\nis a bottom-up view of a surface supporting an object (Figure 7); the camera is stationary, but the\nview is no longer egocentric and part of the observation is proprioceptive. The test tasks involve\n\n8\n\n01000020000300004000050000# Samples from Test Reward0.20.40.60.81.0Success RateFinetune CARML (Random)PPO (Scratch)RND InitCARML Step 1 (Fixed)CARML Step K (Fixed)CARML Step 1 (Random)CARML Step K (Random)Handcrafted (Oracle)0500010000150002000025000300003500040000# Samples from Test Reward0.20.40.60.81.0Finetune CARML (Random)PPO (Scratch)RND InitCARML Step 1CARML Step KHandcrafted (Oracle)01000020000300004000050000# Samples from Test Reward0.20.40.60.81.0PPO (Scratch)Online Disc.Online Pretrained-Disc.Pipelined CARMLCARML Step KHandcrafted (Oracle)\fFigure 7: (Left) Skill maps for visuomotor control. Red encodes the true position of the object, and light blue\nthat of the end-effector. Tasks correspond to moving the object to various regions (see Appendix D for more\nskills maps and analysis). (Right) Observation and third person view from the environment, respectively.\n\n(a) ViZDoom (random)\n\n(b) Sawyer\n\nFigure 8: Finetuning the CARML meta-policy allows for\naccelerated meta-learning of the target task distribution.\nCurves re\ufb02ect error bars across three random seeds.\n\npushing the object to a goal (drawn from the set of reachable states), where the reward function is the\nnegative distance to the goal state. A subset of the skill maps is provided below.\nDo acquired skills directly transfer to test tasks? In Figure 6b, we evaluate the meta-policy on the\ntest task distribution, comparing against baselines as previously. Despite the increased dif\ufb01culty of\ncontrol, our approach allows for meta-learning skills that transfer to the goal distance reward task\ndistribution. We \ufb01nd that transfer is weaker compared to the visual navigation (\ufb01xed version): one\nreason may be that the environment is not as visually rich, resulting in a signi\ufb01cant gap between the\nCARML and the object-centric test task distributions.\n5.4 CARML as Meta-Pretraining\nAnother compelling form of transfer is pre-\ntraining of an initialization for accelerated\nsupervised meta-RL of target task distribu-\ntions.\nIn Figure 8, we see that the initial-\nization learned by CARML enables effective\nsupervised meta-RL with signi\ufb01cantly fewer\nsamples. To separate the effect of the learn-\ning the recurrent meta-policy and the visual\nrepresentation, we also compare to only ini-\ntializing the pre-trained encoder. Thus, while\ndirect transfer of the meta-policy may not directly result in optimal behavior on test tasks, accelerated\nlearning of the test task distribution suggests that the acquired meta-learning strategies may be useful\nfor learning related task distributions, effectively acting as pre-training procedure for meta-RL.\n6 Discussion\nWe proposed a framework for inducing unsupervised, adaptive task distributions for meta-RL that\nscales to environments with high-dimensional pixel observations. Through experiments in visual\nnavigation and manipulation domains, we showed that this procedure enables unsupervised acquisition\nof meta-learning strategies that transfer to downstream test task distributions in terms of direct\nevaluation, more sample-ef\ufb01cient \ufb01ne-tuning, and more sample-ef\ufb01cient supervised meta-learning.\nNevertheless, the following key issues are important to explore in future work.\nTask distribution mismatch. While our results show that useful structure can be meta-learned in an\nunsupervised manner, results like the stalling behavior in ViZDoom (see \u00a75.2) suggest that direct\ntransfer of unsupervised meta-learning strategies suffers from a no-free-lunch issue: there will always\nbe a gap between unsupervised and downstream task distributions, and more so with more complex\nenvironments. Moreover, the semantics of target tasks may not necessarily align with especially\ndiscriminative visual features. This is part of the reason why transfer in the Sawyer domain is less\nsuccessful. Capturing other forms of structure useful for stimulus-reward association might involve\nincorporating domain-speci\ufb01c inductive biases into the task-scaffold model. Another way forward is\nthe semi-supervised setting, whereby data-driven bias is incorporated at meta-training time.\nValidation and early stopping: Since the objective optimized by the proposed method is non-\nstationary and in no way guaranteed to be correlated with objectives of test tasks, one must provide\nsome mechanism for validation of iterates.\nForm of skill-set. For the main experiments, we \ufb01xed a number of discrete tasks to be learned\n(without tuning this), but one should consider how the set of skills can be grown or parameterized to\nhave higher capacity (e.g. a multi-label or continuous latent). Otherwise, the task distribution may\nbecome overloaded (complicating task inference) or limited in capacity (preventing coverage).\nAccumulation of skill. We mitigate forgetting with the simple solution of reservoir sampling. Better\nsolutions involve studying an intersection of continual learning and meta-learning.\n\n9\n\n02004006008001000Policy Updates0.40.50.60.70.80.91.0Success RateRL\u00b2 from ScratchCARML init (Ours)Encoder init (Ours)0100200300400500600Policy Updates0.30.40.50.60.70.80.91.0RL\u00b2 from ScratchCARML init (Ours)Encoder init (Ours)\fAcknowledgments\n\nWe thank the BAIR community for helpful discussion, and Michael Janner and Oleh Rybkin in\nparticular for feedback on an earlier draft. AJ thanks Alexei Efros for his steadfastness and advice,\nand Sasha Sax and Ashish Kumar for discussion. KH thanks his family for their support. AJ is\nsupported by the PD Soros Fellowship. This work was supported in part by the National Science\nFoundation, IIS-1651843, IIS-1700697, and IIS-1700696, as well as Google.\n\nReferences\n[1] Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option\n\ndiscovery algorithms. arXiv preprint arXiv:1807.10299, 2018.\n\n[2] Antreas Antoniou and Amos Storkey. Assume, augment and learn: unsupervised few-shot\nmeta-learning via random labels and data augmentation. arXiv preprint arXiv:1902.09884v3,\n2019.\n\n[3] David Barber and Felix Agakov. The IM algorithm: a variational approach to information\n\nmaximization. In Neural Information Processing Systems (NeurIPS), 2004.\n\n[4] Anthony J. Bell and Terrence J. Sejnowski. An information-maximization approach to blind\n\nseparation and blind deconvolution. Neural Computation, 7(6), 1995.\n\n[5] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi\nMunos. Unifying count-based exploration and intrinsic motivation. In Neural Information\nProcessing Systems (NeurIPS), 2016.\n\n[6] Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In Interna-\n\ntional Conference on Machine Learning (ICML), 2017.\n\n[7] Matthew Botvinick, Sam Ritter, Jane X Wang, Zeb Kurth-Nelson, Charles Blundell, and Demis\n\nHassabis. Reinforcement learning, fast and slow. Trends in Cognitive Science, 23(5), 2019.\n\n[8] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random\nnetwork distillation. In International Conference on Learning Representations (ICLR), 2019.\n[9] Adam J. Calhoun, Sreekanth H. Chalasani, and Tatyana O. Sharpee. Maximally informative\n\nforaging by Caenorhabditis elegans. eLife, 3, 2014.\n\n[10] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for\nunsupervised learning of visual features. In European Conference on Computer Vision (ECCV),\n2018.\n\n[11] Rich Caruana. Multitask learning. Machine Learning, 28(1), 1997.\n[12] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj\nRajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language\ngrounding. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[13] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2:\nfast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[14] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning\n\nwith temporal skip connections. In Conference on Robotic Learning (CoRL), 2017.\n\n[15] Jeffrey L Elman. Learning and development in neural networks: the importance of starting\n\nsmall. Cognition, 48(1), 1993.\n\n[16] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you\nneed: learning skills without a reward function. In International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[17] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In Interna-\n\ntional Conference on Robotics and Automation (ICRA), 2017.\n\n[18] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. In International Conference on Machine Learning (ICML), 2017.\n\n[19] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation\nfor reinforcement learning agents. In International Conference on Machine Learning (ICML),\n2017.\n\n10\n\n\f[20] Carlos Florensa, David Held, Markus Wulfmeier, and Pieter Abbeel. Reverse curriculum\n\ngeneration for reinforcement learning. In Conference on Robotic Learning (CoRL), 2017.\n\n[21] S\u00e9bastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. Intrinsically motivated goal\nexploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190,\n2017.\n\n[22] Justin Fu, John Co-Reyes, and Sergey Levine. EX2: exploration with exemplar models for deep\n\nreinforcement learning. In Neural Information Processing Systems (NeurIPS), 2017.\n\n[23] Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Au-\ntomated curriculum learning for neural networks. In International Conference on Machine\nLearning (ICML), 2017.\n\n[24] Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. arXiv\n\npreprint arXiv:1611.07507, 2016.\n\n[25] Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, and Sergey Levine. Unsupervised\n\nmeta-learning for reinforcement learning. arXiv preprint arXiv:1806.04640, 2018.\n\n[26] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Meta-\nreinforcement learning of structured exploration strategies. In Neural Information Processing\nSystems (NeurIPS), 2018.\n\n[27] Dylan Had\ufb01eld-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse\n\nreward design. In Neural Information Processing Systems (NeurIPS), 2017.\n\n[28] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised learning. In The Elements\n\nof Statistical Learning. Springer, 2009.\n\n[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.\n\n[30] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and\nPieter Abbeel. Evolved policy gradients. In Neural Information Processing Systems (NeurIPS),\n2018.\n\n[31] Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via meta-learning. In\n\nInternational Conference on Learning Representations (ICLR), 2019.\n\n[32] Micha\u0142 Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja\u00b4skowski.\nViZDoom: a Doom-based AI research platform for visual reinforcement learning. In Conference\non Computational Intelligence and Games (CIG), 2016.\n\n[33] Siavash Khodadadeh, Ladislau B\u00f6l\u00f6ni, and Mubarak Shah. Unsupervised meta-learning for\n\nfew-shot image and video classi\ufb01cation. arXiv preprint arXiv:1811.11819v1, 2018.\n\n[34] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint\n\narXiv:1312.6114, 2014.\n\n[35] Joel Lehman and Kenneth O Stanley. Abandoning objectives: evolution through the search for\n\nnovelty alone. Evolutionary Computation, 19(2), 2011.\n\n[36] Manuel Lopes, Tobias Lang, Marc Toussaint, and Pierre-Yves Oudeyer. Exploration in model-\nbased reinforcement learning by empirically estimating learning progress. In Neural Information\nProcessing Systems (NeurIPS), 2012.\n\n[37] Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum\n\nlearning. Transactions on Neural Networks and Learning Systems, 2019.\n\n[38] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive\n\nmeta-learner. In International Conference on Learning Representations (ICLR), 2018.\n\n[39] Shakir Mohamed and Danilo J. Rezende. Variational information maximisation for intrinsically\nmotivated reinforcement learning. In Proceedings of the 28th International Conference on Neu-\nral Information Processing Systems - Volume 2, NIPS\u201915, pages 2125\u20132133, Cambridge, MA,\nUSA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969442.2969477.\n[40] Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine.\nVisual reinforcement learning with imagined goals. In Neural Information Processing Systems\n(NeurIPS), 2018.\n\n11\n\n\f[41] A\u00e4ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive\n\npredictive coding. arXiv preprint arXiv:1807.03748, 2018.\n\n[42] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep rein-\n\nforcement learning. In Neural Information Processing Systems (NeurIPS), 2018.\n\n[43] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration\nby self-supervised prediction. In International Conference on Machine Learning (ICML), 2017.\n[44] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu,\nEvan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation.\nIn International Conference on Learning Representations (ICLR), 2018.\n\n[45] Jean Piaget. The Construction of Reality in the Child. Basic Books, 1954.\n[46] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Ef\ufb01cient\noff-policy meta-reinforcement learning via probabilistic context variables. In International\nConference on Machine Learning, 2019.\n\n[47] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. ProMP: proximal\nmeta-policy search. In International Conference on Learning Representations (ICLR), 2019.\n[48] Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment \u2013 an introduction. In\n\nGuided Self-Organization: Inception. Springer, 2014.\n\n[49] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-\n\nmators. In International Conference on Machine Learning, pages 1312\u20131320, 2015.\n\n[50] J\u00fcrgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to\n\nlearn: the meta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[51] J\u00fcrgen Schmidhuber. Driven by compression progress: a simple principle explains essential\naspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art,\nscience, music, jokes. In Anticipatory Behavior in Adaptive Learning Systems. Springer-Verlag,\n2009.\n\n[52] J\u00fcrgen Schmidhuber. POWERPLAY: training an increasingly general problem solver by\ncontinually searching for the simplest still unsolvable problem. arXiv preprint arXiv:1112.5309,\n2011.\n\n[53] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[54] Pranav Shyam, Wojciech Ja\u00b4skowski, and Faustino Gomez. Model-based active exploration. In\n\nInternational Conference on Machine Learning (ICML), 2019.\n\n[55] Satinder Singh, Andrew G Barto, and Nuttapong Chentanez. Intrinsically motivated reinforce-\n\nment learning. In Neural Information Processing Systems (NeurIPS), 2005.\n\n[56] Bradly C Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, and\nIlya Sutskever. Some considerations on learning to explore via meta-reinforcement learning. In\nNeural Information Processing Systems (NeurIPS), 2018.\n\n[57] Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob\nFergus. Intrinsic motivation and automatic curricula via asymmetric self-play. In International\nConference on Learning Representations (ICLR), 2018.\n\n[58] Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and Yongxin Yang. Learning to learn:\n\nmeta-critic networks for sample ef\ufb01cient learning. arXiv preprint arXiv:1706.09529, 2017.\n\n[59] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media,\n\n1998.\n\n[60] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: a physics engine for model-based\n\ncontrol. In International Conference on Intelligent Robots and Systems (IROS), 2012.\n\n[61] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. In\nAnnual Meeting of the Cognitive Science Society (CogSci), 2016.\n\n[62] David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and\nVolodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In\nInternational Conference on Learning Representations (ICLR), 2019.\n\n12\n\n\f[63] Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed\nto control: a locally linear latent dynamics model for control from raw images. In Neural\nInformation Processing Systems (NeurIPS), 2015.\n\n[64] Robert W White. Motivation reconsidered: the concept of competence. Psychological Review,\n\n66(5), 1959.\n\n[65] Annie Xie, Avi Singh, Sergey Levine, and Chelsea Finn. Few-shot goal inference for visuomotor\n\nlearning and planning. In Conference on Robot Learning (CoRL), 2018.\n\n13\n\n\f", "award": [], "sourceid": 5531, "authors": [{"given_name": "Allan", "family_name": "Jabri", "institution": "UC Berkeley"}, {"given_name": "Kyle", "family_name": "Hsu", "institution": "University of Toronto"}, {"given_name": "Abhishek", "family_name": "Gupta", "institution": "University of California, Berkeley"}, {"given_name": "Ben", "family_name": "Eysenbach", "institution": "Carnegie Mellon University"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}, {"given_name": "Chelsea", "family_name": "Finn", "institution": "Stanford University"}]}