NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5531
Title:Unsupervised Curricula for Visual Meta-Reinforcement Learning

Reviewer 1

This work clearly describes their method, as well as contrasts it against a similar approach (DIAYN). In short, they use a data driven method to learn and approximate salient visual features along trajectories as tasks using the DeepCluster method to aggregate new tasks level information. This is additionally important as it avoids GAN style problems that other methods employ, which should, in theory, result in more stable and potentially interpretable results. This work is significant in its simultaneous contributions along unsupervised meta learning as well as vision based clustering in an RL setting, as the authors discuss in section 3.4.

Reviewer 2

Post-rebuttal: Thank you for the responses they have cleared some of the concerns raised. ------- The authors present CARML, an unsupervised method to generate tasks for meta RL. Previous approaches other require the manual definition of task spaces which is especially hard, or by pipelined approaches in which through interactions with a CMP one can yield a task distribution. The authors propose to use a latent variable density model of the meta-learner’s behaviour in order to adapt the task distribution and then meta-learn on those tasks. The paper is well written, everything is motivated and the choices made clearly explained. They provide results of their proposed algorithms in VizDoom and a Sawyer arm, in which they show that their method requires significantly less samples to learn when compared with baseline approaches. I really liked reading the paper, I think unsupervised task generation is very important for meta-learning and this paper provides a nice way of doing it. I do wonder however how D is handled. It seems to me that every trajectory is added to the reservoir. This increases the complexity of fitting the task giver q. Don’t some trajectories become obsolete as the meta-learner evolves? Fitting q in D this way could harm performance. I feel that having a smart way of dealing with the ever increasing D is important. One thing that would have been nice to include is how the approach compares to [22] as it seems to be a motivating paper for this research.

Reviewer 3

This paper presents a method for learning a distribution of tasks to feed to an agent that's learning via meta RL, while simultaneously optimizing the agent to perform better more quickly on tasks sampled from this distribution. The task distribution is trained using an objective that maximizes mutual information between a latent task variable and the trajectories produced by the meta RL agent. The meta RL agent is trained to maximize this mutual information, more or less. The overall optimization relies on some variational lower bounds on mutual information, and on the RL^2 algorithm for meta RL. Experiments are provided which show that the task distributions and meta RL agents trained in this co-adaptive manner exhibit some potentially useful behaviors, e.g. an improved ability to quickly solve new tasks sampled from an "actual" task distribution -- i.e., a task distribution which is not equal to the one that's co-adapted with the agent. I think the ideas explored in this paper are reasonably interesting, and may spark some more practical insights in the rest of the research community. The paper was clear and didn't make unnecessary claims. My main criticism is that it's a bit hard to predict the strength and future value of this work in the absence of prior work that it out-performs or a strong argument for why "automatic curriculum learning for unsupervised meta RL" is a conjunction of keywords worth exploring. Extending that point, there also isn't much effort to explain why this particular approach to this particular problem is particularly worthwhile, which would be helpful in the absence of prior efforts to solve the problem. E.g., as an alternative one could train an agent and task distribution using one of the 10s of DIAYN-like approaches, and then train another agent to solve tasks from this distribution via meta RL using reward functions constructed similarly to the ones used in the proposed method. Post Rebuttal: I appreciate the new experiments aimed at separating the benefits of improved diversity-based skill learning and co-adaptive vs pipelined approaches to incorporating this in a meta RL setting. I'd recommend refactoring the presentation a bit as well, to emphasize that these two contributions -- i.e. improved diverse skill learning compared to DIAYN and a co-adaptive approach to using diverse skill learning for meta RL -- are separate but complementary. It may be difficult for people who aren't already familiar with these topics to perform the appropriate factorization if it is not explicit in the technical presentation.