NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5115
Title:Guided Meta-Policy Search

Reviewer 1

The proposed method is a novel (and elegant) combination of existing techniques from off-policy learning, imitation learning, guided policy search, and meta-learning. The resulting algorithm is new and I believe it can be valuable to researchers in this field. I think the paper does not sufficiently discuss the related work of Rakelly et al [1] (PEARL). The paper is cited in the related work section as one of the methods based on "recurrent or recursive neural networks". If I remember correctly they don't use recurrency (only in one of their ablation studies to show that their method *outperforms* a recurrent-based encoder). Furthermore, this paper should be pointed out as being an *off-policy* meta-RL algorithm. While I think the methods are sufficiently different, this difference should be explained in the paper. I also believe that a comparison in terms of performance and sample efficiency (during meta-training) would be an interesting addition to the experimental evaluation. The quality and clarity of the paper high. The method is clearly motivated, and the technical contribution well explained. I think the pseudo-code helps a lot, and together with the supplementary material and code provided and expert reader should be able to reproduce the results. Overall, I think this paper has high significance. The proposed method is novel and well executed and there are multiple good ideas that go along with it. I'm surprised the paper does not compare to PEARL [1], but I believe this can be easily addressed. References: [1] Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via probabilistic context variables." arXiv preprint arXiv:1903.08254 (2019). ------------------------------------------------------------------------------------------------------ - UPDATE - I would like to thank the authors for their rebuttal and for adding the comparison to PEARL - these are very interesting insights! Together with these results, I think this is a strong submission and I therefore increase my score to 7, and vote for acceptance.

Reviewer 2

This work builds on the Model Agnostic Meta Learning (MAML) algorithm and adapts it to make it more sample efficient and able to use expert demonstrations. To the best of my knowledge, the work is original. The paper is well presented and polished. Notation is clear. The prose is easy to understand and I could see no obvious typos. The article is clear and well organized. The supplementary material is quite thorough at presenting information that could be used to reproduce the experimental results. I believe that this is an important work in the field of meta learning for reinforcement learning. Improving sample efficiency is key to make those algorithm useful in practice and this paper makes a clear contribution in that respect.

Reviewer 3

originality: This paper follows on from the state-of-the-art in meta-learning. Meta-learning consists in improving the sample efficiency of any reinforcement learning algorithms by finding an initial set of parameters from which it can quickly adapt to any task of interest. This paper proposes a new meta-learning algorithm that decouples meta-objective learning and meta-training by using two nested loops: the inner loop can be any reinforcement learning method, while the outer loop optimizes the meta-objective using a supervised learning method. It goes pretty much against the current trend of unified the deep learning black box and therefore is sufficiently original on its own. quality: The most significant application cases are discussed thoroughly and the authors provide strong theoretical convergence guarantees. However, the theoretical analysis covers only the particular case of on-policy learning, while the method is intended to be used for off-policy learning. The examples are appropriate to assess the performance of the method although they are quite simples and not very challenging control problems in the first place. I'm afraid it may still untractable in practice for real-world applications. clarity: The previous works are very well introduced. The relation with other works and the theoretical background is made very clear throughout the paper, making this paper readable for those without knowledge in meta-learning. The implementation details are very helpful and make the results pretty straight-forward to reproduce. significance: This paper is a valuable contribution to the literature on meta-learning, showing significant improvement over the state of the art. The idea of designing a meta-learning algorithm explicitly separating the meta-training and meta-objective learning into two distinct phases that can combine any reinforcement learning method to any supervised learning method is clearer and surprisingly effective. This architecture is very convenient to build on it, try different combinations, and improve its performance even further. It presents some theoretical convergence analysis, significant improvement of sample efficiency relative to some reference examples. It applies successfully to the challenging case of sparse reward, which is not the case of the other meta-learning methods, to my knowledge.