NeurIPS 2020

Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards

Meta Review

The paper presents an approach to deal with sparse reward setups by storing high-reward trajectories/states from the past experience and use them to perform a more directed exploration. The idea of reusing past trajectories to direct exploration at future timesteps has been attempted several times in the literature but this paper finally seems to get it right without relying on the environment being deterministic or need for expert demonstration trajectories. All the reviewers liked the idea of the paper and but had concerns regarding baseline comparisons. The authors' provided the rebuttal and addressed some of the concerns. Given the initial reviews and author's rebuttal, reviewers agreed that the paper provides sufficient insights to be accepted. However, the baseline comparisons still need to be improved for the final version, for instance, comparison to off-policy and model-based RL methods, etc. Please refer to reviewers' final comments and address their concerns in the camera-ready.