NeurIPS 2020

Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction


Meta Review

Summary: this paper proposes a new model-based RL algorithm, where instead of learning state transition probabilities, the occupancy distribution for an infinite horizon is learned. This method can be seen as an extension of the method known as the successor representation to continuous state-action spaces and to infinite horizons. The occupancy distribution is modeled as an energy function, and learned with temporal differences (TD), using a GAN. The experiments on a few MuJuCo problems clearly show the advantages of the proposed approach compared to RL algorithms such as PPO and SAC. The reviewers agree that the proposed method is new, interesting, and validated by the simulation experiments. There are some concerns about the limitation of the experiments and scalability to high dimensional observations such as images where learning occupancy distributions is clearly challenging. The fact that the occupancy is policy-dependent is also a major limitation.