NeurIPS 2020

On Reward-Free Reinforcement Learning with Linear Function Approximation

Meta Review

The authors study sequential decision processes without reward function. The goal is to learn the transition dynamics such that various reward functions could be optimised efficiently in the future. The authors extend recent work to the linear function approximation case. They provide an analysis of the sample complexity, and show that while for linear MDPs complexity is polynomial, this is not true for MDPs with a linear optimal value functions, providing insight on the hardness of this second class of problems. The strengths of the paper are the theoretical development of the algorithm and the lower bound for MDPs with linear optimal Q functions. The proofs are rigorous. A limitation of the paper is that it is relatively close to the earlier work on reward-free RL in the tabular case and MDPs with linear Q functions. This limits the novelty. One reviewer suggested adding a (toy) experiment to better understand the practical implications of the theory. Although all reviewers agreed that the paper is somewhat incremental, most reviewers agreed that the results are nevertheless strong and of interest to the community. Especially the lower bound was an interesting insight to many of the reviewers. Thus, I’d like to recommend this paper for acceptance. In their rebuttal, the authors have clarified several aspects of the paper. I’d like to ask the authors to add these clarifications to the final draft of the paper.