NeurIPS 2020

Is Long Horizon RL More Difficult Than Short Horizon RL?


Meta Review

This paper addresses a COLT 2018 open problem regarding whether there exists an algorithm for tabular, episodic ("long horizon") RL whose sample complexity is polynomial in the length of the horizon, H. It answers in the affirmative, providing an algorithm and PAC bound that has a poly-log dependence on H. This result furthers our understanding of episodic RL and shows that it is indeed not much more difficult than "short horizon" RL (with a few assumptions). The reviewers agree that this work is interesting and timely, and that the paper is more or less clear and well written. However, the reviewers also agree that the paper might be overstating its contributions a bit. The proposed bounds would be stronger than prior work in the setting where H > 1/epsilon, but worse when this does not hold, due to some pretty hefty polynomial terms involving the state and action space sizes, S and A. For the sake of honesty, the paper should be upfront about this limitation. Another critique is that the computational complexity of the proposed algorithm is bad. But this is somewhat expected of theoretical results, and does not detract from fact that there exists an algorithm that settles the open problem. It may be that more efficient algorithms have the same sample complexity; this would make nice followup work. I strongly encourage the authors to incorporate *all* of the feedback from the reviews -- especially w.r.t. over-claiming -- when finalizing the paper.