NeurIPS 2020

Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis


Meta Review

The results in this work are novel and non-trivial. The reviewers generally supported the work. The key limitation to address is to be more clear about the theoretical and empirical outcomes in the paper. A point highlighted by a reviewer is that the rate is not improved in the Markov setting, with the variance reduction approach, and the experiments are not that compelling in showing that the rate is notably improved. It would be more useful to clearly state that you do not get a rate improvement in the Markov setting, and provide a discussion about this (somewhat negative, but nonetheless realistic) result. Additionally, it would be useful to more clearly explain the improved rate of VRTDC over VRTD. It is a big deal to show that a TDC variant converges faster than a TD variant. Explaining this result, and how it is possible, would more clearly motivate the approach. Here, it would also be useful to contrast VRTDC with the rate from TDC, not just compared to VRTD.