Reviews: Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

This paper analyzes the non-asymptotic convergence for two time-scale TDC under a non-i.i.d. Markovian sample path and linear function approximation. The results are new and important to the field, and the analysis in this setting seems nontrivial. In addition, the paper also develops a new variant of TDC under a blockwise diminishing stepsize, and proves it asymptotically convergent with an arbitrarily small training error at linear convergence rate. Extensive experiments demonstrate that the new TDC variant can converge as fast as vanilla TDC with constant stepsize, and at the same time it enjoys comparable accuracy as TDC with diminishing stepsize. Overall, the paper has both analytical as well as practical value. However, the following issues need to be addressed. 1. The non-asymptotic convergence analysis for other GTD algorithms under a non-i.i.d. Markovian sample path has been studied in e.g., [30,34]. Hence, the new challenges of analyzing TDC relative to GTD in [30, 34] need to be compared and highlighted. 2. The paper generalizes the stagewise stepsize in the conventional (one timescale) optimization e.g., [34] to the considered two-timescale optimization. The new challenges of analyzing algorithms with blockwise diminishing stepsize in this new settings need to be discussed. 3. It is mentioned that the non-asymptotic analysis can be applied to studying other off-policy algorithms such as the actor-critic and the gradient Q-learning algorithms. A comment is due on how the theoretical guarantees can be affected in these settings?

Reviewer 2

This paper provides finite-time bounds for TD with gradient correction (TDC). While non-asymptotic behavior of TD and asymptotic behavior of TDC have been studied before, non-asymptotic analysis for TDC is new and interesting given the importance of off-policy learning and the challenge of step-size tuning in two time-scale algorithms. The paper is well-written, the discussion on the impact of the two step-sizes is clear and is supported by experiments. Questions: - The plots show the error between \theta and \theta^*. How is \theta^* obtained for these domains? - How would the worst-case errors predicted by the bound compare to the errors observed empirically in the experiments? - Besides implications for the choice of step-size, do these bounds provide insight on what properties of the problem, the behavior policy, and the representation affect the rate of convergence? - The first paragraph says that gradient-based TD algorithms are "more flexible than on-policy learning in practice." What does more flexible mean here?

Reviewer 3

To my knowledge the proposed analysis about two time-scale TDC under diminishing step-size and constant step-size is novel. However I need to acknowledge that I am not familiar with all related work in this area and I did not go through the proof details. Later, an blockwise diminishing step-size method is proposed to combine the advantage of constant step-size and diminishing step-size. However, to me, it looks like this ideal property need an careful choice of blocksize, and Thm 3 seems verified that. I have some concern about how to set the blocksize properly here without prior knowledge and how robust the algorithm is with respect to the blocksize hyperparameter. === After Rebuttal === I'm glad that the authors could show the algorithm performance is very robust to the blocksize. The author's further explanation addressed my concern. So I will change my score accordingly.

Paper ID:	5672
Title:	Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Reviewer 1

Reviewer 2

Reviewer 3