NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
*** Update: The authors have addressed my comments. But I also agree certain points brought by Reviewer 4 (e.g. the writing can be more focused on TD), so I decided to keep the same score. *** I like the linear system perspective on the update rule of temporal difference learning. Although similar ideas have been applied in the background in proofs here and there, it is good to have a comprehensive treatment based on control theory. Overall I like the perspective provided by paper. However, I'm also worried that, while one can definitely formulate that as a linear system and then apply known results in control theory. It doesn't seem to provide strong insights to temporal difference learning per se, if it introduces additional warranted assumptions to characterize the system behaviors. Below I list some concerns about certain technical points which I wish the authors can clarify. 1. Assumption on E[b(z^k)] = 0. I am wondering whether this is assumption is reasonable in practice. For example, is it true for the b(z^k) of the classic rule (13)? It seems that it depends on the distribution that determines \theta^* and whether the problem is discounted. Also, do we need still this assumption in Theorem 2? It is not clear, though it seems to be according to line 259. 2. The stability of linear systems. The authors provide analyses based on perturbation to show that the system can be stable when the step size is small enough. Nonetheless, at the end, the argument still hinge on \bar{A} is Hurwitz. How can this be ensured?. For example, under what assumption, can this be true for the simple update rule in (13)? Minor questions: Why is there a minus sign in line 213? Is it due to the Hurwtiz assumption? Typos: In (9) shouldn't it be q_i^k? A parenthesis is missing in line 142 line 167, x^k -> s^k
Reviewer 2
The paper considers TD algorithms with linear function approximations. The key observation is that the learning dynamics of weights in a TD algorithms follow MJLS dynamics. Then the paper uses MJLS theories to analyze the dynamics of TD algorithms with linear function approximations, and then from the dynamics provides stability analysis for the learning algorithms. The method provides a new systematic way to analyze the behaviors of TD learning algorithms which are important for RL research. The technical proofs are correct, but the presentation could be improved. Some comments: - The definitions of \mu^k, q^k and Q^k are hard to find and in the paper, and they are a bit confusing. It's probably better to define them in Section 3 together with the jump dynamics (12) and provide more explanations. - Theorem 1 and 2 are basically some algebraic manipulations of the TD dynamics, and the stability and limiting behavior are the main results of the paper. Therefore, it feels like providing some formal theorems for the stability and limiting behavior might better highlight the main results. - In Theorem 1, (14) uses \mu^k but (16) uses q^k. It seems like a typo in (16). - The assumption \sum p_i b_i = 0 is desirable, but not clear if it would actually hold. It feels like this assumption is related to the representability of the linear function approximations. More discussions on this assumption is important. In particular, what is the dynamics when this expected value is not zero? What is the limiting behavior when this assumption does not hold?
Reviewer 3
Summary: - The main contribution of the paper is to write the TD update as a MJLS over an augmented parameter space with one parameter vector for each pair of states in the underlying MDP. - After presenting MJLS and the idea of the augmented parameter space, they first consider the IID case where pairs of states are chosen IID and give formulas for the expected error and its covariance. - Then they move to the Markov noise case and derive formulas for the the augmented error and covariance. Under an additional ergodicity assumption they give a convergence rate to limiting quantities. For small learning rates (not exactly clear how small in terms of problem parameters) a perturbation analysis gives an estimate of what this convergence rate is (although the value of lambda_{max real} \bar A remains unclear in terms of the parameters of the problem). Pros: - The originality of the connection between TD dynamics and MJLS is a good contribution that could increase the flow of ideas from control theory to RL. In addition, the formulation of the augmented state space seems to be a potentially useful analysis tool. - The proofs seem correct with references to the appropriate work. Section D of the supplement provides an honest analysis of how the conclusions of the analysis are very similar to Srikant and Ying [22] while being more exact but less interpretable. The quality of the paper is solid. Cons: - In my mind, the main issue with this paper is a lack of significance. The approach is slightly different, but the conclusions seem to be the same as two recent papers with finite-time analysis of TD learning (references [3] and [22] in the paper). However, the results in this paper are much harder to interpret as they are presented with cumbersome notation and the impact of the derived formulas is unclear. Moreover, the paper never brings the analysis back down from the augmented space to the error in the parameters of the actual TD algorithm, making it even more difficult to interpret the results and compare them on equal footing with related work. To me, the improvement in more exact convergence rates (but of the same order) does not provide any useful insight into the way that TD learning works since the analysis is so difficult to interpret. - In general, the clarity and focus of the exposition could be improved. The paper gets bogged down in presenting several pages about MJLS before explaining the connection to TD (which is the main contribution). The notation is cumbersome and confusing leading to long and difficult to interpret theorems with sparse analysis/exposition. Connections to variants of TD are hinted at but omitted from the paper. Clarity issues are further addressed in the "improvements" section below. - Related to the above point about significance, the results lack originality. The two recent papers with finite-time analysis of TD learning (references [3] and [22] in the paper) derive similar results, with bounds on the MSE instead of exact formulas in an augmented space on the covariance which is not quite the same thing, as stated in the paper. And while the connection to MJLS is novel in reinforcement learning, it seems to have been inspired by a similar connection made in optimization in the paper by Hu et al. (reference [15]). ----------------------------------------------------------------- Update and response to authors: Interpretability: Adding corollaries that express the results as comparable to the ones in the literature will help interpretability. Also, splitting up the theorems into smaller pieces and simplifying the notation (and maybe moving some of the results to the appendix) will clean up the presentation and make things more understandable. Moreover, as the other reviewers said, more clear explanations of the assumptions should be added. Finally, spelling out the implications of the theorems more clearly will go a long way towards making the paper more impactful. Significance: I am still a bit wary of the impact on understanding TD learning. The analysis is novel and the tight rates are nice, but the claims about informing the choice of learning rate are suspect as they rely on knowing the eigenvalues of bar(A), a matrix that will not be known in practice. Expanding on the implications of tight rates and/or lower bound on the error could help explain the significance. An example or toy experiment could help get this point across. Originality: I agree that I missed the distinction with prior work that the authors raise and I think that this MJLS perspective is a nice idea to bring to the RL community. After the rebuttal and talking to the other reviewers, I will update my score to a 6 to reflect expected changes to the paper to improve clarity and interpretability and to recognise the novelty of the MJLS perspective and tight bounds. My doubts remain about the significance, but better presentation should improve this.