Summary and Contributions: This paper analyzes convergence rates and sample complexity for actor-critic algorithms under linear function approximation for the critic under batched updates for the critic and the actor in an infinite-horizon discounted setting. Their main technical idea is to use a constant step size while controlling for batch bias error and hence obtain better sample complexity over single-sample methods.
Strengths: The paper introduces some interesting techniques which could be of interest besides the scope of this work. I looked over some proofs in the appendix and the theoretical claims seem correct, and interesting overall. The significance is that they could show that AC algorithms perform better than PG algorithms. In this sense, the contribution is significant and interesting.
Weaknesses: I think that while the authors compared to existing analyses of PG based methods and showed that their analyses of AC and NAC are better than their PG counterparts via a new style of analysis, I wonder if their analysis tricks when applied to PG methods improve their guarantees too? And if their analysis tricks does improve PG guarantees too, how does it compare then? It would be good if authors could elaborate on this aspect more. Is this a question of 1/B vs 1/\sqrt{B}?
Correctness: The theorem proofs seem correct to me (but that said, I haven't verified each step of the proof).
Clarity: The paper is well written, although I would suggest adding some more definitions, like what is meant by "single sample SA", or what is meant by "markovian sampling" -- these are obvious, but it is good to state that once.
Relation to Prior Work: I think the discussion of prior work is decent, and the differences are brought out decently well.
Reproducibility: Yes
Additional Feedback: Main comments are in the other questions. I would like a description of whether their trick improves guarantees for policy gradient methods too, agnostic of the critic under their chosen assumptions. ----------------------- Thanks for the response. Your answer for why these techniques may not be applicable to policy gradient methods makes sense, and it would be great to add this to the final paper in my opinion -- to distinguish the potential confounding factor of better analysis for AC/NAC giving better bounds when compared to not so great analysis of PG. A toy example pointed out by R3 should be included as well, and some quantification of the critic approximation error (R4) would be good to discuss as well.
Summary and Contributions: This paper is on improving sample complexity bounds for actor-critic (AC) and natural AC (NAC). The paper gives the first sample complexity bounds for actor-critic methods under Markovian sampling (non-iid) and first results showing AC and NAC have tighter sample complexity bounds than policy gradient and natural policy gradient algorithms (the latter better reflects empirical observations). The three core theoretical results of the paper are: 1. Sample complexity bound for mini-batch TD 2. Sample complexity bound for mini-batch AC 3. Sample complexity bound for mini-batch NAC All results are derived under Markovian sampling as opposed to iid sampling used in other methods.
Strengths: Overall, I like the paper and like to see results moving theoretical bounds closer to practical settings for these algorithms. I believe the contributions are significant, novel, and relevant to the theoretical RL community.
Weaknesses: Admittedly this is a bit nitpicky, but it would be interesting to complement the theoretical results with empirical results in toy problems (maybe random MDPs?). Such results can give an idea of how tight novel bounds are in practice and could also be a way to give better insight into how parameters like mini-batch size affect bounds. With that said, I don't see this as a major limitation of the work.
Correctness: The paper's arguments are presented entirely theoretically. As far as I can tell the proofs of results are correct.
Clarity: The paper is clearly written. I have some minor comments and questions given as feedback to the authors.
Relation to Prior Work: Yes, the authors include a related work section and a table showing how their contributions relate to existing work.
Reproducibility: Yes
Additional Feedback: Thank you for your response. I've left my review as is. Questions for authors: The section on mini-batch TD learning discusses bias error. I'm not sure bias error is the right way to describe this, particularly w.r.t. the part that disappears with a larger mini-batch (line 211). Isn't this more of a variance error? Does Thm 1 depend on both assumptions or just assumption 2? In assumption 1, should L_phi be L_psi? Is there a typo in Assumption 2's equation? What is P(s_t \in \cdot | s_0=s)? Minor comments: 37: infinity -> infinite Don't use numerical citations as nouns. Better to give author names.
Summary and Contributions: The paper provides finite-time analysis for actor critic (AC) and natural actor critic algorithms (NAC) under the conditions of infinite horizon, Markovian sampling, mini-batch training, general actor approximation, and linear critic approximation. The presented convergence rates outperform existing results (that do not consider mini-batch training).
Strengths: 1. The paper is the first to analyze Markovian min-batch training for AC and NAC. Mini-batch training is rarely studied but is widely used in practice, so the contribution is both novel and significant for theoretical analysis. 2. The presented techniques (linear and nonlinear stochastic analysis) are demonstrated to improve upon prior convergence rates, and thus provide new insights into theoretical analysis.
Weaknesses: Though the paper demonstrates faster convergence of AC compared to policy gradient (PG), its proof assumes a linear critic, which can introduce a large approximation error in practice (see Theorem 2). It is unclear whether the gain in convergence speed outweighs the approximation error.
Correctness: I have read through the main text, but only skimmed the appendix due to lack of time and domain expertise. The overall proof appears correct, but it is possible that I have missed important details.
Clarity: The paper is very well written. The presentation is easy to follow. Main ideas are summarized and explained concisely.
Relation to Prior Work: The paper clearly shows differences from previous contributions in terms of mini-batch training and Markovian sampling.
Reproducibility: Yes
Additional Feedback: The authors’ response has addressed my questions. I will keep my score. ---------------------------- 1. Can nonlinear SA be applied to a nonlinear critic as well? This is a natural question to ask, so it could be worth an explanation somewhere. 2. In practice, some have noticed improved convergence rates of NAC compared to AC (e.g. ACKTR v.s. A2C in [1]). However, this paper suggests a slower rate by a factor of (1-\gamma)^{-2}. What could cause the difference and how could the theory here guide development of deep RL algorithms? [1] Wu, Y., Mansimov, E., Grosse, R.B., Liao, S. and Ba, J., 2017. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems (pp. 5279-5288).