Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper is very inspiring. Slow convergence and sample inefficiency remain challenging problems in RL, especially when handling continuous and high-dimensional state spaces. This paper applies a general method that drawn from regularized Anderson acceleration (RAA) to accelerate the convergence or improve the sample efficiency for the model-free, off-policy deep RL. Overall this paper is well constructed and improves the problem from a novel point of view (they make the observation that RL is closely linked to fixed-point iteration). The results show that their approach substantially improves both the learning speed and final performance of state-of-the-art deep RL algorithms, and the literature review shows that the author is knowledgeable in this field. Here are my major concerns: I would suggest adding a description about policy iteration is missing in 3.2 since value iteration is a special variant of policy iteration. Some minor suggestions: Line 116 to 125: specify m > 1. Line 149: Besides l2-norm, do other methods (like l1-norm or l∞-norm) works effectively here? Typo on Appendix ??: line 151, line 191, line 239, line 245, line 286, line 294.
The main contribution of this paper is to apply Anderson acceleration to the setting of deep reinforcement learning. The authors first propose a regularized form of Anderson acceleration, and then show how it can be applied to two practical deep RL algorithms: DQN and TD3. Originality: This paper falls under the vein of applying existing techniques to a novel domain. While the idea of introducing Anderson acceleration to the context of RL is not new, as the authors mention, it has not been applied to deep RL methods. While the originality is somewhat limited in this aspect, developing a practical and functional improvement for deep RL algorithms is not trivial. Quality: The paper is technically sound, and the experimental analysis is fair and supports the main thesis of the paper. I like the fact that introducing RAA results in similar performance gains on both TD3 and Dueling-DQN. This is promising from a reproducibility standpoint, as it is more likely that the performance gain from RAA is real and meaningful, and not simply resolving an odd quirk found in a particular deep RL method. Clarity: The paper is well-written. The background on Anderson acceleration is clearly explained, as well as the motivation and extension to the deep RL domain. Significance: Developing good and stable off-policy RL algorithms is an important area of research. This work proposes a well-motivated modification to existing off-policy algorithms which appears to be simple to implement and has the promise of moderate performance gains. I believe this work will be of interest to the deep RL community. Minor typos and clarifications: - Appendix links are broken (i.e. Lines 191, 239, 286, 294, and more) - It is not immediately clear what value of "m" was used for the experiments. I assume "m" corresponds to the "number of previous estimates" hyperparameter, but this could be made more explicit. Additionally, there is some informal language used in the paper. I have proposed some minor modifications: Line 35: "this mapping is essentially a kind of fixed-point problem" -> "Iterating this mapping results in a fixed-point problem" Line 173: "may make the iteration get stuck" -> delete?
Clarity: The writing in the paper is clear. The presentation of Anderson acceleration and the proposed bound on the iterates is also clear. Originality: The contributions in this paper appear novel and past work is appropriately cited. Significance & Quality: The performance benefits of this method are not sufficiently clear from the chosen experiments. As presented in this paper, the key idea in Anderson acceleration is to use the iterates in a contraction to move faster towards the fixed point. The performance benefit over the baselines in Figure 1 appear small. These benefits could be conflated with other factors such as the choice of step-size, and the optimizer used. The experiments do not disentangle these conflating factors. In particular, common deep-learning optimizers (ADAM, RMSProp) contain momentum terms that could (weakly) mimic some of the benefits of Anderson acceleration in combining information across iterates for faster convergence. There should be experiments that examine this.