Summary and Contributions: The high throughput and stable training are exclusive for most parallel actor-learner methods in reinforcement learning, however, this paper proposes a synchronous training scheme to make a balance between the two factors. The method can learn and rollouts concurrently, meanwhile it claims that it can avoid “stale policy”, which often leads to unstable training. The approach is evaluated on Ataris games and Google Football Environment, the results show that this scheme has competitive throughput and higher rewards.
Strengths: The topic is relevant and significant for the RL community, distributed RL or scaling RL is important research directions, which can make RL available for more complicated environments and also shorten the training time. I like the engineering techniques and ideas in the proposed training schema, compared to most previous methods, this method furtherly decouples the function of env step and actor forward. It considers the step time variance of all env steps, which is indeed to be an issue in practical parallel training schemes. Besides, it makes sure the latency between target and behavior policy is fixed to one updating period and adopts adelayed gradient to remedy the latency. The experiment is sufficient with various environments and ablation study, the configuration and details are presented, the implementation code is also uploaded, I am confident with the reliability and reproducibility of the experimental results.
Weaknesses: Although the experiment is fairly solid, I still want to see more experimental comparisons with the typical scaling training method, such as Seed RL[1], since the Seed Rl is also evaluated on the Google Football Environment.
Correctness: In the section of 4.2, the author supposes that X_i^(j) conforms to the distribution of Exp(\beta), and the data consumption of learners also follows the exponential distribution, so are these assumptions suitable and truthful?
Clarity: Yes, this paper is well-written and well-organized, I think I can clearly get the points of the author.
Relation to Prior Work: Some of the ideas (coupling the step function and forward function) are similar to the Seed RL work of Google, but I didn’t see the author mention that and give no experimental comparison.
Reproducibility: Yes
Additional Feedback: * The caption of Figure 3 and 4 is confusion, the text composition is disorder. Updated after author rebuttals: Thanks for adding the comparison to SeedRL, I keep my score for this paper.
Summary and Contributions: The paper proposes a new engineering approach to combine the benefits of asynchronous and synchronous RL, while avoiding some of their pitfalls. They describe a method, HTS-RL, which they claim allows for multiple environment interactions and learning updates to happen in parallel, while ensuring deterministic training behavior. They also describe the problem of “stale policies” as a possible explanation for training instability, and propose a fix for it. The architecture has several key features, such as: - batch synchronization: actors are synchronized every \alpha>1 steps, this helps in environments with high variance in terms of the environment step. - concurrent rollout and learning - Using two separate data storages to ensure that the behavior policies are only one step ahead of the target policies, thus avoiding the off-policy issue in some of the existing asynchronous RL approaches.
Strengths: + This is a good engineering step towards better utilizing available compute resources to run deep RL. The approach avoids the low throughput of synchronous RL, while hedging against stale policy updates that plagues asynchronous RL. + They run experiments on several Atari and Google Research Football domains, and show that across most domains, their method: Achieves a higher Atari score in a fixed amount of time Takes less time to achieve a fixed score in GFootball
Weaknesses: - While this is an impressive effort, I am concerned about the limited scientific contribution of this work. Clearly, this is a fairly useful results for practitioners who want to best utilize their available compute power, but there is very limited scientific contribution.
Correctness: Yes.
Clarity: One criticism about how some of the figures are structured: Figure 1 comes very early in the paper. Problem is, there are several undefined entities in that figure (such as executors, polls, etc) that only become known later in the paper. But then, when the figure is referred to in the context describing HTS-RL, I had to scroll back and forth to make sense of the architecture.
Relation to Prior Work: I give the paper a lot of credit in terms of situation themselves relative to existing work, such as A2C, A3C, IMPALA, etc.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: The authors propose a new synchronous deep RL framework to make the best use of hardware while avoiding the stale policy issue via delayed gradient updates.
Strengths: As far as I know, the delayed gradient update and the decoupling of actors and executors are novel in deep RL frameworks. Those concepts can possibly be used in a wide range of RL implementations and eventually make deep RL more accessible. The evaluation metric is convincing and the results support the claims.
Weaknesses: 1. The baselines are somehow weak. Though TorchBeast is a strong baseline, the PPO and A2C from Kostrikov seem weak. As far as I know, faster training is not the goal of Kostrikov's implementation. For PPO, the implementation from OpenAI baselines are stronger, which features parallelization with MPI and all-reduce gradients. For A2C, one could consider rlpyt (rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch), where various sampling schemes (including batch synchronization) and optimization schemes can be used. The paper could benefit a lot from a comparison with OpenAI baselines and rlpyt. 2. The explanation of the framework is not easy to follow. In particular, understanding how the two storages and the delayed gradient updates interact with each other is not straightforward. The paper could benefit a lot from a pseudo code, as well as a more detailed Figure 2d additionally showing the flow of parameters and gradients. 3. The utility of batch synchronization As far as I understand, batch synchronization could help if the following assumption holds: Let X_k be the time that an executor needs to run k steps, then Var(X_k) decreases as k increases. There are indeed exceptions, e.g., we could expect that one executor needs 10s to run each of the next 4 steps, while the other executors needs only 1s. In this case, I think batch synchronization won't help. Assuming X_i^(j)s are i.i.d. in Claim 1 seems impractical and the paper could benefit from explicitly stating the above assumption (if I understand it correctly) and clarify this more 4. # of envs v.s. # of threads (cpus) From Line 281, I assume in Figure 4(right), # of envs = # of threads. But this is not a fair comparison for PPO. In openai baseline ppo implementation for Atari games, one thread could have up to 128 envs. In that implementation, it's expected that SPS doesn't change much w.r.t. the # of envs. But I do expect SPS scales linearly w.r.t. the # of threads given the use of MPI and all-reduce gradients. The paper could benefit a lot from studying the performance v.s. # of threads for all the compared algorithms, which would be very useful for practitioners. 5. Credit assignment in Sec 4.1 If I remember it correctly, the two storage feature is similar to the double replay buffer in rlpyt, batch synchronization is also used in rlpyt. I do believe delayed gradient and asynchronous actors and executors are novel. So the paper could benefit a lot by explicitly clarifying what is novel and what is not for the features listed in Sec 4.1 6. The word "updates" in Line 197 is confusing If I understand it correctly, each learner performs only one update -- if there are multiple updates, the delayed gradient updates are not delayed by only 1 step. As you have already pointed out, an update is a forward and a backward pass, I feel it might be better to avoid the use of "updates". I feel it just accumulates gradients but updates parameters only once.
Correctness: Yes
Clarity: Yes
Relation to Prior Work: Yes
Reproducibility: Yes
Additional Feedback: I read the author response and would like to keep my score. I think the authors should use the MPI version of PPO (i.e., 'ppo1' in openai baselines) to make Figure 4 (right) a fair comparison. I do expect the authors' method can still outperform the MPI version of PPO, but the margin should be much smaller.
Summary and Contributions: This paper presents a new implementation methodology for synchronous Deep RL algorithms that rely on on-policy data. The shortcoming they seek to overcome is that asynchronous methods of execution typically manage to run faster and are able to distribute their computation more than synchronous methods, due to the variance in per-step execution times in the environments. However synchronous methods tend to be more reproducible and more data-efficient. This paper presents a middle road by synchronizing the execution every time a batch of data is collected. This setup ensures that the data collection policy and the policy that is updated are 1 update apart, leading to fairly on-policy updates and hence better data efficiency. Reproducibility is ensured by having the pseudo-randomness at the level of the executors. The paper analyses and validates the time it takes by their approach to generate a certain amount of data. They also analyze the latency between the behavior and target policy in asynchronous algorithms. The experiments evaluate this approach against both synchronous and asynchronous algorithms, where they check the performance given a fixed time limit, as well as the time taken to reach a fixed performance level.
Strengths: Increasing the throughput of techniques that are more easily reproducible will be useful to the community at large and this aspiration of the paper is to be commended. The analysis and comparison of synchronous and asynchronous algorithms with respect to the amount of data they are able to generate and the lag or off-policyness they suffer from is a valuable part of this paper. The proof of claims in this respect seem to be correct.
Weaknesses: ** AFTER AUTHOR RESPONSE** The authors addressed my specific concerns regarding the off-policy nature of the updates and determinism in the execution. These points are addressed in the appendix. They seem to be important evidence supporting the proposed modification, however, and I encourage them to highlight these findings in the main paper. Given the theoretical and empirical analysis of this practical modification to data gathering, along with discussions with reviewers, I lean more positively regarding this paper. I have updated my score accordingly. _____________________________________________________________________ 1. While the practical benefit to the community due to the above points is clear, I am not completely sure about the scientific value the paper brings. The insights due to the analysis seem fairly minor. Algorithmic innovation seems fairly minor as well. 2. The paper uses samples from a policy that is slightly off-policy without any mention of correcting for this off-policyness, or acknowledging the off-policy update. 3. One of the reasons for introducing synchrony was to ensure determinism in execution. An experiment to showcase this determinism would have been useful.
Correctness: Nothing in the claim proofs or empirical evaluation methodology raises red flags for me.
Clarity: The paper is well written. The authors present the reason for the bottleneck in current synchronous algorithms quite clearly. Figure 1 (e) is not that easy to understand before reading section 4.1, though it is not a major problem.
Relation to Prior Work: Prior work has been presented clearly. The relation to prior work is discussed at various points in the paper. It is both sufficient and clear.
Reproducibility: Yes
Additional Feedback: