Reviews: Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning

The paper proposes to use a Gossip based algorithm for performing gradient updates across distributed agents in RL. In this approach each agent is connected to a set of other agents to which it sends updated parameters, and from which it receives updated parameters (the outgoing connections do not have to be the same as incoming ones, in theory). At each step of learning (every N actions) each agent performs updates to the value function using TD error gradient, and to the policy using policy-gradients. The updated parameters are sent out to the outgoing peers that it is connected to and new parameters are received from incoming peers it is connected to. The incoming messages are combined (averaged, it seems, although it appears that the matrix P does not have to be this way) and parameters are updated. The results shown in the paper seem to highlight improved performance over A2C under the same wall clock time. I could use some clarification however for Fig 1b -- how do I compare the number of steps for A2C vs the steps for Gala-a2c -- were the same number of workers being used in both cases ? From the differernce between 1(b) and 1(d) it would appear that steps pass faster for Gala-A2C, and I'm wondering why.. In fig 1a, why does the performance degrade so catastrophically with larger number of simulators ? Can this be ameliorate by changing the sparsity of P ? Did the authors try stochastically sampling P, or is that assumed to happen just by variability in communication ?

Reviewer 2

The paper should make it clear what exactly is new, and what was previously known in the gossip algorithm community. The “beneficial effect” of noise is hypothesized, but not verified: please provide evidence for that claim. Algo 1: this reads as if L6 is blocking, yet the text states the opposite? How does it actually work? The “directed ring” topology should be describe in more depth, earlier, and justified. What is a “nominal A2C score” if A2C fails? I also think “convergent” is a really bad term for 50%+ runs, as compared to some baseline. Steps vs frames: I assume you use action repeat, in which case each environment step is 4 frames? And then the 200M reference number for impala should be 50M steps or 200M frames? My reading of Fig 1a is that the “stability” results are weak, possibly even inconclusive: under which hypothesis would you have statistical significance? The mentioned random start action mitigation: did you use that? There are mismatches between the results in the figures and the table, eg impala on space invaders, a3c on pong, etc: where do these come from? The shading in Fig 2b is unreadable, and it misses a legend for the absolute scales, as well as an explanation of “peer”. Spelling: a bunch of “s” missing, eg L44, L87, L88. ----- I liked the author response, and it addressed a number of my concerns, however R3 also raised some complementary issues to my own, so I think on balance the paper remains borderline, although I'm leaning toward accept.

Reviewer 3

The introduction of gossip algorithms to Deep-RL is original. The work is generally clearly presented, but some of the reported baseline results do not match previous published works. Figure 1: The IMPALA results look completely off, as do the A3C results on pong, and the A3C results in the appendix. There shouldn’t be such a discrepancy between A3C and IMPALA when running with the same hyperparameters (there is a larger discrepancy on some games, but not these ones). I suspect a bug in the IMPALA implementation, or at least an unfair comparison due to all other results using a more recent (and hence more tuned) set of hyperparameters from [Stooke&Abbeel 2018]. It should be noted that individual Atari games are particularly sensitive to individual hyperparameters, and this can easily dominate the difference between algorithms. Comparing on a larger number of games, and using comparable hyperparameters across methods, would alleviate this. In particular, the difference of the unroll length N (5 and 20 for the two sets of hyperparameters) is quite significant. These differences make me doubt the quality of the conclusions presented. 140-144: I can’t help but feel that this definition for an update is arrived at in a backwards manner, when it could be defined more simply. 157: Related works define beta inline. This could too, rather than alluding to it being “related to the spectral radius of the graph sequence”. Table 1: Evaluation by sampling directly from the policy is more standard, and can sometimes outperform the argmax action. Table 1: The performance of A2C and GALA-A2C on BeamRider are with in stderr of each other, so A2C should be bolded too. This makes 3 games where GALA-A2C is on par with baseline methods in terms of learning performance, outperforms in 2 games & underperforms in 1 game. On its own, this is a positive but minor result. The main benefits of GALA-A2C over A2C are computational as described later. 202: What other hyperparameters are affected when increasing the number of simulators? Does this change the size of the batches used for training? Please provide more details of this experiment. 203: Convergent at what point in training? Figure 2b: What does it mean for an A2C agent to have multiple ‘agents’? Do these refer to different shards of the training batch, on different devices, whose gradients are summed using allreduce in [Stooke & Abbeel 2018]? Figure 3: This is a nice figure demonstrating the computational potential of gossip based algorithms over A2C. Would it be possible to add the other baseline methods to these plots? (A3C & IMPALA). Supplementary C.2, Figure 4. The results presented for `A3C (Mnih et al)` do not reproduce the published results from that paper, suggesting a bug in the implementation. ------------- Edits in response to author feedback: Thanks for answering the majority of my questions. To clarify: IMPALA was trained for 200 million frames (see section 5.3.2. of the paper, third paragraph). Are the rest of your results using `environment_steps == frames/action_repeats`? Using `frames` as a way of counting experience is more common - `steps` are clearly ambiguous when using action repeats. I suggest using `frames` throughout the paper. I'm still unconvinced by your baseline results for IMPALA - but those concerns aside, GALA-A2C demonstrates a computational improvement on top of Stooke's A3C or A2C with a solid theoretical basis, and the comparisons of energy usage and hardware utilisation are nice. I was perhaps a little ungenerous when writing my review and have increased my score. On reflection, I am actually curious to see how gossip approaches would perform in similar setups in my lab.

Paper ID:	7307
Title:	Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning

Reviewer 1

Reviewer 2

Reviewer 3