{"title": "Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 13320, "page_last": 13330, "abstract": "Multi-simulator training has contributed to the recent success of Deep Reinforcement Learning (Deep RL) by stabilizing learning and allowing for higher training throughputs. In this work, we propose Gossip-based Actor-Learner Architectures (GALA) where several actor-learners (such as A2C agents) are organized in a peer-to-peer communication topology, and exchange information through asynchronous gossip in order to take advantage of a large number of distributed simulators. We prove that GALA agents remain within an epsilon-ball of one-another during training when using loosely coupled asynchronous communication. By reducing the amount of synchronization between agents, GALA is more computationally efficient and scalable compared to A2C, its fully-synchronous counterpart. GALA also outperforms A2C, being more robust and sample efficient. We show that we can run several loosely coupled GALA agents in parallel on a single GPU and achieve significantly higher hardware utilization and frame-rates than vanilla A2C at comparable power draws.", "full_text": "Gossip-based Actor-Learner Architectures for Deep\n\nReinforcement Learning\n\nDepartment of Electrical and Computer Engineering\n\nMahmoud Assran\n\nFacebook AI Research &\n\nMcGill University\n\nJoshua Romoff\n\nFacebook AI Research &\n\nDepartment of Computer Science\n\nMcGill University\n\nmahmoud.assran@mail.mcgill.ca\n\njoshua.romoff@mail.mcgill.ca\n\nNicolas Ballas\n\nFacebook AI Research\n\nballasn@fb.com\n\nJoelle Pineau\n\nFacebook AI Research\n\njpineau@fb.com\n\nMichael Rabbat\n\nFacebook AI Research\nmikerabbat@fb.com\n\nAbstract\n\nMulti-simulator training has contributed to the recent success of Deep Reinforce-\nment Learning by stabilizing learning and allowing for higher training throughputs.\nWe propose Gossip-based Actor-Learner Architectures (GALA) where several actor-\nlearners (such as A2C agents) are organized in a peer-to-peer communication\ntopology, and exchange information through asynchronous gossip in order to take\nadvantage of a large number of distributed simulators. We prove that GALA agents\nremain within an \u270f-ball of one-another during training when using loosely cou-\npled asynchronous communication. By reducing the amount of synchronization\nbetween agents, GALA is more computationally ef\ufb01cient and scalable compared\nto A2C, its fully-synchronous counterpart. GALA also outperforms A3C, being\nmore robust and sample ef\ufb01cient. We show that we can run several loosely coupled\nGALA agents in parallel on a single GPU and achieve signi\ufb01cantly higher hardware\nutilization and frame-rates than vanilla A2C at comparable power draws.\n\n1\n\nIntroduction\n\nDeep Reinforcement Learning (Deep RL) agents have reached superhuman performance in a few\ndomains [Silver et al., 2016, 2018, Mnih et al., 2015, Vinyals et al., 2019], but this is typically at\nsigni\ufb01cant computational expense [Tian et al., 2019]. To both reduce running time and stabilize\ntraining, current approaches rely on distributed computation wherein data is sampled from many\nparallel simulators distributed over parallel devices [Espeholt et al., 2018, Mnih et al., 2016]. Despite\nthe growing ubiquity of multi-simulator training, scaling Deep RL algorithms to a large number of\nsimulators remains a challenging task.\nOn-policy approaches train a policy by using samples generated from that same policy, in which case\ndata sampling (acting) is entangled with the training procedure (learning). To perform distributed\ntraining, these approaches usually introduce multiple learners with a shared policy, and multiple\nactors (each with its own simulator) associated to each learner. The shared policy can either be\nupdated in a synchronous fashion (e.g., learners synchronize gradients before each optimization\nstep [Stooke and Abbeel, 2018]), or in an asynchronous fashion [Mnih et al., 2016]. Both approaches\nhave drawbacks: synchronous approaches suffer from straggler effects (bottlenecked by the slowest\nindividual simulator), and therefore may not exhibit strong scaling ef\ufb01ciency; asynchronous methods\nare robust to stragglers, but prone to gradient staleness, and may become unstable with a large number\nof actors [Clemente et al., 2017].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAlternatively, off-policy approaches typically train a policy by sampling from a replay buffer of\npast transitions [Mnih et al., 2015]. Training off-policy allows for disentangling data-generation\nfrom learning, which can greatly increase computational ef\ufb01ciency when training with many parallel\nactors [Espeholt et al., 2018, Horgan et al., 2018, Kapturowski et al., 2019, Gruslys et al., 2018].\nGenerally, off-policy updates need to be handled with care as the sampled transitions may not conform\nto the current policy and consequently result in unstable training [Fujimoto et al., 2018].\nWe propose Gossip-based Actor-Learner Architectures (GALA), which aim to retain the robustness\nof synchronous on-policy approaches, while improving both their computational ef\ufb01ciency and\nscalability. GALA leverages multiple agents, where each agent is composed of one learner and\npossibly multiple actors/simulators. Unlike classical on-policy approaches, GALA does not require\nthat each agent share the same policy, but rather it inherently enforces (through gossip) that each\nagent\u2019s policy remain \u270f-close to all others throughout training. Relaxing this constraint allows us to\nreduce the synchronization needed between learners, thereby improving the algorithm\u2019s computational\nef\ufb01ciency.\nInstead of computing an exact average between all the learners after a local optimization step, gossip-\nbased approaches compute an approximate average using loosely coupled and possibly asynchronous\ncommunication (see Nedi\u00b4c et al. [2018] and references therein). While this approximation implicitly\ninjects some noise in the aggregate parameters, we prove that this is in fact a principled approach as the\nlearners\u2019 policies stay within an \u270f-ball of one-another (even with non-linear function approximation),\nthe size of which is directly proportional to the spectral-radius of the agent communication topology\nand their learning rates.\nAs a practical algorithm, we propose GALA-A2C, an algorithm that combines gossip with A2C agents.\nWe compare our approach on six Atari games [Machado et al., 2018] following Stooke and Abbeel\n[2018] with vanilla A2C, A3C and the IMPALA off-policy method [Dhariwal et al., 2017, Mnih et al.,\n2016, Espeholt et al., 2018]. Our main empirical \ufb01ndings are:\n\n1. Following the theory, GALA-A2C is empirically stable. Moreover, we observe that GALA\ncan be more stable than A2C when using a large number of simulators, suggesting that the\nnoise introduced by gossiping can have a bene\ufb01cial effect.\n\n2. GALA-A2C has similar sample ef\ufb01ciency to A2C and greatly improves its computational\n\nef\ufb01ciency and scalability.\n\n3. GALA-A2C achieves signi\ufb01cantly higher hardware utilization and frame-rates than vanilla\n\nA2C at comparable power draws, when using a GPU.\n\n4. GALA-A2C is competitive in term of performance relative to A3C and IMPALA.\n\nPerhaps most remarkably, our empirical \ufb01ndings for GALA-A2C are obtained by simply using the\ndefault hyper-parameters from A2C. Our implementation of GALA-A2C is publicly available at\nhttps://github.com/facebookresearch/gala.\n\n2 Technical Background\n\nReinforcement Learning. We consider the standard Reinforcement Learning setting [Sutton and\nBarto, 1998], where the agent\u2019s objective is to maximize the expected value from each state V (s) =\n\nE\u21e5P1i=0 irt+i|st = s\u21e4, is the discount factor which controls the bias towards nearby rewards. To\nmaximize this quantity, the agent chooses at each discrete time step t an action at in the current state\nst based on its policy \u21e1(at|st) and transitions to the next state st+1 receiving reward rt based on the\nenvironment dynamics.\nTemporal difference (TD) learning [Sutton, 1984] aims at learning an approximation of the expected\nreturn parameterized by \u2713, i.e., the value function V (s; \u2713), by iteratively updating its parameters via\ngradient descent:\n\n(1)\nwhere GN\ni=0 irt+i + N V (st+n; \u2713t) is the N-step return. Actor-critic methods [Sutton\net al., 2000, Mnih et al., 2016] simultaneously learn both a parameterized policy \u21e1(at|st; !) with\nparameters ! and a critic V (st; \u2713). They do so by training a value function via the TD error de\ufb01ned\n\nt =PN1\n\nr\u2713GN\n\nt V (st; \u2713)2\n\n2\n\n\fin (1) and then proceed to optimize the policy using the policy gradient (PG) with the value function\nas a baseline:\n\nr! ( log \u21e1(at|st; !)At) ,\n\n(2)\nwhere At = GN\nt V (st; \u2713t) is the advantage function, which represents the relative value the current\naction has over the average. In order to both speed up training time and decorrelate observations, Mnih\net al. [2016] collect samples and perform updates with several asynchronous actor-learners. Speci\ufb01-\ncally, each worker i 2{ 1, 2, .., W}, where W is the number of parallel workers, collects samples\naccording to its current version of the policy weights !i, and computes updates via the standard\nactor-critic gradient de\ufb01ned in (2), with an additional entropy penalty term that prevents premature\nconvergence to deterministic policies:\n\nr!i log \u21e1(at|st; !i)At \u2318Xa\n\n\u21e1(a|st; !i) log \u21e1(a|st; !i)! .\n\nThe workers then perform HOGWILD! [Recht et al., 2011] style updates (asynchronous writes)\nto a shared set of master weights before synchronizing their weights with the master\u2019s. More\nrecently, Dhariwal et al. [2017] removed the asynchrony from A3C, referred to as A2C, by instead\nsynchronously collecting transitions in parallel environments i 2{ 1, 2, .., W} and then performing a\nlarge batched update:\n\n(3)\n\n(4)\n\nr!\" 1\n\nW\n\nWXi=1 log \u21e1(ai\n\nt; !)Ai\n\nt|si\n\nt \u2318Xa\n\n\u21e1(a|si\n\nt; !) log \u21e1(a|si\n\nt; !)!# .\n\ni\n\ni=1 x(0)\n\ni\n\nj=1 p(k)\n\ni,j x(k)\n\nnPn\n\nGossip algorithms. Gossip algorithms are used to solve the distributed averaging problem. Suppose\nthere are n agents connected in a peer-to-peer graph topology, each with parameter vector x(0)\ni 2 Rd.\nLet X (0) 2 Rn\u21e5d denote the row-wise concatenation of these vectors. The objective is to iteratively\nacross all agents. Typical gossip iterations have the form\ncompute the average vector 1\nX (k+1) = P (k)X (k), where P (k) 2 Rn\u21e5n is referred to as the mixing matrix and de\ufb01nes the\ncommunication topology. This corresponds to the update x(k+1)\nfor an agent vi.\nAt an iteration k, an agent vi only needs to receive messages from other agents vj for which p(k)\ni,j 6= 0,\nso sparser matrices P (k) correspond to less communication and less synchronization between agents.\nThe mixing matrices P (k) are designed to be row stochastic (each entry is greater than or equal to\nk=0 P (k) = 1\u21e1>, where \u21e1 is the ergodic limit\nof the Markov chain de\ufb01ned by P (k) and 1 is a vector with all entries equal to 1 [Seneta, 1981].1\nConsequently, the gossip iterations converge to a limit X (1) = 1(\u21e1>X (0)); meaning the value\nat an agent i converges to x(1)\n. In particular, if the matrices P (k) are symmetric\nand doubly-stochastic (each row and each column must sum to 1), we obtain an algorithm such that\n\u21e1j = 1/n for all j, and therefore x(1)\nconverges to the average of the agents\u2019\ninitial vectors.\nFor the particular case of GALA, we only require the matrices P (k) to be row stochastic in order to\nshow the \u270f-ball guarantees.\n\nzero, and each row sums to 1) so that limK!1QK\n\ni = 1/nPn\n\ni =Pn\n\n=Pn\n\nj=1 x(0)\n\nj=1 \u21e1jx(0)\n\nj\n\nj\n\nj\n\n3 Gossip-based Actor-Learner Architectures\n\nWe consider the distributed RL setting where n agents (each composed of a single learner and several\nactors) collaborate to maximize the expected return V (s). Each agent vi has a parameterized policy\nnetwork \u21e1(at|st; !i) and value function V (st; \u2713i). Let xi = (!i,\u2713 i) denote agent vi\u2019s complete set\nof trainable parameters. We consider the speci\ufb01c case where each vi corresponds to a single A2C\nagent, and the agents are con\ufb01gured in a directed and peer-to-peer communication topology de\ufb01ned\nby the mixing matrix P 2 Rn\u21e5n.\nIn order to maximize the expected reward, each GALA-A2C agent alternates between one local policy-\ngradient and TD update, and one iteration of asynchronous gossip with its peers. Pseudocode is\n\n1Assuming that information from every agent eventually reaches all other agents\n\n3\n\n\fAlgorithm 1 Gossip-based Actor-Learner Architectures for agent vi using A2C\nRequire: Initialize trainable policy and critic parameters xi = (!i,\u2713 i).\n1: for t = 0, 1, 2, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n1 We set the non-zero mixing weights for agent vi to pi,j =\n\nTake N actions {at} according to \u21e1!i and store transitions {(st, at.rt, st+1)}\nCompute returns GN\nPerform A2C optimization step on xi using TD in (1) and batched policy-gradient in (4)\nBroadcast (non-blocking) new parameters xi to all out-peers in N out\nif Receive buffer contains a message mj from each in-peer vj in N in\nend if\n\nt =PN1\n(xi +Pj mj)1\n\ni=0 irt+i + N V (st+n; \u2713i) and advantages At = GN\n\nxi 1\n1+|N in\ni |\n\nthen\n\n.\n\n1\n\ni\n\ni\n\n1+|N in\ni |\n\nt V (st; \u2713i)\n\n. Average parameters with messages\n\ni\n\ni\n\nprovided in Algorithm 1, where N in\n:= {vj | pi,j > 0} denotes the set of agents that send messages\nto agent vi (in-peers), and N out\n:= {vj | pj,i > 0} the set of agents that vi sends messages to (out-\npeers). During the gossip phase, agents broadcast their parameters to their out-peers, asynchronously\n(i.e., don\u2019t wait for messages to reach their destination), and update their own parameters via a convex\ncombination of all received messages. Agents broadcast new messages when old transmissions are\ncompleted and aggregate all received messages once they have received a message from each in-peer.\nNote that the GALA agents use non-blocking communication, and therefore operate asynchronously.\nLocal iteration counters may be out-of-sync, and physical message delays may result in agents\nincorporating outdated messages from their peers. One can algorithmically enforce an upper bound\non the message staleness by having the agent block and wait for communication to complete if more\nthan \u2327 0 local iterations have passed since the agent last received a message from its in-peers.\nTheoretical \u270f-ball guarantees: Next we provide the \u270f-ball theoretical guarantees for the asyn-\nchronous GALA agents, proofs of which can be found in Appendix B. Let k 2 N denote the global\niteration counter, which increments whenever any agent (or subset of agents) completes an iteration of\nthe loop de\ufb01ned in Algorithm 1. We de\ufb01ne x(k)\ni 2 Rd as the value of agent vi\u2019s trainable parameters\nat iteration k, and X (k) 2 Rn\u21e5d as the row-concatenation of these parameters.\nFor our theoretical guarantees we let the communication topologies be directed and time-varying\ngraphs, and we do not make any assumptions about the base GALA learners. In particular, let the\nmapping Ti : x(k)\ni 2 Rd characterize agent vi\u2019s local training dynamics (i.e.,\nagent vi optimizes its parameters by computing x(k)\n)), where \u21b5> 0 is a reference learning\nrate, and g(k)\ni 2 Rd can be any update vector. Lastly, let G(k) 2 Rn\u21e5d denote the row-concatenation\nof these update vectors.\nProposition 1. For all k 0, it holds that\n\ni 2 Rd 7! x(k)\n\ni T i(x(k)\n\ni \u21b5g(k)\n\ni\n\nX (k+1) X\n\n(k+1) \uf8ff \u21b5\n\nkXs=0\n\nk+1sG(s) ,\n\n(k+1) := 1n1T\n\nn X (k+1) denotes the average of the learners\u2019 parameters at iteration k + 1,\nwhere X\nand 2 [0, 1] is related to the joint spectral radius of the graph sequence de\ufb01ning the communication\ntopology at each iteration.\n\nn\n\nProposition 1 shows that the distance of a learners\u2019 parameters from consensus is bounded at each\niteration. However, without additional assumptions on the communication topology, the constant \nmay equal 1, and the bound in Proposition 1 can be trivial. In the following proposition, we make\nsuf\ufb01cient assumptions with respect to the graph sequence that ensure < 1.\nProposition 2. Suppose there exists a \ufb01nite integer B 0 such that the (potentially time-varying)\ngraph sequence is B-strongly connected, and suppose that the upper bound \u2327 on the message delays\nin Algorithm 1 is \ufb01nite. If learners run Algorithm 1 from iteration 0 to k + 1, where k \u2327 + B, then\nit holds that\n\nX (k+1) X\n\n(k+1) \uf8ff\n\n4\n\n\u21b5 \u02dcL\n1 \n\n,\n\n\fwhere < 1 is related to the joint spectral radius of the graph sequence, \u21b5 is the reference learning\nrate, \u02dc := \u2327 +B\nlocal optimization updates during training.\n\n\u2327 +B+1 , and L := sups=1,2,...G(s) denotes an upper bound on the magnitude of the\n\nProposition 2 states that the agents\u2019 parameters are guaranteed to reside within an \u270f-ball of their\naverage at all iterations k \u2327 + B. The size of this ball is proportional to the reference learning-rate,\nthe spectral radius of the graph topology, and the upper bound on the magnitude of the local gradient\nupdates. One may also be able to control the constant L in practice since Deep RL agents are typically\ntrained with some form of gradient clipping.\n\n4 Related work\n\nSeveral recent works have approached scaling up RL by using parallel environments. Mnih et al.\n[2016] used parallel asynchronous agents to perform HOGWILD! [Recht et al., 2011] style updates\nto a shared set of parameters. Dhariwal et al. [2017] proposed A2C, which maintains the parallel\ndata collection, but performs updates synchronously, and found this to be more stable empirically.\nWhile A3C was originally designed as a purely CPU-based method, Babaeizadeh et al. [2017]\nproposed GA3C, a GPU implementation of the algorithm. Stooke and Abbeel [2018] also scaled up\nvarious RL algorithms by using signi\ufb01cantly larger batch sizes and distributing computation onto\nseveral GPUs. Differently from those works, we propose the use of Gossip Algorithms to aggregate\ninformation between different agents and thus simulators. Nair et al. [2015], Horgan et al. [2018],\nEspeholt et al. [2018], Kapturowski et al. [2019], Gruslys et al. [2018] use parallel environments as\nwell, but disentangle the data collection (actors) from the network updates (learners). This provides\nseveral computational bene\ufb01ts, including better hardware utilization and reduced straggler effects. By\ndisentangling acting from learning these algorithms must use off-policy methods to handle learning\nfrom data that is not directly generated from the current policy (e.g., slightly older policies).\nGossip-based approaches have been extensively studied in the control-systems literature as a way\nto aggregate information for distributed optimization algorithms [Nedi\u00b4c et al., 2018]. In particular,\nrecent works have proposed to combine gossip algorithms with stochastic gradient descent in order to\ntrain Deep Neural Networks [Lian et al., 2018, 2017, Assran et al., 2019], but unlike our work, focus\nonly on the supervised classi\ufb01cation paradigm.\n\n5 Experiments\n\nWe evaluate GALA for training Deep RL agents on Atari-2600 games [Machado et al., 2018]. We\nfocus on the same six games studied in Stooke and Abbeel [2018]. Unless otherwise-stated, all\nlearning curves show averages over 10 random seeds with 95% con\ufb01dence intervals shaded in. We\nfollow the reproducibility checklist [Pineau, 2018], see Appendix A for details.\nWe compare A2C [Dhariwal et al., 2017], A3C [Mnih et al., 2016], IMPALA [Espeholt et al., 2018],\nand GALA-A2C. All methods are implemented in PyTorch [Paszke et al., 2017]. While A3C was\noriginally proposed with CPU-based agents with 1-simulator per agent, Stooke and Abbeel [2018]\npropose a large-batch variant in which each agent manages 16-simulators and performs batched\ninference on a GPU. We found this large-batch variant to be more stable and computationally\nef\ufb01cient (cf. Appendix C.1). We use the Stooke and Abbeel [2018] variant of A3C to provide a more\ncompetitive baseline. We parallelize A2C training via the canonical approach outlined in Stooke and\nAbbeel [2018], whereby individual A2C agents (running on potentially different devices), all average\ntheir gradients together before each update using the ALLREDUCE primitive.2 For A2C and A3C we\nuse the hyper-parameters suggested in Stooke and Abbeel [2018]. For IMPALA we use the hyper-\nparameters suggested in Espeholt et al. [2018]. For GALA-A2C we use the same hyper-parameters as\nthe original (non-gossip-based) method. All GALA agents are con\ufb01gured in a directed ring graph. All\nimplementation details are described in Appendix C. For the IMPALA baseline, we use a prerelease of\nTorchBeast [K\u00fcttler et al., 2019] available at https://github.com/facebookresearch/torchbeast.\n\n2This is mathematically equivalent to a single A2C agent with multiple simulators (e.g., n agents, with b\n\nsimulators each, are equivalent to a single agent with nb simulators).\n\n5\n\n\fTable 1: Across all training seeds we select the best \ufb01nal policy produced by each method at the end\nof training and evaluate it over 10 evaluation episodes (up to 30 no-ops at the start of the episode).\nEvaluation actions generated from arg maxa \u21e1(a|s). The table depicts the mean and standard error\nacross these 10 evaluation episodes.\n\nIMPALA1\n\nIMPALA\nA3C\nA2C\nA2C\nGALA-A2C\nGALA-A2C\n\nBeamRider\n\nSteps\n50M 8220\n40M 7118 \u00b12536\n40M 5674 \u00b1752\n25M 8755 \u00b1811\n40M 9829 \u00b11355\n25M 9500 \u00b11020\n40M 10188 \u00b11316\n\nBreakout\n641\n127 \u00b165\n414 \u00b156\n419 \u00b13\n495 \u00b157\n690 \u00b172\n690 \u00b172\n\nPong\n21\n21 \u00b10\n21 \u00b10\n21 \u00b10\n21 \u00b10\n21 \u00b10\n21 \u00b10\n\nQbert\n\n18902\n7878 \u00b12573\n14923 \u00b1460\n16805 \u00b1172\n19928 \u00b199\n18810 \u00b137\n20150 \u00b128\n\nSeaquest\n1717\n462 \u00b12\n1840 \u00b10\n1850 \u00b15\n1894 \u00b16\n1874 \u00b14\n1892 \u00b16\n\nSpaceInvaders\n1727\n4071 \u00b1393\n2232 \u00b1302\n2846 \u00b122\n3021 \u00b136\n2726 \u00b1189\n3074 \u00b169\n\n1 Espeholt et al. [2018] results using shallow network (identical to the network used in our experiments).\n\nConvergence and stability: We begin by empirically studying the convergence and stability prop-\nerties of A2C and GALA-A2C. Figure 1a depicts the percentage of successful runs (out of 10 trials) of\nstandard policy-gradient A2C when we sweep the number of simulators across six different games.\nWe de\ufb01ne a run as successful if it achieves better than 50% of nominal 16-simulator A2C scores.\nWhen using A2C, we observe an identical trend across all games in which the number of successful\nruns decreases signi\ufb01cantly as we increase the number of simulators. Note that the A2C batch size is\nproportional to the number of simulators, and when increasing the number of simulators we adjust\nthe learning rate following the recommendation in Stooke and Abbeel [2018].\nFigure 1a also depicts the percentage of successful runs when A2C agents communicate their\nparameters using gossip algorithms (GALA-A2C). In every simulator sweep across the six games\n(600 runs), the gossip-based architecture increases or maintains the percentage of successful runs\nrelative to vanilla A2C, when using identical hyper-parameters. We hypothesize that exercising\nslightly different policies at each learner using gossip-algorithms can provide enough decorrelation\nin gradients to improve learning stability. We revisit this point later on (cf. Figure 3b). We note\nthat Stooke and Abbeel [2018] \ufb01nd that stepping through a random number of uniform random\nactions at the start of training can partially mitigate this stability issue. We did not use this random\nstart action mitigation in the reported experiments.\nWhile Figure 1a shows that GALA can be used to stabilize multi-simulator A2C and increase the\nnumber of successfull runs, it does not directly say anything about the \ufb01nal performance of the\nlearned models. Figures 1b and 1c show the rewards plotted against the number of environment steps\nwhen training with 64 simulators. Using gossip-based architectures stabilizes and maintains the peak\nperformance and sample ef\ufb01ciency of A2C across all six games (Figure 1b), and also increases the\nnumber of convergent runs (Figure 1c).\nFigures 1d and 1e compare the wall-clock time convergence of GALA-A2C to vanilla A2C. Not only\nis GALA-A2C more stable than A2C, but it also runs at a higher frame-rate by mitigating straggler\neffects. In particular, since GALA-A2C learners do not need to synchronize their gradients, each\nlearner is free to run at its own rate without being hampered by variance in peer stepping times.\n\nComparison with distributed Deep RL approaches: Figure 1 also compares GALA-A2C to state-\nof-the-art methods like IMPALA and A3C.3 In each game, the GALA-A2C learners exhibited good\nsample ef\ufb01ciency and computational ef\ufb01ciency, and achieved highly competitive \ufb01nal game scores.\nNext we evaluate the \ufb01nal policies produced by each method at the end of training. After training\nacross 10 different seeds, we are left with 10 distinct policies per method. We select the best \ufb01nal\npolicy and evaluate it over 10 evaluation episodes, with actions generated from arg maxa \u21e1(a|s).\nIn almost every single game, the GALA-A2C learners achieved the highest evaluation scores of any\nmethod. Notably, the GALA-A2C learners that were trained for 25M steps achieved (and in most\ncases surpassed) the scores for IMPALA learners trained for 50M steps [Espeholt et al., 2018].\n\n3We report results for both the TorchBeast implementation of IMPALA, and from Table C.1 from Espeholt\n\net al. [2018]\n\n6\n\n\f(a) Simulator sweep: Percentage of convergent runs out of 10 trials.\n\n(b) Sample complexity: Best 3 runs for each method.\n\n(c) Sample complexity: Average across 10 runs.\n\n(d) Computational complexity: Best 3 runs for each method.\n\n(e) Computational complexity: Average across 10 runs.\n\n(f) Energy ef\ufb01ciency: Best 3 runs for each method.\n\n(g) Energy ef\ufb01ciency: Average across 10 runs.\n\nFigure 1: (a) GALA increases or maintains the percentage of convergent runs relative to A2C. (b)-\n(c) GALA maintains the best performance of A2C while being more robust. (d)-(e) GALA achieves\ncompetitive scores in each game and in the shortest amount of time. (f)-(g) GALA achieves competitive\ngame scores while being energy ef\ufb01cient.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 2: (a) The radius of the \u270f-ball within which the agents\u2019 parameters reside during training.\nThe theoretical upper bound in Proposition 1 is explicitly calculated and compared to the true\nempirical quantity. The bound in Proposition 1 is remarkably tight. (b) Average correlation between\nagents\u2019 gradients during training (darker colors depict low correlation and lighter colors depict higher\ncorrelations). Neighbours in the GALA-A2C topology are annotated with the label \u201cpeer.\u201d The\nGALA-A2C heatmap is generally much darker than the A2C heatmap, indicating that GALA-A2C\nagents produce more diverse gradients with signi\ufb01cantly less correlation.\n\n(a)\n\n(b)\n\nFigure 3: Comparing GALA-A2C hardware utilization to that of A2C when using one NVIDIA V100\nGPU and 48 Intel CPUs. (a) Samples of instantaneous GPU utilization and power draw plotted\nagainst each other. Bubble sizes indicate frame-rates obtained by the corresponding algorithms;\nlarger bubbles depict higher frame-rates. GALA-A2C achieves higher hardware utilization than\nA2C at comparable power draws. This translates to much higher frame-rates and increased energy\nef\ufb01ciency. (b) Hardware utilization/energy ef\ufb01ciency vs. number of simulators. GALA-A2C bene\ufb01ts\nfrom increased parallelism and achieves a 10-fold improvement in GPU utilization over A2C.\n\nEffects of gossip: To better understand the stabilizing effects of GALA, we evaluate the diversity\nin learner policies during training. Figure 2a shows the distance of the agents\u2019 parameters from\nconsensus throughout training. The theoretical upper bound in Proposition 1 is also explicitly\ncalculated and plotted in Figure 2a. As expected, the learner policies remain within an \u270f-ball of\none-another in weight-space, and this size of this ball is remarkably well predicted by Proposition 1.\nNext, we measure the diversity in the agents\u2019 gradients. We hypothesize that the \u270f-diversity in the\npolicies predicted by Proposition 1, and empirically observed in Figure 2a, may lead to less correlation\nin the agents\u2019 gradients. The categorical heatmap in Figure 2b shows the pair-wise cosine-similarity\nbetween agents\u2019 gradients throughout training, computed after every 500 local environment steps,\nand averaged over the \ufb01rst 10M training steps. Dark colors depict low correlations and light colors\ndepict high correlations. We observe that GALA-A2C agents exhibited less gradient correlations than\nA2C agents. Interestingly, we also observe that GALA-A2C agents\u2019 gradients are more correlated with\nthose of peers that they explicitly communicate with (graph neighbours), and less correlated with\nthose of agents that they do not explicitly communicate with.\n\nComputational performance: Figure 3 showcases the hardware utilization and energy ef\ufb01ciency\nof GALA-A2C compared to A2C as we increase the number of simulators. Speci\ufb01cally, Figure 3a\nshows that GALA-A2C achieves signi\ufb01cantly higher hardware utilization than vanilla A2C at com-\nparable power draws. This translates to much higher frame-rates and increased energy ef\ufb01ciency.\nFigure 3b shows that GALA-A2C is also better able to leverage increased parallelism and achieves\na 10-fold improvement in GPU utilization over vanilla A2C. Once again, the improved hardware\nutilization and frame-rates translate to increased energy ef\ufb01ciency. In particular, GALA-A2C steps\n\n8\n\n\fthrough roughly 20 thousand more frames per Kilojoule than vanilla A2C. Figures 1f and 1g compare\ngame scores as a function of energy utilization in Kilojoules. GALA-A2C is distinctly more energy\nef\ufb01cient than the other methods, achieving higher game scores with less energy utilization.\n\n6 Conclusion\n\nWe propose Gossip-based Actor-Learner Architectures (GALA) for accelerating Deep Reinforcement\nLearning by leveraging parallel actor-learners that exchange information through asynchronous\ngossip. We prove that the GALA agents\u2019 policies are guaranteed to remain within an \u270f-ball during\ntraining, and verify this empirically as well. We evaluated our approach on six Atari games, and \ufb01nd\nthat GALA-A2C improves the computational ef\ufb01ciency of A2C, while also providing extra stability\nand robustness by decorrelating gradients. GALA-A2C also achieves signi\ufb01cantly higher hardware\nutilization than vanilla A2C at comparable power draws, and is competitive with state-of-the-art\nmethods like A3C and IMPALA.\n\nAcknowledgments\nWe would like to thank the authors of TorchBeast for providing their pytorch implementation of\nIMPALA.\n\nReferences\nMahmoud Assran. Asynchronous subgradient push: Fast, robust, and scalable multi-agent optimiza-\n\ntion. Master\u2019s thesis, McGill University, 2018.\n\nMahmoud Assran and Michael Rabbat. Asynchronous subgradient-push.\n\narXiv:1803.08950, 2018.\n\narXiv preprint\n\nMahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. Stochastic gradient push for\ndistributed deep learning. Proceedings of the 36th International Conference on Machine Learning,\n97:344\u2013353, 2019.\n\nMohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. Reinforce-\nment learning through asynchronous advantage actor-critic on a gpu. In Proceedings of the 5th\nInternational Conference on Learning Representations, 2017.\n\nVincent D Blondel, Julien M Hendrickx, Alex Olshevsky, and John N Tsitsiklis. Convergence in\nmultiagent coordination, consensus, and \ufb02ocking. Proceedings of the 44th IEEE Conference on\nDecision and Control, pages 2996\u20133000, 2005.\n\nAlfredo V Clemente, Humberto N Castej\u00f3n, and Arjun Chandra. Ef\ufb01cient parallel methods for deep\n\nreinforcement learning. arXiv preprint arXiv:1705.04862, 2017.\n\nPrafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,\n\nJohn Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines, 2017.\n\nLasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam\nDoron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with\nimportance weighted actor-learner architectures. Proceedings of the 35th International Conference\non Machine Learning, 80:1407\u20131416, 2018.\n\nScott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in\nactor-critic methods. Proceedings of the 35th International Conference on Machine Learning, 80:\n1582\u20131591, 2018.\n\nAudrunas Gruslys, Mohammad Gheshlaghi Azar, Marc G. Bellemare, and R\u00e9mi Munos. The reactor:\nA sample-ef\ufb01cient actor-critic architecture. Proceedings of the 6th International Conference on\nLearning Representations, 2018.\n\nChristoforos N Hadjicostis and Themistoklis Charalambous. Average consensus in the presence of\ndelays in directed graph topologies. IEEE Transactions on Automatic Control, 59(3):763\u2013768,\n2013.\n\nDan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt,\nand David Silver. Distributed prioritized experience replay. Proceedings of the 6th International\nConference on Learning Representations, 2018.\n\n9\n\n\fSteven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent\nexperience replay in distributed reinforcement learning. Proceedings of the 7th International\nConference on Learning Representations, 2019.\n\nHeinrich K\u00fcttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim\nRockt\u00e4schel, and Edward Grefenstette. Torchbeast: A pytorch platform for distributed rl. arXiv\npreprint arXiv:1910.03552, 2019.\n\nXiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized\nalgorithms outperform centralized algorithms? a case study for decentralized parallel stochastic\ngradient descent. Advances in Neural Information Processing Systems, pages 5330\u20135340, 2017.\nXiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic\ngradient descent. Proceedings of the 35th International Conference on Machine Learning, 80:\n3043\u20133052, 2018.\n\nMarlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and\nMichael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open\nproblems for general agents. Proceedings of the 27th International Joint Conference on Arti\ufb01cial\nIntelligence, pages 5573\u20135577, 2018.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement\nlearning. Proceedings of the 33rd International conference on machine learning, 48:1928\u20131937,\n2016.\n\nArun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De\nMaria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg,\nVolodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep\nreinforcement learning. CoRR, abs/1507.04296, 2015.\n\nAngelia Nedi\u00b4c, Alex Olshevsky, and Michael G Rabbat. Network topology and communication-\ncomputation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953\u2013976,\n2018.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\nJoelle Pineau. The machine learning reproducibility checklist (version 1.0), 2018.\nBenjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to\nparallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, 24:\n693\u2013701, 2011.\n\nEugene Seneta. Non-negative Matrices and Markov Chains. Springer, 1981.\nDavid Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering\nthe game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.\n\nDavid Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,\nMarc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement\nlearning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):\n1140\u20131144, 2018.\n\nAdam Stooke and Pieter Abbeel. Accelerated methods for deep reinforcement learning. arXiv\n\npreprint arXiv:1803.02811, 2018.\n\nRichard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT\n\npress Cambridge, 1998.\n\nRichard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nmethods for reinforcement learning with function approximation. Advances in neural information\nprocessing systems, pages 1057\u20131063, 2000.\n\n10\n\n\fRichard Stuart Sutton. Temporal credit assignment in reinforcement learning. PhD thesis, University\n\nof Massachusetts Amherst, 1984.\n\nYuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, James Pinkerton,\nand C Lawrence Zitnick. Elf opengo: An analysis and open reimplementation of alphazero.\nProceedings of the 36th International Conference on Machine Learning, 97:6244\u20136253, 2019.\n\nJohn Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic\nand stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):\n803\u2013812, 1986.\n\nO Vinyals, I Babuschkin, J Chung, M Mathieu, M Jaderberg, W Czarnecki, A Dudzik, A Huang,\n\nP Georgiev, R Powell, et al. Alphastar: Mastering the real-time strategy game starcraft ii, 2019.\nJacob Wolfowitz. Products of indecomposable, aperiodic, stochastic matrices. Proceedings of the\n\nAmerican Mathematical Society, 14(5):733\u2013737, 1963.\n\n11\n\n\f", "award": [], "sourceid": 7307, "authors": [{"given_name": "Mahmoud", "family_name": "Assran", "institution": "McGill University / Facebook AI Research"}, {"given_name": "Joshua", "family_name": "Romoff", "institution": "McGill University"}, {"given_name": "Nicolas", "family_name": "Ballas", "institution": "Facebook FAIR"}, {"given_name": "Joelle", "family_name": "Pineau", "institution": "Facebook"}, {"given_name": "Michael", "family_name": "Rabbat", "institution": "Facebook FAIR"}]}