{"title": "Reconciling \u03bb-Returns with Experience Replay", "book": "Advances in Neural Information Processing Systems", "page_first": 1133, "page_last": 1142, "abstract": "Modern deep reinforcement learning methods have departed from the incremental learning required for eligibility traces, rendering the implementation of the \u03bb-return difficult in this context. In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of \u03bb-returns. Yet replay-based methods are often the most sample efficient, and incorporating \u03bb-returns into them is a viable way to achieve new state-of-the-art performance. Towards this, we propose the first method to enable practical use of \u03bb-returns in arbitrary replay-based methods without relying on other forms of decorrelation such as asynchronous gradient updates. By promoting short sequences of past transitions into a small cache within the replay memory, adjacent \u03bb-returns can be efficiently precomputed by sharing Q-values. Computation is not wasted on experiences that are never sampled, and stored \u03bb-returns behave as stable temporal-difference (TD) targets that replace the target network. Additionally, our method grants the unique ability to observe TD errors prior to sampling; for the first time, transitions can be prioritized by their true significance rather than by a proxy to it. Furthermore, we propose the novel use of the TD error to dynamically select \u03bb-values that facilitate faster learning. We show that these innovations can enhance the performance of DQN when playing Atari 2600 games, even under partial observability. While our work specifically focuses on \u03bb-returns, these ideas are applicable to any multi-step return estimator.", "full_text": "Reconciling \u03bb-Returns with Experience Replay\n\nBrett Daley\n\nChristopher Amato\n\nKhoury College of Computer Sciences\n\nKhoury College of Computer Sciences\n\nNortheastern University\n\nBoston, MA 02115\n\nb.daley@northeastern.edu\n\nNortheastern University\n\nBoston, MA 02115\n\nc.amato@northeastern.edu\n\nAbstract\n\nModern deep reinforcement learning methods have departed from the incremental\nlearning required for eligibility traces, rendering the implementation of the \u03bb-return\ndif\ufb01cult in this context. In particular, off-policy methods that utilize experience\nreplay remain problematic because their random sampling of minibatches is not\nconducive to the ef\ufb01cient calculation of \u03bb-returns. Yet replay-based methods are\noften the most sample ef\ufb01cient, and incorporating \u03bb-returns into them is a viable\nway to achieve new state-of-the-art performance. Towards this, we propose the\n\ufb01rst method to enable practical use of \u03bb-returns in arbitrary replay-based methods\nwithout relying on other forms of decorrelation such as asynchronous gradient\nupdates. By promoting short sequences of past transitions into a small cache within\nthe replay memory, adjacent \u03bb-returns can be ef\ufb01ciently precomputed by sharing\nQ-values. Computation is not wasted on experiences that are never sampled, and\nstored \u03bb-returns behave as stable temporal-difference (TD) targets that replace the\ntarget network. Additionally, our method grants the unique ability to observe TD\nerrors prior to sampling; for the \ufb01rst time, transitions can be prioritized by their\ntrue signi\ufb01cance rather than by a proxy to it. Furthermore, we propose the novel\nuse of the TD error to dynamically select \u03bb-values that facilitate faster learning. We\nshow that these innovations can enhance the performance of DQN when playing\nAtari 2600 games, even under partial observability. While our work speci\ufb01cally\nfocuses on \u03bb-returns, these ideas are applicable to any multi-step return estimator.\n\nIntroduction\n\n1\nEligibility traces [1, 15, 36] have been a historically successful approach to the credit assignment\nproblem in reinforcement learning. By applying time-decaying 1-step updates to recently visited\nstates, eligibility traces provide an ef\ufb01cient and online mechanism for generating the \u03bb-return at each\ntimestep [34]. The \u03bb-return (equivalent to an exponential average of all n-step returns [39]) often\nyields faster empirical convergence by interpolating between low-variance temporal-difference (TD)\nreturns [33] and low-bias Monte Carlo returns. Eligibility traces can be effective when the reward\nsignal is sparse or the environment is partially observable.\nMore recently, deep reinforcement learning has shown promise on a variety of high-dimensional\ntasks such as Atari 2600 games [25], Go [32], 3D maze navigation [23], Doom [17], and robotic\nlocomotion [6, 11, 18, 19, 29]. While neural networks are theoretically compatible with eligibility\ntraces [34], training a non-linear function approximator online can cause divergence due to the\nstrong correlations between temporally successive states [37]. Circumventing this issue has required\nunconventional solutions like experience replay [21], in which gradient updates are conducted using\nrandomly sampled past experience to decorrelate the training data. Experience replay is also important\nfor sample ef\ufb01ciency because environment transitions are reused multiple times rather than being\ndiscarded immediately. For this reason, well-tuned algorithms using experience replay such as\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fRainbow [12] and ACER [38] are still among the most sample-ef\ufb01cient deep reinforcement learning\nmethods today for playing Atari 2600 games.\nThe dependency of the \u03bb-return on many future Q-values makes it prohibitively expensive to com-\nbine directly with minibatched experience replay when the Q-function is a deep neural network.\nConsequently, replay-based methods that use \u03bb-returns (or derivative estimators like Retrace(\u03bb)\n[26]) have been limited to algorithms that can learn from long, sequential trajectories [8, 38] or\nutilize asynchronous parameter updates [24] to decorrelate such trajectories [26]. A general method\nfor combining \u03bb-returns with minibatch sampling would be useful for a vast array of off-policy\nalgorithms including DQN [25], DRQN [10], SDQN [22], DDPG [20], NAF [7], and UVFA [30]\nthat cannot learn from sequential trajectories like these.\nIn this paper, we present a general strategy for rectifying \u03bb-returns and replayed minibatches of\nexperience. We propose the use of a cache within the replay memory to store precomputed \u03bb-returns\nand replace the function of a target network. The cache is formed from short sequences of experience\nthat allow the \u03bb-returns to be computed ef\ufb01ciently via recursion while maintaining an acceptably low\ndegree of sampling bias. A unique bene\ufb01t to this approach is that each transition\u2019s TD error can be\nobserved before it is sampled, enabling novel sampling techniques that utilize this information. We\nexplore these opportunities by prioritizing samples according to their actual TD error magnitude \u2014\nrather than a proxy to it like in prior work [31] \u2014 and also dynamically selecting \u03bb-values to facilitate\nfaster learning. Together, these methods can signi\ufb01cantly increase the sample ef\ufb01ciency of DQN\nwhen playing Atari 2600 games, even when the complete environment state is obscured. The ideas\nintroduced here are general enough to be incorporated into any replay-based reinforcement learning\nmethod, where similar performance improvements would be expected.\n\n2 Background\n\n(cid:2)(cid:80)H\n\ni=0 \u03b3iri\n\n(cid:3) up to\n\nReinforcement learning is the problem where an agent must interact with an unknown environment\nthrough trial-and-error in order to maximize its cumulative reward [34]. We \ufb01rst consider the standard\nsetting where the environment can be formulated as a Markov Decision Process (MDP) de\ufb01ned by\nthe 4-tuple (S,A,P,R). At a given timestep t, the environment exists in state st \u2208 S. The agent\ntakes an action at \u2208 A according to policy \u03c0(at|st), causing the environment to transition to a new\nstate st+1 \u223c P(st, at) and yield a reward rt \u223c R(st, at, st+1). Hence, the agent\u2019s goal can be\nformalized as \ufb01nding a policy that maximizes the expected discounted return E\u03c0\nsome horizon H. The discount \u03b3 \u2208 [0, 1] affects the relative importance of future rewards and allows\nthe sum to converge in the case where H \u2192 \u221e, \u03b3 (cid:54)= 1. An important property of the MDP is that\nevery state s \u2208 S satis\ufb01es the Markov property; that is, the agent needs to consider only the current\nstate st when selecting an action in order to perform optimally.\nIn reality, most problems of interest violate the Markov property. Information presently accessible to\nthe agent may be incomplete or otherwise unreliable, and therefore is no longer a suf\ufb01cient statistic for\nthe environment\u2019s history [13]. We can extend our previous formulation to the more general case of the\nPartially Observable Markov Decision Process (POMDP) de\ufb01ned by the 6-tuple (S,A,P,R, \u2126,O).\nAt a given timestep t, the environment exists in state st \u2208 S and reveals observation ot \u223c O(st), ot \u2208\n\u2126. The agent takes an action at \u2208 A according to policy \u03c0(at|o0, . . . , ot) and receives a reward\nrt \u223c R(st, at, st+1), causing the environment to transition to a new state st+1 \u223c P(st, at). In this\nsetting, the agent may need to consider arbitrarily long sequences of past observations when selecting\nactions in order to perform well.1\nWe can mathematically unify MDPs and POMDPs by introducing the notion of an approximate\nstate \u02c6st = \u03c6(o0, . . . , ot), where \u03c6 de\ufb01nes an arbitrary transformation of the observation history. In\npractice, \u03c6 might consider only a subset of the history \u2014 even just the most recent observation. This\nallows for the identical treatment of MDPs and POMDPs by generalizing the notion of a Bellman\nbackup, and greatly simpli\ufb01es our following discussion. However, it is important to emphasize that\n\u02c6st (cid:54)= st in general, and that the choice of \u03c6 can affect the solution quality.\n\n1 To achieve optimality, the policy must additionally consider the action history in general.\n\n2\n\n\f2.1 \u03bb-returns\n\nIn the control setting, value-based reinforcement learning algorithms seek to produce an accurate\nestimate Q(\u02c6st, at) of the expected discounted return achieved by following the optimal policy \u03c0\u2217\nafter taking action at in state \u02c6st. Suppose the agent acts according to the (possibly suboptimal) policy\n\u00b5 and experiences the \ufb01nite trajectory \u02c6st, at, rt, \u02c6st+1, at+1, rt+1, . . . , \u02c6sT . The estimate at time t can\nbe improved, for example, by using the n-step TD update [34]:\n\nQ(\u02c6st, at) \u2190 Q(\u02c6st, at) + \u03b1(cid:2)R(n)\n\nt \u2212 Q(\u02c6st, at)(cid:3)\n\n(1)\n\nt\n\nwhere R(n)\nis the n-step return2 and \u03b1 is the learning rate controlling the magnitude of the update.\nWhen n = 1, Equation (1) is equivalent to Q-Learning [39]. In practice, the 1-step update suffers from\nslow credit assignment and high estimation bias. Increasing n enhances the immediate sensitivity to\nfuture rewards and decreases the bias, but at the expense of greater variance which may require more\nsamples to converge to the true expectation. Any valid return estimator can be substituted for the\nn-step return in Equation (1), including weighted averages of multiple n-step returns [34]. A popular\nchoice is the \u03bb-return, de\ufb01ned as the exponential average of every n-step return [39]:\n\nt = (1 \u2212 \u03bb)\nR\u03bb\n\n\u03bbn\u22121R(n)\n\nt + \u03bbN\u22121R(N )\n\nt\n\n(2)\n\nN\u22121(cid:88)\n\nn=1\n\nwhere N = T \u2212 t and \u03bb \u2208 [0, 1] is a hyperparameter that controls the decay rate. When \u03bb = 0,\nEquation (2) reduces to the 1-step return. When \u03bb = 1 and \u02c6sT is terminal, Equation (2) reduces to the\nMonte Carlo return. The \u03bb-return can thus be seen a smooth interpolation between these methods.3\nWhen learning of\ufb02ine, it is often the case that a full sequence of \u03bb-returns needs to be calculated.\nComputing Equation (2) repeatedly for each state in an N-step trajectory would require roughly\nN + (N \u2212 1) + \u00b7\u00b7\u00b7 + 1 = O(N 2) operations, which is impractical. Alternatively, given the full\ntrajectory, the \u03bb-returns can be calculated ef\ufb01ciently with recursion:\n\nt + \u03b3\u03bb(cid:2)R\u03bb\n\na(cid:48)\u2208A Q(\u02c6st+1, a(cid:48))(cid:3)\n\nt+1 \u2212 max\n\nR\u03bb\n\nt = R(1)\n\n(3)\n\nt can be computed given R\u03bb\n\nWe include a derivation in Appendix D for reference, but this formulation4 has been commonly used\nin prior work [5, 27]. Because R\u03bb\nt+1 in a constant number of operations,\nthe entire sequence of \u03bb-returns can be generated with O(N ) time complexity. Note that the \u03bb-return\npresented here unconditionally conducts backups using the maximizing action for each n-step return,\nregardless of which actions were actually selected by the behavioral policy \u00b5. This is equivalent to\nPeng\u2019s Q(\u03bb) [27]. Although Peng\u2019s Q(\u03bb) has been shown to perform well empirically, its mixture of\non- and off-policy data does not guarantee convergence [34]. One possible alternative is Watkin\u2019s\nQ(\u03bb) [39], which terminates the \u03bb-return calculation by setting \u03bb = 0 whenever an exploratory\naction is taken. Watkin\u2019s Q(\u03bb) provably converges to the optimal policy in the tabular case [26], but\nterminating the returns in this manner can slow learning [34].\n\n2.2 Deep Q-Network\n\nDeep Q-Network (DQN) was one of the \ufb01rst major successes of deep reinforcement learning,\nachieving human-level performance on Atari 2600 games using only the screen pixels as input [25].\nDQN is the deep-learning analog of Q-Learning. Because maintaining tabular information for every\nstate-action pair is not feasible for large state spaces, DQN instead learns a parameterized function\nQ(\u02c6st, at; \u03b8) \u2014 implemented as a deep neural network \u2014 to generalize over states. Unfortunately,\ndirectly updating Q according to a gradient-based version of Equation (1) does not work well [25, 37].\nTo overcome this, transitions (\u02c6st, at, rt, \u02c6st+1) are stored in a replay memory D and gradient descent\nis performed on uniformly sampled minibatches of past experience. A target network with stale\nparameters \u03b8\u2212 copied from \u03b8 every F timesteps helps prevent oscillations of Q. Hence, DQN\nbecomes a minimization problem where the following loss is iteratively approximated and reduced:\nt = rt + \u03b3rt+1 + \u00b7\u00b7\u00b7 + \u03b3n\u22121rt+n\u22121 + \u03b3n maxa(cid:48)\u2208A Q(\u02c6st+n, a(cid:48)), n \u2208 {1, 2, . . . , T \u2212 t}.\n2 De\ufb01ned as R(n)\n3 Additionally, the monotonically decreasing weights can be interpreted as the recency heuristic, which\n\nassumes that recent states and actions are likelier to have contributed to a given reward [34].\n\nt = rt + \u03b3(cid:2)\u03bbR\u03bb\n\nt+1 + (1 \u2212 \u03bb) maxa(cid:48)\u2208A Q(\u02c6st+1, a(cid:48))(cid:3).\n\n4 The equation is sometimes rewritten: R\u03bb\n\n3\n\n\fL(\u03b8) = E(\u02c6s,a,r,\u02c6s(cid:48))\u223cU (D)\n\n(cid:104)(cid:0)r + \u03b3 max\n\na(cid:48)\u2208A Q(\u02c6s(cid:48), a(cid:48); \u03b8\u2212) \u2212 Q(\u02c6s, a; \u03b8)(cid:1)2(cid:105)\n\nDQN assumes Markovian inputs, but a single Atari 2600 game frame is partially observable. Hence,\nthe four most-recent observations were concatenated together to form an approximate state in [25].\n\n3 Experience replay with \u03bb-returns\nDeep reinforcement learning invariably utilizes of\ufb02ine learning schemes, making the recursive \u03bb-\nreturn in Equation (3) ideal for these methods. Nevertheless, combining \u03bb-returns with experience\nreplay remains challenging. This is because the \u03bb-return theoretically depends on all future Q-values.\nCalculating Q-values is notoriously expensive for deep reinforcement learning due to the neural\nnetwork \u2014 an important distinction from tabular methods where Q-values can merely be retrieved\nfrom a look-up table. Even if the \u03bb-return calculation were truncated after 10 timesteps, it would still\nrequire 10 times the computation of a 1-step method. This would be useful only in rare cases where\nmaximal sample ef\ufb01ciency is desired at all costs.\nAn ideal \u03bb-return algorithm using experience replay would more favorably balance computation and\nsample ef\ufb01ciency, while simultaneously allowing for arbitrary function approximators and learning\nmethods. In this section, we propose several techniques to implement such an algorithm. For the\npurposes of our discussion, we use DQN to exemplify the ideas in the following sections, but they are\napplicable to any off-policy reinforcement learning method. We refer to this particular instantiation\nof our methods as DQN(\u03bb); the pseudocode is provided in Appendix B.\n\n3.1 Refreshed \u03bb-returns\n\nBecause the \u03bb-return is substantially more expensive than the 1-step return, the ideal replay-based\nmethod minimizes the number of times each return estimate is computed. Hence, our principal\nmodi\ufb01cation of DQN is to store each return R\u03bb\nt along with its corresponding transition in the replay\nmemory D. Training becomes a matter of sampling minibatches of precomputed \u03bb-returns from D\nand reducing the squared error. Of course, the calculation of R\u03bb\nt must be suf\ufb01ciently deferred because\nof its dependency on future states and rewards; one choice might be to wait until a terminal state is\nreached and then transfer the episode\u2019s \u03bb-returns to D. The new loss function becomes the following:\n\nL(\u03b8) = E(\u02c6s,a,R\u03bb)\u223cU (D)\n\n(cid:104)(cid:0)R\u03bb \u2212 Q(\u02c6s, a; \u03b8)(cid:1)2(cid:105)\n\nThere are two major advantages to this strategy. First, no computation is repeated when a transition is\nsampled more than once. Second, adjacent \u03bb-returns in the replay memory can be calculated very\nef\ufb01ciently with the recursive update in Equation (3). The latter point is crucial; while computing\nrandomly accessed \u03bb-returns may require 10 or more Q-values per \u03bb-return as discussed earlier,\ncomputing them in reverse chronological order requires only one Q-value per \u03bb-return.\nOne remaining challenge is that the stored \u03bb-returns become outdated as the Q-function evolves,\nslowing learning when the replay memory is large. Fortunately, this presents an opportunity to\neliminate the target network altogether. Rather than copying parameters \u03b8 to \u03b8\u2212 every F timesteps,\nwe refresh the \u03bb-returns in the replay memory using the present Q-function. This achieves the same\neffect by providing stable TD targets, but eliminates the redundant target network.\n\n3.2 Cache\n\nRefreshing all of the \u03bb-returns in the replay memory using the recursive formulation in Equation (3)\nachieves maximum Q-value ef\ufb01ciency by exploiting adjacency, and removes the need for a target\nnetwork. However, this process is still prohibitively expensive for typical DQN implementations\nthat have a replay memory capacity on the order of millions of transitions. To make the runtime\ninvariant to the size of the replay memory, we propose a novel strategy where S\nB contiguous \"blocks\"\nof B transitions are randomly promoted from the replay memory to build a cache C of size S. By\nrefreshing only this small memory and sampling minibatches directly from it, calculations are not\nwasted on \u03bb-returns that are ultimately never used. Furthermore, each block can still be ef\ufb01ciently\nrefreshed using Equation (3) as before. Every F timesteps, the cache is regenerated from newly\nsampled blocks (Figure 1), once again obviating the need for a target network.\n\n4\n\n\fFigure 1: Our proposed cache-building process. For each randomly sampled index, a sequence\n(\"block\") of \u03bb-returns is ef\ufb01ciently generated backwards via recursion. Together, the blocks form the\nnew cache, which is treated as a surrogate for the replay memory for the following F timesteps.\n\nCaching is crucial to achieve practical runtime performance with \u03bb-returns, but it introduces minor\nsample correlations that violate DQN\u2019s theoretical requirement of independently and identically\ndistributed (i.i.d.) data. An important question to answer is how pernicious such correlations are in\npractice; if performance is not adversely affected \u2014 or, at the very least, the bene\ufb01ts of \u03bb-returns\novercome such effects \u2014 then we argue that the violation of the i.i.d. assumption is justi\ufb01ed. In\nFigure 2, we compare cache-based DQN with standard target-network DQN on Seaquest and Space\nInvaders using n-step returns (all experimental procedures are detailed later in Section 5). Although\nthe sampling bias of the cache decreases performance on Seaquest, the loss can be mostly recovered\nby increasing the cache size S. On the other hand, Space Invaders provides an example of a game\nwhere the cache actually outperforms the target network despite this bias. In our later experiments,\nwe \ufb01nd that the choice of the return estimator has a signi\ufb01cantly larger impact on performance than\nthese sampling correlations do, and therefore the bias matters little in practice.\n\n3.3 Directly prioritized replay\n\nTo our knowledge, DQN(\u03bb) is the \ufb01rst method with experience replay to compute returns before they\nare sampled, meaning it is possible to observe the TD errors of transitions prior to replaying them.\nThis allows for the opportunity to speci\ufb01cally select samples that will facilitate the fastest learning.\nWhile prioritized experience replay has been explored in prior work [31], these techniques rely on\nthe previously seen (and therefore outdated) TD error as a proxy for ranking samples. This is because\nthe standard target-network approach to DQN computes TD errors as transitions are sampled, only to\nimmediately render them inaccurate by the subsequent gradient update. Hence, we call our approach\ndirectly prioritized replay to emphasize that the true TD error is initially used. The tradeoff of our\nmethod is that only samples within the cache \u2014 not the full replay memory \u2014 can be prioritized.\nWhile any prioritization distribution is possible, we propose a mixture between a uniform distribution\nover C and a uniform distribution over the samples in C whose absolute TD errors exceed some\nquantile. An interesting case arises when the chosen quantile is the median; the distribution becomes\nsymmetric and has a simple analytic form. Letting p \u2208 [0, 1] be our interpolation hyperparameter and\n\u03b4i represent the (unique) error of sample xi \u2208 C, we can write the sampling probability explicitly:\n\nif |\u03b4i| > median(|\u03b40|,|\u03b41|, . . . ,|\u03b4S\u22121|)\nif |\u03b4i| = median(|\u03b40|,|\u03b41|, . . . ,|\u03b4S\u22121|)\nif |\u03b4i| < median(|\u03b40|,|\u03b41|, . . . ,|\u03b4S\u22121|)\n\n\uf8f1\uf8f2\uf8f3 1\n\nS (1 + p)\n1\nS (1 \u2212 p)\nS\n1\n\nP (xi) =\n\nA distribution of this form is appealing because it is scale-invariant and insensitive to noisy TD errors,\nhelping it to perform consistently across a wide variety of reward functions. Following previous work\n[31], we linearly anneal p to 0 during training to alleviate the bias caused by prioritization.\n\n3.4 Dynamic \u03bb selection\n\nThe ability to analyze \u03bb-returns of\ufb02ine presents a unique opportunity to dynamically choose \u03bb-\nvalues according to certain criteria. In previous work, tabular reinforcement learning methods have\n\n5\n\n(1)\u22122\u22123\u22121(1)\u22124\u22122\u22123\u22121\u22124(1)\u22122\u22123\u22121(1)\u22124\u22122\u22123\u22121\u22124Replay MemoryCache\fFigure 2: Ablation analysis of our caching method on Seaquest and Space Invaders. Using the\n3-step return with DQN for all experiments, we compared the scores obtained by caches of size\nS \u2208 {80000, 160000, 240000} against a target-network baseline. As expected, the cache\u2019s violation\nof the i.i.d. assumption has a negative performance impact on Seaquest, but this can be mostly\nrecovered by increasing S. Surprisingly, the trend is reversed for Space Invaders, indicating that\nthe cache\u2019s sample correlations do not always harm performance. Because the target network is\nimpractical for computing \u03bb-returns, the cache is effective when \u03bb-returns outperform n-step returns.\n\nutilized variable \u03bb-values to adjust credit assignment according to the number of times a state has\nbeen visited [34, 35] or a model of the n-step return variance [16]. In our setting, where function\napproximation is used to generalize across a high-dimensional state space, it is dif\ufb01cult to track\nstate-visitation frequencies and states might not be visited more than once. Alternatively, we propose\nto select \u03bb-values based on their TD errors. One strategy we found to work well empirically is to\ncompute several different \u03bb-returns and then select the median return at each timestep. Formally, we\nrede\ufb01ne R\u03bb\n), where k + 1 is the number of evenly spaced\ncandidate \u03bb-values. We used k = 20 for all of our experiments; larger values yielded marginal\nbene\ufb01t. Median-based selection is appealing because it integrates multiple \u03bb-values in an intuitive\nway and is robust to outliers that could cause destructive gradient updates. In Appendix C, we also\nexperimented with selecting \u03bb-values that bound the mean absolute error of each cache block, but we\nfound median-based selection to work better in practice.\n\nt = median(R\u03bb=0/k\n\nt\n\nt\n\n, R\u03bb=1/k\n\n, . . . , R\u03bb=k/k\n\nt\n\n4 Related work\n\nThe \u03bb-return has been used in prior work to improve the sample ef\ufb01ciency of Deep Recurrent Q-\nNetwork (DRQN) for Atari 2600 games [8]. Because recurrent neural networks (RNNs) produce a\nsequence of Q-values during truncated backpropagation through time, these precomputed values can\nbe exploited to calculate \u03bb-returns with little additional expense over standard DRQN. The problem\nwith this approach is its lack of generality; the Q-function is restricted to RNNs, and the length\nN over which the \u03bb-return is computed must be constrained to the length of the training sequence.\nConsequently, increasing N to improve credit assignment forces the training sequence length to be\nincreased as well, introducing undesirable side effects like exploding and vanishing gradients [3]\nand a substantial runtime cost. Additionally, the use of a target network means \u03bb-returns must be\nrecalculated on every training step, even when the input sequence and Q-function do not change.\nIn contrast, our proposed caching mechanism only periodically updates stored \u03bb-returns, thereby\navoiding repeated calculations and eliminating the need for a target network altogether. This strategy\nprovides maximal \ufb02exibility by decoupling the training sequence length from the \u03bb-return length and\nmakes no assumptions about the function approximator. This allows it to be incorporated into any\nreplay-based algorithm and not just DRQN.\n\n6\n\n010M010002000300040005000SeaquestNo CacheS=80000S=160000S=240000010M0100200300400500Space InvadersNo CacheS=80000S=160000S=240000\fFigure 3: Sample ef\ufb01ciency comparison of DQN(\u03bb) with \u03bb \u2208 {0.25, 0.5, 0.75, 1} against 3-step\nDQN on six Atari games.\n\n5 Experiments\nIn order to characterize the performance of DQN(\u03bb), we conducted numerous experiments on six\nAtari 2600 games. We used the OpenAI Gym [4] to provide an interface to the Arcade Learning\nEnvironment [2], where observations consisted of the raw frame pixels. We compared DQN(\u03bb)\nagainst a standard target-network implementation of DQN using the 3-step return, which was shown\nto work well in [12]. We matched the hyperparameters and procedures in [25], except we trained the\nneural networks with Adam [14]. Unless stated otherwise, \u03bb-returns were formulated as Peng\u2019s Q(\u03bb).\nFor all experiments in this paper, agents were trained for 10 million timesteps. An agent\u2019s performance\nat a given time was evaluated by averaging the earned scores of its past 100 completed episodes.\nEach experiment was averaged over 10 random seeds with the standard error of the mean indicated.\nOur complete experimental setup is discussed in Appendix A.\nPeng\u2019s Q(\u03bb): We compared DQN(\u03bb) using Peng\u2019s Q(\u03bb) for \u03bb \u2208 {0.25, 0.5, 0.75, 1} against the\nbaseline on each of the six Atari games (Figure 3). For every game, at least one \u03bb-value matched or\noutperformed the 3-step return. Notably, \u03bb \u2208 {0.25, 0.5} yielded huge performance gains over the\nbaseline on Breakout and Space Invaders. This \ufb01nding is quite interesting because n-step returns have\nbeen shown to perform poorly on Breakout [12], suggesting that \u03bb-returns can be a better alternative.\nWatkin\u2019s Q(\u03bb): Because Peng\u2019s Q(\u03bb) is a biased return estimator, we repeated the previous exper-\niments using Watkin\u2019s Q(\u03bb). The results are included in Appendix E. Surprisingly, Watkin\u2019s Q(\u03bb)\nfailed to outperform Peng\u2019s Q(\u03bb) on every environment we tested. The worse performance is likely\ndue to the cut traces, which slow credit assignment in spite of their bias correction.\nDirectly prioritized replay and dynamic \u03bb selection: We tested DQN(\u03bb) with prioritization p = 0.1\nand median-based \u03bb selection on the six Atari games. The results are shown in Figure 4. In general,\nwe found that dynamic \u03bb selection did not improve performance over the best hand-picked \u03bb-value;\nhowever, it always matched or outperformed the 3-step baseline without any manual \u03bb tuning.\nPartial observability: In Appendix F, we repeated the experiments in Figure 4 but provided agents\nwith only a 1-frame input to make the environments partially observable. We hypothesized that the\nrelative performance difference between DQN(\u03bb) and the baseline would be greater under partial\nobservability, but we found that it was largely unchanged.\n\n7\n\n010M0100020003000400050006000Beam Rider3-step DQNDQN(0.25)DQN(0.5)DQN(0.75)DQN(1)010M050100150200Breakout3-step DQNDQN(0.25)DQN(0.5)DQN(0.75)DQN(1)010M201510505101520Pong3-step DQNDQN(0.25)DQN(0.5)DQN(0.75)DQN(1)010M0200040006000800010000Q*Bert3-step DQNDQN(0.25)DQN(0.5)DQN(0.75)DQN(1)010M010002000300040005000Seaquest3-step DQNDQN(0.25)DQN(0.5)DQN(0.75)DQN(1)010M0100200300400500600Space Invaders3-step DQNDQN(0.25)DQN(0.5)DQN(0.75)DQN(1)\fFigure 4: Sample ef\ufb01ciency comparison of DQN(\u03bb) with prioritization p = 0.1 and median-based\ndynamic \u03bb selection against 3-step DQN on six Atari games.\n\nReal-time sample ef\ufb01ciency: In certain scenarios, it may be desirable to train a model as quickly as\npossible without regard to the number of environment samples. For the best \u03bb-value we tested on\neach game in Figure 3, we plotted the score as a function of wall-clock time and compared it against\nthe target-network baseline in Appendix G. Signi\ufb01cantly, DQN(\u03bb) completed training faster than\nDQN on \ufb01ve of the six games. This shows that the cache can be more computationally ef\ufb01cient than\na target network. We believe the speedup is attributed to greater GPU parallelization when computing\nQ-values because the cache blocks are larger than a typical minibatch.\n\n6 Conclusion\nWe proposed a novel technique that allows for the ef\ufb01cient integration of \u03bb-returns into any off-\npolicy method with minibatched experience replay. By storing \u03bb-returns in a periodically refreshed\ncache, we eliminate the need for a target network and enable of\ufb02ine analysis of the TD errors prior\nto sampling. This latter feature is particularly important, making our method the \ufb01rst to directly\nprioritize samples according to their actual loss contribution. To our knowledge, our method is also\nthe \ufb01rst to explore dynamically selected \u03bb-values for deep reinforcement learning. Our experiments\nshowed that these contributions can increase the sample ef\ufb01ciency of DQN by a large margin.\nWhile our work focused speci\ufb01cally on \u03bb-returns, our proposed methods are equally applicable to any\nmulti-step return estimator. One avenue for future work is to utilize a lower-variance, bias-corrected\nreturn such as Tree Backup [28], Q*(\u03bb) [9], or Retrace(\u03bb) [26] for potentially better performance.\nFurthermore, although our method does not require asynchronous gradient updates, a multi-threaded\nimplementation of DQN(\u03bb) could feasibly enhance both absolute and real-time sample ef\ufb01ciencies.\nOur ideas presented here should prove useful to a wide range of off-policy reinforcement learning\nmethods by improving performance while limiting training duration.\n\nAcknowledgments\n\nWe would like to thank the anonymous reviewers for their valuable feedback. We also gratefully\nacknowledge NVIDIA Corporation for its GPU donation. This research was funded by NSF award\n1734497 and an Amazon Research Award (ARA).\n\n8\n\n010M01000200030004000500060007000Beam Rider3-step DQNDQN() (median)010M0255075100125150175200Breakout3-step DQNDQN() (median)010M201510505101520Pong3-step DQNDQN() (median)010M0200040006000800010000Q*Bert3-step DQNDQN() (median)010M010002000300040005000Seaquest3-step DQNDQN() (median)010M0100200300400500600Space Invaders3-step DQNDQN() (median)\fReferences\n[1] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve\ndif\ufb01cult learning control problems. IEEE transactions on systems, man, and cybernetics, 13(5):834\u2013846,\n1983.\n\n[2] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:\nAn evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, 2013.\n[3] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient\n\ndescent is dif\ufb01cult. IEEE transactions on neural networks, 5(2):157\u2013166, 1994.\n\n[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and\n\nWojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[5] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic.\n\narXiv:1205.4839, 2012.\n\narXiv preprint\n\n[6] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning, pages 1329\u20131338,\n2016.\n\n[7] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with\n\nmodel-based acceleration. In International Conference on Machine Learning, pages 2829\u20132838, 2016.\n\n[8] Jean Harb and Doina Precup. Investigating recurrence and eligibility traces in deep Q-networks. arXiv\n\npreprint arXiv:1704.05495, 2017.\n\n[9] Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and R\u00e9mi Munos. Q(\u03bb) with off-policy corrections.\n\nIn International Conference on Algorithmic Learning Theory, pages 305\u2013320. Springer, 2016.\n\n[10] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In 2015\n\nAAAI Fall Symposium Series, 2015.\n\n[11] Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu\nWang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments.\narXiv preprint arXiv:1707.02286, 2017.\n\n[12] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan\nHorgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep\nreinforcement learning. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[13] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially\n\nobservable stochastic domains. Arti\ufb01cial intelligence, 101(1-2):99\u2013134, 1998.\n\n[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[15] A Harry Klopf. Brain function and adaptive systems: a heterostatic theory. Technical report, AIR FORCE\n\nCAMBRIDGE RESEARCH LABS HANSCOM AFB MA, 1972.\n\n[16] George Konidaris, Scott Niekum, and Philip S Thomas. Td_gamma: Re-evaluating complex backups in\ntemporal difference learning. In Advances in Neural Information Processing Systems, pages 2402\u20132410,\n2011.\n\n[17] Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning. In\n\nAAAI, pages 2140\u20132146, 2017.\n\n[18] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor\n\npolicies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[19] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye\ncoordination for robotic grasping with deep learning and large-scale data collection. The International\nJournal of Robotics Research, 37(4-5):421\u2013436, 2018.\n\n[20] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint\narXiv:1509.02971, 2015.\n\n[21] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.\n\nMachine learning, 8(3-4):293\u2013321, 1992.\n\n[22] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous\n\nactions for deep rl. arXiv preprint arXiv:1705.05035, 2017.\n\n[23] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha De-\nnil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments.\narXiv preprint arXiv:1611.03673, 2016.\n\n9\n\n\f[24] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nInternational Conference on Machine Learning, pages 1928\u20131937, 2016.\n\n[25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[26] R\u00e9mi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and ef\ufb01cient off-policy\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 1054\u20131062, 2016.\n[27] Jing Peng and Ronald J Williams. Incremental multi-step Q-learning. In Machine Learning Proceedings\n\n1994, pages 226\u2013232. Elsevier, 1994.\n\n[28] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty\n\nPublication Series, page 80, 2000.\n\n[29] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and Sergey\nLevine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.\narXiv preprint arXiv:1709.10087, 2017.\n\n[30] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In\n\nInternational conference on machine learning, pages 1312\u20131320, 2015.\n\n[31] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv\n\npreprint arXiv:1511.05952, 2015.\n\n[32] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go\nwith deep neural networks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[33] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9\u201344,\n\n1988.\n\n[34] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2nd edition,\n\n2018.\n\n[35] Richard S Sutton and Satinder P Singh. On step-size and bias in temporal-difference learning.\n\nIn\nProceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pages 91\u201396. Citeseer, 1994.\n[36] Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. PhD thesis, University of\n\nMassachussetts, Amherst, 1984.\n\n[37] John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approxi-\n\nmation. In Advances in neural information processing systems, pages 1075\u20131081, 1997.\n\n[38] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando\nde Freitas. Sample ef\ufb01cient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.\n[39] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King\u2019s College,\n\nCambridge, 1989.\n\n10\n\n\f", "award": [], "sourceid": 673, "authors": [{"given_name": "Brett", "family_name": "Daley", "institution": "Northeastern University"}, {"given_name": "Christopher", "family_name": "Amato", "institution": "Northeastern University"}]}