{"title": "Meta-Gradient Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2396, "page_last": 2407, "abstract": "The goal of reinforcement learning algorithms is to estimate and/or optimise\nthe value function. However, unlike supervised learning, no teacher or oracle is\navailable to provide the true value function. Instead, the majority of reinforcement\nlearning algorithms estimate and/or optimise a proxy for the value function. This\nproxy is typically based on a sampled and bootstrapped approximation to the true\nvalue function, known as a return. The particular choice of return is one of the\nchief components determining the nature of the algorithm: the rate at which future\nrewards are discounted; when and how values should be bootstrapped; or even the\nnature of the rewards themselves. It is well-known that these decisions are crucial\nto the overall success of RL algorithms. We discuss a gradient-based meta-learning\nalgorithm that is able to adapt the nature of the return, online, whilst interacting\nand learning from the environment. When applied to 57 games on the Atari 2600\nenvironment over 200 million frames, our algorithm achieved a new state-of-the-art\nperformance.", "full_text": "Meta-Gradient Reinforcement Learning\n\nZhongwen Xu\n\nDeepMind\n\nzhongwen@google.com\n\nHado van Hasselt\n\nDeepMind\n\nhado@google.com\n\nDavid Silver\nDeepMind\n\ndavidsilver@google.com\n\nAbstract\n\nThe goal of reinforcement learning algorithms is to estimate and/or optimise\nthe value function. However, unlike supervised learning, no teacher or oracle is\navailable to provide the true value function. Instead, the majority of reinforcement\nlearning algorithms estimate and/or optimise a proxy for the value function. This\nproxy is typically based on a sampled and bootstrapped approximation to the true\nvalue function, known as a return. The particular choice of return is one of the\nchief components determining the nature of the algorithm: the rate at which future\nrewards are discounted; when and how values should be bootstrapped; or even the\nnature of the rewards themselves. It is well-known that these decisions are crucial\nto the overall success of RL algorithms. We discuss a gradient-based meta-learning\nalgorithm that is able to adapt the nature of the return, online, whilst interacting\nand learning from the environment. When applied to 57 games on the Atari 2600\nenvironment over 200 million frames, our algorithm achieved a new state-of-the-art\nperformance.\n\nThe central goal of reinforcement learning (RL) is to optimise the agent\u2019s return (cumulative reward);\nthis is typically achieved by a combination of prediction and control. The prediction subtask is to\nestimate the value function \u2013 the expected return from any given state. Ideally, this would be achieved\nby updating an approximate value function towards the true value function. The control subtask is to\noptimise the agent\u2019s policy for selecting actions, so as to maximise the value function. Ideally, the\npolicy would simply be updated in the direction that increases the true value function. However, the\ntrue value function is unknown and therefore, for both prediction and control, a sampled return is\ninstead used as a proxy. A large family of RL algorithms [Sutton, 1988, Rummery and Niranjan,\n1994, van Seijen et al., 2009, Sutton and Barto, 2018], including several state-of-the-art deep RL\nalgorithms [Mnih et al., 2015, van Hasselt et al., 2016, Harutyunyan et al., 2016, Hessel et al., 2018,\nEspeholt et al., 2018], are characterised by different choices of the return.\nThe discount factor \u03b3 determines the time-scale of the return. A discount factor close to \u03b3 = 1\nprovides a long-sighted goal that accumulates rewards far into the future, while a discount factor\nclose to \u03b3 = 0 provides a short-sighted goal that prioritises short-term rewards. Even in problems\nwhere long-sightedness is clearly desired, it is frequently observed that discounts \u03b3 < 1 achieve\nbetter results [Prokhorov and Wunsch, 1997], especially during early learning. It is known that many\nalgorithms converge faster with lower discounts [Bertsekas and Tsitsiklis, 1996], but of course too\nlow a discount can lead to highly sub-optimal policies that are too myopic. In practice it can be better\nto \ufb01rst optimise for a myopic horizon, e.g., with \u03b3 = 0 at \ufb01rst, and then to repeatedly increase the\ndiscount only after learning is somewhat successful [Prokhorov and Wunsch, 1997].\nThe return may also be bootstrapped at different time horizons. An n-step return accumulates rewards\nover n time-steps and then adds the value function at the nth time-step. The \u03bb-return [Sutton, 1988,\nSutton and Barto, 2018] is a geometrically weighted combination of n-step returns. In either case,\nthe meta-parameter n or \u03bb can be important to the performance of the algorithm, trading off bias and\nvariance. Many researchers have sought to automate the selection of these parameters [Kearns and\nSingh, 2000, Downey and Sanner, 2010, Konidaris et al., 2011, White and White, 2016].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThere are potentially many other design choices that may be represented in the return, including\noff-policy corrections [Espeholt et al., 2018, Munos et al., 2016], target networks [Mnih et al., 2015],\nemphasis on certain states [Sutton et al., 2016], reward clipping [Mnih et al., 2013], or even the nature\nof the rewards themselves [Randl\u00f8v and Alstr\u00f8m, 1998, Singh et al., 2005, Zheng et al., 2018].\nIn this work, we are interested in one of the fundamental problems in reinforcement learning: what\nwould be the best form of return for the agent to maximise? Speci\ufb01cally, we propose to learn the\nreturn function by treating it as a parametric function with tunable meta-parameters \u03b7, for instance\nincluding the discount factor \u03b3, or the bootstrapping parameter \u03bb [Sutton, 1988]. The meta-parameters\n\u03b7 are adjusted online during the agent\u2019s interaction with the environment, allowing the return to\nboth adapt to the speci\ufb01c problem, and also to dynamically adapt over time to the changing context\nof learning. We derive a practical gradient-based meta-learning algorithm and show that this can\nsigni\ufb01cantly improve performance on large-scale deep reinforcement learning applications.\n\n1 Meta-Gradient Reinforcement Learning Algorithms\n\n\u03b8(cid:48) = \u03b8 + f (\u03c4, \u03b8, \u03b7) ,\n\nIn deep reinforcement learning, the value function and policy are approximated by a neural network\nwith parameters \u03b8, denoted by v\u03b8(S) and \u03c0\u03b8(A|S) respectively. At the core of the algorithm is an\nupdate function,\n\n(1)\nthat adjusts parameters from a sequence of experience \u03c4t = {St, At, Rt+1, . . .} consisting of states\nS, actions A and rewards R. The nature of the function is determined by meta-parameters \u03b7.\nOur meta-gradient RL approach is based on the principle of online cross-validation [Sutton, 1992],\nusing successive samples of experience. The underlying RL algorithm is applied to the \ufb01rst sample\n(or samples), and its performance is measured in a subsequent sample. Speci\ufb01cally, the algorithm\nstarts with parameters \u03b8, and applies the update function to the \ufb01rst sample(s), resulting in new\nparameters \u03b8(cid:48). The gradient d\u03b8(cid:48)/d\u03b7 of these updates indicates how the meta-parameters affected\nthese new parameters.\nThe algorithm then measures the performance of the new parameters \u03b8(cid:48) on a second sample \u03c4(cid:48). For\ninstance, when learning online \u03c4(cid:48) could be the next time-step immediately following \u03c4. Performance\nis measured by a differentiable meta-objective \u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7) that uses a \ufb01xed meta-parameter \u00af\u03b7.\nThe gradient of the meta-objective with respect to the meta-parameters \u03b7 is obtained by applying the\nchain rule:\n\n(2)\nTo compute the gradient of the updates, d\u03b8(cid:48)/d\u03b7, we note that the parameters form an additive sequence,\nand the gradient can therefore be accumulated online [Williams and Zipser, 1989],\n\n\u2202\u03b8(cid:48)\n\n\u2202\u03b7\n\n=\n\n\u2202 \u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7)\n\n\u2202 \u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7)\n\nd\u03b8(cid:48)\nd\u03b7\n\n.\n\n+\n\n\u2202f (\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b7\n\n(3)\n\nd\u03b8(cid:48)\nd\u03b7\n\n=\n\nd\u03b8\nd\u03b7\n\n+\n\n\u2202f (\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b7\n\n+\n\n\u2202f (\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b8\n\nd\u03b8\nd\u03b7\n\n=\n\nThis update has the form\n\n(cid:18)\n\n(cid:19) d\u03b8\n\n\u2202f (\u03c4, \u03b8, \u03b7)\n\nI +\n\n\u2202\u03b8\n\nd\u03b7\n\n,\n\nz(cid:48) = Az +\n\n\u2202f (\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b7\n\nwhere z = d\u03b8/d\u03b7 and z(cid:48) = d\u03b8(cid:48)/d\u03b7.\nThe exact gradient is given by A = I + \u2202f (\u03c4, \u03b8, \u03b7)/\u2202\u03b8. In practice, the gradient \u2202f (\u03c4, \u03b8, \u03b7)/\u2202\u03b8 is\nlarge and challenging to compute \u2014 it is a n \u00d7 n matrix, where n is the number of parameters in \u03b8.\nIn practice, we approximate the gradient, z \u2248 d\u03b8/d\u03b7. One possibility is to use an alternate update\nA = I + \u02c6\u2202f (\u03c4, \u03b8, \u03b7)/ \u02c6\u2202\u03b8 using a cheap approximate derivative \u02c6\u2202f (\u03c4, \u03b8, \u03b7)/ \u02c6\u2202\u03b8 \u2248 \u2202f (\u03c4, \u03b8, \u03b7)/\u2202\u03b8,\nfor instance using a diagonal approximation [Sutton, 1992, Schraudolph, 1999]. Furthermore, the\ngradient accumulation de\ufb01ned above assumes that the meta-parameters \u03b7 are held \ufb01xed throughout\ntraining. In practice, we are updating \u03b7 and therefore it may be desirable to decay the trace into the\npast [Schraudolph, 1999], A = \u00b5(I + \u2202f (\u03c4, \u03b8, \u03b7)/\u2202\u03b8), using decay rate \u00b5 \u2208 [0, 1]. The simplest\napproximation is to use A = 0 (or equivalently \u00b5 = 0), which means that we only consider the effect\nof the meta-parameters \u03b7 on a single update; this approximation is especially cheap to compute.\n\n2\n\n\fFinally, the meta-parameters \u03b7 are updated to optimise the meta-objective, for example by applying\nstochastic gradient descent (SGD) to update \u03b7 in the direction of the meta-gradient,\n\n\u2206\u03b7 = \u2212\u03b2\n\n\u2202 \u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7)\n\n\u2202\u03b8(cid:48)\n\nz(cid:48),\n\n(4)\n\nwhere \u03b2 is the learning rate for updating meta parameter \u03b7. The pseudo-code for the meta-gradient\nreinforcement learning algorithm is provided in Appendix A.\nIn the following sections we instantiate this idea more speci\ufb01cally to RL algorithms based on\npredicting or controlling returns. We begin with a pedagogical example of using meta-gradients for\nprediction using a temporal-difference update. We then consider meta-gradients for control, using a\ncanonical actor-critic update function and a policy gradient meta-objective. Many other instantiations\nof meta-gradient RL would be possible, since the majority of deep reinforcement learning updates are\ndifferentiable functions of the return, including, for instance, value-based methods like SARSA(\u03bb)\n[Rummery and Niranjan, 1994, Sutton and Barto, 2018] and DQN [Mnih et al., 2015], policy-\ngradient methods [Williams, 1992], or actor-critic algorithms like A3C [Mnih et al., 2016] and\nIMPALA [Espeholt et al., 2018].\n\n1.1 Applying Meta-Gradients to Returns\n\nWe de\ufb01ne the return g\u03b7(\u03c4t) to be a function of an episode or a truncated n-step sequence of experience\n\u03c4t = {St, At, Rt+1, . . . , St+n}. The nature of the return is determined by the meta-parameters \u03b7.\nThe n-step return [Sutton and Barto, 2018] accumulates rewards over the sequence and then bootstraps\nfrom the value function,\n\ng\u03b7(\u03c4t) = Rt+1 + \u03b3Rt+2 + \u03b32Rt+3 + . . . , +\u03b3n\u22121Rt+n + \u03b3nv\u03b8(St+n)\n\n(5)\n\nwhere \u03b7 = {\u03b3}.\nThe \u03bb-return is a geometric mixture of n-step returns, [Sutton, 1988]\n\ng\u03b7(\u03c4t) = Rt+1 + \u03b3(1 \u2212 \u03bb)v\u03b8(St+1) + \u03b3\u03bbg\u03b7(\u03c4t+1)\n\n(6)\nwhere \u03b7 = {\u03b3, \u03bb}. The \u03bb-return has the advantage of being fully differentiable with respect to the\nmeta-parameters. The meta-parameters \u03b7 may be viewed as gates that cause the return to terminate\n(\u03b3 = 0) or bootstrap (\u03bb = 0), or to continue onto the next step (\u03b3 = 1 and \u03bb = 1). The n-step\nor \u03bb-return can be augmented with off-policy corrections [Precup et al., 2000, Sutton et al., 2014,\nEspeholt et al., 2018] if it is necessary to correct for the distribution used to generate the data.\nA typical RL algorithm would hand-select the meta-parameters, such as the discount factor \u03b3 and\nbootstrapping parameter \u03bb, and these would be held \ufb01xed throughout training. Instead, we view the\nreturn g as a function parameterised by meta-parameters \u03b7, which may be differentiated to understand\nits dependence on \u03b7. This in turn allows us to compute the gradient \u2202f /\u2202\u03b7 of the update function\nwith respect to the meta-parameters \u03b7, and hence the meta-gradient \u2202 \u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7)/\u2202\u03b7. In essence,\nour agent asks itself the question, \u201cwhich return results in the best performance?\", and adjusts its\nmeta-parameters accordingly.\n\n1.2 Meta-Gradient Prediction\n\nWe begin with a simple instantiation of the idea, based on the canonical TD(\u03bb) algorithm for prediction.\nThe objective of the TD(\u03bb) algorithm (according to the forward view [Sutton and Barto, 2018]) is to\nminimise the squared error between the value function approximator v\u03b8(S) and the \u03bb-return g\u03b7(\u03c4 ),\n\nJ(\u03c4, \u03b8, \u03b7) = (g\u03b7(\u03c4 ) \u2212 v\u03b8(S))2\n\n\u2202J(\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b8\n\n= \u22122(g\u03b7(\u03c4 ) \u2212 v\u03b8(S))\n\n\u2202v\u03b8(S)\n\n\u2202\u03b8\n\n(7)\n\nwhere \u03c4 is a sampled trajectory starting with state S, and \u2202J(\u03c4, \u03b8, \u03b7)/\u2202\u03b8 is a semi-gradient [Sutton\nand Barto, 2018], i.e. the \u03bb-return is treated as constant with respect to \u03b8.\nThe TD(\u03bb) update function f (\u00b7) applies SGD to update the agent\u2019s parameters \u03b8 to descend the\ngradient of the objective with respect to the parameters,\n\nf (\u03c4, \u03b8, \u03b7) = \u2212 \u03b1\n2\n\n\u2202J(\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b8\n\n= \u03b1(g\u03b7(\u03c4 ) \u2212 v\u03b8(S))\n\n\u2202v\u03b8(S)\n\n\u2202\u03b8\n\n(8)\n\n3\n\n\fwhere \u03b1 is the learning rate for updating agent \u03b8. We note that this update is itself a differentiable\nfunction of the meta-parameters \u03b7,\n\n\u2202f (\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b7\n\n= \u2212 \u03b1\n2\n\n\u22022J(\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b8 \u2202\u03b7\n\n= \u03b1\n\n\u2202g\u03b7(\u03c4 )\n\n\u2202v\u03b8(S)\n\n\u2202\u03b7\n\n\u2202\u03b8\n\n(9)\n\nThe key idea of the meta-gradient prediction algorithm is to adjust meta-parameters \u03b7 in the direction\nthat achieves the best predictive accuracy. This is measured by cross-validating the new parameters \u03b8(cid:48)\non a second trajectory \u03c4(cid:48) that starts from state S(cid:48), using a mean squared error (MSE) meta-objective\nand taking its semi-gradient,\n\n\u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7) = (g\u00af\u03b7(\u03c4(cid:48)) \u2212 v\u03b8(cid:48)(S(cid:48)))2\n\n\u2202 \u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7)\n\n\u2202\u03b8(cid:48)\n\n= \u22122(g\u00af\u03b7(\u03c4(cid:48)) \u2212 v\u03b8(cid:48)(S(cid:48)))\n\n\u2202v\u03b8(cid:48)(S(cid:48))\n\n\u2202\u03b8(cid:48)\n\n(10)\n\nThe meta-objective in this case could make use of an unbiased and long-sighted return1, for example\nusing \u00af\u03b7 = {\u00af\u03b3, \u00af\u03bb} where \u00af\u03b3 = 1 and \u00af\u03bb = 1.\n\n1.3 Meta-Gradient Control\n\nWe now provide a practical example of meta-gradients applied to control. We focus on the A2C\nalgorithm \u2013 an actor-critic update function that combines both prediction and control into a single\nupdate. This update function is widely used in several state-of-the-art agents [Mnih et al., 2016, Jader-\nberg et al., 2017b, Espeholt et al., 2018]. The semi-gradient of the A2C objective, \u2202J(\u03c4 ; \u03b8, \u03b7)/\u2202\u03b8, is\nde\ufb01ned as follows,\n\u2212 \u2202J(\u03c4, \u03b8, \u03b7)\n\n+b(g\u03b7(\u03c4 )\u2212v\u03b8(S))\n\n= (g\u03b7(\u03c4 )\u2212v\u03b8(S))\n\n\u2202log \u03c0\u03b8(A|S)\n\n. (11)\n\n\u2202v\u03b8(S)\n\n\u2202\u03b8\n\n+c\n\n\u2202H(\u03c0\u03b8(\u00b7|S))\n\n\u2202\u03b8\n\n\u2202\u03b8\n\n\u2202\u03b8\n\nThe \ufb01rst term represents a control objective, encouraging the policy \u03c0\u03b8 to select actions that maximise\nthe return. The second term represents a prediction objective, encouraging the value function\napproximator v\u03b8 to more accurately estimate the return g\u03b7(\u03c4 ). The third term regularises the policy\naccording to its entropy H(\u03c0\u03b8), and b, c are scalar coef\ufb01cients that weight the different components\nin the objective function.\nThe A2C update function f (\u00b7) applies SGD to update the agent\u2019s parameters \u03b8. This update function\nis a differentiable function of the meta-parameters \u03b7,\n\n(cid:20) \u2202log \u03c0\u03b8(A|S)\n\n\u2202\u03b8\n\n(cid:21)\n\nf (\u03c4, \u03b8, \u03b7) = \u2212\u03b1\n\n\u2202J(\u03c4, \u03b8, \u03b7)\n\n\u2202f (\u03c4, \u03b8, \u03b7)\n\n\u2202\u03b8\n\n\u2202\u03b7\n\n= \u03b1\n\n\u2202g\u03b7(\u03c4 )\n\n\u2202\u03b7\n\n+ b\n\n\u2202v\u03b8(S)\n\n\u2202\u03b8\n\n(12)\n\nNow we come to the choice of meta-objective \u00afJ to use for control. Our goal is to identify the return\nfunction that maximises overall performance in our agents. This may be directly measured by a\nmeta-objective focused exclusively on optimising returns \u2013 in other words a policy gradient objective,\n\n\u2202 \u00afJ(\u03c4(cid:48), \u03b8(cid:48), \u00af\u03b7)\n\n\u2202\u03b8(cid:48)\n\n= (g\u00af\u03b7(\u03c4(cid:48)) \u2212 v\u03b8(cid:48)(S(cid:48)))\n\n\u2202log \u03c0\u03b8(cid:48)(A(cid:48)|S(cid:48))\n\n\u2202\u03b8(cid:48)\n\n.\n\n(13)\n\nThis equation evaluates how good the updated policy \u03b8(cid:48) is in terms of returns computed under \u00af\u03b7, when\nmeasured on \u201cheld-out\u201d experiences \u03c4(cid:48), e.g. the subsequent n-step trajectory. When cross-validating\nperformance using this meta-objective, we use \ufb01xed meta-parameters \u00af\u03b7, ideally representing a good\nproxy to the true objective of the agent. In practice this typically means selecting reasonable values of\n\u00af\u03b7; the agent is free to adapt its meta-parameters \u03b7 and choose values that perform better in practice.\nWe now put the meta-gradient control algorithm together. First, the parameters \u03b8 are updated on\na sample of experience \u03c4 using the A2C update function (Equation (11)), and the gradient of the\nupdate (Equation (12)) is accumulated into trace z. Second, the performance is cross-validated on a\nsubsequent sample of experience \u03c4(cid:48) using the policy gradient meta-objective (Equation (13)). Finally,\nthe meta-parameters \u03b7 are updated according to the gradient of the meta-objective (Equation (4)).\n\n1The meta-objective could even use a discount factor that is longer-sighted than the original problem, perhaps\n\nspanning over many episodes.\n\n4\n\n\f1.4 Conditioned Value and Policy Functions\n\nOne complication of the approach outlined above is that the return function g\u03b7(\u03c4 ) is non-stationary,\nadapting along with the meta-parameters throughout the training process. As a result, there is a\ndanger that the value function v\u03b8 becomes inaccurate, since it may be approximating old returns.\nFor example, the value function may initially form a good approximation of a short-sighted return\nwith \u03b3 = 0, but if \u03b3 subsequently adapts to \u03b3 = 1 then the value function may suddenly \ufb01nd its\napproximation is rather poor. The same principle applies for the policy \u03c0, which again may have\nspecialised to old returns.\nTo deal with non-stationarity in the value function and policy, we utilise an idea similar to universal\nvalue function approximation (UVFA) [Schaul et al., 2015]. The key idea is to provide the meta-\nparameters \u03b7 as an additional input to condition the value function and policy, as follows:\n\nv\u03b7\n\u03b8 (S) = v\u03b8([S; e\u03b7]),\n\n\u03c0\u03b7\n\u03b8 (S) = \u03c0\u03b8([S; e\u03b7]),\n\nwhere e\u03b7 is the embedding of \u03b7, [s; e\u03b7] denotes concatenation of vectors s and e\u03b7, the embedding\nnetwork e\u03b7 is updated by backpropagation during training but the gradient is not \ufb02owing through \u03b7.\nIn this way, the agent explicitly learns value functions and policies that are appropriate for various \u03b7.\nThe approximation problem becomes a little harder, but the payoff is that the algorithm can freely\nshift the meta-parameters without needing to wait for the approximator to \u201ccatch up\".\n\n1.5 Meta-Gradient Reinforcement Learning in Practice\n\nTo scale up the meta-gradient approach, several additional steps were taken. For ef\ufb01ciency, the A2C\nobjective and meta-objective were accumulated over all time-steps within an n-step trajectory of\nexperience. The A2C objective was optimised by RMSProp [Tieleman and Hinton, 2012] without\nmomentum [Mnih et al., 2015, 2016, Espeholt et al., 2018]. This is a differentiable function of\nthe meta-parameters, and can therefore be substituted similarly to SGD (see Equation (12)); this\nprocess may be simpli\ufb01ed by automatic differentiation (Appendix C.2). As in IMPALA, an off-policy\ncorrection was used, based on a V-trace return (see Appendix C.1). For ef\ufb01cient implementation,\nmini-batches of trajectories were computed in parallel; trajectories were reused twice for both the\nupdate function and for cross-validation (see Appendix C.3).\n\n2\n\nIllustrative Examples\n\nTo illustrate the key idea of our meta-gradient approach, we provide two examples that show how\nthe discount factor \u03b3 and temporal difference parameter \u03bb, respectively, can be meta-learned. We\nfocus on meta-gradient prediction using the TD(\u03bb) algorithm and a MSE meta-objective with \u00af\u03b3 = 1\nand \u00af\u03bb = 1, as described in Section 1.2. For these illustrative examples, we consider state-dependent\nmeta-parameters that can take on a different value in each state.\nThe \ufb01rst example is a 10-step Markov reward process (MRP), that alternates between \u201csignal\u201d and\n\u201cnoise\u201d transitions. Transitions from odd-numbered \u201csignal\u201d states receive a small positive reward,\nR = +0.1. Transitions from even-numbered \u201cnoise\u201d states receive a random reward, R \u223c N (0, 1).\nTo ensure that the signal can overwhelm the noise, it is bene\ufb01cial to terminate the return (low \u03b3) in\n\u201cnoise\u201d states, but to continue the return (high \u03b3) in \u201csignal\u201d states.\nThe second example is a 9-step MRP, that alternates between random rewards and the negation of\nwhatever reward was received on the previous step. The sum of rewards over each such pair of\ntime-steps is zero. There are 9 transitions, so the last reward is always random. To predict accurately,\nit is bene\ufb01cial to bootstrap (low \u03bb) in states for which the value function is well-known and equal to\nzero, but to avoid bootstrapping (high \u03bb) in the noisier, partially observed state for which the return\nwill depend on the previous reward, which cannot be inferred from the state itself.\nFigure 1 shows the results of meta-gradient prediction using the TD(\u03bb) algorithm. The meta-gradient\nalgorithm was able to adapt both \u03bb and \u03b3 to form returns that alternate between high or low values in\nodd or even states respectively.\n\n5\n\n\f(a) Chain MRP. For the adaptive \u03b3 experiments, the rewards alternate between +0.1 and zero-mean Gaussian on\neach step. For the adaptive \u03bb experiments, the rewards alternate between zero-mean Gaussians and the negative\nof whatever reward was received on the previous step: Rt+1 = \u2212Rt.\n\nFigure 1: Illustrative results of meta-gradient learning of a state-dependent (a) bootstrapping parameter\n\u03bb or (b) discount factor \u03b3, in the respective Markov reward processes (top). In each of the subplot\nshown in the bottom, the \ufb01rst one shows how the meta-parameter \u03b3 or \u03bb adapts over the course of\ntraining (averaged over 10 seeds - shaded regions cover 20%\u201380% percentiles). The second plot\nshows the \ufb01nal value of \u03b3 or \u03bb in each state, identifying appropriately high/low values for odd/even\nstates respectively (violin plots show distribution over seeds).\n\n3 Deep Reinforcement Learning Experiments\n\nIn this section, we demonstrate the advantages of the proposed meta-gradient learning approach using\na state-of-the-art actor-critic framework IMPALA [Espeholt et al., 2018]. We focused on adapting the\ndiscount factor \u03b7 = {\u03b3} (which we found to be the most effective meta-parameter in preliminary\nexperiments). We also investigated adapting the bootstrapping parameter \u03bb. For these experiments,\nthe meta-parameters were state-independent, adapting one scalar value for \u03b3 and \u03bb respectively\n(state-dependent meta-parameters did not provide signi\ufb01cant bene\ufb01t in preliminary experiments).2\n\n3.1 Experiment Setup\n\nWe validate the proposed approach on Atari 2600 video games from Arcade Learning Environment\n(ALE) [Bellemare et al., 2013], a standard benchmark for deep reinforcement learning algorithms.\nWe build our agent with the IMPALA framework [Espeholt et al., 2018], an ef\ufb01cient distributed\nimplementation of actor-critic architecture [Sutton and Barto, 2018, Mnih et al., 2016]. We utilise the\ndeep ResNet architecture [He et al., 2016] speci\ufb01ed in Espeholt et al. [2018], which has shown great\nadvantages over the shallow architecture [Mnih et al., 2015]. Following Espeholt et al., we train our\nagent for 200 million frames. Our algorithm does not require extra data compared to the baseline\nalgorithms, as each experience can be utilised in both training the agent itself and training the meta\nparameters \u03b7 (i.e., each experience can serve as validation data of other experiences). We describe\nthe detailed implementation in the Appendix C.3. For full details about the IMPALA implementation\nand the speci\ufb01c off-policy correction g\u03b7(\u03c4 ), please refer to Espeholt et al. [2018].\nThe agents are evaluated on 57 different Atari games and the median of human-normalised scores [Nair\net al., 2015, van Hasselt et al., 2016, Wang et al., 2016b, Mnih et al., 2016] are reported. There are two\ndifferent evaluation protocols. The \ufb01rst protocol is is \u201chuman starts\u201d [Nair et al., 2015, Wang et al.,\n2016b, van Hasselt et al., 2016], which initialises episodes to a state that is randomly sampled from\nhuman play. The second protocol is \u201cno-ops starts\u201d, which initialises each episode with a random\nsequence of no-op actions; this protocol is also used during training. We keep the con\ufb01guration (e.g.,\nbatch size, unroll length, learning rate, entropy cost) the same as speci\ufb01ed in Espeholt et al. [2018] for\na fair comparison. For self-contained purpose, we provide all of the important hyper-parameters used\n\n2In practice we parameterise \u03b7 = \u03c3(x), where \u03c3 is the logistic function \u03c3(x) =\n\nparameters are actually the logits of \u03b3 and \u03bb.\n\n1\n\n1+e\u2212x ; i.e. the meta-\n\n6\n\n010002000episodes0.00.20.40.60.81.0\u03b3\u03b3 during training123456789state\u03b3 after training010002000episodes0.00.20.40.60.81.0\u03bb\u03bb during training12345678state\u03bb after training\f\u03b7\n\n{}\n{\u03bb}\n\nIMPALA\nMeta-gradient\n\nMeta-gradient\nMeta-gradient\n\n{\u03b3}\n{\u03b3, \u03bb}\n\nHuman starts\n\nNo-op starts\n\n\u03b3 = 0.99\n144.4%\n156.6%\n\u00af\u03b3 = 0.99\n233.2%\n221.6%\n\n\u03b3 = 0.995\n211.9%\n214.2%\n\u00af\u03b3 = 0.995\n267.9%\n292.9%\n\n\u03b3 = 0.99\n191.8%\n185.5%\n\u00af\u03b3 = 0.99\n280.9%\n242.6%\n\n\u03b3 = 0.995\n257.1%\n246.5%\n\u00af\u03b3 = 0.995\n275.5%\n287.6%\n\nTable 1: Results of meta-learning the discount parameter \u03b3, the temporal-difference learning parameter\n\u03bb, or both \u03b3 and \u03bb, compared to the baseline IMPALA algorithm which meta-learns neither. Results\nare given both for the discount factor \u03b3 = 0.99 originally reported in [Espeholt et al., 2018] and also\nfor a tuned discount factor \u03b3 = 0.995 (see Appendix D.1); the cross-validated discount factor \u00af\u03b3 in\nthe meta-objective was set to the same value for a fair comparison.\n\nin this paper, including the ones following Espeholt et al. [2018] and the additional meta-learning\noptimisation hyper-parameters (i.e., meta batch size, meta learning rate \u03b2, embedding size for \u03b7), in\nAppendix B. The meta-learning hyper-parameters are chosen according to the performance of six\nAtari games as common practice in Deep RL Atari experiments [van Hasselt et al., 2016, Mnih et al.,\n2016, Wang et al., 2016b]. Additional implementation details are provided in Appendix C.\n\n3.2 Experiment Results\n\nWe compared four variants of the IMPALA algorithm: the original baseline algorithm without meta-\ngradients, i.e. \u03b7 = {}; using meta-gradients with \u03b7 = {\u03bb}; using meta-gradients with \u03b7 = {\u03b3}; and\nusing meta-gradients with \u03b7 = {\u03b3, \u03bb}. The original IMPALA algorithm used a discount factor of\n\u03b3 = 0.99; however, when we manually tuned the discount factor and found that a discount factor of\n\u03b3 = 0.995 performed considerably better (see Appendix D.1). For a fair comparison, we tested our\nmeta-gradient algorithm in both cases. When the discount factor is not adapted, \u03b7 = {} or \u03b7 = {\u03bb},\nwe used a \ufb01xed value of \u03b3 = 0.99 or \u03b3 = 0.995. When the discount factor is adapted, \u03b7 = {\u03b3} or\n\u03b7 = {\u03b3, \u03bb}, we cross-validate with a meta-parameter of \u00af\u03b3 = 0.99 or \u00af\u03b3 = 0.995 accordingly in the\nmeta-objective \u00afJ (Equation (13)). Manual tuning of the \u03bb parameter did not have a signi\ufb01cant impact\non performance and we therefore compared only to the original value of \u03bb = 1.\nWe summarise the median human-normalised scores in Table 1; individual improvements on each\ngame, compared to the IMPALA baseline, are given in Appendix E.1; and individual plots demon-\nstrating the adaptation of \u03b3 and \u03bb are provided in Appendix E.2. The meta-gradient RL algorithm\nincreased the median performance, compared to the baseline algorithm, by a margin between 30%\nand 80% across \u201chuman starts\u201d and \u201cno-op starts\" conditions, and with both \u03b3 = 0.99 and \u03b3 = 0.995.\nWe also veri\ufb01ed the architecture choice of conditioning the value function v and policy \u03c0 on the\nmeta-parameters \u03b7. We compared the proposed algorithm with an identical meta-gradient algorithm\nthat adapts the discount factor \u03b7 = {\u03b3}, but does not provide an embedding of the discount factor as\nan input to \u03c0 and v. For this experiment, we used a cross-validation discount factor of \u00af\u03b3 = 0.995. The\nhuman-normalised median score was only 183%, well below the IMPALA baseline with \u03b3 = 0.995\n(211.9%), and much worse than the full meta-gradient algorithm that includes the discount factor\nembedding (267.9%).\nFinally, we compare against the state-of-the-art agent trained on Atari games, namely Rainbow [Hessel\net al., 2018], which combines DQN [Mnih et al., 2015] with double Q-learning [van Hasselt et al.,\n2016, van Hasselt, 2010], prioritised replay [Schaul et al., 2016], dueling networks [Wang et al.,\n2016b], multi-step targets [Sutton, 1988, Sutton and Barto, 2018], distributional RL [Bellemare\net al., 2017], and parameter noise for exploration [Fortunato et al., 2018]. Rainbow obtains median\nhuman-normalised score of 153% on the human starts protocol and 223% on the no-ops protocol. In\ncontrast, the meta-gradient agent achieved a median score of 292.9% on human starts and 287.6% on\nno-ops, with the same number (200M) of frames. We note, however, that there are many differences\nbetween the two algorithms, including the deeper neural network architecture used in our work.\n\n7\n\n\f4 Related Work\n\nAmong the earliest studies on meta learning (or learning to learn [Thrun and Pratt, 1998]), Schmid-\nhuber [1987] applied genetic programming to itself to evolve better genetic programming algo-\nrithms. Hochreiter et al. [2001] used recurrent neural networks like Long Short-Term Memory\n(LSTM) [Hochreiter and Schmidhuber, 1997] as meta-learners. A recent direction of research has\nbeen to meta-learn an optimiser using a recurrent parameterisation [Andrychowicz et al., 2016,\nWichrowska et al., 2017]. Duan et al. [2016] and Wang et al. [2016a] proposed to learn a recurrent\nmeta-policy that itself learns to solve the reinforcement learning problem, so that the recurrent\npolicy can generalise into new tasks faster than learning the policy from scratch. Model-Agnostic\nMeta-Learning (MAML) [Finn et al., 2017a, Finn and Levine, 2018, Finn et al., 2017b, Grant et al.,\n2018, Al-Shedivat et al., 2018] learns a good initialisation of the model that can adapt quickly to\nother tasks within a few gradient update steps. These works focus on a multi-task setting in which\nmeta-learning takes place on a distribution of training tasks, to facilitate fast adaptation on an unseen\ntest task. In contrast, our work emphasises the (arguably) more fundamental problem of meta-learning\nwithin a single task. In other words we return to the standard formulation of RL as maximising\nrewards during a single lifetime of interactions with an environment.\nContemporaneously with our own work, Zheng et al. [2018] also propose a similar algorithm to learn\nmeta-parameters of the return: in their case an auxiliary reward function that is added to the external\nrewards. They do not condition their value function or policy, and reuse the same samples for both\nthe update function and the cross-validation step \u2013 which may be problematic in stochastic domains\nwhen the noise these updates becomes highly correlated.\nThere are many works focusing on adapting learning rate through gradient-based methods [Sutton,\n1992, Schraudolph, 1999, Maclaurin et al., 2015, Pedregosa, 2016, Franceschi et al., 2017], Bayesian\noptimisation methods [Snoek et al., 2012], or evolution based hyper-parameter tuning [Jaderberg\net al., 2017a, Elfwing et al., 2017]. In particular, Sutton [1992], introduced the idea of online cross-\nvalidation; however, this method was limited in scope to adapting the learning rate for linear updates\nin supervised learning (later extended to non-linear updates by Schraudolph [1999]); whereas we\nfocus on the fundamental problem of reinforcement learning, i.e., adapting the return function to\nmaximise the proxy returns we can achieve from the environment.\nThere has also been signi\ufb01cant prior work on automatically adapting the bootstrapping parameter\n\u03bb. Singh and Dayan [1998] empirically analyse the effect of \u03bb in terms of bias, variance and MSE.\nKearns and Singh [2000] derive upper bounds on the error of temporal-difference algorithms, and use\nthese bounds to derive schedules for \u03bb. Downey and Sanner [2010] introduced a Bayesian model\naveraging approach to scheduling \u03bb. Konidaris et al. [2011] derive a maximum-likelihood estimator,\nTD(\u03b3), that weights the n-step returns according to the discount factor, leading to a parameter-free\nalgorithm for temporal-difference learning with linear function approximation. White and White\n[2016] introduce an algorithm that explicitly estimates the bias and variance, and greedily adapts \u03bb to\nlocally minimise the MSE of the \u03bb-return. Unlike our meta-gradient approach, these prior approaches\nexploit i.i.d. assumptions on the trajectory of experience that are not realistic in many applications.\n\n5 Conclusion\n\nIn this work, we discussed how to learn the meta-parameters of a return function. Our meta-learning\nalgorithm runs online, while interacting with a single environment, and successfully adapts the return\nto produce better performance. We demonstrated, by adjusting the meta-parameters of a state-of-\nthe-art deep learning algorithm, that we could achieve much higher performance than previously\nobserved on 57 Atari 2600 games from the Arcade Learning Environment.\nOur proposed method is more general, and can be applied not just to the discount factor or bootstrap-\nping parameter, but also to other components of the return, and even more generally to the learning\nupdate itself. Hyper-parameter tuning has been a thorn in the side of reinforcement learning research\nfor several decades. Our hope is that this approach will allow agents to automatically tune their own\nhyper-parameters, by exposing them as meta-parameters of the learning update. This may also result\nin better performance because the parameters can change over time and adapt to novel environments.\n\n8\n\n\fAcknowledgements\n\nThe authors would like to thank Matteo Hessel, Lasse Espeholt, Hubert Soyer, Dan Horgan, Aedan\nPope and Tim Harley for their kind engineering support; and Joseph Modayil, Andre Barreto for their\nsuggestions and comments on an early version of the paper. The authors would also like to thank\nanonymous reviewers for their constructive suggestions on improving the paper.\n\n9\n\n\fReferences\nM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. TensorFlow: A system for large-scale machine learning. In OSDI, volume 16,\npages 265\u2013283, 2016.\n\nM. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel. Continuous adaptation\n\nvia meta-learning in nonstationary and competitive environments. In ICLR, 2018.\n\nM. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas.\n\nLearning to learn by gradient descent by gradient descent. In NIPS, pages 3981\u20133989, 2016.\n\nM. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An\n\nevaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253\u2013279, 2013.\n\nM. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning.\n\nIn ICML, 2017.\n\nD. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n\nC. Downey and S. Sanner. Temporal difference bayesian model averaging: A bayesian perspective\n\non adapting lambda. In ICML, pages 311\u2013318. Citeseer, 2010.\n\nY. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RL2: Fast reinforcement\n\nlearning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.\n\nS. Elfwing, E. Uchibe, and K. Doya. Online meta-learning by parallel algorithm competition. CoRR,\n\nabs/1702.07490, 2017.\n\nL. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,\nI. Dunning, et al. IMPALA: Scalable distributed Deep-RL with importance weighted actor-learner\narchitectures. ICML, 2018.\n\nC. Finn and S. Levine. Meta-learning and universality: Deep representations and gradient descent\n\ncan approximate any learning algorithm. ICLR, 2018.\n\nC. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks.\n\nIn ICML, 2017a.\n\nC. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-\n\nlearning. In CoRL, 2017b.\n\nM. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis,\n\nO. Pietquin, et al. Noisy networks for exploration. In ICLR, 2018.\n\nL. Franceschi, M. Donini, P. Frasconi, and M. Pontil. Forward and reverse gradient-based hyperpa-\n\nrameter optimization. In ICML, 2017.\n\nE. Grant, C. Finn, S. Levine, T. Darrell, and T. Grif\ufb01ths. Recasting gradient-based meta-learning as\n\nhierarchical Bayes. ICLR, 2018.\n\nA. Harutyunyan, M. G. Bellemare, T. Stepleton, and R. Munos. Q(\u03bb) with off-policy corrections. In\n\nALT, pages 305\u2013320. Springer, 2016.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages\n\n770\u2013778, 2016.\n\nM. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot,\nM. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In\nAAAI, 2018.\n\nS. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\nS. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In ICANN,\n\npages 87\u201394. Springer, 2001.\n\n10\n\n\fM. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals,\nT. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. arXiv\npreprint arXiv:1711.09846, 2017a.\n\nM. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu.\n\nReinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017b.\n\nM. J. Kearns and S. P. Singh. Bias-variance error bounds for temporal difference updates. In COLT,\n\npages 142\u2013147, 2000.\n\nD. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. ICLR, 2015.\n\nG. Konidaris, S. Niekum, and P. S. Thomas. TD\u03b3: Re-evaluating complex backups in temporal\n\ndifference learning. In NIPS, pages 2402\u20132410, 2011.\n\nD. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based hyperparameter optimization through\n\nreversible learning. In ICML, pages 2113\u20132122, 2015.\n\nA. Mahmood. Incremental Off-policy Reinforcement Learning Algorithms. PhD thesis, University of\n\nAlberta, 2017.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\nPlaying atari with deep reinforcement learning. NIPS workshop, 2013.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-\nmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529, 2015.\n\nV. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\n\nAsynchronous methods for deep reinforcement learning. In ICML, pages 1928\u20131937, 2016.\n\nR. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and ef\ufb01cient off-policy reinforce-\n\nment learning. In NIPS, pages 1054\u20131062, 2016.\n\nA. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam,\nM. Suleyman, C. Beattie, S. Petersen, et al. Massively parallel methods for deep reinforcement\nlearning. arXiv preprint arXiv:1507.04296, 2015.\n\nF. Pedregosa. Hyperparameter optimization with approximate gradient. In ICML, pages 737\u2013746,\n\n2016.\n\nD. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for off-policy policy evaluation. In ICML,\n\npages 759\u2013766, 2000.\n\nD. V. Prokhorov and D. C. Wunsch. Adaptive critic designs. TNN, 8(5):997\u20131007, 1997.\n\nJ. Randl\u00f8v and P. Alstr\u00f8m. Learning to drive a bicycle using reinforcement learning and shaping. In\n\nICML, volume 98, pages 463\u2013471, 1998.\n\nG. A. Rummery and M. Niranjan. On-line Q-learning using connectionist sytems. Technical Report\n\nCUED/F-INFENG-TR 166, Cambridge University, UK, 1994.\n\nT. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In ICML,\n\npages 1312\u20131320, 2015.\n\nT. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In ICLR, 2016.\n\nJ. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the\n\nmeta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\nN. N. Schraudolph. Local gain adaptation in stochastic gradient descent. In ICANN. IET, 1999.\n\nS. Singh and P. Dayan. Analytical mean squared error curves for temporal difference learning.\n\nMachine Learning, 32(1):5\u201340, 1998.\n\n11\n\n\fS. P. Singh, A. G. Barto, and N. Chentanez. Intrinsically motivated reinforcement learning. In NIPS,\n\npages 1281\u20131288, 2005.\n\nJ. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning\n\nalgorithms. In NIPS, pages 2951\u20132959, 2012.\n\nR. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):\n\n9\u201344, 1988.\n\nR. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI,\n\npages 171\u2013176, 1992.\n\nR. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 2018.\n\nR. S. Sutton, A. R. Mahmood, D. Precup, and H. van Hasselt. A new Q(\u03bb) with interim forward view\n\nand Monte Carlo equivalence. In ICML, pages 568\u2013576, 2014.\n\nR. S. Sutton, A. R. Mahmood, and M. White. An emphatic approach to the problem of off-policy\n\ntemporal-difference learning. JMLR, 17(1):2603\u20132631, 2016.\n\nS. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 1998.\n\nT. Tieleman and G. Hinton. Lecture 6.5-RMSProp: Divide the gradient by a running average of its\n\nrecent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331, 2012.\n\nH. van Hasselt. Double Q-learning. In NIPS, pages 2613\u20132621, 2010.\n\nH. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In\n\nAAAI, volume 16, pages 2094\u20132100, 2016.\n\nH. H. van Seijen, H. P. van Hasselt, S. Whiteson, and M. A. Wiering. A theoretical and empirical\n\nanalysis of Expected Sarsa. In ADPRL, pages 177\u2013184, 2009.\n\nJ. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran,\n\nand M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016a.\n\nZ. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network\n\narchitectures for deep reinforcement learning. ICML, 2016b.\n\nM. White and A. White. A greedy approach to adapting the trace parameter for temporal difference\n\nlearning. In AAMAS, pages 557\u2013565, 2016.\n\nO. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas,\n\nand J. Sohl-Dickstein. Learned optimizers that scale and generalize. In ICML, 2017.\n\nR. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3-4):229\u2013256, May 1992.\n\nR. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural\n\nnetworks. Neural computation, 1(2):270\u2013280, 1989.\n\nZ. Zheng, J. Oh, and S. Singh. On learning intrinsic rewards for policy gradient methods. In NeurIPS,\n\n2018.\n\n12\n\n\f", "award": [], "sourceid": 1219, "authors": [{"given_name": "Zhongwen", "family_name": "Xu", "institution": "DeepMind"}, {"given_name": "Hado", "family_name": "van Hasselt", "institution": "DeepMind"}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}]}