{"title": "Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8151, "page_last": 8162, "abstract": "Gradient-based methods for optimisation of objectives in stochastic settings with unknown or intractable dynamics require estimators of derivatives. We derive an objective that, under automatic differentiation, produces low-variance unbiased estimators of derivatives at any order. Our objective is compatible with arbitrary advantage estimators, which allows the control of the bias and variance of any-order derivatives when using function approximation. Furthermore, we propose a method to trade off bias and variance of higher order derivatives by discounting the impact of more distant causal dependencies. We demonstrate the correctness and utility of our estimator in analytically tractable MDPs and in meta-reinforcement-learning for continuous control.", "full_text": "Loaded DiCE: Trading off Bias and Variance in\n\nAny-Order Score Function Estimators for\n\nReinforcement Learning\n\nGregory Farquhar \u2217\nUniversity of Oxford\n\nShimon Whiteson\nUniversity of Oxford\n\nJakob Foerster\n\nFacebook AI Research\n\nAbstract\n\nGradient-based methods for optimisation of objectives in stochastic settings with\nunknown or intractable dynamics require estimators of derivatives. We derive an\nobjective that, under automatic differentiation, produces low-variance unbiased\nestimators of derivatives at any order. Our objective is compatible with arbitrary\nadvantage estimators, which allows the control of the bias and variance of any-order\nderivatives when using function approximation. Furthermore, we propose a method\nto trade off bias and variance of higher order derivatives by discounting the impact\nof more distant causal dependencies. We demonstrate the correctness and utility of\nour objective in analytically tractable MDPs and in meta-reinforcement-learning\nfor continuous control.\n\n1\n\nIntroduction\n\nIn stochastic settings, such as reinforcement learning (RL), it is often impossible to compute the\nderivative of our objectives, because they depend on an unknown or intractable distribution (such\nas the transition function of an RL environment). In these cases, gradient-based optimisation is\nonly possible through the use of stochastic gradient estimators. Great successes in these domains\nhave been found by building estimators of \ufb01rst-order derivatives which are amenable to automatic\ndifferentiation, and using them to optimise the parameters of deep neural networks [Fran\u00e7ois-Lavet\net al., 2018].\nNonetheless, for a number of exciting applications, \ufb01rst-order derivatives are insuf\ufb01cient. Meta-\nlearning and multi-agent learning often involve differentiating through the learning step of a gradient-\nbased learner [Finn et al., 2017, Stadie et al., 2018, Zintgraf et al., 2019, Foerster et al., 2018a].\nHigher-order optimisation methods can also improve sample ef\ufb01ciency [Furmston et al., 2016].\nHowever, estimating these higher order derivatives correctly, with low variance, and easily in the\ncontext of automatic differentiation, has proven challenging.\nFoerster et al. [2018b] propose tools for constructing estimators for any-order derivatives that are\neasy to use because they avoid the cumbersome manipulations otherwise required to account for the\ndependency of the gradient estimates on the distributions they are sampled from. However, their\nformulation relies on pure Monte-Carlo estimates of the objective, introducing unacceptable variance\nin estimates of \ufb01rst- and higher-order derivatives and limiting the uptake of methods relying on these\nderivatives.\nMeanwhile, great strides have been made in the development of estimators for \ufb01rst-order derivatives\nof stochastic objectives. In reinforcement learning, the use of learned value functions as both critics\nand baselines has been extensively studied. The trade-off between bias and variance in gradient\nestimators can be made explicit in mixed objectives that combine Monte-Carlo samples of the\n\n\u2217Correspondence to gregory.farquhar@cs.ox.ac.uk\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fobjective with learned value functions [Schulman et al., 2015b]. These techniques create families\nof advantage estimators that can be used to reduce variance and accelerate credit assignment in\n\ufb01rst-order optimisation, but have not been applied in full generality to higher-order derivatives.\nIn this work, we derive an objective that can be differentiated any number of times to produce correct\nestimators of higher-order derivatives in Stochastic Computation Graphs (SCGs) that have a Markov\nproperty, such as those found in RL and sequence modeling. Unlike prior work, this objective is fully\ncompatible with arbitrary choices of advantage estimators. When using approximate value functions,\nthis allows for explicit trade-offs between bias and variance in any-order derivative estimates to\nbe made using known techniques (or using any future advantage estimation methods designed for\n\ufb01rst-order derivatives). Furthermore, we propose a method for trading off bias and variance of higher\norder derivatives by discounting the impact of more distant causal dependencies.\nEmpirically, we \ufb01rst use small random MDPs that admit analytic solutions to show that our estimator\nis unbiased and low variance when using a perfect value function, and that bias and variance\nmay be \ufb02exibly traded off using two hyperparameters. We further study our objective in more\nchallenging meta-reinforcement-learning problems for simulated continuous control, and show\nthe impact of various parameter choices on training. Demonstration code is available at https:\n//github.com/oxwhirl/loaded-dice. Only a handful of additional lines of code are needed to\nimplement our objective in any existing codebase that uses higher-order derivatives for RL.\n\n2 Background\n\n2.1 Gradient estimators\n\nWe are commonly faced with objectives that have the form of an expectation over random variables.\nIn order to calculate the gradient of the expectation with respect to parameters of interest, we must\noften employ gradient estimators, because the gradient cannot be computed exactly. For example, in\nreinforcement learning the environment dynamics are unknown and form a part of our objective, the\nexpected returns. The polyonymous \u201clikelihood ratio\u201d, \u201cscore function\u201d, or \u201cREINFORCE\u201d estimator\nis given by\n\n\u2207\u03b8 Ex[f (x, \u03b8)] = Ex[f (x, \u03b8)\u2207\u03b8 log p(x; \u03b8) + \u2207\u03b8f (x, \u03b8)].\n\n(1)\nThe expectation on the RHS may now be estimated from Monte-Carlo samples drawn from p(x; \u03b8).\nOften f is independent of \u03b8 and the second term is dropped. If f depends on \u03b8, but the random\nvariable does not (or may be reparameterised to depend only deterministically on \u03b8) we may instead\ndrop the \ufb01rst term. See Fu [2006] or Mohamed et al. [2019] for a more comprehensive review.\n\n2.2 Stochastic Computation Graphs and MDPs\n\nStochastic computation graphs (SCGs) are directed acyclic graphs in which nodes are determinsitic\nor stochastic functions, and edges indicate functional dependencies [Schulman et al., 2015a]. The\ngradient estimators described above may be used to estimate the gradients of the objective (the sum\nof cost nodes) with respect to parameters \u03b8. Schulman et al. [2015a] propose a surrogate loss, a\nsingle objective that produces the desired gradient estimates under differentiation.\nWeber et al. [2019] apply more advanced \ufb01rst-order gradient estimators to SCGs. They formalise\nMarkov properties for SCGs that allow the most \ufb02exible and powerful of these estimators, originally\ndeveloped in the context of reinforcement learning, to be applied. We describe these estimators in\nthe following subsection, but \ufb01rst de\ufb01ne the relevant subset of SCGs. To keep the main body of this\npaper simple and highlight the most important known use case for our method, we adopt the notation\nof reinforcement learning rather than the more cumbersome notation of generic SCGs.\nThe graph in reinforcement learning describes a Markov Decision Process (MDP), and begins with\nan initial state s0 at time t = 0. At each timestep, an action at is sampled from a stochastic policy\n\u03c0\u03b8, parameterised by \u03b8, that maps states to actions. This adds a stochastic node at to the graph. The\nstate-action pair leads to a reward rt, and a next state st+1, from which the process continues. A\nsimple MDP graph is shown in Figure 1. In the \ufb01gure, as in many problems, the reward conditions\nonly on the state rather than the state and action. We consider episodic problems that terminate after\nT steps, although all of our results may be extended to the non-terminating case. The (discounted)\nrewards are the cost nodes of this graph, leading to the familiar reinforcement learning objective of an\n\n2\n\n\fFigure 1: Some example SCGs that support our new objective. From left to right (a) Vanilla MDP (b)\nMDP with stochastic latent goal variable g (c) POMDP\n\nexpected discounted sum of rewards: J = E[(cid:80)T\n\nt=0 \u03b3trt], where the expectation is taken with respect\n\nto the policy as well as the unknown transition dynamics of the underlying MDP.\nA generalisation of our results holds for a slightly more general class of SCGs as well, whose objective\nis still a sum of rewards over time. We may have any number of stochastic and deterministic nodes\nXt corresponding to each timestep t. However, these nodes may only in\ufb02uence the future rewards\nthrough their in\ufb02uence on the next timestep. More formally, this Markov property states that for any\nnode w such that there exists a directed path from w to any rt(cid:48), t(cid:48) \u2265 t not blocked by Xt, none of the\ndescendants of w are in Xt (de\ufb01nition 6 of Weber et al. [2019]). This class of SCGs can capture a\nbroad class of MDP-like models, such as those in Figure 1.\n\n2.3 Gradient estimators with advantages\n\nA value function for a set of nodes in an SCG is the expectation of the objective over the other stochas-\ntic variables (excluding that set of nodes). These can reduce variance by serving as control variates\n(\u201cbaselines\u201d), or as critics that also condition on the sampled values taken by the corresponding\nstochastic nodes (i.e. the sampled actions). The difference of the critic and baseline value functions is\nknown as the advantage, which replaces sampled costs in the gradient estimator.\nBaseline value functions only affect the variance of gradient estimators [Weaver and Tao, 2001].\nHowever, using learned, imperfect critic value functions results in biased gradient estimators. We\nmay trade off bias and variance by using different mixtures of sampled costs (unbiased, high variance)\nand learned critic value functions (biased, low variance). This choice of advantage estimator and its\nhyperparameters can be used to tune the bias and variance of the resulting gradient estimator to suit\nthe problem at hand.\nThere are many ways to model the advantage function in RL. A popular and simple family of\nadvantage estimators is proposed by Schulman et al. [2015b]:\n\nAGAE(\u03b3,\u03c4 )(st, at) =\n\n(2)\n\n(\u03b3\u03c4 )t(cid:48)\u2212t(cid:0)rt(cid:48) + \u03b3 \u02c6V (st(cid:48)+1) \u2212 \u02c6V (st(cid:48))(cid:1).\n\n\u221e(cid:88)\n\nt(cid:48)=t\n\nThe parameter \u03c4 trades off bias and variance: when \u03c4 = 1, A is formed only of sampled rewards and\nis unbiased, but high variance; when \u03c4 = 0, AGAE uses only the next sampled reward rt and relies\nheavily on the estimated value function \u02c6V , reducing variance at the cost of bias.\n\n2.4 Higher order estimators\n\nTo construct higher order gradient estimators, we may recursively apply the above techniques, treating\ngradient estimates as objectives in a new SCG. Foerster et al. [2018b] note several shortcomings of\nthe surrogate loss approach of Schulman et al. [2015a] for higher-order derivatives. The surrogate\nloss cannot itself be differentiated again to produce correct higher-order estimators. Even estimates\nproduced using the surrogate loss cannot be treated as objectives in a new SCG, because the surrogate\nloss severs dependencies of the sampled costs on the sampling distribution.\n\n3\n\n\fTo address this, Foerster et al. [2018b] introduce DiCE, a single objective that may be differentiated\nrepeatedly (using automatic differentiation) to produce unbiased estimators of derivatives of any\norder. The DiCE objective for reinforcement learning is given by\n\nT(cid:88)\n\nt=0\n\n(cid:88)\n\nw\u2208W\n\nJ =\n\n\u03b3t\n\n(a\u2264t)rt,\n\n(3)\n\nwhere a\u2264t indicates the set of stochastic nodes (i.e. actions) occurring at timestep t or earlier.\n\nis a special operator that acts on a set of stochastic nodes W.\n\n(\u00b7) always evaluates to 1, but has a\n\nspecial behaviour under differentiation:\n\n\u2207\u03b8\n\n(W) = (W)\n\n\u2207\u03b8 log p(w; \u03b8)\n\n(4)\n\nT(cid:88)\n\n(\u2205) = 1, so it has a zero derivative.\n\nThis operator in effect automates the likelihood-ratio trick for differentiation of expectations, while\nmaintaining dependencies such that the same trick will be applied when computing higher order\nderivatives. For notational convenience in our later derivation, we extend the de\ufb01nition of\nslightly\nby de\ufb01ning its operation on the empty set:\nThe original version of DiCE has two critical drawbacks compared to the state-of-the-art methods\ndescribed above for estimating \ufb01rst-order derivatives of stochastic objectives. First, it has no mecha-\nnism for using baselines to reduce the variance of estimators of higher order derivatives. Mao et al.\n[2019], and Liu et al. [2019] (subsequently but independently) suggest the same partial solution\nfor this problem, but neither provide proof of unbiasedness of their estimator beyond second order.\nSecond, DiCE (and the estimator of Mao et al. [2019] and Liu et al. [2019]) are formulated in a way\nthat requires the use of Monte-Carlo sampled costs. Without a form that permits the use of critic\nvalue functions, there is no way to make use of the full range of possible advantage estimators.\nIn an exact calculation of higher-order derivative estimators, the dependence of a given reward on\nall previous actions leads to nested sums over previous timesteps. These terms tend to have high\nvariance when estimated from data, and become small in the vicinity of local optima, as noted by\nFurmston et al. [2016]. Rothfuss et al. [2018] use this observation to propose a simpli\ufb01ed version of\nthe DiCE objective dropping these dependencies:\n\nJLV C =\n\n(at)Rt\n\n(5)\n\nt=0\n\nThis estimator is biased for higher than \ufb01rst-order derivatives, and Rothfuss et al. [2018] do not derive\na correct unbiased estimator for all orders, make use of advantage estimation in this objective, or\nextend its applicability beyond meta-learning in the style of MAML [Finn et al., 2017].\nIn the next section, we introduce a new objective which may make use of the critic as well as baseline\nvalue functions, and thereby allows the bias and variance of any-order derivatives to be traded off\nthrough the choice of an advantage estimator. Furthermore, we introduce a discounting of past\ndependencies that allows a smooth trade-off of bias and variance due to the high-variance terms\nidenti\ufb01ed by Furmston et al. [2016].\n\n3 Method\n\nThe DiCE objective is cast as a sum over rewards, with the dependencies of the reward node rt on its\nstochastic causes captured by\n(a\u2264t). To use critic value functions, on the other hand, we must use\nforward-looking sums over returns.\nThis is possible if the graph maintains the Markov property de\ufb01ned above in Section 2.2 with respect\nto its objective, so as to permit a sequential decomposition of the cost nodes, i.e., rewards rt, and\ntheir stochastic causes in\ufb02uenced by \u03b8, i.e., the actions at. We begin with the DiCE objective for a\ndiscounted sum of rewards given in (3), where our true objective is the expected discounted sum of\nrewards in trajectories drawn from a policy \u03c0\u03b8.\n\n4\n\n\ft(cid:48)=t \u03b3t(cid:48)\u2212trt(cid:48). Now we have rt = Rt \u2212 \u03b3Rt+1, so:\n\nNow we simply take a change of variables t(cid:48) = t + 1 in the second term, relabeling the dummy\nvariable immediately back to t:\n\n(a\u2264t)(Rt \u2212 \u03b3Rt+1)\n\n=\n\nt=0\n\nt=0\n\n\u03b3t\n\n\u03b3t\n\nJ =\n\nT(cid:88)\nT(cid:88)\n\nWe de\ufb01ne, as is typical in RL, the return Rt =(cid:80)T\n(a\u2264t)Rt \u2212 T(cid:88)\n(a\u2264t)Rt \u2212 T +1(cid:88)\n(a\u2264t)Rt \u2212 T +1(cid:88)\n(a\u2264t)Rt \u2212 T(cid:88)\n(cid:18)\nT(cid:88)\n\nT(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=0\n+ \u03b30\n\n\u03b3t\n(a<0)R0 \u2212 \u03b3T +1\n\n=\n\n=\n\n\u03b3t\n\n\u03b3t\n\nJ =\n\n\u03b3t\n\nt=1\n\nt=1\n\nt=0\n\nt=0\n\nt=0\n\nt=0\n\n\u03b3t+1\n\n(a\u2264t)Rt+1\n\n\u03b3t\n\n(a\u2264t\u22121)Rt\n\n\u03b3t\n\n(a<t)Rt\n\n(a<t)Rt\n\n(cid:19)\n\n(a<T +1)RT +1\n\n=R0 +\n\n\u03b3t\n\nt=0\n\n(a\u2264t) \u2212 (a<t)\n\nRt.\n\n(6)\n\n(a<0) = (\u2205) = 1, and that RT +1 = 0.\n\nIn the last line we have used that\nNow we have an objective formulated in terms of forwards-looking returns, that captures the depen-\n(a\u2264t) \u2212 (a<t). Since this is just a re-expression of\ndencies on the sampling distribution through\nthe DiCE objective (applied to a restricted class of SCGs with the requisite Markov property), we are\nstill guaranteed that its derivatives will be unbiased estimators of the derivatives of our true objective,\nup to any order. The proof for the original DiCE objective is given by Foerster et al. [2018b]. Because\nR0 carries no derivatives, we will omit it from the following estimators for clarity. Including it,\nhowever, ensures the convenient property that the objective still evaluates in expectation to the true\nreturn (as\nWe can now introduce value functions. Rt is conditionally independent of each of\n\n(a\u2264t) and\n(a<t) (as well as all their derivatives), conditioned on a\u2264t, s\u2264t. Because of the Markov property of\nour SCG, this is equivalent to conditional independence given st, at. If we consider the expectation of\nour new form of J , we can use this conditional independence to push the expectation over a>t, s>t\nonto Rt. For a complete derivation please see the Supplementary Material. This is simply a critic\nvalue function, de\ufb01ned by Q(st, at) = E\u03c0[Rt|st, at]:\n\n(a\u2264t) \u2212 (a<t) always evaluates to zero).\n\n(cid:18)\n(cid:18)\n\n\u03b3t\n\n\u03b3t\n\n(cid:20) T(cid:88)\n(cid:20) T(cid:88)\n\nt=0\n\nt=0\n\nE\u03c0[J ] =E\u03c0\n\n=E\u03c0\n\n(cid:19)\n(cid:19)\n\n(a\u2264t) \u2212 (a<t)\n\n(a\u2264t) \u2212 (a<t)\n\nE[Rt|st, at]\n\n(cid:21)\n\n(cid:21)\n\nQ(st, at)\n\n(7)\n\nFurthermore, a baseline that does not depend on a\u2265t or s>t does not change the expectation of\nthe estimator, as shown by the standard derivation reproduced in Schulman et al. [2015a].\nIn\nreinforcement learning, it is common to use the expected state value V (st) = Eat[Q(st, at)] as an\napproximation of the optimal baseline. The estimator may now use A(st, at) = Q(st, at) \u2212 V (st) in\nplace of Rt, further reducing its variance. We have now derived an estimator in terms of an advantage\nA(st, at) that recovers unbiased estimates of derivatives of any order:\n\n(cid:18)\n\n(cid:88)\n\n(cid:19)\n\nJ\u2666 =\n\n\u03b3t\n\n(a\u2264t) \u2212 (a<t)\n\nA(st, at).\n\n(8)\n\nIn practice, it is common to omit \u03b3t, thus optimising the undiscounted returns, but still using\ndiscounted advantages as a variance-reduction tool. See, e.g., the discussion in Thomas [2014].\n\nt\n\n5\n\n\f3.1 Function approximation\n\nIn practice, an estimate of the advantage must be made from limited data. Inexact models of the critic\nvalue function (due to limited data, model class misspeci\ufb01cation, or inef\ufb01cient learning) introduce\nbias in the gradient estimators. As in the work of Schulman et al. [2015b], we may use combinations\nof sampled costs and estimated values to form advantage estimators that trade off bias and variance.\nHowever, thanks to our new estimator, which captures the full dependencies of the advantage on the\nsampling distribution, these trade-offs may be immediately applied to higher-order derivatives.\nApproximate baseline value functions only affect the estimator variance. Careful choice of this\nbaseline may nonetheless be of great signi\ufb01cance (e.g., by exploiting the factorisation of the policy\n[Foerster et al., 2018c]). Our formulation of the objective extends such methods, as well as any future\nadvances in advantage estimation at \ufb01rst order, to higher order derivatives.\n\n3.2 Variance due to higher-order dependencies\n\nNow that we have the correct form for the unbiased estimator, which uses proper variance-reduction\nstrategies for computing the advantage, we may also trade off the bias and variance in estimates of\nhigher-order derivatives that arises due to the full history of causal dependencies.\nIn particular, we propose to set a discount factor \u03bb \u2208 [0, 1] on prior dependencies that limits the\nhorizon of the past actions that are accounted for in the estimates of higher-order derivatives. Similarly\nto the way the MDP discount factor \u03b3 reduces variance by constraining the horizon into the future\nthat must be considered, \u03bb constrains how far into the past we consider causal dependencies that\nin\ufb02uence higher order derivatives.\nFirst note that\n\nacting on any set of stochastic nodes W decomposes as a product:\n\n(W) =\n\nw\u2208W (w). We now implement discounting by exponentially decaying past contributions:\n\n(cid:81)\n\nT(cid:88)\n\n(cid:18) t(cid:89)\n\nt=0\n\nt(cid:48)=0\n\nJ\u03bb =\n\n(at(cid:48))\u03bbt\u2212t(cid:48) \u2212 t\u22121(cid:89)\n\nt(cid:48)=0\n\n(at(cid:48))\u03bbt\u2212t(cid:48)(cid:19)\n\nAt.\n\n(9)\n\n(\u00b7) may be computed\nThis is our \ufb01nal objective, which we call \u201cLoaded DiCE\". The products over\nin the log-space of the action probabilities, transforming them into convenient and numerically stable\nsums. Algorithm 1 shows how the objective may easily be computed from an episode.\n\nAlgorithm 1 Compute Loaded DiCE Objective\nRequire: trajectory of states st, actions at, t = 0 . . . T\n\nJ \u2190 0\nw \u2190 0\nfor t \u2190 0 to T do\n\nw \u2190 \u03bbw + log(\u03c0(at|st))\nv \u2190 w \u2212 log(\u03c0(at|st))\ndeps \u2190 f (w) \u2212 f (v)\nJ \u2190 J + deps \u00b7 A(st, at)\n\nend for\nreturn J\n\n(cid:46) J accumulates the \ufb01nal objective\n(cid:46) w accumulates the \u03bb-weighted stochastic dependencies\n\n(cid:46) w has the dependencies including at\n(cid:46) v has the dependencies excluding at\noperator on the log-probabilities\n(cid:46) The dependencies are weighted by the advantage A(st, at)\n\n(cid:46) f applies the\n\nfunction f(x)\n\nend function\n\nreturn exp(x \u2212 stop_gradient(x))\n\nWhen \u03bb = 0, this estimator resembles JLV C, although it makes use of advantages. It may be low\nvariance, but is biased regardless of the choice of advantage estimator. When \u03bb = 1, we recover the\nestimator in (8), which is unbiased if the advantage estimator is itself unbiased. Intermediate values\nof \u03bb should be able to trade off bias and variance, as we demonstrate empirically in Section 4. Our\nnew form of objective allows us to use \u03bb to reduce the impact of the high variance terms identi\ufb01ed by\nFurmston et al. [2016] and Rothfuss et al. [2018] in a smooth way, rather than completely dropping\nthose terms.\n\n6\n\n\fFigure 2: Convergence with increasing batch size of unbiased any-order estimators (DiCE, DiCE\nwith the baseline of Mao et al. [2019], and Loaded DiCE). Also, LVC [Rothfuss et al., 2018], a\nlow-variance but biased estimator.\n\n4 Experiments\n\nIn this section we empirically demonstrate the correctness of our estimator in the absence of function\napproximation, and show how the bias and variance of the estimator may be traded off (a) by the\nchoice of advantage estimator when only an approximate value function is available, and (b) by the\nuse of our novel discount factor \u03bb.\n\n4.1 Bias and variance in any-order derivatives\n\nTo make the initial analysis simple and interpretable, we use small random MDPs with \ufb01ve states,\nfour actions per state, and rewards that depend only on state. For these MDPs the discounted value\nmay be calculated analytically, as follows.\nP \u03c0 is the state transition matrix induced by the MDP\u2019s transition function P (s, a, s(cid:48)) and the tabular\npolicy \u03c0, with elements given by\n\nP \u03c0\n\nss(cid:48) =\n\nP (s, a, s(cid:48))\u03c0(a|s).\n\n(10)\n\nLet P0 be the initial state distribution as a vector. Then, the probability distribution over states at time\nt is a vector pst = (P \u03c0)tP0. The mean reward at time t is rt = RT pst, where R is the vector of\nper-state rewards. Finally,\n\n(cid:88)\n\na\n\n\u221e(cid:88)\n\nt=0\n\n\u221e(cid:88)\n\nt=0\n\nV \u03c0 =\n\n\u03b3trt = RT\n\n(\u03b3P\u03c0)tP0\n\n= RT (I \u2212 \u03b3P\u03c0)\u22121P0.\n\n(11)\n\nThis V \u03c0 is differentiable wrt \u03c0 and may be easily computed with automatic differentiation packages.\nMore details and code can be found in the Supplementary Material.\n\nA low-variance, unbiased, any-order estimator. Figure 2 shows how the correlation between\nestimated and true derivatives changes as a function of batch size, for up to third order. We compare\nthe original DiCE estimator to Loaded DiCE, and the objective proposed by Mao et al. [2019] which\nincorporates only a baseline. For Loaded DiCE, we use AGAE with \u03c4 = 0, the exact value function,\nand \u03bb = 1 (so as to remain unbiased). As these are all unbiased estimators, they will converge to\nthe true derivatives with a suf\ufb01ciently large batch size. However, when using an advantage estimator\nwith the exact value function, the variance may be dramatically reduced and the estimates converge\nmuch more rapidly. We also show the performance of LVC [Rothfuss et al., 2018]. At \ufb01rst order it\nmatches exactly the estimator of Mao et al. [2019], but underperforms Loaded DiCE because it does\nnot use the advantage. At higher orders, it is low variance but biased, as expected.\n\n7\n\n\f(a) Low \u03c4 produces low variance estimates at the cost\nof high bias. The effect holds at all orders of deriva-\ntives.\n\n(b) High \u03bb considers the full past to produce low-bias\nhigh-variance estimators, low \u03bb discounts the past.\nFirst order gradients are unaffected.\nFigure 3: Trading off bias and variance with \u03c4 and \u03bb in a small MDP.\n\nTrading off bias and variance with advantage estimation. Figure 3a shows the bias and standard\ndeviation of estimated derivatives using a range of \u03c4, and an inexact value function (we perturb the\ntrue value function with Gaussian noise for each state to emulate function approximation). The effect\nof the choice of advantage estimator trades off bias and variance not only at the \ufb01rst order, but in\nany-order derivatives.\n\nTrading off bias and variance by discounting causes. Figure 3b shows the bias and standard\ndeviation of estimated derivatives using a range of \u03bb. To isolate the effect of \u03bb we use the exact\nvalue function and \u03c4 = 0, so the absolute bias and variance are lower than in \ufb01gure 3a. First-order\nderivatives are unaffected by \u03bb, as expected. However, in higher-order derivatives, \u03bb strongly affects\nthe bias and variance of the resulting estimator. There is an outlier at \u03bb = 0.75 for third order\nderivatives \u2013 there is no guarantee of monotonicity in the bias or variance, but we found such outliers\nrarer at second than third order, and appearing as artefacts of particular MDPs.\n\n4.2 Meta reinforcement learning with MAML and Loaded DiCE\n\nWe now apply our new family of estimators to a pair of more challenging meta-reinforcement-learning\nproblems in continuous control, following the work of Finn et al. [2017]. The aim of their Model-\nAgnostic Meta-Learning (MAML) is to learn a good initialisation of a neural network policy, such\nthat with a single (or small number of) policy gradient updates on a batch of training episodes, the\npolicy achieves good performance on a task sampled from some distribution. Then, in meta-testing,\nthe policy should be able to adapt to a new task from the same distribution. MAML is theoretically\nsound, but the original implementation neglected the higher order dependencies induced by the RL\nsetting [Rothfuss et al., 2018, Stadie et al., 2018].\nThe approach is to sample a number of tasks and adapt the policy in an inner-loop policy-gradient\noptimisation step. Then, in the outer loop, the initial parameters are updated so as to maximise the\nreturns of the post-adaptation policies. The outer loop optimisation depends on the post-adaptation\nparameters, which depend on the gradients estimated in the inner loop. As a result, there are important\nhigher-order terms in the outer loop optimisation. Using the correct estimator for the inner loop\noptimisation can therefore impact the ef\ufb01ciency of the overall meta-training procedure as well as the\nquality of the \ufb01nal solution.\nFor the inner-loop optimisation, we use our novel objective, with a range of values for \u03c4 and \u03bb. We\nsweep a range of \u03c4 with \ufb01xed \u03bb = 0, and then sweep a range of \u03bb using the best value found for\n\u03c4. For the outer-loop optimisation, we use a vanilla policy gradient with a baseline. The outer-loop\ncould use any other gradient-based policy optimisation algorithm, but we choose a simple version to\nisolate, to some extent, the impact of the inner loop estimator.\nFigure 4 shows our results. In the CheetahDir task, if \u03c4 is too high the estimator is too high variance\nand performance is bad. \u03c4 is less impactful in the CheetahVel task. We note that in these tasks,\nepisodes are short, \u03b3 is low, and the value functions are simple linear functions \ufb01t to each batch of\ndata as in Finn et al. [2017]. These factors which would all favor a high \u03c4. With higher variance\nreturns or better value functions, relying more heavily on the learned value function (by using a lower\n\u03c4) may be effective.\n\n8\n\n\fFigure 4: Trading off bias and variance with \u03c4 and \u03bb in meta-reinforcement-learning. We report\nthe mean and standard error (over [#runs]) of the post-adaptation returns, smoothed with a moving\naverage over 10 outer-loop optimisations.\n\nIn both environments, \u03bb = 1.0 leads to too high variance. The unbiased (\u03bb = 1.0) version of our\nobjective may also be more valuable when value functions are better and can be used more effectively\nto mitigate variance. In CheetahVel, noticeably faster learning is achieved with a low but non-zero\n\u03bb. The analysis of Furmston et al. [2016] indicates that the magnitude of the higher-order terms\ndiscounted by \u03bb will in many cases become small as the policy approaches a local optimum. This is\nconsistent with our empirical \ufb01nding here that non-zero \u03bb may learn faster but plateaus at a similar\nlevel. Figure 5 in the appendix shows results on the AntVel task, in which \u03bb is a more important factor\nthan \u03c4. We conclude that Loaded DiCE provides meaningful control of the higher-order estimator\nwith signi\ufb01cant impact on a realistic use-case.\n\n5 Conclusion\n\nIn this work, we derived a theoretically sound objective which can apply general advantage functions\nto the estimation of any-order derivatives in reinforcement-learning type sequential problems. In the\ncontext of function approximation, this objective unlocks the ability to trade off bias and variance\nin higher order derivatives. Importantly, like the underlying DiCE objective, our single objective\ngenerates estimators for any-order derivatives under repeated automatic differentiation. Further, we\npropose a simple method for discounting the impact of more distant causal dependencies on the\nestimation of higher order derivatives, which allows another axis for the trade-off of bias and variance.\nEmpirically, we use small random MDPs to demonstrate the behaviour of the bias and variance of\nhigher-order derivative estimates, and further show its utility in meta-reinforcement-learning.\nWe are excited for other applications in meta-learning, multi-agent learning and higher-order optimi-\nsation which may be made possible using our new objective. In future work, we also wish to revisit\nour choice of \u03bb-discounting, which is a heuristic method to limit the impact of high-variance terms.\nFurther theoretical analysis may also help to identify contexts in which higher-order dependencies\nare important for optimisation. Finally, it may even be possible to meta-learn the hyperparameters \u03c4\nand \u03bb themselves.\n\nAcknowledgments\n\nWe thank Maruan Al-Shedivat and Minqi Jiang for valuable discussions. This work was supported by\nthe UK EPSRC CDT in Autonomous Intelligent Machines and Systems. This project has received\nfunding from the European Research Council (ERC) under the European Union\u2019s Horizon 2020\nresearch and innovation programme (grant agreement number 637713).\n\n9\n\n\fReferences\nChelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1126\u20131135. JMLR. org, 2017.\n\nJakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch.\nIn Proceedings of the 17th International Conference on\nLearning with opponent-learning awareness.\nAutonomous Agents and MultiAgent Systems, pages 122\u2013130. International Foundation for Autonomous\nAgents and Multiagent Systems, 2018a.\n\nJakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rockt\u00e4schel, Eric Xing, and Shimon Whiteson.\nDice: The in\ufb01nitely differentiable monte carlo estimator. In International Conference on Machine Learning,\npages 1524\u20131533, 2018b.\n\nJakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfac-\n\ntual multi-agent policy gradients. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018c.\n\nVincent Fran\u00e7ois-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An introduction\nto deep reinforcement learning. Foundations and Trends R(cid:13) in Machine Learning, 11(3-4):219\u2013354, 2018.\n\nMichael C Fu. Gradient estimation. Handbooks in operations research and management science, 13:575\u2013616,\n\n2006.\n\nThomas Furmston, Guy Lever, and David Barber. Approximate newton methods for policy search in markov\n\ndecision processes. The Journal of Machine Learning Research, 17(1):8055\u20138105, 2016.\n\nHao Liu, Richard Socher, and Caiming Xiong. Taming maml: Ef\ufb01cient unbiased meta-reinforcement learning.\n\nIn International Conference on Machine Learning, pages 4061\u20134071, 2019.\n\nJingkai Mao, Jakob Foerster, Tim Rockt\u00e4schel, Maruan Al-Shedivat, Gregory Farquhar, and Shimon Whiteson.\nA baseline for any order gradient estimation in stochastic computation graphs. In International Conference\non Machine Learning, pages 4343\u20134351, 2019.\n\nShakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in\n\nmachine learning. arXiv preprint arXiv:1906.10652, 2019.\n\nJonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp: Proximal meta-policy\n\nsearch. arXiv preprint arXiv:1810.06784, 2018.\n\nJohn Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic\n\ncomputation graphs. In Advances in Neural Information Processing Systems, pages 3528\u20133536, 2015a.\n\nJohn Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous\n\ncontrol using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.\n\nBradly C Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, and Ilya Sutskever.\nSome considerations on learning to explore via meta-reinforcement learning. arXiv preprint arXiv:1803.01118,\n2018.\n\nPhilip Thomas. Bias in natural actor-critic algorithms. In International Conference on Machine Learning, pages\n\n441\u2013448, 2014.\n\nLex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning.\n\nIn\nProceedings of the Seventeenth conference on Uncertainty in arti\ufb01cial intelligence, pages 538\u2013545. Morgan\nKaufmann Publishers Inc., 2001.\n\nTh\u00e9ophane Weber, Nicolas Heess, Lars Buesing, and David Silver. Credit assignment techniques in stochastic\n\ncomputation graphs. arXiv preprint arXiv:1901.01761, 2019.\n\nLuisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Caml: Fast context\n\nadaptation via meta-learning. International Conference on Machine Learning, 2019.\n\n10\n\n\f6 Derivation of value function formulation\n\nWe start out with the J(cid:5) objective:\n\n(cid:18)\n\n\u03b3t\n\nT(cid:88)\n\nt=0\n\nJ(cid:5) =\n\n(cid:19)\n\nRt.\n\n(a\u2264t) \u2212 (a<t)\n\n(12)\n\nWe evaluate this objective by taking an expectation over the trajectories \u03c4 as induced by the policy \u03c0. Here \u03c4 is a\ncomplete sequence of states, actions and rewards, \u03c4 = {s0, a0, r1, .., sT , aT}. For convenience in the following\nderivation we have de\ufb01ned the reward, rt+1, to be indexed by the next time step, after action at was taken. This\nensures that partial trajectories (e.g. \u03c4>t) correctly keep rewards after the actions that cause them. Note that\n\nk=0 \u03b3krt+k+1 depends only on \u03c4>t. The expectation of our objective is given by:\n\nRt =(cid:80)T\u2212t\n\n\u03c4\n\n(cid:88)\n(cid:88)\nT(cid:88)\nT(cid:88)\n\nt=0\n\n\u03c4\n\nt=0\n\nE\u03c0[J(cid:5)] =\n\n=\n\n=\n\n=\n\nP (\u03c4 )J(cid:5)(\u03c4 )\n\n(cid:18) T(cid:88)\n\u03b3t(cid:0) (a\u2264t) \u2212 (a<t)(cid:1)Rt\n(cid:18)(cid:88)\nP (\u03c4 )(cid:0) (a\u2264t) \u2212 (a<t)(cid:1)Rt\n\nt=0\n\n(cid:19)\n(cid:19)\n\nP (\u03c4 )\n\n\u03b3t\n\n\u03c4\n\n\u03b3tJt\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\n(cid:88)\n\nWe note that for each time step the term, Jt is of the form:\n\nwhere f (\u03c4\u2264t) =(cid:0) (a\u2264t) \u2212 (a<t)(cid:1) and g(\u03c4>t) = Rt.\n\n\u03c4\n\nJt =\n\nP (\u03c4 )f (\u03c4\u2264t)g(\u03c4>t),\n\nNext we use:\n\nP (\u03c4 ) = P (\u03c4\u2264t)P (\u03c4>t|\u03c4\u2264t)\n\n= P (\u03c4\u2264t)P (\u03c4>t|st, at),\n\nwhere in the last step we have used the Markov property. Substituting we obtain:\n\nIf we substitute back for g and f we obtain:\n\nP (\u03c4\u2265t|st, at)g(\u03c4>t)\n\n\u03c4\n\n\u03c4>t\n\n\u03c4\u2264t\n\n(cid:88)\n\nP (\u03c4\u2264t)f (\u03c4\u2264t)\n\nP (\u03c4\u2264t)P (\u03c4>t|st, at)f (\u03c4\u2264t)g(\u03c4>t)\n\n(cid:88)\n(cid:88)\nP (\u03c4\u2264t)(cid:0) (a\u2264t) \u2212 (a<t)(cid:1)(cid:88)\nP (\u03c4\u2264t)(cid:0) (a\u2264t) \u2212 (a<t)(cid:1)E[Rt|st, at]\nP (\u03c4\u2264t)(cid:0) (a\u2264t) \u2212 (a<t)(cid:1)Q(st, at)\n\n\u03c4>t\n\nP (\u03c4\u2265t|st, at)Rt\n\nJt =\n\n=\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u03c4\u2264t\n\n\u03c4\u2264t\n\n\u03c4\u2264t\n\nJt =\n\n=\n\n=\n\nPutting all together we obtain the \ufb01nal form:\n\n(cid:20) T(cid:88)\n\n(cid:18)\n\n\u03b3t\n\n(cid:19)\n\n(cid:21)\n\nQ(st, at)\n\n(a\u2264t) \u2212 (a<t)\n\nE\u03c0[J(cid:5)] =E\u03c0\n\nt=0\n\n11\n\n\f7 Experimental details\n\n7.1 Random MDPs\n\nWe use the mdptoolbox.example.rand() function from PyMDPToolbox to generate random MDP transition\nfunctions with \ufb01ve states and four actions per state.\nThe reward is a function only of state, and is sampled from N (5, 10). We use \u03b3 = 0.95. When sampling for\nthe stochastic estimators, we use batches of 512 rollouts of length 50 steps unless the batch size is otherwise\nspeci\ufb01ed.\nWe only compute higher order derivatives of the derivative of the \ufb01rst parameter at each order, to save computa-\ntion.\nFor the sweeps over \u03bb and \u03c4 we use 200 batches for each value of \u03bb or \u03c4. To simulate function approximation\nerror in our analysis of the impact of \u03c4, we add a gaussian noise with standard deviation 10 to the true value\nfunction.\n\n7.2 MAML experiments\n\nWe use the following hyperparameters for our MAML experiments:\n\nParameter\n\u03b3\nhidden layer size\nnumber of layers\ntask batch size\nmeta batch size\ninner loop learning rate\nouter loop optimiser\nouter loop learning rate\nouter loop \u03c4\nreward noise\n\nValue\n0.97\n100\n2\n20 trajectories\n40 tasks\n0.1\nAdam\n0.0005\n1.0\nUniform(-0.01, 0.01) at each timestep\n\nWe also normalise all advantages in each batch (per task).\nFigure 5 shows some additional experiments on the AntVel MuJoCo task. In this domain, \u03bb is a more important\nfactor than \u03c4.\n\nFigure 5: Trading off bias and variance with \u03c4 and \u03bb in the ant-velocity task. We report the mean and\nstandard error (over [#runs]) of the post-adaptation returns, smoothed with a moving average over 10\nouter-loop optimisations\n\n12\n\n\f", "award": [], "sourceid": 4447, "authors": [{"given_name": "Gregory", "family_name": "Farquhar", "institution": "University of Oxford"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}, {"given_name": "Jakob", "family_name": "Foerster", "institution": "Facebook AI Research"}]}