{"title": "Learning Continuous Control Policies by Stochastic Value Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 2944, "page_last": 2952, "abstract": "We present a unified framework for learning continuous control policies usingbackpropagation. It supports stochastic control by treating stochasticity in theBellman equation as a deterministic function of exogenous noise. The productis a spectrum of general policy gradient algorithms that range from model-freemethods with value functions to model-based methods without value functions.We use learned models but only require observations from the environment insteadof observations from model-predicted trajectories, minimizing the impactof compounded model errors. We apply these algorithms first to a toy stochasticcontrol problem and then to several physics-based control problems in simulation.One of these variants, SVG(1), shows the effectiveness of learning models, valuefunctions, and policies simultaneously in continuous domains.", "full_text": "Learning Continuous Control Policies by\n\nStochastic Value Gradients\n\nNicolas Heess\u21e4, Greg Wayne\u21e4, David Silver, Timothy Lillicrap, Yuval Tassa, Tom Erez\n\nGoogle DeepMind\n\n{heess, gregwayne, davidsilver, countzero, tassa, etom}@google.com\n\n\u21e4These authors contributed equally.\n\nAbstract\n\nWe present a uni\ufb01ed framework for learning continuous control policies using\nbackpropagation.\nIt supports stochastic control by treating stochasticity in the\nBellman equation as a deterministic function of exogenous noise. The product\nis a spectrum of general policy gradient algorithms that range from model-free\nmethods with value functions to model-based methods without value functions.\nWe use learned models but only require observations from the environment in-\nstead of observations from model-predicted trajectories, minimizing the impact\nof compounded model errors. We apply these algorithms \ufb01rst to a toy stochastic\ncontrol problem and then to several physics-based control problems in simulation.\nOne of these variants, SVG(1), shows the effectiveness of learning models, value\nfunctions, and policies simultaneously in continuous domains.\n\nIntroduction\n\n1\nPolicy gradient algorithms maximize the expectation of cumulative reward by following the gradient\nof this expectation with respect to the policy parameters. Most existing algorithms estimate this gra-\ndient in a model-free manner by sampling returns from the real environment and rely on a likelihood\nratio estimator [32, 26]. Such estimates tend to have high variance and require large numbers of\nsamples or, conversely, low-dimensional policy parameterizations.\nA second approach to estimate a policy gradient relies on backpropagation instead of likelihood ratio\nmethods. If a differentiable environment model is available, one can link together the policy, model,\nand reward function to compute an analytic policy gradient by backpropagation of reward along a\ntrajectory [18, 11, 6, 9]. Instead of using entire trajectories, one can estimate future rewards using a\nlearned value function (a critic) and compute policy gradients from subsequences of trajectories. It\nis also possible to backpropagate analytic action derivatives from a Q-function to compute the policy\ngradient without a model [31, 21, 23]. Following Fairbank [8], we refer to methods that compute\nthe policy gradient through backpropagation as value gradient methods.\nIn this paper, we address two limitations of prior value gradient algorithms. The \ufb01rst is that, in\ncontrast to likelihood ratio methods, value gradient algorithms are only suitable for training deter-\nministic policies. Stochastic policies have several advantages: for example, they can be bene\ufb01cial for\npartially observed problems [24]; they permit on-policy exploration; and because stochastic policies\ncan assign probability mass to off-policy trajectories, we can train a stochastic policy on samples\nfrom an experience database in a principled manner. When an environment model is used, value\ngradient algorithms have also been critically limited to operation in deterministic environments. By\nexploiting a mathematical tool known as \u201cre-parameterization\u201d that has found recent use for gener-\native models [20, 12], we extend the scope of value gradient algorithms to include the optimization\nof stochastic policies in stochastic environments. We thus describe our framework as Stochastic\nValue Gradient (SVG) methods. Secondly, we show that an environment dynamics model, value\nfunction, and policy can be learned jointly with neural networks based only on environment in-\nteraction. Learned dynamics models are often inaccurate, which we mitigate by computing value\ngradients along real system trajectories instead of planned ones, a feature shared by model-free\n\n1\n\n\fmethods [32, 26]. This substantially reduces the impact of model error because we only use models\nto compute policy gradients, not for prediction, combining advantages of model-based and model-\nfree methods with fewer of their drawbacks.\nWe present several algorithms that range from model-based to model-free methods, \ufb02exibly combin-\ning models of environment dynamics with value functions to optimize policies in stochastic or de-\nterministic environments. Experimentally, we demonstrate that SVG methods can be applied using\ngeneric neural networks with tens of thousands of parameters while making minimal assumptions\nabout plants or environments. By examining a simple stochastic control problem, we show that\nSVG algorithms can optimize policies where model-based planning and likelihood ratio methods\ncannot. We provide evidence that value function approximation can compensate for degraded mod-\nels, demonstrating the increased robustness of SVG methods over model-based planning. Finally,\nwe use SVG algorithms to solve a variety of challenging, under-actuated, physical control problems,\nincluding swimming of snakes, reaching, tracking, and grabbing with a robot arm, fall-recovery for\na monoped, and locomotion for a planar cheetah and biped.\n2 Background\nWe consider discrete-time Markov Decision Processes (MDPs) with continuous states and actions\nand denote the state and action at time step t by st 2 RNS and at 2 RNA, respectively. The MDP has\nan initial state distribution s0 \u21e0 p0(\u00b7), a transition distribution st+1 \u21e0 p(\u00b7|st, at), and a (potentially\ntime-varying) reward function rt = r(st, at, t).1 We consider time-invariant stochastic policies\na \u21e0 p(\u00b7|s; \u2713), parameterized by \u2713. The goal of policy optimization is to \ufb01nd policy parameters \u2713 that\nmaximize the expected sum of future rewards. We optimize either \ufb01nite-horizon or in\ufb01nite-horizon\n\nsums, i.e., J(\u2713) = EhPT\n\nt=0 trt\u2713i or J(\u2713) = E\u21e5P1t=0 trt\u2713\u21e4 where 2 [0, 1] is a discount\n\nfactor.2 When possible, we represent a variable at the next time step using the \u201ctick\u201d notation, e.g.,\ns0 , st+1.\nIn what follows, we make extensive use of the state-action-value Q-function and state-value V-\nfunction.\n\nQt(s, a) = E\"X\u2327 =t\n\n\u2327tr\u2327st = s, at = a,\u2713# ; V t(s) = E\"X\u2327 =t\n\n\u2327tr\u2327st = s,\u2713# .\n\nFor \ufb01nite-horizon problems, the value functions are time-dependent, e.g., V 0 , V t+1(s0), and for\nin\ufb01nite-horizon problems the value functions are stationary, V 0 , V (s0). The relevant meaning\nshould be clear from the context. The state-value function can be expressed recursively using the\nstochastic Bellman equation\n\nV t(s) =Z \uf8ffrt + Z V t+1(s0)p(s0|s, a)ds0 p(a|s; \u2713)da.\n\n(1)\n\n(2)\n\nWe abbreviate partial differentiation using subscripts, gx , @g(x, y)/@x.\n3 Deterministic value gradients\nThe deterministic Bellman equation takes the form V (s) = r(s, a)+V 0(f (s, a)) for a deterministic\nmodel s0 = f (s, a) and deterministic policy a = \u21e1(s; \u2713). Differentiating the equation with respect\nto the state and policy yields an expression for the value gradient\n\nVs = rs + ra\u21e1s + V 0s0(fs + fa\u21e1s),\nV\u2713 = ra\u21e1\u2713 + V 0s0fa\u21e1\u2713 + V 0\u2713 .\n\n(3)\n(4)\nIn eq. 4, the term V 0\u2713 arises because the total derivative includes policy gradient contributions from\nsubsequent time steps (full derivation in Appendix A). For a purely model-based formalism, these\nequations are used as a pair of coupled recursions that, starting from the termination of a trajectory,\nproceed backward in time to compute the gradient of the value function with respect to the state\n\u2713 returns the total policy gradient. When a state-value function is used\nand policy parameters. V 0\n\n1We make use of a time-varying reward function only in one problem to encode a terminal reward.\n2< 1 for the in\ufb01nite-horizon case.\n\n2\n\n\fafter one step in the recursion, ra\u21e1\u2713 + V 0s0fa\u21e1\u2713 directly expresses the contribution of the current\ntime step to the policy gradient. Summing these gradients over the trajectory gives the total policy\ngradient. When a Q-function is used, the per-time step contribution to the policy gradient takes the\nform Qa\u21e1\u2713.\n4 Stochastic value gradients\nOne limitation of the gradient computation in eqs. 3 and 4 is that the model and policy must be\ndeterministic. Additionally, the accuracy of the policy gradient V\u2713 is highly sensitive to modeling\nerrors. We introduce two critical changes: First, in section 4.1, we transform the stochastic Bellman\nequation (eq. 2) to permit backpropagating value information in a stochastic setting. This also\nenables us to compute gradients along real trajectories, not ones sampled from a model, making the\napproach robust to model error, leading to our \ufb01rst algorithm \u201cSVG(1),\u201d described in section 4.2.\nSecond, in section 4.3, we show how value function critics can be integrated into this framework,\nleading to the algorithms \u201cSVG(1)\u201d and \u201cSVG(0)\u201d, which expand the Bellman recursion for 1 and\n0 steps, respectively. Value functions further increase robustness to model error and extend our\nframework to in\ufb01nite-horizon control.\n4.1 Differentiating the stochastic Bellman equation\n\nRe-parameterization of distributions Our goal is to backpropagate through the stochastic Bell-\nman equation. To do so, we make use of a concept called \u201cre-parameterization\u201d, which permits us to\ncompute derivatives of deterministic and stochastic models in the same way. A very simple example\nof re-parameterization is to write a conditional Gaussian density p(y|x) = N (y|\u00b5(x), 2(x)) as the\nfunction y = \u00b5(x) + (x)\u21e0, where \u21e0 \u21e0N (0, 1). From this point of view, one produces samples\nprocedurally by \ufb01rst sampling \u21e0, then deterministically constructing y. Here, we consider condi-\ntional densities whose samples are generated by a deterministic function of an input noise variable\nand other conditioning variables: y = f (x,\u21e0 ), where \u21e0 \u21e0 \u21e2(\u00b7), a \ufb01xed noise distribution. Rich\ndensity models can be expressed in this form [20, 12]. Expectations of a function g(y) become\nEp(y|x)g(y) =R g(f (x,\u21e0 ))\u21e2(\u21e0)d\u21e0.\n\nThe advantage of working with re-parameterized distributions is that we can now obtain a simple\nMonte-Carlo estimator of the derivative of an expectation with respect to x:\n\nrxEp(y|x)g(y) = E\u21e2(\u21e0)gyfx \u21e1\n\n1\nM\n\n.\n\n(5)\n\nMXi=1\n\ngyfx\u21e0=\u21e0i\n\nIn contrast to likelihood ratio-based Monte Carlo estimators, rx log p(y|x)g(y), this formula makes\ndirect use of the Jacobian of g.\nRe-parameterization of the Bellman equation We now re-parameterize the Bellman equation.\nWhen re-parameterized, the stochastic policy takes the form a = \u21e1(s,\u2318 ; \u2713), and the stochastic\nenvironment the form s0 = f (s, a,\u21e0 ) for noise variables \u2318 \u21e0 \u21e2(\u2318) and \u21e0 \u21e0 \u21e2(\u21e0), respectively.\nInserting these functions into eq. (2) yields\n\nV (s) = E\u21e2(\u2318)\uf8ffr(s,\u21e1 (s,\u2318 ; \u2713)) + E\u21e2(\u21e0)\u21e5V 0(f (s,\u21e1 (s,\u2318 ; \u2713),\u21e0 ))\u21e4.\n\nDifferentiating eq. 6 with respect to the current state s and policy parameters \u2713 gives\n\n(6)\n\n(7)\n\n(8)\n\nVs = E\u21e2(\u2318)\uf8ffrs + ra\u21e1s + E\u21e2(\u21e0)V 0s0(fs + fa\u21e1s),\nV\u2713 = E\u21e2(\u2318)\uf8ffra\u21e1\u2713 + E\u21e2(\u21e0)\u21e5V 0s0fa\u21e1\u2713 + V 0\u2713\u21e4.\n\nWe are interested in controlling systems with a priori unknown dynamics. Consequently, in the\nfollowing, we replace instances of f or its derivatives with a learned model \u02c6f.\nGradient evaluation by planning A planning method to compute a gradient estimate is to com-\npute a trajectory by running the policy in loop with a model while sampling the associated noise\nvariables, yielding a trajectory \u2327 = (s1,\u2318 1, a1,\u21e0 1, s2,\u2318 2, a2,\u21e0 2, . . . ). On this sampled trajectory, a\nMonte-Carlo estimate of the policy gradient can be computed by the backward recursions:\n\n3\n\n\fvs = [rs + ra\u21e1s + v0s0(\u02c6fs + \u02c6fa\u21e1s)]\u2318,\u21e0 ,\n\u02c6fa\u21e1\u2713 + v0\u2713)]\u2318,\u21e0 ,\n\nv\u2713 = [ra\u21e1\u2713 + (v0s0\n\nwhere have written lower-case v to emphasize that the quantities are one-sample estimates3, and\n\n\u201cx\u201d means \u201cevaluated at x\u201d.\n\nGradient evaluation on real trajectories An important advantage of stochastic over determinis-\ntic models is that they can assign probability mass to observations produced by the real environment.\nIn a deterministic formulation, there is no principled way to account for mismatch between model\npredictions and observed trajectories. In this case, the policy and environment noise (\u2318, \u21e0 ) that pro-\nduced the observed trajectory are considered unknown. By an application of Bayes\u2019 rule, which we\nexplain in Appendix B, we can rewrite the expectations in equations 7 and 8 given the observations\n(s, a, s0) as\n\n(9)\n(10)\n\n(11)\n\n(12)\n\nVs = Ep(a|s)Ep(s0|s,a)Ep(\u2318,\u21e0|s,a,s0)\uf8ffrs + ra\u21e1+V 0s0(\u02c6fs + \u02c6fa\u21e1s),\nV\u2713 = Ep(a|s)Ep(s0|s,a)Ep(\u2318,\u21e0|s,a,s0)\uf8ffra\u21e1\u2713 + (V 0s0\n\u02c6fa\u21e1\u2713 + V 0\u2713 ),\n\nwhere we can now replace the two outer expectations with samples derived from interaction with\nthe real environment. In the special case of additive noise, s0 = \u02c6f (s, a) + \u21e0, it is possible to use\na deterministic model to compute the derivatives (\u02c6fs, \u02c6fa). The noise\u2019s in\ufb02uence is restricted to the\ngradient of the value of the next state, V 0s0, and does not affect the model Jacobian. If we consider it\ndesirable to capture more complicated environment noise, we can use a re-parameterized generative\nmodel and infer the missing noise variables, possibly by sampling from p(\u2318, \u21e0|s, a, s0).\n4.2 SVG(1)\nSVG(1) computes value gradients by backward recursions on \ufb01nite-horizon trajectories. After\nevery episode, we train the model, \u02c6f, followed by the policy, \u21e1. We provide pseudocode for this in\nAlgorithm 1 but discuss further implementation details in section 5 and in the experiments.\n\nfor t = 0 to T do\n\nApply control a = \u21e1(s,\u2318 ; \u2713), \u2318 \u21e0 \u21e2(\u2318)\nInsert (s, a, r, s0) into D\nend for\nTrain generative model \u02c6f using D\nv0s = 0 (\ufb01nite-horizon)\nv0\u2713 = 0 (\ufb01nite-horizon)\nfor t = T down to 0 do\n\nAlgorithm 2 SVG(1) with Replay\n1: Given empty experience database D\n2: for t = 0 to 1 do\n3:\nApply control \u21e1(s,\u2318 ; \u2713), \u2318 \u21e0 \u21e2(\u2318)\nObserve r, s0\n4:\nInsert (s, a, r, s0) into D\n5:\n6:\n// Model and critic updates\nTrain generative model \u02c6f using D\n7:\nTrain value function \u02c6V using D (Alg. 4)\n8:\n9:\n// Policy update\nSample (sk, ak, rk, sk+1) from D (k \uf8ff t)\n10:\nw = p(ak|sk;\u2713t)\n11:\np(ak|sk;\u2713k)\n12:\nInfer \u21e0k|(sk, ak, sk+1) and \u2318k|(sk, ak)\nv\u2713 = w(ra + \u02c6V 0s0\n13:\n14:\nApply gradient-based update using v\u2713\n15: end for\n\nAlgorithm 1 SVG(1)\n1: Given empty experience database D\n2: for trajectory = 0 to 1 do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: end for\n4.3 SVG(1) and SVG(0)\nIn our framework, we may learn a parametric estimate of the expected value \u02c6V (s; \u232b) (critic) with\nparameters \u232b. The derivative of the critic value with respect to the state, \u02c6Vs, can be used in place\nof the sample gradient estimate given in eq. (9). The critic can reduce the variance of the gradient\nestimates because \u02c6V approximates the expectation of future rewards while eq. (9) provides only a\n\n\u02c6fa)\u21e1\u2713\u2318k,\u21e0k\n\nInfer \u21e0|(s, a, s0) and \u2318|(s, a)\nv\u2713 = [ra\u21e1\u2713 + (v0s0\n\n\u02c6fa\u21e1\u2713 + v0\u2713)]\u2318,\u21e0\nvs = [rs + ra\u21e1s + v0s0 (\u02c6fs + \u02c6fa\u21e1s)]\u2318,\u21e0\n\nend for\nApply gradient-based update using v0\n\u2713\n\n3In the \ufb01nite-horizon formulation, the gradient calculation starts at the end of the trajectory for which the\ns . After the recursion, the total derivative of the value\n\nonly terms remaining in eq. (9) are vT\nfunction with respect to the policy parameters is given by v0\n\ns \u21e1 rT\n\n\u2713, which is a one-sample estimate of r\u2713J.\n\ns + rT\n\na \u21e1T\n\n4\n\n\fsingle-trajectory estimate. Additionally, the value function can be used at the end of an episode to\napproximate the in\ufb01nite-horizon policy gradient. Finally, eq. (9) involves the repeated multiplication\nof Jacobians of the approximate model \u02c6fs, \u02c6fa. Just as model error can compound in forward planning,\nmodel gradient error can compound during backpropagation. Furthermore, SVG(1) is on-policy.\nThat is, after each episode, a single gradient-based update is made to the policy, and the policy\noptimization does not revisit those trajectory data again. To increase data-ef\ufb01ciency, we construct\nan off-policy, experience replay [15, 29] algorithm that uses models and value functions, SVG(1)\nwith Experience Replay (SVG(1)-ER). This algorithm also has the advantage that it can perform an\nin\ufb01nite-horizon computation.\nTo construct an off-policy estimator, we perform importance-weighting of the current policy distri-\nbution with respect to a proposal distribution, q(s, a):\n\u02c6V\u2713 = Eq(s,a)Ep(s0|s,a)Ep(\u2318,\u21e0|s,a,s0)\n\np(a|s; \u2713)\n\n(13)\n\nq(a|s) \uf8ffra\u21e1\u2713 + \u02c6V 0s\n\n\u02c6fa\u21e1\u2713.\n\nSpeci\ufb01cally, we maintain a database with tuples of past state transitions (sk, ak, rk, sk+1). Each\nproposal drawn from q is a sample of a tuple from the database. At time t, the importance-weight\nw , p/q = p(ak|sk;\u2713t)\np(ak|sk,\u2713k), where \u2713k comprise the policy parameters in use at the historical time step k.\nWe do not importance-weight the marginal distribution over states q(s) generated by a policy; this\nis widely considered to be intractable.\nSimilarly, we use experience replay for value function learning. Details can be found in Appendix\nC. Pseudocode for the SVG(1) algorithm with Experience Replay is in Algorithm 2.\nWe also provide a model-free stochastic value gradient algorithm, SVG(0) (Algorithm 3 in the Ap-\npendix). This algorithm is very similar to SVG(1) and is the stochastic analogue of the recently intro-\nduced Deterministic Policy Gradient algorithm (DPG) [23, 14, 4]. Unlike DPG, instead of assuming\n\na deterministic policy, SVG(0) estimates the derivative around the policy noise Ep(\u2318)\u21e5Qa\u21e1\u2713\u2318\u21e4.4\n\nThis, for example, permits learning policy noise variance. The relative merit of SVG(1) versus\nSVG(0) depends on whether the model or value function is easier to learn and is task-dependent.\nWe expect that model-based algorithms such as SVG(1) will show the strongest advantages in mul-\ntitask settings where the system dynamics are \ufb01xed, but the reward function is variable. SVG(1)\nperformed well across all experiments, including ones introducing capacity constraints on the value\nfunction and model. SVG(1)-ER demonstrated a signi\ufb01cant advantage over all other tested algo-\nrithms.\n5 Model and value learning\nWe can use almost any kind of differentiable, generative model. In our work, we have parameterized\nthe models as neural networks. Our framework supports nonlinear state- and action-dependent noise,\nnotable properties of biological actuators. For example, this can be described by the parametric\nform \u02c6f (s, a,\u21e0 ) = \u02c6\u00b5(s, a) + \u02c6(s, a)\u21e0. Model learning amounts to a purely supervised problem based\non observed state transitions. Our model and policy training occur jointly. There is no \u201cmotor-\nbabbling\u201d period used to identify the model. As new transitions are observed, the model is trained\n\ufb01rst, followed by the value function (for SVG(1)), followed by the policy. To ensure that the model\ndoes not forget information about state transitions, we maintain an experience database and cull\nbatches of examples from the database for every model update. Additionally, we model the state-\nchange by s0 = \u02c6f (s, a,\u21e0 ) + s and have found that constructing models as separate sub-networks per\npredicted state dimension improved model quality signi\ufb01cantly.\nOur framework also permits a variety of means to learn the value function models. We can use\ntemporal difference learning [25] or regression to empirical episode returns. Since SVG(1) is model-\nbased, we can also use Bellman residual minimization [3]. In practice, we used a version of \u201c\ufb01tted\u201d\npolicy evaluation. Pseudocode is available in Appendix C, Algorithm 4.\n6 Experiments\nWe tested the SVG algorithms in two sets of experiments. In the \ufb01rst set of experiments (section\n6.1), we test whether evaluating gradients on real environment trajectories and value function ap-\n\n4Note that \u21e1 is a function of the state and noise variable.\n\n5\n\n\fFigure 1: From left to right: 7-Link Swimmer; Reacher; Gripper; Monoped; Half-Cheetah; Walker\n\nproximation can reduce the impact of model error. In our second set (section 6.2), we show that\nSVG(1) can be applied to several complicated, multidimensional physics environments involving\ncontact dynamics (Figure 1) in the MuJoCo simulator [28]. Below we only brie\ufb02y summarize the\nmain properties of each environment: further details of the simulations can be found in Appendix\nD and supplement. In all cases, we use generic, 2 hidden-layer neural networks with tanh activa-\ntion functions to represent models, value functions, and policies. A video montage is available at\nhttps://youtu.be/PYdL7bcn_cM.\n6.1 Analyzing SVG\nGradient evaluation on real trajectories vs. planning To demonstrate the dif\ufb01culty of planning\nwith a stochastic model, we \ufb01rst present a very simple control problem for which SVG(1) easily\nlearns a control policy but for which an otherwise identical planner fails entirely. Our example is\nbased on a problem due to [16]. The policy directly controls the velocity of a point-mass \u201chand\u201d\non a 2D plane. By means of a spring-coupling, the hand exerts a force on a ball mass; the ball\nadditionally experiences a gravitational force and random forces (Gaussian noise). The goal is to\nbring hand and ball into one of two randomly chosen target con\ufb01gurations with a relevant reward\nbeing provided only at the \ufb01nal time step.\nWith simulation time step 0.01s, this demands controlling and backpropagating the distal reward\nalong a trajectory of 1, 000 steps. Because this experiment has a non-stationary, time-dependent\nvalue function, this problem also favors model-based value gradients over methods using value\nfunctions. SVG(1) easily learns this task, but the planner, which uses trajectories from the model,\nshows little improvement. The planner simulates trajectories using the learned stochastic model\nand backpropagates along those simulated trajectories (eqs. 9 and 10) [18]. The extremely long\ntime-horizon lets prediction error accumulate and thus renders roll-outs highly inaccurate, leading\nto much worse \ufb01nal performance (c.f. Fig. 2, left).5\nRobustness to degraded models and value functions We investigated the sensitivity of SVG(1)\nand SVG(1) to the quality of the learned model on Swimmer. Swimmer is a chain body with multiple\nlinks immersed in a \ufb02uid environment with drag forces that allow the body to propel itself [5, 27].\nWe build chains of 3, 5, or 7 links, corresponding to 10, 14, or 18-dimensional state spaces with 2,\n4, or 6-dimensional action spaces. The body is initialized in random con\ufb01gurations with respect to\na central goal location. Thus, to solve the task, the body must turn to re-orient and then produce an\nundulation to move to the goal.\nTo assess the impact of model quality, we learned to control a link-3 swimmer with SVG(1) and\nSVG(1) while varying the capacity of the network used to model the environment (5, 10, or 20\nhidden units for each state dimension subnetwork (Appendix D); i.e., in this task we intentionally\nshrink the neural network model to investigate the sensitivity of our methods to model inaccuracy.\nWhile with a high capacity model (20 hidden units per state dimension), both SVG(1) and SVG(1)\nsuccessfully learn to solve the task, the performance of SVG(1) drops signi\ufb01cantly as model ca-\npacity is reduced (c.f. Fig. 3, middle). SVG(1) still works well for models with only 5 hidden units,\nand it also scales up to 5 and 7-link versions of the swimmer (Figs. 3, right and 4, left). To compare\nSVG(1) to conventional model-free approaches, we also tested a state-of-the-art actor-critic algo-\nrithm that learns a V -function and updates the policy using the TD-error = r + V 0 V as an\nestimate of the advantage, yielding the policy gradient v\u2713 = r\u2713 log \u21e1 [30]. (SVG(1) and the AC\nalgorithm used the same code for learning V .) SVG(1) outperformed the model-free approach in\nthe 3-, 5-, and 7-link swimmer tasks (c.f. Fig. 3, left, right; Fig. 4, top left). In \ufb01gure panels 2,\nmiddle, 3, right, and 4, left column, we show that experience replay for the policy can improve the\ndata ef\ufb01ciency and performance of SVG(1).\n\n5We also tested REINFORCE on this problem but achieved very poor results due to the long horizon.\n\n6\n\n\fFigure 2: Left: Backpropagation through a model along observed stochastic trajectories is able\nto optimize a stochastic policy in a stochastic environment, but an otherwise equivalent planning\nalgorithm that simulates the transitions with a learned stochastic model makes little progress due to\ncompounding model error. Middle: SVG and DPG algorithms on cart-pole. SVG(1)-ER learns the\nfastest. Right: When the value function capacity is reduced from 200 hidden units in the \ufb01rst layer to\n100 and then again to 50, SVG(1) exhibits less performance degradation than the Q-function-based\nDPG, presumably because the dynamics model contains auxiliary information about the Q function.\n\nFigure 3: Left: For a 3-link swimmer, with relatively simple dynamics, the compared methods\nyield similar results and possibly a slight advantage to the purely model-based SVG(1). Middle:\nHowever, as the environment model\u2019s capacity is reduced from 20 to 10 then to 5 hidden units\nper state-dimension subnetwork, SVG(1) dramatically deteriorates, whereas SVG(1) shows undis-\nturbed performance. Right: For a 5-link swimmer, SVG(1)-ER learns faster and asymptotes at\nhigher performance than the other tested algorithms.\n\nSimilarly, we tested the impact of varying the capacity of the value function approximator (Fig. 2,\nright) on a cart-pole. The V-function-based SVG(1) degrades less severely than the Q-function-\nbased DPG presumably because it computes the policy gradient with the aid of the dynamics model.\n6.2 SVG in complex environments\nIn a second set of experiments we demonstrated that SVG(1)-ER can be applied to several chal-\nlenging physical control problems with stochastic, non-linear, and discontinuous dynamics due to\ncontacts. Reacher is an arm stationed within a walled box with 6 state dimensions and 3 action\ndimensions and the (x, y) coordinates of a target site, giving 8 state dimensions in total. In 4-Target\nReacher, the site was randomly placed at one of the four corners of the box, and the arm in a random\ncon\ufb01guration at the beginning of each trial. In Moving-Target Reacher, the site moved at a random-\nized speed and heading in the box with re\ufb02ections at the walls. Solving this latter problem implies\nthat the policy has generalized over the entire work space. Gripper augments the reacher arm with a\nmanipulator that can grab a ball in a randomized position and return it to a speci\ufb01ed site. Monoped\nhas 14 state dimensions, 4 action dimensions, and ground contact dynamics. The monoped begins\nfalling from a height and must remain standing. Additionally, we apply Gaussian random noise\nto the torques controlling the joints with a standard deviation of 5% of the total possible actuator\nstrength at all points in time, reducing the stability of upright postures. Half-Cheetah is a planar cat\nrobot designed to run based on [29] with 18 state dimensions and 6 action dimensions. Half-Cheetah\nhas a version with springs to aid balanced standing and a version without them. Walker is a planar\nbiped, based on the environment from [22].\nResults Figure 4 shows learning curves for several repeats for each of the tasks. We found that\nin all cases SVG(1) solved the problem well; we provide videos of the learned policies in the sup-\nplemental material. The 4-target reacher reliably \ufb01nished at the target site, and in the tracking task\nfollowed the moving target successfully. SVG(1)-ER has a clear advantage on this task as also borne\nout in the cart-pole and swimmer experiments. The cheetah gaits varied slightly from experiment to\nexperiment but in all cases made good forward progress. For the monoped, the policies were able\nto balance well beyond the 200 time steps of training episodes and were able to resist signi\ufb01cantly\n\n7\n\nHandCartpoleCartpoleSwimmer-5Swimmer-3Swimmer-3\fFigure 4: Across several different domains, SVG(1)-ER reliably optimizes policies, clearly settling\ninto similar local optima. On the 4-target Reacher, SVG(1)-ER shows a noticeable ef\ufb01ciency and\nperformance gain relative to the other algorithms.\n\nhigher adversarial noise levels than used during training (up to 25% noise). We were able to learn\ngripping and walking behavior, although walking policies that achieved similar reward levels did not\nalways exhibit equally good walking phenotypes.\n\n7 Related work\n\nWriting the noise variables as exogenous inputs to the system to allow direct differentiation with\nrespect to the system state (equation 7) is a known device in control theory [10, 7] where the model is\ngiven analytically. The idea of using a model to optimize a parametric policy around real trajectories\nis presented heuristically in [17] and [1] for deterministic policies and models. Also in the limit of\ndeterministic policies and models, the recursions we have derived in Algorithm 1 reduce to those\nof [2]. Werbos de\ufb01nes an actor-critic algorithm called Heuristic Dynamic Programming that uses a\ndeterministic model to roll-forward one step to produce a state prediction that is evaluated by a value\nfunction [31]. Deisenroth et al. have used Gaussian process models to compute policy gradients that\nare sensitive to model-uncertainty [6], and Levine et al. have optimized impressive policies with the\naid of a non-parametric trajectory optimizer and locally-linear models [13]. Our work in contrast\nhas focused on using global, neural network models conjoined to value function approximators.\n\n8 Discussion\n\nWe have shown that two potential problems with value gradient methods, their reliance on planning\nand restriction to deterministic models, can be exorcised, broadening their relevance to reinforce-\nment learning. We have shown experimentally that the SVG framework can train neural network\npolicies in a robust manner to solve interesting continuous control problems. The framework in-\ncludes algorithm variants beyond the ones tested in this paper, for example, ones that combine a\nvalue function with k steps of back-propagation through a model (SVG(k)). Augmenting SVG(1)\nwith experience replay led to the best results, and a similar extension could be applied to any SVG(k).\nFurthermore, we did not harness sophisticated generative models of stochastic dynamics, but one\ncould readily do so, presenting great room for growth.\n\nAcknowledgements We thank Arthur Guez, Danilo Rezende, Hado van Hasselt, John Schulman, Jonathan\nHunt, Nando de Freitas, Martin Riedmiller, Remi Munos, Shakir Mohamed, and Theophane Weber for helpful\ndiscussions and John Schulman for sharing his walker model.\n\n8\n\nMonoped2D-WalkerGripperAvg. reward (arbitrary units)4-Target ReacherHalf-CheetahAvg. reward (arbitrary units)Swimmer-7\fReferences\n[1] P. Abbeel, M. Quigley, and A. Y. Ng. Using inaccurate models in reinforcement learning. In ICML, 2006.\n[2] C. G. Atkeson. Ef\ufb01cient robust policy optimization. In ACC, 2012.\n[3] L. Baird. Residual algorithms: Reinforcement learning with function approximation. In ICML, 1995.\n[4] D. Balduzzi and M. Ghifary. Compatible value gradients for reinforcement learning of continuous deep\n\npolicies. arXiv preprint arXiv:1509.03005, 2015.\n\n[5] R. Coulom. Reinforcement learning using neural networks, with applications to motor control. PhD\n\nthesis, Institut National Polytechnique de Grenoble-INPG, 2002.\n\n[6] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy search.\n\nIn (ICML), 2011.\n\n[7] M. Fairbank. Value-gradient learning. PhD thesis, City University London, 2014.\n[8] M. Fairbank and E. Alonso. Value-gradient learning. In IJCNN, 2012.\n[9] I. Grondman. Online Model Learning Algorithms for Actor-Critic Control. PhD thesis, TU Delft, Delft\n\nUniversity of Technology, 2015.\n\n[10] D. H. Jacobson and D. Q. Mayne. Differential dynamic programming. 1970.\n[11] M. I. Jordan and D. E. Rumelhart. Forward models: Supervised learning with a distal teacher. Cognitive\n\nscience, 16(3):307\u2013354, 1992.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n[13] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown\n\ndynamics. In NIPS, 2014.\n\n[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous\n\ncontrol with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.\n\n[15] L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Ma-\n\nchine learning, 8(3-4):293\u2013321, 1992.\n\n[16] R. Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771\u2013791, 2006.\n[17] K. S. Narendra and K. Parthasarathy. Identi\ufb01cation and control of dynamical systems using neural net-\n\nworks. IEEE Transactions on Neural Networks, 1(1):4\u201327, 1990.\n\n[18] D. H. Nguyen and B. Widrow. Neural networks for self-learning control systems. IEEE Control Systems\n\nMagazine, 10(3):18\u201323, 1990.\n\n[19] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks. In ICML,\n\n2013.\n\n[20] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, 2014.\n\n[21] Martin Riedmiller. Neural \ufb01tted q iteration\u2013\ufb01rst experiences with a data ef\ufb01cient neural reinforcement\n\nlearning method. In Machine Learning: ECML 2005, pages 317\u2013328. Springer, 2005.\n\n[22] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. CoRR,\n\nabs/1502.05477, 2015.\n\n[23] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient\n\nalgorithms. In ICML, 2014.\n\n[24] S. P. Singh. Learning without state-estimation in partially observable Markovian decision processes. In\n\nICML, 1994.\n\n[25] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9\u2013\n\n44, 1988.\n\n[26] R.S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement\n\nlearning with function approximation. In NIPS, 1999.\n\n[27] Y. Tassa, T. Erez, and W.D. Smart. Receding horizon differential dynamic programming. In NIPS, 2008.\n[28] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012.\n[29] P. Wawrzy\u00b4nski. A cat-like robot real-time learning to run. In Adaptive and Natural Computing Algorithms,\n\npages 380\u2013390. Springer, 2009.\n\n[30] P. Wawrzy\u00b4nski. Real-time reinforcement learning by sequential actor\u2013critics and experience replay. Neu-\n\nral Networks, 22(10):1484\u20131497, 2009.\n\n[31] P. J Werbos. A menu of designs for reinforcement learning over time. Neural networks for control, pages\n\n67\u201395, 1990.\n\n[32] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine learning, 8(3-4):229\u2013256, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1670, "authors": [{"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}, {"given_name": "Gregory", "family_name": "Wayne", "institution": "Google DeepMind"}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}, {"given_name": "Timothy", "family_name": "Lillicrap", "institution": "Google DeepMind"}, {"given_name": "Tom", "family_name": "Erez", "institution": "Google DeepMind"}, {"given_name": "Yuval", "family_name": "Tassa", "institution": "Google DeepMind"}]}