{"title": "Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion", "book": "Advances in Neural Information Processing Systems", "page_first": 8224, "page_last": 8234, "abstract": "There is growing interest in combining model-free and model-based approaches in reinforcement learning with the goal of achieving the high performance of model-free algorithms with low sample complexity. This is difficult because an imperfect dynamics model can degrade the performance of the learning algorithm, and in sufficiently complex environments, the dynamics model will always be imperfect. As a result, a key challenge is to combine model-based approaches with model-free learning in such a way that errors in the model do not degrade performance. We propose stochastic ensemble value expansion (STEVE), a novel model-based technique that addresses this issue. By dynamically interpolating between model rollouts of various horizon lengths, STEVE ensures that the model is only utilized when doing so does not introduce significant errors. Our approach outperforms model-free baselines on challenging continuous control benchmarks with an order-of-magnitude increase in sample efficiency.", "full_text": "Sample-Ef\ufb01cient Reinforcement Learning\nwith Stochastic Ensemble Value Expansion\n\nJacob Buckman\u2217 Danijar Hafner George Tucker Eugene Brevdo Honglak Lee\n\nGoogle Brain, Mountain View, CA, USA\n\njacobbuckman@gmail.com, mail@danijar.com,\n\n{gjt,ebrevdo,honglak}@google.com\n\nAbstract\n\nIntegrating model-free and model-based approaches in reinforcement learning has\nthe potential to achieve the high performance of model-free algorithms with low\nsample complexity. However, this is dif\ufb01cult because an imperfect dynamics model\ncan degrade the performance of the learning algorithm, and in suf\ufb01ciently complex\nenvironments, the dynamics model will almost always be imperfect. As a result,\na key challenge is to combine model-based approaches with model-free learning\nin such a way that errors in the model do not degrade performance. We propose\nstochastic ensemble value expansion (STEVE), a novel model-based technique\nthat addresses this issue. By dynamically interpolating between model rollouts\nof various horizon lengths for each individual example, STEVE ensures that the\nmodel is only utilized when doing so does not introduce signi\ufb01cant errors. Our\napproach outperforms model-free baselines on challenging continuous control\nbenchmarks with an order-of-magnitude increase in sample ef\ufb01ciency, and in\ncontrast to previous model-based approaches, performance does not degrade in\ncomplex environments.\n\n1\n\nIntroduction\n\nDeep model-free reinforcement learning has had great successes in recent years, notably in playing\nvideo games [23] and strategic board games [27]. However, training agents using these algorithms\nrequires tens to hundreds of millions of samples, which makes many practical applications infeasible,\nparticularly in real-world control problems (e.g., robotics) where data collection is expensive.\nModel-based approaches aim to reduce the number of samples required to learn a policy by modeling\nthe dynamics of the environment. A dynamics model can be used to increase sample ef\ufb01ciency in\nvarious ways, including training the policy on rollouts from the dynamics model [28], using rollouts\nto improve targets for temporal difference (TD) learning [7], and using information gained from\nrollouts as inputs to the policy [31]. Model-based algorithms such as PILCO [4] have shown that it is\npossible to learn from orders-of-magnitude fewer samples.\nThese successes have mostly been limited to environments where the dynamics are simple to model.\nIn noisy, complex environments, it is dif\ufb01cult to learn an accurate model of the environment. When\nthe model makes mistakes in this context, it can cause the wrong policy to be learned, hindering\nperformance. Recent work has begun to address this issue. Kalweit and Boedecker [17] train a\nmodel-free algorithm on a mix of real and imagined data, adjusting the proportion in favor of real\ndata as the Q-function becomes more con\ufb01dent. Kurutach et al. [20] train a model-free algorithm\non purely imaginary data, but use an ensemble of environment models to avoid over\ufb01tting to errors\nmade by any individual model.\n\n\u2217This work was completed as part of the Google AI Residency program.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Value error per update on a value-estimation task (\ufb01xed policy) in a toy environment. H is\nthe maximum rollout horizon (see Section 3). When given access to a perfect dynamics model, hybrid\nmodel-free model-based approaches (MVE and STEVE) solve this task with 5\u00d7 fewer samples\nthan model-free TD learning. However, when only given access to a noisy dynamics model, MVE\ndiverges due to model errors. In contrast, STEVE converges to the correct solution, and does so with\na 2\u00d7 speedup over TD learning. This is because STEVE dynamically adapts its rollout horizon to\naccommodate model error. See Appendix A for more details.\n\nWe propose stochastic ensemble value expansion (STEVE), an extension to model-based value\nexpansion (MVE) proposed by Feinberg et al. [7]. Both techniques use a dynamics model to compute\n\u201crollouts\u201d that are used to improve the targets for temporal difference learning. MVE rolls out a \ufb01xed\nlength into the future, potentially accumulating model errors or increasing value estimation error\nalong the way. In contrast, STEVE interpolates between many different horizon lengths, favoring\nthose whose estimates have lower uncertainty, and thus lower error. To compute the interpolated\ntarget, we replace both the model and Q-function with ensembles, approximating the uncertainty of\nan estimate by computing its variance under samples from the ensemble. Through these uncertainty\nestimates, STEVE dynamically utilizes the model rollouts only when they do not introduce signi\ufb01cant\nerrors. For illustration, Figure 1 compares the sample ef\ufb01ciency of various algorithms on a tabular toy\nenvironment, which shows that STEVE signi\ufb01cantly outperforms MVE and TD-learning baselines\nwhen the dynamics model is noisy. We systematically evaluate STEVE on several challenging\ncontinuous control benchmarks and demonstrate that STEVE signi\ufb01cantly outperforms model-free\nbaselines with an order-of-magnitude increase in sample ef\ufb01ciency.\n\n2 Background\n\nReinforcement learning aims to learn an agent policy that maximizes the expected (discounted) sum\nof rewards [29]. The agent starts at an initial state s0 \u223c p(s0), where p(s0) is the distribution of\ninitial states of the environment. Then, the agent deterministically chooses an action at according\nto its policy \u03c0\u03c6(st) with parameters \u03c6, deterministically transitions to a subsequent state st+1\naccording to the Markovian dynamics T (st, at) of the environment, and receives a reward rt =\nr(st, at, st+1). This generates a trajectory of states, actions, and rewards \u03c4 = (s0, a0, r0, s1, a1, . . .).\nIf a trajectory reaches a terminal state, it concludes without further transitions or rewards; however,\nthis is optional, and trajectories may instead be in\ufb01nite in length. We abbreviate the trajectory by\n\u03c4. The goal is to maximize the expected discounted sum of rewards along sampled trajectories\n\nt=0 \u03b3trt] where \u03b3 \u2208 [0, 1) is a discount parameter.\n\nJ(\u03b8) = Es0 [(cid:80)\u221e\nThe action-value function Q\u03c0(s0, a0) =(cid:80)\u221e\n\n2.1 Value Estimation with TD-learning\n\nalgorithms. Using the fact that Q\u03c0(s, a) satis\ufb01es a recursion relation\n\nt=0 \u03b3trt is a critical quantity to estimate for many learning\n\nQ\u03c0(s, a) = r(s, a) + \u03b3(1 \u2212 d(s(cid:48)))Q\u03c0(s(cid:48), \u03c0(s(cid:48))),\n\nwhere s(cid:48) = T (s, a) and d(s(cid:48)) is an indicator function which returns 1 when s(cid:48) is a terminal state and\n0 otherwise. We can estimate Q\u03c0(s, a) off-policy with collected transitions of the form (s, a, r, s(cid:48))\nsampled uniformly from a replay buffer [29]. We approximate Q\u03c0(s, a) with a deep neural network,\n\u02c6Q\u03c0\n\u03b8 (s, a). We learn parameters \u03b8 to minimize the mean squared error (MSE) between Q-value\n\n2\n\n0200040006000800010000Updates010002000ErrorSTEVE (H=5)MVE (H=5)TD Learning (Model-Free)Toy Environment + Oracle Dynamics Model0200040006000800010000Updates010002000ErrorToy Environment + Noisy Dynamics Model\festimates of states and their corresponding TD targets:\n\nT T D(r, s(cid:48)) = r + \u03b3(1 \u2212 d(s(cid:48))) \u02c6Q\u03c0\n\nL\u03b8 = E(s,a,r,s(cid:48))\n\n( \u02c6Q\u03c0\n\n(cid:104)\n\n\u03b8\u2212 (s(cid:48), \u03c0(s(cid:48)))\n\n\u03b8 (s, a) \u2212 T TD(r, s(cid:48)))2(cid:105)\n\n(1)\n\n(2)\n\nThis expectation is taken with respect to transitions sampled from our replay buffer. Note that we use\nan older copy of the parameters, \u03b8\u2212, when computing targets [23].\nSince we evaluate our method in a continuous action space, it is not possible to compute a policy from\nour Q-function by simply taking maxa \u02c6Q\u03c0\n\u03b8 (s, a). Instead, we use a neural network to approximate\nthis maximization function [21], by learning a parameterized function \u03c0\u03c6 to minimize the negative\nQ-value:\n\nL\u03c6 = \u2212 \u02c6Q\u03c0\n\n\u03b8 (s, \u03c0\u03c6(s)).\n\n(3)\n\nIn this work, we use DDPG as the base learning algorithm, but our technique is generally applicable\nto other methods that use TD objectives.\n\n2.2 Model-Based Value Expansion (MVE)\n\nRecently, Feinberg et al. [7] showed that a learned dynamics model can be used to improve value\nestimation. MVE forms TD targets by combining a short term value estimate formed by unrolling the\nmodel dynamics and a long term value estimate using the learned \u02c6Q\u03c0\n\u03b8\u2212 function. When the model is\naccurate, this reduces the bias of the targets, leading to improved performance.\nThe learned dynamics model consists of three learned functions: the transition function \u02c6T\u03be(s, a),\nwhich returns a successor state s(cid:48); a termination function \u02c6d\u03be(s), which returns the probability that s\nis a terminal state; and the reward function \u02c6r\u03c8(s, a, s(cid:48)), which returns a scalar reward. This model is\ntrained to minimize\n\nL\u03be,\u03c8 = E(s,a,r,s(cid:48))\n\nd(s(cid:48)), \u02c6d\u03be( \u02c6T\u03be(s, a))\n\n(4)\nwhere the expectation is over collected transitions (s, a, r, s(cid:48)), and H is the cross-entropy. In this\nwork, we consider continuous environments; for discrete environments, the \ufb01rst term can be replaced\nby a cross-entropy loss term.\nTo incorporate the model into value estimation, Feinberg et al. [7] replace the standard Q-learning\ntarget with an improved target, T MVE\n\n, computed by rolling the learned model out for H steps.\n\n(cid:2)|| \u02c6T\u03be(s, a) \u2212 s(cid:48)||2 + H(cid:16)\n\n(cid:17)\n\n+ (\u02c6r\u03c8(s, a, s(cid:48)) \u2212 r)2(cid:3),\n\nH\n\na(cid:48)\ni = \u03c0\u03c6(s(cid:48)\ni),\n\ns(cid:48)\n0 = s(cid:48),\n\nT MVE\nH (r, s(cid:48)) = r +\n\n(cid:32) H(cid:88)\n\ni = \u02c6T\u03be(s(cid:48)\ns(cid:48)\n(cid:33)\n\nDi\u03b3i\u02c6r\u03c8(s(cid:48)\n\ni\u22121, a(cid:48)\n\ni\u22121, s(cid:48)\ni)\n\n+ DH+1\u03b3H+1 \u02c6Q\u03c0\n\n\u03b8\u2212 (s(cid:48)\n\nH , a(cid:48)\n\nH ).\n\ni=1\n\nH\n\nin place of T TD when training \u03b8 using Equation 2.2 Note that\n\nTo use this target, we substitute T MVE\nwhen H = 0, MVE reduces to TD-learning (i.e., T TD = T MVE\nWhen the model is perfect and the learned Q-function has similar bias on all states and actions,\nFeinberg et al. [7] show that the MVE target with rollout horizon H will decrease the target error by\na factor of \u03b32H. Errors in the learned model can lead to worse targets, so in practice, we must tune H\nto balance between the errors in the model and the Q-function estimates. An additional challenge\nis that the bias in the learned Q-function is not uniform across states and actions [7]. In particular,\n\n).\n\n0\n\n2This formulation is a minor generalization of the original MVE objective in that we additionally model\nthe reward function and termination function; Feinberg et al. [7] consider \u201cfully observable\u201d environments in\nwhich the reward function and termination condition were known, deterministic functions of the observations.\nBecause we use a function approximator for the termination condition, we compute the accumulated probability\nof termination, Di, at every timestep, and use this value to discount future returns.\n\n3\n\ni\u22121, a(cid:48)\n\ni\u22121),\n\nDi = d(s(cid:48))\n\n(1 \u2212 \u02c6d\u03be(s(cid:48)\n\nj))\n\ni(cid:89)\n\nj=1\n\n(5)\n\n(6)\n\n\fFigure 2: Visualization of how the set of possible values for each candidate target is computed, shown\nfor a length-two rollout with M, N, L = 2. Colors correspond to ensemble members. Best viewed in\ncolor.\n\nthey \ufb01nd that the bias in the Q-function on states sampled from the replay buffer is lower than when\nthe Q-function is evaluated on states generated from model rollouts. They term this the distribution\nmismatch problem and propose the TD-k trick as a solution; see Appendix B for further discussion of\nthis trick.\nWhile the results of Feinberg et al. [7] are promising, they rely on task-speci\ufb01c tuning of the rollout\nhorizon H. This sensitivity arises from the dif\ufb01culty of modeling the transition dynamics and the Q-\nfunction, which are task-speci\ufb01c and may change throughout training as the policy explores different\nparts of the state space. Complex environments require much smaller rollout horizon H, which limits\nthe effectiveness of the approach (e.g., Feinberg et al. [7] used H = 10 for HalfCheetah-v1, but had\nto reduce to H = 3 on Walker2d-v1). Motivated by this limitation, we propose an approach that\nbalances model error and Q-function error by dynamically adjusting the rollout horizon.\n\n3 Stochastic Ensemble Value Expansion\n\n0\n\n,T MVE\n\n,T MVE\n\nFrom a single rollout of H timesteps, we can compute H + 1 distinct candidate targets by considering\nrollouts of various horizon lengths: T MVE\nH . Standard TD learning uses T MVE\nas the target, while MVE uses T MVE\nas the target. We propose interpolating all of the candidate\ntargets to produce a target which is better than any individual. Conventionally, one could average the\ncandidate targets, or weight the candidate targets in an exponentially-decaying fashion, similar to\nTD(\u03bb) [29]. However, we show that we can do still better by weighting the candidate targets in a way\nthat balances errors in the learned Q-function and errors from longer model rollouts. STEVE provides\na computationally-tractable and theoretically-motivated algorithm for choosing these weights. We\ndescribe the algorithm for STEVE in Section 3.1, and justify it in Section 3.2.\n\n,...,T MVE\n\nH\n\n1\n\n2\n\n0\n\n3.1 Algorithm\n\nTo estimate uncertainty in our learned estimators, we maintain an ensemble of parameters for our\nQ-function, reward function, and model: \u03b8 = {\u03b81, ..., \u03b8L}, \u03c8 = {\u03c81, ..., \u03c8N}, and \u03be = {\u03be1, ..., \u03beM},\nrespectively. Each parameterization is initialized independently and trained on different subsets of\nthe data in each minibatch.\nWe roll out an H step trajectory with each of the M models, \u03c4 \u03be1, ..., \u03c4 \u03beM . Each trajectory consists of\nH + 1 states, \u03c4 \u03bem\nH in Equation 5 with the transition function\nparameterized by \u03bem. Similarly, we use the N reward functions and L Q-functions to evaluate\nEquation 6 for each \u03c4 \u03bem at every rollout-length 0 \u2264 i \u2264 H. This gives us M \u00b7 N \u00b7 L different values\nof T MVE\nUsing these values, we can compute the empirical mean T \u00b5\nof length i. In order to form a single target, we use an inverse variance weighting of the means:\n\nfor each rollout-length i. See Figure 2 for a visualization of this process.\n\nH , which correspond to s(cid:48)\n\ni and variance T \u03c32\n\nfor each partial rollout\n\n0 , ..., \u03c4 \u03bem\n\n0, ..., s(cid:48)\n\ni\n\ni\n\nT STEVE\n\nH\n\n(r, s(cid:48)) =\n\nT \u00b5\ni ,\n\ni = T \u03c32\n\u02dcw\u22121\n\ni\n\n(7)\n\nH(cid:88)\n\ni=0\n\n\u02dcwi(cid:80)\n\nj \u02dcwj\n\n4\n\n\fTo learn a value function with STEVE, we substitute in T STEVE\nusing Equation 2.\n\nH\n\nin place of T TD when training \u03b8\n\n3.2 Derivation\n\nWe wish to \ufb01nd weights wi, where(cid:80)\n\nweighted-average of candidate targets T MVE\n\n\uf8ee\uf8f0(cid:32) H(cid:88)\n\nE\n\nwiT MVE\n\ni\n\n\u2212 Q\u03c0(s, a)\n\ni=0\n\n\u2248 Bias\n\ni wi = 1 that minimize the mean-squared error between the\n\n,...,T MVE\n\nH\n\nand the true Q-value.\n\n0\n\n,T MVE\n\n(cid:33)2\uf8f9\uf8fb = Bias\n\n1\n\n,T MVE\n\n2\n\n(cid:32)(cid:88)\n(cid:32)(cid:88)\n\ni\n\nwiT MVE\n\ni\n\nwiT MVE\n\ni\n\n(cid:33)2\n(cid:33)2\n\n(cid:32)(cid:88)\n\n(cid:33)\n\n+ Var\n\n(cid:88)\n\n+\n\nwiT MVE\n\ni\n\ni\n\ni Var(T MVE\nw2\n\ni\n\n),\n\ni\n\ni\n\nwhere the expectation considers the candidate targets as random variables conditioned on the collected\ndata and minibatch sampling noise, and the approximation is due to assuming the candidate targets\nare independent3.\nOur goal is to minimize this with respect to wi. We can estimate the variance terms using empirical\nvariance estimates from the ensemble. Unfortunately, we could not devise a reliable estimator for\nthe bias terms, and this is a limitation of our approach and an area for future work. In this work, we\nignore the bias terms and minimize the weighted sum of variances\n\n(cid:88)\n\ni Var(T MVE\nw2\n\ni\n\n).\n\nWith this approximation, which is equivalent to in inverse-variance weighting [8], we achieve state-\n) and normalizing yields the formula for T STEVE\nof-the-art results. Setting each wi equal to\ngiven in Equation 7.\n\nVar(T MVE\n\nH\n\n1\n\ni\n\ni\n\n3.3 Note on ensembles\n\nThis technique for calculating uncertainty estimates is applicable to any family of models from which\nwe can sample. For example, we could train a Bayesian neural network for each model [22], or use\ndropout as a Bayesian approximation by resampling the dropout masks each time we wish to sample\na new model [10]. These options could potentially give better diversity of various samples from the\nfamily, and thus better uncertainty estimates; exploring them further is a promising direction for\nfuture work. However, we found that these methods degraded the accuracy of the base models. An\nensemble is far easier to train, and so we focus on that in this work. This is a common choice, as the\nuse of ensembles in the context of uncertainty estimations for deep reinforcement learning has seen\nwide adoption in the literature. It was \ufb01rst proposed by Osband et al. [25] as a technique to improve\nexploration, and subsequent work showed that this approach gives a good estimate of the uncertainty\nof both value functions [17] and models [20].\n\n4 Experiments\n\n4.1\n\nImplementation\n\nWe use DDPG [21] as our baseline model-free algorithm. We train two deep feedforward neural\nnetworks, a Q-function network \u02c6Q\u03c0\n\u03b8 (s, a) and a policy network \u03c0\u03c6(s), by minimizing the loss\nfunctions given in Equations 2 and 3. We also train another three deep feedforward networks to\nrepresent our world model, corresponding to function approximators for the transition \u02c6T\u03be(s, a),\ntermination \u02c6d\u03be(t | s), and reward \u02c6r\u03c8(s, a, s(cid:48)), and minimize the loss function given in Equation 4.\nWhen collecting rollouts for evaluation, we simply take the action selected by the policy, \u03c0\u03c6(s), at\nevery state s. (Note that only the policy is required at test-time, not the ensembles of Q-functions,\n\n3Initial experiments suggested that omitting the covariance cross terms provided signi\ufb01cant computational\nspeedups at the cost of a slight performance degradation. As a result, we omitted the terms in the rest of the\nexperiments.\n\n5\n\n\fFigure 3: Learning curves comparing sample ef\ufb01ciency of our method to both model-free and\nmodel-based baselines. Each experiment was run four times.\n\ndynamics models, or reward models.) Each run was evaluated after every 500 updates by computing\nthe mean total episode reward (referred to as score) across many environment restarts. To produce\nthe lines in Figures 3, 4, and 5, these evaluation results were downsampled by splitting the domain\ninto non-overlapping regions and computing the mean score within each region across several runs.\nThe shaded area shows one standard deviation of scores in the region as de\ufb01ned above.\nWhen collecting rollouts for our replay buffer, we do \u0001-greedy exploration: with probability \u0001, we\nselect a random action by adding Gaussian noise to the pre-tanh policy action.\nAll algorithms were implemented in Tensor\ufb02ow [1]. We use a distributed implementation to parallelize\ncomputation. In the style of ApeX [16], IMPALA [6], and D4PG [2], we use a centralized learner with\nseveral agents operating in parallel. Each agent periodically loads the most recent policy, interacts\nwith the environment, and sends its observations to the central learner. The learner stores received\nframes in a replay buffer, and continuously loads batches of frames from this buffer to use as training\ndata for a model update. In the algorithms with a model-based component, there are two learners: a\npolicy-learner and a model-learner. In these cases, the policy-learner periodically reloads the latest\ncopy of the model.\nAll baselines reported in this section were re-implementations of existing methods. This allowed us to\nensure that the various methods compared were consistent with one another, and that the differences\nreported are fully attributable to the independent variables in question. Our baselines are competitive\nwith state-of-the-art implementations of these algorithms [7, 14]. All MVE experiments utilize the\nTD-k trick. For hyperparameters and additional implementation details, please see Appendix C.4\n\n4.2 Comparison of Performance\n\nWe evaluated STEVE on a variety of continuous control tasks [3, 19]; we plot learning curves in\nFigure 3. We found that STEVE yields signi\ufb01cant improvements in both performance and sample\nef\ufb01ciency across a wide range of environments. Importantly, the gains are most substantial in the\ncomplex environments. On the most challenging environments: Humanoid-v1, RoboschoolHumanoid-\n\n4Our code is available open-source at: https://github.com/tensorflow/models/tree/master/\n\nresearch/steve\n\n6\n\n\fFigure 5: Comparison of wall-clock time between our method and baselines. Each experiment was\nrun three times.\n\nv1, RoboschoolHumanoidFlagrun-v1, and BipedalWalkerHardcore-v2, STEVE is the only algorithm\nto show signi\ufb01cant learning within 5M frames.\n\n4.3 Ablation Study\n\nIn order to verify that STEVE\u2019s gains in sam-\nple ef\ufb01ciency are due to the reweighting, and\nnot simply due to the additional parameters of\nthe ensembles of its components, we examine\nseveral ablations. Ensemble MVE is the regular\nMVE algorithm, but the model and Q-functions\nare replaced with with ensembles. Mean-MVE\nuses the exact same architecture as STEVE,\nbut uses a simple uniform weighting instead\nof the uncertainty-aware reweighting scheme.\nSimilarly, TDL25 and TDL75 correspond to\nTD(\u03bb) reweighting schemes with \u03bb = 0.25 and\n\u03bb = 0.75, respectively. COV-STEVE is a ver-\nsion of STEVE which includes the covariances\nbetween candidate targets when computing the\nweights (see Section 3.2). We also investigate\nthe effect of the horizon parameter on the perfor-\nmance of both STEVE and MVE. These results\nare shown in Figure 4.\nAll of these variants show the same trend: fast\ninitial gains, which quickly taper off and are\novertaken by the baseline. STEVE is the only\nvariant to converge faster and higher than the baseline; this provides strong evidence that the gains\ncome speci\ufb01cally from the uncertainty-aware reweighting of targets. Additionally, we \ufb01nd that\nincreasing the rollout horizon increases the sample ef\ufb01ciency of STEVE, even though the dynamics\nmodel for Humanoid-v1 has high error.\n\nFigure 4: Ablation experiments on variation of\nmethods. Each experiment was run twice.\n\n4.4 Wall-Clock Comparison\n\nIn the previous experiments, we synchronized data collection, policy updates, and model updates.\nHowever, when we run these steps asynchronously, we can reduce the wall-clock time at the risk of\ninstability. To evaluate this con\ufb01guration, we compare DDPG, MVE-DDPG, and STEVE-DPPG on\nHumanoid-v1 and RoboschoolHumanoidFlagrun-v1. Both were trained on a P100 GPU and had 8\nCPUs collecting data; STEVE-DPPG additionally used a second P100 to learn a model in parallel. We\nplot reward as a function of wall-clock time for these tasks in Figure 5. STEVE-DDPG learns more\nquickly on both tasks, and it achieves a higher reward than DDPG and MVE-DDPG on Humanoid-v1\nand performs comparably to DDPG on RoboschoolHumanoidFlagrun-v1. Moreover, in future work,\nSTEVE could be accelerated by parallelizing training of each component of the ensemble.\n\n7\n\n0204060Hours02000400060008000ScoreHumanoid-v10204060Hours0250500750ScoreRoboschoolHumanoidFlagrun-v1\f5 Discussion\n\nOur primary experiments (Section 4.2) show that STEVE greatly increases sample ef\ufb01ciency rel-\native to baselines, matching or out-performing both MVE-DDPG and DDPG baselines on every\ntask. STEVE also outperforms other recently-published results on these tasks in terms of sample\nef\ufb01ciency [13, 14, 26]. Our ablation studies (Section 4.3) support the hypothesis that the increased\nperformance is due to the uncertainty-dependent reweighting of targets, as well as demonstrate that\nthe performance of STEVE consistently increases with longer horizon lengths, even in complex\nenvironments. Finally, our wall-clock experiments (Section 4.4) demonstrate that in spite of the\nadditional computation per epoch, the gains in sample ef\ufb01ciency are enough that it is competitive\nwith model-free algorithms in terms of wall-clock time. The speed gains associated with improved\nsample ef\ufb01ciency will only be exacerbated as samples become more expensive to collect, making\nSTEVE a promising choice for applications involving real-world interaction.\nGiven that the improvements stem from the dynamic reweighting between horizon lengths, it may be\ninteresting to examine the choices that the model makes about which candidate targets to favor most\nheavily. In Figure 6, we plot the average model usage over the course of training. Intriguingly, most\nof the lines seem to remain stable at around 50% usage, with two notable exceptions: Humanoid-v1,\nthe most complex environment tested (with an observation-space of size 376); and Swimmer-v1, the\nleast complex environment tested (with an observation-space of size 8). This supports the hypothesis\nthat STEVE is trading off between Q-function bias and model bias; it chooses to ignore the model\nalmost immediately when the environment is too complex to learn, and gradually ignores the model\nas the Q-function improves if an optimal environment model is learned quickly.\n\n6 Related Work\n\nSutton and Barto [29] describe TD(\u03bb), a family of Q-learning variants in which targets from multiple\ntimesteps are merged via exponentially decay. STEVE is similar in that it is also computing a\nweighted average between targets. However, our approach is signi\ufb01cantly more powerful because it\nadapts the weights to the speci\ufb01c characteristics of each individual rollout, rather than being constant\nbetween examples and throughout training. Our approach can be thought of as a generalization of\nTD(\u03bb), in that the two approaches are equivalent in the speci\ufb01c case where the overall uncertainty\ngrows exponentially at rate \u03bb at every timestep.\nMunos et al. [24] propose Retrace(\u03bb), a low-variance method for off-policy Q-learning. Retrace(\u03bb)\nis an off-policy correction method, so it learns from n-step off-policy data by multiplying each\nterm of the loss by a correction coef\ufb01cient, the trace, in order to re-weight the data distribution\nto look more like the on-policy distribution. Speci\ufb01cally, at each timestep, Retrace(\u03bb) updates the\ncoef\ufb01cient for that term by multiplying it by \u03bbmin(1, \u03c0(as|xs)\n\u00b5(as|xs) ). Similarly to TD(\u03bb), the \u03bb parameter\ncorresponds to an exponential decay of the weighting of potential targets. STEVE approximates this\nweighting in a more complex way, and additionally learns a predictive model of the environment\n(under which on-policy rollouts are possible) instead of using off-policy correction terms to re-weight\nreal off-policy rollouts.\n\nFigure 6: Average model usage for STEVE on each environment. The y-axis represents the amount of\nprobability mass assigned to weights that were not w0, i.e. the probability mass assigned to candidate\ntargets that include at least one step of model rollout.\n\n8\n\n0123Environment steps1e6050100% model usageHalfCheetah-v10123Environment steps1e6050100% model usageSwimmer-v10123Environment steps1e6050100% model usageHopper-v10123Environment steps1e6050100% model usageWalker2d-v10123Environment steps1e6050100% model usageHumanoid-v10123Environment steps1e6050100% model usageBipedalWalkerHardcore-v20123Environment steps1e6050100% model usageRoboschoolHumanoid-v10123Environment steps1e6050100% model usageRoboschoolHumanoidFlagrun-v1\fHeess et al. [15] describe stochastic value gradient (SVG) methods, which are a general family of\nhybrid model-based/model-free control algorithms. By re-parameterizing distributions to separate\nout the noise, SVG is able to learn stochastic continuous control policies in stochastic environments.\nSTEVE currently operates only with deterministic policies and environments, but this is a promising\ndirection for future work.\nKurutach et al. [20] propose model-ensemble trust-region policy optimization (ME-TRPO), which is\nmotivated similarly to this work in that they also propose an algorithm which uses an ensemble of\nmodels to mitigate the deleterious effects of model bias. However, the algorithm is quite different. ME-\nTRPO is a purely model-based policy-gradient approach, and uses the ensemble to avoid over\ufb01tting\nto any one model. In contrast, STEVE interpolates between model-free and model-based estimates,\nuses a value-estimation approach, and uses the ensemble to explicitly estimate uncertainty.\nKalweit and Boedecker [17] train on a mix of real and imagined rollouts, and adjust the ratio over the\ncourse of training by tying it to the variance of the Q-function. Similarly to our work, this variance is\ncomputed via an ensemble. However, they do not adapt to the uncertainty of individual estimates,\nonly the overall ratio of real to imagined data. Additionally, they do not take into account model bias,\nor uncertainty in model predictions.\nWeber et al. [31] use rollouts generated by the dynamics model as inputs to the policy function, by\n\u201csummarizing\u201d the outputs of the rollouts with a deep neural network. This second network allows the\nalgorithm to implicitly calculate uncertainty over various parts of the rollout and use that information\nwhen making its decision. However, I2A has only been evaluated on discrete domains. Additionally,\nthe lack of explicit model use likely tempers the sample-ef\ufb01ciency bene\ufb01ts gained relative to more\ntraditional model-based learning.\nGal et al. [11] use a deep neural network in combination with the PILCO algorithm [4] to do sample-\nef\ufb01cient reinforcement learning. They demonstrate good performance on the continuous-control task\nof cartpole swing-up. They model uncertainty in the learned neural dynamics function using dropout\nas a Bayesian approximation, and provide evidence that maintaining these uncertainty estimates is\nvery important for model-based reinforcement learning.\nDepeweg et al. [5] use a Bayesian neural network as the environment model in a policy search\nsetting, learning a policy purely from imagined rollouts. This work also demonstrates that modeling\nuncertainty is important for model-based reinforcement learning with neural network models, and\nthat uncertainty-aware models can escape many common pitfalls.\nGu et al. [12] propose a continuous variant of Q-learning known as normalized advantage functions\n(NAF), and show that learning using NAF can be accelerated by using a model-based component.\nThey use a variant of Dyna-Q [28], augmenting the experience available to the model-free learner\nwith imaginary on-policy data generated via environment rollouts. They use an iLQG controller and\na learned locally-linear model to plan over small, easily-modelled regions of the environment, but\n\ufb01nd that using more complex neural network models of the environment can yield errors.\nThomas et al. [30] de\ufb01ne the \u2126-return, an alternative to the \u03bb-return that accounts for the variance of,\nand correlations between, predicted returns at multiple timesteps. Similarly to STEVE, the target\nused is an unbiased linear combination of returns with minimum variance. However, the \u2126-return is\nnot directly computable in non-tabular state spaces, and does n-step off-policy learning rather than\nlearn a predictive model of the environment. Drawing a theoretical connection between the STEVE\nalgorithm and the \u2126-return is an interesting potential direction for future work.\n\n7 Conclusion\n\nIn this work, we demonstrated that STEVE, an uncertainty-aware approach for merging model-free\nand model-based reinforcement learning, outperforms model-free approaches while reducing sample\ncomplexity by an order magnitude on several challenging tasks. We believe that this is a strong\nstep towards enabling RL for practical, real-world applications. Since submitting this manuscript\nfor publication, we have further explored the relationship between STEVE and recent work on\noverestimation bias [9], and found evidence that STEVE may help to reduce this bias. Other future\ndirections include exploring more complex worldmodels for various tasks, as well as comparing\nvarious techniques for calculating uncertainty and estimating bias.\n\n9\n\n\fAcknowledgments\n\nThe authors would like to thank the following individuals for their valuable insights and discussion:\nDavid Ha, Prajit Ramachandran, Tuomas Haarnoja, Dustin Tran, Matt Johnson, Matt Hoffman, Ishaan\nGulrajani, and Sergey Levine. Also, we would like to thank Jascha Sohl-Dickstein, Joseph Antognini,\nShane Gu, and Samy Bengio for their feedback during the writing process, and Erwin Coumans for\nhis help on PyBullet enivronments. Finally, we would like to thank our anonymous reviewers for\ntheir insightful suggestions.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, volume 16,\npages 265\u2013283, 2016.\n\n[2] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal,\nN. Heess, and T. Lillicrap. Distributional policy gradients. In International Conference on\nLearning Representations, 2018.\n\n[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[4] M. Deisenroth and C. E. Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach\nto policy search. In Proceedings of the 28th International Conference on machine learning\n(ICML-11), pages 465\u2013472, 2011.\n\n[5] S. Depeweg, J. M. Hern\u00e1ndez-Lobato, F. Doshi-Velez, and S. Udluft. Learning and policy\n\nsearch in stochastic dynamical systems with Bayesian neural networks. 2016.\n\n[6] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,\nI. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner\narchitectures. In Proceedings of the International Conference on Machine Learning, 2018.\n\n[7] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Model-based value\nestimation for ef\ufb01cient model-free reinforcement learning. arXiv preprint arXiv:1803.00101,\n2018.\n\n[8] J. Fleiss. Review papers: The statistical basis of meta-analysis. Statistical methods in medical\n\nresearch, 2(2):121\u2013145, 1993.\n\n[9] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic\nmethods. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1587\u20131596,\nStockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR. URL http://proceedings.\nmlr.press/v80/fujimoto18a.html.\n\n[10] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model\nuncertainty in deep learning. In international conference on machine learning, pages 1050\u2013\n1059, 2016.\n\n[11] Y. Gal, R. McAllister, and C. E. Rasmussen. Improving PILCO with Bayesian neural network\n\ndynamics models.\n\n[12] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based\n\nacceleration. In International Conference on Machine Learning, pages 2829\u20132838, 2016.\n\n[13] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-ef\ufb01cient policy\ngradient with an off-policy critic. International Conference on Learning Representations, 2017.\n\n[14] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy\n\ndeep reinforcement learning with a stochastic actor, 2018.\n\n10\n\n\f[15] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control\npolicies by stochastic value gradients. In Advances in Neural Information Processing Systems,\npages 2944\u20132952, 2015.\n\n[16] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver. Dis-\ntributed prioritized experience replay. In International Conference on Learning Representations,\n2018.\n\n[17] G. Kalweit and J. Boedecker. Uncertainty-driven imagination for continuous deep reinforcement\n\nlearning. In Conference on Robot Learning, pages 195\u2013206, 2017.\n\n[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference\n\non Learning Representations, 2015.\n\n[19] O. Klimov and J. Schulman. Roboschool. https://github.com/openai/roboschool.\n\n[20] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy\n\noptimization. In International Conference on Learning Representations, 2018.\n\n[21] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\nContinuous control with deep reinforcement learning. International Conference on Learning\nRepresentations, 2016.\n\n[22] D. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural compu-\n\ntation, 4(3):448\u2013472, 1992.\n\n[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\nPlaying atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.\n\n[24] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and ef\ufb01cient off-policy\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 1054\u2013\n1062, 2016.\n\n[25] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In\n\nAdvances in neural information processing systems, pages 4026\u20134034, 2016.\n\n[26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural\nnetworks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[28] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating\ndynamic programming. In Machine Learning Proceedings 1990, pages 216\u2013224. Elsevier,\n1990.\n\n[29] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction, volume 1. MIT Press\n\nCambridge, 1998.\n\n[30] P. S. Thomas, S. Niekum, G. Theocharous, and G. Konidaris. Policy evaluation using the\n\n\u2126-return. In Advances in Neural Information Processing Systems, pages 334\u2013342, 2015.\n\n[31] T. Weber, S. Racani\u00e8re, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia,\nImagination-augmented agents for deep reinforcement\n\nO. Vinyals, N. Heess, Y. Li, et al.\nlearning. 31st Conference on Neural Information Processing Systems, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5026, "authors": [{"given_name": "Jacob", "family_name": "Buckman", "institution": "Johns Hopkins University"}, {"given_name": "Danijar", "family_name": "Hafner", "institution": "Google Brain & UCL"}, {"given_name": "George", "family_name": "Tucker", "institution": "Google Brain"}, {"given_name": "Eugene", "family_name": "Brevdo", "institution": "Google"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "Google Brain"}]}