{"title": "When to Trust Your Model: Model-Based Policy Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 12519, "page_last": 12530, "abstract": "Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of model-generated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.", "full_text": "When to Trust Your Model:\n\nModel-Based Policy Optimization\n\nMichael Janner\n\nSergey Levine\n\nJustin Fu\n\nMarvin Zhang\nUniversity of California, Berkeley\n\n{janner, justinjfu, marvin, svlevine}@eecs.berkeley.edu\n\nAbstract\n\nDesigning effective model-based reinforcement learning algorithms is dif\ufb01cult\nbecause the ease of data generation must be weighed against the bias of model-\ngenerated data. In this paper, we study the role of model usage in policy opti-\nmization both theoretically and empirically. We \ufb01rst formulate and analyze a\nmodel-based reinforcement learning algorithm with a guarantee of monotonic im-\nprovement at each step. In practice, this analysis is overly pessimistic and suggests\nthat real off-policy data is always preferable to model-generated on-policy data,\nbut we show that an empirical estimate of model generalization can be incorpo-\nrated into such analysis to justify model usage. Motivated by this analysis, we\nthen demonstrate that a simple procedure of using short model-generated rollouts\nbranched from real data has the bene\ufb01ts of more complicated model-based algo-\nrithms without the usual pitfalls. In particular, this approach surpasses the sample\nef\ufb01ciency of prior model-based methods, matches the asymptotic performance of\nthe best model-free algorithms, and scales to horizons that cause other model-based\nmethods to fail entirely.\n\n1\n\nIntroduction\n\nReinforcement learning algorithms generally fall into one of two categories: model-based approaches,\nwhich build a predictive model of an environment and derive a controller from it, and model-free\ntechniques, which learn a direct mapping from states to actions. Model-free methods have shown\npromise as a general-purpose tool for learning complex policies from raw state inputs (Mnih et al.,\n2015; Lillicrap et al., 2016; Haarnoja et al., 2018), but their generality comes at the cost of ef\ufb01ciency.\nWhen dealing with real-world physical systems, for which data collection can be an arduous process,\nmodel-based approaches are appealing due to their comparatively fast learning. However, model\naccuracy acts as a bottleneck to policy quality, often causing model-based approaches to perform\nworse asymptotically than their model-free counterparts.\nIn this paper, we study how to most effectively use a predictive model for policy optimization.\nWe \ufb01rst formulate and analyze a class of model-based reinforcement learning algorithms with\nimprovement guarantees. Although there has been recent interest in monotonic improvement of\nmodel-based reinforcement learning algorithms (Sun et al., 2018; Luo et al., 2019), most commonly\nused model-based approaches lack the improvement guarantees that underpin many model-free\nmethods (Schulman et al., 2015). While it is possible to apply analogous techniques to the study of\nmodel-based methods to achieve similar guarantees, it is more dif\ufb01cult to use such analysis to justify\nmodel usage in the \ufb01rst place due to pessimistic bounds on model error. However, we show that more\nrealistic model error rates derived empirically allow us to modify this analysis to provide a more\nreasonable tradeoff on model usage.\nOur main contribution is a practical algorithm built on these insights, which we call model-based\npolicy optimization (MBPO), that makes limited use of a predictive model to achieve pronounced\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fimprovements in performance compared to other model-based approaches. More speci\ufb01cally, we\ndisentangle the task horizon and model horizon by querying the model only for short rollouts. We\nempirically demonstrate that a large amount of these short model-generated rollouts can allow a\npolicy optimization algorithm to learn substantially faster than recent model-based alternatives while\nretaining the asymptotic performance of the most competitive model-free algorithms. We also show\nthat MBPO does not suffer from the same pitfalls as prior model-based approaches, avoiding model\nexploitation and failure on long-horizon tasks. Finally, we empirically investigate different strategies\nfor model usage, supporting the conclusion that careful use of short model-based rollouts provides\nthe most bene\ufb01t to a reinforcement learning algorithm.\n\n2 Related work\nModel-based reinforcement learning methods are promising candidates for real-world sequential\ndecision-making problems due to their data ef\ufb01ciency (Kaelbling et al., 1996). Gaussian processes\nand time-varying linear dynamical systems provide excellent performance in the low-data regime\n(Deisenroth & Rasmussen, 2011; Levine & Koltun, 2013; Kumar et al., 2016). Neural network\npredictive models (Draeger et al., 1995; Gal et al., 2016; Depeweg et al., 2016; Nagabandi et al.,\n2018), are appealing because they allow for algorithms that combine the sample ef\ufb01ciency of a\nmodel-based approach with the asymptotic performance of high-capacity function approximators,\neven in domains with high-dimensional observations (Oh et al., 2015; Ebert et al., 2018; Kaiser et al.,\n2019). Our work uses an ensemble of probabilistic networks, as in Chua et al. (2018), although our\nmodel is employed to learn a policy rather than in the context of a receding-horizon planning routine.\nLearned models may be incorporated into otherwise model-free methods for improvements in data\nef\ufb01ciency. For example, a model-free policy can be used as an action proposal distribution within a\nmodel-based planner (Pich\u00e9 et al., 2019). Conversely, model rollouts may be used to provide extra\ntraining examples for a Q-function (Sutton, 1990), to improve the target value estimates of existing\ndata points (Feinberg et al., 2018), or to provide additional context to a policy (Du & Narasimhan,\n2019). However, the performance of such approaches rapidly degrades with increasing model error\n(Gu et al., 2016), motivating work that interpolates between different rollout lengths (Buckman et al.,\n2018), tunes the ratio of real to model-generated data (Kalweit & Boedecker, 2017), or does not rely\non model predictions (Heess et al., 2015). Our approach similarly tunes model usage during policy\noptimization, but we show that justifying non-negligible model usage during most points in training\nrequires consideration of the model\u2019s ability to generalize outside of its training distribution.\nPrior methods have also explored incorporating computation that resembles model-based planning\nbut without constraining the intermediate predictions of the planner to match plausible environment\nobservations (Tamar et al., 2016; Racani\u00e8re et al., 2017; Oh et al., 2017; Silver et al., 2017). While\nsuch methods can reach asymptotic performance on par with model-free approaches, they may not\nbene\ufb01t from the sample ef\ufb01ciency of model-based methods as they forgo the extra supervision used\nin standard model-based methods.\nThe bottleneck in scaling model-based approaches to complex tasks often lies in learning reliable\npredictive models of high-dimensional dynamics (Atkeson & Schaal, 1997). While ground-truth\nmodels are most effective when queried for long horizons (Holland et al., 2018), inaccuracies in\nlearned models tend to make long rollouts unreliable. Ensembles have shown to be effective in\npreventing a policy or planning procedure from exploiting such inaccuracies (Rajeswaran et al., 2017;\nKurutach et al., 2018; Clavera et al., 2018; Chua et al., 2018). Alternatively, a model may also be\ntrained on its own outputs to avoid compounding error from multi-step predictions (Talvitie, 2014,\n2016) or predict many timesteps into the future (Whitney & Fergus, 2018). We demonstrate that a\ncombination of model ensembles with short model rollouts is suf\ufb01cient to prevent model exploitation.\nTheoretical analysis of model-based reinforcement learning algorithms has been considered by Sun\net al. (2018) and Luo et al. (2019), who bound the discrepancy between returns under a model and\nthose in the real environment of interest. Their approaches enforce a trust region around a reference\npolicy, whereas we do not constrain the policy but instead consider rollout length based on estimated\nmodel generalization capacity. Alternate analyses have been carried out by incorporating the structure\nof the value function into the model learning (Farahmand et al., 2017) or by regularizing the model\nby controlling its Lipschitz constant (Asadi et al., 2018). Prior work has also constructed complexity\nbounds for model-based approaches in the tabular setting (Szita & Szepesvari, 2010) and for the\nlinear quadratic regulator (Dean et al., 2017), whereas we consider general non-linear systems.\n\n2\n\n\fAlgorithm 1 Monotonic Model-Based Policy Optimization\n1: Initialize policy \u21e1(a|s), predictive model p\u2713(s0, r|s, a), empty dataset D.\n2: for N epochs do\n3:\n4:\n5:\n\nCollect data with \u21e1 in real environment: D = D[{ (si, ai, s0i, ri)}i\nTrain model p\u2713 on dataset D via maximum likelihood: \u2713 argmax\u2713ED[log p\u2713(s0, r|s, a)]\nOptimize policy under predictive model: \u21e1 argmax\u21e10 \u02c6\u2318[\u21e10] C(\u270fm,\u270f \u21e1)\n\n3 Background\n\nWe consider a Markov decision process (MDP), de\ufb01ned by the tuple (S,A, p, r,,\u21e2 0). S and A\nare the state and action spaces, respectively, and 2 (0, 1) is the discount factor. The dynamics or\ntransition distribution are denoted as p(s0|s, a), the initial state distribution as \u21e20(s), and the reward\nfunction as r(s, a). The goal of reinforcement learning is to \ufb01nd the optimal policy \u21e1\u21e4 that maximizes\nthe expected sum of discounted rewards, denoted by \u2318:\n\n\u21e1\u21e4 = argmax\n\n\u21e1\n\n\u2318[\u21e1] = argmax\n\n\u21e1\n\nE\u21e1\" 1Xt=0\n\ntr(st, at)# .\n\nThe dynamics p(s0|s, a) are assumed to be unknown. Model-based reinforcement learning methods\naim to construct a model of the transition distribution, p\u2713(s0|s, a), using data collected from interaction\nwith the MDP, typically using supervised learning. We additionally assume that the reward function\nhas unknown form, and predict r as a learned function of s and a.\n\n4 Monotonic improvement with model bias\n\nIn this section, we \ufb01rst lay out a general recipe for MBPO with monotonic improvement. This general\nrecipe resembles or subsumes several prior algorithms and provides us with a concrete framework\nthat is amenable to theoretical analysis. Described generically in Algorithm 1, MBPO optimizes a\npolicy under a learned model, collects data under the updated policy, and uses that data to train a new\nmodel. While conceptually simple, the performance of MBPO can be dif\ufb01cult to understand; errors\nin the model can be exploited during policy optimization, resulting in large discrepancies between\nthe predicted returns of the policy under the model and under the true dynamics.\n\n4.1 Monotonic model-based improvement\n\nOur goal is to outline a principled framework in which we can provide performance guarantees for\nmodel-based algorithms. To show monotonic improvement for a model-based method, we wish to\nconstruct a bound of the following form:\n\n\u2318[\u21e1] \u02c6\u2318[\u21e1] C.\n\n\u2318[\u21e1] denotes the returns of the policy in the true MDP, whereas \u02c6\u2318[\u21e1] denotes the returns of the policy\nunder our model. Such a statement guarantees that, as long as we improve by at least C under the\nmodel, we can guarantee improvement on the true MDP.\nThe gap between true returns and model returns, C, can be expressed in terms of two er-\nror quantities of the model: generalization error due to sampling, and distribution shift due\nto the updated policy encountering states not seen during model training. As the model is\ntrained with supervised learning, sample error can be quanti\ufb01ed by standard PAC generaliza-\ntion bounds, which bound the difference in expected loss and empirical loss by a constant with\nhigh probability (Shalev-Shwartz & Ben-David, 2014). We denote this generalization error by\n\u270fm = maxt Es\u21e0\u21e1D,t[DT V (p(s0, r|s, a)||p\u2713(s0, r|s, a))], which can be estimated in practice by mea-\nsuring the validation loss of the model on the time-dependent state distribution of the data-collecting\npolicy \u21e1D. For our analysis, we denote distribution shift by the maximum total-variation distance,\nmaxs DT V (\u21e1||\u21e1D) \uf8ff \u270f\u21e1, of the policy between iterations. In practice, we measure the KL diver-\ngence between policies, which we can relate to \u270f\u21e1 by Pinsker\u2019s inequality. With these two sources of\nerror controlled (generalization by \u270fm, and distribution shift by \u270f\u21e1), we now present our bound:\n\n3\n\n\fTheorem 4.1. Let the expected TV-distance between two transition distributions be bounded at each\ntimestep by \u270fm and the policy divergence be bounded by \u270f\u21e1. Then the true returns and model returns\nof the policy are bounded as:\n\n\u2318[\u21e1] \u02c6\u2318[\u21e1] \uf8ff 2rmax(\u270fm + 2\u270f\u21e1)\n{z\n\n(1 )2\n\n|\n\n+\n\n4rmax\u270f\u21e1\n\n(1 )\n}\n\n(1)\n\nC(\u270fm,\u270f\u21e1)\n\nProof. See Appendix A, Theorem A.1.\n\nThis bound implies that as long as we improve the returns under the model \u02c6\u2318[\u21e1] by more than\nC(\u270fm,\u270f \u21e1), we can guarantee improvement under the true returns.\n\n4.2\n\nInterpolating model-based and model-free updates\n\nTheorem 4.1 provides a useful relationship between model returns and true returns. However, it\ncontains several issues regarding cases when the model error \u270fm is high. First, there may not exist a\npolicy such that \u02c6\u2318[\u21e1] \u2318[\u21e1] > C(\u270fm,\u270f \u21e1), in which case improvement is not guaranteed. Second, the\nanalysis relies on running full rollouts through the model, allowing model errors to compound. This\nis re\ufb02ected in the bound by a factor scaling quadratically with the effective horizon, 1/(1 ). In\nsuch cases, we can improve the algorithm by choosing to rely less on the model and instead more on\nreal data collected from the true dynamics when the model is inaccurate.\nIn order to allow for dynamic adjustment between model-based and model-free rollouts, we introduce\nthe notion of a branched rollout, in which we begin a rollout from a state under the previous policy\u2019s\nstate distribution d\u21e1D (s) and run k steps according to \u21e1 under the learned model p\u2713. This branched\nrollout structure resembles the scheme proposed in the original Dyna algorithm (Sutton, 1990), which\ncan be viewed as a special case of a length 1 branched rollouts. Formally, we can view this as\nexecuting a nonstationary policy which begins a rollout by sampling actions from the previous policy\n\u21e1D. Then, at some speci\ufb01ed time, we switch to unrolling the trajectory under the model p and current\npolicy \u21e1 for k steps. Under such a scheme, the returns can be bounded as follows:\nTheorem 4.2. Given returns \u2318branch[\u21e1] from the k-branched rollout method,\n\n\u2318[\u21e1] \u2318branch[\u21e1] 2rmax\uf8ff k+1\u270f\u21e1\n(1 )2 +\n\nk + 2\n(1 )\n\n\u270f\u21e1 +\n\n(\u270fm + 2\u270f\u21e1) .\n\nk\n1 \n\n(2)\n\nProof. See Appendix A, Theorem A.3.\n\n4.3 Model generalization in practice\n\nTheorem 4.2 would be most useful for guiding algorithm design if it could be used to determine\nan optimal model rollout length k. While this bound does include two competing factors, one\nexponentially decreasing in k and another scaling linearly with k, the values of the associated\nconstants prevent an actual tradeoff; taken literally, this lower bound is maximized when k = 0,\ncorresponding to not using the model at all. One limitation of the analysis is pessimistic scaling\nof model error \u270fm with respect to policy shift \u270f\u21e1, as we do not make any assumptions about the\ngeneralization capacity or smoothness properties of the model (Asadi et al., 2018).\nTo better determine how well we can expect our model to generalize in practice, we empirically\nmeasure how the model error under new policies increases with policy change \u270f\u21e1. We train a model\non the state distribution of a data-collecting policy \u21e1D and then continue policy optimization while\nmeasuring the model\u2019s loss on all intermediate policies \u21e1 during this optimization. Figure 1a shows\nthat, as expected, the model error increases with the divergence between the current policy \u21e1 and\nthe data-collecting policy \u21e1D. However, the rate of this increase depends on the amount of data\ncollected by \u21e1D. We plot the local change in model error over policy change, d\u270fm0\n, in Figure 1b. The\nd\u270f\u21e1\ndecreasing dependence on policy shift shows that not only do models trained with more data perform\nbetter on their training distribution, but they also generalize better to nearby distributions.\n\n4\n\n\fFigure 1: (a) We train a predictive model on the state distribution of \u21e1D and evaluate it on policies \u21e1\nof varying KL-divergence from \u21e1D without retraining. The color of each curve denotes the amount\nof data from \u21e1D used to train the model corresponding to that curve. The offsets of the curves\ndepict the expected trend of increasing training data leading to decreasing model error on the training\ndistribution. However, we also see a decreasing in\ufb02uence of state distribution shift on model error\nwith increasing training data, signifying that the model is generalizing better. (b) We measure the\nlocal change in model error versus KL-divergence of the policies at \u270f\u21e1 = 0 as a proxy to model\ngeneralization.\n\nThe clear trend in model error growth rate suggests a way to modify the pessimistic bounds. In the\nprevious analysis, we assumed access to only model error \u270fm on the distribution of the most recent\ndata-collecting policy \u21e1D and approximated the error on the current distribution as \u270fm + 2\u270f\u21e1. If we\ncan instead approximate the model error on the distribution of the current policy \u21e1, which we denote\nas \u270fm0, we may use this directly. For example, approximating \u270fm0 with a linear function of the policy\ndivergence yields:\n\n\u02c6\u270fm0(\u270f\u21e1) \u21e1 \u270fm + \u270f\u21e1\n\nd\u270fm0\nd\u270f\u21e1\n\nis empirically estimated as in Figure 1. Equipped with an approximation of \u270fm0, the\n\nwhere d\u270fm0\nd\u270f\u21e1\nmodel\u2019s error on the distribution of the current policy \u21e1, we arrive at the following bound:\nTheorem 4.3. Under the k-branched rollout method, using model error under the updated policy\n\u270fm0 maxt Es\u21e0\u21e1D,t[DT V (p(s0|s, a)||\u02c6p(s0|s, a))], we have\n\u2318[\u21e1] \u2318branch[\u21e1] 2rmax\uf8ff k+1\u270f\u21e1\n(1 )2 +\n\n(\u270fm0) .\n\nk\u270f\u21e1\n(1 )\n\nk\n1 \n\n(3)\n\n+\n\nProof. See Appendix A, Theorem A.2.\n\nWhile this bound appears similar to Theorem 4.2, the important difference is that this version actually\nmotivates model usage. More speci\ufb01cally, k\u21e4 = argmin\nsuf\ufb01ciently low \u270fm0. While this insight does not immediately suggest an algorithm design by itself,\nwe can build on this idea to develop a method that makes limited use of truncated, but nonzero-length,\nmodel rollouts.\n\n1 (\u270fm0)i > 0 for\n\nh k+1\u270f\u21e1\n(1)2 + k\u270f\u21e1\n\n(1) + k\n\nk\n\n5\n\ntrain sizemodel errortrain sizetrain sizemodel errorpolicy shiftpolicy shiftWalker2dHoppera)b)\fTrain model p\u2713 on Denv via maximum likelihood\nfor E steps do\n\nAlgorithm 2 Model-Based Policy Optimization with Deep Reinforcement Learning\n1: Initialize policy \u21e1, predictive model p\u2713, environment dataset Denv, model dataset Dmodel\n2: for N epochs do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nSample st uniformly from Denv\nPerform k-step model rollout starting from st using policy \u21e1; add to Dmodel\nfor G gradient updates do\nUpdate policy parameters on model data: \u21e1 \u02c6rJ\u21e1(,Dmodel)\n\nTake action in environment according to \u21e1; add to Denv\nfor M model rollouts do\n\n5 Model-based policy optimization with deep reinforcement learning\n\nWe now present a practical model-based reinforcement learning algorithm based on the derivation in\nthe previous section. Instantiating Algorithm 1 amounts to specifying three design decisions: (1) the\nparametrization of the model p\u2713, (2) how the policy \u21e1 is optimized given model samples, and (3) how\nto query the model for samples for policy optimization.\n\nIn our work, we use a bootstrap ensemble of dynamics models {p1\n\nPredictive model.\n\u2713 }.\n\u2713, ..., pB\nEach member of the ensemble is a probabilistic neural network whose outputs parametrize a Gaussian\n\u2713(st, at))). Individual\ndistribution with diagonal covariance: pi\nprobabilistic models capture aleatoric uncertainty, or the noise in the outputs with respect to the\ninputs. The bootstrapping procedure accounts for epistemic uncertainty, or uncertainty in the model\nparameters, which is crucial in regions when data is scarce and the model can by exploited by policy\noptimization. Chua et al. (2018) demonstrate that a proper handling of both of these uncertainties\nallows for asymptotically competitive model-based learning. To generate a prediction from the\nensemble, we simply select a model uniformly at random, allowing for different transitions along a\nsingle model rollout to be sampled from different dynamics models.\n\n\u2713(st+1, r|st, at) = N (\u00b5i\n\n\u2713(st, at), \u2303i\n\nPolicy optimization. We adopt soft-actor critic (SAC) (Haarnoja et al., 2018) as our pol-\nicy optimization algorithm. SAC alternates between a policy evaluation step, which estimates\n\nQ\u21e1(s, a) = E\u21e1 [P1t=0 tr(st, at)|s0 = s, a0 = a] using the Bellman backup operator, and a\npolicy improvement step, which trains an actor \u21e1 by minimizing the expected KL-divergence\nJ\u21e1(,D) = Est\u21e0D[DKL(\u21e1|| exp{Q\u21e1 V \u21e1})].\nModel usage. Many recent model-based algorithms have focused on the setting in which model\nrollouts begin from the initial state distribution (Kurutach et al., 2018; Clavera et al., 2018). While\nthis may be a more faithful interpretation of Algorithm 1, as it is optimizing a policy purely under\nthe state distribution of the model, this approach entangles the model rollout length with the task\nhorizon. Because compounding model errors make extended rollouts dif\ufb01cult, these works evaluate\non truncated versions of benchmarks. The branching strategy described in Section 4.2, in which\nmodel rollouts begin from the state distribution of a different policy under the true environment\ndynamics, effectively relieves this limitation. In practice, branching replaces few long rollouts from\nthe initial state distribution with many short rollouts starting from replay buffer states.\nA practical implementation of MBPO is described in Algorithm 2.1 The primary differences from\nthe general formulation in Algorithm 1 are k-length rollouts from replay buffer states in the place of\noptimization under the model\u2019s state distribution and a \ufb01xed number of policy update steps in the\nplace of an intractable argmax. Even when the horizon length k is short, we can perform many such\nshort rollouts to yield a large set of model samples for policy optimization. This large set allows us to\ntake many more policy gradient steps per environment sample (between 20 and 40) than is typically\nstable in model-free algorithms. A full listing of the hyperparameters included in Algorithm 2 for all\nevaluation environments is given in Appendix C.\n\n1When SAC is used as the policy optimization algorithm, we must also perform gradient updates on the\n\nparameters of the Q-functions, but we omit these updates for clarity.\n\n6\n\n\fFigure 2: Training curves of MBPO and \ufb01ve baselines on continuous control benchmarks. Solid\ncurves depict the mean of \ufb01ve trials and shaded regions correspond to standard deviation among\ntrials. MBPO has asymptotic performance similar to the best model-free algorithms while being\nfaster than the model-based baselines. For example, MBPO\u2019s performance on the Ant task at 300\nthousand steps matches that of SAC at 3 million steps. We evaluated all algorithms on the standard\n1000-step versions of the benchmarks.\n\n6 Experiments\n\nOur experimental evaluation aims to study two primary questions: (1) How well does MBPO perform\non benchmark reinforcement learning tasks, compared to state-of-the-art model-based and model-free\nalgorithms? (2) What conclusions can we draw about appropriate model usage?\n\n6.1 Comparative evaluation\n\nIn our comparisons, we aim to understand both how well our method compares to state-of-the-art\nmodel-based and model-free methods and how our design choices affect performance. We compare\nto two state-of-the-art model-free methods, SAC (Haarnoja et al., 2018) and PPO (Schulman et al.,\n2017), both to establish a baseline and, in the case of SAC, measure the bene\ufb01t of incorporating a\nmodel, as our model-based method uses SAC for policy learning as well. For model-based methods,\nwe compare to PETS (Chua et al., 2018), which does not perform explicit policy learning, but\ndirectly uses the model for planning; STEVE (Buckman et al., 2018), which also uses short-horizon\nmodel-based rollouts, but incorporates data from these rollouts into value estimation rather than\npolicy learning; and SLBO (Luo et al., 2019), a model-based algorithm with performance guarantees\nthat performs model rollouts from the initial state distribution. These comparisons represent the\nstate-of-the-art in both model-free and model-based reinforcement learning.\nWe evaluate MBPO and these baselines on a set of MuJoCo continuous control tasks (Todorov\net al., 2012) commonly used to evaluate model-free algorithms. Note that some recent works in\nmodel-based reinforcement learning have used modi\ufb01ed versions of these benchmarks, where the\ntask horizon is chosen to be shorter so as to simplify the modeling problem (Kurutach et al., 2018;\nClavera et al., 2018). We use the standard full-length version of these tasks. MBPO also does not\nassume access to privileged information in the form of fully observable states or the reward function\nfor of\ufb02ine evaluation.\n\n7\n\n\fFigure 3: No model: SAC run without model data but with the same range of gradient updates per\nenvironment step (G) as MBPO on the Hopper task. Rollout length: While we \ufb01nd that increasing\nrollout length k over time yields the best performance for MBPO (Appendix C), single-step rollouts\nprovide a baseline that is dif\ufb01cult to beat. Value expansion: We implement H-step model value\nexpansion from Feinberg et al. (2018) on top of SAC for a more informative comparison. We also\n\ufb01nd in the context of value expansion that single-step model rollouts are surprisingly competitive.\n\nFigure 2 shows the learning curves for all methods, along with asymptotic performance of algorithms\nwhich do not converge in the region shown. These results show that MBPO learns substantially\nfaster, an order of magnitude faster on some tasks, than prior model-free methods, while attaining\ncomparable \ufb01nal performance. For example, MBPO\u2019s performance on the Ant task at 300 thousand\nsteps is the same as that of SAC at 3 million steps. On Hopper and Walker2d, MBPO requires the\nequivalent of 14 and 40 minutes, respectively, of simulation time if the simulator were running in real\ntime. More crucially, MBPO learns on some of the higher-dimensional tasks, such as Ant, which\npose problems for purely model-based approaches such as PETS.\n\n6.2 Design evaluation\n\nWe next make ablations and modi\ufb01cations to our method to better understand why MBPO outperforms\nprior approaches. Results for the following experiments are shown in Figure 3.\n\nNo model. The ratio between the number of gradient updates and environment samples, G, is\nmuch higher in MBPO than in comparable model-free algorithms because the model-generated data\nreduces the risk of over\ufb01tting. We run standard SAC with similarly high ratios, but without model\ndata, to ensure that the model is actually helpful. While increasing the number of gradient updates\nper sample taken in SAC does marginally speed up learning, we cannot match the sample-ef\ufb01ciency\nof our method without using the model. For hyperparameter settings of MBPO, see Appendix C.\n\nRollout horizon. While the best-performing rollout length schedule on the Hopper task linearly\nincreases from k = 1 to 15 (Appendix C), we \ufb01nd that \ufb01xing the rollout length at 1 for the duration\nof training retains much of the bene\ufb01t of our model-based method. We also \ufb01nd that our model\nis accurate enough for 200-step rollouts, although this performs worse than shorter values when\nused for policy optimization. 500-step rollouts are too inaccurate for effective learning. While\nmore precise \ufb01ne-tuning is always possible, augmenting policy training data with single-step model\nrollouts provides a baseline that is surprisingly dif\ufb01cult to beat and outperforms recent methods which\nperform longer rollouts from the initial state distribution. This result agrees with our theoretical\nanalysis which prescribes short model-based rollouts to mitigate compounding modeling errors.\n\nValue expansion. An alternative to using model rollouts for direct training of a policy is to improve\nthe quality of target values of samples collected from the real environment. This technique is used\nin model-based value expansion (MVE) (Feinberg et al., 2018) and STEVE (Buckman et al., 2018).\nWhile MBPO outperforms both of these approaches, there are other confounding factors making\na head-to-head comparison dif\ufb01cult, such as the choice of policy learning algorithm. To better\ndetermine the relationship between training on model-generated data and using model predictions to\n\n8\n\n\fa)\n\nb)\n\nFigure 4: a) A 450-step hopping sequence performed in the real environment, with the trajectory\nof the body\u2019s joints traced through space. b) The same action sequence rolled out under the model\n1000 times, with shaded regions corresponding to one standard deviation away from the mean\nprediction. The growing uncertainty and deterioration of a recognizable sinusoidal motion underscore\naccumulation of model errors. c) Cumulative returns of the same policy under the model and actual\nenvironment dynamics reveal that the policy is not exploiting the learned model. Thin blue lines\nre\ufb02ect individual model rollouts and the thick blue line is their mean.\n\nimprove target values, we augment SAC with the H-step Q-target objective:\n\nkt\u02c6rk + HQ(\u02c6sH, \u02c6aH))2\n\n1\nH\n\nH1Xt=1\n\n(Q(\u02c6st, \u02c6at (\n\nH1Xk=t\n\nin which \u02c6st and \u02c6rt are model predictions and \u02c6at \u21e0 \u21e1(at|\u02c6st). We refer the reader to Feinberg et al.\n(2018) for further discussion of this approach. We verify that SAC also bene\ufb01ts from improved target\nvalues, and similar to our conclusions from MBPO, single-step model rollouts (H = 1) provide a\nsurprisingly effective baseline. While model-generated data augmentation and value expansion are in\nprinciple complementary approaches, preliminary experiments did not show improvements to MBPO\nby using improved target value estimates.\n\nModel exploitation. We analyze the problem of \u201cmodel exploitation,\u201d which a number of recent\nworks have raised as a primary challenge in model-based reinforcement learning (Rajeswaran et al.,\n2017; Clavera et al., 2018; Kurutach et al., 2018). We plot empirical returns of a trained policy on the\nHopper task under both the real environment and the model in Figure 4 (c) and \ufb01nd, surprisingly, that\nthey are highly correlated, indicating that a policy trained on model-predicted transitions may not\nexploit the model at all if the rollouts are suf\ufb01ciently short. This is likely because short rollouts are\nmore likely to re\ufb02ect the real dynamics (Figure 4 a-b), reducing the opportunities for policies to rely\non inaccuracies of model predictions. While the models for other environments are not necessarily as\naccurate as that for Hopper, we \ufb01nd across the board that model returns tend to underestimate real\nenvironment returns in MBPO.\n\n7 Discussion\nWe have investigated the role of model usage in policy optimization procedures through both a\ntheoretical and empirical lens. We have shown that, while it is possible to formulate model-based\nreinforcement learning algorithms with monotonic improvement guarantees, such an analysis cannot\nnecessarily be used to motivate using a model in the \ufb01rst place. However, an empirical study of\nmodel generalization shows that predictive models can indeed perform well outside of their training\ndistribution. Incorporating a linear approximation of model generalization into the analysis gives\nrise to a more reasonable tradeoff that does in fact justify using the model for truncated rollouts.\nThe algorithm stemming from this insight, MBPO, has asymptotic performance rivaling the best\nmodel-free algorithms, learns substantially faster than prior model-free or model-based methods,\nand scales to long horizon tasks that often cause model-based methods to fail. We experimentally\ninvestigate the tradeoffs associated with our design decisions, and \ufb01nd that model rollouts as short as\na single step can provide pronounced bene\ufb01ts to policy optimization procedures.\n\n9\n\nc)\fAcknowledgements\nWe thank Anusha Nagabandi, Michael Chang, Chelsea Finn, Pulkit Agrawal, and Jacob Steinhardt for\ninsightful discussions; Vitchyr Pong, Alex Lee, Kyle Hsu, and Aviral Kumar for feedback on an early\ndraft of the paper; and Kristian Hartikainen for help with the SAC baseline. This research was partly\nsupported by the NSF via IIS-1651843, IIS-1700697, and IIS-1700696, the Of\ufb01ce of Naval Research,\nARL DCIST CRA W911NF-17-2-0181, and computational resource donations from Google. M.J. is\nsupported by fellowships from the National Science Foundation and the Open Philanthropy Project.\nM.Z. is supported by an NDSEG fellowship.\n\nReferences\nAsadi, K., Misra, D., and Littman, M. Lipschitz continuity in model-based reinforcement learning.\n\nIn International Conference on Machine Learning, 2018.\n\nAtkeson, C. G. and Schaal, S. Learning tasks from a single demonstration. In International Conference\n\non Robotics and Automation, 1997.\n\nBuckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-ef\ufb01cient reinforcement learning\nwith stochastic ensemble value expansion. In Advances in Neural Information Processing Systems,\n2018.\n\nChua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of\ntrials using probabilistic dynamics models. In Advances in Neural Information Processing Systems.\n2018.\n\nClavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. Model-based reinforce-\n\nment learning via meta-policy optimization. In Conference on Robot Learning, 2018.\n\nDean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the sample complexity of the linear quadratic\n\nregulator. arXiv preprint arXiv:1710.01688, 2017.\n\nDeisenroth, M. and Rasmussen, C. E. PILCO: A model-based and data-ef\ufb01cient approach to policy\n\nsearch. In International Conference on Machine Learning, 2011.\n\nDepeweg, S., Hern\u00e1ndez-Lobato, J. M., Doshi-Velez, F., and Udluft, S. Learning and policy search\nin stochastic dynamical systems with bayesian neural networks. In International Conference on\nLearning Representations, 2016.\n\nDraeger, A., Engell, S., and Ranke, H. Model predictive control using neural networks. IEEE Control\n\nSystems Magazine, 1995.\n\nDu, Y. and Narasimhan, K. Task-agnostic dynamics priors for deep reinforcement learning. In\n\nInternational Conference on Machine Learning, 2019.\n\nEbert, F., Finn, C., Dasari, S., Xie, A., Lee, A. X., and Levine, S. Visual foresight: Model-based deep\nreinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.\nFarahmand, A.-M., Barreto, A., and Nikovski, D. Value-aware loss function for model-based\nreinforcement learning. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\nFeinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Model-based value\nestimation for ef\ufb01cient model-free reinforcement learning. In International Conference on Machine\nLearning, 2018.\n\nGal, Y., McAllister, R., and Rasmussen, C. E. Improving PILCO with Bayesian neural network\n\ndynamics models. In ICML Workshop on Data-Ef\ufb01cient Machine Learning Workshop, 2016.\n\nGu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep Q-learning with model-based\n\nacceleration. In International Conference on Machine Learning, 2016.\n\nHaarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy\ndeep reinforcement learning with a stochastic actor. In International Conference on Machine\nLearning, 2018.\n\nHeess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T. Learning continuous control\npolicies by stochastic value gradients. In Advances in Neural Information Processing Systems,\n2015.\n\nHolland, G. Z., Talvitie, E. J., and Bowling, M. The effect of planning shape on dyna-style planning\n\nin high-dimensional state spaces. arXiv preprint arXiv:1806.01825, 2018.\n\n10\n\n\fKaelbling, L. P., Littman, M. L., and Moore, A. P. Reinforcement learning: A survey. Journal of\n\nArti\ufb01cial Intelligence Research, 4:237\u2013285, 1996.\n\nKaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D.,\nFinn, C., Kozakowsi, P., Levine, S., Sepassi, R., Tucker, G., and Michalewski, H. Model-based\nreinforcement learning for Atari. arXiv preprint arXiv:1903.00374, 2019.\n\nKalweit, G. and Boedecker, J. Uncertainty-driven imagination for continuous deep reinforcement\n\nlearning. In Conference on Robot Learning, 2017.\n\nKumar, V., Todorov, E., and Levine, S. Optimal control with learned local models: Application to\n\ndexterous manipulation. In International Conference on Robotics and Automation, 2016.\n\nKurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. Model-ensemble trust-region policy\n\noptimization. In International Conference on Learning Representations, 2018.\n\nLevine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning,\n\n2013.\n\nLillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.\nContinuous control with deep reinforcement learning. In International Conference on Learning\nRepresentations, 2016.\n\nLuo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based\ndeep reinforcement learning with theoretical guarantees. In International Conference on Learning\nRepresentations, 2019.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,\nI., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through\ndeep reinforcement learning. Nature, 2015.\n\nNagabandi, A., Kahn, G., S. Fearing, R., and Levine, S. Neural network dynamics for model-based\ndeep reinforcement learning with model-free \ufb01ne-tuning. In International Conference on Robotics\nand Automation, 2018.\n\nOh, J., Guo, X., Lee, H., Lewis, R., and Singh, S. Action-conditional video prediction using deep\n\nnetworks in Atari games. In Advances in Neural Information Processing Systems, 2015.\n\nOh, J., Singh, S., and Lee, H. Value prediction network.\n\nProcessing Systems, 2017.\n\nIn Advances in Neural Information\n\nPich\u00e9, A., Thomas, V., Ibrahim, C., Bengio, Y., and Pal, C. Probabilistic planning with sequential\n\nMonte Carlo methods. In International Conference on Learning Representations, 2019.\n\nRacani\u00e8re, S., Weber, T., Reichert, D., Buesing, L., Guez, A., Jimenez Rezende, D., Puigdom\u00e8nech Ba-\ndia, A., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Hassabis, D., Silver, D., and\nWierstra, D. Imagination-augmented agents for deep reinforcement learning. In Advances in\nNeural Information Processing Systems. 2017.\n\nRajeswaran, A., Ghotra, S., Levine, S., and Ravindran, B. EPOpt: Learning robust neural network\npolicies using model ensembles. In International Conference on Learning Representations, 2017.\nSchulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In\n\nInternational Conference on Machine Learning, 2015.\n\nSchulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\nShalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms.\n\nCambridge university press, 2014.\n\nSilver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert,\nD., Rabinowitz, N., Barreto, A., and Degris, T. The predictron: End-to-end learning and planning.\nIn International Conference on Machine Learning, 2017.\n\nSun, W., Gordon, G. J., Boots, B., and Bagnell, J. Dual policy iteration. In Advances in Neural\n\nInformation Processing Systems, 2018.\n\nSutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating\n\ndynamic programming. In International Conference on Machine Learning, 1990.\n\n11\n\n\fSzita, I. and Szepesvari, C. Model-based reinforcement learning with nearly tight exploration\n\ncomplexity bounds. In International Conference on Machine Learning, 2010.\n\nTalvitie, E. Model regularization for stable sample rollouts. In Conference on Uncertainty in Arti\ufb01cial\n\nIntelligence, 2014.\n\nTalvitie, E. Self-correcting models for model-based reinforcement learning. In AAAI Conference on\n\nArti\ufb01cial Intelligence, 2016.\n\nTamar, A., WU, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in\n\nNeural Information Processing Systems. 2016.\n\nTodorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control.\n\nInternational Conference on Intelligent Robots and Systems, 2012.\n\nIn\n\nWhitney, W. and Fergus, R. Understanding the asymptotic performance of model-based RL methods.\n\n2018.\n\n12\n\n\f", "award": [], "sourceid": 6805, "authors": [{"given_name": "Michael", "family_name": "Janner", "institution": "UC Berkeley"}, {"given_name": "Justin", "family_name": "Fu", "institution": "UC Berkeley"}, {"given_name": "Marvin", "family_name": "Zhang", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}