{"title": "Differentiable MPC for End-to-end Planning and Control", "book": "Advances in Neural Information Processing Systems", "page_first": 8289, "page_last": 8300, "abstract": "We present foundations for using Model Predictive Control (MPC) as a differentiable policy class for reinforcement learning. This provides one way of leveraging and combining the advantages of model-free and model-based approaches. Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. Using this strategy, we are able to learn the cost and dynamics of a controller via end-to-end learning. Our experiments focus on imitation learning in the pendulum and cartpole domains, where we learn the cost and dynamics terms of an MPC policy class. We show that our MPC policies are significantly more data-efficient than a generic neural network and that our method is superior to traditional system identification in a setting where the expert is unrealizable.", "full_text": "Differentiable MPC for End-to-end Planning and Control\n\nBrandon Amos1\n\nIvan Dario Jimenez Rodriguez2\n\nJacob Sacks2\n\nByron Boots2\n\nJ. Zico Kolter13\n\n1Carnegie Mellon University\n\n2Georgia Tech\n\n3Bosch Center for AI\n\nAbstract\n\nWe present foundations for using Model Predictive Control (MPC) as a differen-\ntiable policy class for reinforcement learning in continuous state and action spaces.\nThis provides one way of leveraging and combining the advantages of model-free\nand model-based approaches. Speci\ufb01cally, we differentiate through MPC by using\nthe KKT conditions of the convex approximation at a \ufb01xed point of the controller.\nUsing this strategy, we are able to learn the cost and dynamics of a controller via\nend-to-end learning. Our experiments focus on imitation learning in the pendulum\nand cartpole domains, where we learn the cost and dynamics terms of an MPC\npolicy class. We show that our MPC policies are signi\ufb01cantly more data-ef\ufb01cient\nthan a generic neural network and that our method is superior to traditional system\nidenti\ufb01cation in a setting where the expert is unrealizable.\n\n1\n\nIntroduction\n\nModel-free reinforcement learning has achieved state-of-the-art results in many challenging domains.\nHowever, these methods learn black-box control policies and typically suffer from poor sample\ncomplexity and generalization. Alternatively, model-based approaches seek to model the environment\nthe agent is interacting in. Many model-based approaches utilize Model Predictive Control (MPC) to\nperform complex control tasks [Gonz\u00e1lez et al., 2011, Lenz et al., 2015, Liniger et al., 2014, Kamel\net al., 2015, Erez et al., 2012, Alexis et al., 2011, Bouffard et al., 2012, Neunert et al., 2016]. MPC\nleverages a predictive model of the controlled system and solves an optimization problem online in a\nreceding horizon fashion to produce a sequence of control actions. Usually the \ufb01rst control action is\napplied to the system, after which the optimization problem is solved again for the next time step.\nFormally, MPC requires that at each time step we solve the optimization problem:\n\nargmin\n\nx1:T \u2208X ,u1:T \u2208U\n\nCt(xt, ut) subject to xt+1 = f (xt, ut), x1 = xinit,\n\n(1)\n\nT\n\nXt=1\n\nwhere xt, ut are the state and control at time t, X and U are constraints on valid states and controls,\nCt : X \u00d7 U \u2192 R is a (potentially time-varying) cost function, f : X \u00d7 U \u2192 X is a dynamics model,\nand xinit is the initial state of the system. The optimization problem in Equation (1) can be ef\ufb01ciently\nsolved in many ways, for example with the \ufb01nite-horizon iterative Linear Quadratic Regulator (iLQR)\nalgorithm [Li and Todorov, 2004]. Although these techniques are widely used in control domains,\nmuch work in deep reinforcement learning or imitation learning opts instead to use a much simpler\npolicy class such as a linear function or neural network. The advantages of these policy classes is that\nthey are differentiable and the loss can be directly optimized with respect to them while it is typically\nnot possible to do full end-to-end learning with model-based approaches.\nIn this paper, we consider the task of learning MPC-based policies in an end-to-end fashion, illustrated\nin Figure 1. That is, we treat MPC as a generic policy class u = \u03c0(xinit; C, f ) parameterized by some\nrepresentations of the cost C and dynamics model f. By differentiating through the optimization\nproblem, we can learn the costs and dynamics model to perform a desired task. This is in contrast to\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fStates\n\nPolicy\n\nBackprop\n\nActions\n\nLoss\n\n\u2026\n\nLearnable MPC Module\n\n\u2026\n\nSubmodules: Cost and Dynamics\n\nFigure 1: Illustration of our contribution: A learnable MPC module that can be integrated into a\nlarger end-to-end reinforcement learning pipeline. Our method allows the controller to be updated\nwith gradient information directly from the task loss.\n\nregressing on collected dynamics or trajectory rollout data and learning each component in isolation,\nand comes with the typical advantages of end-to-end learning (the ability to train directly based upon\nthe task loss of interest, the ability to \u201cspecialize\u201d parameter for a given task, etc).\nStill, ef\ufb01ciently differentiating through a complex policy class like MPC is challenging. Previous\nwork with similar aims has either simply unrolled and differentiated through a simple optimization\nprocedure [Tamar et al., 2017] or has considered generic optimization solvers that do not scale to the\nsize of MPC problems [Amos and Kolter, 2017]. This paper makes the following two contributions\nto this space. First, we provide an ef\ufb01cient method for analytically differentiating through an iterative\nnon-convex optimization procedure based upon a box-constrained iterative LQR solver [Tassa et al.,\n2014]; in particular, we show that the analytical derivative can be computed using one additional\nbackward pass of a modi\ufb01ed iterative LQR solver. Second, we empirically show that in imitation\nlearning scenarios we can recover the cost and dynamics from an MPC expert with a loss based only\non the actions (and not states). In one notable experiment, we show that directly optimizing the\nimitation loss results in better performance than vanilla system identi\ufb01cation.\n\n2 Background and Related Work\n\nPure model-free techniques for policy search have demonstrated promising results in many do-\nmains by learning reactive polices which directly map observations to actions [Mnih et al., 2013, Oh\net al., 2016, Gu et al., 2016b, Lillicrap et al., 2015, Schulman et al., 2015, 2016, Gu et al., 2016a].\nDespite their success, model-free methods have many drawbacks and limitations, including a lack\nof interpretability, poor generalization, and a high sample complexity. Model-based methods are\nknown to be more sample-ef\ufb01cient than their model-free counterparts. These methods generally\nrely on learning a dynamics model directly from interactions with the real system and then integrate\nthe learned model into the control policy [Schneider, 1997, Abbeel et al., 2006, Deisenroth and\nRasmussen, 2011, Heess et al., 2015, Boedecker et al., 2014]. More recent approaches use a deep\nnetwork to learn low-dimensional latent state representations and associated dynamics models in this\nlearned representation. They then apply standard trajectory optimization methods on these learned\nembeddings [Lenz et al., 2015, Watter et al., 2015, Levine et al., 2016]. However, these methods still\nrequire a manually speci\ufb01ed and hand-tuned cost function, which can become even more dif\ufb01cult in a\nlatent representation. Moreover, there is no guarantee that the learned dynamics model can accurately\ncapture portions of the state space relevant for the task at hand.\nTo leverage the bene\ufb01ts of both approaches, there has been signi\ufb01cant interest in combining the\nIn particular, much attention has been dedicated to\nmodel-based and model-free paradigms.\nutilizing model-based priors to accelerate the model-free learning process. For instance, synthetic\ntraining data can be generated by model-based control algorithms to guide the policy search or prime\na model-free policy [Sutton, 1990, Theodorou et al., 2010, Levine and Abbeel, 2014, Gu et al., 2016b,\nVenkatraman et al., 2016, Levine et al., 2016, Chebotar et al., 2017, Nagabandi et al., 2017, Sun et al.,\n2017]. [Bansal et al., 2017] learns a controller and then distills it to a neural network policy which is\nthen \ufb01ne-tuned with model-free policy learning. However, this line of work usually keeps the model\nseparate from the learned policy.\nAlternatively, the policy can include an explicit planning module which leverages learned models\nof the system or environment, both of which are learned through model-free techniques. For example,\nthe classic Dyna-Q algorithm [Sutton, 1990] simultaneously learns a model of the environment and\nuses it to plan. More recent work has explored incorporating such structure into deep networks and\nlearning the policies in an end-to-end fashion. Tamar et al. [2016] uses a recurrent network to predict\n\n2\n\n\fthe value function by approximating the value iteration algorithm with convolutional layers. Karkus\net al. [2017] connects a dynamics model to a planning algorithm and formulates the policy as a\nstructured recurrent network. Silver et al. [2016] and Oh et al. [2017] perform multiple rollouts using\nan abstract dynamics model to predict the value function. A similar approach is taken by Weber\net al. [2017] but directly predicts the next action and reward from rollouts of an explicit environment\nmodel. Farquhar et al. [2017] extends model-free approaches, such as DQN [Mnih et al., 2015] and\nA3C [Mnih et al., 2016], by planning with a tree-structured neural network to predict the cost-to-go.\nWhile these approaches have demonstrated impressive results in discrete state and action spaces, they\nare not applicable to continuous control problems.\nTo tackle continuous state and action spaces, Pascanu et al. [2017] propose a neural architecture\nwhich uses an abstract environmental model to plan and is trained directly from an external task loss.\nPong et al. [2018] learn goal-conditioned value functions and use them to plan single or multiple\nsteps of actions in an MPC fashion. Similarly, Pathak et al. [2018] train a goal-conditioned policy to\nperform rollouts in an abstract feature space but ground the policy with a loss term which corresponds\nto true dynamics data. The aforementioned approaches can be interpreted as a distilled optimal\ncontroller which does not separate components for the cost and dynamics. Taking this analogy further,\nanother strategy is to differentiate through an optimal control algorithm itself. Okada et al. [2017]\nand Pereira et al. [2018] present a way to differentiate through path integral optimal control [Williams\net al., 2016, 2017] and learn a planning policy end-to-end. Srinivas et al. [2018] shows how to embed\ndifferentiable planning (unrolled gradient descent over actions) within a goal-directed policy. In\na similar vein, Tamar et al. [2017] differentiates through an iterative LQR (iLQR) solver [Li and\nTodorov, 2004, Xie et al., 2017, Tassa et al., 2014] to learn a cost-shaping term of\ufb02ine. This shaping\nterm enables a shorter horizon controller to approximate the behavior of a solver with a longer horizon\nto save computation during runtime.\nContributions of our paper. All of these methods require differentiating through planning proce-\ndures by explicitly \u201cunrolling\u201d the optimization algorithm itself. While this is a reasonable strategy,\nit is both memory- and computationally-expensive and challenging when unrolling through many\niterations because the time- and space-complexity of the backward pass grows linearly with the\nforward pass. In contrast, we address this issue by showing how to analytically differentiate through\nthe \ufb01xed point of a nonlinear MPC solver. Speci\ufb01cally, we compute the derivatives of an iLQR solver\nwith a single LQR step in the backward pass. This makes the learning process more computationally\ntractable while still allowing us to plan in continuous state and action spaces. Unlike model-free\napproaches, explicit cost and dynamics components can be extracted and analyzed on their own.\nMoreover, in contrast to pure model-based approaches, the dynamics model and cost function can be\nlearned entirely end-to-end.\n\n3 Differentiable LQR\n\nDiscrete-time \ufb01nite-horizon LQR is a well-studied control method that optimizes a convex quadratic\nobjective function with respect to af\ufb01ne state-transition dynamics from an initial system state xinit.\nSpeci\ufb01cally, LQR \ufb01nds the optimal nominal trajectory \u03c4 \u22c6\n1:T = {xt, ut}1:T by solving the optimization\nproblem\n\n\u03c4 \u22c6\n1:T = argmin\n\n\u03c41:T Xt\n\n1\n2\n\n\u03c4 \u22a4\nt Ct\u03c4t + c\u22a4\n\nt \u03c4t subject to x1 = xinit, xt+1 = Ft\u03c4t + ft.\n\n(2)\n\nFrom a policy learning perspective, this can be interpreted as a module with unknown parameters\n\u03b8 = {C, c, F, f }, which can be integrated into a larger end-to-end learning system. The learning\nprocess involves taking derivatives of some loss function \u2113, which are then used to update the\nparameters. Instead of directly computing each of the individual gradients, we present an ef\ufb01cient\nway of computing the derivatives of the loss function with respect to the parameters\n\nBy interpreting LQR from an optimization perspective [Boyd, 2008], we associate dual variables\n\u03bb1:T with the state constraints. The Lagrangian of the optimization problem is then given by\n\n\u2202\u2113\n\u2202\u03b8\n\n=\n\n\u2202\u2113\n\u2202\u03c4 \u22c6\n\n1:T\n\n\u2202\u03c4 \u22c6\n1:T\n\u2202\u03b8\n\n.\n\n(3)\n\nL(\u03c4, \u03bb) =Xt\n\n1\n2\n\n\u03c4 \u22a4\nt Ct\u03c4t +\n\nT \u22121\n\nXt=0\n\n3\n\n\u03bb\u22a4\n\nt (Ft\u03c4t + ft \u2212 xt+1),\n\n(4)\n\n\fModule 1 Differentiable LQR\nInput: Initial state xinit\nParameters: \u03b8 = {C, c, F, f }\n\n(The LQR algorithm is de\ufb01ned in Appendix A)\n\nForward Pass:\n1: \u03c4 \u22c6\n2: Compute \u03bb\u22c6\n\n1:T = LQRT (xinit; C, c, F, f )\n\n1:T with (7)\n\n\u22b2 Solve (2)\n\n\u03c41:T = LQRT (0; C, \u2207\u03c4 \u22c6 \u2113, F, 0)\n\nBackward Pass:\n1: d\u22c6\n2: Compute d\u22c6\n3: Compute the derivatives of \u2113 with respect to C, c, F , f, and xinit with (8)\n\nwith (7)\n\n\u03bb1:T\n\n\u22b2 Solve (9), ideally reusing the factorizations from the forward pass\n\nwhere the initial constraint x1 = xinit is represented by setting F0 = 0 and f0 = xinit. Differentiating\nEquation (4) with respect to \u03c4 \u22c6\n\nt yields\n\n\u2207\u03c4t L(\u03c4 \u22c6, \u03bb\u22c6) = Ct\u03c4 \u22c6\n\nt + Ct + F \u22a4\n\nt \u03bb\u22c6\n\nt \u2212(cid:20)\u03bb\u22c6\n\n0 (cid:21) = 0,\n\nt\u22121\n\n(5)\n\nThus, the normal approach to solving LQR problems with dynamic Riccati recursion can be viewed\nas an ef\ufb01cient way of solving the KKT system\n\n. . .\n\n\u03c4t\n\n\u03bbt\n\nCt\nFt\n\nF \u22a4\nt\n\n(cid:20)\u2212I\n0 (cid:21)\n\nK\n\n}|\n\n\u03c4t+1\n\n\u03bbt+1\n\n[\u2212I\n\n0]\n\nCt+1\n\nFt+1\n\nF \u22a4\n\nt+1\n\nz\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n{\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n...\n\u03c4 \u22c6\nt\n\u03bb\u22c6\nt\n\u03c4 \u22c6\nt+1\n\u03bb\u22c6\n...\n\nt+1\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n. . .\n\n= \u2212\n\n...\nct\nft\nct+1\nft+1\n...\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n.\n\n(6)\n\nGiven an optimal nominal trajectory \u03c4 \u22c6\nvariables \u03bb with the backward recursion\n\n1:T , Equation (5) shows how to compute the optimal dual\n\n\u03bb\u22c6\nT = CT ,x\u03c4 \u22c6\n\nT + cT ,x\n\n\u03bb\u22c6\nt = F \u22a4\n\nt,x\u03bb\u22c6\n\nt+1 + Ct,x\u03c4 \u22c6\n\nt + ct,x,\n\n(7)\n\nwhere Ct,x, ct,x, and Ft,x are the \ufb01rst block-rows of Ct, ct, and Ft, respectively. Now that we have\nthe optimal trajectory and dual variables, we can compute the gradients of the loss with respect to\nthe parameters. Since LQR is a constrained convex quadratic argmin, the derivatives of the loss\nwith respect to the LQR parameters can be obtained by implicitly differentiating the KKT conditions.\nApplying the approach from Section 3 of Amos and Kolter [2017], the derivatives are\n\n1\n\n\u03c4t(cid:1)\n\n2(cid:0)d\u22c6\n\nt + \u03c4 \u22c6\nt + \u03bb\u22c6\n\n\u2207Ct \u2113 =\n\u2207Ft \u2113 = d\u22c6\n\n\u03c4t \u2297 \u03c4 \u22c6\n\u03bbt+1 \u2297 \u03c4 \u22c6\n\nt \u2297 d\u22c6\nt+1 \u2297 d\u22c6\n\u03c4t\nwhere \u2297 is the outer product operator, and d\u22c6\n\u03c4 and d\u22c6\n...\nd\u22c6\n\u03c4t\nd\u22c6\n\nK\uf8ee\n\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8f9\n\uf8fa\uf8fa\uf8fa\uf8fb\n\n= \u2212\uf8ee\n\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u03bbt...\n\n\u2113\n\n...\n\u2207\u03c4 \u22c6\n0\n...\n\nt\n\n\uf8f9\n\uf8fa\uf8fa\uf8fa\uf8fb\n\n.\n\n\u2207ct \u2113 = d\u22c6\n\u03c4t\n\u2207ft \u2113 = d\u22c6\n\u03bbt\n\n\u2207xinit \u2113 = d\u22c6\n\u03bb0\n\n\u03bb are obtained by solving the linear system\n\n(8)\n\n(9)\n\nWe observe that Equation (9) is of the same form as the linear system in Equation (6) for the LQR\nproblem. Therefore, we can leverage this insight and solve Equation (9) ef\ufb01ciently by solving another\nLQR problem that replaces ct with \u2207\u03c4 \u22c6\n\u2113 and ft with 0. Moreover, this approach enables us to re-use\nthe factorization of K from the forward pass instead of recomputing. Module 1 summarizes the\nforward and backward passes for a differentiable LQR module.\n\nt\n\n4\n\n\f4 Differentiable MPC\n\nWhile LQR is a powerful tool, it does not cover realistic control problems with non-linear dynamics\nand cost. Furthermore, most control problems have natural bounds on the control space that can\noften be expressed as box constraints. These highly non-convex problems, which we will refer to as\nmodel predictive control (MPC), are well-studied in the control literature and can be expressed in the\ngeneral form\n\n\u03c4 \u22c6\n1:T = argmin\n\n\u03c41:T Xt\n\nC\u03b8,t(\u03c4t) subject to x1 = xinit, xt+1 = f\u03b8(\u03c4t), u \u2264 u \u2264 u,\n\n(10)\n\nwhere the non-convex cost function C\u03b8 and non-convex dynamics function f\u03b8 are (potentially)\nparameterized by some \u03b8. We note that more generic constraints on the control and state space can be\nrepresented as penalties and barriers in the cost function. The standard way of solving the control\nproblem Equation (10) is by iteratively forming and optimizing a convex approximation\n\n\u03c4 i\n1:T = argmin\n\n\u02dcC i\n\n\u03b8,t(\u03c4t) subject to x1 = xinit, xt+1 = \u02dcf i\n\n\u03b8(\u03c4t), u \u2264 u \u2264 u,\n\n\u03c41:T Xt\n\nwhere we have de\ufb01ned the second-order Taylor approximation of the cost around \u03c4 i as\n\n\u02dcC i\n\n\u03b8,t = C\u03b8,t(\u03c4 i\n\nt ) + (pi\n\nt)\u22a4(\u03c4t \u2212 \u03c4 i\n\nt ) +\n\n1\n2\n\n(\u03c4t \u2212 \u03c4 i\n\nt )\u22a4H i\n\nt (\u03c4t \u2212 \u03c4 i\nt )\n\nt = \u2207\u03c4 i\n\nwith pi\ndynamics around \u03c4 i as\n\nC\u03b8,t and H i\n\nt\n\nt = \u22072\n\u03c4 i\nt\n\nC\u03b8,t. We also have a \ufb01rst-order Taylor approximation of the\n\nt = \u2207\u03c4 i\n\n\u02dcf i\n\u03b8,t(\u03c4t) = f\u03b8,t(\u03c4 i\n\n(13)\nwith F i\nf\u03b8,t. In practice, a \ufb01xed point of Equation (11) is often reached, especially when\nthe dynamics are smooth. As such, differentiating the non-convex problem Equation (10) can be\ndone exactly by using the \ufb01nal convex approximation. Without the box constraints, the \ufb01xed point in\nEquation (11) could be differentiated with LQR as we show in Section 3. In the next section, we will\nshow how to extend this to the case where we have box constraints on the controls as well.\n\nt (\u03c4t \u2212 \u03c4 i\nt )\n\nt ) + F i\n\nt\n\n4.1 Differentiating Box-Constrained QPs\n\nFirst, we consider how to differentiate a more generic box-constrained convex QP of the form\n\nx\u22c6 = argmin\n\nx\n\n1\n2\n\nx\u22a4Qx + p\u22a4x subject to Ax = b, x \u2264 x \u2264 x.\n\n(14)\n\nGiven active inequality constraints at the solution in the form \u02dcGx = \u02dch, this problem turns into an\nequality-constrained optimization problem with the solution given by the linear system\n\n(11)\n\n(12)\n\n(15)\n\nQ A\u22a4\nA 0\n\u02dcG 0\n\n\u02dcG\u22a4\n0\n0\n\n\uf8ee\n\uf8f0\n\n\u03bb\u22c6\n\n\u02dc\u03bd \u22c6# = \u2212\uf8ee\n\uf8f9\n\uf8fb\"x\u22c6\n\uf8f0\n\np\nb\n\n\u02dch\uf8f9\n\uf8fb\n\nWith some loss function \u2113 that depends on x\u22c6, we can use the approach in Amos and Kolter [2017] to\nobtain the derivatives of \u2113 with respect to Q, p, A, and b as\n\n\u2207Q\u2113 =\n\n1\n2\n\n(d\u22c6\n\nx \u2297 x\u22c6 + x\u22c6 \u2297 d\u22c6\nx)\n\n\u2207p\u2113 = d\u22c6\nx\n\n\u2207A\u2113 = d\u22c6\n\n\u03bb \u2297 x\u22c6 + \u03bb\u22c6 \u2297 d\u22c6\n\nx\n\nwhere d\u22c6\n\nx and d\u22c6\n\n\u03bb are obtained by solving the linear system\n\n\u2207b\u2113 = \u2212d\u22c6\n\n\u03bb (16)\n\n(17)\n\nQ A\u22a4\nA 0\n\u02dcG 0\n\n\u02dcG\u22a4\n0\n0\n\n\uf8ee\n\uf8f0\n\nx\nd\u22c6\n\u03bb\nd\u22c6\n\n\uf8f9\n\uf8fb\"d\u22c6\n0 #\n\u02dc\u03bd# = \u2212\"\u2207x\u22c6 \u2113\n\n0\n\nThe constraint \u02dcGd\u22c6\nsystem in Equation (17) is equivalent to solving the optimization problem\n\nx = 0 is equivalent to the constraint d\u22c6\n\nxi = 0 if x\u22c6\n\ni \u2208 {xi, xi}. Thus solving the\n\nd\u22c6\nx = argmin\n\ndx\n\n1\n2\n\nx Qdx + (\u2207x\u22c6 \u2113)\u22a4dx subject to Adx = 0, dxi = 0 if x\u22c6\nd\u22a4\n\ni \u2208 {xi, xi}\n\n(18)\n\n5\n\n\fModule 2 Differentiable MPC\nGiven: Initial state xinit and initial control sequence uinit\nParameters: \u03b8 of the objective C\u03b8(\u03c4 ) and dynamics f\u03b8(\u03c4 )\n\n(The MPC algorithm is de\ufb01ned in Appendix A)\n\n1:T = MPCT ,u,u(xinit, uinit; C\u03b8, F\u03b8)\n\nForward Pass:\n1: \u03c4 \u22c6\n2: The solver should reach the \ufb01xed point in (11) to obtain approximations to the cost H n\n3: Compute \u03bb\u22c6\n\n1:T with (7)\n\n\u22b2 Solve Equation (10)\n\u03b8 and dynamics F n\n\n\u03b8\n\n\u03b8 is F n\n\u03c41:T = LQRT (0; H n\nwith (7)\n\n\u03b8 with the rows corresponding to the tight control constraints zeroed\n\nBackward Pass:\n1: \u02dcF n\n2: d\u22c6\n3: Compute d\u22c6\n4: Differentiate \u2113 with respect to the approximations H n\n5: Differentiate these approximations with respect to \u03b8 and use the chain rule to obtain \u2202\u2113/\u2202\u03b8\n\n\u03b8 , \u2207\u03c4 \u22c6 \u2113, \u02dcF n\n\n\u03bb1:T\n\n\u03b8 and F n\n\n\u03b8 with (8)\n\n\u03b8 , 0) \u22b2 Solve (19), ideally reusing the factorizations from the forward pass\n\n4.2 Differentiating MPC with Box Constraints\n\nAt a \ufb01xed point, we can use Equation (16) to compute the derivatives of the MPC problem, where\n\u03bb are found by solving the linear system in Equation (9) with the additional constraint that\n\u03c4 and d\u22c6\nd\u22c6\ndut,i = 0 if u\u22c6\nt,i \u2208 {ut,i, ut,i}. Solving this system can be equivalently written as a zero-constrained\nLQR problem of the form\n\nd\u22c6\n\u03c41:T = argmin\n\nd\u03c41:T Xt\n\n1\n2\n\nd\u22a4\n\u03c4t\n\nH n\n\nt d\u03c4t + (\u2207\u03c4 \u22c6\n\nt\n\n\u2113)\u22a4d\u03c4t\n\n(19)\n\nsubject to dx1 = 0, dxt+1 = F n\n\nt d\u03c4t , dut,i = 0 if u\u22c6\n\ni \u2208 {ut,i, ut,i}\n\nwhere n is the iteration that Equation (11) reaches a \ufb01xed point, and H n and F n are the corresponding\napproximations to the objective and dynamics de\ufb01ned earlier. Module 2 summarizes the proposed\ndifferentiable MPC module. To solve the MPC problem in Equation (10) and reach the \ufb01xed point\nin Equation (11), we use the box-DDP heuristic [Tassa et al., 2014]. For the zero-constrained LQR\nproblem in Equation (19) to compute the derivatives, we use an LQR solver that zeros the appropriate\ncontrols.\n\n4.3 Drawbacks of Our Approach\n\nSometimes the controller does not run for long enough to reach a \ufb01xed point of Equation (11), or\na \ufb01xed point doesn\u2019t exist, which often happens when using neural networks to approximate the\ndynamics. When this happens, Equation (19) cannot be used to differentiate through the controller,\nbecause it assumes a \ufb01xed point. Differentiating through the \ufb01nal iLQR iterate that\u2019s not a \ufb01xed\npoint will usually give the wrong gradients. Treating the iLQR procedure as a compute graph and\ndifferentiating through the unrolled operations is a reasonable alternative in this scenario that obtains\nsurrogate gradients to the control problem. However, as we empirically show in Section 5.1, the\nbackward pass of this method scales linearly with the number of iLQR iterations used in the forward.\nInstead, \ufb01xed-point differentiation is constant time and only requires a single iLQR solve.\n\n5 Experimental Results\n\nIn this section, we present several results that highlight the performance and capabilities of differen-\ntiable MPC in comparison to neural network policies and vanilla system identi\ufb01cation (SysId). We\nshow 1) superior runtime performance compared to an unrolled solver, 2) the ability of our method to\nrecover the cost and dynamics of a controller with imitation, and 3) the bene\ufb01t of directly optimizing\nthe task loss over vanilla SysId.\nWe have released our differentiable MPC solver as a standalone open source package that is available\nat https://github.com/locuslab/mpc.pytorch and our experimental code for this paper is also\nopenly available at https://github.com/locuslab/differentiable-mpc. Our experiments\nare implemented with PyTorch [Paszke et al., 2017].\n\n6\n\n\fFigure 2: Runtime comparison of \ufb01xed point\ndifferentiation (FP) to unrolling the iLQR solver\n(Unroll), averaged over 10 trials.\n\nFigure 3: Model and imitation losses for the\nLQR imitation learning experiments.\n\n5.1 MPC Solver Performance\n\nFigure 2 highlights the performance of our differentiable MPC solver. We compare to an alternative\nversion where each box-constrained iLQR iteration is individually unrolled, and gradients are\ncomputed by differentiating through the entire unrolled chain. As illustrated in the \ufb01gure, these\nunrolled operations incur a substantial extra cost. Our differentiable MPC solver 1) is slightly more\ncomputationally ef\ufb01cient even in the forward pass, as it does not need to create and maintain the\nbackward pass variables; 2) is more memory ef\ufb01cient in the forward pass for this same reason (by a\nfactor of the number of iLQR iterations); and 3) is signi\ufb01cantly more ef\ufb01cient in the backward pass,\nespecially when a large number of iLQR iterations are needed. The backward pass is essentially free,\nas it can reuse all the factorizations for the forward pass and does not require multiple iterations.\n\n5.2\n\nImitation Learning: Linear-Dynamics Quadratic-Cost (LQR)\n\nIn this section, we show results to validate the MPC solver and gradient-based learning approach for\nan imitation learning problem. The expert and learner are LQR controllers that share all information\nexcept for the linear system dynamics f (xt, ut) = Axt + But. The controllers have the same\nquadratic cost (the identity), control bounds [\u22121, 1], horizon (5 timesteps), and 3-dimensional state\nand control spaces. Though the dynamics can also be recovered by \ufb01tting next-state transitions, we\nshow that we can alternatively use imitation learning to recover the dynamics using only controls.\nGiven an initial state x, we can obtain nominal actions from the controllers as u1:T (x; \u03b8), where\n\u03b8 = {A, B}. We randomly initialize the learner\u2019s dynamics with \u02c6\u03b8 and minimize the imitation loss\n\n2i , .\nL = Exh||\u03c41:T (x; \u03b8) \u2212 \u03c41:T (x; \u02c6\u03b8)||2\n\nWe do learning by differentiating L with respect to \u02c6\u03b8 (using mini-batches with 32 examples) and\ntaking gradient steps with RMSprop [Tieleman and Hinton, 2012]. Figure 3 shows the model and\nimitation loss of eight randomly sampled initial dynamics, where the model loss is MSE(\u03b8, \u02c6\u03b8). The\nmodel converges to the true parameters in half of the trials and achieves a perfect imitation loss. The\nother trials get stuck in a local minimum of the imitation loss and causes the approximate model to\nsigni\ufb01cantly diverge from the true model. These faulty trials highlight that despite the LQR problem\nbeing convex, the optimization problem of some loss function w.r.t. the controller\u2019s parameters is\na (potentially dif\ufb01cult) non-convex optimization problem that typically does not have convergence\nguarantees.\n\n7\n\n13264128Number of LQR Steps10-310-210-1100101Runtime (s)FP ForwardFP BackwardUnroll ForwardUnroll Backward02004006008001000Iteration0.00.20.40.60.81.01.2Imitation Loss02004006008001000Iteration0.00.51.01.52.02.53.0Model Loss\f#Train: 10\n\n#Train: 50\n\n#Train: 100\n\nFigure 4: Learning results on the (simple) pendulum and cartpole environments. We select the best\nvalidation loss observed during the training run and report the best test loss.\n\n5.3\n\nImitation Learning: Non-Convex Continuous Control\n\nWe next demonstrate the ability of our method to do imitation learning in the pendulum and\ncartpole benchmark domains. Despite being simple tasks, they are relatively challenging for a\ngeneric poicy to learn quickly in the imitation learning setting. In our experiments we use MPC\nexperts and learners that produce a nominal action sequence u1:T (x; \u03b8) where \u03b8 parameterizes\nthe model that\u2019s being optimized. The goal of these experiments is to optimize the imitation loss\n\nL = Exh||u1:T (x; \u03b8) \u2212 u1:T (x; \u02c6\u03b8)||2\n\n2i, again which we can uniquely do using only observed controls\n\nand no observations. We consider the following methods:\nBaselines: nn is an LSTM that takes the state x as input and predicts the nominal action sequence. In\nthis setting we optimize the imitation loss directly. sysid assumes the cost of the controller is known\nand approximates the parameters of the dynamics by optimizing the next-state transitions.\nOur Methods: mpc.dx assumes the cost of the controller is known and approximates the parameters\nof the dynamics by directly optimizing the imitation loss. mpc.cost assumes the dynamics of the\ncontroller is known and approximates the cost by directly optimizing the imitation loss. mpc.cost.dx\napproximates both the cost and parameters of the dynamics of the controller by directly optimizing\nthe imitation loss.\nIn all settings that involve learning the dynamics (sysid, mpc.dx, and mpc.cost.dx) we use a parame-\nterized version of the true dynamics. In the pendulum domain, the parameters are the mass, length,\nand gravity; and in the cartpole domain, the parameters are the cart\u2019s mass, pole\u2019s mass, gravity, and\nlength. For cost learning in mpc.cost and mpc.cost.dx we parameterize the cost of the controller as\nthe weighted distance to a goal state C(\u03c4 ) = ||wg \u25e6 (\u03c4 \u2212 \u03c4g)||2\n2. We have found that simultaneously\nlearning the weights wg and goal state \u03c4g is instable and in our experiments we alternate learning\nof wg and \u03c4g independently every 10 epochs. We collected a dataset of trajectories from an expert\ncontroller and vary the number of trajectories our models are trained on. A single trial of our experi-\nments takes 1-2 hours on a modern CPU. We optimize the nn setting with Adam [Kingma and Ba,\n2014] with a learning rate of 10\u22124 and all other settings are optimized with RMSprop [Tieleman and\nHinton, 2012] with a learning rate of 10\u22122 and a decay term of 0.5.\nFigure 4 shows that in nearly every case we are able to directly optimize the imitation loss with\nrespect to the controller and we signi\ufb01cantly outperform a general neural network policy trained on\nthe same information. In many cases we are able to recover the true cost function and dynamics of the\nexpert. More information about the training and validation losses are in Appendix B. The comparison\nbetween our approach mpc.dx and SysId is notable, as we are able to recover equivalent performance\nto SysId with our models using only the control information and without using state information.\nAgain, while we emphasize that these are simple tasks, there are stark differences between the\napproaches. Unlike the generic network-based imitation learning, the MPC policy can exploit its\ninherent structure. Speci\ufb01cally, because the network contains a well-de\ufb01ned notion of the dynamics\nand cost, it is able to learn with much lower sample complexity that a typical network. But unlike pure\nsystem identi\ufb01cation (which would be reasonable only for the case where the physical parameters are\nunknown but all other costs are known), the differentiable MPC policy can naturally be adapted to\nobjectives besides simple state prediction, such as incorporating the additional cost learning portion.\n\n8\n\nnnsysidmpc.dxmpc.costmpc.cost.dx10-910-710-510-310-1101Imitation LossBaselinesOursPendulumnnsysidmpc.dxmpc.costmpc.cost.dx10-410-310-210-1100101Imitation LossBaselinesOursCartpole\fVanilla SysId Baseline\n\n(Ours) Directly optimizing the Imitation Loss\n\nFigure 5: Convergence results in the non-realizable Pendulum task.\n\n5.4\n\nImitation Learning: SysId with a non-realizable expert\n\nAll of our previous experiments that involve SysId and learning the dynamics are in the unrealistic\ncase when the expert\u2019s dynamics are in the model class being learned. In this experiment we study a\ncase where the expert\u2019s dynamics are outside of the model class being learned. In this setting we will\ndo imitation learning for the parameters of a dynamics function with vanilla SysId and by directly\noptimizing the imitation loss (sysid and the mpc.dx in the previous section, respectively).\nSysId often \ufb01ts observations from a noisy environment to a simpler model. In our setting, we collect\noptimal trajectories from an expert in the pendulum environment that has an additional damping term\nand also has another force acting on the point-mass at the end (which can be interpreted as a \u201cwind\u201d\nforce). We do learning with dynamics models that do not have these additional terms and therefore\nwe cannot recover the expert\u2019s parameters. Figure 5 shows that even though vanilla SysId is slightly\nbetter at optimizing the next-state transitions, it \ufb01nds an inferior model for imitation compared to our\napproach that directly optimizes the imitation loss.\nWe argue that the goal of doing SysId is rarely in isolation and always serves the purpose of performing\na more sophisticated task such as imitation or policy learning. Typically SysId is merely a surrogate\nfor optimizing the task and we claim that the task\u2019s loss signal provides useful information to guide\nthe dynamics learning. Our method provides one way of doing this by allowing the task\u2019s loss\nfunction to be directly differentiated with respect to the dynamics function being learned.\n\n6 Conclusion\n\nThis paper lays the foundations for differentiating and learning MPC-based controllers within\nreinforcement learning and imitation learning. Our approach, in contrast to the more traditional\nstrategy of \u201cunrolling\u201d a policy, has the bene\ufb01t that it is much less computationally and memory\nintensive, with a backward pass that is essentially free given the number of iterations required for a\nthe iLQR optimizer to converge to a \ufb01xed point. We have demonstrated our approach in the context\nof imitation learning, and have highlighted the potential advantages that the approach brings over\ngeneric imitation learning and system identi\ufb01cation.\nWe also emphasize that one of the primary contributions of this paper is to de\ufb01ne and set up the\nframework for differentiating through MPC in general. Given the recent prominence of attempting to\nincorporate planning and control methods into the loop of deep network architectures, the techniques\nhere offer a method for ef\ufb01ciently integrating MPC policies into such situations, allowing these\narchitectures to make use of a very powerful function class that has proven extremely effective in\npractice. The future applications of our differentiable MPC method include tuning model parameters\nto task-speci\ufb01c goals and incorporating joint model-based and policy-based loss functions; and our\nmethod can also be extended for stochastic control.\n\n9\n\n050100150200250Epoch0.0000.0050.010SysID Loss050100150200250Epoch0.00.10.20.3Imitation Loss\fAcknowledgments\n\nBA is supported by the National Science Foundation Graduate Research Fellowship Program under\nGrant No. DGE1252522. We thank Alfredo Canziani, Shane Gu, and Yuval Tassa for insightful\ndiscussions.\n\nReferences\n\nPieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement learning. In\n\nProceedings of the 23rd international conference on Machine learning, pages 1\u20138. ACM, 2006.\n\nKostas Alexis, Christos Papachristos, George Nikolakopoulos, and Anthony Tzes. Model predictive quadrotor\nindoor position control. In Control & Automation (MED), 2011 19th Mediterranean Conference on, pages\n1247\u20131252. IEEE, 2011.\n\nBrandon Amos and J Zico Kolter. OptNet: Differentiable Optimization as a Layer in Neural Networks. In\n\nProceedings of the International Conference on Machine Learning, 2017.\n\nSomil Bansal, Roberto Calandra, Sergey Levine, and Claire Tomlin. Mbmf: Model-based priors for model-free\n\nreinforcement learning. arXiv preprint arXiv:1709.03153, 2017.\n\nJoschika Boedecker, Jost Tobias Springenberg, Jan Wul\ufb01ng, and Martin Riedmiller. Approximate real-time\nIn IEEE Symposium on Adaptive Dynamic\n\noptimal control based on sparse gaussian process models.\nProgramming and Reinforcement Learning (ADPRL), 2014.\n\nP. Bouffard, A. Aswani, , and C. Tomlin. Learning-based model predictive control on a quadrotor: Onboard\nimplementation and experimental results. In IEEE International Conference on Robotics and Automation,\n2012.\n\nStephen Boyd. Lqr via lagrange multipliers. Stanford EE 363: Linear Dynamical Systems, 2008. URL\n\nhttp://stanford.edu/class/ee363/lectures/lqr-lagrange.pdf.\n\nYevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey Levine.\nCombining model-based and model-free updates for trajectory-centric reinforcement learning. arXiv preprint\narXiv:1703.03078, 2017.\n\nMarc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy search. In\n\nProceedings of the 28th International Conference on machine learning (ICML-11), pages 465\u2013472, 2011.\n\nT. Erez, Y. Tassa, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory\n\noptimization. In International Conference on Intelligent Robots and Systems, 2012.\n\nGregory Farquhar, Tim Rockt\u00e4schel, Maximilian Igl, and Shimon Whiteson. Treeqn and atreec: Differentiable\n\ntree planning for deep reinforcement learning. arXiv preprint arXiv:1710.11417, 2017.\n\nRam\u00f3n Gonz\u00e1lez, Mirko Fiacchini, Jos\u00e9 Luis Guzm\u00e1n, Teodoro \u00c1lamo, and Francisco Rodr\u00edguez. Robust\ntube-based predictive control for mobile robots in off-road conditions. Robotics and Autonomous Systems, 59\n(10):711\u2013726, 2011.\n\nShixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-\n\nef\ufb01cient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016a.\n\nShixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-\n\nbased acceleration. In Proceedings of the International Conference on Machine Learning, 2016b.\n\nNicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous\ncontrol policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages\n2944\u20132952, 2015.\n\nMina Kamel, Kostas Alexis, Markus Achtelik, and Roland Siegwart. Fast nonlinear model predictive control for\nmulticopter attitude tracking on so (3). In Control Applications (CCA), 2015 IEEE Conference on, pages\n1160\u20131166. IEEE, 2015.\n\nPeter Karkus, David Hsu, and Wee Sun Lee. Qmdp-net: Deep learning for planning under partial observability.\n\nIn Advances in Neural Information Processing Systems, pages 4697\u20134707, 2017.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\n10\n\n\fIan Lenz, Ross A Knepper, and Ashutosh Saxena. Deepmpc: Learning deep latent features for model predictive\n\ncontrol. In Robotics: Science and Systems, 2015.\n\nSergey Levine. Optimal control and planning. Berkeley CS 294-112: Deep Reinforcement Learning, 2017. URL\n\nhttp://rll.berkeley.edu/deeprlcourse/f17docs/lecture_8_model_based_planning.pdf.\n\nSergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown\n\ndynamics. In Advances in Neural Information Processing Systems, pages 1071\u20131079, 2014.\n\nSergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.\n\nThe Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\nWeiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological movement\n\nsystems. 2004.\n\nTimothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,\nand Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,\n2015.\n\nAlexander Liniger, Alexander Domahidi, and Manfred Morari. Optimization-based autonomous racing of 1:43\n\nscale rc cars. In Optimal Control Applications and Methods, pages 628\u2013647, 2014.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and\nMartin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex\nGraves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep\nreinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David\nSilver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International\nConference on Machine Learning, pages 1928\u20131937, 2016.\n\nAnusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for\nmodel-based deep reinforcement learning with model-free \ufb01ne-tuning. In arXiv preprint arXiv:1708.02596,\n2017.\n\nMichael Neunert, Cedric de Crousaz, Fardi Furrer, Mina Kamel, Farbod Farshidian, Roland Siegwart, and Jonas\nBuchli. Fast Nonlinear Model Predictive Control for Uni\ufb01ed Trajectory Optimization and Tracking. In ICRA,\n2016.\n\nJunhyuk Oh, Valliappa Chockalingam, Satinder Singh, and Honglak Lee. Control of memory, active perception,\nand action in minecraft. Proceedings of the 33rd International Conference on Machine Learning (ICML),\n2016.\n\nJunhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in Neural Information\n\nProcessing Systems, pages 6120\u20136130, 2017.\n\nMasashi Okada, Luca Rigazio, and Takenobu Aoshima. Path integral networks: End-to-end differentiable\n\noptimal control. arXiv preprint arXiv:1706.09597, 2017.\n\nRazvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racani\u00e8re, David Reichert,\nTh\u00e9ophane Weber, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. arXiv\npreprint arXiv:1707.06170, 2017.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin,\n\nAlban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.\n\nDeepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shel-\nhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. arXiv preprint\narXiv:1804.08606, 2018.\n\nMarcus Pereira, David D. Fan, Gabriel Nakajima An, and Evangelos Theodorou. Mpc-inspired neural network\n\npolicies for sequential decision making. arXiv preprint arXiv:1802.05803, 2018.\n\nVitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Model-free deep rl\n\nfor model-based control. arXiv preprint arXiv:1802.09081, 2018.\n\nJeff G Schneider. Exploiting model uncertainty estimates for safe dynamic control learning. In Advances in\n\nneural information processing systems, pages 1047\u20131053, 1997.\n\n11\n\n\fJohn Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\noptimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages\n1889\u20131897, 2015.\n\nJohn Schulman, Philpp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continu-\nous control using generalized advantage estimation. International Conference on Learning Representations,\n2016.\n\nDavid Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold,\nDavid Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning.\narXiv preprint arXiv:1612.08810, 2016.\n\nAravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks.\n\narXiv preprint arXiv:1804.00645, 2018.\n\nLiting Sun, Cheng Peng, Wei Zhan, and Masayoshi Tomizuka. A fast integrated planning and control framework\n\nfor autonomous driving via imitation learning. In arXiv preprint arXiv:1707.02515, 2017.\n\nRichard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic\nprogramming. In Proceedings of the seventh international conference on machine learning, pages 216\u2013224,\n1990.\n\nAviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances\n\nin Neural Information Processing Systems, pages 2154\u20132162, 2016.\n\nAviv Tamar, Garrett Thomas, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Learning from the hindsight\nplan\u2014episodic mpc improvement. In Robotics and Automation (ICRA), 2017 IEEE International Conference\non, pages 336\u2013343. IEEE, 2017.\n\nYuval Tassa, Nicolas Mansard, and Emo Todorov. Control-limited differential dynamic programming. In\nRobotics and Automation (ICRA), 2014 IEEE International Conference on, pages 1168\u20131175. IEEE, 2014.\n\nEvangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to\n\nreinforcement learning. Journal of Machine Learning Research, 11(Nov):3137\u20133181, 2010.\n\nTijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its\n\nrecent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331, 2012.\n\nArun Venkatraman, Roberto Capobianco, Lerrel Pinto, Martial Hebert, Daniele Nardi, and J Andrew Bagnell.\nImproved learning of dynamics models for control. In International Symposium on Experimental Robotics,\npages 703\u2013713. Springer, 2016.\n\nManuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally\nlinear latent dynamics model for control from raw images. In Advances in neural information processing\nsystems, pages 2746\u20132754, 2015.\n\nTh\u00e9ophane Weber, S\u00e9bastien Racani\u00e8re, David P Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende,\nAdria Puigdom\u00e8nech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for\ndeep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.\n\nGrady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive driving\nwith model predictive path integral control. In Robotics and Automation (ICRA), 2016 IEEE International\nConference on, pages 1433\u20131440. IEEE, 2016.\n\nGrady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control: From\n\ntheory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344\u2013357, 2017.\n\nZhaoming Xie, C. Karen Liu, and Kris Hauser. Differential Dynamic Programming with Nonlinear Constraints.\n\nIn International Conference on Robotics and Automation (ICRA), 2017.\n\n12\n\n\f", "award": [], "sourceid": 5042, "authors": [{"given_name": "Brandon", "family_name": "Amos", "institution": "Carnegie Mellon University"}, {"given_name": "Ivan", "family_name": "Jimenez", "institution": "Georgia Tech"}, {"given_name": "Jacob", "family_name": "Sacks", "institution": "Georgia Institute of Technology"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech / Google Brain"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University / Bosch Center for AI"}]}