{"title": "Sample Efficient Path Integral Control under Uncertainty", "book": "Advances in Neural Information Processing Systems", "page_first": 2314, "page_last": 2322, "abstract": "We present a data-driven stochastic optimal control framework that is derived using the path integral (PI) control approach. We find iterative control laws analytically without a priori policy parameterization based on probabilistic representation of the learned dynamics model. The proposed algorithm operates in a forward-backward sweep manner which differentiate it from other PI-related methods that perform forward sampling to find open-loop optimal controls. Our method uses significantly less sampled data to find analytic control laws compared to other approaches within the PI control family that rely on extensive sampling from given dynamics models or trials on physical systems in a model-free fashion. In addition, the learned controllers can be generalized to new tasks without re-sampling based on the compositionality theory for the linearly-solvable optimal control framework.We provide experimental results on three different systems and comparisons with state-of-the-art model-based methods to demonstrate the efficiency and generalizability of the proposed framework.", "full_text": "Sample Ef\ufb01cient Path Integral Control under\n\nUncertainty\n\nYunpeng Pan, Evangelos A. Theodorou, and Michail Kontitsis\n\nAutonomous Control and Decision Systems Laboratory\n\nInstitute for Robotics and Intelligent Machines\n\nSchool of Aerospace Engineering\n\nGeorgia Institute of Technology, Atlanta, GA 30332\n{ypan37,evangelos.theodorou,kontitsis}@gatech.edu\n\nAbstract\n\nWe present a data-driven optimal control framework that is derived using the path\nintegral (PI) control approach. We \ufb01nd iterative control laws analytically without a\npriori policy parameterization based on probabilistic representation of the learned\ndynamics model. The proposed algorithm operates in a forward-backward manner\nwhich differentiate it from other PI-related methods that perform forward sam-\npling to \ufb01nd optimal controls. Our method uses signi\ufb01cantly less samples to \ufb01nd\nanalytic control laws compared to other approaches within the PI control family\nthat rely on extensive sampling from given dynamics models or trials on physical\nsystems in a model-free fashion. In addition, the learned controllers can be gener-\nalized to new tasks without re-sampling based on the compositionality theory for\nthe linearly-solvable optimal control framework. We provide experimental results\non three different tasks and comparisons with state-of-the-art model-based meth-\nods to demonstrate the ef\ufb01ciency and generalizability of the proposed framework.\n\n1\n\nIntroduction\n\nStochastic optimal control (SOC) is a general and powerful framework with applications in many\nareas of science and engineering. However, despite the broad applicability, solving SOC problems\nremains challenging for systems in high-dimensional continuous state action spaces. Various func-\ntion approximation approaches to optimal control are available [1, 2] but usually sensitive to model\nuncertainty. Over the last decade, SOC based on exponential transformation of the value function has\ndemonstrated remarkable applicability in solving real world control and planning problems. In con-\ntrol theory the exponential transformation of the value function was introduced in [3, 4]. In the recent\ndecade it has been explored in terms of path integral interpretations and theoretical generalizations\n[5, 6, 7, 8], discrete time formulations [9], and scalable RL/control algorithms [10, 11, 12, 13, 14].\nThe resulting stochastic optimal control frameworks are known as Path Integral (PI) control for con-\ntinuous time, Kullback Leibler (KL) control for discrete time, or more generally Linearly Solvable\nOptimal Control [9, 15].\nOne of the most attractive characteristics of PI control is that optimal control problems can be solved\nwith forward sampling of Stochastic Differential Equations (SDEs). While the process of sampling\nwith SDEs is more scalable than numerically solving partial differential equations, it still suffers\nfrom the curse of dimensionality when performed in a naive fashion. One way to circumvent this\nproblem is to parameterize policies [10, 11, 14] and then perform optimization with sampling. How-\never, in this case one has to impose the structure of the policy a-priori, therefore restrict the possible\noptimal control solutions within the assumed parameterization. In addition, the optimized policy\nparameters can not be generalized to new tasks. In general, model-free PI policy search approaches\n\n1\n\n\frequire a large number of samples from trials performed on real physical systems. The issue of\nsample inef\ufb01ciency further restricts the applicability of PI control methods on physical systems with\nunknown or partially known dynamics.\nMotivated by the aforementioned limitations, in this paper we introduce a sample ef\ufb01cient, model-\nbased approach to PI control. Different from existing PI control approaches, our method combines\nthe bene\ufb01ts of PI control theory [5, 6, 7] and probabilistic model-based reinforcement learning\nmethodologies [16, 17]. The main characteristics of the our approach are summarized as follows\n\nwithout any policy parameterization.\n\n\u2022 It extends the PI control theory [5, 6, 7] to the case of uncertain systems. The structural\nconstraint is enforced between the control cost and uncertainty of the learned dynamics,\nwhich can be viewed as a generalization of previous work [5, 6, 7].\n\u2022 Different from parameterized PI controllers [10, 11, 14, 8], we \ufb01nd analytic control law\n\u2022 Rather than keeping a \ufb01xed control cost weight [5, 6, 7, 10, 18], or ignoring the con-\nstraint between control authority and noise level [11], in this work the control cost weight\nis adapted based on the explicit uncertainty of the learned dynamics model.\n\u2022 The algorithm operates in a different manner compared to existing PI-related methods that\nperform forward sampling [5, 6, 7, 10, 18, 11, 12, 14, 8]. More precisely our method per-\nform successive deterministic approximate inference and backward computation of optimal\ncontrol law.\n\u2022 The proposed model-based approach is signi\ufb01cantly more sample ef\ufb01cient than sampling-\nbased PI control [5, 6, 7, 18]. In RL setting our method is comparable to the state-of-the-art\nRL methods [17, 19] in terms of sample and computational ef\ufb01ciency.\n\u2022 Thanks to the linearity of the backward Chapman-Kolmogorov PDE, the learned controllers\ncan be generalized to new tasks without re-sampling by constructing composite controllers.\nIn contrast, most policy search and trajectory optimization methods [10, 11, 14, 17, 19, 20,\n21, 22] \ufb01nd policy parameters that can not be generalized.\n\n2\n\nIterative Path Integral Control for a Class of Uncertain Systems\n\n2.1 Problem formulation\n\ndx =(cid:0)f (x) + G(x)u(cid:1)dt + Bd\u03c9,\n\nWe consider a nonlinear stochastic system described by the following differential equation\n\n(1)\nwith state x \u2208 Rn, control u \u2208 Rm, and standard Brownian motion noise \u03c9 \u2208 Rp with variance\n\u03a3\u03c9. f (x) is the unknown drift term (passive dynamics), G(x) \u2208 Rn\u00d7m is the control matrix and\nB \u2208 Rn\u00d7p is the diffusion matrix. Given some previous control uold, we seek the optimal control\ncorrection term \u03b4u such that the total control u = uold + \u03b4u. The original system becomes\n\ndx =(cid:0)f (x) + G(x)(uold + \u03b4u)(cid:1)dt + Bd\u03c9 =(cid:0)f (x) + G(x)uold(cid:1)\n(cid:125)\n\ndt + G(x)\u03b4udt + Bd\u03c9.\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n\u02dcf (x,uold)\n\nIn this work we assume the dynamics based on the previous control can be represented by Gaussian\nprocesses (GP) such that\n\n(2)\nwhere fGP is the GP representation of the biased drift term \u02dcf under the previous control. Now the\noriginal dynamical system (1) can be represented as follow\n\nfGP(x) = \u02dcf (x, uold)dt + Bd\u03c9,\n\ndx = fGP + G\u03b4udt,\n\n(3)\nwhere \u00b5f , \u03a3f are predictive mean and covariance functions, respectively. For the GP model we use\n2 (xi \u2212 xj)TW(xi \u2212 xj)) +\na prior of zero mean and covariance function K(xi, xj) = \u03c32\n\u03c9, with \u03c3s, \u03c3\u03c9, W the hyper-parameters. \u03b4ij is the Kronecker symbol that is one iff i = j and\n\u03b4ij\u03c32\nzero otherwise. Samples over fGP can be drawn using an vector of i.i.d. Gaussian variable \u2126\n\ns exp(\u2212 1\n\nfGP \u223c GP(\u00b5f , \u03a3f ),\n\n\u02dcfGP = \u00b5f + Lf \u2126\n\n2\n\n(4)\n\n\fwhere Lf is obtained using Cholesky factorization such that \u03a3f = Lf LT\nf . Note that generally \u2126 is\nan in\ufb01nite dimensional vector and we can use the same sample to represent uncertainty during learn-\ning [23]. Without loss of generality we assume \u2126 to be the standard zero-mean Brownian motion.\nFor the rest of the paper we use simpli\ufb01ed notations with subscripts indicating the time step. The\ndiscrete-time representation of the system is xt+dt = xt +\u00b5f t +Gt\u03b4utdt+Lf t\u2126t\ndt, and the con-\n\nditional probability of xt+dt given xt and \u03b4ut is a Gaussian p(cid:0)xt+dt|xt, \u03b4ut\n\n(cid:1) = N(cid:0)\u00b5t+dt, \u03a3t+dt\n\nJ(x0) = E(cid:104)\n\nwhere \u00b5t+dt = xt + \u00b5f t + Gt\u03b4ut and \u03a3t+dt = \u03a3f t. In this paper we consider a \ufb01nite-horizon\nstochastic optimal control problem\n\n(cid:105)\n,\nwhere the immediate cost is de\ufb01ned as L(xt, ut) = q(xt) + 1\nt Rt\u03b4ut, and q(xt) = (xt \u2212\nt )TQ(xt \u2212 xd\nt is the desired state. Rt = R(xt) is a state-\nxd\ndependent positive de\ufb01nite weight matrix. Next we show the linearized Hamilton-Jacobi-Bellman\nequation for this class of optimal control problems.\n\nt ) is a quadratic cost function where xd\n\nL(xt, \u03b4ut)dt\n\n(cid:90) T\n\nq(xT ) +\n\n2 \u03b4uT\n\n(cid:1),\n\n\u221a\n\nt=0\n\n2.2 Linearized Hamilton-Jacobi-Bellman equation for uncertain dynamics\n\nAt each iteration the goal is to \ufb01nd the optimal control update \u03b4ut that minimizes the value function\n\nL(xt, \u03b4ut)dt + V (xt + dxt, t + dt)dt|xt\n\n.\n\n(5)\n\nE(cid:104)(cid:90) t+dt\n\nt\n\nV (xt, t) = min\n\u03b4ut\n\n(cid:105)\n\n(5) is the Bellman equation. By approximating the integral for a small dt and applying It\u02c6o\u2019s rule we\nobtain the Hamilton-Jacobi-Bellman (HJB) equation (detailed derivation is skipped):\n\n\u2212\u2202tVt = min\n\n(qt +\n\n\u03b4ut\n\nt Rt\u03b4ut + (\u00b5f t + Gt\u03b4ut)T\u2207xVt +\n\n\u03b4uT\n\n1\n2\n\nTr(\u03a3f t\u2207xxVt)).\n\n1\n2\n\nTo \ufb01nd the optimal control update, we take gradient of the above expression (inside the parentheses)\nwith respect to \u03b4ut and set to 0. This yields \u03b4ut = \u2212R\u22121\nt \u2207xVt. Inserting this expression into\nthe HJB equation yields the following nonlinear and second order PDE\nt \u2207xVt +\n\n(\u2207xVt)TGtR\u22121GT\n\n\u2212\u2202tVt = qt + (\u2207xVt)T\u00b5f t \u2212 1\n2\n\nTr(\u03a3f t\u2207xxVt).\n\nt GT\n\n(6)\n\n1\n2\n\nIn order to solve the above PDE we use the exponential transformation of the value function\nVt = \u2212\u03bb log \u03a8t, where \u03a8t = \u03a8(xt) is called the desirability of xt. The corresponding\npartial derivatives can be found as \u2202tVt = \u2212 \u03bb\n\u2207x\u03a8t and \u2207xxVt =\n\n\u2202t\u03a8t, \u2207xVt = \u2212 \u03bb\n\n\u2207xx\u03a8t. Inserting these terms to (6) results in\n\n\u03a8t\n\n\u03a8t\n\n\u2207x\u03a8t\u2207x\u03a8T\n\n\u03bb\n\u03a82\nt\n\n\u03a8t\n\nt \u2212 \u03bb\n(\u2207x\u03a8t)T\u00b5f t\u2212 \u03bb2\n2\u03a82\nt\n\n\u03bb\n\n\u22121\nt GT\n\n2\u03a82\nt\n\n\u03bb\n\u03a8t\n\nt \u2207x\u03a8t+\n\n(\u2207x\u03a8t)TGtR\n\n\u2202t\u03a8t = qt\u2212 \u03bb\n\u03a8t\n\nTr((\u2207x\u03a8t)T\u03a3f t\u2207x\u03a8t)\u2212 \u03bb\n2\u03a8t\nThe quadratic terms \u2207x\u03a8t will cancel out under the assumption of \u03bbGtR\u22121\nt = \u03a3f t. This\nconstraint is different from existing works in path integral control [5, 6, 7, 10, 18, 8] where the\nconstraint is enforced between the additive noise covariance and control authority, more precisely\n\u03bbGtR\u22121\nt = B\u03a3\u03c9BT. The new constraint enables an adaptive update of control cost weight\nbased on explicit uncertainty of the learned dynamics. In contrast, most existing works use a \ufb01xed\ncontrol cost weight [5, 6, 7, 10, 18, 12, 14, 8]. This condition also leads to more exploration (more\naggressive control) under high uncertainty and less exploration with more certain dynamics. Given\nthe aforementioned assumption, the above PDE is simpli\ufb01ed as\n\nt GT\n\nt GT\n\nTr(\u2207xx\u03a8t\u03a3f t).\n\n1\n\u03bb\n\n\u2202t\u03a8t =\n\nqt\u03a8t \u2212 \u00b5T\nsubject to the terminal condition \u03a8T = exp(\u2212 1\n\u03bb qT ). The resulting Chapman-Kolmogorov PDE (7)\nis linear. In general, solving (7) analytically is intractable for nonlinear systems and cost functions.\nWe apply the Feynman-Kac formula which gives a probabilistic representation of the solution of the\nlinear PDE (7)\n\nf t\u2207x\u03a8t \u2212 1\n2\n\nTr(\u2207xx\u03a8t\u03a3f t),\n\n(7)\n\n(cid:90)\n\np(\u03c4t|xt) exp(cid:0) \u2212 1\n\nT\u2212dt(cid:88)\n\nj=t\n\n(\n\n\u03bb\n\nqjdt)(cid:1)\u03a8T d\u03c4t,\n\n\u03a8t = lim\ndt\u21920\n\n(8)\n\n3\n\n\fwhere \u03c4t is the state trajectory from time t to T . The optimal control is obtained as\n\nGt\u03b4 \u02c6ut = \u2212GtR\u22121\n\nt GT\n\n=\u21d2\u02c6ut = uold\n\nt + \u03b4 \u02c6ut = uold\n\nt (\u2207xVt) = \u03bbGtR\u22121\nt + G\u22121\n\n(cid:16)\u2207x\u03a8t\n\nt \u03a3f t\n\n(cid:17)\n\nt GT\nt\n\n.\n\n\u03a8t\n\n(cid:16)\u2207x\u03a8t\n\n(cid:17)\n\n\u03a8t\n\n(cid:16)\u2207x\u03a8t\n\n(cid:17)\n\n= \u03a3f t\n\n\u03a8t\n\n(9)\n\nRather than computing \u2207x\u03a8t and \u03a8t, the optimal control \u02c6ut can be approximated based on path\ncosts of sampled trajectories. Next we brie\ufb02y review some of the existing approaches.\n\n2.3 Related works\n\n(cid:2)d\u03c9t\n\nAccording to the path integral control theory [5, 6, 7, 10, 18, 8], the stochastic optimal control\nproblem becomes an approximation problem of a path integral (8). This problem can be solved by\nforward sampling of the uncontrolled (u = 0) SDE (1). The optimal control \u02c6ut is approximated\n(cid:3), where the\nbased on path costs of sampled trajectories. Therefore the computation of optimal controls becomes\na forward process. More precisely, when the control and noise act in the same subspace, the optimal\ncontrol can be evaluated as the weighted average of the noise \u02c6ut = Ep(\u03c4t|xt)\n(cid:82) exp(\u2212 1\nprobability of a trajectory is p(\u03c4t|xt) = exp(\u2212 1\n\u03bb S(\u03c4t|xt))\n\u03bb S(\u03c4t|xt))d\u03c4 , and S(\u03c4t|xt) is de\ufb01ned as the path\ncost computed by performing forward sampling. However, these approaches require a large amount\nof samples from a given dynamics model, or extensive trials on physical systems when applied in\nmodel-free reinforcement learning settings. In order to improve sample ef\ufb01ciency, a nonparametric\napproach was developed by representing the desirability \u03a8t in terms of linear operators in a repro-\nducing kernel Hilbert space (RKHS) [12]. As a model-free approach, it allows sample re-use but\nrelies on numerical methods to estimate the gradient of desirability, i.e., \u2207x\u03a8t , which can be com-\nputationally expensive. On the other hand, computing the analytic expressions of the path integral\nembedding is intractable and requires exact knowledge of the system dynamics. Furthermore, the\ncontrol approximation is based on samples from the uncontrolled dynamics, which is usually not\nsuf\ufb01cient for highly nonlinear or underactuated systems.\nAnother class of PI-related method is based on policy parameterization. Notable approaches in-\nclude PI2 [10], PI2-CMA [11], PI-REPS[14] and recently developed state-dependent PI[8]. The\nlimitations of these methods are: 1) They do not take into account model uncertainty in the passive\ndynamics f (x). 2) The imposed policy parameterizations restrict optimal control solutions. 3) The\noptimized policy parameters can not be generalized to new tasks. A brief comparison of some of\nthese methods can be found in Table 1. Motivated by the challenge of combining sample ef\ufb01ciency\nand generalizability, next we introduce a probabilistic model-based approach to compute the optimal\ncontrol (9) analytically.\n\nStructural constraint\n\nDynamics model\n\nPolicy parameterization\n\n\u22121\nt GT\nmodel-based\n\nNo\n\nPI [5, 6, 7], iterative PI [18] PI2[10], PI2-CMA [11] PI-REPS[14] State feedback PI[8]\n\u03bbGtR\n\nt = B\u03a3\u03c9BT\n\nsame as PI\nmodel-free\n\nYes\n\nsame as PI\nmodel-based\n\nYes\n\nsame as PI\nmodel-based\n\nYes\n\nOur method\n\n\u03bbGR\u22121GT = \u03a3f\n\nGP model-based\n\nNo\n\nTable 1: Comparison with some notable and recent path integral-related approaches.\n\n3 Proposed Approach\n\n3.1 Analytic path integral control: a forward-backward scheme\n\nIn order to derive the proposed framework, \ufb01rstly we learn the function fGP(xt) = \u02dcf (x, uold)dt +\nBd\u03c9 from sampled data. Learning the continuous mapping from state to state transition can be\nviewed as an inference with the goal of inferring the state transition d\u02dcxt = fGP(xt). The kernel\nfunction has been de\ufb01ned in Sec.2.1, which can be interpreted as a similarity measure of random\nvariables. More speci\ufb01cally, if the training input xi and xj are close to each other in the kernel\nspace, their outputs dxi and dxj are highly correlated. Given a sequence of states {x0, . . . xT},\nand the corresponding state transition {d\u02dcx0, . . . , d\u02dcxT}, the posterior distribution can be obtained\nby conditioning the joint prior distribution on the observations. In this work we make the standard\nassumption of independent outputs (no correlation between each output dimension).\n\n4\n\n\fTo propagate the GP-based dynamics over a trajectory of time horizon T we employ the moment\nmatching approach [24, 17] to compute the predictive distribution. Given an input distribution over\nthe state N (\u00b5t, \u03a3t), the predictive distribution over the state at t + dt can be approximated as a\nGaussian p(xt+dt) \u2248 N (\u00b5t+dt, \u03a3t+dt) such that\n\n\u00b5t+dt = \u00b5t + \u00b5f t, \u03a3t+dt = \u03a3t + \u03a3f t + COV[xt, d\u02dcxt] + COV[d\u02dcxt, xt].\n\n(10)\nThe above formulation is used to approximate one-step transition probabilities over the trajectory.\nDetails regarding the moment matching method can be found in [24, 17]. All mean and variance\nterms can be computed analytically. The hyper-parameters \u03c3s, \u03c3\u03c9, W are learned by maximizing\nthe log-likelihood of the training outputs given the inputs [25]. Given the approximation of transition\nprobability (10), we now introduce a Bayesian nonparametric formulation of path integral control\nbased on probabilistic representation of the dynamics. Firstly we perform approximate inference\n(forward propagation) to obtain the Gaussian belief (predictive mean and covariance of the state)\nover the trajectory. Since the exponential transformation of the state cost exp(\u2212 1\n\u03bb q(x)dt) is an\nunnormalized Gaussian N (xd, 2\u03bb\ndt Q\u22121). We can evaluate the following integral analytically\n(cid:90)\n(cid:12)(cid:12)(cid:12)I +\n(cid:17)\nqj dt(cid:1)dxj =\n\n(cid:1) exp(cid:0) \u2212 1\n\nN(cid:0)\u00b5j , \u03a3j\n\n\u22121(\u00b5j \u2212 xd\nj )\n\n(cid:16) \u2212 1\n\n(cid:12)(cid:12)(cid:12)\u2212 1\n\n(\u00b5j \u2212 xd\n\n\u03bb\u03a3j Q)\n\n2 exp\n\nQ(I +\n\n\u03a3j Q\n\n,\n\nj )T dt\n2\u03bb\n\n(11)\nfor j = t+dt, ..., T . Thus given a boundary condition \u03a8T = exp(\u2212 1\n\u03bb qT ) and predictive distribution\nat the \ufb01nal step N (\u00b5T , \u03a3T ), we can evaluate the one-step backward desirability \u03a8T\u2212dt analytically\nusing the above expression (11). More generally we use the following recursive rule\n\ndt\n2\u03bb\n\ndt\n2\u03bb\n\n\u03bb\n\n2\n\n(cid:90)\n\nN(cid:0)\u00b5j, \u03a3j\n\n(cid:1) exp(cid:0) \u2212 1\n\nqjdt(cid:1)\u03a8jdxj,\n\n\u03a8j\u2212dt = \u03a6(xj, \u03a8j) =\n\n(12)\nfor j = t + dt, ..., T \u2212 dt. Since we use deterministic approximate inference based on (10) instead\nof explicitly sampling from the corresponding SDE, we approximate the conditional distribution\np(xj|xj\u2212dt) by the Gaussian predictive distribution N (\u00b5j, \u03a3j). Therefore the path integral\n(cid:90)\n\n(cid:16) \u2212 1\n\n\u03c4t|xt\n\nT\u2212dt(cid:88)\n(cid:17)\n\n\u03a8T d\u03c4t\n\n\u03a8t =\n\nqjdt)\n\n(cid:17)\n\n(cid:17)\n\n(cid:90)\n\n(cid:90)\n\nexp\n\nj=t\n\n\u03bb\n\n\u03bb\n\np\n\n(\n\n\u00b5T\u2212dt, \u03a3T\u2212dt\n\nexp\n\nqT\u2212dtdt\n\n\u00b5T , \u03a3T\n\nqT\n\ndxT\n\ndxT\u2212dt\n\n...dxt+dt\n\n(cid:16)\nN(cid:16)\n\nN(cid:16)\n\n\u2248\n\n...\n\n(cid:17)\n(cid:16) \u2212 1\n\n\u03bb\n\n(cid:17)(cid:90)\n(cid:124)\n(cid:123)(cid:122)\n\n\u03a8T \u22122dt\n\n(cid:17)\n(cid:125)\n\n(cid:16) \u2212 1\n(cid:123)(cid:122)\n\n\u03bb\n\n\u03a8T\n\nexp\n\n(cid:124)\n(cid:123)(cid:122)\n\n\u03a8T\u2212dt\n\n(cid:125)\n\n\u03a8t+dtdxt+dt = \u03a6(xt+dt, \u03a8t+dt).\n\n(cid:124)\nN(cid:16)\n\n(cid:90)\n\n=\n\n(cid:17)\n\nexp\n\n(cid:16) \u2212 1\n\n\u03bb\n\nqt+dtdt\n\n\u00b5t+dt, \u03a3t+dt\n\n(cid:17)\n\n(cid:125)\n\n(13)\n\nWe evaluate the desirability \u03a8t backward in time by successive computation using the above recur-\nsive expression. The optimal control law \u02c6ut (9) requires gradients of the desirability function with\nrespect to the state, which can be computed backward in time as well. For simplicity we denote the\nfunction \u03a6(xj, \u03a8j) by \u03a6j. Thus we compute the gradient of the recursive expression (13)\n\n\u2207x\u03a8j\u2212dt = \u03a8j\u2207x\u03a6j + \u03a6j\u2207x\u03a8j,\n\n(14)\nwhere j = t + dt, ..., T \u2212 dt. Given the expression in (11) we compute the gradient terms in (14) as\n\u2207x\u03a6j =\n\u22121,\n\n\u03bb\u03a3jQ)\n\ndp(xj)\n\nQ(I +\n\nd\u03a6j\n\n=\n\n\u2202\u03a6j\n\u2202\u00b5j\n\n+\n\nd\u00b5j\ndxt\n\n\u2202\u03a6j\n\u2202\u03a3j\n\n\u22121(cid:0)\u00b5j \u2212 xd\n\nj\n\nd\u03a3j\ndxt\n\n(cid:1)(cid:0)\u00b5j \u2212 xd\n\n, where \u2202\u03a6j\n\u2202\u00b5j\n\n(cid:1)T \u2212 I\n\nj\n\n= \u03a6j(\u00b5j \u2212 xd\n(cid:17) dt\n\nQ(I +\n\nj )T dt\n2\u03bb\n\ndt\n2\u03bb\n\n\u03bb\u03a3jQ)\n\ndt\n2\u03bb\n\u22121, and\n\ndxt\n\ndp(xj)\n\u2202\u03a6j\n\u03a6j\n\u2202\u03a3j\n2\nd{\u00b5j, \u03a3j}\n\n(cid:16) dt\n(cid:110) \u2202\u00b5j\n\n2\u03bb\n\n=\n\nQ(I +\n\n=\n\n\u03bb\u03a3jQ)\n\ndt\n2\u03bb\nd\u00b5j\u2212dt\n\n+\n\n\u2202\u00b5j\n\nd\u03a3j\u2212dt\n\n\u2202\u03a3j\n\n\u2202\u03a3j\n\nd\u03a3j\u2212dt\n\n\u2202\u03a3j\u2212dt\n\ndxt\n\ndxt\n\n\u2202\u00b5j\u2212dt\n\ndxt\nThe term \u2207x\u03a8T\u2212dt is compute similarly. The partial derivatives\n\u2202\u03a3j\u2212dt\ncan be computed analytically as in [17]. We compute all gradients using this scheme without any\nnumerical method (\ufb01nite differences, etc.). Given \u03a8t and \u2207x\u03a8t, the optimal control takes a analytic\n\n, \u2202\u03a3j\n\u2202\u00b5j\u2212dt\n\n\u2202\u00b5j\u2212dt\n\n\u2202\u00b5j\u2212dt\n\n\u2202\u03a3j\u2212dt\n\n\u2202\u00b5j\n\n\u2202\u00b5j\n\n\u2202\u03a3j\n\ndxt\n\ndxt\n\n,\n\n,\n\n\u2202\u03a3j\u2212dt\n\n2\u03bb\nd\u00b5j\u2212dt\n\n+\n\n(cid:111)\n\n.\n\n,\n\n5\n\n\fform as in eq.(9). Since \u03a8t and \u2207x\u03a8t are explicit functions of xt, the resulting control law is es-\nsentially different from the feedforward control in sampling-based path integral control frameworks\n[5, 6, 7, 10, 18] as well as the parameterized state feedback PI control policies [14, 8]. Notice that\nat current time step t, we update the control sequence \u02c6ut,...,T using the presented forward-backward\nscheme. Only \u02c6ut is applied to the system to move to the next step, while the controls \u02c6ut+dt,...,T are\nused for control update at future steps. The transition sample recorded at each time step is incorpo-\nrated to update the GP model of the dynamics. A summary of the proposed algorithm is shown in\nAlgorithm 1.\n\nfor t=0:T do\n\nIncorporate transition sample to learn GP dynamics model.\nrepeat\n\nAlgorithm 1 Sample ef\ufb01cient path integral control under uncertain dynamics\n1: Initialization: Apply random controls \u02c6u0,..,T to the physical system (1), record data.\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: until Task learned.\n\nApproximate inference for predictive distributions using uold\nBackward computation of optimal control updates \u03b4 \u02c6ut,..,T , see (13)(14)(9).\nUpdate optimal controls \u02c6ut,..,T = uold\n\nuntil Convergence.\nApply optimal control \u02c6ut to the system. Move one step forward and record data.\n\nt,..,T + \u03b4 \u02c6ut,..,T .\n\nend for\n\nt,..,T = \u02c6ut,..,T , see (10).\n\n3.2 Generalization to unlearned tasks without sampling\n\nIn this section we describe how to generalize the learned controllers for new (unlearned) tasks with-\nout any interaction with the real system. The proposed approach is based on the compositionality\ntheory [26] in linearly solvable optimal control (LSOC). We use superscripts to denote previously\nlearned task indexes. Firstly we de\ufb01ne a distance measure between the new target \u00afxd and old targets\nxdk, k = 1, .., K, i.e., a Gaussian kernel\n\n\u03c9k = exp\n\n(\u00afxd \u2212 xdk)TP(\u00afxd \u2212 xdk)\n\n(15)\n\nwhere P is a diagonal matrix (kernel width). The composite terminal cost \u00afq(xT ) for the new task\nbecomes\n\n\u00afq(xT ) = \u2212\u03bb log\n\n(cid:80)K\n\nk=1 \u03c9k exp(\u2212 1\nk=1 \u03c9k\n\n\u03bb qk(xT ))\n\n,\n\n(16)\n\n(cid:16) \u2212 1\n(cid:18)(cid:80)K\n\n2\n\n,\n\n(cid:17)\n(cid:19)\n\nwhere qk(xT ) is the terminal cost for old tasks. For conciseness we de\ufb01ne a normalized distance\n\u03c9k(cid:80)K\nk=1 \u03c9k , which can be interpreted as a probability weight. Based on (16) we have\nmeasure \u02dc\u03c9k =\nthe composite terminal desirability for the new task which is a linear combination of \u03a8k\nT\n\n\u00af\u03a8T = exp\n\n\u00afq(xT )\n\n=\n\n\u02dc\u03c9k\u03a8k\nT .\n\n(17)\n\nSince \u03a8k\nt is the solution to the linear Chapman-Kolmogorov PDE (7), the linear combination of\ndesirability (17) holds everywhere from t to T as long as it holds on the boundary (terminal time\nstep). Therefore we obtain the composite control\n\n(cid:16) \u2212 1\n\n\u03bb\n\n(cid:17)\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n\nk=1\n\n(cid:80)K\n\n\u02dc\u03c9k\u03a8k\nt\nk=1 \u02dc\u03c9k\u03a8k\nt\n\n\u00afut =\n\n\u02c6uk\nt .\n\n(18)\n\nThe composite control law in (18) is essentially different from an interpolating control law[26]. It\nenables sample-free controllers that constructed from learned controllers for different tasks. This\nscheme can not be adopted in policy search or trajectory optimization methods such as [10, 11,\n14, 17, 19, 20, 21, 22]. Alternatively, generalization can be achieved by imposing task-dependent\npolicies [27]. However, this approach might restrict the choice of optimal controls given the assumed\nstructure of control policy.\n\n6\n\n\f4 Experiments and Analysis\n\nIterative path integral control is a sampling-based stochastic control method.\n\nWe consider 3 simulated RL tasks: cart-pole (CP) swing up, double pendulum on a cart (DPC)\nswing up, and PUMA-560 robotic arm reaching. The CP and DPC systems consist of a cart and a\nsingle/double-link pendulum. The tasks are to swing-up the single/double-link pendulum from the\ninitial position (point down). Both CP and DPC are under-actuated systems with only one control\nacting on the cart. PUMA-560 is a 3D robotic arm that has 12 state dimensions, 6 degrees of\nfreedom with 6 actuators on the joints. The task is to steer the end-effector to the desired position\nand orientation.\nIn order to demonstrate the performance, we compare the proposed control framework with three\nrelated methods: iterative path integral control [18] with known dynamics model, PILCO [17] and\nPDDP [19].\nIt is\nbased on importance sampling using controlled diffusion process rather than passive dynamics used\nin standard path integral control [5, 6, 7]. Iterative PI control is used as a baseline with a given\ndynamics model. PILCO is a model-based policy search method that features state-of-the-art data\nef\ufb01ciency in terms of number of trials required to learn a task. PILCO requires an extra optimizer\n(such as BFGS) for policy improvement. PDDP is a Gaussian belief space trajectory optimization\napproach. It performs dynamic programming based on local approximation of the learned dynamics\nand value function. Both PILCO and PDDP are applied with unknown dynamics. In this work we do\nnot compare our method with model-free PI-related approaches such as [10, 11, 12, 14] since these\nmethods would certainly cost more samples than model-based methods such as PILCO and PDDP.\nThe reason for choosing these two methods for comparison is that our method adopts a similar model\nlearning scheme while other state-of-the-art methods, such as [20] is based on a different model.\nIn experiment 1 we demonstrate the sample ef\ufb01ciency of our method using the CP and DPC tasks.\nFor both tasks we choose T = 1.2 and dt = 0.02 (60 time steps per rollout). The iterative PI\n[18] with a given dynamics model uses 103/104 (CP/DPC) sample rollouts per iteration and 500\niterations at each time step. We initialize PILCO and the proposed method by collecting 2/6 sample\nrollouts (corresponding to 120/360 transition samples) for CP/DPC tasks respectively. At each trial\n(on the true dynamics model), we use 1 sample rollout for PILCO and our method. PDDP uses\n4/5 rollouts (corresponding to 240/300 transition samples) for initialization as well as at each trial\nfor the CP/DPC tasks. Fig. 1 shows the results in terms of \u03a8T and computational time. For both\ntasks our method shows higher desirability (lower terminal state cost) at each trial, which indicates\nhigher sample ef\ufb01ciency for task learning. This is mainly because our method performs online re-\noptimization at each time step. In contrast, the other two methods do not use this scheme. However\nwe assume partial information of the dynamics (G matrix) is given. PILCO and PDDP perform\noptimization on entirely unknown dynamics. In many robotic systems G corresponds to the inverse\nof the inertia matrix, which can be identi\ufb01ed based on data as well. In terms of computational ef\ufb01-\nciency, our method outperforms PILCO since we compute the optimal control update analytically,\nwhile PILCO solves large scale nonlinear optimization problems to obtain policy parameters. Our\nmethod is more computational expensive than PDDP because PDDP seeks local optimal controls\nthat rely on linear approximations, while our method is a global optimal control approach. Despite\nthe relatively higher computational burden than PDDP, our method offers reasonable ef\ufb01ciency in\nterms of the time required to reach the baseline performance.\nIn experiment 2 we demonstrate the generalizability of the learned controllers to new tasks using\nthe composite control law (18) based on the PUMA-560 system. We use T = 2 and dt = 0.02\n(100 time steps per rollout). First we learn 8 independent controllers using Algorithm 1. The target\npostures are shown in Fig. 2. For all tasks we initialize with 3 sample rollouts and 1 sample at each\ntrial. Blue bars in Fig. 2b shows the desirabilities \u03a8T after 3 trials. Next we use the composite law\n(18) to construct controllers without re-sampling using 7 other controllers learned using Algorithm\n1. For instance the composite controller for task#1 is found as \u00afu1\nt . The\n\u02c6uk\nperformance comparison of the composite controllers with controllers learned from trials is shown in\nFig. 2. It can be seen that the composite controllers give close performance as independently learned\ncontrollers. The compositionality theory [26] generally does not apply to policy search methods and\ntrajectory optimizers such as PILCO, PDDP, and other recent methods [20, 21, 22]. Our method\nbene\ufb01ts from the compositionality of control laws that can be applied for multi-task control without\nre-sampling.\n\nt = (cid:80)8\n\n(cid:80)8\n\n\u02dc\u03c9k\u03a8k\nt\nk=2 \u02dc\u03c9k\u03a8k\nt\n\nk=2\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: Comparison in terms of sample ef\ufb01ciency and computational ef\ufb01ciency for (a) cart-pole\nand (b) double pendulum on a cart swing-up tasks. Left sub\ufb01gures show the terminal desirability \u03a8T\n(for PILCO and PDDP, \u03a8T is computed using terminal state costs) at each trial. Right sub\ufb01gures\nshow computational time (in minute) at each trial.\n\n(a)\n\n(b)\n\nFigure 2: Resutls for the PUMA-560 tasks. (a) 8 tasks tested in this experiment. Each number\nindicates a corresponding target posture. (b) Comparison of the controllers learned independently\nfrom trials and the composite controllers without sampling. Each composite controller is obtained\n(18) from 7 other independent controllers learned from trials.\n5 Conclusion and Discussion\n\nWe presented an iterative learning control framework that can \ufb01nd optimal controllers under uncer-\ntain dynamics using very small number of samples. This approach is closely related to the family\nof path integral (PI) control algorithms. Our method is based on a forward-backward optimiza-\ntion scheme, which differs signi\ufb01cantly from current PI-related approaches. Moreover, it combines\nthe attractive characteristics of probabilistic model-based reinforcement learning and linearly solv-\nable optimal control theory. These characteristics include sample ef\ufb01ciency, optimality and gen-\neralizability. By iteratively updating the control laws based on probabilistic representation of the\nlearned dynamics, our method demonstrated encouraging performance compared to the state-of-\nthe-art model-based methods. In addition, our method showed promising potential in performing\nmulti-task control based on the compositionality of learned controllers. Besides the assumed struc-\ntural constraint between control cost weight and uncertainty of the passive dynamics, the major\nlimitation is that we have not taken into account the uncertainty in the control matrix G. Future\nwork will focus on further generalization of this framework and applications to real systems.\n\nAcknowledgments\n\nThis research is supported by NSF NRI-1426945.\n\n8\n\n0123Trial#00.10.20.30.40.50.60.70.80.91\u03a8TCart-poleIterative PI (true model, 103 samp/iter)PILCO (1 sample/trial)PDDP (4 samples/trial)Ours (1 sample/trial)0123Trial#051015Time02468Trial#00.10.20.30.40.50.60.70.80.91\u03a8TDouble pendulum on a cartIterative PI (true model, 104 samp/iter)PILCO (1 sample/trial)PDDP (5 samples/trial)Ours (1 sample/trial)02468Trial#050100150200250300350Time12345687Task#12345678\u03a8T00.20.40.60.811.2Independent controller (1 samp/trial, 3 trials)Composite controller (no sampling)\fReferences\n[1] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming (optimization and neural computation\n\nseries, 3). Athena Scienti\ufb01c, 7:15\u201323, 1996.\n\n[2] A.G. Barto, W. Powell, J. Si, and D.C. Wunsch. Handbook of learning and approximate dynamic pro-\n\ngramming. 2004.\n\n[3] W.H. Fleming. Exit probabilities and optimal stochastic control. Applied Math. Optim, 9:329\u2013346, 1971.\n[4] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Applications of\n\nmathematics. Springer, New York, 1st edition, 1993.\n\n[5] H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Phys Rev Lett, 95:200\u2013201, 2005.\n[6] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical\n\nMechanics: Theory and Experiment, 11:P11011, 2005.\n\n[7] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. AIP\n\nConference Proceedings, 887(1), 2007.\n\n[8] S. Thijssen and H. J. Kappen. Path integral control and state-dependent feedback. Phys. Rev. E,\n\n91:032104, Mar 2015.\n\n[9] E. Todorov. Ef\ufb01cient computation of optimal actions. Proceedings of the national academy of sciences,\n\n106(28):11478\u201311483, 2009.\n\n[10] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement\n\nlearning. The Journal of Machine Learning Research, 11:3137\u20133181, 2010.\n\n[11] F. Stulp and O. Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceed-\n\nings of the 29th International Conference on Machine Learning (ICML), pages 281\u2013288. ACM, 2012.\n\n[12] K. Rawlik, M. Toussaint, and S. Vijayakumar. Path integral control by reproducing kernel hilbert space\nembedding. In Proceedings of the Twenty-Third International Joint Conference on Arti\ufb01cial Intelligence,\nIJCAI\u201913, pages 1628\u20131634, 2013.\n\n[13] Y. Pan and E. Theodorou. Nonparametric in\ufb01nite horizon kullback-leibler stochastic control. In 2014\nIEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1\u20138.\nIEEE, 2014.\n\n[14] V. G\u00b4omez, H.J. Kappen, J. Peters, and G. Neumann. Policy search for path integral control. In Machine\n\nLearning and Knowledge Discovery in Databases, pages 482\u2013497. Springer, 2014.\n\n[15] K. Dvijotham and E Todorov. Linearly solvable optimal control. Reinforcement learning and approximate\n\ndynamic programming for feedback control, pages 119\u2013141, 2012.\n\n[16] M.P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations and\n\nTrends in Robotics, 2(1-2):1\u2013142, 2013.\n\n[17] M. Deisenroth, D. Fox, and C. Rasmussen. Gaussian processes for data-ef\ufb01cient learning in robotics and\n\ncontrol. IEEE Transsactions on Pattern Analysis and Machine Intelligence, 27:75\u201390, 2015.\n\n[18] E. Theodorou and E. Todorov. Relative entropy and free energy dualities: Connections to path integral\n\nand kl control. In 51st IEEE Conference on Decision and Control, pages 1466\u20131473, 2012.\n\n[19] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Infor-\n\nmation Processing Systems (NIPS), pages 1907\u20131915, 2014.\n\n[20] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown\n\ndynamics. In Advances in Neural Information Processing Systems (NIPS), pages 1071\u20131079, 2014.\n[21] S. Levine and V. Koltun. Learning complex neural network policies with trajectory optimization.\n\nIn\nProceedings of the 31st International Conference on Machine Learning (ICML-14), pages 829\u2013837, 2014.\n[22] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. arXiv\n\npreprint arXiv:1502.05477, 2015.\n\n[23] P. Hennig. Optimal reinforcement learning for gaussian systems.\n\nProcessing Systems (NIPS), pages 325\u2013333, 2011.\n\nIn Advances in Neural Information\n\n[24] J. Quinonero Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of uncertainty in bayesian\nIn IEEE International Conference on\n\nkernel models-application to multiple-step ahead forecasting.\nAcoustics, Speech, and Signal Processing, 2003.\n\n[25] C.K.I Williams and C.E. Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.\n[26] E. Todorov. Compositionality of optimal control laws. In Advances in Neural Information Processing\n\nSystems (NIPS), pages 1856\u20131864, 2009.\n\n[27] M.P. Deisenroth, P. Englert, J. Peters, and D. Fox. Multi-task policy search for robotics. In Proceedings\n\nof 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014.\n\n9\n\n\f", "award": [], "sourceid": 1371, "authors": [{"given_name": "Yunpeng", "family_name": "Pan", "institution": "Georgia Institute of Technolog"}, {"given_name": "Evangelos", "family_name": "Theodorou", "institution": "Georgia Tech"}, {"given_name": "Michail", "family_name": "Kontitsis", "institution": "Georgia Institute of Technology"}]}