{"title": "An Approximate Inference Approach to Temporal Optimization in Optimal Control", "book": "Advances in Neural Information Processing Systems", "page_first": 2011, "page_last": 2019, "abstract": "Algorithms based on iterative local approximations present a practical approach to optimal control in robotic systems. However, they generally require the temporal parameters (for e.g. the movement duration or the time point of reaching an intermediate goal) to be specified \\textit{a priori}. Here, we present a methodology that is capable of jointly optimising the temporal parameters in addition to the control command profiles. The presented approach is based on a Bayesian canonical time formulation of the optimal control problem, with the temporal mapping from canonical to real time parametrised by an additional control variable. An approximate EM algorithm is derived that efficiently optimises both the movement duration and control commands offering, for the first time, a practical approach to tackling generic via point problems in a systematic way under the optimal control framework. The proposed approach is evaluated on simulations of a redundant robotic plant.", "full_text": "An Approximate Inference Approach to Temporal\n\nOptimization in Optimal Control\n\nKonrad C. Rawlik\nSchool of Informatics\nUniversity of Edinburgh\n\nEdinburgh, UK\n\nMarc Toussaint\n\nTU Berlin\n\nBerlin, Germany\n\nSethu Vijayakumar\nSchool of Informatics\nUniversity of Edinburgh\n\nEdinburgh, UK\n\nAbstract\n\nAlgorithms based on iterative local approximations present a practical approach\nto optimal control in robotic systems. However, they generally require the tem-\nporal parameters (for e.g. the movement duration or the time point of reaching\nan intermediate goal) to be speci\ufb01ed a priori. Here, we present a methodology\nthat is capable of jointly optimizing the temporal parameters in addition to the\ncontrol command pro\ufb01les. The presented approach is based on a Bayesian canon-\nical time formulation of the optimal control problem, with the temporal mapping\nfrom canonical to real time parametrised by an additional control variable. An ap-\nproximate EM algorithm is derived that ef\ufb01ciently optimizes both the movement\nduration and control commands offering, for the \ufb01rst time, a practical approach to\ntackling generic via point problems in a systematic way under the optimal control\nframework. The proposed approach, which is applicable to plants with non-linear\ndynamics as well as arbitrary state dependent and quadratic control costs, is eval-\nuated on realistic simulations of a redundant robotic plant.\n\n1 Introduction\n\nControl of sensorimotor systems, arti\ufb01cial or biological, is inherently both a spatial and temporal\nprocess. Not only do we have to specify where the plant has to move to but also when it reaches\nthat position. In some control schemes, the temporal component is implicit; for example, with a\nPID controller, movement duration results from the application of the feedback loop, while in other\ncases it is explicit, like for example in \ufb01nite or receding horizon optimal control approaches where\nthe time horizon is set explicitly as a parameter of the problem [8, 13].\n\nAlthough control based on an optimality criterion is certainly attractive, practical approaches for\nstochastic systems are currently limited to the \ufb01nite horizon [9, 16] or \ufb01rst exit time formulation [14,\n17]. The former does not optimize temporal aspects of the movement, i.e., duration or the time when\ncosts for speci\ufb01c sub goals of the problem are incurred, assuming them as given a priori. However,\nhow should one choose these temporal parameters? This question is non trivial and important even\nwhile considering a simple reaching problem. The solution generally employed in practice is to use\na apriori \ufb01xed duration, chosen experimentally. This can result in not reaching the goal, having to\nuse unrealistic range of control commands or excessive (wasteful) durations for short distance tasks.\nThe alternative \ufb01rst exit time formulation, on the other hand, either assumes speci\ufb01c exit states in the\ncost function and computes the shortest duration trajectory which ful\ufb01ls the task or assumes a time\nstationary task cost function and computes the control which minimizes the joint cost of movement\nduration and task cost [17, 1, 14]. This formalism is thus only directly applicable to tasks which do\nnot require sequential achievement of multiple goals. Although this limitation could be overcome\nby chaining together individual time optimal single goal controllers, such a sequential approach has\nseveral drawbacks. First, if we are interested in placing a cost on overall movement duration, we are\nrestricted to linear costs if we wish to remain time optimal. A second more important \ufb02aw is that\n\n1\n\n\ffuture goals should in\ufb02uence our control even before we have achieved the previous goal, a problem\nwhich we highlight during our comparative simulation studies.\n\nA wide variety of successful approaches to address stochastic optimal control problems have been\ndescribed in the literature [6, 2, 7]. The approach we present here builds on a class of approximate\nstochastic optimal control methods which have been successfully used in the domain of robotic ma-\nnipulators and in particular, the iLQG [9] algorithm used by [10], and the Approximate Inference\nControl (AICO) algorithm [16]. These approaches, as alluded to earlier, are \ufb01nite horizon formu-\nlations and consequently require the temporal structure of the problem to be \ufb01xed a priori. This\nrequirement is a direct consequence of a \ufb01xed length discretization of the continuous problem and\nthe structure of the temporally non-stationary cost function used, which binds incurrence of goal\nrelated costs to speci\ufb01c (discretised) time points. The fundamental idea proposed here is to refor-\nmulate the problem in canonical time and alternately optimize the temporal and spatial trajectories.\nWe implement this general approach in the context of the approximate inference formulation of\nAICO, leading to an Expectation Maximisation (EM) algorithm where the E-Step reduces to the\nstandard inference control problem. It is worth noting that due to the similarities between AICO,\niLQG and other algorithms, e.g., DDP [6], the same principle and approach should be applicable\nmore generally. The proposed approach provides an extension to the time scaling approach [12, 3]\nby considering the scaling for a complete controlled system, rather then a single trajectory. Addi-\ntionally, it also extends previous applications of Expectation Maximisation algorithms for system\nidenti\ufb01cation of dynamical systems, e.g. [4, 5], which did not consider the temporal aspects.\n\n2 Preliminaries\n\nLet us consider a process with state x \u2208 RDx and controls u \u2208 RDu which is of the form\n\ndx = (F(x) + Bu)dt + d\u03be\n\n(cid:10)d\u03bed\u03be\u22a4(cid:11) = Q\n\n(1)\n\n(2)\n\nwith non-linear state dependent dynamics F, control matrix B and Brownian motion \u03be, and de\ufb01ne\na cost of the form\n\nL(x(\u00b7), u(\u00b7)) =Z T\n\n0\n\n(cid:2)C(x(t), t) + u(t)\u22a4Hu(t)(cid:3) dt ,\n\nwith arbitrary state dependent cost C and quadratic control cost. Note in particular that T , the\ntrajectory length, is assumed to be known. The closed loop stochastic optimal control problem is to\n\ufb01nd the policy \u03c0 : x(t) \u2192 u(t) given by\n\n\u03c0\u2217 = argmin\n\n\u03c0\n\nE\n\nx,u|\u03c0,x(0) {L(x(\u00b7), u(\u00b7))} .\n\n(3)\n\nIn practice, the continuous time problem is discretized into a \ufb01xed number of K steps of length \u2206t,\nleading to the discreet problem with dynamics\n\nP(xk+1|xk, uk) = N (xk+1|xk + (F(x) + Bu)\u2206t, Q\u2206t) ,\n\n(4)\n\nwhere we use N (\u00b7|a, A) to denote a Gaussian distribution with mean a and covariance A, and cost\n\nL(x1:K , u1:K) = CK(xK ) +\n\nK\u22121Xk=0 (cid:2)\u2206tCk(xk) + u\u22a4\n\nk(H\u2206t)uk(cid:3) .\n\n(5)\n\nNote that here we used the Euler Forward Method as the discretization scheme, which will prove\nadvantageous if a linear cost on the movement duration is chosen, leading to closed form solution\nfor certain optimization problems. However, in other cases, alternative discretisation methods could\nbe used and indeed, be preferable.\n\n2.1 Approximate Inference Control\n\nRecently, it has been suggested to consider a Bayesian inference approach [16] to (discreet) optimal\ncontrol problems formalised in Section 2. With the probabilistic trajectory model in (4) as a prior,\nan auxiliary (binary) dynamic random task variable rk, with the associated likelihood\n\nP(rk = 1|xk, uk) = exp(cid:8)\u2212(\u2206tCk(xk) + u\u22a4\n\nk(H\u2206t)uk)(cid:9) ,\n\n(6)\n\n2\n\n\fu0\n\nu1\n\nu2\n\nx0\n\nx1\n\nx2\n\n. . .\n\nxK\n\nr0\n\nr1\n\nr2\n\nrK\n\n\u03b80\n\nu0\n\nx0\n\nr0\n\n\u03b81\n\nu1\n\nx1\n\nr1\n\n\u03b82\n\nu2\n\nx2\n\n. . .\n\nxK\n\nr2\n\nrK\n\n(a)\n\n(b)\n\nFigure 1: The graphical models for (a) standard inference control and (b) the AICO-T model\nwith canonical time. Circle and square nodes indicate continous and discreet variables respectively.\nShaded nodes are observed.\n\nis introduced, i.e., we interpret the cost as a negative log likelihood of task ful\ufb01lment. Inference\ncontrol consists of computing the posterior conditioned on the observation r0:K = 1 within the\nresulting model (illustrated as a graphical model in Fig. 1 (a)), and from it obtaining the maximum\na posteriori (MAP) controls. For cases, where the process and cost are linear and quadratic in u\nrespectively, the controls can be marginalised in closed form and one is left with the problem of\ncomputing the posterior\n\nP (x0:K|r0:K = 1) =Yk\n\nN (xk+1|xk + F(xk)\u2206t, W\u2206t) exp(\u2212\u2206tCk(xk)) ,\n\n(7)\n\nwith W := Q + BH\u22121B\u22a4.\nAs this posterior is in general not tractable, the AICO [16] algorithm computes a Gaussian approxi-\nmation to the true posterior using an approximate message passing approach similar in nature to EP\n(details are given in supplementary material). The algorithm has been shown to have competitive\nperformance when compared to iLQG [16].\n\n3 Temporal Optimization for Optimal Control\n\nOften the state dependent cost term C(x, t) in (2) can be split into a set of costs which are incurred\nonly at speci\ufb01c times: also referred to as goals, and others which are independent of time, that is\n\nC(x, t) = J (x) +\n\nNXn=1\n\n\u03b4t=\u02c6tnVn(x) .\n\n(8)\n\nClassically, \u02c6tn refer to real time and are \ufb01xed. For instance, in a reaching movement, generally a cost\nthat is a function of the distance to the target is incurred only at the \ufb01nal time T while collision costs\nare independent of time and incurred throughout the movement. In order to allow the time point at\nwhich the goals are achieved to be in\ufb02uenced by the optimization, we will re-frame the goal driven\npart of the problem in a canonical time and in addition to optimizing the controls, also optimize the\nmapping from canonical to real time.\n\nSpeci\ufb01cally, we introduce into the problem de\ufb01ned by (1) & (2) the canonical time variable \u03c4 with\nthe associated mapping\n\n\u03c4 = \u03b2(t) =Z t\n\n0\n\n1\n\n\u03b8(s)\n\nds ,\n\n\u03b8(\u00b7) > 0 ,\n\n(9)\n\nwith \u03b8 as an additional control. We also reformulate the cost in terms of the time \u03c4 as1\n\nL(x(\u00b7), u(\u00b7), \u03b8(\u00b7)) =\n\nNXn=1\n\nVn(x(\u03b2\u22121(\u02c6\u03c4n))) +Z \u02c6\u03c4N\n\n0\n\nT (\u03b8(s))ds\n\n+Z \u03b2\u22121(\u02c6\u03c4N )\n\n0\n\n(cid:2)J (x(t)) + u(t)\u22a4Hu(t)(cid:3) dt ,\n\n(10)\n\n1Note that as \u03b2 is strictly monotonic and increasing, the inverse function \u03b2\u22121 exists\n\n3\n\n\fwith T an additional cost term over the controls \u03b8 and the \u02c6\u03c41:N \u2208 R assumed as given. Based on the\nlast assumption, we are still required to choose the time point at which individual goals are achieved\nand how long the movement lasts; however, this is now done in terms of the canonical time and since\nby controlling \u03b8, we can change the real time point at which the cost is incurred, the exact choices\nfor \u02c6\u03c41:N are relatively unimportant. The real time behaviour is mainly speci\ufb01ed by the additional\ncost term T over the new controls \u03b8 which we have introduced. Note that in the special case where\n0 T (\u03b8s)ds = T (T ), i.e., T is equivalent to a cost on the total movement\nduration. Although here we will stick to the linear case, the proposed approach is also applicable\nto non-linear duration costs. We brie\ufb02y note the similarity of the formulation to the canonical time\nformulation of [11] used in an imitation learning setting.\n\nT is linear, we have R \u02c6\u03c4N\n\nWe now discretize the augmented system in canonical time with a \ufb01xed number of steps K. Making\nthe arbitrary choice of a step length of 1 in \u03c4 induces, by (9), a sequence of steps in t with length2\n\u2206k = \u03b8k. Using this time step sequence and (4) we can now obtain a discreet process in terms of\nthe canonical time with an explicit dependence on \u03b80:K\u22121. Discretization of the cost in (10) gives\n\nL(x1:K , u1:K , \u03b80:K\u22121) =\n\nNXn=1\n\nVn(x\u02c6kn\n\n) +\n\nK\u22121Xk=0 (cid:2)T (\u03b8k) + J (xk)\u03b8k + u\u22a4\n\nkH\u03b8kuk(cid:3) ,\n\n(11)\n\nfor some given \u02c6k1:N . We now have a new formulation of the optimal control problem that no longer\nof the form of equations (4) & (5), e.g. (11) is no longer quadratic in the controls as \u03b8 is a control.\n\nProceeding as for standard inference control and treating the cost (11) as a neg-log likelihood of\nan auxiliary binary dynamic random variable, we obtain the inference problem illustrated by the\nBayesian network in Figure 1(b). With controls u marginalised, our aim is now to \ufb01nd the posterior\nP(x0:K , \u03b80:K\u22121|r0:K = 1). Unfortunately, this problem is intractable even for the simplest case, e.g.\nLQG with linear duration cost. However, observing that for given \u03b8k\u2019s, the problem reduces to the\nstandard case of Section 2.1 suggest restricting ourselves to \ufb01nding the MAP estimate for \u03b80:K\u22121 and\nthe associated posterior P(x0:K|\u03b8MAP\n0:K\u22121, r0:K = 1) using an EM algorithm. The solution is obtained\nby iterating the E- & M-Steps (see below) until the \u03b8\u2019s have converged; we call this algorithm AICO-\nT to re\ufb02ect the temporal aspect of the optimization.\n\n3.1 E-Step\n\nIn general, the aim of the E-Step is to calculate the posterior over the unobserved variables, i.e. the\ntrajectories, given the current parameter values, i.e. the \u03b8i\u2019s.\n\nqi(x0:K) = P(x0:K |r0:K = 1, \u03b8i\n\n0:K\u22121) .\n\n(12)\n\nHowever, as will be shown below we actually only require the expectations(cid:10)xkx\u22a4\n\nduring the M-Step. As these are in general not tractable, we compute a Gaussian approximation to\nthe posterior, following an approximate message passing approach with linear and quadratic approx-\nimations to the dynamics and cost respectively [16] (for details, refer to supplementary material).\n\nk(cid:11) and(cid:10)xkx\u22a4\nk+1(cid:11)\n\n3.2 M-Step\n\nIn the M-Step, we solve\n\nwith\n\n\u03b8i+1\n0:K\u22121 = argmax\n\u03b80:K\u22121\n\nQ(\u03b80:K\u22121|\u03b8i\n\n0:K\u22121) ,\n\n(13)\n\nQ(\u03b80:K\u22121|\u03b8i\n\n0:K\u22121) = hlog P(x0:K , r0:K = 1|\u03b80:K\u22121)i\n\n=\n\nK\u22121Xk=0\n\nK\u22121Xk=1\n\nhlog P(xk+1|xk, \u03b8k)i \u2212\n\n[T (\u03b8k) + \u03b8k hJ (xk)i] + constant ,\n\n(14)\nwhere h\u00b7i denotes the expectation with respect to the distribution calculated in the E-Step, i.e., the\nposterior qi(x0:K) over trajectories given the previous parameter values. The required expectations,\n\n2under the assumption of constant \u03b8(\u00b7) during each step\n\n4\n\n\fhJ (xk)i and\n\nhlog P(xk+1|xk, \u03b8k)i = \u2212\n\nDx\n2\n\nlog |fWk| \u2212\n\n1\n\n2D(xk+1 \u2212 eF(xk))\u22a4fW\u22121\n\nk (xk+1 \u2212 eF(xk))E ,\n\n(15)\n\nwith eF(xk) = xk + F(xk)\u03b8k and fWk = \u03b8kW, are in general not tractable. Therefore, we take\n\napproximations\n\nF(xk) \u2248 ak + Akxk\n\nand J (xk) \u2248\n\nx\u22a4\nkJkxk \u2212 j\u22a4\n\nkxk ,\n\n(16)\n\n1\n2\n\nchoosing the mean of qi(xk) as the point of approximation, consistent with the equivalent approxi-\nmations made in the E-Step. Under these approximations, it can be shown that, up to additive terms\nindependent of \u03b8,\n\nQ(\u03b80:K\u22121|\u03b8i\n\n0:K\u22121) = \u2212\n\n2\n\n1\n2\n\n(cid:20) Dx\nK\u22121Xk=0\nlog |fWk| + T (\u03b8k) +\nkfW\u22121\n\u2212 Tr(eA\u2032\nk \u02dcak + \u03b8k(cid:20) 1\nkfW\u22121\n\nk hxk+1x\u2032\n\nki) +\n\nTr(Jk(cid:10)xkx\u22a4\n\nk (cid:10)xk+1x\u2032\nTr(fW\u22121\nTr(eAkfW\u22121\nk eA\u2032\nk(cid:11) \u2212 jk hxki(cid:21)(cid:21) ,\n\nk hxkx\u2032\n\n\u02dca\u22a4\n\n1\n2\n\n1\n2\n\n+\n\n2\n\nk+1(cid:11))\n\nki) + \u02dca\u22a4\n\nkfW\u22121\n\nk eAk hxki\n\nwith \u02dca\u22a4\n\n=\n\n\u03b8\u22122\n\n1\n2\n1\n\n\u2202Q\n\u2202\u03b8k\n\nk = \u03b8kak, eAk = I + \u03b8kAk and taking partial derivatives leads to\nk(cid:11) +(cid:10)xkx\u22a4\n\nk Tr(cid:0)W\u22121((cid:10)xk+1x\u22a4\n2(cid:20) Tr(AW\u22121A\u22a4(cid:10)xkx\u22a4\n+ Tr(Jk(cid:10)xkx\u22a4\n\nk+1(cid:11) \u2212 2(cid:10)xk+1x\u22a4\nd\u03b8(cid:12)(cid:12)(cid:12)(cid:12)\u03b8k\nk(cid:11)) + 2\nk(cid:11)) \u2212 2jk hxki(cid:21) .\n\n+ a\u22a4\n\ndT\n\n\u2212\n\nk(cid:11))(cid:1) \u2212\n\nD2\nx\n2\n\n\u03b8\u22121\nk\n\nkW\u22121ak + 2a\u22a4\n\nkW\u22121Ak hxki\n\n(17)\n\nIn the general case, we can now use gradient ascent to improve the \u03b8\u2019s. However, in the speci\ufb01c\ncase where T is a linear function of \u03b8, we note that 0 = \u2202Q\nand the unique\n\u2202\u03b8k\nextremum under the constraint \u03b8k > 0 can be found analytically.\n\nis a quadratic in \u03b8\u22121\n\nk\n\n3.3 Practical Remarks\n\nThe performance of the algorithm can be greatly enhanced by using the result of the previous E-\nStep as initialisation for the next one. As this is likely to be near the optimum with the new temporal\ntrajectory, AICO converges within only a few iterations. Additionally, in practise it is often suf\ufb01cient\nto restrict the \u03b8k\u2019s between goals to be constant, which is easily achieved as Q is a sum over the \u03b8\u2019s.\nThe proposed algorithms leads to a variation of discretization step length which can be a problem.\nFor one, the approximation error increases with the step length which may lead to wrong results. On\nthe other hand, the algorithm may lead to control frequencies which are not achievable in practice.\nIn general, a \ufb01xed control signal frequency may be prescribed by the hardware system. In practice,\n\u03b8\u2019s can be kept in a prescribed range by adjusting the number of discretization steps K after an\nM-Step.\n\nFinally, although we have chosen to express the time cost in terms of a function of the \u03b8\u2019s, often it\n\nmay be desirable to consider a cost directly over the duration T . Noting that T = P \u03b8k, all that is\n\nrequired is to replace dT\n\nin (17).\n\nd\u03b8 with \u2202T (P \u03b8)\n\n\u2202\u03b8k\n\n4 Experiments\n\nThe proposed algorithm was evaluated in simulation. As a basic plant, we used a kinematic simula-\ntion of a 2 degrees of freedom (DOF) planar arm, consisting of two links of equal length. The state\nof the plant is given by x = (q, \u02d9q), with q \u2208 R2 the joint angles and \u02d9q \u2208 R2 associated angular\n\n5\n\n\f0.6\n\nn\no\ni\nt\na\nr\nu\nD\n\n0.4\n\nt\nn\ne\nm\ne\nv\no\nM\n\n0.2\n\n0\n\n0.2\n\nAICO-T(\u03b1 = \u03b10)\nAICO-T(\u03b1 = 2\u03b1)\nAICO-T(\u03b1 = 0.5\u03b1)\n\n0.4\n\n0.8\nTask Space Movement Distance\n\n0.6\n\n300\n\n200\n\n100\n\nt\ns\no\nC\ng\nn\ni\nh\nc\na\ne\nR\n\n0\n\n0.2\n\nAICO-T(\u03b1 = \u03b10)\nAICO-T(\u03b1 = 2\u03b10)\nAICO-T(\u03b1 = 0.5\u03b10)\n\n0.4\n\n0.8\nTask Space Movement Distance\n\n0.6\n\n600\n\nt\ns\no\nC\ng\nn\ni\nh\nc\na\ne\nR\n\n400\n\n200\n\n0\n\n0.2\n\nAICO (T = 0.07)\nAICO (T = 0.24)\nAICO (T = 0.41)\nAICO-T(\u03b1 = \u03b10)\n\n0.4\n\n0.8\nTask Space Movement Distance\n\n0.6\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Temporal scaling behaviour using AICO-T. (a & b) Effect of changing time-cost weight\n\u03b1, (effectively the ratio between reaching cost and duration cost) on (a) duration and (b) reaching\ncost (control + state cost). (c) Comparison of reaching costs (control + error cost) for AICO-T and\na \ufb01xed duration approach, i.e. AICO.\nvelocities. The controls u \u2208 R2 are the joint space accelerations. We also added some iid noise with\nsmall diagonal covariance.\n\nFor all experiments, we used a quadratic control cost and the state dependent cost term:\n\nV(xk) =Xi\n\n\u03b4k=\u02c6ki\n\n(\u03c6i(xk) \u2212 y\u2217\n\ni )\u22a4\u039bi(\u03c6i(xk) \u2212 y\u2217\n\ni ) ,\n\n(18)\n\nfor some given \u02c6ki and employed a diagonal weight matrix \u039bi while y\u2217\ni represented the desired state\nin task space. For point targets, the task space mapping is \u03c6(x) = (x, y, \u02d9x, \u02d9y)\u22a4, i.e., the map from\nx to the vector of end point positions and velocities in task space coordinates. The time cost was\nlinear, that is, T (\u03b8) = \u03b1\u03b8.\n\n4.1 Variable Distance Reaching Task\n\nIn order to evaluate the behaviour of AICO-T we applied it to a reaching task with varying start-\ntarget distance. Speci\ufb01cally, for a \ufb01xed start point we considered a series of targets lying equally\nspaced along a line in task space. It should be noted that although the targets are equally spaced\nin task space and results are shown with respect to movement distance in task space, the distances\nin joint space scale non linearly. The state cost (18) contained a single term incurred at the \ufb01nal\ndiscrete step with \u039b = 106 \u00b7 I and the control cost were given by H = 104 \u00b7 I. Fig. 2(a & b) shows\n\nthe movement duration (=P \u03b8k) and standard reaching cost3 for different temporal-cost parameters\n\n\u03b1 (we used \u03b10 = 2\u00b7107), demonstrating that AICO-T successfully trades-off the movement duration\nand standard reaching cost for varying movement distances. In Fig. 2(c), we compare the reaching\ncosts of AICO-T with those obtained with a \ufb01xed duration approach, in this case AICO. Note that\nalthough with a \ufb01xed, long duration (e.g., AICO with duration T=0.41) the control and error costs\nare reduced for short movements, these movements necessarily have up to 4\u00d7 longer durations than\nthose obtained with AICO-T. For example for a movement distance of 0.2 application of AICO-T\nresults in a optimised movement duration of 0.07 (cf. Fig. 2(a)), making the \ufb01xed time approach\nimpractical when temporal costs are considered. Choosing a short duration on the other hand (AICO\n(T=0.07)) leads to signi\ufb01cantly worse costs for long movements. We further emphasis that the\n\ufb01xed durations used in this comparison were chosen post hoc by exploiting the durations suggested\nby AICO-T in absence of this, there would have been no practical way of choosing them apart\nfrom experimentation. Furthermore, we would like to highlight that, although the results suggests a\nsimple scaling of duration with movement distance, in cluttered environments and plants with more\ncomplex forward kinematics, an ef\ufb01cient decision on the movement duration cannot be based only\non task space distance.\n\n4.2 Via Point Reaching Tasks\n\nWe also evaluated the proposed algorithm in a more complex via point task. The task requires the\nend-effector to reach to a target, having passed at some point through a given second target, the\n\n3n.b. the standard reaching cost is the sum of control costs and cost on the endpoint error, without taking\n\nduration into account, i.e., (11) without the T (\u03b8) term.\n\n6\n\n\f\u22121.5\n\n\u22122\n\n]\nd\na\nr\n[\n2\nt\nn\ni\no\nJ\n\ne\nl\ng\nn\nA\n\n\u22122.5\n\nn\no\ni\nt\na\nr\nu\nD\n\n \nt\nn\ne\nm\ne\nv\no\nM\n\n1.5\n\n2\n\n1\n\n0.5\n\n2.5\n\n25\n\n20\n\n15\n\n10\n\nt\ns\no\nC\n \ng\nn\nh\nc\na\ne\nR\n\ni\n\n5\n\n0\n\n(c)\n\nNear\n\nFar\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n]\nd\na\nr\n[\n1\nt\nn\ni\no\nJ\n\ne\nl\ng\nn\nA\n\n\u22120.8\n\n0\n\n3.4\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n\u22120.4\u22120.2 0\n\n0.2\n\n(a)\n\n1\n\nTime\n\n2\n\n0\n\n(b)\n\n1\n\nTime\n\n2\n\n0\n\nNear\n\nFar\n\nFigure 3: Comparision of AICO-T (solid) to the common modelling approach, using AICO,\n(dashed) with \ufb01xed times on a via point task. (a) End point task space trajectories for two dif-\nferent via points (circles) obtained for a \ufb01xed start point (triangle). (b) The corresponding joint\nspace trajectories. (c) Movement durations and reaching costs (control + error costs) from 10 ran-\ndom start points. The proportion of the movement duration spend before the via point is shown in\nlight gray (mean in the AICO-T case).\n\nvia point. This task is of interest as it can be seen as an abstraction of a diverse range of complex\nsequential tasks that requires one to achieve a series of sub-tasks in order to reach a \ufb01nal goal. This\ntask has also seen some interest in the literature on modeling of human movement using the optimal\ncontrol framework, e.g., [15]. Here the common approach is to choose the time point at which\none passes the via point such as to divide the movement duration in the same ratio as the distances\nbetween the start point, via point and end target. This requires on the one hand prior knowledge of\nthese movement distances and on the other, makes the implicit assumption that the two movements\nare in some sense independent.\n\nIn a \ufb01rst experiment, we demonstrate the ability of our approach to solve such sequential problems,\nadjusting movement durations between sub goals in a principled manner, and show that it improves\nupon the standard modelling approach. Speci\ufb01cally, we apply AICO-T to the two via point problems\nillustrated in Fig. 3(a) with randomised start states4. For comparison, we follow the standard mod-\neling approach and apply AICO to compute the controller. We again choose the movement duration\nfor the standard case post hoc to coincide with the mean movement duration obtained with AICO-T\nfor each of the individual via point tasks. Each task is expressed using a cost function consisting of\ntwo point target cost terms. Speci\ufb01cally, (18) takes the form\n\nV(xk) = \u03b4k= K\n\n2\n\n(\u03c6(xk) \u2212 y\u2217\n\nv)\u22a4\u039bv(\u03c6(xk) \u2212 y\u2217\n\nv) + \u03b4k=K(\u03c6(xk) \u2212 y\u2217\n\ne )\u22a4\u039be(\u03c6(xk) \u2212 y\u2217\n\ne ) ,\n\n(19)\n\nv = (\u00b7, \u00b7, 0, 0)\u22a4, y\u2217\n\nwith K the number of discrete steps and diagonal matrices \u039bv = diag(\u03bbpos, \u03bbpos, 0, 0), \u039be =\ndiag(\u03bbpos, \u03bbpos, \u03bbvel, \u03bbvel), where \u03bbpos = 105 & \u03bbvel = 107 and vectors y\u2217\ne =\n(\u00b7, \u00b7, 0, 0)\u22a4 desired states for individual via point and target, respectively. Note that the cost function\ndoes not penalise velocity at the via point but encourages the stopping at the target. While admittedly\n2 ) is likely to be a sub-\nthe choice of incurring the via point cost at the middle of the movement ( K\noptimal choice for the standard approach, one has to consider that in more complex task spaces, the\nrelative ratio of movement distances may not be easily accessible and one may have to resort to the\nmost intuitive choice for the uninformed case as we have done here. Note that although for AICO-T\nthis cost is incurred at the same discrete step, we allow \u03b8 before and after the via point to differ, but\nconstrain them to be constant throughout each part of the movement, hence, allowing the cost to be\nincurred at an arbitrary point in real time. We sampled the initial position of each joint independently\nfrom a Gaussian distribution with a variance of 3\u25e6. In Fig. 3(a&b), we show maximum a posteriori\n(MAP) trajectories in task space and joint space for controllers computed for the mean initial state.\nInterestingly, although the end point trajectory for the near via point produced by AICO-T may\nlook sub-optimal than that produced by the standard AICO algorithm, closer examination of the\njoint space trajectories reveal that our approach results in more ef\ufb01cient actuation trajectories. In\nFig. 3(c), we illustrate the resulting average movement durations and costs of the mean trajectories.\nAs can be seen, AICO-T results in the expected passing times for the two via points, i.e. early\nvs. late in the movement for the near and far via point, respectively. This directly leads to a lower\nincurred cost compared to un-optimised movement durations.\n\n4For the sake of clarity, Fig. 3(a&b) show MAP trajectories of controllers computed for the mean start state.\n\n7\n\n\f]\nd\na\nr\n[\n1\nt\nn\ni\no\nJ\n\ne\nl\ng\nn\nA\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n\u22121.2\n\n3.4\n\n3.2\n\n3\n\n2.8\n\n2.6\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n(a)\n\n\u22121.5\n\n]\nd\na\nr\n[\n2\n\n\u22122\n\nt\nn\ni\no\nJ\n\ne\nl\ng\nn\nA\n\n\u22122.5\n\n0\n\n1\nTime\n\n2\n\n0\n\n1\nTime\n\n2\n\nn\no\ni\nt\na\nr\nu\nD\n\n \nt\nn\ne\nm\ne\nv\no\nM\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\nJoint Seq.\n\nt\ns\no\nC\n \ng\nn\nh\nc\na\ne\nR\n\ni\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\nJoint Seq.\n\n(b)\n\n(c)\n\nFigure 4:\nJoint (solid) vs. sequential (dashed) optimisation using AICO-T for a sequential (via\npoint) task. (a) Task space trajectories for a \ufb01xed start point (triangle). Viapoint and target are\nindicated by the circle and square, respectively. (b) The corresponding joint space trajectories. (c)\nThe movement durations and reaching costs (control + error cost) for 10 random start points. The\nmean proportion of the movement duration spend before the via point is shown in light gray.\n\nIn order to highlight the shortcomings of sequential time optimal control, next we compare plan-\nning a complete movement over sequential goals to planning a sequence of individual movements.\nSpeci\ufb01cally, using AICO-T, we compare planning the whole via point movement (joint planner) to\nplanning a movement from the start to the via point followed by a second movement from the end\npoint of the \ufb01rst movement (n.b. not from the via point) to the end target (sequential planner). The\njoint planner used the same cost function as the previous experiment. For the sequential planner,\neach of the two sub-trajectories had half the number of discrete time steps of the joint planner and\nthe cost functions were given by appropriately splitting (19), i.e.,\n\nV 1(xk) = \u03b4k= K\n\n2\n\n(\u03c6(xk)\u2212y\u2217\n\nv)\u22a4\u039bv(\u03c6(xk)\u2212y\u2217\nv)\n\nand V 2(xk) = \u03b4k= K\n\n2\n\n(\u03c6(xk)\u2212y\u2217\n\ne)\u22a4\u039be(\u03c6(xk)\u2212y\u2217\n\ne) ,\n\nv, y\u2217\n\nwith \u039bv, \u039be, y\u2217\ne as for (19). The start states were sampled according to the distribution used in\nthe last experiment and in Fig. 4(a&b), we plot the MAP trajectories for the mean start state, in task\nas well as joint space. The results illustrate that sequential planning leads to sub-optimal results as\nit does not take future goals into consideration. This leads directly to a higher cost (c.f. Fig. 4(c)),\ncalculated from trials with randomised start state. One should however note that this effect would\nbe less pronounced if the cost required stopping at the via point, as it is the velocity away from the\nend target which is the main problem for the sequential planner.\n\n5 Conclusion\n\nThe contribution of this paper is a novel method for jointly optimizing a movement trajectory and\nits time evolution (temporal scale and duration) in the stochastic optimal control framework. As a\nspecial case, this solves the problem of an unknown goal horizon and the problem of trajectory op-\ntimization through via points when the timing of intermediate constraints is unknown and subject to\noptimization. Both cases are of high relevance in practical robotic applications where pre-specifying\na goal horizon by hand is common practice but typically lacks justi\ufb01cation.\n\nThe method was derived in the form of an Expectation-Maximization algorithm where the E-step ad-\ndresses the stochastic optimal control problem reformulated as an inference problem and the M-step\nre-adapts the time evolution of the trajectory. In principle, the proposed framework can be applied\nto extend any algorithm that \u2013 directly or indirectly \u2013 provides us with an approximate trajectory\nposterior in each iteration. AICO [16] does so directly in terms of a Gaussian approximation; simi-\nlarly, the local LQG solution implicit in iLQG [9] can, with little extra computational cost, be used\nto compute a Gaussian posterior over trajectories. For algorithms like DDP [6], which do not lead to\nan LQG approximation, we can employ the Laplace method to obtain Gaussian posteriors or adjust\nthe M-Step for the non-Gaussian posterior. We demonstrated the algorithm on a standard reaching\ntask with and without via points. In particular, in the via point case, it becomes obvious that \ufb01xed\nhorizon methods and sequenced \ufb01rst exit time methods cannot \ufb01nd equally ef\ufb01cient motions as the\nproposed method.\n\n8\n\n\fReferences\n\n[1] David Barber and Tom Furmston. Solving deterministic policy (PO)MDPs using expectation-\nmaximisation and antifreeze. In European Conference on Machine Learning (LEMIR work-\nshop), 2009.\n\n[2] Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters. Gaussian process dynamic\n\nprogramming. Neurocomputing, 72(7-9):1508 \u2013 1524, 2009.\n\n[3] Yu-Yi Fu, Chia-Ju Wu, Kuo-Lan Su, and Chia-Nan Ko. A time-scaling method for near-time-\noptimal control of an omni-directional robot along speci\ufb01ed paths. Arti\ufb01cial Life and Robotics,\n13(1):350\u2013354, 2008.\n\n[4] Z Ghahramani and G Hinton. Parameter estimation for linear dynamical systems. Technical\n\nReport CRG-TR-96-2, University of Toronto, 1996.\n\n[5] Z Ghahramani and S Roweis. Learning nonlinear dynamical systems using an em algorithm.\n\nIn Advances in Neural Information Processing Systems, volume 11, Nov 1999.\n[6] D Jacobson and D Mayne. Differential Dynamic Programming. Elsevier, 1970.\n[7] Hilbert J. Kappen. A linear theory for control of non-linear stochastic systems. Physical\n\nReview Letters, 95(20):200201, 2005.\n\n[8] Donald E. Kirk. Optimal Control Theory - An Introduction. Prentice-Hall, 1970.\n[9] Weiwei Li and Emanuel Todorov. An iterative optimal control and estimation design for non-\nlinear stochastic system. In Proc. of the 45th IEEE Conference on Decision and Control, 2006.\n[10] Djordje Mitrovic, Sho Nagashima, Stefan Klanke, Takamitsu Matsubara, and Sethu Vijayaku-\nmar. Optimal feedback control for anthropomorphic manipulators. In Proc. IEEE International\nConference on Robotics and Automation (ICRA 2010), 2010.\n\n[11] Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learning and generalization\nof motor skills by learning from demonstration. In Proc. IEEE International Conference on\nRobotics and Automation (ICRA 2010), Feb 2010.\n\n[12] Gideon Sahar and John M. Hollerbach. Planning of minimum- time trajectories for robot arms.\n\nThe International Journal of Robotics Research, 5(3):90\u2013100, 1986.\n\n[13] Robert F. Stengel. Optimal Control and Estimation. Dover Publications, 1986.\n[14] Emanuel Todorov. Compositionality of optimal control laws. In Advances in Neural Informa-\n\ntion Processing Systems, volume 22, 2009.\n\n[15] Emanuel Todorov and Michael Jordan. Optimal feedback control as a theory of motor coordi-\n\nnation. Nature Neuroscience, 5(11):1226\u20131235, 2002.\n\n[16] Marc Toussaint. Robot trajectory optimization using approximate inference. In Proc. of the 26\n\nth International Conference on Machine Learning (ICML 2009), 2009.\n\n[17] Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and continuous\nstate Markov Decision Processes. In Proc. of the 23nd Int. Conf. on Machine Learning (ICML\n2006), pages 945\u2013952, 2006.\n\n9\n\n\f", "award": [], "sourceid": 667, "authors": [{"given_name": "Konrad", "family_name": "Rawlik", "institution": null}, {"given_name": "Marc", "family_name": "Toussaint", "institution": null}, {"given_name": "Sethu", "family_name": "Vijayakumar", "institution": null}]}