{"title": "Sample Complexity Bounds for Iterative Stochastic Policy Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3114, "page_last": 3122, "abstract": "This paper is concerned with robustness analysis of decision making under uncertainty. We consider a class of iterative stochastic policy optimization problems and analyze the resulting expected performance for each newly updated policy at each iteration. In particular, we employ concentration-of-measure inequalities to compute future expected cost and probability of constraint violation using empirical runs. A novel inequality bound is derived that accounts for the possibly unbounded change-of-measure likelihood ratio resulting from iterative policy adaptation. The bound serves as a high-confidence certificate for providing future performance or safety guarantees. The approach is illustrated with a simple robot control scenario and initial steps towards applications to challenging aerial vehicle navigation problems are presented.", "full_text": "Sample Complexity Bounds for Iterative Stochastic\n\nPolicy Optimization\n\nMarin Kobilarov\n\nDepartment of Mechanical Engineering\n\nJohns Hopkins University\n\nBaltimore, MD 21218\nmarin@jhu.edu\n\nAbstract\n\nThis paper is concerned with robustness analysis of decision making under un-\ncertainty. We consider a class of iterative stochastic policy optimization problems\nand analyze the resulting expected performance for each newly updated policy\nat each iteration.\nIn particular, we employ concentration-of-measure inequali-\nties to compute future expected cost and probability of constraint violation using\nempirical runs. A novel inequality bound is derived that accounts for the possi-\nbly unbounded change-of-measure likelihood ratio resulting from iterative policy\nadaptation. The bound serves as a high-con\ufb01dence certi\ufb01cate for providing future\nperformance or safety guarantees. The approach is illustrated with a simple robot\ncontrol scenario and initial steps towards applications to challenging aerial vehicle\nnavigation problems are presented.\n\n1\n\nIntroduction\n\nWe consider a general class of stochastic optimization problems formulated as\n\n\u03be\u2217 = arg min\n\n\u03be\n\nE\u03c4\u223cp(\u00b7|\u03be)[J(\u03c4 )],\n\n(1)\n\nwhere \u03be de\ufb01nes a vector of decision variables, \u03c4 represents the system response de\ufb01ned through\nthe density p(\u03c4|\u03be), and J(\u03c4 ) de\ufb01nes a positive cost function which can be non-smooth and non-\nconvex. It is assumed that p(\u03c4|\u03be) is either known or can be sampled from, e.g.\nin a black-box\nmanner. The objective is to obtain high-con\ufb01dence sample complexity bounds on the expected cost\nfor a given decision strategy by observing past realizations of possibly different strategies. Such\nbounds are useful for two reasons: 1) for providing robustness guarantees for future executions, and\n2) for designing new algorithms that directly minimize the bound and therefore are expected to have\nbuilt-in robustness.\nOur primary motivation arises from applications in robotics, for instance when a robot executes\ncontrol policies to achieve a given task such as navigating to a desired state while perceiving the\nenvironment and avoiding obstacles. Such problems are traditionally considered in the framework\nof reinforcement learning and addressed using policy search algorithms, e.g. [1, 2] (see also [3] for a\ncomprehensive overview with a focus on robotic applications [4]). When an uncertain system model\nis available the problem is equivalent to robust model-predictive control (MPC) [5].\nOur speci\ufb01c focus is on providing formal guarantees on future executions of control algorithms in\nterms of maximum expected cost (quantifying performance) and maximum probability of constraint\nviolation (quantifying safety). Such bounds determine the reliability of control in the presence of\nprocess, measurement and parameter uncertainties, and contextual changes in the task. In this work\nwe make no assumptions about nature of the system structure, such as linearity, convexity, or Gaus-\nsianity. In addition, the proposed approach applies either to a physical system without an available\n\n1\n\n\fmodel, to an analytical stochastic model, or to a white-box model (e.g. from a high-\ufb01delity open-\nsource physics engine). In this context, PAC bounds have been rarely considered but could prove\nessential for system certi\ufb01cation, by providing high-con\ufb01dence guarantees for future performance\nand safety, for instance \u201cwith 99% chance the robot will reach the goal within 5 minutes\u201d, or \u201cwith\n99% chance the robot will not collide with obstacles\u201d.\nApproach. To cope with such general conditions, we study robustness through a statistical learn-\ning viewpoint [6, 7, 8] using \ufb01nite-time sample complexity bounds on performance based on em-\npirical runs. This is accomplished using concentration-of-measure inequalities [9] which provide\nonly probabilistic bounds , i.e. they certify the algorithm execution in terms of statements such as:\n\u201cin future executions, with 99% chance the expected cost will be less than X and the probability of\ncollision will be less than Y\u201d. While such bounds are generally applicable to any stochastic decision\nmaking process, our focus and initial evaluation is on stochastic control problems.\nRandomized methods in control analysis. Our approach is also inspired by existing work on\nrandomized algorithms in control theory originally motivated by robust linear control design [10].\nFor example, early work focused on probabilistic root-locus design [11] and later applied to con-\nstraint satisfaction [12] and general cost functions [13]. High-con\ufb01dence bounds for decidability of\nlinear stability were re\ufb01ned in [14]. These are closely related to the concepts of randomized sta-\nbility robustness analysis (RSRA) and randomized performance robustness analysis (RPRA) [13].\nFinite-time probabilistic bounds for system identi\ufb01cation problems have also been obtained through\na statistical learning viewpoint [15].\n\n2\n\nIterative Stochastic Policy Optimization\n\nInstead of directly searching for the optimal \u03be to solve (1) a common strategy in direct policy search\nand global optimization [16, 17, 18, 19, 20, 21] is to iteratively construct a surrogate stochastic\nmodel \u03c0(\u03be|\u03bd) with hyper-parameters \u03bd \u2208 V, such as a Gaussian Mixture Model (GMM), where V\nis a vector space. The model induces a joint density p(\u03c4, \u03be|\u03bd) = p(\u03c4|\u03be)\u03c0(\u03be|\u03bd) encoding natural\nstochasticity p(\u03c4|\u03be) and arti\ufb01cial control-exploration stochasticity \u03c0(\u03be|\u03bd). The problem is then to\n\ufb01nd \u03bd to minimize the expected cost\n\nJ (v) (cid:44) E \u03c4,\u03be\u223cp(\u00b7|\u03bd) [J(\u03c4 )],\n\niteratively until convergence, which in many cases also corresponds to \u03c0(\u00b7|\u03bd) shrinking close to a\ndelta function around the optimal \u03be\u2217 (or to multiple peaks when multiple disparate optima exist as\nlong as \u03c0 is multi-modal).\nThe typical \ufb02ow of the iterative policy optimization algorithms considered in this work is:\n\nIterative Stochastic Policy Optimization (ISPO)\n\n0. Start with initial hyper-parameters \u03bd0 (i.e. a prior), set i = 0\n1. Sample M trajectories (\u03bej, \u03c4j) \u223c p(\u00b7|\u03bdi) for j = 1, . . . , M\n2. Compute new policy \u03bdi+1 using observed costs J(\u03c4j)\n3. Compute bound on expected cost and Stop if below threshold, else set i = i+1 and Goto 1\n\nThe purpose of computing probably-approximate bounds is two-fold: a) to analyze the performance\nof such standard policy search algorithms; b) to design new algorithms by not directly minimizing\nan estimate of the expected cost, but by minimizing an upper con\ufb01dence bound on the expected\ncost instead. The computed policy will thus have \u201cbuilt-in\u201d robustness in the sense that, with high-\nprobability, the resulting cost will not exceed an a-priori known value. The present paper develops\nbounds applicable to both (a) and (b), but only explores their application to (a), i.e. to the analysis\nof existing iterative policy search methods.\nCost functions. We consider two classes of cost functions J. The \ufb01rst class encodes system per-\nformance and is de\ufb01ned as a bounded real-valued function such that 0 \u2264 J(\u03c4 ) \u2264 b for any \u03c4.\nThe second are binary-valued indicator functions representing constraint violation. Assume that the\nvariable \u03c4 must satisfy the condition g(\u03c4 ) \u2264 0. The cost is then de\ufb01ned as J(\u03c4 ) = I{g(\u03c4 )>0} and its\nexpectation can be regarded as the probability of constraint violation, i.e.\n\nP(g(\u03c4 ) > 0) = E\u03c4\u223cp(\u00b7|\u03be)I{g(\u03c4 )>0}.\n\nIn this work, we will be obtain bounds for both classes of cost functions.\n\n2\n\n\f3 A Speci\ufb01c Application: Discrete-time Stochastic Control\n\nWe next illustrate the general stochastic optimization setting using a classical discrete-time non-\nlinear optimal control problem. Speci\ufb01c instances of such control problems will later be used for\nnumerical evaluation. Consider a discrete-time dynamical model with state xk \u2208 X, where X is\nan n-dimensional manifold, and control inputs uk \u2208 Rm at time tk \u2208 [0, T ] where k = 0, . . . , N\ndenotes the time stage. Assume that the system dynamics are given by\n\nxk+1 = fk(xk, uk, wk),\n\nsubject to\n\ngk(xk, uk) \u2264 0, gN (xN ) \u2264 0,\n\nwhere fk and gk correspond either to the physical plant, to an analytical model, or to a \u201cwhite-box\u201d\nhigh-\ufb01delity physics-engine update step. The terms wk denotes process noise. Equivalently, such a\nformulation induces the process model density p(xk+1|xk, uk). In addition, consider the cost\n\nJ(x0:N , u0:N\u22121) (cid:44) N\u22121(cid:88)\n\nk=0\n\nLk(xk, uk) + LN (xN ),\n\nwhere x0:N (cid:44) {x0, . . . , xN} denotes the complete trajectory and Lk are given nonlinear functions.\nOur goal is to design feedback control policies to optimize the expected value of J. For simplicity,\nwe will assume perfect measurements although this does not impose a limitation on the approach.\nAssume that any decision variables in the problem (such as feedforward or feedback gains, obstacle\navoidance terms, mode switching variables) are encoded using a \ufb01nite-dimensional vector \u03be \u2208 Rn\u03be\nand de\ufb01ne the control law uk = \u03a6k(xk)\u03be using basis functions \u03a6k(x) \u2208 Rm\u00d7n\u03be for all k =\n0, . . . , N \u2212 1. This representation captures both static feedback control laws as well as time-varying\noptimal control laws of the form uk = u\u2217\nk = B(tk)\u03be is an optimized\nfeedforward control (parametrized using basis functions B(t) \u2208 Rm\u00d7z such as B-splines), K LQR\nis the optimal feedback gain matrix of the LQR problem based on the linearized dynamics and\nsecond-order cost expansion around the optimized nominal reference trajectory x\u2217, i.e. such that\nx\u2217\nk+1 = fk(x\u2217\nThe complete trajectory of the system is denoted by the random variable \u03c4 = (x0:N , u0:N\u22121) and\nhas density p(\u03c4|\u03be) = p(x0)\u03a0N\u22121\nk=0 p(xk+1|xk, uk)\u03b4(uk \u2212 \u03a6k(xk)\u03be), where \u03b4(\u00b7) is the Dirac delta.\nk=0 {gk(xk, uk) \u2264 0} \u2227 {gN (xN ) \u2264 0}.\n\nThe trajectory constraint takes the form {g(\u03c4 ) \u2264 0} (cid:44)(cid:86)N\u22121\n\nk) where u\u2217\n\n(xk \u2212 x\u2217\n\nk, u\u2217\n\nk + K LQR\n\nk\n\nk, 0).\n\nk\n\nA simple example. As an example, consider a point-mass robot modeled as a double-integrator\nsystem with state x = (p, v) where p \u2208 Rd denotes position and v \u2208 Rd denotes velocity with d = 2\nfor planar workspaces and d = 3 for 3-D workspaces. The dynamics is given, for \u2206t = T /N, by\n\npk+1 = pk + \u2206tvk +\n\n1\n2\n\n\u2206t2(uk + wk),\n\nvk+1 = vk + \u2206t(uk + wk),\n\nwhere uk are the applied controls and wk is zero-mean white noise. Imagine that the constraint\ngk(x, u) \u2264 0 de\ufb01nes circular obstacles O \u2282 Rd and control norm bounds de\ufb01ned as\n\nro \u2212 (cid:107)p \u2212 po(cid:107) \u2264 0,\n\n(cid:107)u(cid:107) \u2264 umax,\n\nwhere ro is the radius of an obstacle at position po \u2208 Rd. The cost J could be arbitrary but a\ntypical choice is L(x, u) = 1\nR + q(x) where R > 0 is a given matrix and q(x) is a nonlinear\nfunction de\ufb01ning a task. The \ufb01nal cost could force the system towards a goal state xf \u2208 Rn (or a\nregion Xf \u2282 Rn) and could be de\ufb01ned according to LN (x) = 1\nfor some given matrix\nQf \u2265 0. For such simple systems one can choose a smooth feedback control law uk = \u03a6k(x)\u03be with\nstatic positive gains \u03be = (kp, kd, ko) \u2208 R3 and basis function\n\n2(cid:107)x \u2212 xf(cid:107)2\n\n2(cid:107)u(cid:107)2\n\nQf\n\n\u03a6(x) = [ pf \u2212 p\n\nvf \u2212 v \u03d5(x,O) ] ,\n\nwhere \u03d5(x,O) is an obstacle-avoidance force, e.g. de\ufb01ned as the gradient of a potential \ufb01eld or as a\ngyroscopic \u201csteering\u201d force \u03d5(x,O) = s(x,O) \u00d7 v that effectively rotates the velocity vector [22] .\nAlternatively, one could employ a time-varying optimal control law as described in \u00a73.\n\n3\n\n\f4 PAC Bounds for Iterative Policy Adaptation\nWe next compute probabilistic bounds on the expected cost J (\u03bd) resulting from the execution of\na new stochastic policy with hyper-parameters \u03bd using observed samples from previous policies\n\u03bd0, \u03bd1, . . . . The bound is agnostic to how the policy is updated (i.e. Step 2 in the ISPO algorithm).\n\n4.1 A concentration-of-measure inequality for policy adaptation\nThe stochastic optimization setting naturally allows the use of a prior belief \u03be \u223c \u03c0(\u00b7|\u03bd0) on what\ngood control laws could be, for some known \u03bd0 \u2208 V. After observing M executions based on such\nprior, we wish to \ufb01nd a new improved policy \u03c0(\u00b7|\u03bd) which optimizes the cost\n\u03c0(\u03be|\u03bd)\n\u03c0(\u03be|\u03bd0)\n\n(cid:20)\nJ (\u03bd) (cid:44) E\u03c4,\u03be\u223cp(\u00b7|\u03bd)[J(\u03c4 )] = E\u03c4,\u03be\u223cp(\u00b7|\u03bd0)\n(cid:21)\n\nwhich can be approximated using samples \u03bej \u223c \u03c0(\u03be|\u03bd0) and \u03c4j \u223c p(\u03c4|\u03bej) by the empirical cost\n\nJ(\u03c4 )\n\n(cid:21)\n\n(2)\n\n,\n\n(cid:20)\nM(cid:88)\n\nj=1\n\n1\nM\n\nJ(\u03c4j)\n\n\u03c0(\u03bej|\u03bd)\n\u03c0(\u03bej|\u03bd0)\n\n.\n\n(3)\n\nThe goal is to compute the parameters \u03bd using the sampled decision variables \u03bej and the corre-\nsponding observed costs J(\u03c4j). Obtaining practical bounds for J (\u03bd) becomes challenging since the\nchange-of-measure likelihood ratio \u03c0(\u03be|\u03bd)\n\u03c0(\u03be|\u03bd0) can be unbounded (or have very large values) [23] and a\nstandard bound, e.g. such as Hoeffding\u2019s or Bernstein\u2019s becomes impractical or impossible to apply.\nTo cope with this we will employ a recently proposed robust estimation [24] technique stipulating\nthat instead of estimating the expectation m = E[X] of a random variable X \u2208 [0,\u221e) using its\nj=1 Xj, a more robust estimate can be obtained by truncating its higher\n\n(cid:80)M\nempirical mean (cid:98)m = 1\nmoments, i.e. using (cid:98)m\u03b1 (cid:44) 1\n\nj=1 \u03c8(\u03b1Xj) for some \u03b1 > 0 where\n\n(cid:80)M\n\n\u03b1M\n\nM\n\n1\n2\n\nx2).\n\n\u03c8(x) = log(1 + x +\n\n(4)\nWhat makes this possible is the key assumption that the (possibly unbounded) random variable must\nhave bounded second moment. We employ this idea to deal with the unboundedness of the policy\nadaptation ratio by showing that in fact its second moment can be bounded and corresponds to an\ninformation distance between the current and previous stochastic policies.\nTo obtain sharp bounds though it is useful to employ samples over multiple iterations of the ISPO\nalgorithm, i.e. from policies \u03bd0, \u03bd1, . . . , \u03bdL\u22121 computed in previous iterations. To simplify notation\nlet z = (\u03c4, \u03be) and de\ufb01ne (cid:96)i(z, \u03bd) (cid:44) J(\u03c4 ) \u03c0(\u03be|\u03bd)\n\u03c0(\u03be|\u03bdi). The cost (2) of executing \u03bd can now be equivalently\nexpressed as\n\nL\u22121(cid:88)\n\nJ (\u03bd) \u2261 1\nL\n\nEz\u223cp(\u00b7|\u03bdi)(cid:96)i(z, \u03bd)\n\nusing the computed policies in previous iterations i = 0, . . . , L \u2212 1. We next state the main result:\nProposition 1. With probability 1 \u2212 \u03b4 the expected cost of executing a stochastic policy with pa-\nrameters \u03be \u223c \u03c0(\u00b7|\u03bd) is bounded according to:\n\ni=0\n\n(cid:41)\n\n1\n\n\u03b1LM\n\nlog\n\n1\n\u03b4\n\n,\n\n(5)\n\n\u03b1>0\n\n\u03b1\n2L\n\nJ (\u03bd) \u2264 inf\n\n(cid:40)(cid:98)J\u03b1(\u03bd) +\n(cid:98)J\u03b1(\u03bd) (cid:44) 1\n\nL\u22121(cid:88)\ni eD2(\u03c0(\u00b7|\u03bd)||\u03c0(\u00b7|\u03bdi)) +\nb2\nwhere (cid:98)J\u03b1(\u03bd) denotes a robust estimator de\ufb01ned by\nM(cid:88)\nL\u22121(cid:88)\n(cid:90)\n\n\u03c8 (\u03b1(cid:96)(zij, \u03bd)) ,\n\n\u03b1LM\n\nj=1\n\ni=0\n\ni=0\n\nD\u03b2(p||q) =\n\n1\n\n\u03b2 \u2212 1\n\nlog\n\np\u03b2(x)\nq\u03b2\u22121(x)\n\ndx.\n\nThe constants bi are such that 0 \u2264 J(\u03c4 ) \u2264 bi at each iteration i = 0, . . . , L \u2212 1.\n\ncomputed after L iterations, with M samples zi1, . . . , ziM \u223c p(\u00b7|\u03bdi) obtained at iterations i =\n0, . . . , L \u2212 1, where D\u03b2(p||q) denotes the Renyii divergence between p and q de\ufb01ned by\n\n4\n\n\fProof. The bound is obtained by relating the mean to its robust estimate according to\n\n,\n\ne\n\ni=0\n\n= e\u2212\u03b1t+\u03b1LMJ E\n\n(cid:17)\nP(cid:16)\nLM (J (\u03bd) \u2212 (cid:98)J\u03b1(\u03bd)) \u2265 t\ne\u03b1LM (J (\u03bd)\u2212(cid:98)J\u03b1(\u03bd)) \u2265 e\u03b1t(cid:17)\n= P(cid:16)\ne\u03b1LM (J (\u03bd)\u2212(cid:98)J\u03b1(\u03bd))(cid:105)\n\u2264 E(cid:104)\n= e\u2212\u03b1t+\u03b1LMJ (\u03bd)E(cid:104)\nj=1 \u2212\u03c8(\u03b1(cid:96)i(zij ,\u03bd))(cid:105)\ne\u2212\u03b1t,\n(cid:80)M\n(cid:80)L\u22121\n\uf8ee\uf8f0L\u22121(cid:89)\n\uf8f9\uf8fb\nM(cid:89)\n= e\u2212\u03b1t+\u03b1LMJ L\u22121(cid:89)\nM(cid:89)\nL\u22121(cid:89)\nM(cid:89)\nM(cid:89)\nL\u22121(cid:89)\n\u2264 e\u2212\u03b1t+\u03b1LMJ (\u03bd)\n(cid:80)L\u22121\n\n= e\u2212\u03b1t+\u03b1LMJ (\u03bd)\n\ne\u2212\u03c8(\u03b1(cid:96)i(zij ,\u03bd))\n\n1 \u2212 \u03b1J (\u03bd) +\n\ne\u2212\u03b1J (\u03bd)+ \u03b12\n\n2\n\nE z\u223cp(\u00b7|\u03bdi)\n\n(cid:18)\n\n\u03b12\n2\n\ni=0\n\nj=1\n\ni=0\n\nj=1\n\ni=0\n\nj=1\n\ni=0\n\nj=1\n\n\u2264 e\u2212\u03b1t+M \u03b12\n\n2\n\nEz\u223cp(\u00b7|\u03bdi)[(cid:96)i(z,\u03bd)2],\n\ni=0\n\n(cid:20)\n\n1 \u2212 \u03b1(cid:96)i(z, \u03bd) +\n\n(cid:21)\n\n\u03b12\n2\n\n(cid:96)i(z, \u03bd)2\n\n(cid:19)\n\nE z\u223cp(\u00b7|\u03bdi)[(cid:96)i(z, \u03bd)2]\n\nEz\u223cp(\u00b7|\u03bdi)[(cid:96)i(z,\u03bd)2]\n\n(6)\n\n(7)\n\n(8)\n\nusing Markov\u2019s inequality to obtain (6), the identities \u03c8(x) \u2265 \u2212 log(1 \u2212 x + 1\n2 x2) in (7) and\n1 + x \u2264 ex in (8). Here, we adapted the moment-truncation technique proposed by Catoni [24] for\ngeneral unbounded losses to our policy adaptation setting in order to handle the possibly unbounded\nlikelihood ratio. These results are then combined with\n\nE [(cid:96)i(z, \u03bd)2] \u2264 b2\n\ni\n\nE\u03c0(\u00b7|\u03bdi)\n\n= b2\n\ni eD2(\u03c0||\u03c0i),\n\n(cid:21)\n\n(cid:20) \u03c0(\u03be|\u03bd)2\n\n\u03c0(\u03be|\u03bdi)2\n\nwhere the relationship between the likelihood ratio variance and the Renyii divergence was estab-\nlished in [23].\n\nNote that the Renyii divergence can be regarded as a distance between two distribution and can be\ncomputed in closed bounded form for various distributions such as the exponential families; it is\nalso closely related to the Kullback-Liebler (KL) divergence, i.e. D1(p||q) = KL(p||q).\n\nIllustration using simple robot navigation\n\n4.2\nWe next illustrate the application of these bounds using the simple scenario introduced in \u00a73. The\nstochasticity is modeled using a Gaussian density on the initial state p(x0), on the disturbances wk\nand on the goal state xf . Iterative policy optimization is performed using a stochastic model \u03c0(\u03be|\u03bd)\nencoding a multivariate Gaussian, i.e.\n\n\u03c0(\u03be|\u03bd) = N (\u03be|\u00b5, \u03a3)\n\nwhich is updated through reward-weighted-regression (RWR) [3], i.e. in Step 2 of the ISPO algo-\nrithm we take M samples, observe their costs, and update the parameters according to\n\n\u00b5 =\n\n\u00afw(\u03c4j)\u03bej,\n\n\u03a3 =\n\nj=1\n\nj=1\n\n\u00afw(\u03c4j)(\u03bej \u2212 \u00b5)(\u03bej \u2212 \u00b5)T ,\n\n(9)\n\nusing the tilting weights w(\u03c4 ) = e\u2212\u03b2J(\u03c4 ) for some adaptively chosen \u03b2 > 0 and where \u00afw(\u03c4j) (cid:44)\n\nM(cid:88)\n\nM(cid:88)\n\nw(\u03c4j)/(cid:80)M\n\n(cid:96)=1 w(\u03c4(cid:96)) are the normalized weights.\n\nAt each iteration i one can compute a bound on the expected cost using the previously computed\n\u03bd0, . . . , \u03bdi\u22121. We have computed such bounds using (5) for both the expected cost and probability of\n\n5\n\n\fiteration #1\n\niteration #4\n\niteration #9\n\niteration #28\n\na)\n\nFigure 1: Robot navigation scenario based on iterative policy improvement and resulting predicted perfor-\nmance: a) evolution of the density p(\u03be|\u03bd) over the decision variables (in this case the control gains); b) cost\nfunction J and its computed upper bound J + for future executions; c) analogous bounds on probability-of-\ncollision P ; snapshots of sampled trajectories (top). Note that the initial policy results in \u2248 30% collisions.\nSurprisingly, the standard empirical and robust estimates are nearly identical.\n\nb)\n\nc)\n\ncollision, denoted respectively by J + and P + using M = 200 samples (Figure 1) at each iteration.\nWe used a window of maximum L = 10 previous iterations to compute the bounds, i.e. to compute\n\u03bdi+1 all samples from densities \u03bdi\u2212L+1, \u03bdi\u2212L+2, . . . , \u03bdi were used. Remarkably, using our robust\nstatistics approach the resulting bound eventually becomes close to the standard empirical estimate\n\n(cid:98)J . The collision probability bound P + decreses to less than 10% which could be further improved\n\nby employing more samples and more iterations. The signi\ufb01cance of these bounds is that one can\nstop the optimization (regarded as training) at any time and be able to predict expected performance\nin future executions using the newly updated policy before actually executing the policy, i.e. using\nthe samples from the previous iteration.\nFinally, the Renyii divergence term used in these computations takes the simple form\n\nD\u03b2 (N (\u00b7|\u00b50, \u03a30)(cid:107)N (\u00b7|\u00b51, \u03a31)) =\n\n\u03b2\n2\n\nwhere \u03a3\u03b2 = (1 \u2212 \u03b2)\u03a30 + \u03b2\u03a31.\n\n(cid:107)\u00b51 \u2212 \u00b50(cid:107)2\n\n\u22121\n\u03b2\n\n\u03a3\n\n+\n\n1\n\n2(1 \u2212 \u03b2)\n\nlog\n\n|\u03a3\u03b2|\n\n|\u03a30|1\u2212\u03b2|\u03a31|\u03b2 ,\n\n4.3 Policy Optimization Methods\nWe do not impose any restrictions on the speci\ufb01c method used for optimizing the policy \u03c0(\u03be|\u03bd).\nWhen complex constraints are present such computation will involve a global motion planning step\ncombined with local feedback control laws (we show such an example in \u00a75). The approach can\nbe used to either analyze such policies computed using any method of choice or to derive new\nalgorithms based on minimizing the right-hand side of the bound. The method also applies to model-\nfree learning. For instance, related to recent methods in robotics one could use reward-weighted-\nregression (RWR) or policy learning by weighted samples with returns (PoWeR) [3], stochastic\noptimization methods such as [25, 26], or using the related cross-entropy optimization [16, 27].\n\n6\n\n-15-10-505-15-10-505-15-10-505-15-10-505-15-10-505-15-10-505obstaclesgoalsampledstartstatesobstaclesiterations051015202530012345678ExpectedCostempiricalbJrobustbJ\u03b1PACboundJ+iterations05101520253000.10.20.30.40.50.60.7ProbabilityofCollisionempiricalProbustP\u03b1PACboundP+\f5 Application to Aerial Vehicle Navigation\nConsider an aerial vehicle such as a quadrotor navigating at high speed through a cluttered environ-\nment. We are interested in minimizing a cost metric related to the total time taken and control effort\nrequired to reach a desired goal state, while maintaining low probability of collision. We employ an\nexperimentally identi\ufb01ed model of an AscTec quadrotor (Figure 2) with 12-dimensional state space\nX = SE(3) \u00d7 R6 with state x = (p, R, \u02d9p, \u03c9) where p \u2208 R3 is the position, R \u2208 SO(3) is the\nrotation matrix, and \u03c9 \u2208 R3 is the body-\ufb01xed angular velocity. The vehicle is controlled with inputs\nu = (F, M ) \u2208 R4 including the lift force F \u2265 0 and torque moments M \u2208 R3. The dynamics is\n\n\u02d9R = R(cid:98)\u03c9,\n\nJ \u02d9\u03c9 = J\u03c9 \u00d7 \u03c9 + M,\n\nm\u00a8p = Re3F + mg + \u03b4(p, \u02d9p),\n\nwhere m is the mass, J\u2013the inertia tensor, e3 = (0, 0, 1) and the matrix(cid:98)\u03c9 is such that(cid:98)\u03c9\u03b7 = \u03c9 \u00d7 \u03b7\n\n(10)\n(11)\n(12)\nfor any \u03b7 \u2208 R3. The system is subject to initial localization errors and also to random disturbances,\ne.g. due to wind gusts and wall effects, de\ufb01ned as stochastic forces \u03b4(p, \u02d9p) \u2208 R3. Each component\nin \u03b4 is zero-mean and has standard deviation of 3 Newtons, for a vehicle with mass m \u2248 1 kg.\nThe objective is to navigate through a given urban environment at high speed to a desired goal\nstate. We employ a two-stage approach consisting of an A*-based global planner which produces\na sequence of local sub-goals that the vehicle must pass through. A standard nonlinear feedback\nbackstepping controller based on a \u201cslow\u201d position control loop and a \u201cfast\u201d attitude control is\nemployed [28, 29] for local control. In addition, and obstacle avoidance controller is added to avoid\ncollisions since the vehicle is not expected to exactly follow the A* path. At each iteration M = 200\nsamples are taken with 1 \u2212 \u03b4 = 0.95 con\ufb01dence level. A window of L = 5 past iterations were\nused for the bounds. The control density \u03c0(\u03be|\u03bd) is a single Gaussian as speci\ufb01ed in \u00a74.2. The most\nsensitive gains in the controller are the position proporitional and derivative terms, and the obstacle\ngains, denoted by kp, kd, and ko, which we examine in the following scenarios:\na) \ufb01xed goal, wind gusts disturbances, virtual environment: the system is \ufb01rst tested in a cluttered\nsimulated environment (Figure 2). The simulated vehicle travels at an average velocity of 20\nm/s (see video in Supplement) and initially experiences more than 50% collisions. After a few\niterations the total cost stabilizes and the probability of collision reduces to around 15%. The\nbound is close to the empirical estimate which indicates that it can be tight if more samples are\ntaken. The collision probability bound is still too high to be practical but our goal was only\nto illustrate the bound behavior. It is also likely that our chosen control strategy is in fact not\nsuitable for high-speed traversal of such tight environments.\n\nb) sparser campus-like environment, randomly sampled goals: a more general evaluation was per-\nformed by adding the goal location to the stochastic problem parameters so that the bound will\napply to any future desired goal in that environment (Figure 3). The algorithm converges to\nsimilar values as before, but this time the collision probability is smaller due to more expan-\nsive environment. In both cases, the bounds could be reduced further by employing more than\nM = 200 samples or by reusing more samples from previous runs according to Proposition 1.\n\n6 Conclusion\nThis paper considered stochastic decision problems and focused on a probably-approximate bounds\non robustness of the computed decision variables. We showed how to derive bounds for \ufb01xed poli-\ncies in order to predict future performance and/or constraint violation. These results could then be\nemployed for obtaining generalization PAC bounds, e.g. through a PAC-Bayesian approach which\ncould be consistent with the proposed notion of policy priors and policy adaptation. Future work\nwill develop concrete algorithms by directly optimizing such PAC bounds, which are expected to\nhave built-in robustness properties.\nReferences\n[1] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods\n\nfor reinforcement learning with function approximation. In NIPS, pages 1057\u20131063, 1999.\n\n[2] Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.\n[3] M. P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. pages 388\u2013403,\n\n2013.\n\n7\n\n\fFigure 2: Aerial vehicle navigation using a simulated nonlinear quadrotor model (top). Iterative stochastic\npolicy optimization iterations (a,b,c) analogous to those given in Figure 1. Note that the initial policy results in\nover 50% collisions which is reduced to less than 10% after a few policy iterations.\n\nb)\n\nc)\n\nFigure 3: Analogous plot to Figure 2 but for a typical campus environment using uniformly at random sampled\ngoal states along the northern boundary. The vehicle must \ufb02y below 100 feet and is not allowed to \ufb02y above\nbuildings. This is a larger less constrained environment resulting in less collisions.\n\nc)\n\na)\n\na)\n\nb)\n\n8\n\niteration#1iteration#5iteration#17A*waypointpathsimulatedquadrotormotionAscTecPelicaniterations0510152025050100150200250300350ExpectedCostempiricalbJrobustbJ\u03b1PACboundJ+iterations051015202500.20.40.60.81ProbabilityofCollisionempiricalProbustP\u03b1PACboundP+iteration#1iteration#4iteration#10campusmapStartrandomGoalsiterations051015050100150200250300350400ExpectedCostempiricalbJrobustbJ\u03b1PACboundJ+iterations05101500.20.40.60.81ProbabilityofCollisionempiricalProbustP\u03b1PACboundP+\f[4] S. Schaal and C. Atkeson. Learning control in robotics. Robotics Automation Magazine, IEEE, 17(2):20\n\n\u201329, june 2010.\n\n[5] Alberto Bemporad and Manfred Morari. Robust model predictive control: A survey. In A. Garulli and\nA. Tesi, editors, Robustness in identi\ufb01cation and control, volume 245 of Lecture Notes in Control and\nInformation Sciences, pages 207\u2013226. Springer London, 1999.\n\n[6] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York,\n\nNY, USA, 1995.\n\n[7] David A. McAllester. Pac-bayesian stochastic model selection. Mach. Learn., 51:5\u201321, April 2003.\n[8] J Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine Learning Re-\n\nsearch, 6(1):273\u2013306, 2005.\n\n[9] Stphane Boucheron, Gbor Lugosi, Pascal Massart, and Michel Ledoux. Concentration inequalities : a\n\nnonasymptotic theory of independence. Oxford university press, Oxford, 2013.\n\n[10] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory.\n\nAutomatica, 37(10):1515\u20131528, October 2001.\n\n[11] Laura Ryan Ray and Robert F. Stengel. A monte carlo approach to the analysis of control system robust-\n\nness. Automatica, 29(1):229\u2013236, January 1993.\n\n[12] Qian Wang and RobertF. Stengel. Probabilistic control of nonlinear uncertain systems.\n\nIn Giuseppe\nCala\ufb01ore and Fabrizio Dabbene, editors, Probabilistic and Randomized Methods for Design under Un-\ncertainty, pages 381\u2013414. Springer London, 2006.\n\n[13] R. Tempo, G. Cala\ufb01ore, and F. Dabbene. Randomized algorithms for analysis and control of uncertain\n\nsystems. Springer, 2004.\n\n[14] V. Koltchinskii, C.T. Abdallah, M. Ariola, and P. Dorato. Statistical learning control of uncertain systems:\n\u00a1ce:title\u00bfThe\n\ntheory and algorithms. Applied Mathematics and Computation, 120(13):31 \u2013 43, 2001.\nBellman Continuum\u00a1/ce:title\u00bf.\n\n[15] M. Vidyasagar and Rajeeva L. Karandikar. A learning theory approach to system identi\ufb01cation and\nstochastic adaptive control. Journal of Process Control, 18(34):421 \u2013 430, 2008. Festschrift honour-\ning Professor Dale Seborg.\n\n[16] Reuven Y. Rubinstein and Dirk P. Kroese. The cross-entropy method: a uni\ufb01ed approach to combinatorial\n\noptimization. Springer, 2004.\n\n[17] Anatoly Zhigljavsky and Antanasz Zilinskas. Stochastic Global Optimization. Spri, 2008.\n[18] Philipp Hennig and Christian J. Schuler. Entropy search for information-ef\ufb01cient global optimization. J.\n\nMach. Learn. Res., 98888:1809\u20131837, June 2012.\n\n[19] Christian Igel, Nikolaus Hansen, and Stefan Roth. Covariance matrix adaptation for multi-objective\n\noptimization. Evol. Comput., 15(1):1\u201328, March 2007.\n\n[20] Pedro Larraaga and Jose A. Lozano, editors. Estimation of distribution algorithms: A new tool for evolu-\n\ntionary computation. Kluwer Academic Publishers, 2002.\n\n[21] Martin Pelikan, David E. Goldberg, and Fernando G. Lobo. A survey of optimization by building and\n\nusing probabilistic models. Comput. Optim. Appl., 21:5\u201320, January 2002.\n\n[22] Howie Choset, Kevin M. Lynch, Seth Hutchinson, George A Kantor, Wolfram Burgard, Lydia E. Kavraki,\nand Sebastian Thrun. Principles of Robot Motion: Theory, Algorithms, and Implementations. MIT Press,\nJune 2005.\n\n[23] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning Bounds for Importance Weighting. In\n\nAdvances in Neural Information Processing Systems 23, 2010.\n\n[24] Olivier Catoni. Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. H.\n\nPoincar Probab. Statist., 48(4):1148\u20131185, 11 2012.\n\n[25] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement\n\nlearning. Journal of Machine Learning Research, (11):3137\u20133181, 2010.\n\n[26] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under\n\nunknown dynamics. In Neural Information Processing Systems (NIPS), 2014.\n\n[27] M. Kobilarov. Cross-entropy motion planning. International Journal of Robotics Research, 31(7):855\u2013\n\n871, 2012.\n\n[28] Robert Mahony and Tarek Hamel. Robust trajectory tracking for a scale model autonomous helicopter.\n\nInternational Journal of Robust and Nonlinear Control, 14(12):1035\u20131059, 2004.\n\n[29] Marin Kobilarov. Trajectory tracking of a class of underactuated systems with external disturbances. In\n\nAmerican Control Conference, pages 1044\u20131049, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1745, "authors": [{"given_name": "Marin", "family_name": "Kobilarov", "institution": "Johns Hopkins University"}]}