{"title": "Constrained Cross-Entropy Method for Safe Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7450, "page_last": 7460, "abstract": "We study a safe reinforcement learning problem in which the constraints are defined as the expected cost over finite-length trajectories. We propose a constrained cross-entropy-based method to solve this problem. The method explicitly tracks its performance with respect to constraint satisfaction and thus is well-suited for safety-critical applications. We show that the asymptotic behavior of the proposed algorithm can be almost-surely described by that of an ordinary differential equation. Then we give sufficient conditions on the properties of this differential equation to guarantee the convergence of the proposed algorithm. At last, we show with simulation experiments that the proposed algorithm can effectively learn feasible policies without assumptions on the feasibility of initial policies, even with non-Markovian objective functions and constraint functions.", "full_text": "Constrained Cross-Entropy Method for Safe\n\nReinforcement Learning\n\nDepartment of Electrical and Systems Engineering\n\nMin Wen\n\nUniversity of Pennsylvania\n\nwenm@seas.upenn.edu\n\nUfuk Topcu\n\nDepartment of Aerospace Engineering and Engineering Mechanics\n\nUniversity of Texas at Austin\n\nutopcu@utexas.edu\n\nAbstract\n\nWe study a safe reinforcement learning problem in which the constraints are de-\n\ufb01ned as the expected cost over \ufb01nite-length trajectories. We propose a constrained\ncross-entropy-based method to solve this problem. The method explicitly tracks\nits performance with respect to constraint satisfaction and thus is well-suited for\nsafety-critical applications. We show that the asymptotic behavior of the pro-\nposed algorithm can be almost-surely described by that of an ordinary differential\nequation. Then we give suf\ufb01cient conditions on the properties of this differential\nequation for the convergence of the proposed algorithm. At last, we show with\nsimulation experiments that the proposed algorithm can effectively learn feasi-\nble policies without assumptions on the feasibility of initial policies, even with\nnon-Markovian objective functions and constraint functions.\n\n1\n\nIntroduction\n\nThis paper studies the following constrained optimal control problem: given a dynamical system\nmodel with continuous states and actions, a objective function and a constraint function, \ufb01nd a\ncontroller that maximizes the objective function while satisfying the constraint. Although this topic\nhas been studied for decades within the control community [3], it is still challenging for practical\nproblems. To illustrate some major dif\ufb01culties, consider the synthesis of a policy for a nonholonomic\nmobile robot to reach a goal while avoiding obstacles (which introduces constraints) in a cost-ef\ufb01cient\nway (which induces an objective). The obstacle-free state space is usually nonconvex. The equations\nof the dynamical system model are typically highly nonlinear. Constraint functions and cost functions\nmay not be convex or differentiable in the state and action variables. There may even be hidden\nvariables that are not observable and make transitions and costs non-Markovian. Given all these\ndif\ufb01culties, we still need to compute a policy that is at least feasible and improve the cost objective as\nmuch as possible.\nReinforcement learning (RL) methods have been widely used to learn optimal policies for agents with\ncomplicated or even unknown dynamics. For problems with continuous state and action spaces, the\nagent\u2019s policy is usually modeled as a parameterized function of states such as deep neural networks\nand later trained using policy gradient methods [35; 30; 27; 28; 21; 8; 29]. By encoding control tasks\nas reward or cost functions, RL has successfully solved a wide range of tasks such as Atari games\n[19; 20], the game of Go [31; 32], controlling simulated robots [36; 24] and real robots [15; 37; 22].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fMost of the existing methods for RL solve only unconstrained problems. However, it is generally\nnon-trivial to transform a constrained optimal control problem into an unconstrained one, due to\nthe asymmetry between the goals of objective optimization and constraint satisfaction. On the one\nhand, it is usually acceptable to output a policy that is only locally optimal with respect to the\noptimization objective. On the other hand, in many application scenarios where constraints encode\nsafety requirements or the amount of available resources, violating the constraint even by a small\namount may have signi\ufb01cant consequences.\nExisting methods for safe reinforcement learning that are based on policy gradient methods cannot\nguarantee strict feasibility of the policies they output, even when initialized with feasible initial\npolicies. When initialized with an infeasible policy, they usually are not be able to \ufb01nd even a single\nfeasible policy until their convergence (with an example in Section 5). These limitations motivate the\nfollowing question: Can we develop a reinforcement learning algorithm that explicitly addresses the\npriority of constraint satisfaction? Rather than assuming that the initial policy is feasible and that one\ncan always \ufb01nd a feasible policy in the estimated gradient direction, we need to deal with cases in\nwhich the initial policy is not feasible, or we have never seen a feasible policy before.\nInspired by stochastic optimization methods based on the cross-entropy (CE) concept [11], we\npropose a new safe reinforcement learning algorithm, which we call the constrained cross-entropy\n(CCE) method. The basic framework is the same with standard CE methods: In each iteration, we\nsample from a distribution of policies, select a set of elite sample policies and use them to update the\npolicy distribution. Rather than treating the constraints as an extra term in the objective function as\nwhat policy gradient method do, we use constraint values to sort sample policies. If there are not\nenough feasible sample policies, we select only those with the best constraint performance as elite\nsample policies. If a given proportion of the sample policies are feasible, we select the feasible sample\npolicies with the best objective values as elite sample policies. Instead of initializing the optimization\nwith a feasible policy, the method improves both the objective function and the constraint function\nwith the constraint as a prioritized concern.\nOur algorithm can be used as a black-box optimizer. It does not even assume that there is an\nunderlying reward or cost function encoding the optimization objective and constraint functions.\nIn fact, the algorithm can be applied to any \ufb01nite-horizon problem (say, with horizon N) whose\nobjective and constraint functions are de\ufb01ned as the average performance over some distribution of\ntrajectories. For example, a constraint function can be the probability that the agent satis\ufb01es a given\ntask speci\ufb01cation (which may be Markovian or non-Markovian) with policy \u03c0\u03b8, if the satisfaction\nof the given task can be decided with any N-step trajectory. An optimization objective may be the\nexpected number of steps before the agent reaches a goal state, or the expected maximum distance the\nagent has left from its origin, or the expected minimum distance between the agent and any obstacle\nover the whole trajectory.\nOur contributions are as follows. First, we present a model-free constrained RL algorithm that works\nwith continuous state and action spaces. Second, we prove that the asymptotic behavior of our\nalgorithm can be almost-surely described by that of an ordinary differential equation (ODE), which is\neasily interpretable with respect to the objectives. Third, we give suf\ufb01cient conditions on the properties\nof this ODE to guarantee the convergence of our algorithm. We show with numerical experiments\nthat, our algorithm converges to feasible policies in all our experiments with all combinations of\nfeasible or infeasible initial policies, Markovian or non-Markovian objectives and constraints, while\nother policy-gradient-based algorithms fail to \ufb01nd strictly feasible solutions.\n\n2 Related Work\n\nSafety has long been concerned in RL literature and is formulated as various criteria [7]. We choose\nto take the so-called constrained criterion [7] to encode our safety requirement, which is the same as\nin the literature of constrained Markov decision processes (CMDP) [2]. Approaches are still limited\nfor safe RL with continuous state and action spaces. Uchibe and Doya [34] proposed a constrained\npolicy gradient reinforcement learning algorithm, which relies on projected gradients to maintain\nfeasibility. The computation of projection restricts the types of constraints it can deal with, and there\nis no known guarantee on convergence. Chow et al. [4] came up with a trajectory-based primal-dual\nsubgradient algorithm for a risk-constrained RL problem with \ufb01nite state and action spaces. The\nalgorithm is proved to converge almost-surely to a local saddle point. However, the constraints are\n\n2\n\n\fjust implicitly considered by updating dual variables and the output policy may not actually satisfy the\nconstraints. Recently, Achiam et al. [1] proposed a trust region method for CMDP called constrained\npolicy optimization (CPO), which can deal with high-dimensional policy classes such as neural\nnetworks and claim to maintain feasibility if started with a feasible solution. However, we found in\nSection 5 that feasibility is rarely guaranteed during learning in practice, possibly due to errors in\ngradient and Hessian matrix estimation.\nCross-entropy-based stochastic optimization techniques have been applied to a series of RL and\noptimal control problems. Mannor et al. [18] used cross-entropy methods to solve a stochastic\nshortest-path problem on \ufb01nite Markov decision processes, which is essentially an unconstrained\nproblem. Szita and L\u00f6rincz [33] took a noisy variant to learn how to play Tetris. Kobilarov [14]\nintroduced a similar technique to motion planning in constrained continuous-state environments\nby considering distributions over collision-free trajectories. Livingston et al. [17] generalized this\nmethod to deal with a broader class of trajectory-based constraints called linear temporal logic\nspeci\ufb01cations. Both methods simply discard all sample trajectories that violate the given constraints,\nand thus their work can be considered as a special case of our work when the constraint function has\nbinary outputs. Similar applications in approximate optimal control with constraints can be found in\n[23; 6; 16].\n\n3 Preliminaries\nFor a set B, let D(B) be the set of all probability distributions over B, int(B) be the interior of B\nand Bk := {s0, s1, . . . , sk\u22121 | st \u2208 B, \u2200t = 0, . . . , k \u2212 1} be the set of all sequences composed by\nelements in B of length k for any k \u2208 N+.\nA (reward-free) Markov decision process (MDP) is de\ufb01ned as a tuple (S,A, T, P0), where S is a set\nof states, A is a set of actions, T : S \u00d7A \u2192 D(S) is a transition distribution function and P0 \u2208 D(S)\nis an initial state distribution. Let \u03a0 : S \u2192 D(A) be the set of all stationary policies. Given a\n\ufb01nite horizon N, an N-step trajectory is a sequence of N state-action pairs. Each stationary policy\n\u03c0 \u2208 \u03a0 decides a distribution over N-step trajectories such that the probability to draw a trajectory\nt=0 \u03c0(at|st). Without\nloss of generality, we assume that N is \ufb01xed and use P\u03c0 to represent P\u03c0,N .\nAn objective function J : (S \u00d7 A)N \u2192 R is a mapping from each N-step trajectory to a scalar value.\nFor each \u03c0 \u2208 \u03a0, let\nbe the expected value of J with the N-step trajectory distribution decided by \u03c0. A policy \u03c0 \u2208 \u03a0 is an\noptimal policy in \u03a0 with respect to J if GJ (\u03c0) = max\u03c0(cid:48)\u2208\u03a0 GJ (\u03c0(cid:48)).\nA cost function Z : (S \u00d7 A)N \u2192 R is also a function de\ufb01ned on N-step trajectories. Let\n\n\u03c4 = s0, a0, . . . , sN\u22121, aN\u22121 is P\u03c0,N (\u03c4 ) = P0(s0)(cid:81)N\u22122\n\nt=0 T (st+1|st, at)(cid:81)N\u22121\n\nGJ (\u03c0) := E\u03c4\u223cP\u03c0 [J(\u03c4 )]\n\nHZ(\u03c0) := E\u03c4\u223cP\u03c0 [Z(\u03c4 )]\n\nbe the expected cost over trajectory distribution P\u03c0. A policy \u03c0 \u2208 \u03a0 is feasible for a constrained\noptimization problem with cost function Z and constraint upper bound d if HZ(\u03c0) \u2264 d. Let \u03a0Z,d be\nthe set of all feasible policies.\nFor notational simplicity, we omit J and Z in GJ and HZ whenever there is no ambiguity. For any\npolicy \u03c0 \u2208 \u03a0, we refer to G(\u03c0) and H(\u03c0) as the G-value and H-value of \u03c0.\n\n4 Constrained Cross-Entropy Framework\n\n4.1 Problem Formulation\n\nIn this paper, we consider a \ufb01nite-horizon RL problem with a strictly positive objective function\nJ : (S \u00d7 A)N \u2192 R+, a cost function Z : (S \u00d7 A)N \u2192 R and a constraint upper bound d. For\nMDPs with continuous state and action spaces, it is usually intractable to exactly solve an optimal\nstationary policy due to the curse of dimensionality. An alternative is to use function approximators,\nsuch as neural networks, to parameterize a subset of policies. Given a parameterized class of policies\n\u03a0\u0398 with a parameter space \u0398 \u2286 Rd\u03b8, we aim to solve the following problem:\n\n(cid:84) \u03a0Z,d\n\u03c0\u2217 = arg max\n\n\u03c0\u2208\u03a0\u0398\n\nGJ (\u03c0).\n\n3\n\n\fThe proposed algorithm, which we call the constrained cross-entropy method, generalizes the well-\nknown cross-entropy method [18] for unconstrained optimization. The basic idea is to generate a\nsequence of policy distributions that eventually concentrates on a feasible (locally) optimal policy.\nGiven a distribution over \u03a0\u0398, we randomly generate a set of sample policies, sort them with a ranking\nfunction that depends on their G-values and H-values and then update the policy distribution with a\nsubset of high-ranking sample policies.\nGiven the policy parameterization \u03a0\u0398, we use distributions over the parameter space \u0398 to represent\ndistributions over the policy space \u03a0\u0398. We focus ourselves on a speci\ufb01c family of distributions\nover \u0398 called natural exponential family (NEF), which includes many useful distributions such as\nGaussian distribution and Gamma distribution. A formal de\ufb01nition of NEF is as follows.\nDe\ufb01nition 4.1. A parameterized family FV = {fv \u2208 D(\u0398), v \u2208 V \u2286 Rdv} is called a natural\nexponential family if there exist continuous mappings \u0393 : Rd\u03b8 \u2192 Rdv and K : Rd\u03b8 \u2192 R such that\n\n\u0393(\u03b8) \u2212 K(v)(cid:1), where V \u2286 {v \u2208 Rdv : |K(v)| < \u221e} is the natural parameter\n\nfv(\u03b8) = exp(cid:0)v\nspace and K(v) = log(cid:82)\n\n\u0393(\u03b8)(cid:1)d\u03b8.\n\n(cid:124)\n\n\u0398 exp(cid:0)v\n\n(cid:124)\n\nAs with other CE-based algorithms, we replace the original objective G(\u03c0\u03b8) = E\u03c4\u223cP\u03c0\u03b8\n[J(\u03c4 )] with\na surrogate function. For the unconstrained CE method, the surrogate function is the conditional\nexpectation over policies whose G-values are highly ranked with the current sampling distribution\nfv. The ranking function is de\ufb01ned using the concept of \u03c1-quantiles for random variables, which is\nformally de\ufb01ned as below.\nDe\ufb01nition 4.2. [10] Given a distribution P \u2208 D(R), \u03c1 \u2208 (0, 1) and a random variable X \u223c P , the\n\u03c1-quantile of X is de\ufb01ned as a scalar \u03b3 such that P r(X \u2264 \u03b3) \u2265 \u03c1 and P r(X \u2265 \u03b3) \u2265 1 \u2212 \u03c1.\nFor \u03c1 \u2208 (0, 1), v \u2208 V and any function X : \u0398 \u2192 R, we denote the \u03c1-quantile of X for \u03b8 \u223c fv by\n\u03beX (\u03c1, v). We also de\ufb01ne \u03b4 : R \u00d7 {\u2265,\u2264, >, <, =} \u00d7 R \u2192 {0, 1} as an indicator function such that\nfor \u25e6 \u2208 {\u2265,\u2264, >, <, =}, \u03b4(x \u25e6 y) = 1 if and only if x \u25e6 y holds. The surrogate objective function for\nthe unconstrained CE method is E\u03b8\u223cfv [G(\u03c0\u03b8)\u03b4(G(\u03c0\u03b8) \u2265 \u03beG(1 \u2212 \u03c1, v))]. In other words, a policy\n\u03c0\u03b8 is considered as highly ranked if G(\u03c0\u03b8) \u2265 \u03beG(1 \u2212 \u03c1, v). When there is a constraint H(\u03c0) \u2264 d,\nwe de\ufb01ne U : \u03a0\u0398 \u2192 R such that U (\u03c0\u03b8) := G(\u03c0\u03b8)\u03b4(H(\u03c0\u03b8) \u2264 d) for any \u03b8 \u2208 \u0398 and extend the\nsurrogate function as follows:\n\n(cid:26)E\u03b8\u223cfv [G(\u03c0\u03b8)\u03b4(H(\u03c0\u03b8) \u2264 \u03beH (\u03c1, v))],\n\nL(v; \u03c1) :=\n\nif \u03beH (\u03c1, v) > d;\notherwise.\nWe can combine the two cases. De\ufb01ne S : \u03a0\u0398 \u00d7 V \u00d7 (0, 1) \u2192 {0, 1} such that\n\nE\u03b8\u223cfv [U (\u03c0\u03b8)\u03b4(U (\u03c0\u03b8) \u2265 \u03beU (1 \u2212 \u03c1, v))],\n\nS(\u03c0\u03b8, v, \u03c1) :=\u03b4(\u03beH (\u03c1, v) > d)\u03b4(H(\u03c0\u03b8) \u2264 \u03beH (\u03c1, v))+\n\n\u03b4(\u03beH (\u03c1, v) \u2264 d)\u03b4(H(\u03c0\u03b8) \u2264 d)\u03b4(U (\u03c0\u03b8) \u2265 \u03beU (1 \u2212 \u03c1, v)),\n\nthen (1) can be rewritten as\n\nL(v; \u03c1) = E\u03b8\u223cfv [G(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)].\n\n(1)\n\n(2)\n\nThe interpretation of L is as follows: If the \u03c1-quantile of H for the current policy distribution fv is\ngreater than the constraint threshold d, we select policies in \u03a0\u0398 by their H-values in order to increase\nthe probability of drawing feasible policies. Consequently, \u03c0\u03b8 is highly ranked if H(\u03c0\u03b8) \u2264 \u03beH (\u03c1, v).\nIf the proportion of feasible policies is higher than \u03c1, we select policies that are both feasible and with\nlarge objective values, i.e., \u03c0\u03b8 is highly ranked if H(\u03c0\u03b8) \u2264 d and U (\u03c0\u03b8) \u2265 \u03beU (1 \u2212 \u03c1, v). Intuitively,\nS can be considered as the indicator function of the highly-ranked or elite samples.\nRemark 1. By maximizing U, we implicitly prioritizes feasibility over the G objective: For any\nfeasible policy \u03c0 and infeasible policy \u03c0(cid:48), it can be easily veri\ufb01ed that U (\u03c0) \u2265 U (\u03c0(cid:48)), as G and U\nare non-negative by de\ufb01nition.\nRemark 2. If \u03beH (\u03c1, v) \u2264 d, then G(\u03c0\u03b8)\u03b4(G(\u03c0\u03b8) \u2265 \u03beG(1 \u2212 \u03c1, v)) \u2265 U (\u03c0\u03b8)\u03b4(U (\u03c0\u03b8) \u2265 \u03beU (1 \u2212\n\u03c1, v)) \u2265 G(\u03c0\u03b8)\u03b4(H(\u03c0\u03b8) \u2264 \u03beH (\u03c1, v)). Intuitively, if at least 100\u03c1% of all policies are feasible,\nL(v; \u03c1) is less than the objective value for the unconstrained CE method and greater than the expected\nG-value over the 100\u03c1% policies of the highest H-values.\n\nThe main problem we solve in this paper can be then stated as follows.\n\n4\n\n\fSimulate \u03c0\u03b8i and estimate G(\u03c0\u03b8i), H(\u03c0\u03b8i).\n\nparameterized policies \u03a0\u0398, an NEF family FV.\n\nSample \u03b81, . . . , \u03b8nl \u223c fvl i.i.d..\nfor i = 1, . . . , nl do\n\nAlgorithm 1 Constrained Cross-Entropy Method\nRequire: An objective function G, a constraint function H, a constraint upper bound d, a class of\n1: l \u2190 1. Initialize nl, vl, \u03c1, \u03bbl, \u03b1l. kl \u2190 (cid:100)\u03c1nl(cid:101). \u02c6\u03b7l \u2190 0.\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\n(cid:12)(cid:12) H(\u03c0\u03b8i ) \u2265 d} in descending order of G. Let \u039bl be the \ufb01rst kl elements.\n(cid:1).\n(cid:80)\n\nend for\nSort {\u03b8i}nl\ni=1 in ascending order of H. Let \u039bl be the \ufb01rst kl elements.\n) \u2264 d then\nif H(\u03c0\u03b8kl\nSort {\u03b8i\n(cid:80)\nend if\n\u02c6\u03b7l+1 \u2190 \u03b1l\nvl+1 \u2190 m\u22121(\u02c6\u03b7l+1).\nUpdate nl, \u03bbl, \u03b1l. l \u2190 l + 1. kl \u2190 (cid:100)\u03c1nl(cid:101).\n\nG(\u03c0\u03b8) \u0393(\u03b8) + (1 \u2212 \u03b1l)(cid:0) \u03bbl\n\n(cid:80)nl\ni=1 \u0393(\u03b8i) + (1 \u2212 \u03bbl)\u02c6\u03b7l\n\n12:\n13:\n14: until Stopping rule is satis\ufb01ed.\n\nG(\u03c0\u03b8)\n\u03b8\u2208\u039bl\n\n\u03b8\u2208\u039bl\n\nnl\n\nProblem 1. Given a set \u03a0 = {\u03c0\u03b8 : \u03b8 \u2208 \u0398} of policies with parameter space \u0398, an NEF FV =\n{fv \u2208 D(\u0398) : v \u2208 V} of distributions over \u0398, two functions G : \u03a0 \u2192 R+ and H : \u03a0 \u2192 R, a\nconstraint upper bound d and \u03c1 \u2208 (0, 1), compute v\u2217 \u2208 V such that\n\nv\u2208V\nwhere L : V \u00d7 (0, 1) \u2192 R is de\ufb01ned in (1) or (2).\n\nv\u2217 = arg max\n\nL(v; \u03c1),\n\n4.2 The Constrained Cross-Entropy Algorithm\n\nThe pseudocode of the constrained cross-entropy algorithm is given in Algorithm 1. We \ufb01rst explain\nthe basic ideas behind the updates in Algorithm 1, and provide a proof of convergence in Section 4.3.\nWe \ufb01rst describe the key idea behind the (idealized) CE-based stochastic optimization method as in\n[12]. For notational simplicity, we use Ev[\u00b7] to represent E\u03b8\u223cfv [\u00b7] in the rest of this paper. De\ufb01ne\nm(v) := Ev[\u0393(\u03b8)] \u2208 Rdv for v \u2208 V, which is continuously differentiable in v and \u2202\n\u2202v m(v) =\nCovv[\u0393(\u03b8)] where Covv[\u0393(\u03b8)] denotes the covariance matrix of \u0393(\u03b8) with \u03b8 \u223c fv. We take\nAssumption 1 to guarantee that m\u22121 exists and is continuously differentiable over {\u03b7 : \u2203 v \u2208\nint(V) s.t. \u03b7 = m(v)} (see Lemma 1 in supplemental material).\nAssumption 1. Covv[\u0393(\u03b8)] is positive de\ufb01nite for any v \u2208 V \u2286 int({v \u2208 Rdv : |K(v)| < \u221e}).\nBy de\ufb01nition of \u03c1-quantiles, it is a rare event to sample the highly ranked policies for small \u03c1. Thus\nwe apply importance sampling to estimate L(v; \u03c1) using any sampling distribution g that shares the\nsame support \u0398 as fv, among which the optimal distribution g\u2217\nG(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)fv(\u03b8)\n\n(3)\nIn practice we smoothen the updates by including a learning rate \u03b1 \u2208 (0, 1) so the goal distribution is\nv +(1\u2212\u03b1)fv. We can project \u02dcgv to fv(cid:48) \u2208 FV by minimizing the Kullback-Leibler (KL) diver-\n\u02dcgv = \u03b1g\u2217\ngence of fv(cid:48)(cid:48) \u2208 FV from \u02dcgv, which is equivalent to minimizing the cross entropy between \u02dcgv and fv(cid:48)(cid:48).\nIf FV is an NEF, log fv(cid:48)(cid:48) (\u03b8) = (v(cid:48)(cid:48))\n\u0398 \u02dcgv(\u03b8) log fv(cid:48)(cid:48) (\u03b8)d\u03b8\nis convex in v(cid:48)(cid:48). As a result, v(cid:48) can be found by setting \u2202\n= 0, which\n\u2202v(cid:48)(cid:48)\ninduces\n\n\u0393(\u03b8) \u2212 K(v(cid:48)(cid:48)) is concave in v(cid:48)(cid:48); thus \u2212(cid:82)\n\nv [25] with minimal variance is\n\n\u0398 \u02dcgv(\u03b8) log fv(cid:48)(cid:48) (\u03b8)d\u03b8\n\ng\u2217\nv(\u03b8) =\n\nL(v; \u03c1)\n\n(cid:17)\n\n(cid:124)\n\n.\n\nm(v(cid:48)) \u2212 m(v) =\u03b1(cid:0)Eg\u2217\n(cid:16) \u2202\nAs a property of NEF, the KL-divergence of fv from g satis\ufb01es \u2202\nTherefore\n\u2202v(cid:48)(cid:48) DKL(g\u2217\n\nm(v(cid:48)) \u2212 m(v) = \u2212\u03b1\n\nv\n\n(cid:16) \u2212(cid:82)\n[\u0393(\u03b8)] \u2212 m(v)(cid:1).\n(cid:17)(cid:12)(cid:12)(cid:12)v(cid:48)(cid:48)=v\n\nv, fv(cid:48)(cid:48))\n\n(4)\n\u2202v DKL(g, fv) = \u2212Eg[\u0393(\u03b8)]+m(v).\n\n,\n\n(5)\n\n5\n\n\f\u2212 m(v)\n\nL(v; \u03c1)\nfv(\u03b8)(\u0393(\u03b8) \u2212 m(v))d\u03b8\n\n\u02dcL(v; \u03c1) =\n\n(cid:90)\n(cid:90)\n\n=\n\n(\u2217)\n=\n\n=\n\n\u0398\n\nL(v; \u03c1)\n\nG(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)\n\n(cid:0) \u2202\n\nG(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)\n\nfv(\u03b8)(cid:1)d\u03b8\n(cid:12)(cid:12)(cid:12)v(cid:48)(cid:48)=v\n\u2202v(cid:48)(cid:48) log Ev(cid:48)(cid:48) [G(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)]\n\nL(v; \u03c1)\n\n\u0398\n\u2202\n\n\u2202v\n\n,\n\n(\u2217\u2217)\n=\n\n\u2202\n\u2202v(cid:48)(cid:48)\n\nEv(cid:48)(cid:48) [G(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)]\n\nL(v; \u03c1)\n\n(cid:12)(cid:12)(cid:12)v(cid:48)(cid:48)=v\n\n(6)\n\nv, fv) where g\u2217\n\nv is the optimal sampling distribution from importance sampling.\n\nwhich con\ufb01rms that m(v) is always updated in the negative gradient direction of the objective\nfunction DKL(g\u2217\nRemark 3. The equality in (5) holds not just for the optimal distribution g\u2217\ndistribution.\nDe\ufb01ne \u02dcL(v; \u03c1) := Eg\u2217\n\n[\u0393(\u03b8)] \u2212 m(v). If G has a strictly positive lower bound and is bounded, then\nEv[G(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)\u0393(\u03b8)]\n\nv but for any reference\n\nv\n\n(cid:12)(cid:12)(cid:12)v(cid:48)(cid:48)=v\n\n\u2202v fv(\u03b8) = fv(\u03b8)(\u0393(\u03b8) \u2212 m(v)) and the (\u2217\u2217) step holds by the\n\nwhere the (\u2217) step holds by noticing \u2202\ndominated convergence theorem. Combining (4) and (6), we get\n\u2202v(cid:48)(cid:48) log Ev(cid:48)(cid:48) [G(\u03c0\u03b8)S(\u03c0\u03b8, v, \u03c1)]\n\nm(v(cid:48)) \u2212 m(v) = \u03b1\n\n\u2202\n\n= \u03b1 \u02dcL(v; \u03c1),\n\n(7)\nwhich leads to the second interpretation of the updates: The update from v to v(cid:48) approximately\nfollows the gradient direction of log L(v(cid:48)(cid:48); \u03c1), while the quantiles are estimated using the previous\ndistribution fv.\nAlgorithm 1 essentially takes the above updates in (3) and (4) in each iteration, with all expectations\nand quantiles estimated by Monte Carlo simulation. Given fvl \u2208 D(\u0398) in the lth iteration, we sample\nover policies (Step 3), evaluate their G-values and H-values (Step 5), estimate S(\u00b7, v, \u03c1) (Step 7 to\n10) and estimate m(vl+1) with \u02c6\u03b7l+1 (Step 11) and \ufb01nally update the sampling distribution to vl+1\n(Step 12).\n\n4.3 Convergence Analysis\nWe prove the convergence of Algorithm 1 by comparing the asymptotic behavior of {\u02c6\u03b7l}l\u22650 with the\n\ufb02ow induced by the following ordinary differential equation (ODE):\n\n\u2202\u03b7(t)\n\n\u2202t\n\n= \u02dcL(m\u22121(\u03b7(t)); \u03c1),\n\n(8)\n\nwhere we de\ufb01ne \u03b7 := m(v) or equivalently, v = m\u22121(\u03b7).\nWe need a series of assumptions for technical reasons.\nAssumption 2.\n\n(2a) \u02dcL(v; \u03c1) is continuous in v \u2208 int(V) and (8) has a unique integral curve\n\nfor any given initial condition.\n\nis positive and decreasing with liml\u2192\u221e \u03b1l = 0,(cid:80)\u221e\n\n(2b) The number of samples in the lth iteration is nl = \u0398(l\u03b2), \u03b2 > 0. The gain sequence {\u03b1l}\nl=1 \u03b1l = \u221e. {\u03bbl} satis\ufb01es \u03bbl = O( 1\nl\u03bb )\n(2c) For any \u03c1 \u2208 (0, 1) and fv for any v \u2208 V, the \u03c1-quantile of {H(\u03c0\u03b8) : \u03b8 \u223c fv} and the\n\nfor some \u03bb > 0 such that \u03b2 + 2\u03bb > 1.\n\n(1 \u2212 \u03c1)-quantile of {U (\u03c0\u03b8) : \u03b8 \u223c fv} are both unique.\n\n(2d) Both \u0398 and V are compact.\n(2e) The function G de\ufb01ned in Problem 1 is bounded and has a positive lower bound:\n\ninf \u03c0\u2208\u03a0 G(\u03c0) > 0. The function H in Problem 1 is bounded.\n\n(2f) vl \u2208 int(V) for any iteration l.\n\nAssumption (2a) ensures that (8) is well-posed and has a unique solution. Assumption (2b) addresses\nsome requirements on the number of sampled policies in each iteration and other hyperparameters\n\n6\n\n\f(cid:80)\n\n\u2202v\n\u2202t\n\n=\n\n.\n\n(9)\n\n\u03b8\u2208\u039bl\n\nin Algorithm 1. Assumptions (2c) to (2e) are used in the proof of the convergence of Algorithm 1.\nAssumption (2c) is required to show that 1\nG(\u03c0\u03b8) in Step 11 of Algorithm 1 is an unbiased\nnl\nestimate of Evl [G(\u03c0\u03b8)S(\u03c0\u03b8, vl, \u03c1)]. Assumption (2d) and (2e) are compactness and boundedness\nconstraints for the sets and functions involved in Algorithm 1, which are unlikely to be restrictive in\npractice. Assumption (2f) states that V is large enough such that the learned v lies within its interior.\nThe main result that connects the asymptotic behavior of Algorithm 1 with that of an ODE is stated\nin Theorem 4.1. The main idea behind the proof of Theorem 4.1 is similar to that of Theorem 3.1 in\n[12], although the details are tailored to our problem. There are two major parts in the convergence\nproof: The \ufb01rst part shows that all the sampling-based estimates converge to the true values almost\nsurely, including sample quantiles and sample estimates of G, H and L. The second part shows that\nthe asymptotic behavior of the idealized updates in (4) can be described by the ODE (8). A detailed\nproof of Theorem 4.1 is shown in the supplemental material.\nTheorem 4.1. If Assumptions 1 and 2 hold, the sequence {\u02c6\u03b7l}l\u22650 in Step 11 of Algorithm 1 converges\nto a connected internally chain recurrent set of (8) as l \u2192 \u221e with probability 1.\nBy de\ufb01nition of \u03b7 in (8), we know \u2202\u03b7(t)\n\u2202t = \u2202v\nAssumption 1, (8) can be rewritten with variable v\n\n\u2202t \u00b7 Covv[\u0393(\u03b8)]. Since Covv[\u0393(\u03b8)] is invertible by\n\n(cid:16) \u02dcL(v; \u03c1)\n\n(cid:17)(cid:124)(cid:0)Covv[\u0393(\u03b8)](cid:1)\u22121\n\nThe conclusion of Theorem 4.1 can be equivalently stated in terms of the variable v: the sequence\n{vl}l\u22650 of Algorithm 1 converges to a connected internally chain recurrent set of (9) as l \u2192 \u221e with\nprobability 1.\nIntuitively, a point v0 \u2208 V is chain recurrent for (9) if the solution v(t) of (9) with initial condition\nv(0) = v0 can return to v0 within some \ufb01nite time t(cid:48) > 0 itself or just with \ufb01nitely many arbitrarily\nsmall perturbations. An internally chain recurrent set is a nonempty compact invariant set of\nchain-recurrent points, i.e., v can never leave an internally chain recurrent set if v0 belongs to it.\nTheorem 4.1 implies that with probability 1, the set of points that occur in\ufb01nitely often in {vl}l\u22650\nare internally chain recurrent for (9). Since fv belongs to NEF, Covv[\u0393(\u03b8)] is the Fisher information\nmatrix at v and the right hand side of (9) is an estimate of the natural gradient of log L(v; p) with\na \ufb01xed indicator function S. This suggests that v evolves to increase L(v; \u03c1), which is consistent\nwith the optimization problem (1) and our motivation to solve a constrained RL problem. Note that\ninternally chain-recurrent sets are generally not unique and our algorithm can still converge to a local\noptimum.\nTo further interpret Theorem 4.1, we \ufb01rst note that any equilibrium of (8) forms an internally chain\nrecurrent set by itself. The following result shows a suf\ufb01cient condition for an equilibrium point \u00afv\u2217\nof (8) to be locally asymptotically stable, i.e., there is a small neighborhood of this equilibrium of \u00afv\u2217\nsuch that once entered, (9) will converge to \u00afv\u2217.\nTheorem 4.2. Let \u03d5 : V \u2192 R be any function such that \u2202\n\u00afv\u2217 \u2208 int(V) of (9) that is an isolated local maximum of \u03d5(v) is locally asympototically stable.\nThe proof of Theorem 4.2 is done by constructing a local Lyapunov function and is given in the\nsupplemental material. It also shows that \u03d5(v) always decreases in the interior of V unless it hits a\nstationary point of (9), which suggests a stronger property of our algorithm as stated in Theorem 4.3.\nTheorem 4.3. If all equilibria of (9) are isolated, the sequence {vl}l\u22650 derived by Algorithm 1\nconverges toward an equilibrium of (9) as l \u2192 \u221e with probability 1.\n\n\u2202v \u03d5(v) = \u02dcL(v; \u03c1). Any equilibrium\n\n5 Experimental Results\nWe consider a mobile robot navigation task with only local sensors. There is a compact goal region G\nand a non-overlapping compact bad region B in the robot\u2019s environment. The transition function is\ndeterministic. The robot uses a local sensing model to observe if B or G is in its neighborhood and\nthe direction of the center of G in its local coordinate. Details of this experiment and the local sensing\nmodel can be found in the supplemental material.\nWe compare the performance of CCE to trust region policy optimization (TRPO) [27], a state-of-\nthe-art unconstrained RL algorithm, and its variant for constrained problems, i.e., CPO [1]. For all\n\n7\n\n\fObjective value GJi(\u03c0\u03b8)\n\nConstraint value HZi(\u03c0\u03b8) (Feasible regions are below dashed lines)\n\n(a) i = 1\n\n(b) i = 2\n\n(c) i = 3\n\n(d) i = 4\n\n(e) i = 1\n\nFigure 1: Learning curves of CCE, CPO and TRPO with different objectives GJi and constraints\nHZi. The x-axes show the total number of sample trajectories for CCE and the total number of\nequivalent sample trajectories for TRPO and CPO. The y-axes show the sample mean of the objective\nand constraint values of the learned policy (for TRPO and CPO) or the learned policy distribution\n(for CCE). Each experiment is repeated for 5 times. More details can be found in the supplemental\nmaterial.\n\nTable 1: Ji(\u03c4 ), Zi(\u03c4 ) and constraint upper bound di for i = 1, 2, 3, 4, \u03c4 \u2208 (S \u00d7 A)N .\n\ni\n\n1\n\n2\n\n3\n\n4\n\nJi(\u03c4 )\n\n1 for each state in G; 2|y| for\ny \u2208 [\u22122,\u22120.2]; 0 otherwise.\n\neach state with\n\n30 times the minimum\nsigned distance from any\n\nstate in \u03c4 to B.\n\nSame as J2(\u03c4 ).\n\nSame as J1(\u03c4 ).\n\nZi(\u03c4 )\n\ndi\n\nJi Markovian Zi Markovian\n\n-1 if the robot arrives\nG which is absorbing;\n\n0 otherwise.\n\n-1 if the robot visited\nG in \u03c4; 0 otherwise.\n-1 for each state in G;\n\n0 otherwise.\n\n-1 if the robot visits\nG and never visits B;\n\n0 otherwise.\n\n-0.5\n\nYes\n\n-0.5\n\n-5\n\nNo\n\nNo\n\n-0.5\n\nYes\n\nYes\n\nNo\n\nYes\n\nNo\n\nexperiments, the agent\u2019s policy is modeled as a fully connected neural network with two hidden layers\nwith 30 nodes in each layer. Trajectory length for all experiments is set to N = 30. All experiments\nare implemented in rllab [5].\nFigure 1 shows the learning curves of CCE,\nTRPO and CPO for four different objectives\nand constraints (i = 1, 2, 3, 4). The objective\nfunctions and constraint functions used in each\nexperiment are interpreted in Table 1. For exper-\niments in which Ji is not strictly positive, we use\nexp(Ji) instead of Ji for the CCE. TRPO results\nshow that the constraints cannot be satis\ufb01ed by\nmerely optimizing the corresponding objectives.\nWe \ufb01rst initialize each experiment with a ran-\ndomly generated infeasible policy. We \ufb01nd that\nCCE successfully outputs feasible policies in all\n\nFigure 2: Average performance of CCE, CPO and\nTRPO for Experiment 4 with initial feasible policy.\n\n(b) HZ4 (\u03c0\u03b8).\n\n(a) GJ4 (\u03c0\u03b8).\n\n8\n\n\fexperiments. On the other hand, CPO needs signi\ufb01cantly more samples to \ufb01nd a single feasible\npolicy, or simply converges to an infeasible policy especially if the constraint is non-Markovian.\nWe repeat the \ufb01rst experiment (i = 1) with feasible initial policies and obtain the result in the last\ncolumn of Figure 1. In this case, CPO leaves the feasible region rapidly and then follows generally the\nsame path as if it is initialized with an infeasible policy. This behavior suggests that its incapability to\nenforce constraint satisfaction is not due to the lack of initial feasibility. Although CCE also leaves\nthe feasible region at an early stage of iterations, it regains feasibility much faster than the previous\ncase with infeasible initial polices. These results suggest that CCE is more reliable than CPO for\napplications where the strict constraint satisfaction is critical.\nIn Figure 2, we compare the performance of CPO and CCE in Experiment 4 to that of TRPO with\nobjective GJ4 \u2212 100HZ4. Due to the non-Markovian nature of Z4, HZ4(\u03c0) is not sensitive to local\nchanges in \u03c0(s) at any state s. It therefore makes it more dif\ufb01cult for standard RL algorithms to\nimprove its HZ4-value. The \ufb01xed penalty coef\ufb01cient 100 is chosen to be neither too large nor too\nsmall so it can show a large variety of locally optimal behaviors with very different GJ4-values and\nHZ4-values. Figure 2 clearly shows the trade-off between GJ4-values and HZ4-values, which partially\nexplains the gap between GJ4-value outputs of CCE and CPO. With a \ufb01xed penalty coef\ufb01cient, the\npolicies learned by TRPO are either infeasible or with very small constraint values. The policy output\nby CCE has higher GJ4-value than all feasible policies found by TRPO and CPO.\n\n6 Conclusions and Future Work\n\nIn this work, we studied a safe reinforcement learning problem with the constraints that are de\ufb01ned\nas the expected cost over \ufb01nite-length trajectories. We proposed a constrained cross-entropy-based\nmethod to solve this problem, analyzed its asymptotic performance using an ODE and proved\nits convergence. We showed with simulation experiments that our method can effectively learn\nfeasible policies without assumptions on the feasibility of initial policies with both Markovian and\nnon-Markovian objective functions and constraint functions.\nCCE is expected to be less sample-ef\ufb01cient than gradient-based methods especially for high-\ndimensional systems. Unlike gradient-based methods such as TRPO, CCE does not infer the\nperformances of unseen policies from previous experience. As a result, it has to repetitively sample\ngood policies in order to make steady improvement. Meanwhile, CCE can be easily parallelized\nas each sampled policy is evaluated independently. This may mitigate the problem of high sample\ncomplexity as other evolutionary methods [26].\nGiven all these limitations, we \ufb01nd the CCE method to be particularly useful in learning hierarchical\npolicies. With a high-level policy that speci\ufb01es intermediate goals and thus reduces the state space\nfor low-level policies, we can use CCE to train a (locally) optimal low-level policy while satisfying\nlocal constraints. As shown in the experiment of our paper, CCE converges with reasonable sample\ncomplexity and outperforms CPO on its constraint performance. Since the satisfaction of low-\nlevel constraints is of critical signi\ufb01cance to the performance of the overall policy, CCE seems to\nbe especially well-suited for this application. In future work, we will combine this method with\noff-policy policy evaluation techniques such as [13; 9] to improve sample complexity.\n\nAcknowledgement\n\nThis work was supported in part by ONR N000141712623, ARO W911NF-15-1-0592 and DARPA\nW911NF-16-1-0001.\n\nReferences\n[1] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International\n\nConference on Machine Learning, pages 22\u201331, 2017.\n\n[2] E. Altman. Asymptotic properties of constrained markov decision processes. Mathematical\n\nMethods of Operations Research, 37(2):151\u2013170, 1993.\n\n[3] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. Athena Scienti\ufb01c, 3rd\n\nedition, 2007. ISBN 1886529302, 9781886529304.\n\n9\n\n\f[4] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning\n\nwith percentile risk criteria. arXiv preprint arXiv:1512.01629, 2015.\n\n[5] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning, pages 1329\u2013\n1338, 2016.\n\n[6] J. Fu, I. Papusha, and U. Topcu. Sampling-based approximate optimal control under temporal\nlogic constraints. In Proceedings of the 20th International Conference on Hybrid Systems:\nComputation and Control, pages 227\u2013235. ACM, 2017.\n\n[7] J. Garc\u0131a and F. Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning. Journal of\n\nMachine Learning Research, 16(1):1437\u20131480, 2015.\n\n[8] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-\nIn M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The\nbased acceleration.\n33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine\nLearning Research, pages 2829\u20132838, New York, New York, USA, 20\u201322 Jun 2016. PMLR.\nURL http://proceedings.mlr.press/v48/gu16.html.\n\n[9] J. P. Hanna, P. Stone, and S. Niekum. Bootstrapping with models: Con\ufb01dence intervals for\nIn Proceedings of the 16th Conference on Autonomous Agents and\noff-policy evaluation.\nMultiAgent Systems, pages 538\u2013546. International Foundation for Autonomous Agents and\nMultiagent Systems, 2017.\n\n[10] T. Homem-de Mello. A study on the cross-entropy method for rare-event probability estimation.\n\nINFORMS Journal on Computing, 19(3):381\u2013394, 2007.\n\n[11] J. Hu, M. C. Fu, S. I. Marcus, et al. A model reference adaptive search method for stochastic\n\nglobal optimization. Communications in Information and Systems, 8(3):245\u2013276, 2008.\n\n[12] J. Hu, P. Hu, and H. S. Chang. A stochastic approximation framework for a class of randomized\n\noptimization algorithms. IEEE Transactions on Automatic Control, 57(1):165\u2013178, 2012.\n\n[13] N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In\n\nInternational Conference on Machine Learning, pages 652\u2013661, 2016.\n\n[14] M. Kobilarov. Cross-entropy randomized motion planning. In Robotics: Science and Systems,\n\n2011.\n\n[15] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.\n\nThe Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[16] L. Li and J. Fu. Sampling-based approximate optimal temporal logic planning. In Robotics and\nAutomation (ICRA), 2017 IEEE International Conference on, pages 1328\u20131335. IEEE, 2017.\n\n[17] S. C. Livingston, E. M. Wolff, and R. M. Murray. Cross-entropy temporal logic motion planning.\nIn Proceedings of the 18th International Conference on Hybrid Systems: Computation and\nControl, pages 269\u2013278. ACM, 2015.\n\n[18] S. Mannor, R. Rubinstein, and Y. Gat. The cross entropy method for fast policy search. In In\n\nInternational Conference on Machine Learning, pages 512\u2013519. Morgan Kaufmann, 2003.\n\n[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\nPlaying atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\n[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[21] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\nIn International Conference on\n\nAsynchronous methods for deep reinforcement learning.\nMachine Learning, pages 1928\u20131937, 2016.\n\n10\n\n\f[22] W. Montgomery, A. Ajay, C. Finn, P. Abbeel, and S. Levine. Reset-free guided policy search:\nEf\ufb01cient deep reinforcement learning with stochastic initial states. In Robotics and Automation\n(ICRA), 2017 IEEE International Conference on, pages 3373\u20133380. IEEE, 2017.\n\n[23] I. Papusha, J. Fu, U. Topcu, and R. M. Murray. Automata theory meets approximate dynamic\nprogramming: Optimal control with temporal logic constraints. In Decision and Control (CDC),\n2016 IEEE 55th Conference on, pages 434\u2013440. IEEE, 2016.\n\n[24] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y. Tassa,\nT. Erez, and M. Riedmiller. Data-ef\ufb01cient deep reinforcement learning for dexterous manipula-\ntion. arXiv preprint arXiv:1704.03073, 2017.\n\n[25] R. Y. Rubinstein and B. Melamed. Modern simulation and modeling, volume 7. Wiley New\n\nYork, 1998.\n\n[26] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable\n\nalternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\n[27] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization.\nIn Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages\n1889\u20131897, 2015.\n\n[28] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous\n\ncontrol using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.\n\n[29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[30] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy\ngradient algorithms. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 387\u2013395, 2014.\n\n[31] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural\nnetworks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[32] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,\nM. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550\n(7676):354, 2017.\n\n[33] I. Szita and A. L\u00f6rincz. Learning tetris using the noisy cross-entropy method. Learning, 18(12),\n\n2006.\n\n[34] E. Uchibe and K. Doya. Constrained reinforcement learning from intrinsic and extrinsic rewards.\nIn Development and Learning, 2007. ICDL 2007. IEEE 6th International Conference on, pages\n163\u2013168. IEEE, 2007.\n\n[35] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. In Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n[36] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero. Extending the openai gym\nfor robotics: a toolkit for reinforcement learning using ros and gazebo. arXiv preprint\narXiv:1608.05742, 2016.\n\n[37] T. Zhang, G. Kahn, S. Levine, and P. Abbeel. Learning deep control policies for autonomous\naerial vehicles with mpc-guided policy search. In Robotics and Automation (ICRA), 2016 IEEE\nInternational Conference on, pages 528\u2013535. IEEE, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3708, "authors": [{"given_name": "Min", "family_name": "Wen", "institution": "University of Pennsylvania"}, {"given_name": "Ufuk", "family_name": "Topcu", "institution": "The University of Texas at Austin"}]}