{"title": "Convergent Policy Optimization for Safe Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3127, "page_last": 3139, "abstract": "We study the safe reinforcement learning problem with nonlinear function approximation, where policy optimization is formulated as a constrained optimization problem with both the objective and the constraint being nonconvex functions. For such a problem, we construct a sequence of surrogate convex constrained optimization problems by replacing the nonconvex functions locally with convex quadratic functions obtained from policy gradient estimators.  We prove that the solutions to these surrogate problems converge to a stationary point of the original nonconvex problem. Furthermore, to extend our theoretical results, we apply our algorithm to examples of optimal control and multi-agent reinforcement learning with safety constraints.", "full_text": "Convergent Policy Optimization for Safe\n\nReinforcement Learning\n\nMing Yu \u21e4\n\nZhuoran Yang \u2020\n\nMladen Kolar \u2021\n\nZhaoran Wang \u00a7\n\nAbstract\n\nWe study the safe reinforcement learning problem with nonlinear function approx-\nimation, where policy optimization is formulated as a constrained optimization\nproblem with both the objective and the constraint being nonconvex functions.\nFor such a problem, we construct a sequence of surrogate convex constrained\noptimization problems by replacing the nonconvex functions locally with convex\nquadratic functions obtained from policy gradient estimators. We prove that the\nsolutions to these surrogate problems converge to a stationary point of the original\nnonconvex problem. Furthermore, to extend our theoretical results, we apply our\nalgorithm to examples of optimal control and multi-agent reinforcement learning\nwith safety constraints.\n\n1\n\nIntroduction\n\nReinforcement learning [58] has achieved tremendous success in video games [41, 44, 56, 36, 66]\nand board games, such as chess and Go [53, 55, 54], in part due to powerful simulators [8, 62]. In\ncontrast, due to physical limitations, real-world applications of reinforcement learning methods often\nneed to take into consideration the safety of the agent [5, 26]. For instance, in expensive robotic\nand autonomous driving platforms, it is pivotal to avoid damages and collisions [25, 9]. In medical\napplications, we need to consider the switching cost [7].\nA popular model of safe reinforcement learning is the constrained Markov decision process (CMDP),\nwhich generalizes the Markov decision process by allowing for inclusion of constraints that model\nthe concept of safety [3]. In a CMDP, the cost is associated with each state and action experienced\nby the agent, and safety is ensured only if the expected cumulative cost is below a certain threshold.\nIntuitively, if the agent takes an unsafe action at some state, it will receive a huge cost that punishes\nrisky attempts. Moreover, by considering the cumulative cost, the notion of safety is de\ufb01ned for the\nwhole trajectory enabling us to examine the long-term safety of the agent, instead of focusing on\nindividual state-action pairs. For a CMDP, the goal is to take sequential decisions to achieve the\nexpected cumulative reward under the safety constraint.\nSolving a CMDP can be written as a linear program [3], with the number of variables being the\nsame as the size of the state and action spaces. Therefore, such an approach is only feasible for the\ntabular setting, where we can enumerate all the state-action pairs. For large-scale reinforcement\nlearning problems, where function approximation is applied, both the objective and constraint of the\nCMDP are nonconvex functions of the policy parameter. One common method for solving CMDP\nis to formulate an unconstrained saddle-point optimization problem via Lagrangian multipliers and\nsolve it using policy optimization algorithms [18, 60]. Such an approach suffers the following two\ndrawbacks:\n\n\u21e4The University of Chicago Booth School of Business, Chicago, IL. Email: ming93@uchicago.edu.\n\u2020Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ.\n\u2021The University of Chicago Booth School of Business, Chicago, IL.\n\u00a7Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFirst, for each \ufb01xed Lagrangian multiplier, the inner minimization problem itself can be viewed\nas solving a new reinforcement learning problem. From the computational point of view, solving\nthe saddle-point optimization problem requires solving a sequence of MDPs with different reward\nfunctions. For a large scale problem, even solving a single MDP requires huge computational\nresources, making such an approach computationally infeasible.\nSecond, from a theoretical perspective, the performance of the saddle-point approach hinges on\nsolving the inner problem optimally. Existing theory only provides convergence to a stationary point\nwhere the gradient with respect to the policy parameter is zero [28, 37]. Moreover, the objective, as a\nbivariate function of the Lagrangian multiplier and the policy parameter, is not convex-concave and,\ntherefore, \ufb01rst-order iterative algorithms can be unstable [27].\nIn contrast, we tackle the nonconvex constrained optimization problem of the CMDP directly. We\npropose a novel policy optimization algorithm, inspired by [38]. Speci\ufb01cally, in each iteration, we\nreplace both the objective and constraint by quadratic surrogate functions and update the policy\nparameter by solving the new constrained optimization problem. The two surrogate functions can\nbe viewed as \ufb01rst-order Taylor-expansions of the expected reward and cost functions where the\ngradients are estimated using policy gradient methods [59]. Additionally, they can be viewed as\nconvex relaxations of the original nonconvex reward and cost functions. In \u00a74 we show that, as the\nalgorithm proceeds, we obtain a sequence of convex relaxations that gradually converge to a smooth\nfunction. More importantly, the sequence of policy parameters converges almost surely to a stationary\npoint of the nonconvex constrained optimization problem.\n\nRelated work. Our work is pertinent to the line of research on CMDP [3]. For CMDPs with\nlarge state and action spaces, [19] proposed an iterative algorithm based on a novel construction of\nLyapunov functions. However, their theory only holds for the tabular setting. Using Lagrangian\nmultipliers, [46, 18, 1, 60] proposed policy gradient [59], actor-critic [33], or trust region policy\noptimization [51] methods for CMDP or constrained risk-sensitive reinforcement learning [26]. These\nalgorithms either do not have convergence guarantees or are shown to converge to saddle-points of the\nLagrangian using two-time-scale stochastic approximations [10]. However, due to the projection on\nthe Lagrangian multiplier, the saddle-point achieved by these approaches might not be the stationary\npoint of the original CMDP problem. In addition, [65] proposed a cross-entropy-based stochastic\noptimization algorithm, and proved the asymptotic behavior using ordinary differential equations.\nIn contrast, our algorithm and the theoretical analysis focus on the discrete time CMDP. Outside\nof the CMDP setting, [31, 35] studied safe reinforcement learning with demonstration data, [61]\nstudied the safe exploration problem with different safety constraints, and [4] studied multi-task safe\nreinforcement learning.\n\nOur contribution. Our contribution is three-fold. First, for the CMDP policy optimization problem\nwhere both the objective and constraint function are nonconvex, we propose to optimize a sequence\nof convex relaxation problems using convex quadratic functions. Solving these surrogate problems\nyields a sequence of policy parameters that converge almost surely to a stationary point of the original\npolicy optimization problem. Second, to reduce the variance in the gradient estimator that is used to\nconstruct the surrogate functions, we propose an online actor-critic algorithm. Finally, as concrete\napplications, our algorithms are also applied to optimal control (in \u00a75) and parallel and multi-agent\nreinforcement learning problems with safety constraints (in supplementary material).\n\n2 Background\nA Markov decision process is denoted by (S,A, P,, r, \u00b5 ), where S is the state space, A is the action\nspace, P is the transition probability distribution,  2 (0, 1) is the discount factor, r : S\u21e5A! R is\nthe reward function, and \u00b5 2P (S) is the distribution of the initial state s0 2S , where we denote\nP(X ) as the set of probability distributions over X for any X . A policy is a mapping \u21e1 : S!P (A)\nthat speci\ufb01es the action that an agent will take when it is at state s.\nPolicy gradient method. Let {\u21e1\u2713 : S!P (A)} be a parametrized policy class, where \u2713 2 \u21e5\nis the parameter de\ufb01ned on a compact set \u21e5. This parameterization transfers the original in\ufb01nite\ndimensional policy class to a \ufb01nite dimensional vector space and enables gradient based methods\nto be used to maximize (1). For example, the most popular Gaussian policy can be written as\n\u21e1(\u00b7|s, \u2713) = N\u00b5(s, \u2713), (s, \u2713), where the state dependent mean \u00b5(s, \u2713) and standard deviation\n\n2\n\n\f(s, \u2713) can be further parameterized as \u00b5(s, \u2713) = \u2713>\u00b5 \u00b7 x(s) and (s, \u2713) = exp\u2713> \u00b7 x(s) with x(s)\n\nbeing a state feature vector. The goal of an agent is to maximize the expected cumulative reward\n\nwhere s0 \u21e0 \u00b5, and for all t  0, we have st+1 \u21e0 P (\u00b7| st, at) and at \u21e0 \u21e1(\u00b7| st). Given a policy \u21e1(\u2713),\nwe de\ufb01ne the state- and action-value functions of \u21e1\u2713, respectively, as\n\nR(\u2713) = E\u21e1\uf8ffXt0\n\nt \u00b7 r(st, at),\ntr(st, at) s0 = s, Q\u2713(s, a) = E\u21e1\u2713\uf8ffXt0\n\u2713k+1 = \u2713k + \u2318 \u00b7br\u2713R(\u2713k),\n\nV \u2713(s) = E\u21e1\u2713\uf8ffXt0\n\nThe policy gradient method updates the parameter \u2713 through gradient ascent\n\ntr(st, at) s0 = s, a0 = a. (2)\nwhere br\u2713R(\u2713k) is a stochastic estimate of the gradient r\u2713R(\u2713k) at k-th iteration. Policy gradient\nmethod, as well as its variants (e.g. policy gradient with baseline [58], neural policy gradient\n[64, 39, 16]) is widely used in reinforcement learning. The gradient r\u2713R(\u2713) can be estimated\naccording to the policy gradient theorem [59],\n\n(1)\n\nr\u2713R(\u2713) = Ehr\u2713 log \u21e1\u2713(s, a) \u00b7 Q\u2713(s, a)i.\n\nActor-critic method. To further reduce the variance of the policy gradient method, we could\nestimate both the policy parameter and value function simultaneously. This kind of method is called\nactor-critic algorithm [33], which is widely used in reinforcement learning. Speci\ufb01cally, in the value\nfunction evaluation (critic) step we estimate the action-value function Q\u2713(s, a) using, for example, the\ntemporal difference method TD(0) [20]. The policy parameter update (actor) step is implemented as\nbefore by the Monte-Carlo method according to the policy gradient theorem (3) with the action-value\nQ\u2713(s, a) replaced by the estimated value in the policy evaluation step.\n\nConstrained MDP.\nIn this work, we consider an MDP problem with an additional constraint\non the model parameter \u2713. Speci\ufb01cally, when taking action at some state we incur some cost\nvalue. The constraint is such that the expected cumulative cost cannot exceed some pre-de\ufb01ned\nconstant. A constrained Markov decision process (CMDP) is denoted by (S,A, P,, r, d, \u00b5 ), where\nd : S\u21e5A! R is the cost function and the other parameters are as before. The goal of an the agent\nin CMDP is to solve the following constrained problem\n\n(3)\n\n(4)\n\n\u27132\u21e5\n\nminimize\n\nJ(\u2713) = E\u21e1\u2713\uf8ffXt0\nsubject to D(\u2713) = E\u21e1\u2713\uf8ffXt0\n\nt \u00b7 r(st, at),\nt \u00b7 d(st, at) \uf8ff D0,\n\nwhere D0 is a \ufb01xed constant. We consider only one constraint D(\u2713) \uf8ff D0, noting that it is\nstraightforward to generalize to multiple constraints. Throughout this paper, we assume that both the\n\nreward and cost value functions are bounded:r(st, at) \uf8ff rmax andd(st, at) \uf8ff dmax. Also, the\n\nparameter space \u21e5 is assumed to be compact.\n\n3 Algorithm\n\nIn this section, we develop an algorithm to solve the optimization problem (4). Note that both the\nobjective function and the constraint in (4) are nonconvex and involve expectation without closed-\nform expression. As a constrained problem, a straightforward approach to solve (4) is to de\ufb01ne the\nfollowing Lagrangian function\n\nand solve the dual problem\n\nL(\u2713, ) = J(\u2713) +  \u00b7\u21e5D(\u2713)  D0\u21e4,\n\ninf\n0\n\nsup\n\nL(\u2713, ).\n\n\u2713\n\n3\n\n\fHowever, this problem is a nonconvex minimax problem and, therefore, is hard to solve and establish\ntheoretical guarantees for solutions [2]. Another approach to solve (4) is to replace J(\u2713) and D(\u2713)\nby surrogate functions with nice properties. For example, one can iteratively construct local quadratic\napproximations that are strongly convex [52], or are an upper bound for the original function [57].\nHowever, an immediate problem of this naive approach is that, even if the original problem (4) is\nfeasible, the convex relaxation problem need not be. Also, these methods only deal with deterministic\nand/or convex constraints.\nIn this work, we propose an iterative algorithm that approximately solves (4) by constructing a\nsequence of convex relaxations, inspired by [38]. Our method is able to handle the possible infeasible\nsituation due to the convex relaxation as mentioned above, and handle stochastic and nonconvex\nconstraint. Since we do not have access to J(\u2713) or D(\u2713), we \ufb01rst de\ufb01ne the sample negative\ncumulative reward and cost functions as\n\nJ\u21e4(\u2713) = Xt0\n\nt \u00b7 r(st, at)\n\nand\n\nD\u21e4(\u2713) =Xt0\n\nt \u00b7 d(st, at).\n\nGiven \u2713, J\u21e4(\u2713) and D\u21e4(\u2713) are the sample negative cumulative reward and cost value of a realization\n(i.e., a trajectory) following policy \u21e1\u2713. Note that both J\u21e4(\u2713) and D\u21e4(\u2713) are stochastic due to the\nrandomness in the policy, state transition distribution, etc. With some abuse of notation, we use\nJ\u21e4(\u2713) and D\u21e4(\u2713) to denote both a function of \u2713 and a value obtained by the realization of a trajectory.\n\nClearly we have J(\u2713) = E\u21e5J\u21e4(\u2713)\u21e4 and D(\u2713) = E\u21e5D\u21e4(\u2713)\u21e4.\nWe start from some (possibly infeasible) \u27130. Let \u2713k denote the estimate of the policy parameter in\nthe k-th iteration. As mentioned above, we do not have access to the expected cumulative reward\nJ(\u2713). Instead we sample a trajectory following the current policy \u21e1\u2713k and obtain a realization of the\nnegative cumulative reward value and the gradient of it as J\u21e4(\u2713k) and r\u2713J\u21e4(\u2713k), respectively. The\ncumulative reward value is obtained by Monte-Carlo estimation, and the gradient is also obtained by\nMonte-Carlo estimation according to the policy gradient theorem in (3). We provide more details on\nthe realization step later in this section. Similarly, we use the same procedure for the cost function\nand obtain realizations D\u21e4(\u2713k) and r\u2713D\u21e4(\u2713k).\nWe approximate J(\u2713) and D(\u2713) at \u2713k by the quadratic surrogate functions\n\n(5)\n(6)\n\n(7)\n\n(8)\n\nwhere \u2327> 0 is any \ufb01xed constant. In each iteration, we solve the optimization problem\n\neJ(\u2713, \u2713k,\u2327 ) = J\u21e4(\u2713k) + hr\u2713J\u21e4(\u2713k),\u2713  \u2713ki + \u2327k\u2713  \u2713kk2\neD(\u2713, \u2713k,\u2327 ) = D\u21e4(\u2713k) + hr\u2713D\u21e4(\u2713k),\u2713  \u2713ki + \u2327k\u2713  \u2713kk2\n\n2,\n2,\n\n\u2713k = argmin\n\nJ\n\n(k)\n\n(\u2713)\n\nsubject to\n\n(k)\n\nD\n\n(\u2713) \uf8ff D0,\n\n\u2713\n\nwhere we de\ufb01ne\n\n(k)\n\nJ\n(k)\n\n(\u2713) = (1  \u21e2k) \u00b7 J\n(\u2713) = (1  \u21e2k) \u00b7 D\n\n(0)\n\nD\n(0)\n\n(k1)\n\n(k1)\n\n(\u2713) + \u21e2k \u00b7 eJ(\u2713, \u2713k,\u2327 ),\n(\u2713) + \u21e2k \u00b7 eD(\u2713, \u2713k,\u2327 ),\n\n(\u2713) = D\n\nwith the initial value J\n(\u2713) = 0. Here \u21e2k is the weight parameter to be speci\ufb01ed later.\nAccording to the de\ufb01nition (5) and (6), problem (7) is a convex quadratically constrained quadratic\nprogram (QCQP). Therefore, it can be ef\ufb01ciently solved by, for example, the interior point method.\nHowever, as mentioned before, even if the original problem (4) is feasible, the convex relaxation\nproblem (7) could be infeasible. In this case, we instead solve the following feasibility problem\n\n\u2713k = argmin\n\n\u2713,\u21b5\n\n\u21b5\n\nsubject to\n\n(k)\n\nD\n\n(\u2713) \uf8ff D0 + \u21b5.\n\n(9)\n\n(k)\n\nIn particular, we relax the infeasible constraint and \ufb01nd \u2713k as the solution that gives the minimum\nrelaxation. Due to the speci\ufb01c form in (6), D\n(\u2713) is decomposable into quadratic forms of each\ncomponent of \u2713, with no terms involving \u2713i \u00b7 \u2713j. Therefore, the solution to problem (9) can be written\nin a closed form. Given \u2713k from either (7) or (9), we update \u2713k by\n(10)\n\u2713k+1 = (1  \u2318k) \u00b7 \u2713k + \u2318k \u00b7 \u2713k,\nwhere \u2318k is the learning rate to be speci\ufb01ed later. Note that although we consider only one constraint\nin the algorithm, both the algorithm and the theoretical result in Section 4 can be directly generalized\nto multiple constraints setting. The whole procedure is summarized in Algorithm 1.\n\n4\n\n\fObtain a sample J\u21e4(\u2713k) and D\u21e4(\u2713k) by Monte-Carlo sampling.\nObtain a sample r\u2713J\u21e4(\u2713k) and r\u2713D\u21e4(\u2713k) by policy gradient theorem.\nif problem (7) is feasible then\n\nAlgorithm 1 Successive convex relaxation algorithm for constrained MDP\n1: Input: Initial value \u27130, \u2327, {\u21e2k},{\u2318k}.\n2: for k = 1, 2, 3, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nend if\nUpdate \u2713k+1 by (10).\n\nObtain \u2713k by solving (7).\n\nObtain \u2713k by solving (9).\n\nelse\n\nr\u2713J(\u2713) = E\u21e1\u2713hr\u2713 log \u21e1\u2713(s, a) \u00b7 Q\u2713(s, a)i.\n\nObtaining realizations J\u21e4(\u2713k) and r\u2713J\u21e4(\u2713k). We detail how to obtain realizations J\u21e4(\u2713k) and\nr\u2713J\u21e4(\u2713k) corresponding to the lines 3 and 4 in Algorithm 1. The realizations of D\u21e4(\u2713k) and\nr\u2713D\u21e4(\u2713k) can be obtained similarly.\nFirst, we discuss \ufb01nite horizon setting, where we can sample the full trajectory according to the\npolicy \u21e1\u2713. In particular, for any \u2713k, we use the policy \u21e1\u2713k to sample a trajectory and obtain J\u21e4(\u2713k)\nby Monte-Carlo method. The gradient r\u2713J(\u2713) can be estimated by the policy gradient theorem [59],\n(11)\nAgain we can sample a trajectory and obtain the policy gradient realization r\u2713J\u21e4(\u2713k) by Monte-Carlo\nmethod.\nIn in\ufb01nite horizon setting, we cannot sample the in\ufb01nite length trajectory. In this case, we utilize the\ntruncation method introduced in [48], which truncates the trajectory at some stage T and scales the\nundiscounted cumulative reward to obtain an unbiased estimation. Intuitively, if the discount factor\n is close to 0, then the future reward would be discounted heavily and, therefore, we can obtain\nan accurate estimate with a relatively small number of stages. On the other hand, if  is close to 1,\nthen the future reward is more important compared to the small  case and we have to sample a long\ntrajectory. Taking this intuition into consideration, we de\ufb01ne T to be a geometric random variable\nwith parameter 1  : Pr(T = t) = (1  )t. Then, we simulate the trajectory until stage T and\nuse the estimator Jtruncate(\u2713) = (1  ) \u00b7PT\nt=0 r(st, at), which is an unbiased estimator of the\nexpected negative cumulative reward J(\u2713), as proved in proposition 5 in [43]. We can apply the same\ntruncation procedure to estimate the policy gradient r\u2713J(\u2713).\nVariance reduction. Using the naive sampling method described above, we may suffer from high\nvariance problem. To reduce the variance, we can modify the above procedure in the following ways.\nFirst, instead of sampling only one trajectory in each iteration, a more practical and stable way is to\nsample several trajectories and take average to obtain the realizations. As another approach, we can\nsubtract a baseline function from the action-value function Q\u2713(s, a) in the policy gradient estimation\nstep (11) to reduce the variance without changing the expectation. A popular choice of the baseline\nfunction is the state-value function V \u2713(s) as de\ufb01ned in (2). In this way, we can replace Q\u2713(s, a) in\n(11) by the advantage function A\u2713(s, a) de\ufb01ned as\n\nA\u2713(s, a) = Q\u2713(s, a)  V \u2713(s).\n\nThis modi\ufb01cation corresponds to the standard REINFORCE with Baseline algorithm [58] and can\nsigni\ufb01cantly reduce the variance of policy gradient.\nActor-critic method. Finally, we can use an actor-critic update to improve the performance further.\nIn this case, since we need unbiased estimators for both the gradient and the reward value in (5) and\n(6) in online fashion, we modify our original problem (4) to average reward setting as\n\nminimize\n\n\u27132\u21e5\n\nJ(\u2713) = lim\nT!1\n\nsubject to D(\u2713) = lim\nT!1\n\nE\u21e1\u2713\uf8ff\nE\u21e1\u2713\uf8ff 1\n\nT\n\n1\nT\n\nr(st, at),\nTXt=0\nd(st, at) \uf8ff D0.\nTXt=0\n\n5\n\n\fTake action a, observe reward r, cost d, and new state s0.\nCritic step:\n\nAlgorithm 2 Actor-Critic update for constrained MDP\n1: for k = 1, 2, 3, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nw w + w \u00b7 JrwV J\nv v + v \u00b7 DrvV J\nCalculate TD error:\nJ = r  J + V J\nD = d  D + V D\nSolve \u2713k by (7) or (9) with\n\nw (s), J J + w \u00b7r  J.\nv (s), D D + v \u00b7d  D.\nw (s).\nw (s0)  V J\nv (s).\nv (s0)  V D\n\nActor step:\n\ns s0.\n11:\n12: end for\n\nJ\u21e4(\u2713k), r\u2713J\u21e4(\u2713k) in (5) replaced by J and J \u00b7r \u2713 log \u21e1\u2713(s, a);\nD\u21e4(\u2713k), r\u2713D\u21e4(\u2713k) in (6) replaced by D and D \u00b7r \u2713 log \u21e1\u2713(s, a).\n\n\u2713 (s) and V D\n\nLet V J\n\u2713 (s) denote the value and cost functions corresponding to (2). We use possibly\nnonlinear approximation with parameter w for the value function: V J\nw (s) and v for the cost function:\nv (s). In the critic step, we update w and v by TD(0) with step size w and v; in the actor step, we\nV D\nsolve our proposed convex relaxation problem to update \u2713. The actor-critic procedure is summarized\nin Algorithm 2. Here J and D are estimators of J(\u2713k) and D(\u2713k). Both of J and D, and the TD\nerror J, D can be initialized as 0.\nThe usage of the actor-critic method helps reduce variance by using a value function instead of\nMonte-Carlo sampling. Speci\ufb01cally, in Algorithm 1 we need to obtain a sample trajectory and\ncalculate J\u21e4(\u2713) and r\u2713J\u21e4(\u2713) by Monte-Carlo sampling. This step has a high variance since we need\nto sample a potentially long trajectory and sum up a lot of random rewards. In contrast, in Algorithm\n2, this step is replaced by a value function V J\n\nw (s), which reduces the variance.\n\n4 Theoretical Result\n\nIn this section, we show almost sure convergence of the iterates obtained by our algorithm to a\nstationary point. We start by stating some mild assumptions on the original problem (4) and the\nchoice of some parameters in Algorithm 1.\n\nAssumption 1 The choice of {\u2318k} and {\u21e2k} satisfy limk!1Pk \u2318k = 1, limk!1Pk \u21e2k = 1\nand limk!1Pk \u23182\nk < 1. Furthermore, we have limk!1 \u2318k/\u21e2k = 0 and \u2318k is decreasing.\nAssumption 2 For any realization, J\u21e4(\u2713) and D\u21e4(\u2713) are continuously differentiable as functions of\n\u2713. Moreover, J\u21e4(\u2713), D\u21e4(\u2713), and their derivatives are uniformly Lipschitz continuous.\n\nk + \u21e22\n\nAssumption 1 allows us to specify the learning rates. A practical choice would be \u2318k = kc1 and\n\u21e2k = kc2 with 0.5 < c2 < c1 < 1. This assumption is standard for gradient-based algorithms.\nAssumption 2 is also standard and is known to hold for a number of models. It ensures that the\nreward and cost functions are suf\ufb01ciently regular. In fact, it can be relaxed such that each realization\nis Lipschitz (not uniformly), and the event that we keep generating realizations with monotonically\nincreasing Lipschitz constant is an event with probability 0. See condition iv) in [67] and the\ndiscussion thereafter. Also, see [45] for suf\ufb01cient conditions such that both the expected cumulative\nreward function and the gradient of it are Lipschitz.\nThe following Assumption 3 is useful only when we initialize with an infeasible point. We \ufb01rst state\nit here and we will discuss this assumption after the statement of the main theorem.\n\nAssumption 3 Suppose (\u2713S,\u21b5 S) is a stationary point of the optimization problem\n\nminimize\n\n\u2713,\u21b5\n\n\u21b5\n\nsubject to\n\nD(\u2713) \uf8ff D0 + \u21b5.\n\n(12)\n\nWe have that \u2713S is a feasible point of the original problem (4), i.e. D(\u2713S) \uf8ff D0.\n\n6\n\n\fand\n\nWe are now ready to state the main theorem.\nTheorem 4 Suppose the Assumptions 1 and 2 are satis\ufb01ed with small enough initial step size \u23180.\nSuppose also that, either \u27130 is a feasible point, or Assumption 3 is satis\ufb01ed. If there is a subsequence\n\n(kj )\n\nJ\n\nlim\nj!1\n\n(kj )\n\nD\n\nlim\nj!1\n\n(\u2713) = bJ(\u2713)\n\n(\u2713) = bD(\u2713).\n\n{\u2713kj} of {\u2713k} that converges to somee\u2713, then there exist uniformly continuous functions bJ(\u2713) and\nbD(\u2713) satisfying\nFurthermore, suppose there exists \u2713 such that bD(\u2713) < D0 (i.e. the Slater\u2019s condition holds), thene\u2713 is\n\na stationary point of the original problem (4) almost surely.\n\nThe proof of Theorem 4 is provided in the supplementary material.\nNote that Assumption 3 is not necessary if we start from a feasible point, or we reach a feasible point\nin the iterates, which could be viewed as an initializer. Assumption 3 makes sure that the iterates\nin Algorithm 1 keep making progress without getting stuck at any infeasible stationary point. A\nsimilar condition is assumed in [38] for an infeasible initializer. If it turns out that \u27130 is infeasible\nand Assumption 3 is violated, then the convergent point may be an infeasible stationary point of (12).\nIn practice, if we can \ufb01nd a feasible point of the original problem, then we proceed with that point.\nAlternatively, we could generate multiple initializers and obtain iterates for all of them. As long as\nthere is a feasible point in one of the iterates, we can view this feasible point as the initializer and\nTheorem 4 follows without Assumption 3. In our later experiments, for every single replicate, we\ncould reach a feasible point, and therefore Assumption 3 is not necessary.\nOur algorithm does not guarantee safe exploration during the training phase. Ensuring safety during\nlearning is a more challenging problem. Sometimes even \ufb01nding a feasible point is not straightforward,\notherwise Assumption 3 is not necessary.\nOur proposed algorithm is inspired by [38]. Compared to [38] which deals with an optimization\nproblem, solving the safe reinforcement learning problem is more challenging. We need to verify\nthat the Lipschitz condition is satis\ufb01ed, and also the policy gradient has to be estimated (instead of\ndirectly evaluated as in a standard optimization problem). The usage of the Actor-Critic algorithm\nreduces the variance of the sampling, which is unique to Reinforcement learning.\n\n5 Application to Constrained Linear-Quadratic Regulator\n\nWe apply our algorithm to the linear-quadratic regulator (LQR), which is one of the most fundamental\nproblems in control theory. In the LQR setting, the state dynamic equation is linear, the cost function\nis quadratic, and the optimal control theory tells us that the optimal control for LQR is a linear\nfunction of the state [23, 6]. LQR can be viewed as an MDP problem and it has attracted a lot of\nattention in the reinforcement learning literature [12, 13, 21, 47].\nWe consider the in\ufb01nite-horizon, discrete-time LQR problem. Denote xt as the state variable and ut\nas the control variable. The state transition and the control sequence are given by\n\nxt+1 = Axt + But + vt,\n\nut = F xt + wt,\n\n(13)\n\nwhere vt and wt represent possible Gaussian white noise, and the initial state is given by x0. The goal\nis to \ufb01nd the control parameter matrix F such that the expected total cost is minimized. The usual\ncost function of LQR corresponds to the negative reward in our setting and we impose an additional\nquadratic constraint on the system. The overall optimization problem is given by\n\nminimize J(F ) = E\uf8ffXt0\nsubject to D(F ) = E\uf8ffXt0\n\nx>t Q1xt + u>t R1ut,\nx>t Q2xt + u>t R2ut \uf8ff D0,\n\n7\n\nwhere Q1, Q2, R1, and R2 are positive de\ufb01nite matrices. Note that even thought the matrices are\npositive de\ufb01nite, both the objective function J and the constraint D are nonconvex with respect to\n\n\fthe parameter F . Furthermore, with the additional constraint, the optimal control sequence may no\nlonger be linear in the state xt. Nevertheless, in this work, we still consider linear control given by\n(13) and the goal is to \ufb01nd the best linear control for this constrained LQR problem. We assume that\nthe choice of A, B are such that the optimal cost is \ufb01nite.\n\nRandom initial state. We \ufb01rst consider the setting where the initial state x0 \u21e0D follows a random\ndistribution D, while both the state transition and the control sequence (13) are deterministic (i.e.\nvt = wt = 0). In this random initial state setting, [24] showed that without the constraint, the policy\ngradient method converges ef\ufb01ciently to the global optima in polynomial time. In the constrained\ncase, we can explicitly write down the objective and constraint function, since the only randomness\ncomes from the initial state. Therefore, we have the state dynamic xt+1 = (A  BF )xt and the\nobjective function has the following expression ([24], Lemma 1)\n(14)\n\nwhere PF is the solution to the following equation\n\nJ(F ) = Ex0\u21e0D\u21e5x>0 PF x0\u21e4,\n\nThe gradient is given by\n\nPF = Q1 + F >R1F + (A  BF )>PF (A  BF ).\n1Xt=0\n\nrF J(F ) = 2\u21e3R1 + B>PF BF  B>PF A\u2318 \u00b7\uf8ffEx0\u21e0D\n\n(15)\n\n(16)\n\nxtx>t.\n\nLet SF =P1t=0 xtx>t and observe that\n\nSF = x0x>0 + (A  BF )SF (A  BF )>.\n\n(17)\nWe start from some F0 and apply our Algorithm 1 to solve the constrained LQR problem. In iteration\nk, with the current estimator denoted by Fk, we \ufb01rst obtain an estimator of PFk by starting from\nQ1 and iteratively applying the recursion PFk Q1 + F >k R1Fk + (A  BFk)>PFk (A  BFk)\nuntil convergence. Next, we sample an x\u21e40 from the distribution D and follow a similar recursion\ngiven by (17) to obtain an estimate of SFk. Plugging the sample x\u21e40 and the estimates of PFk and\nSFk into (14) and (16), we obtain the sample reward value J\u21e4(Fk) and rF J\u21e4(Fk), respectively.\n(F ). We apply the same procedure to\nWith these two values, we follow (5) and (8) and obtain J\nthe cost function D(F ) with Q1, R1 replaced by Q2, R2 to obtain D\n(F ). Finally we solve the\noptimization problem (7) (or (9) if (7) is infeasible) and obtain Fk+1 by (10).\n\n(k)\n\n(k)\n\nRandom state transition and control. We then consider the setting where both vt and wt are\nindependent standard Gaussian white noise.\nIn this case, the state dynamic can be written as\nxt+1 = (A  BF )xt + \u270ft where \u270ft \u21e0N (0, I + BB>). Let PF be de\ufb01ned as in (15) and SF be the\nsolution to the following Lyapunov equation\n\nSF = I + BB> + (A  BF )SF (A  BF )>.\nThe objective function has the following expression ([68], Proposition 3.1)\nJ(F ) = Ex\u21e0N (0,SF )hx>(Q1 + F >R1F )xi + tr(R1),\n\nand the gradient is given by\n\n(18)\n\n(19)\n\nrF J(F ) = 2\u21e3R1 + B>PF BF  B>PF A\u2318 \u00b7 Ex\u21e0N (0,SF )hxx>i.\n\nAlthough in this setting it is straightforward to calculate the expectation in a closed form, we keep the\ncurrent expectation form to be in line with our algorithm. Moreover, when the error distribution is\nmore complicated or unknown, we can no longer calculate the closed form expression and have to\nsample in each iteration. With the formulas given by (18) and (19), we again apply our Algorithm 1.\nWe sample x \u21e0N (0, SF ) in each iteration and solve the optimization problem (7) or (9). The whole\nprocedure is similar to the random initial state case described above.\n\nOther applications. Our algorithm can also be applied to constrained parallel MDP and constrained\nmulti-agent MDP problem. Due to the space limit, we relegate them to supplementary material.\n\n8\n\n\f33\n\n32.5\n\n32\n\n31.5\n\n31\n\nl\n\ne\nu\na\nv\n \nt\n\ni\n\nn\na\nr\nt\ns\nn\no\nc\n\n30.5\n\n0\n\n46.5\n\n46\n\n45.5\n\n45\n\n44.5\n\n44\n\nl\n\ne\nu\na\nv\n \n\ne\nv\ni\nt\nc\ne\nb\no\n\nj\n\n43.5\n\n0\n\n500\n\n1000\n\n1500\n\niteration\n\n500\n\n1000\n\n1500\n\niteration\n\n(a) Constraint value D(\u2713k) in each iteration.\n\n(b) Objective value J(\u2713k) in each iteration.\n\nFigure 1: An experiment on constrained LQR problem. The iterate starts from an infeasible point and\nthen becomes feasible and eventually converges.\n\nmin value\n\nOur method\nLagrangian\n\n30.689 \u00b1 0.114\n30.693 \u00b1 0.113\nTable 1: Comparison of our method with Lagrangian method\n\napprox. min value\n30.694 \u00b1 0.114\n30.699 \u00b1 0.113\n\n# iterations\n2001 \u00b1 1172\n7492 \u00b1 1780\n\napprox. # iterations\n\n604.3 \u00b1 722.4\n5464 \u00b1 2116\n\n6 Experiment\n\nWe verify the effectiveness of the proposed algorithm through experiments. We focus on the LQR\nsetting with a random initial state as discussed in Section 5. In this experiment we set x 2 R15 and\nu 2 R8. The initial state distribution is uniform on the unit cube: x0 \u21e0D = Uniform[1, 1]15.\nEach element of A and B is sampled independently from the standard normal distribution and scaled\nsuch that the eigenvalues of A are within the range (1, 1). We initialize F0 as an all-zero matrix,\nand the choice of the constraint function and the value D0 are such that (1) the constrained problem is\nfeasible; (2) the solution of the unconstrained problem does not satisfy the constraint, i.e., the problem\nis not trivial; (3) the initial value F0 is not feasible. The learning rates are set as \u2318k = 2\n3 k3/4 and\n3 k2/3. The conservative choice of step size is to avoid the situation where an eigenvalue of\n\u21e2k = 2\nA  BF runs out of the range (1, 1), and so the system is stable. 5\nFigure 1(a) and 1(b) show the constraint and objective value in each iteration, respectively. The red\nhorizontal line in Figure 1(a) is for D0, while the horizontal line in Figure 1(b) is for the unconstrained\nminimum objective value. We can see from Figure 1(a) that we start from an infeasible point, and the\nproblem becomes feasible after about 100 iterations. The objective value is in general decreasing\nafter becoming feasible, but never lower than the unconstrained minimum, as shown in Figure 1(b).\n\nComparison with the Lagrangian method. We compare our proposed method with the usual\nLagrangian method. For the Lagrangian method, we follow the algorithm proposed in [18] for safe\nreinforcement learning, which iteratively applies gradient descent on the parameter F and gradient\nascent on the Lagrangian multiplier  for the Lagrangian function until convergence.\nTable 1 reports the comparison results with mean and standard deviation based on 50 replicates. In\nthe second and third columns, we compare the minimum objective value and the number of iterations\nto achieve it. We also consider an approximate version, where we are satis\ufb01ed with the result if the\nobjective value exceeds less than 0.02% of the minimum value. The fourth and \ufb01fth columns show\nthe comparison results for this approximate version. We can see that both methods achieve similar\nminimum objective values, but ours requires less number of policy updates, for both minimum and\napproximate minimum version.\n\n5The code is available at https://github.com/ming93/Safe_reinforcement_learning\n\n9\n\n\fReferences\n[1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.\n\nIn International Conference on Machine Learning, pages 22\u201331, 2017.\n\n[2] Leonard Adolphs. Non convex-concave saddle point optimization. Master\u2019s thesis, ETH Zurich,\n\n2018.\n\n[3] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.\n[4] Haitham Bou Ammar, Rasul Tutunov, and Eric Eaton. Safe policy search for lifelong reinforce-\nment learning with sublinear regret. In International Conference on Machine Learning, pages\n2361\u20132369, 2015.\n\n[5] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00b4e.\n\nConcrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[6] Brian DO Anderson and John B Moore. Optimal control: linear quadratic methods. Courier\n\nCorporation, 2007.\n\n[7] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably ef\ufb01cient q-learning with low\n\nswitching cost. arXiv preprint arXiv:1905.12849, 2019.\n\n[8] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 2013.\n\n[9] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based\nreinforcement learning with stability guarantees. In Advances in Neural Information Processing\nSystems, pages 908\u2013918, 2017.\n\n[10] Vivek S Borkar. Stochastic approximation with two time scales. Systems & Control Letters,\n\n29(5):291\u2013294, 1997.\n\n[11] Craig Boutilier. Planning, learning and coordination in multiagent decision processes. In\nProceedings of the 6th conference on Theoretical aspects of rationality and knowledge, pages\n195\u2013210. Morgan Kaufmann Publishers Inc., 1996.\n\n[12] Steven J Bradtke. Reinforcement learning applied to linear quadratic regulation. In Advances in\n\nneural information processing systems, pages 295\u2013302, 1993.\n\n[13] Steven J Bradtke, B Erik Ydstie, and Andrew G Barto. Adaptive linear quadratic control\nusing policy iteration. In Proceedings of the American control conference, volume 3, pages\n3475\u20133475. Citeseer, 1994.\n\n[14] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[15] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multia-\ngent reinforcement learning. IEEE Transactions on Systems, Man, And Cybernetics-Part C:\nApplications and Reviews, 38 (2), 2008, 2008.\n\n[16] Qi Cai, Zhuoran Yang, Jason D Lee, and Zhaoran Wang. Neural temporal-difference learning\n\nconverges to global optima. arXiv preprint arXiv:1905.10027, 2019.\n\n[17] Tianyi Chen, Kaiqing Zhang, Georgios B Giannakis, and Tamer Bas\u00b8ar. Communication-ef\ufb01cient\n\ndistributed reinforcement learning. arXiv preprint arXiv:1812.03239, 2018.\n\n[18] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained\nreinforcement learning with percentile risk criteria. Journal of Machine Learning Research,\n18(167):1\u2013167, 2017.\n\n[19] Yinlam Chow, O\ufb01r Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A\nlyapunov-based approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708,\n2018.\n\n[20] Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences:\nA survey and comparison. The Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n[21] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample\n\ncomplexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.\n\n[22] Nelson Dunford and Jacob T Schwartz. Linear operators part I: general theory, volume 7.\n\nInterscience publishers New York, 1958.\n\n10\n\n\f[23] Lawrence C Evans. An introduction to mathematical optimal control theory. Lecture Notes,\n\nUniversity of California, Department of Mathematics, Berkeley, 2005.\n\n[24] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy\ngradient methods for the linear quadratic regulator. In International Conference on Machine\nLearning, pages 1466\u20131475, 2018.\n\n[25] Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula,\nand Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic\nsystems. IEEE Transactions on Automatic Control, 2018.\n\n[26] Javier Garc\u0131a and Fernando Fern\u00b4andez. A comprehensive survey on safe reinforcement learning.\n\nJournal of Machine Learning Research, 16(1):1437\u20131480, 2015.\n\n[27] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[28] Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. A survey of actor-\ncritic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on\nSystems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6):1291\u20131307, 2012.\n[29] Jingyu He, Saar Yalov, and P Richard Hahn. XBART: Accelerated Bayesian additive regression\ntrees. In The 22nd International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n1130\u20131138, 2019.\n\n[30] Jingyu He, Saar Yalov, Jared Murray, and P Richard Hahn. Stochastic tree ensembles for\n\nregularized supervised learning. Technical report, 2019.\n\n[31] Jessie Huang, Fa Wu, Doina Precup, and Yang Cai. Learning safe policies with expert guidance.\n\narXiv preprint arXiv:1805.08313, 2018.\n\n[32] John L Kelley. General topology. Courier Dover Publications, 2017.\n[33] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information\n\nprocessing systems, pages 1008\u20131014, 2000.\n\n[34] R Matthew Kretchmar. Parallel reinforcement learning. In The 6th World Conference on\n\nSystemics, Cybernetics, and Informatics. Citeseer, 2002.\n\n[35] Jonathan Lacotte, Yinlam Chow, Mohammad Ghavamzadeh, and Marco Pavone. Risk-sensitive\n\ngenerative adversarial imitation learning. arXiv preprint arXiv:1808.04468, 2018.\n\n[36] Dennis Lee, Haoran Tang, Jeffrey O Zhang, Huazhe Xu, Trevor Darrell, and Pieter Abbeel.\nModular architecture for starcraft ii with deep reinforcement learning. In Fourteenth Arti\ufb01cial\nIntelligence and Interactive Digital Entertainment Conference, 2018.\n\n[37] Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.\n[38] An Liu, Vincent Lau, and Borna Kananian. Stochastic successive convex approximation for\n\nnon-convex constrained stochastic optimization. arXiv preprint arXiv:1801.08266, 2018.\n\n[39] Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy\n\noptimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.\n\n[40] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International conference on machine learning, pages 1928\u20131937,\n2016.\n\n[41] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[42] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro\nDe Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al.\nMassively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296,\n2015.\n\n[43] Santiago Paternain. Stochastic Control Foundations of Autonomous Behavior. PhD thesis,\n\nUniversity of Pennsylvania, 2018.\n\n11\n\n\f[44] Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang.\nMultiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning\nto play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.\n\n[45] Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz markov\n\ndecision processes. Machine Learning, 100(2-3):255\u2013283, 2015.\n\n[46] LA Prashanth and Mohammad Ghavamzadeh. Variance-constrained actor-critic algorithms for\n\ndiscounted and average reward mdps. Machine Learning, 105(3):367\u2013417, 2016.\n\n[47] Benjamin Recht. A tour of reinforcement learning: The view from continuous control. Annual\n\nReview of Control, Robotics, and Autonomous Systems, 2018.\n\n[48] Chang-han Rhee and Peter W Glynn. Unbiased estimation with square root convergence for sde\n\nmodels. Operations Research, 63(5):1026\u20131043, 2015.\n\n[49] Andrzej Ruszczy\u00b4nski. Feasible direction methods for stochastic programming problems. Math-\n\nematical Programming, 19(1):220\u2013229, 1980.\n\n[50] Joris Scharpff, Diederik M Roijers, Frans A Oliehoek, Matthijs TJ Spaan, and Mathijs Michiel\nde Weerdt. Solving transition-independent multi-agent mdps with sparse interactions. In AAAI,\npages 3174\u20133180, 2016.\n\n[51] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[52] Gesualdo Scutari, Francisco Facchinei, Peiran Song, Daniel P Palomar, and Jong-Shi Pang.\nDecomposition by partial linearization: Parallel optimization of multi-agent systems. IEEE\nTransactions on Signal Processing, 62(3):641\u2013656, 2013.\n\n[53] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.\nMastering the game of go with deep neural networks and tree search. nature, 529(7587):484,\n2016.\n\n[54] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur\nGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general\nreinforcement learning algorithm that masters chess, shogi, and go through self-play. Science,\n362(6419):1140\u20131144, 2018.\n\n[55] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[56] Peng Sun, Xinghai Sun, Lei Han, Jiechao Xiong, Qing Wang, Bo Li, Yang Zheng, Ji Liu,\nYongsheng Liu, Han Liu, et al. Tstarbots: Defeating the cheating level builtin ai in starcraft ii in\nthe full game. arXiv preprint arXiv:1809.07193, 2018.\n\n[57] Ying Sun, Prabhu Babu, and Daniel P Palomar. Majorization-minimization algorithms in signal\nprocessing, communications, and machine learning. IEEE Transactions on Signal Processing,\n65(3):794\u2013816, 2016.\n\n[58] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[59] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n[60] Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization.\n\narXiv preprint arXiv:1805.11074, 2018.\n\n[61] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in \ufb01nite markov\ndecision processes with gaussian processes. In Advances in Neural Information Processing\nSystems, pages 4312\u20134320, 2016.\n\n[62] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets,\nMichelle Yeo, Alireza Makhzani, Heinrich K\u00a8uttler, John Agapiou, Julian Schrittwieser, et al.\nStarcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782,\n2017.\n\n12\n\n\f[63] Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, and Mingyi Hong. Multi-agent reinforcement\nlearning via double averaging primal-dual optimization. arXiv preprint arXiv:1806.00877,\n2018.\n\n[64] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods:\n\nGlobal optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.\n\n[65] Min Wen and Ufuk Topcu. Constrained cross-entropy method for safe reinforcement learning.\n\nIn Advances in Neural Information Processing Systems, pages 7461\u20137471, 2018.\n\n[66] Sijia Xu, Hongyu Kuang, Zhi Zhuang, Renjie Hu, Yang Liu, and Huyang Sun. Macro action\nselection with deep reinforcement learning in starcraft. arXiv preprint arXiv:1812.00336, 2018.\n[67] Yang Yang, Gesualdo Scutari, Daniel P Palomar, and Marius Pesavento. A parallel decomposi-\ntion method for nonconvex stochastic multi-agent optimization problems. IEEE Transactions\non Signal Processing, 64(11):2949\u20132964, 2016.\n\n[68] Zhuoran Yang, Yongxin Chen, Mingyi Hong, and Zhaoran Wang. On the global conver-\ngence of actor-critic: A case for linear quadratic regulator with ergodic cost. arXiv preprint\narXiv:1907.06246, 2019.\n\n[69] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Bas\u00b8ar. Finite-sample analyses\nfor fully decentralized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783,\n2018.\n\n13\n\n\f", "award": [], "sourceid": 1768, "authors": [{"given_name": "Ming", "family_name": "Yu", "institution": "The University of Chicago, Booth School of Business"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Mladen", "family_name": "Kolar", "institution": "University of Chicago"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Northwestern University"}]}