{"title": "Algorithms for CVaR Optimization in MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 3509, "page_last": 3517, "abstract": "In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in costs in addition to minimizing a standard criterion. Conditional value-at-risk (CVaR) is a relatively new risk measure that addresses some of the shortcomings of the well-known variance-related risk measures, and because of its computational efficiencies has gained popularity in finance and operations research. In this paper, we consider the mean-CVaR optimization problem in MDPs. We first derive a formula for computing the gradient of this risk-sensitive objective function. We then devise policy gradient and actor-critic algorithms that each uses a specific method to estimate this gradient and updates the policy parameters in the descent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem.", "full_text": "Algorithms for CVaR Optimization in MDPs\n\nInstitute of Computational & Mathematical Engineering, Stanford University\n\nYinlam Chow\u2217\n\nMohammad Ghavamzadeh\u2020\n\nAdobe Research & INRIA Lille - Team SequeL\n\nAbstract\n\nIn many sequential decision-making problems we may want to manage risk by\nminimizing some measure of variability in costs in addition to minimizing a stan-\ndard criterion. Conditional value-at-risk (CVaR) is a relatively new risk measure\nthat addresses some of the shortcomings of the well-known variance-related risk\nmeasures, and because of its computational ef\ufb01ciencies has gained popularity in\n\ufb01nance and operations research. In this paper, we consider the mean-CVaR op-\ntimization problem in MDPs. We \ufb01rst derive a formula for computing the gradi-\nent of this risk-sensitive objective function. We then devise policy gradient and\nactor-critic algorithms that each uses a speci\ufb01c method to estimate this gradient\nand updates the policy parameters in the descent direction. We establish the con-\nvergence of our algorithms to locally risk-sensitive optimal policies. Finally, we\ndemonstrate the usefulness of our algorithms in an optimal stopping problem.\n\nIntroduction\n\n1\nA standard optimization criterion for an in\ufb01nite horizon Markov decision process (MDP) is the\nexpected sum of (discounted) costs (i.e., \ufb01nding a policy that minimizes the value function of the\ninitial state of the system). However in many applications, we may prefer to minimize some measure\nof risk in addition to this standard optimization criterion. In such cases, we would like to use a\ncriterion that incorporates a penalty for the variability (due to the stochastic nature of the system)\ninduced by a given policy. In risk-sensitive MDPs [16], the objective is to minimize a risk-sensitive\ncriterion such as the expected exponential utility [16], a variance-related measure [24, 14], or the\npercentile performance [15]. The issue of how to construct such criteria in a manner that will be\nboth conceptually meaningful and mathematically tractable is still an open question.\nAlthough most losses (returns) are not normally distributed, the typical Markowitz mean-variance\noptimization [18], that relies on the \ufb01rst two moments of the loss (return) distribution, has domi-\nnated the risk management for over 50 years. Numerous alternatives to mean-variance optimization\nhave emerged in the literature, but there is no clear leader amongst these alternative risk-sensitive\nobjective functions. Value-at-risk (VaR) and conditional value-at-risk (CVaR) are two promising\nsuch alternatives that quantify the losses that might be encountered in the tail of the loss distribu-\ntion, and thus, have received high status in risk management. For (continuous) loss distributions,\nwhile VaR\u03b1 measures risk as the maximum loss that might be incurred w.r.t. a given con\ufb01dence\nlevel \u03b1, CVaR\u03b1 measures it as the expected loss given that the loss is greater or equal to VaR\u03b1.\nAlthough VaR is a popular risk measure, CVaR\u2019s computational advantages over VaR has boosted\nthe development of CVaR optimization techniques. We provide the exact de\ufb01nitions of these two\nrisk measures and brie\ufb02y discuss some of the VaR\u2019s shortcomings in Section 2. CVaR minimization\nwas \ufb01rst developed by Rockafellar and Uryasev [23] and its numerical effectiveness was demon-\nstrated in portfolio optimization and option hedging problems. Their work was then extended to\nobjective functions consist of different combinations of the expected loss and the CVaR, such as the\nminimization of the expected loss subject to a constraint on CVaR. This is the objective function\n\n\u2217Part of the work is completed during Yinlam Chow\u2019s internship at Adobe Research.\n\u2020Mohammad Ghavamzadeh is at Adobe Research, on leave of absence from INRIA Lille - Team SequeL.\n\n1\n\n\fthat we study in this paper, although we believe that our proposed algorithms can be easily extended\nto several other CVaR-related objective functions. Boda and Filar [9] and B\u00a8auerle and Ott [20, 3]\nextended the results of [23] to MDPs (sequential decision-making). While the former proposed to\nuse dynamic programming (DP) to optimize CVaR, an approach that is limited to small problems,\nthe latter showed that in both \ufb01nite and in\ufb01nite horizon MDPs, there exists a deterministic history-\ndependent optimal policy for CVaR optimization (see Section 3 for more details).\nMost of the work in risk-sensitive sequential decision-making has been in the context of MDPs\n(when the model is known) and much less work has been done within the reinforcement learning\n(RL) framework. In risk-sensitive RL, we can mention the work by Borkar [10, 11] who considered\nthe expected exponential utility and those by Tamar et al. [26] and Prashanth and Ghavamzadeh [17]\non several variance-related risk measures. CVaR optimization in RL is a rather novel subject.\nMorimura et al. [19] estimate the return distribution while exploring using a CVaR-based risk-\nsensitive policy. Their algorithm does not scale to large problems. Petrik and Subramanian [22]\npropose a method based on stochastic dual DP to optimize CVaR in large-scale MDPs. However,\ntheir method is limited to linearly controllable problems. Borkar and Jain [12] consider a \ufb01nite-\nhorizon MDP with CVaR constraint and sketch a stochastic approximation algorithm to solve it.\nFinally, Tamar et al. [27] have recently proposed a policy gradient algorithm for CVaR optimization.\nIn this paper, we develop policy gradient (PG) and actor-critic (AC) algorithms for mean-CVaR\noptimization in MDPs. We \ufb01rst derive a formula for computing the gradient of this risk-sensitive\nobjective function. We then propose several methods to estimate this gradient both incrementally\nand using system trajectories (update at each time-step vs. update after observing one or more tra-\njectories). We then use these gradient estimations to devise PG and AC algorithms that update the\npolicy parameters in the descent direction. Using the ordinary differential equations (ODE) ap-\nproach, we establish the asymptotic convergence of our algorithms to locally risk-sensitive optimal\npolicies. Finally, we demonstrate the usefulness of our algorithms in an optimal stopping prob-\nlem. In comparison to [27], while they develop a PG algorithm for CVaR optimization in stochastic\nshortest path problems that only considers continuous loss distributions, uses a biased estimator for\nVaR, is not incremental, and has no comprehensive convergence proof, here we study mean-CVaR\noptimization, consider both discrete and continuous loss distributions, devise both PG and (several)\nAC algorithms (trajectory-based and incremental \u2013 plus AC helps in reducing the variance of PG\nalgorithms), and establish convergence proof for our algorithms.\n\n2 Preliminaries\n\ntation is denoted by c(x, a) = E(cid:2)C(x, a)(cid:3); P (\u00b7|x, a) is the transition probability distribution; and\n\nWe consider problems in which the agent\u2019s interaction with the environment is modeled as a MDP.\nA MDP is a tuple M = (X ,A, C, P, P0), where X = {1, . . . , n} and A = {1, . . . , m} are the state\nand action spaces; C(x, a) \u2208 [\u2212Cmax, Cmax] is the bounded cost random variable whose expec-\nP0(\u00b7) is the initial state distribution. For simplicity, we assume that the system has a single initial\nstate x0, i.e., P0(x) = 1{x = x0}. All the results of the paper can be easily extended to the case that\nthe system has more than one initial state. We also need to specify the rule according to which the\nagent selects actions at each state. A stationary policy \u00b5(\u00b7|x) is a probability distribution over ac-\ntions, conditioned on the current state. In policy gradient and actor-critic methods, we de\ufb01ne a class\n\nof parameterized stochastic policies(cid:8)\u00b5(\u00b7|x; \u03b8), x \u2208 X , \u03b8 \u2208 \u0398 \u2286 R\u03ba1(cid:9), estimate the gradient of a\n\u03b3 (x|x0) = (1 \u2212 \u03b3)(cid:80)\u221e\nde\ufb01ne the value-at-risk at the con\ufb01dence level \u03b1 \u2208 (0, 1) as VaR\u03b1(Z) = min(cid:8)z | F (z) \u2265 \u03b1(cid:9).\n\nperformance measure w.r.t. the policy parameters \u03b8 from the observed system trajectories, and then\nimprove the policy by adjusting its parameters in the direction of the gradient. Since in this setting a\npolicy \u00b5 is represented by its \u03ba1-dimensional parameter vector \u03b8, policy dependent functions can be\nwritten as a function of \u03b8 in place of \u00b5. So, we use \u00b5 and \u03b8 interchangeably in the paper. We denote\n\u03b3 (x|x0)\u00b5(a|x) the\nby d\u00b5\n\u03b3-discounted visiting distribution of state x and state-action pair (x, a) under policy \u00b5, respectively.\nLet Z be a bounded-mean random variable, i.e., E[|Z|] < \u221e, with the cumulative distribution\nfunction F (z) = P(Z \u2264 z) (e.g., one may think of Z as the loss of an investment strategy \u00b5). We\n\nk=0 \u03b3kP(xk = x|x0 = x0; \u00b5) and \u03c0\u00b5\n\n\u03b3 (x, a|x0) = d\u00b5\n\nHere the minimum is attained because F is non-decreasing and right-continuous in z. When F\nis continuous and strictly increasing, VaR\u03b1(Z) is the unique z satisfying F (z) = \u03b1, otherwise,\nthe VaR equation can have no solution or a whole range of solutions. Although VaR is a popular\nrisk measure, it suffers from being unstable and dif\ufb01cult to work with numerically when Z is not\n\n2\n\n\fnormally distributed, which is often the case as loss distributions tend to exhibit fat tails or empirical\ndiscreteness. Moreover, VaR is not a coherent risk measure [1] and more importantly does not\nquantify the losses that might be suffered beyond its value at the \u03b1-tail of the distribution [23].\nAn alternative measure that addresses most of the VaR\u2019s shortcomings is conditional value-at-risk,\nCVAR\u03b1(Z), which is the mean of the \u03b1-tail distribution of Z. If there is no probability atom at\n\nVaR\u03b1(Z), CVaR\u03b1(Z) has a unique value that is de\ufb01ned as CVaR\u03b1(Z) = E(cid:2)Z | Z \u2265 VaR\u03b1(Z)(cid:3).\n\nRockafellar and Uryasev [23] showed that\n\nCVaR\u03b1(Z) = min\n\n(1)\nwhere (x)+ = max(x, 0) represents the positive part of x. Note that as a function of \u03bd, H\u03b1(\u00b7, \u03bd) is\n\ufb01nite and convex (hence continuous).\n\n\u03bd\u2208R H\u03b1(Z, \u03bd)\n\n1 \u2212 \u03b1\n\n\u03bd +\n\n.\n\n(cid:110)\n\n(cid:52)\n= min\n\u03bd\u2208R\n\nE(cid:2)(Z \u2212 \u03bd)+(cid:3)(cid:111)\n\n1\n\n3 CVaR Optimization in MDPs\n\nk=0 \u03b3kC(xk, ak) | x0 = x, \u00b5 and D\u03b8(x, a) = (cid:80)\u221e\n\npolicy \u00b5, i.e., D\u03b8(x) = (cid:80)\u221e\nfunctions of policy \u00b5, i.e., V \u03b8(x) = E(cid:2)D\u03b8(x)(cid:3) and Q\u03b8(x, a) = E(cid:2)D\u03b8(x, a)(cid:3). The goal in the\n\nFor a policy \u00b5, we de\ufb01ne the loss of a state x (state-action pair (x, a)) as the sum of (discounted)\ncosts encountered by the agent when it starts at state x (state-action pair (x, a)) and then follows\nk=0 \u03b3kC(xk, ak) | x0 =\nx, a0 = a, \u00b5. The expected value of these two random variables are the value and action-value\nstandard discounted formulation is to \ufb01nd an optimal policy \u03b8\u2217 = argmin\u03b8 V \u03b8(x0).\nFor CVaR optimization in MDPs, we consider the following optimization problem: For a given\ncon\ufb01dence level \u03b1 \u2208 (0, 1) and loss tolerance \u03b2 \u2208 R,\n\nV \u03b8(x0)\n\nmin\n\n\u03b8\n\nsubject to\n\nCVaR\u03b1\n\nV \u03b8(x0)\n\nmin\n\u03b8,\u03bd\n\nsubject to\n\nH\u03b1\n\n(cid:0)D\u03b8(x0)(cid:1) \u2264 \u03b2.\n(cid:0)D\u03b8(x0), \u03bd(cid:1) \u2264 \u03b2.\n\n(2)\n\n(3)\n\n(4)\n\nBy Theorem 16 in [23], the optimization problem (2) is equivalent to (H\u03b1 is de\ufb01ned by (1))\n\nTo solve (3), we employ the Lagrangian relaxation procedure [4] to convert it to the following\nunconstrained problem:\n\n(cid:18)\n\nmax\n\u03bb\u22650\n\nmin\n\u03b8,\u03bd\n\nL(\u03b8, \u03bd, \u03bb)\n\n(cid:52)\n= V \u03b8(x0) + \u03bb\n\nH\u03b1\n\n(cid:16)\n\n(cid:0)D\u03b8(x0), \u03bd(cid:1) \u2212 \u03b2\n\n(cid:17)(cid:19)\n\n,\n\nwhere \u03bb is the Lagrange multiplier. The goal here is to \ufb01nd the saddle point of L(\u03b8, \u03bd, \u03bb), i.e., a point\n(\u03b8\u2217, \u03bd\u2217, \u03bb\u2217) that satis\ufb01es L(\u03b8, \u03bd, \u03bb\u2217) \u2265 L(\u03b8\u2217, \u03bd\u2217, \u03bb\u2217) \u2265 L(\u03b8\u2217, \u03bd\u2217, \u03bb),\u2200\u03b8, \u03bd,\u2200\u03bb \u2265 0. This is achieved by\ndescending in (\u03b8, \u03bd) and ascending in \u03bb using the gradients of L(\u03b8, \u03bd, \u03bb) w.r.t. \u03b8, \u03bd, and \u03bb, i.e.,1\n\n(cid:18)\n\n\u2207\u03b8L(\u03b8, \u03bd, \u03bb) = \u2207\u03b8V \u03b8(x0) +\n\n\u2202\u03bd L(\u03b8, \u03bd, \u03bb) = \u03bb\n\n1 +\n\n\u2207\u03bbL(\u03b8, \u03bd, \u03bb) = \u03bd +\n\n1\n\n(1 \u2212 \u03b1)\n1\n\n(1 \u2212 \u03b1)\n\n,\n\n\u03bb\n\n(1 \u2212 \u03b1)\n\n\u2207\u03b8E(cid:104)(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+(cid:105)\n(cid:18)\n\u2202\u03bdE(cid:104)(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+(cid:105)(cid:19)\nE(cid:104)(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+(cid:105) \u2212 \u03b2.\n(cid:0)D\u03b8(x0)(cid:1) \u2264 \u03b2 (feasibility assump-\n\nP(cid:0)D\u03b8(x0) \u2265 \u03bd(cid:1)(cid:19)\n\n(1 \u2212 \u03b1)\n\n(cid:51) \u03bb\n\n1 \u2212\n\n(7)\n\n(5)\n\n,\n\n(6)\n\n1\n\nWe assume that there exists a policy \u00b5(\u00b7|\u00b7; \u03b8) such that CVaR\u03b1\ntion). As discussed in Section 1, B\u00a8auerle and Ott [20, 3] showed that there exists a deterministic\nhistory-dependent optimal policy for CVaR optimization. The important point is that this policy\ndoes not depend on the complete history, but only on the current time step k, current state of the\n\nsystem xk, and accumulated discounted cost(cid:80)k\n\ni=0 \u03b3iC(xi, ai).\n\nIn the following, we present a policy gradient (PG) algorithm (Sec. 4) and several actor-critic (AC)\nalgorithms (Sec. 5) to optimize (4). While the PG algorithm updates its parameters after observing\nseveral trajectories, the AC algorithms are incremental and update their parameters at each time-step.\n\n1The notation (cid:51) in (6) means that the right-most term is a member of the sub-gradient set \u2202\u03bd L(\u03b8, \u03bd, \u03bb).\n\n3\n\n\f4 A Trajectory-based Policy Gradient Algorithm\nIn this section, we present a policy gradient algorithm to solve the optimization problem (4). The\nunit of observation in this algorithm is a system trajectory generated by following the current policy.\nAt each iteration, the algorithm generates N trajectories by following the current policy, use them\nto estimate the gradients in Eqs. 5-7, and then use these estimates to update the parameters \u03b8, \u03bd, \u03bb.\nLet \u03be = {x0, a0, x1, a1, . . . , xT\u22121, aT\u22121, xT} be a trajectory generated by following the policy \u03b8,\nwhere x0 = x0 and xT is usually a terminal state of the system. After xk visits the terminal state,\nit enters a recurring sink state xS at the next time step, incurring zero cost, i.e., C(xS , a) = 0,\n\u2200a \u2208 A. Time index T is referred to as the stopping time of the MDP. Since the transition is\nstochastic, T is a non-deterministic quantity. Here we assume that the policy \u00b5 is proper, i.e.,\nP(xk = x|x0 = x0, \u00b5) < \u221e for every x (cid:54)\u2208 {xS}. This further means that with probability 1,\nthe MDP exits the transient states and hits xS (and stays in xS) in \ufb01nite time T . For simplicity, we\nassume that the agent incurs zero cost at the terminal state. Analogous results for the general case\nwith a non-zero terminal cost can be derived using identical arguments. The loss and probability of \u03be\nk=0 \u00b5(ak|xk; \u03b8)P (xk+1|xk, ak),\n\n(cid:80)\u221e\nare de\ufb01ned as D(\u03be) =(cid:80)T\u22121\nrespectively. It can be easily shown that \u2207\u03b8 log P\u03b8(\u03be) =(cid:80)T\u22121\n\nk=0 \u03b3kc(xk, ak) and P\u03b8(\u03be) = P0(x0)(cid:81)T\u22121\n\nk=0 \u2207\u03b8 log \u00b5(ak|xk; \u03b8).\n\nk=0\n\nAlgorithm 1 contains the pseudo-code of our proposed policy gradient algorithm. What appears\ninside the parentheses on the right-hand-side of the update equations are the estimates of the gradi-\nents of L(\u03b8, \u03bd, \u03bb) w.r.t. \u03b8, \u03bd, \u03bb (estimates of Eqs. 5-7) (see Appendix A.2 of [13]). \u0393\u03b8 is an operator\nthat projects a vector \u03b8 \u2208 R\u03ba1 to the closest point in a compact and convex set \u0398 \u2282 R\u03ba1, and\n\u0393\u03bd and \u0393\u03bb are projection operators to [\u2212 Cmax\n1\u2212\u03b3 ] and [0, \u03bbmax], respectively. These projection\noperators are necessary to ensure the convergence of the algorithm. The step-size schedules satisfy\nthe standard conditions for stochastic approximation algorithms, and ensure that the VaR parameter\n\n\u03bd update is on the fastest time-scale(cid:8)\u03b63(i)(cid:9), the policy parameter \u03b8 update is on the intermediate\ntime-scale(cid:8)\u03b62(i)(cid:9), and the Lagrange multiplier \u03bb update is on the slowest time-scale(cid:8)\u03b61(i)(cid:9) (see\n\nAppendix A.1 of [13] for the conditions on the step-size schedules). This results in a three time-\nscale stochastic approximation algorithm. We prove that our policy gradient algorithm converges to\na (local) saddle point of the risk-sensitive objective function L(\u03b8, \u03bd, \u03bb) (see Appendix A.3 of [13]).\n\n1\u2212\u03b3 , Cmax\n\nAlgorithm 1 Trajectory-based Policy Gradient Algorithm for CVaR Optimization\n\nInput: parameterized policy \u00b5(\u00b7|\u00b7; \u03b8), con\ufb01dence level \u03b1, and loss tolerance \u03b2\nInitialization: policy parameter \u03b8 = \u03b80, VaR parameter \u03bd = \u03bd0, and the Lagrangian parameter \u03bb = \u03bb0\nfor i = 0, 1, 2, . . . do\nfor j = 1, 2, . . . do\n\nj=1 by starting at x0 = x0 and following the current policy \u03b8i.\n\nGenerate N trajectories {\u03bej,i}N\n\nend for\n\nN(cid:88)\n\n1(cid:8)D(\u03bej,i) \u2265 \u03bdi\n\n(cid:9)(cid:19)(cid:21)\n\n\u03bdi \u2212 \u03b63(i)\n\n\u03bbi \u2212\n\n\u03bbi\n\n\u03bd Update: \u03bdi+1 = \u0393\u03bd\n\n\u03b8 Update:\n\n\u03b8i+1 = \u0393\u03b8\n\n\u03b8i \u2212 \u03b62(i)\n\n(cid:20)\n(cid:20)\n\n(cid:18)\n(cid:18) 1\n\nN\n\nj=1\n\n(1 \u2212 \u03b1)N\n\nN(cid:88)\n\u2207\u03b8 log P\u03b8(\u03bej,i)|\u03b8=\u03b8i D(\u03bej,i)\nN(cid:88)\n\nj=1\n\nj=1\n\n\u2207\u03b8 log P\u03b8(\u03bej,i)|\u03b8=\u03b8i\nN(cid:88)\n\n1\n\n(1 \u2212 \u03b1)N\n\nj=1\n\n(cid:0)D(\u03bej,i) \u2212 \u03bdi\n(cid:0)D(\u03bej,i) \u2212 \u03bdi\n\n(cid:9)(cid:19)(cid:21)\n(cid:1)1(cid:8)D(\u03bej,i) \u2265 \u03bdi\n(cid:9)(cid:19)(cid:21)\n(cid:1)1(cid:8)D(\u03bej,i) \u2265 \u03bdi\n\n+\n\n(cid:20)\n\n\u03bbi\n\n(1 \u2212 \u03b1)N\n\n(cid:18)\n\n\u03bbi + \u03b61(i)\n\n\u03bdi \u2212 \u03b2 +\n\n\u03bb Update: \u03bbi+1 = \u0393\u03bb\n\nend for\nreturn parameters \u03bd, \u03b8, \u03bb\n\nIncremental Actor-Critic Algorithms\n\n5\nAs mentioned in Section 4, the unit of observation in our policy gradient algorithm (Algorithm 1) is\na system trajectory. This may result in high variance for the gradient estimates, especially when the\nlength of the trajectories is long. To address this issue, in this section, we propose two actor-critic\n\n4\n\n\falgorithms that use linear approximation for some quantities in the gradient estimates and update the\nparameters incrementally (after each state-action transition). We present two actor-critic algorithms\nfor optimizing the risk-sensitive measure (4). These algorithms are based on the gradient estimates\nof Sections 5.1-5.3. While the \ufb01rst algorithm (SPSA-based) is fully incremental and updates all the\nparameters \u03b8, \u03bd, \u03bb at each time-step, the second one updates \u03b8 at each time-step and updates \u03bd and \u03bb\nonly at the end of each trajectory, thus given the name semi trajectory-based. Algorithm 2 contains\nthe pseudo-code of these algorithms. The projection operators \u0393\u03b8, \u0393\u03bd, and \u0393\u03bb are de\ufb01ned as in\nSection 4 and are necessary to ensure the convergence of the algorithms. The step-size schedules\nsatisfy the standard conditions for stochastic approximation algorithms, and ensures that the critic\n\nupdate is on the fastest time-scale(cid:8)\u03b64(i)(cid:9), the policy and VaR parameter updates are on the interme-\ndiate time-scale, with \u03bd-update(cid:8)\u03b63(i)(cid:9) being faster than \u03b8-update(cid:8)\u03b62(i)(cid:9), and \ufb01nally the Lagrange\nmultiplier update is on the slowest time-scale(cid:8)\u03b61(i)(cid:9) (see Appendix B.1 of [13] for the conditions\n\non these step-size schedules). This results in four time-scale stochastic approximation algorithms.\nWe prove that these actor-critic algorithms converge to a (local) saddle point of the risk-sensitive\nobjective function L(\u03b8, \u03bd, \u03bb) (see Appendix B.4 of [13]).\n5.1 Gradient w.r.t. the Policy Parameters \u03b8\nThe gradient of our objective function w.r.t. the policy parameters \u03b8 in (5) may be rewritten as\n\n(cid:18)\nE(cid:2)D\u03b8(x0)(cid:3) +\n\n\u03bb\n\n\u2207\u03b8L(\u03b8, \u03bd, \u03bb) = \u2207\u03b8\n\n(1 \u2212 \u03b1)\n\n(cid:26) \u03bb(\u2212s)+/(1 \u2212 \u03b1)\n\n(8)\nGiven the original MDP M = (X ,A, C, P, P0) and the parameter \u03bb, we de\ufb01ne the augmented MDP\n\u00afM = ( \u00afX , \u00afA, \u00afC, \u00afP , \u00afP0) as \u00afX = X \u00d7 R, \u00afA = A, \u00afP0(x, s) = P0(x)1{s0 = s}, and\n\nif s(cid:48) =(cid:0)s \u2212 C(x, a)(cid:1)/\u03b3\nk=0 \u03b3kC(xk, ak)(cid:1).\nWe de\ufb01ne a class of parameterized stochastic policies(cid:8)\u00b5(\u00b7|x, s; \u03b8), (x, s) \u2208 \u00afX , \u03b8 \u2208 \u0398 \u2286 R\u03ba1(cid:9) for\n\n\u00afC(x, s, a) =\nwhere xT is any terminal state of the original MDP M and sT is the value of the s part of the state\nwhen a policy \u03b8 reaches a terminal state xT after T steps, i.e., sT = 1\n\u03b3T\n\nif x = xT\n(cid:48)\notherwise , \u00afP (x\n\n(cid:48)|x, s, a) =\n\notherwise\n\nC(x, a)\n\n, s\n\n0\n\nthis augmented MDP. Thus, the total (discounted) loss of this trajectory can be written as\n\nE(cid:104)(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+(cid:105)(cid:19)\n(cid:26) P (x(cid:48)|x, a)\n\n.\n\n(cid:0)\u03bd \u2212(cid:80)T\u22121\n(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+.\n\nT\u22121(cid:88)\n\nk=0\n\n\u03b3kC(xk, ak) + \u03b3T \u00afC(xT , sT , a) = D\u03b8(x0) +\n\n\u03bb\n\n(1 \u2212 \u03b1)\n\n(9)\n\nFrom (9), it is clear that the quantity in the parenthesis of (8) is the value function of the policy \u03b8 at\nstate (x0, \u03bd) in the augmented MDP \u00afM, i.e., V \u03b8(x0, \u03bd). Thus, it is easy to show that (the second\nequality in Eq. 10 is the result of the policy gradient theorem [21])\n\n\u2207\u03b8L(\u03b8, \u03bd, \u03bb) = \u2207\u03b8V \u03b8(x0, \u03bd) =\n\n1\n1 \u2212 \u03b3\n\n\u03b3(x, s, a|x0, \u03bd) \u2207 log \u00b5(a|x, s; \u03b8) Q\u03b8(x, s, a),\n\u03c0\u03b8\n\n(10)\n\n(cid:88)\n\nx,s,a\n\nwhere \u03c0\u03b8\nfunction of policy \u03b8 in the augmented MDP \u00afM. We can show that\n\n\u03b3 is the discounted visiting distribution (de\ufb01ned in Section 2) and Q\u03b8 is the action-value\nan unbiased estimate of \u2207\u03b8L(\u03b8, \u03bd, \u03bb), where \u03b4k = \u00afC(xk, sk, ak) + \u03b3(cid:98)V (xk+1, sk+1) \u2212 (cid:98)V (xk, sk)\n1\u2212\u03b3\u2207 log \u00b5(ak|xk, sk; \u03b8) \u00b7 \u03b4k is\nis the temporal-difference (TD) error in \u00afM, and (cid:98)V is an unbiased estimator of V \u03b8 (see e.g., [6, 7]).\nv(cid:62)\u03c6(x, s) = (cid:101)V \u03b8,v(x, s), where the feature vector \u03c6(\u00b7) belongs to the low-dimensional space R\u03ba2.\n\nIn our actor-critic algorithms, the critic uses linear approximation for the value function V \u03b8(x, s) \u2248\n\n1\n\n5.2 Gradient w.r.t. the Lagrangian Parameter \u03bb\nWe may rewrite the gradient of our objective function w.r.t. the Lagrangian parameters \u03bb in (7) as\n\u2207\u03bbL(\u03b8, \u03bd, \u03bb) = \u03bd \u2212 \u03b2 +\u2207\u03bb\n= \u03bd \u2212 \u03b2 +\u2207\u03bbV \u03b8(x0, \u03bd). (11)\n\nE(cid:104)(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+(cid:105)(cid:19) (a)\n\n(cid:18)\nE(cid:2)D\u03b8(x0)(cid:3) +\n\n\u03bb\n\n(1 \u2212 \u03b1)\n\nSimilar to Section 5.1, (a) comes from the fact that the quantity in the parenthesis in (11) is\nV \u03b8(x0, \u03bd), the value function of the policy \u03b8 at state (x0, \u03bd) in the augmented MDP \u00afM. Note that\nthe dependence of V \u03b8(x0, \u03bd) on \u03bb comes from the de\ufb01nition of the cost function \u00afC in \u00afM. We now\nderive an expression for \u2207\u03bbV \u03b8(x0, \u03bd), which in turn will give us an expression for \u2207\u03bbL(\u03b8, \u03bd, \u03bb).\n\n5\n\n\f(cid:88)\n\nLemma 1 The gradient of V \u03b8(x0, \u03bd) w.r.t. the Lagrangian parameter \u03bb may be written as\n\n1\n1 \u2212 \u03b3\n\n\u2207\u03bbV \u03b8(x0, \u03bd) =\n\n\u03b3(x, s, a|x0, \u03bd)\n\u03c0\u03b8\n\n1\n\n1{x = xT}(\u2212s)+.\n\n1\n\nx,s,a\n\n(1 \u2212 \u03b1)\n\n(12)\n(cid:4)\nProof. See Appendix B.2 of [13].\n(1\u2212\u03b3)(1\u2212\u03b1) 1{x = xT}(\u2212s)+ is an unbiased\nFrom Lemma 1 and (11), it is easy to see that \u03bd \u2212 \u03b2 +\nestimate of \u2207\u03bbL(\u03b8, \u03bd, \u03bb). An issue with this estimator is that its value is \ufb01xed to \u03bdk \u2212 \u03b2 all along\na system trajectory, and only changes at the end to \u03bdk \u2212 \u03b2 +\n(1\u2212\u03b3)(1\u2212\u03b1) (\u2212sT )+. This may affect\nthe incremental nature of our actor-critic algorithm. To address this issue, we propose a different\napproach to estimate the gradients w.r.t. \u03b8 and \u03bb in Sec. 5.4 (of course this does not come for free).\nAnother important issue is that the above estimator is unbiased only if the samples are generated\n(1\u2212\u03b1) 1{xk =\nfrom the distribution \u03c0\u03b8\nxT}(\u2212sk)+ as an estimate for \u2207\u03bbL(\u03b8, \u03bd, \u03bb). Note that this is an issue for all discounted actor-critic\nalgorithms that their (likelihood ratio based) estimate for the gradient is unbiased only if the samples\nare generated from \u03c0\u03b8\n\u03b3, and not when we simply follow the policy. This might be a reason that we\nhave no convergence analysis (to the best of our knowledge) for (likelihood ratio based) discounted\nactor-critic algorithms.2\n5.3 Sub-Gradient w.r.t. the VaR Parameter \u03bd\nWe may rewrite the sub-gradient of our objective function w.r.t. the VaR parameter \u03bd (Eq. 6) as\n\n\u03b3(\u00b7|x0, \u03bd). If we just follow the policy, then we may use \u03bdk\u2212\u03b2+ \u03b3k\n\n1\n\n(cid:18)\n\nP(cid:16) \u221e(cid:88)\n\n(cid:17)(cid:19)\n\n(cid:16)\n\nP(cid:0)sT \u2264 0 | x0 = x0, s0 = \u03bd; \u03b8(cid:1)(cid:17)\n\n\u2202\u03bd L(\u03b8, \u03bd, \u03bb) (cid:51) \u03bb\n\n1 \u2212\n\n(13)\nFrom the de\ufb01nition of the augmented MDP \u00afM, the probability in (13) may be written as P(sT \u2264\n0 | x0 = x0, s0 = \u03bd; \u03b8), where sT is the s part of the state in \u00afM when we reach a terminal state,\ni.e., x = xT (see Section 5.1). Thus, we may rewrite (13) as\n\n(1 \u2212 \u03b1)\n\nk=0\n\n.\n\n\u03b3kC(xk, ak) \u2265 \u03bd | x0 = x0; \u03b8\n\n1\n\n\u2202\u03bd L(\u03b8, \u03bd, \u03bb) (cid:51) \u03bb\n\n1 \u2212\n\n1\n\n.\n\n(1 \u2212 \u03b1)\n\n(14)\nFrom (14), it is easy to see that \u03bb\u2212 \u03bb1{sT \u2264 0}/(1\u2212 \u03b1) is an unbiased estimate of the sub-gradient\nof L(\u03b8, \u03bd, \u03bb) w.r.t. \u03bd. An issue with this (unbiased) estimator is that it can be only applied at the\nend of a system trajectory (i.e., when we reach the terminal state xT ), and thus, using it prevents\nus of having a fully incremental algorithm. In fact, this is the estimator that we use in our semi\ntrajectory-based actor-critic algorithm.\nOne approach to estimate this sub-gradient incrementally is to use simultaneous perturbation\nstochastic approximation (SPSA) method [8]. The idea of SPSA is to estimate the sub-gradient\ng(\u03bd) \u2208 \u2202\u03bdL(\u03b8, \u03bd, \u03bb) using two values of g at \u03bd\u2212 = \u03bd \u2212 \u2206 and \u03bd+ = \u03bd + \u2206, where \u2206 > 0 is a\npositive perturbation (see [8, 17] for the detailed description of \u2206).3 In order to see how SPSA can\nhelp us to estimate our sub-gradient incrementally, note that\n\n(cid:18)\n\nE(cid:2)D\u03b8(x0)(cid:3) +\n\n\u03bb\n\n(1 \u2212 \u03b1)\n\nE(cid:104)(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+(cid:105)(cid:19) (a)\n\n= \u03bb + \u2202\u03bd V \u03b8(x0, \u03bd).\n\n(15)\n\n\u2202\u03bd L(\u03b8, \u03bd, \u03bb) = \u03bb + \u2202\u03bd\n\nSimilar to Sections 5.1, (a) comes from the fact that the quantity in the parenthesis in (15) is\nV \u03b8(x0, \u03bd), the value function of the policy \u03b8 at state (x0, \u03bd) in the augmented MDP \u00afM. Since\nthe critic uses a linear approximation for the value function, i.e., V \u03b8(x, s) \u2248 v(cid:62)\u03c6(x, s), in our\nactor-critic algorithms (see Section 5.1 and Algorithm 2), the SPSA estimate of the sub-gradient\n\nwould be of the form g(\u03bd) \u2248 \u03bb + v(cid:62)(cid:2)\u03c6(x0, \u03bd+) \u2212 \u03c6(x0, \u03bd\u2212)(cid:3)/2\u2206.\n\n5.4 An Alternative Approach to Compute the Gradients\nIn this section, we present an alternative way to compute the gradients, especially those w.r.t. \u03b8 and\n\u03bb. This allows us to estimate the gradient w.r.t. \u03bb in a (more) incremental fashion (compared to the\nmethod of Section 5.3), with the cost of the need to use two different linear function approximators\n\n2Note that the discounted actor-critic algorithm with convergence proof in [5] is based on SPSA.\n3SPSA-based gradient estimate was \ufb01rst proposed in [25] and has been widely used in various settings,\nespecially those involving high-dimensional parameter. The SPSA estimate described above is two-sided. It\ncan also be implemented single-sided, where we use the values of the function at \u03bd and \u03bd+. We refer the readers\nto [8] for more details on SPSA and to [17] for its application in learning in risk-sensitive MDPs.\n\n6\n\n\f(cid:26) (\u2212s)+/(1 \u2212 \u03b1)\n\n0\n\n(instead of one used in Algorithm 2).\nIn this approach, we de\ufb01ne the augmented MDP slightly\ndifferent than the one in Section 5.3. The only difference is in the de\ufb01nition of the cost function,\nwhich is de\ufb01ned here as (note that C(x, a) has been replaced by 0 and \u03bb has been removed)\n\nif x = xT ,\notherwise,\n\n\u00afC(x, s, a) =\n\nE(cid:104)(cid:0)D\u03b8(x0) \u2212 \u03bd(cid:1)+(cid:105)\n\n1\n\nIt is easy to see that he term\nappearing in the gradients of Eqs. 5-7 is the value function of the pol-\n\nwhere xT is any terminal state of the original MDP M.\n(1\u2212\u03b1)\nicy \u03b8 at state (x0, \u03bd) in this augmented MDP. As a result, we have\nGradient w.r.t. \u03b8: It is easy to see that now this gradient (Eq. 5) is the gradient of the value function\nof the original MDP, \u2207\u03b8V \u03b8(x0), plus \u03bb times the gradient of the value function of the augmented\nMDP, \u2207\u03b8V \u03b8(x0, \u03bd), both at the initial states of these MDPs (with abuse of notation, we use V\nfor the value function of both MDPs). Thus, using linear approximators u(cid:62)f (x, s) and v(cid:62)\u03c6(x, s)\nfor the value functions of the original and augmented MDPs, \u2207\u03b8L(\u03b8, \u03bd, \u03bb) can be estimated as\n\u2207\u03b8 log \u00b5(ak|xk, sk; \u03b8) \u00b7 (\u0001k + \u03bb\u03b4k), where \u0001k and \u03b4k are the TD-errors of these MDPs.\nGradient w.r.t. \u03bb: Similar to the case for \u03b8, it is easy to see that this gradient (Eq. 7) is \u03bd \u2212 \u03b2 plus\nthe value function of the augmented MDP, V \u03b8(x0, \u03bd), and thus, can be estimated incrementally as\n\u2207\u03bbL(\u03b8, \u03bd, \u03bb) \u2248 \u03bd \u2212 \u03b2 + v(cid:62)\u03c6(x, s).\nSub-Gradient w.r.t. \u03bd: This sub-gradient (Eq. 6) is \u03bb times one plus the gradient w.r.t. \u03bd of the\nvalue function of the augmented MDP, \u2207\u03bdV \u03b8(x0, \u03bd), and thus, it can be estimated incrementally\n\nv(cid:62)(cid:2)\u03c6(x0,\u03bd+)\u2212\u03c6(x0,\u03bd\u2212)(cid:3)\n\nusing SPSA as \u03bb(cid:0)1 +\n\n(cid:1) .\n\n2\u2206\n\nAlgorithm 3 in Appendix B.3 of [13] contains the pseudo-code of the resulting algorithm.\nAlgorithm 2 Actor-Critic Algorithms for CVaR Optimization\n\nInput: Parameterized policy \u00b5(\u00b7|\u00b7; \u03b8) and value function feature vector \u03c6(\u00b7) (both over the augmented\nMDP \u00afM), con\ufb01dence level \u03b1, and loss tolerance \u03b2\nInitialization: policy parameters \u03b8 = \u03b80; VaR parameter \u03bd = \u03bd0; Lagrangian parameter \u03bb = \u03bb0; value\nfunction weight vector v = v0\n// (1) SPSA-based Algorithm:\nfor k = 0, 1, 2, . . . do\n\nDraw action ak \u223c \u00b5(\u00b7|xk, sk; \u03b8k);\nObserve next state (xk+1, sk+1) \u223c \u00afP (\u00b7|xk, sk, ak);\n\nTD Error:\nCritic Update:\n\n\u03b4k = \u00afC(xk, sk, ak) + \u03b3v\nvk+1 = vk + \u03b64(k)\u03b4k\u03c6(xk, sk)\n\nk \u03c6(xk+1, sk+1) \u2212 v\n(cid:62)\n\nv(cid:62)\n\n(cid:16)\n\n\u03bdk \u2212 \u03b63(k)\n\n(cid:32)\n(cid:16)\n(cid:16)\n\u03bbk + \u03b61(k)(cid:0)\u03bdk \u2212 \u03b2 +\n\n\u03b8k \u2212 \u03b62(k)\n1 \u2212 \u03b3\n\n\u03bbk +\n\nk\n\n\u03bd Update: \u03bdk+1 = \u0393\u03bd\n\n\u03b8 Update:\n\n\u03b8k+1 = \u0393\u03b8\n\n\u03bb Update: \u03bbk+1 = \u0393\u03bb\n\n(cid:62)\nk \u03c6(xk, sk)\n\nObserve cost \u00afC(xk, sk, ak) (with \u03bb = \u03bbk);\n\n// note that sk+1 = (sk \u2212 C(cid:0)xk, ak)(cid:1)/\u03b3\n(cid:17)(cid:33)\n(cid:1) \u2212 \u03c6(x0, \u03bdk \u2212 \u2206k)(cid:3)\n(cid:2)\u03c6(cid:0)x0, \u03bdk + \u2206k\n(cid:17)\n1{xk = xT}(\u2212sk)+(cid:1)(cid:17)\n\n2\u2206k\n\n1\n\n(1 \u2212 \u03b1)(1 \u2212 \u03b3)\n\n(16)\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n\u2207\u03b8 log \u00b5(ak|xk, sk; \u03b8) \u00b7 \u03b4k\n\nthen set (xk+1, sk+1) = (x0, \u03bdk+1)\n\nif xk = xT (reach a terminal state),\n\nend for\n// (2) Semi Trajectory-based Algorithm:\nfor k = 0, 1, 2, . . . do\n\nif xk (cid:54)= xT then\n\nDraw action ak \u223c \u00b5(\u00b7|xk, sk; \u03b8k), observe cost \u00afC(xk, sk, ak) (with \u03bb = \u03bbk), and next state\n(xk+1, sk+1) \u223c \u00afP (\u00b7|xk, sk, ak); Update (\u03b4k, vk, \u03b8k, \u03bbk) using Eqs. 16, 17, 19, and 20\n\nelse\n\nUpdate (\u03b4k, vk, \u03b8k, \u03bbk) using Eqs. 16, 17, 19, and 20; Update \u03bd as\n\u03bbk \u2212 \u03bbk\n1 \u2212 \u03b1\n\n\u03bd Update: \u03bdk+1 = \u0393\u03bd\n\n\u03bdk \u2212 \u03b63(k)\n\n(cid:18)\n\n(cid:16)\n\n1(cid:8)sT \u2264 0(cid:9)(cid:17)(cid:19)\n\n(21)\n\nSet (xk+1, sk+1) = (x0, \u03bdk+1)\n\nend if\nend for\nreturn policy and value function parameters \u03b8, \u03bd, \u03bb, v\n\n7\n\n\f6 Experimental Results\nWe consider an optimal stopping problem in which the state at each time step k \u2264 T consists of the\ncost ck and time k, i.e., x = (ck, k), where T is the stopping time. The agent (buyer) should decide\neither to accept the present cost or wait. If she accepts or when k = T , the system reaches a terminal\nstate and the cost ck is received, otherwise, she receives the cost ph and the new state is (ck+1, k+1),\nwhere ck+1 is fuck w.p. p and fdck w.p. 1 \u2212 p (fu > 1 and fd < 1 are constants). Moreover, there\nis a discounted factor \u03b3 \u2208 (0, 1) to account for the increase in the buyer\u2019s affordability. The problem\nhas been described in more details in Appendix C of [13]. Note that if we change cost to reward\nand minimization to maximization, this is exactly the American option pricing problem, a standard\ntestbed to evaluate risk-sensitive algorithms (e.g., [26]). Since the state space is continuous, \ufb01nding\nan exact solution via DP is infeasible, and thus, it requires approximation and sampling techniques.\nWe compare the performance of our risk-sensitive policy gradient Algorithm 1 (PG-CVaR) and two\nactor-critic Algorithms 2 (AC-CVaR-SPSA,AC-CVaR-Semi-Traj) with their risk-neutral counterparts\n(PG and AC) (see Appendix C of [13] for the details of these experiments). Figure 1 shows the\ndistribution of the discounted cumulative cost D\u03b8(x0) for the policy \u03b8 learned by each of these\nalgorithms. The results indicate that the risk-sensitive algorithms yield a higher expected loss, but\nless variance, compared to the risk-neutral methods. More precisely, the loss distributions of the risk-\nsensitive algorithms have lower right-tail than their risk-neutral counterparts. Table 1 summarizes\nthe performance of these algorithms. The numbers reiterate what we concluded from Figure 1.\n\nFigure 1: Loss distributions for the policies learned by the risk-sensitive and risk-neutral policy\ngradient and actor critic algorithms. The two left \ufb01gures correspond to the PG methods, and the two\nright \ufb01gures correspond to the AC algorithms. In all cases, the loss tolerance equals to \u03b2 = 40.\n\nPG-CVaR\n\nPG\n\nAC\n\nAC-CVaR-SPSA\n\nAC-CVaR-Semi-Traj.\n\nE(D\u03b8(x0))\n\n16.08\n19.75\n16.96\n22.86\n23.01\n\n\u03c3(D\u03b8(x0)) CVaR(D\u03b8(x0))\n\n17.53\n7.06\n32.09\n3.40\n4.98\n\n69.18\n25.75\n122.61\n31.36\n34.81\n\nTable 1: Performance comparison for the policies learned by the risk-sensitive and risk-neutral algorithms.\n\n7 Conclusions and Future Work\nWe proposed novel policy gradient and actor critic (AC) algorithms for CVaR optimization in MDPs.\nWe provided proofs of convergence (in [13]) to locally risk-sensitive optimal policies for the pro-\nposed algorithms. Further, using an optimal stopping problem, we observed that our algorithms\nresulted in policies whose loss distributions have lower right-tail compared to their risk-neutral\ncounterparts. This is extremely important for a risk averse decision-maker, especially if the right-\ntail contains catastrophic losses. Future work includes: 1) Providing convergence proofs for our\nAC algorithms when the samples are generated by following the policy and not from its discounted\nvisiting distribution, 2) Using importance sampling methods [2, 27] to improve gradient estimates\nin the right-tail of the loss distribution (worst-case events that are observed with low probability) of\nthe CVaR objective function, and 4) Evaluating our algorithms in more challenging problems.\nAcknowledgement\nfor their comments that helped us with some technical details in the proofs of the algorithms.\n\nThe authors would like to thank Professor Marco Pavone and Lucas Janson\n\n8\n\n\u221240\u221220020406000.020.040.06RewardProbability Mean\u2212CVaRMean\u22125005010000.050.10.15RewardProbability Mean\u2212CVaRMean\u2212CVaR SPSAMean\fReferences\n[1] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Journal of Mathematical\n\nFinance, 9(3):203\u2013228, 1999.\n\n[2] O. Bardou, N. Frikha, and G. Pag`es. Computing VaR and CVaR using stochastic approximation and\nadaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 15(3):173\u2013210,\n2009.\n\n[3] N. B\u00a8auerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical\n\nMethods of Operations Research, 74(3):361\u2013379, 2011.\n\n[4] D. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[5] S. Bhatnagar. An actor-critic algorithm with function approximation for discounted cost constrained\n\nMarkov decision processes. Systems & Control Letters, 59(12):760\u2013766, 2010.\n\n[6] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-critic algorithms. In\n\nProceedings of Advances in Neural Information Processing Systems 20, pages 105\u2013112, 2008.\n\n[7] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45\n\n(11):2471\u20132482, 2009.\n\n[8] S. Bhatnagar, H. Prasad, and L.A. Prashanth. Stochastic Recursive Algorithms for Optimization, volume\n\n434. Springer, 2013.\n\n[9] K. Boda and J. Filar. Time consistent dynamic risk measures. Mathematical Methods of Operations\n\nResearch, 63(1):169\u2013186, 2006.\n\n[10] V. Borkar. A sensitivity formula for the risk-sensitive cost and the actor-critic algorithm. Systems &\n\nControl Letters, 44:339\u2013346, 2001.\n\n[11] V. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27:294\u2013311, 2002.\n[12] V. Borkar and R. Jain. Risk-constrained Markov decision processes. IEEE Transaction on Automatic\n\nControl, 2014.\n\n[13] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Algorithms for CVaR optimization in MDPs.\n\narXiv:1406.3339, 2014.\n\n[14] J. Filar, L. Kallenberg, and H. Lee. Variance-penalized Markov decision processes. Mathematics of\n\nOperations Research, 14(1):147\u2013161, 1989.\n\n[15] J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision\n\nprocesses. IEEE Transaction of Automatic Control, 40(1):2\u201310, 1995.\n\n[16] R. Howard and J. Matheson. Risk sensitive Markov decision processes. Management Science, 18(7):\n\n356\u2013369, 1972.\n\n[17] Prashanth L.A. and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Proceedings\n\nof Advances in Neural Information Processing Systems 26, pages 252\u2013260, 2013.\n\n[18] H. Markowitz. Portfolio Selection: Ef\ufb01cient Diversi\ufb01cation of Investment. John Wiley and Sons, 1959.\n[19] T. Morimura, M. Sugiyama, M. Kashima, H. Hachiya, and T. Tanaka. Nonparametric return distribu-\ntion approximation for reinforcement learning. In Proceedings of the 27th International Conference on\nMachine Learning, pages 799\u2013806, 2010.\n\n[20] J. Ott. A Markov Decision Model for a Surveillance Application and Risk-Sensitive Markov Decision\n\nProcesses. PhD thesis, Karlsruhe Institute of Technology, 2010.\n\n[21] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In Proceedings of the Sixteenth European\n\nConference on Machine Learning, pages 280\u2013291, 2005.\n\n[22] M. Petrik and D. Subramanian. An approximate solution method for large risk-averse Markov decision\nprocesses. In Proceedings of the 28th International Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2012.\n\n[23] R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 26:1443\u20131471,\n\n2002.\n\n[24] M. Sobel. The variance of discounted Markov decision processes. Applied Probability, pages 794\u2013802,\n\n1982.\n\n[25] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.\n\nIEEE Transactions on Automatic Control, 37(3):332\u2013341, 1992.\n\n[26] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings\n\nof the Twenty-Ninth International Conference on Machine Learning, pages 387\u2013396, 2012.\n\n[27] A. Tamar, Y. Glassner, and S. Mannor. Policy gradients beyond expectations: Conditional value-at-risk.\n\narXiv:1404.3862v1, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1847, "authors": [{"given_name": "Yinlam", "family_name": "Chow", "institution": "Stanford"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "Adobe Research & INRIA"}]}