{"title": "Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 1522, "page_last": 1530, "abstract": "In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. Our approach is to minimize a risk-sensitive conditional-value-at-risk (CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to such problem as CVaR MDP. Our first contribution is to show that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget. This result, which is of independent interest, motivates CVaR MDPs as a unifying framework for risk-sensitive and robust decision making. Our second contribution is to present a value-iteration algorithm for CVaR MDPs, and analyze its convergence rate. To our knowledge, this is the first solution algorithm for CVaR MDPs that enjoys error guarantees. Finally, we present results from numerical experiments that corroborate our theoretical findings and show the practicality of our approach.", "full_text": "Risk-Sensitive and Robust Decision-Making:\n\na CVaR Optimization Approach\n\nYinlam Chow\n\nStanford University\n\nychow@stanford.edu\n\nShie Mannor\n\nTechnion\n\nshie@ee.technion.ac.il\n\nAviv Tamar\nUC Berkeley\n\navivt@berkeley.edu\n\nMarco Pavone\n\nStanford University\n\npavone@stanford.edu\n\nAbstract\n\nIn this paper we address the problem of decision making within a Markov de-\ncision process (MDP) framework where risk and modeling errors are taken into\naccount. Our approach is to minimize a risk-sensitive conditional-value-at-risk\n(CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to\nsuch problem as CVaR MDP. Our \ufb01rst contribution is to show that a CVaR objec-\ntive, besides capturing risk sensitivity, has an alternative interpretation as expected\ncost under worst-case modeling errors, for a given error budget. This result, which\nis of independent interest, motivates CVaR MDPs as a unifying framework for\nrisk-sensitive and robust decision making. Our second contribution is to present\nan approximate value-iteration algorithm for CVaR MDPs and analyze its conver-\ngence rate. To our knowledge, this is the \ufb01rst solution algorithm for CVaR MDPs\nthat enjoys error guarantees. Finally, we present results from numerical experi-\nments that corroborate our theoretical \ufb01ndings and show the practicality of our\napproach.\n\nIntroduction\n\n1\nDecision making within the Markov decision process (MDP) framework typically involves the min-\nimization of a risk-neutral performance objective, namely the expected total discounted cost [3].\nThis approach, while very popular, natural, and attractive from a computational standpoint, neither\ntakes into account the variability of the cost (i.e., \ufb02uctuations around the mean), nor its sensitivity\nto modeling errors, which may signi\ufb01cantly affect overall performance [12]. Risk-sensitive MDPs\n[9] address the \ufb01rst aspect by replacing the risk-neutral expectation with a risk-measure of the total\ndiscounted cost, such as variance, Value-at-Risk (VaR), or Conditional-VaR (CVaR). Robust MDPs\n[15], on the other hand, address the second aspect by de\ufb01ning a set of plausible MDP parameters,\nand optimize decision with respect to the expected cost under worst-case parameters.\nIn this work we consider risk-sensitive MDPs with a CVaR objective, referred to as CVaR MDPs.\nCVaR [1, 20] is a risk-measure that is rapidly gaining popularity in various engineering applica-\ntions, e.g., \ufb01nance, due to its favorable computational properties [1] and superior ability to safe-\nguard a decision maker from the \u201coutcomes that hurt the most\u201d [22]. In this paper, by relating risk\nto robustness, we derive a novel result that further motivates the usage of a CVaR objective in a\ndecision-making context. Speci\ufb01cally, we show that the CVaR of a discounted cost in an MDP is\nequivalent to the expected value of the same discounted cost in presence of worst-case perturbations\nof the MDP parameters (speci\ufb01cally, transition probabilities), provided that such perturbations are\nwithin a certain error budget. This result suggests CVaR MDP as a method for decision making\nunder both cost variability and model uncertainty, motivating it as uni\ufb01ed framework for planning\nunder uncertainty.\nLiterature review: Risk-sensitive MDPs have been studied for over four decades, with earlier efforts\nfocusing on exponential utility [9], mean-variance [24], and percentile risk criteria [7] . Recently,\nfor the reasons explained above, several authors have investigated CVaR MDPs [20]. Speci\ufb01cally,\n\n1\n\n\fin [4], the authors propose a dynamic programming algorithm for \ufb01nite-horizon risk-constrained\nMDPs where risk is measured according to CVaR. The algorithm is proven to asymptotically con-\nverge to an optimal risk-constrained policy. However, the algorithm involves computing integrals\nover continuous variables (Algorithm 1 in [4]) and, in general, its implementation appears particu-\nlarly dif\ufb01cult. In [2], the authors investigate the structure of CVaR optimal policies and show that a\nMarkov policy is optimal on an augmented state space, where the additional (continuous) state vari-\nable is represented by the running cost. In [8], the authors leverage such result to design an algorithm\nfor CVaR MDPs that relies on discretizing occupation measures in the augmented-state MDP. This\napproach, however, involves solving a non-convex program via a sequence of linear-programming\napproximations, which can only shown to converge asymptotically. A different approach is taken\nby [5], [19] and [25], which consider a \ufb01nite dimensional parameterization of control policies, and\nshow that a CVaR MDP can be optimized to a local optimum using stochastic gradient descent (pol-\nicy gradient). A recent result by P\ufb02ug and Pichler [17] showed that CVaR MDPs admit a dynamic\nprogramming formulation by using a state-augmentation procedure different from the one in [2].\nThe augmented state is also continuous, making the design of a solution algorithm challenging.\nContributions: The contribution of this paper is twofold. First, as discussed above, we provide a\nnovel interpretation for CVaR MDPs in terms of robustness to modeling errors. This result is of\nindependent interest and further motivates the usage of CVaR MDPs for decision making under un-\ncertainty. Second, we provide a new optimization algorithm for CVaR MDPs, which leverages the\nstate augmentation procedure introduced by P\ufb02ug and Pichler [17]. We overcome the aforemen-\ntioned computational challenges (due to the continuous augmented state) by designing an algorithm\nthat merges approximate value iteration [3] with linear interpolation. Remarkably, we are able to\nprovide explicit error bounds and convergence rates based on contraction-style arguments. In con-\ntrast to the algorithms in [4, 8, 5, 25], given the explicit MDP model our approach leads to \ufb01nite-time\nerror guarantees, with respect to the globally optimal policy. In addition, our algorithm is signi\ufb01-\ncantly simpler than previous methods, and calculates the optimal policy for all CVaR con\ufb01dence\nintervals and initial states simultaneously. The practicality of our approach is demonstrated in nu-\nmerical experiments involving planning a path on a grid with thousand of states. To the best of our\nknowledge, this is the \ufb01rst algorithm to approximate globally-optimal policies for non-trivial CVaR\nMDPs whose error depends on the resolution of interpolation.\nOrganization: This paper is structured as follows. In Section 2 we provide background on CVaR\nand MDPs, we state the problem we wish to solve (i.e., CVaR MDPs), and motivate the CVaR\nMDP formulation by establishing a novel relation between CVaR and model perturbations. Section\n3 provides the basis for our solution algorithm, based on a Bellman-style equation for the CVaR.\nThen, in Section 4 we present our algorithm and correctness analysis. In Section 5 we evaluate our\napproach via numerical experiments. Finally, in Section 6, we draw some conclusions and discuss\ndirections for future work.\n2 Preliminaries, Problem Formulation, and Motivation\n2.1 Conditional Value-at-Risk\nLet Z be a bounded-mean random variable, i.e., E[|Z|] < \u221e, on a probability space (\u2126,F, P),\nwith cumulative distribution function F (z) = P(Z \u2264 z). In this paper we interpret Z as a cost.\nThe value-at-risk (VaR) at con\ufb01dence level \u03b1 \u2208 (0, 1) is the 1 \u2212 \u03b1 quantile of Z, i.e., VaR\u03b1(Z) =\n\nmin(cid:8)z | F (z) \u2265 1 \u2212 \u03b1(cid:9). The conditional value-at-risk (CVaR) at con\ufb01dence level \u03b1 \u2208 (0, 1) is\nVaR\u03b1(Z), it is well known from Theorem 6.2 in [23] that CVaR\u03b1(Z) = E(cid:2)Z | Z \u2265 VaR\u03b1(Z)(cid:3).\n\nwhere (x)+ = max(x, 0) represents the positive part of x.\n\nE(cid:2)(Z \u2212 w)+(cid:3)(cid:111)\n\nIf there is no probability atom at\n\nde\ufb01ned as [20]:\n\nCVaR\u03b1(Z) = min\nw\u2208R\n\nw +\n\n1\n\u03b1\n\nTherefore, CVaR\u03b1(Z) may be interpreted as the expected value of Z, conditioned on the \u03b1-portion\nof the tail distribution. It is well known that CVaR\u03b1(Z) is decreasing in \u03b1, CVaR1(Z) equals to\nE(Z), and CVaR\u03b1(Z) tends to max(Z) as \u03b1 \u2193 0. During the last decade, the CVaR risk-measure\nhas gained popularity in \ufb01nancial applications, among others. It is especially useful for controlling\nrare, but potentially disastrous events, which occur above the 1 \u2212 \u03b1 quantile, and are neglected by\nthe VaR [22]. Furthermore, CVaR enjoys desirable axiomatic properties, such as coherence [1]. We\nrefer to [26] for further motivation about CVaR and a comparison with other risk measures such as\nVaR.\nA useful property of CVaR, which we exploit in this paper, is its alternative dual representation [1]:\n(2)\n\nCVaR\u03b1(Z) = max\n\nE\u03be[Z],\n\n\u03be\u2208UCVaR(\u03b1,P)\n\n2\n\n(cid:110)\n\n,\n\n(1)\n\n\f(cid:26)\n\n(cid:20)\n\n(cid:21)\n\n,(cid:82)\n\n\u03be : \u03be(\u03c9) \u2208\n\nwhere E\u03be[Z] denotes the \u03be-weighted expectation of Z, and the risk envelope UCVaR is given by\nUCVaR(\u03b1, P) =\n. Thus, the CVaR of a random vari-\nable Z may be interpreted as the worst-case expectation of Z, under a perturbed distribution \u03beP.\nIn this paper, we are interested in the CVaR of the total discounted cost in a sequential decision-\nmaking setting, as discussed next.\n\n\u03c9\u2208\u2126 \u03be(\u03c9)P(\u03c9)d\u03c9 = 1\n\n0, 1\n\u03b1\n\n(cid:27)\n\n2.2 Markov Decision Processes\nAn MDP is a tuple M = (X ,A, C, P, x0, \u03b3), where X and A are \ufb01nite state and action spaces;\nC(x, a) \u2208 [\u2212Cmax, Cmax] is a bounded deterministic cost; P (\u00b7|x, a) is the transition probability\ndistribution; \u03b3 \u2208 [0, 1) is the discounting factor, and x0 is the initial state.\n(Our results easily\ngeneralize to random initial states and random costs.)\nLet the space of admissible histories up to time t be Ht = Ht\u22121 \u00d7 A \u00d7 X , for t \u2265 1, and H0 = X .\nA generic element ht \u2208 Ht is of the form ht = (x0, a0, . . . , xt\u22121, at\u22121, xt). Let \u03a0H,t be the set of\nall history-dependent policies with the property that at each time t the randomized control action is\n\na function of ht. In other words, \u03a0H,t :=(cid:8)\u00b50 : H0 \u2192 P(A), \u00b51 : H1 \u2192 P(A), . . . , \u00b5t : Ht \u2192\nP(A)}|\u00b5j(hj) \u2208 P(A) for all hj \u2208 Hj, 1 \u2264 j \u2264 t(cid:9). We also let \u03a0H = limt\u2192\u221e \u03a0H,t be the set of\nmodel, and let C0,T =(cid:80)T\n\n2.3 Problem Formulation\nLet C(xt, at) denote the stage-wise costs observed along a state/control trajectory in the MDP\nt=0 \u03b3tC(xt, at) denote the total discounted cost up to time T . The risk-\n\nall history dependent policies.\n\nsensitive discounted-cost problem we wish to address is as follows:\n\n(cid:16)\n\n(cid:12)(cid:12)(cid:12) x0, \u00b5\n(cid:17)\n\nmin\n\u00b5\u2208\u03a0H\n\nCVaR\u03b1\n\nT\u2192\u221eC0,T\n\nlim\n\n,\n\n(3)\n\nwhere \u00b5 = {\u00b50, \u00b51, . . .} is the policy sequence with actions at = \u00b5t(ht) for t \u2208 {0, 1, . . .}. We\nrefer to problem (3) as CVaR MDP (One may also consider a related formulation combining mean\nand CVaR, the details of which are presented in the supplementary material).\nThe problem formulation in (3) directly addresses the aspect of risk sensitivity, as demonstrated by\nthe numerous applications of CVaR optimization in \ufb01nance (see, e.g., [21, 11, 6]) and the recent\napproaches for CVaR optimization in MDPs [4, 8, 5, 25]. In the following, we show a new result\nproviding additional motivation for CVaR MDPs, from the point of view of robustness to modeling\nerrors.\n\n2.4 Motivation - Robustness to Modeling Errors\nWe show a new result relating the CVaR objective in (3) to the expected discounted-cost in presence\nof worst-case perturbations of the MDP parameters, where the perturbations are budgeted according\nto the \u201cnumber of things that can go wrong\u201d. Thus, by minimizing CVaR, the decision maker also\nguarantees robustness of the policy.\nConsider a trajectory (x0, a0, . . . , xT ) in a \ufb01nite-horizon MDP problem with transitions\nPt(xt|xt\u22121, at\u22121). We explicitly denote the time index of the transition matrices for reasons\nthat will become clear shortly. The total probability of the trajectory is P (x0, a0, . . . , xT ) =\nP0(x0)P1(x1|x0, a0)\u00b7\u00b7\u00b7 PT (xT|xT\u22121, aT\u22121), and we let C0,T (x0, a0, . . . , xT ) denote its dis-\ncounted cost, as de\ufb01ned above.\nWe consider an adversarial setting, where an adversary is allowed to change the transition proba-\nbilities at each stage, under some budget constraints. We will show that, for a speci\ufb01c budget and\nperturbation structure, the expected cost under the worst-case perturbation is equivalent to the CVaR\nof the cost. Thus, we shall establish that, in this perspective, being risk sensitive is equivalent to\nbeing robust against model perturbations.\nFor each stage 1 \u2264 t \u2264 T , consider a perturbed transition matrix \u02c6Pt = Pt\u25e6\u03b4t, where \u03b4t \u2208 RX\u00d7A\u00d7X\nis a multiplicative probability perturbation and \u25e6 is the Hadamard product, under the condition that\n\u02c6Pt is a stochastic matrix. Let \u2206t denote the set of perturbation matrices that satisfy this condition,\nand let \u2206 = \u22061 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u2206T the set of all possible perturbations to the trajectory distribution.\n\n3\n\n\fWe now impose a budget constraint on the perturbations as follows. For some budget \u03b7 \u2265 1, we\nconsider the constraint\n\u2200x0, . . . , xT \u2208 X , \u2200a0, . . . , aT\u22121 \u2208 A.\n\u03b41(x1|x0, a0)\u03b42(x2|x1, a1)\u00b7\u00b7\u00b7 \u03b4T (xT|xT\u22121, aT\u22121) \u2264 \u03b7,\n(4)\nEssentially, the product in Eq. (4) states that with small budget the worst cannot happen at each\ntime. Instead, the perturbation budget has to be split (multiplicatively) along the trajectory. We note\nthat Eq. (4) is in fact a constraint on the perturbation matrices, and we denote by \u2206\u03b7 \u2282 \u2206 the set of\nperturbations that satisfy this constraint with budget \u03b7. The following result shows an equivalence\nbetween the CVaR and the worst-case expected loss.\n\nProposition 1 (Interpretation of CVaR as a Robustness Measure) It holds\n\nCVaR 1\n\n\u03b7\n\n(C0,T (x0, a0, . . . , xT )) =\n\nsup\n\n(\u03b41,...,\u03b4T )\u2208\u2206\u03b7\n\nE \u02c6P [C0,T (x0, a0, . . . , xT )] ,\n\n(5)\n\nwhere E \u02c6P [\u00b7] denotes expectation with respect to a Markov chain with transitions \u02c6Pt.\nThe proof of Proposition 1 is in the supplementary material. It is instructive to compare Proposition\n1 with the dual representation of CVaR in (2) where both results convert the CVaR risk into a ro-\nbustness measure. Note, in particular, that the perturbation budget in Proposition 1 has a temporal\nstructure, which constrains the adversary from choosing the worst perturbation at each time step.\n\nRemark 1 An equivalence between robustness and risk-sensitivity was previously suggested by Os-\nogami [16]. In that study, the iterated (dynamic) coherent risk was shown to be equivalent to a\nrobust MDP [10] with a rectangular uncertainty set. The iterated risk (and, correspondingly, the\nrectangular uncertainty set) is very conservative [27], in the sense that the worst can happen at each\ntime step. In contrast, the perturbations considered here are much less conservative. In general,\nsolving robust MDPs without the rectangularity assumption is NP-hard. Nevertheless, Mannor et.\nal. [13] showed that, for cases where the number of perturbations to the parameters along a trajec-\ntory is upper bounded (budget-constrained perturbation), the corresponding robust MDP problem is\ntractable. Analogous to the constraint set (1) in [13], the perturbation set in Proposition 1 limits the\ntotal number of log-perturbations along a trajectory. Accordingly, we shall later see that optimizing\nproblem (3) with perturbation structure (4) is indeed also tractable.\n\nNext section provides the fundamental theoretical ideas behind our approach to the solution of (3).\n3 Bellman Equation for CVaR\nIn this section, by leveraging a recent result from [17], we present a dynamic programming (DP) for-\nmulation for the CVaR MDP problem in (3). As we shall see, the value function in this formulation\ndepends on both the state and the CVaR con\ufb01dence level \u03b1. We then establish important proper-\nties of such DP formulation, which will later enable us to derive an ef\ufb01cient DP-based approximate\nsolution algorithm and provide correctness guarantees on the approximation error. All proofs are\npresented in the supplementary material.\nOur starting point is a recursive decomposition of CVaR, whose proof is detailed in Theorem 10 of\n[17].\nTheorem 2 (CVaR Decomposition, Theorem 21 in [17]) For any t \u2265 0, denote by Z =\n(Zt+1, Zt+2, . . . ) the cost sequence from time t + 1 onwards. The conditional CVaR under pol-\nicy \u00b5, i.e., CVaR\u03b1(Z | ht, \u00b5), obeys the following decomposition:\n\nCVaR\u03b1(Z | ht, \u00b5) =\n\nmax\n\n\u03be\u2208UCVaR(\u03b1,P (\u00b7|xt,at))\n\nE[\u03be(xt+1) \u00b7 CVaR\u03b1\u03be(xt+1)(Z | ht+1, \u00b5) | ht, \u00b5],\n\nwhere at is the action induced by policy \u00b5t(ht), and the expectation is with respect to xt+1.\nTheorem 2 concerns a \ufb01xed policy \u00b5; we now extend it to a general DP formulation. Note that in\nthe recursive decomposition in Theorem 2 the right-hand side involves CVaR terms with different\ncon\ufb01dence levels than that in the left-hand side. Accordingly, we augment the state space X with an\nadditional continuous state Y = (0, 1], which corresponds to the con\ufb01dence level. For any x \u2208 X\nand y \u2208 Y, the value-function V (x, y) for the augmented state (x, y) is de\ufb01ned as:\n\nV (x, y) = min\n\u00b5\u2208\u03a0H\n\nCVaRy\n\n(cid:17)\nT\u2192\u221eC0,T | x0 = x, \u00b5\n\nlim\n\n.\n\n(cid:16)\n\n4\n\n\fSimilar to standard DP, it is convenient to work with operators de\ufb01ned on the space of value functions\n[3]. In our case, Theorem 2 leads to the following de\ufb01nition of CVaR Bellman operator T : X\u00d7Y \u2192\nX \u00d7 Y:\n\n(cid:35)\n\n(cid:34)\n\n(cid:88)\n\nx(cid:48)\u2208X\n\nT[V ](x, y) = min\na\u2208A\n\nC(x, a) + \u03b3\n\nmax\n\n\u03be\u2208UCVaR(y,P (\u00b7|x,a))\n\n\u03be(x(cid:48))V (x(cid:48), y\u03be(x(cid:48))) P (x(cid:48)|x, a)\n\n.\n\n(6)\n\nWe now establish several useful properties for the Bellman operator T[V ].\nLemma 3 (Properties of CVaR Bellman Operator) The Bellman operator T[V ] has the follow-\ning properties:\n\n1. (Contraction.) (cid:107)T[V1] \u2212 T[V2](cid:107)\u221e \u2264 \u03b3(cid:107)V1 \u2212 V2(cid:107)\u221e, where (cid:107)f(cid:107)\u221e= supx\u2208X ,y\u2208Y |f (x, y)|.\n2. (Concavity preserving in y.) For any x \u2208 X , suppose yV (x, y) is concave in y \u2208 Y. Then\n\nthe maximization problem in (6) is concave. Furthermore, yT[V ](x, y) is concave in y.\n\nThe \ufb01rst property in Lemma 3 is similar to standard DP [3], and is instrumental to the design of\na converging value-iteration approach. The second property is nonstandard and speci\ufb01c to our ap-\nproach. It will be used to show that the computation of value-iteration updates involves concave,\nand therefore tractable optimization problems. Furthermore, it will be used to show that a linear-\ninterpolation of V (x, y) in the augmented state y has a bounded error.\nEquipped with the results in Theorem 2 and Lemma 3, we can now show that the \ufb01xed point solution\nof T[V ](x, y) = V (x, y) is unique, and equals to the solution of the CVaR MDP problem (3) with\nx0 = x and \u03b1 = y.\nTheorem 4 (Optimality Condition) For any x \u2208 X and y \u2208 (0, 1], the solution to T[V ](x, y) =\nV (x, y) is unique, and equals to V \u2217(x, y) = min\u00b5\u2208\u03a0H CVaRy (limT\u2192\u221e C0,T | x0 = x, \u00b5).\nNext, we show that the optimal value of the CVaR MDP problem (3) can be attained by a station-\nary Markov policy, de\ufb01ned as a greedy policy with respect to the value function V \u2217(x, y). Thus,\nwhile the original problem is de\ufb01ned over the intractable space of history-dependent policies, a\nstationary Markov policy (over the augmented state space) is optimal, and can be readily derived\nfrom V \u2217(x, y). Furthermore, an optimal history-dependent policy can be readily obtained from an\n(augmented) optimal Markov policy according to the following theorem.\nTheorem 5 (Optimal Policies) Let \u03c0\u2217\nsively de\ufb01ned as:\n\nH = {\u00b50, \u00b51, . . .} \u2208 \u03a0H be a history-dependent policy recur-\n\n\u00b5k(hk) = u\u2217(xk, yk), \u2200k \u2265 0,\n\n(7)\n\nwith initial conditions x0 and y0 = \u03b1, and state transitions\n\nxk \u223c P (\u00b7 | xk\u22121, u\u2217(xk\u22121, yk\u22121)),\n\nyk = yk\u22121\u03be\u2217\n\nxk\u22121,yk\u22121,u\u2217 (xk),\u2200k \u2265 1,\n\n(8)\nx,y,u\u2217 (\u00b7) are solution to the min-\nH is an optimal\n\nwhere the stationary Markovian policy u\u2217(x, y) and risk factor \u03be\u2217\nmax optimization problem in the CVaR Bellman operator T[V \u2217](x, y). Then, \u03c0\u2217\npolicy for problem (3) with initial state x0 and CVaR con\ufb01dence level \u03b1.\nTheorems 4 and 5 suggest that a value-iteration DP method [3] can be used to solve the CVaR MDP\nproblem (3). Let an initial value-function guess V0 : X \u00d7 Y \u2192 R be chosen arbitrarily. Value\niteration proceeds recursively as follows:\n\nVk+1(x, y) = T[Vk](x, y), \u2200(x, y) \u2208 X \u00d7 Y, k \u2208 {0, 1, . . . ,}.\n\n(9)\n\nSpeci\ufb01cally, by combining the contraction property in Lemma 3 and uniqueness result of \ufb01xed point\nsolutions from Theorem 4, one concludes that limk\u2192\u221e Vk(x, y) = V \u2217(x, y). By selecting x =\nx0 and y = \u03b1, one immediately obtains V \u2217(x0, \u03b1) = min\u00b5\u2208\u03a0H CVaR\u03b1 (limT\u2192\u221e C0,T | x0, \u00b5).\nFurthermore, an optimal policy may be derived from V \u2217(x, y) according to the policy construction\nprocedure in Theorem 5.\nUnfortunately, while value iteration is conceptually appealing, its direct implementation in our set-\nting is generally impractical since, e.g., the state y is continuous. In the following, we pursue an\napproximation to the value iteration algorithm (9), based on a linear interpolation scheme for y.\n\n5\n\n\fAlgorithm 1 CVaR Value Iteration with Linear Interpolation\n1: Given:\n\n\u2022 N (x) interpolation points Y(x) = (cid:8)y1, . . . , yN (x)\n\n(cid:9) \u2208 [0, 1]N (x) for every x \u2208 X with\n\nyi < yi+1, y1 = 0 and yN (x) = 1.\n\n\u2022 Initial value function V0(x, y) that satis\ufb01es Assumption 1.\n\n2: For t = 1, 2, . . .\n\n\u2022 For each x \u2208 X and each yi \u2208 Y(x), update the value function estimate as follows:\n\n3: Set the converged value iteration estimate as (cid:98)V \u2217(x, yi), for any x \u2208 X , and yi \u2208 Y(x).\n\nVt(x, yi) = TI[Vt\u22121](x, yi),\n\n4 Value Iteration with Linear Interpolation\nIn this section we present an approximate DP algorithm for solving CVaR MDPs, based on the\ntheoretical results of Section 3. The value iteration algorithm in Eq. (9) presents two main im-\nplementation challenges. The \ufb01rst is due to the fact that the augmented state y is continuous. We\nhandle this challenge by using interpolation, and exploit the concavity of yV (x, y) to bound the\nerror introduced by this procedure. The second challenge stems from the the fact that applying T\ninvolves maximizing over \u03be. Our strategy is to exploit the concavity of the maximization problem\nto guarantee that such optimization can indeed be performed effectively.\nAs discussed, our approach relies on the fact that the Bellman operator T preserves concavity as\nestablished in Lemma 3. Accordingly, we require the following assumption for the initial guess\nV0(x, y),\nAssumption 1 The guess for the initial value function V0(x, y) satis\ufb01es the following properties:\n1) yV0(x, y) is concave in y \u2208 Y and 2) V0(x, y) is continuous in y \u2208 Y for any x \u2208 X .\nAssumption 1 may easily be satis\ufb01ed, for example, by choosing V0(x, y) = CVaRy(Z | x0 = x),\nwhere Z is any arbitrary bounded random variable. As stated earlier, a key dif\ufb01culty in applying\nvalue iteration (9) is that, for each state x \u2208 X , the Bellman operator has to be calculated for each\ny \u2208 Y, and Y is continuous. As an approximation, we propose to calculate the Bellman operator\nonly for a \ufb01nite set of values y, and interpolate the value function in between such interpolation\npoints.\nFormally, let N (x) denote the number of interpolation points. For every x \u2208 X , denote by Y(x) =\n\n(cid:9) \u2208 [0, 1]N (x) the set of interpolation points. We denote by Ix[V ](y) the linear\n\n(cid:8)y1, . . . , yN (x)\n\ninterpolation of the function yV (x, y) on these points, i.e.,\n\nIx[V ](y) = yiV (x, yi) +\n\nyi+1V (x, yi+1) \u2212 yiV (x, yi)\n\nyi+1 \u2212 yi\n\n(y \u2212 yi),\n\nwhere yi = max{y(cid:48) \u2208 Y(x) : y(cid:48) \u2264 y} and yi+1 is the closest interpolation point such that\ny \u2208 [yi, yi+1], i.e., yi+1 = min{y(cid:48) \u2208 Y(x) : y(cid:48) \u2265 y}. The interpolation of yV (x, y) instead of\nV (x, y) is key to our approach. The motivation is twofold: \ufb01rst, it can be shown [20] that for a\ndiscrete random variable Z, yCVaRy(Z) is piecewise linear in y. Second, one can show that the\nLipschitzness of y V (x, y) is preserved during value iteration, and exploit this fact to bound the\nlinear interpolation error.\nWe now de\ufb01ne the interpolated Bellman operator TI as follows:\n\n(cid:34)\n\n(cid:35)\n\n(cid:88)\n\nx(cid:48)\u2208X\n\nTI[V ](x, y) = min\na\u2208A\n\nC(x, a) + \u03b3\n\nmax\n\n\u03be\u2208UCVaR(y,P (\u00b7|x,a))\n\nIx(cid:48)[V ](y\u03be(x(cid:48)))\n\ny\n\nP (x(cid:48)|x, a)\n\n.\n\n(10)\n\nRemark 2 Notice that by L\u2019Hospital\u2019s rule one has limy\u21920 Ix[V ](y\u03be(x))/y = V (x, 0)\u03be(x). This\nimplies that at y = 0 the interpolated Bellman operator is equivalent to the original Bellman oper-\n\nator, i.e., T[V ](x, 0) = mina\u2208A(cid:8)C(x, a) + \u03b3 maxx(cid:48)\u2208X :P (x(cid:48)|x,a)>0 V (x(cid:48), 0)(cid:9) = TI[V ](x, 0).\n\nAlgorithm 1 presents CVaR value iteration with linear interpolation. The only difference between\nthis algorithm and standard value iteration (9) is the linear interpolation procedure described above.\nIn the following, we show that Algorithm 1 converges, and bound the error due to interpolation.\nWe begin by showing that the useful properties established in Lemma 3 for the Bellman operator T\nextend to the interpolated Bellman operator TI.\n\n6\n\n\fLemma 6 (Properties of Interpolated Bellman Operator) TI[V ] has the same properties of\nT[V ] as in Lemma 3, namely 1) contraction and 2) concavity preservation.\nLemma 6 implies several important consequences for Algorithm 1. The \ufb01rst one is that the max-\nimization problem in (10) is concave, and thus may be solved ef\ufb01ciently at each step. This\nguarantees that the algorithm is tractable. Second, the contraction property in Lemma 6 guar-\n\nantees that Algorithm 1 converges, i.e., there exists a value function (cid:98)V \u2217 \u2208 R|X|\u00d7|Y| such that\nlimn\u2192\u221e TnI[V0](x, yi) = (cid:98)V \u2217(x, yi). In addition, the convergence rate is geometric and equals to \u03b3.\n\nThe following theorem provides an error bound between approximate value iteration and exact value\niteration (3) in terms of the interpolation resolution.\nTheorem 7 (Convergence and Error Bound) Suppose the initial value function V0(x, y) satis\ufb01es\nAssumption 1 and let \u0001 > 0 be an error tolerance parameter. For any state x \u2208 X and step t \u2265 0,\nchoose y2 > 0 such that Vt(x, y2) \u2212 Vt(x, 0) \u2265 \u2212\u0001 and update the interpolation points according\nto the logarithmic rule: yi+1 = \u03b8yi, \u2200i \u2265 2, with uniform constant \u03b8 \u2265 1. Then, Algorithm 1 has\nthe following error bound:\n\n(cid:16)\nT\u2192\u221eC0,T | x0, \u00b5\n\nlim\n\n1 \u2212 \u03b3\n\nO ((\u03b8 \u2212 1) + \u0001) ,\n\n(cid:17) \u2265 \u2212\u03b3\n(cid:17)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 O ((\u03b8 \u2212 1) + \u0001) + O(\u03b3n)\n\n1 \u2212 \u03b3\n\n.\n\nand the following \ufb01nite time convergence error bound:\n\n\u2217\n\n(x0, \u03b1) \u2212 min\n\u00b5\u2208\u03a0H\n\n0 \u2265 (cid:98)V\n(cid:12)(cid:12)(cid:12)(cid:12)TnI[V0](x0, \u03b1) \u2212 min\n\n\u00b5\u2208\u03a0H\n\nCVaR\u03b1\n\n(cid:16)\n\nCVaR\u03b1\n\nT\u2192\u221eC0,T | x0, \u00b5\n\nlim\n\n\u0001y2\n\n2 =\n\n|Vt(x,y2)\u2212Vt(x,0)|, and additional points between y(cid:48)\n\nTheorem 7 shows that 1) the interpolation-based value function is a conservative estimate for the\noptimal solution to problem (3); 2) the interpolation procedure is consistent, i.e., when the number\nof interpolation points is arbitrarily large (speci\ufb01cally, \u0001 \u2192 0 and yi+1/yi \u2192 1), the approximation\nerror tends to zero; and 3) the approximation error bound is O((\u03b8 \u2212 1) + \u0001), where log \u03b8 is the\nlog-difference of the interpolation points, i.e., log \u03b8 = log yi+1 \u2212 log yi, \u2200i.\nFor a pre-speci\ufb01ed \u0001, the condition Vt(x, y2)\u2212 Vt(x, 0) \u2265 \u2212\u0001 may be satis\ufb01ed by a simple adaptive\nprocedure for selecting the interpolation points Y(x). At each iteration t > 0, after calculating\nVt(x, yi) in Algorithm 1, at each state x in which the condition does not hold, add a new interpolation\n2 and y2 such that the condition log \u03b8 \u2265\npoint y(cid:48)\nlog yi+1 \u2212 log yi is maintained. Since all the additional points belong to the segment [0, y2], the\nlinearly interpolated Vt(x, yi) remains unchanged, and Algorithm 1 proceeds as is. For bounded\ncosts and \u0001 > 0, the number of additional points required is bounded.\nThe full proof of Theorem 7 is detailed in the supplementary material; we highlight the main ideas\nand challenges involved. In the \ufb01rst part of the proof we bound, for all t > 0, the Lipschitz constant\nof yVt(x, y) in y. The key to this result is to show that the Bellman operator T preserves the\nLipschitz property for yVt(x, y). Using the Lipschitz bound and the concavity of yVt(x, y), we then\nbound the error Ix[Vt](y)\n\u2212 Vt(x, y) for all y. The condition on y2 is required for this bound to hold\nwhen y \u2192 0. Finally, we use this result to bound (cid:107)TI[Vt](x, y) \u2212 T[Vt](x, y)(cid:107)\u221e. The results of\nTheorem 7 follow from contraction arguments, similar to approximate dynamic programming [3].\n\ny\n\n5 Experiments\nWe validate Algorithm 1 on a rectangular grid world, where states represent grid points on a 2D\nterrain map. An agent (e.g., a robotic vehicle) starts in a safe region and its objective is to travel to a\ngiven destination. At each time step the agent can move to any of its four neighboring states. Due to\nsensing and control noise, however, with probability \u03b4 a move to a random neighboring state occurs.\nThe stage-wise cost of each move until reaching the destination is 1, to account for fuel usage. In\nbetween the starting point and the destination there are a number of obstacles that the agent should\navoid. Hitting an obstacle costs M >> 1 and terminates the mission. The objective is to compute a\nsafe (i.e., obstacle-free) path that is fuel ef\ufb01cient.\nFor our experiments, we choose a 64 \u00d7 53 grid-world (see Figure 1), for a total of 3,312 states.\nThe destination is at position (60, 2), and there are 80 obstacles plotted in yellow. By leveraging\nTheorem 7, we use 21 log-spaced interpolation points for Algorithm 1 in order to achieve a small\nvalue function error. We choose \u03b4 = 0.05, and a discount factor \u03b3 = 0.95 for an effective horizon\nof 200 steps. Furthermore, we set the penalty cost equal to M = 2/(1 \u2212 \u03b3)\u2013such choice trades off\nhigh penalty for collisions and computational complexity (that increases as M increases). For the\n\n7\n\n\fFigure 1: Grid-world simulation. Left three plots show the value functions and corresponding paths\nfor different CVaR con\ufb01dence levels. The rightmost plot shows a cost histogram (for 400 Monte\nCarlo trials) for a risk-neutral policy and a CVaR policy with con\ufb01dence level \u03b1 = 0.11.\n\ninterpolation parameters discussed in Theorem 7, we set \u0001 = 0.1 and \u03b8 = 2.067 (in order to have 21\nlogarithmically distributed grid points for the CVaR con\ufb01dence parameter in [0, 1]).\nIn Figure 1 we plot the value function V (x, y) for three different values of the CVaR con\ufb01dence\nparameter \u03b1, and the corresponding paths starting from the initial position (60, 50). The \ufb01rst three\n\ufb01gures in Figure 1 show how by decreasing the con\ufb01dence parameter \u03b1 the average travel distance\n(and hence fuel consumption) slightly increases but the collision probability decreases, as expected.\nWe next discuss robustness to modeling errors. We conducted simulations in which with probability\n0.5 each obstacle position is perturbed in a random direction to one of the neighboring grid cells.\nThis emulates, for example, measurement errors in the terrain map. We then trained both the risk-\naverse (\u03b1 = 0.11) and risk-neutral (\u03b1 = 1) policies on the nominal (i.e., unperturbed) terrain map,\nand evaluated them on 400 perturbed scenarios (20 perturbed maps with 20 Monte Carlo evaluations\neach). While the risk-neutral policy \ufb01nds a shorter route (with average cost equal to 18.137 on\nsuccessful runs), it is vulnerable to perturbations and fails more often (with over 120 failed runs). In\ncontrast, the risk-averse policy chooses slightly longer routes (with average cost equal to 18.878 on\nsuccessful runs), but is much more robust to model perturbations (with only 5 failed runs).\nFor the computation of Algorithm 1 we represented the concave piecewise linear maximization\nproblem in (10) as a linear program, and concatenated several problems to reduce repeated over-\nhead stemming from the initialization of the CPLEX linear programming solver. This resulted in\na computation time on the order of two hours. We believe there is ample room for improvement,\nfor example by leveraging parallelization and sampling-based methods. Overall, we believe our\nproposed approach is currently the most practical method available for solving CVaR MDPs (as a\ncomparison, the recently proposed method in [8] involves in\ufb01nite dimensional optimization). The\nMatlab code used for the experiments is provided in the supplementary material.\n\n6 Conclusion\nIn this paper we presented an algorithm for CVaR MDPs, based on approximate value-iteration on\nan augmented state space. We established convergence of our algorithm, and derived \ufb01nite-time\nerror bounds. These bounds are useful to stop the algorithm at a desired error threshold.\nIn addition, we uncovered an interesting relationship between the CVaR of the total cost and the\nworst-case expected cost under adversarial model perturbations. In this formulation, the perturba-\ntions are correlated in time, and lead to a robustness framework signi\ufb01cantly less conservative than\nthe popular robust-MDP framework, where the uncertainty is temporally independent.\nCollectively, our work suggests CVaR MDPs as a unifying and practical framework for computing\ncontrol policies that are robust with respect to both stochasticity and model perturbations. Future\nwork should address extensions to large state-spaces. We conjecture that a sampling-based approxi-\nmate DP approach [3] should be feasible since, as proven in this paper, the CVaR Bellman equation\nis contracting (as required by approximate DP methods).\n\nAcknowledgement\n\nThe authors would like to thank Mohammad Ghavamzadeh for helpful comments on the technical\ndetails, and Daniel Vainsencher for practical optimization advice. Y-L. Chow and M. Pavone are par-\ntially supported by the Croucher Foundation doctoral scholarship and the Of\ufb01ce of Naval Research,\nScience of Autonomy Program, under Contract N00014-15-1-2673. Funding for Shie Mannor and\nAviv Tamar were partially provided by the European Community\u2019s Seventh Framework Programme\n(FP7/2007-2013) under grant agreement 306638 (SUPREL).\n\n8\n\n\fReferences\n[1] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical \ufb01nance, 9(3):\n\n203\u2013228, 1999.\n\n[2] N. B\u00a8auerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical\n\nMethods of Operations Research, 74(3):361\u2013379, 2011.\n\n[3] D. Bertsekas. Dynamic programming and optimal control, Vol II. Athena Scienti\ufb01c, 4th edition, 2012.\n[4] V. Borkar and R. Jain. Risk-constrained Markov decision processes.\n\nIEEE Transaction of Automatic\n\nControl, 59(9):2574 \u2013 2579, 2014.\n\n[5] Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural\n\nInformation Processing Systems 27, pages 3509\u20133517, 2014.\n[6] K. Dowd. Measuring market risk. John Wiley & Sons, 2007.\n[7] J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision\n\nprocesses. Automatic Control, IEEE Transactions on, 40(1):2\u201310, 1995.\n\n[8] W. Haskell and R. Jain. A convex analytic approach to risk-aware Markov decision processes. SIAM\n\nJournal of Control and Optimization, 2014.\n\n[9] R. A. Howard and J. E. Matheson. Risk-sensitive Markov decision processes. Management Science, 18\n\n(7):356\u2013369, 1972.\n\n[10] G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257280, 2005.\n[11] G. Iyengar and A. Ma. Fast gradient descent method for mean-CVaR optimization. Annals of Operations\n\nResearch, 205(1):203\u2013212, 2013.\n\n[12] S. Mannor, D. Simester, P. Sun, and J. Tsitsiklis. Bias and variance approximation in value function\n\nestimates. Management Science, 53(2):308\u2013322, 2007.\n\n[13] S. Mannor, O. Mebel, and H. Xu. Lightning does not strike twice: Robust MDPs with coupled uncertainty.\n\nIn International Conference on Machine Learning, pages 385\u2013392, 2012.\n\n[14] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583\u2013601,\n\n2002.\n\n[15] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matri-\n\nces. Operations Research, 53(5):780\u2013798, 2005.\n\n[16] T. Osogami. Robustness and risk-sensitivity in markov decision processes. In Advances in Neural Infor-\n\nmation Processing Systems, pages 233\u2013241, 2012.\n\n[17] G. P\ufb02ug and A. Pichler. Time consistent decisions and temporal decomposition of coherent risk function-\n\nals. Optimization online, 2015.\n\n[18] M. Phillips. Interpolation and approximation by polynomials, volume 14. Springer Science & Business\n\nMedia, 2003.\n\n[19] L. Prashanth. Policy gradients for cvar-constrained mdps. In Algorithmic Learning Theory, pages 155\u2013\n\n169. Springer, 2014.\n\n[20] R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of risk, 2:21\u201342, 2000.\n[21] R. Rockafellar, S. Uryasev, and M. Zabarankin. Master funds in portfolio analysis with general deviation\n\nmeasures. Journal of Banking & Finance, 30(2):743\u2013778, 2006.\n\n[22] G. Serraino and S. Uryasev. Conditional value-at-risk (CVaR). In Encyclopedia of Operations Research\n\nand Management Science, pages 258\u2013266. Springer, 2013.\n\n[23] A. Shapiro, D. Dentcheva, and A. Ruszczy\u00b4nski. Lectures on stochastic programming. SIAM, 2009.\n[24] M. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, pages\n\n794\u2013802, 1982.\n\n[25] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the CVaR via sampling. In AAAI, 2015.\n[26] S. Uryasev, S. Sarykalin, G. Serraino, and K. Kalinchenko. VaR vs CVaR in risk management and\n\noptimization. In CARISMA conference, 2010.\n\n[27] H. Xu and S. Mannor. The robustness-performance tradeoff in Markov decision processes. In Advances\n\nin Neural Information Processing Systems, pages 1537\u20131544, 2006.\n\n9\n\n\f", "award": [], "sourceid": 944, "authors": [{"given_name": "Yinlam", "family_name": "Chow", "institution": "Stanford"}, {"given_name": "Aviv", "family_name": "Tamar", "institution": "Technion"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}, {"given_name": "Marco", "family_name": "Pavone", "institution": "Stanford University"}]}