{"title": "Robustness in Markov Decision Problems with Uncertain Transition Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 839, "page_last": 846, "abstract": "", "full_text": "Robustness in Markov Decision Problems with\n\nUncertain Transition Matrices\u2217\n\nArnab Nilim\n\nDepartment of EECS \u2020\nUniversity of California\n\nBerkeley, CA 94720\n\nnilim@eecs.berkeley.edu\n\nLaurent El Ghaoui\nDepartment of EECS\nUniversity of California\n\nBerkeley, CA 94720\n\nelghaoui@eecs.berkeley.edu\n\nAbstract\n\nOptimal solutions to Markov Decision Problems (MDPs) are very sen-\nsitive with respect to the state transition probabilities. In many practi-\ncal problems, the estimation of those probabilities is far from accurate.\nHence, estimation errors are limiting factors in applying MDPs to real-\nworld problems. We propose an algorithm for solving \ufb01nite-state and\n\ufb01nite-action MDPs, where the solution is guaranteed to be robust with\nrespect to estimation errors on the state transition probabilities. Our al-\ngorithm involves a statistically accurate yet numerically ef\ufb01cient repre-\nsentation of uncertainty, via Kullback-Leibler divergence bounds. The\nworst-case complexity of the robust algorithm is the same as the origi-\nnal Bellman recursion. Hence, robustness can be added at practically no\nextra computing cost.\n\n1 Introduction\n\nWe consider a \ufb01nite-state and \ufb01nite-action Markov decision problem in which the transi-\ntion probabilities themselves are uncertain, and seek a robust decision for it. Our work\nis motivated by the fact that in many practical problems, the transition matrices have to\nbe estimated from data. This may be a dif\ufb01cult task and the estimation errors may have\na huge impact on the solution, which is often quite sensitive to changes in the transition\nprobabilities [3]. A number of authors have addressed the issue of uncertainty in the transi-\ntion matrices of an MDP. A Bayesian approach such as described by [9] requires a perfect\nknowledge of the whole prior distribution on the transition matrix, making it dif\ufb01cult to\napply in practice. Other authors have considered the transition matrix to lie in a given set,\nmost typically a polytope: see [8, 10, 5]. Although our approach allows to describe the\nuncertainty on the transition matrix by a polytope, we may argue against choosing such a\nmodel for the uncertainty. First, a general polytope is often not a tractable way to address\nthe robustness problem, as it incurs a signi\ufb01cant additional computational effort to handle\nuncertainty. Perhaps more importantly, polytopic models, especially interval matrices, may\nbe very poor representations of statistical uncertainty and lead to very conservative robust\n\u2217Research funded in part by Eurocontrol-014692, DARPA-F33615-01-C-3150, and NSF-ECS-\n\u2020Electrical Engineering and Computer Sciences\n\n9983874\n\n\fpolicies. In [1], the authors consider a problem dual to ours, and provide a general state-\nment according to which the cost of solving their problem is polynomial in problem size,\nprovided the uncertainty on the transition matrices is described by convex sets, without\nproposing any speci\ufb01c algorithm. This paper is a short version of a longer report [2], which\ncontains all the proofs of the results summarized here.\nNotation. P > 0 or P \u2265 0 refers to the strict or non-strict componentwise inequality for\nmatrices or vectors. For a vector p > 0, log p refers to the componentwise operation. The\nnotation 1 refers to the vector of ones, with size determined from context. The probability\nsimplex in Rn is denoted \u2206n = {p \u2208 Rn\n+ : pT 1 = 1}, while \u0398n is the set of n\u00d7n transition\nmatrices (componentwise non-negative matrices with rows summing to one). We use \u03c3P\nto denote the support function of a set P \u2286 Rn, with for v \u2208 Rn, \u03c3P(v) := sup{pT v :\np \u2208 P}.\n\n2 The problem description\n\nt )a\u2208A,t\u2208T , where for every a \u2208 A, t \u2208 T , the n \u00d7 n transition matrix P a\n\nWe consider a \ufb01nite horizon Markov decision process with \ufb01nite decision horizon T =\n{0, 1, 2, . . . , N \u2212 1}. At each stage, the system occupies a state i \u2208 X , where n = |X| is\n\ufb01nite, and a decision maker is allowed to choose an action a deterministically from a \ufb01nite\nset of allowable actions A = {a1, . . . , am} (for notational simplicity we assume that A is\nnot state-dependent). The system starts in a given initial state i0. The states make Markov\ntransitions according to a collection of (possibly time-dependent) transition matrices \u03c4 :=\n(P a\nt contains the\nprobabilities of transition under action a at stage t. We denote by \u03c0 = (a0, . . . , aN\u22121) a\ngeneric controller policy, where at(i) denotes the controller action when the system is in\nstate i \u2208 X at time t \u2208 T . Let \u03a0 = AnN be the corresponding strategy space. De\ufb01ne by\nct(i, a) the cost corresponding to state i \u2208 X and action a \u2208 A at time t \u2208 T , and by cN\nthe cost function at the terminal stage. We assume that ct(i, a) is non-negative and \ufb01nite\nfor every i \u2208 X and a \u2208 A.\nFor a given set of transition matrices \u03c4 , we de\ufb01ne the \ufb01nite-horizon nominal problem by\n\nwhere CN (\u03c0, \u03c4) denotes the expected total cost under controller policy \u03c0 and transitions \u03c4 :\n\n\u03c6N (\u03a0, \u03c4) := min\n\u03c0\u2208\u03a0\n\nCN (\u03c0, \u03c4),\n\n(1)\n\n(cid:195)\n\nN\u22121(cid:88)\n\n(cid:33)\n\nCN (\u03c0, \u03c4) := E\n\nct(it, at(i)) + cN (iN )\n\n.\n\n(2)\n\nt=0\n\nA special case of interest is when the expected total cost function bears the form (2), where\nthe terminal cost is zero, and ct(i, a) = \u03bdtc(i, a), with c(i, a) now a constant cost function,\nwhich we assume non-negative and \ufb01nite everywhere, and \u03bd \u2208 (0, 1) is a discount factor.\nWe refer to this cost function as the discounted cost function, and denote by C\u221e(\u03c0, \u03c4) the\nlimit of the discounted cost (2) as N \u2192 \u221e.\nWhen the transition matrices are exactly known, the corresponding nominal problem can\nbe solved via a dynamic programming algorithm, which has total complexity of nmN \ufb02ops\nin the \ufb01nite-horizon case. In the in\ufb01nite-horizon case with a discounted cost function, the\ncost of computing an \u0001-suboptimal policy via the Bellman recursion is O(nm log(1/\u0001));\nsee [7] for more details.\n\n2.1 The robust control problems\n\nAt \ufb01rst we assume that when for each action a and time t, the corresponding transition\nis only known to lie in some given subset P a. Two models for transition ma-\nmatrix P a\nt\ntrix uncertainty are possible, leading to two possible forms of \ufb01nite-horizon robust control\n\n\fproblems. In a \ufb01rst model, referred to as the stationary uncertainty model, the transition\nmatrices are chosen by nature depending on the controller policy once and for all, and\nremain \ufb01xed thereafter. In a second model, which we refer to as the time-varying uncer-\ntainty model, the transition matrices can vary arbitrarily with time, within their prescribed\nbounds. Each problem leads to a game between the controller and nature, where the con-\ntroller seeks to minimize the maximum expected cost, with nature being the maximizing\nplayer.\n\nLet us de\ufb01ne our two problems more formally. A policy of nature refers to a speci\ufb01c\ncollection of time-dependent transition matrices \u03c4 = (P a\nt )a\u2208A,t\u2208T chosen by nature, and\nthe set of admissible policies of nature is T := (\u2297a\u2208AP a)N . Denote by Ts the set of\nstationary admissible policies of nature:\n\nTs = {\u03c4 = (P a\n\nt )a\u2208A,t\u2208T \u2208 T : P a\n\nt = P a\n\ns for every t, s \u2208 T, a \u2208 A} .\n\nThe stationary uncertainty model leads to the problem\n\n\u03c6N (\u03a0,Ts) := min\n\u03c0\u2208\u03a0\n\nmax\n\u03c4\u2208Ts\n\nCN (\u03c0, \u03c4).\n\nIn contrast, the time-varying uncertainty model leads to a relaxed version of the above:\n\n\u03c6N (\u03a0,Ts) \u2264 \u03c6N (\u03a0,T ) := min\n\u03c0\u2208\u03a0\n\nmax\n\u03c4\u2208T CN (\u03c0, \u03c4).\n\n(3)\n\n(4)\n\nThe \ufb01rst model is attractive for statistical reasons, as it is much easier to develop statistically\naccurate sets of con\ufb01dence when the underlying process is time-invariant. Unfortunately,\nthe resulting game (3) seems to be hard to solve. The second model is attractive as one\ncan solve the corresponding game (4) using a variant of the dynamic programming algo-\nrithm seen later, but we are left with a dif\ufb01cult task, that of estimating a meaningful set of\ncon\ufb01dence for the time-varying matrices P a\nt . In this paper we will use the \ufb01rst model of\nuncertainty in order to derive statistically meaningful sets of con\ufb01dence for the transition\nmatrices, based on likelihood or entropy bounds. Then, instead of solving the correspond-\ning dif\ufb01cult control problem (3), we use an approximation that is common in robust control,\nand solve the time-varying upper bound (4), using the uncertainty sets P a derived from a\nstationarity assumption about the transition matrices. We will also consider a variant of\nthe \ufb01nite-horizon time-varying problem (4), where controller and nature play alternatively,\nleading to a repeated game\n\nN (\u03a0,Q) := min\n\u03c6rep\n\n(5)\n\nmax\n\u03c40\u2208Q min\n\nmax\n\u03c41\u2208Q . . . min\n\naN\u22121\n\nmax\n\u03c4N\u22121\u2208Q CN (\u03c0, \u03c4),\n\na1\n\na0\nwhere the notation \u03c4t = (P a\nt )a\u2208A denotes the collection of transition matrices at a given\ntime t \u2208 T , and Q := \u2297a\u2208AP a is the corresponding set of con\ufb01dence.\nFinally, we will consider an in\ufb01nite-horizon robust control problem, with the discounted\ncost function referred to above, and where we restrict control and nature policies to be\nstationary:\n\n(6)\nwhere \u03a0s denotes the space of stationary control policies. We de\ufb01ne \u03c6\u221e(\u03a0,T ), \u03c6\u221e(\u03a0,Ts)\nand \u03c6\u221e(\u03a0s,T ) accordingly.\nIn the sequel, for a given control policy \u03c0 \u2208 \u03a0 and subset S \u2286 T , the notation\n\u03c6N (\u03c0,S) := max\u03c4\u2208S CN (\u03c0, \u03c4) denotes the worst-case expected total cost for the \ufb01nite-\nhorizon problem, and \u03c6\u221e(\u03c0,S) is de\ufb01ned likewise.\n\nC\u221e(\u03c0, \u03c4),\n\nmax\n\u03c4\u2208Ts\n\n\u03c6\u221e(\u03a0s,Ts) := min\n\u03c0\u2208\u03a0s\n\n2.2 Main results\n\nOur main contributions are as follows. First we provide a recursion, the \u201crobust dynamic\nprogramming\u201d algorithm, which solves the \ufb01nite-horizon robust control problem (4). We\n\n\fprovide a simple proof in [2] of the optimality of the recursion, where the main ingredient\nis to show that perfect duality holds in the game (4). As a corollary of this result, we ob-\ntain that the repeated game (5) is equivalent to its non-repeated counterpart (4). Second,\nwe provide similar results for the in\ufb01nite-horizon problem with discounted cost function,\n(6). Moreover, we obtain that if we consider a \ufb01nite-horizon problem with a discounted\ncost function, then the gap between the optimal value of the stationary uncertainty problem\n(3) and that of its time-varying counterpart (4) goes to zero as the horizon length goes to\nin\ufb01nity, at a rate determined by the discount factor. Finally, we identify several classes\nof uncertainty models, which result in an algorithm that is both statistically accurate and\nnumerically tractable. We provide precise complexity results that imply that, with the pro-\nposed approach, robustness can be handled at practically no extra computing cost.\n\n3 Finite-Horizon Robust MDP\n\nWe consider the \ufb01nite-horizon robust control problem de\ufb01ned in section 2.1. For a given\nstate i \u2208 X , action a \u2208 A, and P a \u2208 P a, we denote by pa\ni the next-state distribution\ndrawn from P a corresponding to state i \u2208 X ; thus pa\ni is the i-th row of matrix P a. We\nde\ufb01ne P a\ni as the projection of the set P a onto the set of pa\ni -variables. By assumption, these\nsets are included in the probability simplex of Rn, \u2206n; no other property is assumed. The\nfollowing theorem is proved in [2].\n\nTheorem 1 (robust dynamic programming) For the robust control problem (4), perfect\nduality holds:\n\n\u03c6N (\u03a0,T ) = min\n\u03c0\u2208\u03a0\n\nmax\n\u03c4\u2208T CN (\u03c0, \u03c4) = max\n\n\u03c4\u2208T min\n\u03c0\u2208\u03a0\n\nCN (\u03c0, \u03c4) := \u03c8N (\u03a0,T ).\n\nThe problem can be solved via the recursion\n\nct(i, a) + \u03c3P a\n\n(vt+1)\n\n,\n\nvt(i) = min\na\u2208A\n\n(7)\nwhere \u03c3P(v) := sup{pT v : p \u2208 P} denotes the support function of a set P, vt(i) is the\nworst-case optimal value function in state i at stage t. A corresponding optimal control\npolicy \u03c0\u2217 = (a\u2217\n\nN\u22121) is obtained by setting\n\n0, . . . , a\u2217\n\ni\n\ni \u2208 X , t \u2208 T,\n\nt (i) \u2208 arg min\na\u2217\na\u2208A\n\nct(i, a) + \u03c3P a\n\ni\n\n(vt+1)\n\ni \u2208 X .\n\n,\n\n(8)\n\n(cid:161)\n\n(cid:169)\n\n(cid:162)\n\n(cid:170)\n\nThe effect of uncertainty on a given strategy \u03c0 = (a0, . . . , aN ) can be evaluated by the\nfollowing recursion\n\nt (i) = ct(i, at(i)) + \u03c3P at(i)\nv\u03c0\n\ni\n\n(v\u03c0\n\nt+1),\n\ni \u2208 X ,\n\n(9)\n\nwhich provides the worst-case value function v\u03c0 for the strategy \u03c0.\n\nThe above result has a nice consequence for the repeated game (5):\n\nCorollary 2 The repeated game (5) is equivalent to the game (4):\n\nN (\u03a0,Q) = \u03c6N (\u03a0,T ),\n\u03c6rep\n\nand the optimal strategies for \u03c6N (\u03a0,T ) given in theorem 1 are optimal for \u03c6rep\nwell.\n\nN (\u03a0,Q) as\n\nThe interpretation of the perfect duality result given in theorem 1, and its consequence\ngiven in corollary 2, is that it does not matter wether the controller or nature play \ufb01rst, or if\nthey alternatively; all these games are equivalent.\n\n\fEach step of the robust dynamic programming algorithm involves the solution of an opti-\nmization problem, referred to as the \u201cinner problem\u201d, of the form\n\n\u03c3P a\n\ni\n\n(v) = max\np\u2208P a\n\ni\n\nvT p,\n\n(10)\n\ni\n\nwhere P a\nis the set that describes the uncertainty on i-th row of the transition matrix P a,\nand v contains the elements of the value function at some given stage. The complexity of\nthe sets P a\ni for each i \u2208 X and a \u2208 A is a key component in the complexity of the robust\ndynamic programming algorithm. Beyond numerical tractability, an additional criteria for\nthe choice of a speci\ufb01c uncertainty model is of course be that the sets P a should repre-\nsent accurate (non-conservative) descriptions of the statistical uncertainty on the transition\nmatrices. Perhaps surprisingly, there are statistical models of uncertainty, such as those\ndescribed in section 5, that are good on both counts. Precisely, these models result in inner\nproblems (10) that can be solved in worst-case time of O(n log(vmax/\u03b4)) via a simple bi-\nsection algorithm, where n is the size of the state space, vmax is a global upper bound on\nthe value function, and \u03b4 > 0 speci\ufb01es the accuracy at which the optimal value of the inner\nproblem (10) is computed. In the \ufb01nite-horizon case, we can bound vmax by O(N).\nNow consider the following algorithm, where the uncertainty is described in terms of one\nof the models described in section 5:\n\nRobust Finite Horizon Dynamic Programming Algorithm\n\n1. Set \u0001 > 0. Initialize the value function to its terminal value \u02c6vN = cN .\n2. Repeat until t = 0:\n\n(a) For every state i \u2208 X and action a \u2208 A, compute, using the bisection algo-\n\nrithm given in [2], a value \u02c6\u03c3a\n\ni such that\ni \u2212 \u0001/N \u2264 \u03c3P a\n\u02c6\u03c3a\n\ni\n\n(\u02c6vt) \u2264 \u02c6\u03c3a\ni .\n\n(b) Update the value function by \u02c6vt\u22121(i) = mina\u2208A(ct\u22121(i, a) + \u02c6\u03c3a\n(c) Replace t by t \u2212 1 and go to 2.\n\ni ) , i \u2208 X .\n\n3. For every i \u2208 X and t \u2208 T , set \u03c0\u0001 = (a\u0001\n\nt(i) = arg max\na\u0001\n\nN\u22121), where\n0, . . . , a\u0001\na\u2208A {ct\u22121(i, a) + \u02c6\u03c3a\ni } ,\n\ni \u2208 X , a \u2208 A.\n\nAs shown in [2], the above algorithm provides an suboptimal policy \u03c0\u0001 that achieves the\nexact optimum with prescribed accuracy \u0001, with a required number of \ufb02ops bounded above\nby O(mnN log(N/\u0001)). This means that robustness is obtained at a relative increase of\ncomputational cost of only log(N/\u0001) with respect to the classical dynamic programming\nalgorithm, which is small for moderate values of N. If N is very large, we can turn instead\nto the in\ufb01nite-horizon problem examined in section 4, and similar complexity results hold.\n\n4 In\ufb01nite-Horizon MDP\n\nIn this section, we address a the in\ufb01nite-horizon robust control problem, with a discounted\ncost function of the form (2), where the terminal cost is zero, and ct(i, a) = \u03bdtc(i, a),\nwhere c(i, a) is now a constant cost function, which we assume non-negative and \ufb01nite\neverywhere, and \u03bd \u2208 (0, 1) is a discount factor.\nWe begin with the in\ufb01nite-horizon problem involving stationary control and nature policies\nde\ufb01ned in (6). The following theorem is proved in [2].\n\n\fv(i) = min\na\u2208A\n\n(cid:161)\n\n(cid:169)\n\ni\n\n(cid:162)\n\n(cid:170)\n\nTheorem 3 (Robust Bellman recursion) For the in\ufb01nite-horizon robust control problem\n(6) with stationary uncertainty on the transition matrices, stationary control policies, and\na discounted cost function with discount factor \u03bd \u2208 [0, 1), perfect duality holds:\n\n\u03c6\u221e(\u03a0s,Ts) = max\n\u03c4\u2208Ts\n\n(11)\nThe optimal value is given by \u03c6\u221e(\u03a0s,Ts) = v(i0), where i0 is the initial state, and where\nthe value function v satis\ufb01es is the optimality conditions\n(v)\n\ni \u2208 X .\n\nc(i, a) + \u03bd\u03c3P a\n\n(12)\n\n(cid:162)\n\n(cid:161)\n\n,\n\nmin\n\u03c0\u2208\u03a0s\n\nC\u221e(\u03c0, \u03c4) := \u03c8\u221e(\u03a0s,Ts).\n\nThe value function is the unique limit value of the convergent vector sequence de\ufb01ned by\n\nvk+1(i) = min\na\u2208A\n\nc(i, a) + \u03bd\u03c3P a\n\n(vk)\n\n,\n\ni\n\ni \u2208 X , k = 1, 2, . . .\n\n(13)\n\nA stationary, optimal control policy \u03c0 = (a\u2217, a\u2217, . . .) is obtained as\ni \u2208 X .\n\n(14)\nNote that the problem of computing the dual quantity \u03c8\u221e(\u03a0s,Ts) given in (11), has been\naddressed in [1], where the authors provide the recursion (13) without proof.\n\na\u2217(i) \u2208 arg min\na\u2208A\n\nc(i, a) + \u03bd\u03c3P a\n\n(v)\n\n,\n\ni\n\nTheorem (3) leads to the following corollary, also proved in [2].\n\nCorollary 4 In the in\ufb01nite-horizon problem, we can without loss of generality assume that\nthe control and nature policies are stationary, that is,\n\n\u03c6\u221e(\u03a0,T ) = \u03c6\u221e(\u03a0s,Ts) = \u03c6\u221e(\u03a0s,T ) = \u03c6\u221e(\u03a0,Ts).\n\n(15)\nFurthermore, in the \ufb01nite-horizon case, with a discounted cost function, the gap between\nthe optimal values of the \ufb01nite-horizon problems under stationary and time-varying uncer-\ntainty models, \u03c6N (\u03a0,T )\u2212\u03c6N (\u03a0,Ts), goes to zero as the horizon length N goes to in\ufb01nity,\nat a geometric rate \u03bd.\n\nNow consider the following algorithm, where we describe the uncertainty using one of the\nmodels of section 5.\n\nRobust In\ufb01nite Horizon Dynamic Programming Algorithm\n\n1. Set \u0001 > 0, initialize the value function \u02c6v1 > 0 and set k = 1.\n\n2.\n\n(a) For all states i and controls a, compute, using the bisection algorithm given\n\nin [2], a value \u02c6\u03c3a\n\ni such that\ni \u2212 \u03b4 \u2264 \u03c3P a\n\u02c6\u03c3a\n\ni\n\n(\u02c6vk) \u2264 \u02c6\u03c3a\ni ,\n\nwhere \u03b4 = (1 \u2212 \u03bd)\u0001/2\u03bd.\n\n(b) For all states i and controls a, compute \u02c6vk+1(i) by,\ni ) .\na\u2208A (c(i, a) + \u03bd\u02c6\u03c3a\n\n\u02c6vk+1(i) = min\n\n3. If\n\n2\u03bd\ngo to 4. Otherwise, replace k by k + 1 and go to 2.\n\n(cid:107)\u02c6vk+1 \u2212 \u02c6vk(cid:107) <\n\n(1 \u2212 \u03bd)\u0001\n\n,\n\n4. For each i \u2208 X , set an \u03c0\u0001 = (a\u0001, a\u0001, . . .), where\n\na\u0001(i) = arg max\n\ni } ,\na\u2208A {c(i, a) + \u03bd\u02c6\u03c3a\n\ni \u2208 X .\n\nIn [2], we establish that the above algorithm \ufb01nds an \u0001-suboptimal robust policy in at most\nO(nm log(1/\u0001)2) \ufb02ops. Thus, the extra computational cost incurred by robustness in the\nin\ufb01nite-horizon case is only O(log(1/\u0001)).\n\n\f5 Kullback-Liebler Divergence Uncertainty Models\nWe now address the inner problem (10) for a speci\ufb01c action a \u2208 A and state i \u2208 X .\nDenote by D(p(cid:107)q) denotes the Kullback-Leibler (KL) divergence (relative entropy) from\nthe probability distribution q \u2208 \u2206n to the probability distribution p \u2208 \u2206n:\n\n(cid:88)\n\nj\n\nD(p(cid:107)q) :=\n\np(j) log p(j)\nq(j) .\n\nThe above function provides a natural way to describe errors in (rows of) the transition\nmatrices; examples of models based on this function are given below.\n\nLikelihood Models: Our \ufb01rst uncertainty model is derived from a controlled experiment\nstarting from state i = 1, 2, . . . , n and the count of the number of transitions to different\nstates. We denote by F a the matrix of empirical frequencies of transition with control a in\ni its ith row. We have F a \u2265 0 and F a1 = 1, where 1 denotes\nthe experiment; denote by f a\n(cid:88)\nthe vector of ones. The \u201cplug-in\u201d estimate \u02c6P a = F a is the solution to the maximum\nlikelihood problem\n\nF a(i, j) log P (i, j) : P \u2265 0, P 1 = 1.\n\n(16)\n\nThe optimal log-likelihood is \u03b2a\nof uncertainty in a maximum-likelihood setting is via the \u201dlikelihood region\u201d [6]\n\ni,j F a(i, j) log F a(i, j). A classical description\n\nP\n\ni,j\n\nmax\n\n(cid:80)\n\uf8f1\uf8f2\uf8f3P \u2208 Rn\u00d7n : P \u2265 0, P 1 = 1,\n\nmax =\n\n(cid:88)\n\ni,j\n\nP a =\n\n\uf8fc\uf8fd\uf8fe ,\n\nF a(i, j) log P (i, j) \u2265 \u03b2a\n\n(17)\n\n(cid:110)\n\nwhere \u03b2a < \u03b2a\nmax is a pre-speci\ufb01ed number, which represents the uncertainty level. In\npractice, the designer speci\ufb01es an uncertainty level \u03b2a based on re-sampling Bmethods, or\non a large-sample Gaussian approximation, so as to ensure that the set above achieves a\ndesired level of con\ufb01dence.\n\ni\n\ni\n\n:=\n\n(cid:80)\ni \u2265 0, pa\nk(cid:54)=i\n\n:= \u03b2a \u2212(cid:80)\n\nWith the above model, we note that the inner problem (10) only involves the set\nP a\ni \u2208 Rn : pa\npa\n, where the pa-\ni\nrameter \u03b2a\nis the projection of\ni\nthe set described in (17) on a speci\ufb01c axis of pa\ni -variables. Noting further that the like-\nlihood function can be expressed in terms of KL divergence, the corresponding uncer-\ni for given i \u2208 X , a \u2208 A, is given by a set of the form\ntainty model on the row pa\nP a\ni = {p \u2208 \u2206n : D(f a\ni is a func-\ntion of the uncertainty level.\n\n(cid:80)\nj F a(i, j) log F a(i, j) \u2212 \u03b2a\n\nT 1 = 1,\nj F a(k, j) log F a(k, j). The set P a\n\ni }, where \u03b3a\n\nj F a(i, j) log pa\n\ni (cid:107)p) \u2264 \u03b3a\n\ni (j) \u2265 \u03b2a\n\ni =\n\ni\n\n(cid:80)\n\n(cid:111)\n\nMaximum A-Posteriori (MAP) Models: a variation on Likelihood models involves Maxi-\nmum A Posteriori (MAP) estimates. If there exist a prior information regrading the uncer-\ntainty on the i-th row of P a, which can be described via a Dirichlet distribution [4] with\nparameter \u03b1a\n\ni , the resulting MAP estimation problem takes the form\ni \u2212 1)T log p : pT 1 = 1, p \u2265 0.\n\ni + \u03b1a\n\nmax\n\n(f a\n\np\n\nThus, the MAP uncertainty model is equivalent to a Likelihood model, with the sample\ndistribution f a\ni is the prior corresponding to state i and\naction a.\n\ni \u2212 1, where \u03b1a\n\ni replaced by f a\n\ni + \u03b1a\n\nRelative Entropy Models: Likelihood or MAP models involve the KL divergence from the\nunknown distribution to a reference distribution. We can also choose to describe uncer-\ntainty by exchanging the order of the arguments of the KL divergence. This results in a\n\n\fso-called \u201crelative entropy\u201d model, where the uncertainty on the i-th row of the transition\nmatrix P a described by a set of the form P a\ni > 0\ni > 0 is a given \u201dreference\u201d distribution (for example, the Maximum Likelihood\nis \ufb01xed, qa\ndistribution).\n\ni = {p \u2208 \u2206n : D(p(cid:107)qa\n\ni }, where \u03b3a\n\ni ) \u2264 \u03b3a\n\nEquipped with one of the above uncertainty models, we can address the inner problem\n(10). As shown in [2], the inner problem can be converted by convex duality, to a problem\nof minimizing a single-variable, convex function.\nIn turn, this one-dimensional convex\noptimization problem can be solved via a bisection algorithm with a worst-case complexity\nof O(n log(vmax/\u03b4)), where \u03b4 > 0 speci\ufb01es the accuracy at which the optimal value of the\ninner problem (10) is computed, and vmax is a global upper bound on the value function.\nRemark: We can also use models where the uncertainty in the i-th row for the transition\nmatrix P a is described by a \ufb01nite set of vectors, P a\n}. In this case the\ncomplexity of the corresponding robust dynamic programming algorithm is increased by\na relative factor of K with respect to its classical counterpart, which makes the approach\nattractive when the number of \u201dscenarios\u201d K is moderate.\n\ni = {pa,1\n\n, . . . , pa,K\n\ni\n\ni\n\n6 Concluding remarks\n\nWe proposed a \u201crobust dynamic programming\u201d algorithm for solving \ufb01nite-state and \ufb01nite-\naction MDPs whose solutions are guaranteed to tolerate arbitrary changes of the transi-\ntion probability matrices within given sets. We proposed models based on KL divergence,\nwhich is a natural way to describe estimation errors. The resulting robust dynamic program-\nming algorithm has almost the same computational cost as the classical dynamic program-\nming algorithm: the relative increase to compute an \u0001-suboptimal policy is O(N log(1/\u0001))\nin the N-horizon case, and O(log(1/\u0001)) for the in\ufb01nite-horizon case.\n\nReferences\n\n[1] J. Bagnell, A. Ng, and J. Schneider. Solving uncertain Markov decision problems. Technical\n\nReport CMU-RI-TR-01-25, Robotics Institute, Carnegie Mellon University, August 2001.\n\n[2] L. El-Ghaoui and A. Nilim. Robust solution to Markov decision problems with uncertain tran-\nsition matrices: proofs and complexity analysis. Technical Report UCB/ERL M04/07, Depart-\nment of EECS, University of California, Berkeley, January 2004. A related version has been\nsubmitted to Operations Research in Dec. 2003.\n\n[3] E. Feinberg and A. Shwartz. Handbook of Markov Decision Processes, Methods and Applica-\n\ntions. Kluwer\u2019s Academic Publishers, Boston, 2002.\n\n[4] T. Ferguson. Prior distributions on space of probability measures. The Annal of Statistics,\n\n2(4):615\u2013629, 1974.\n\n[5] R. Givan, S. Leach, and T. Dean. Bounded parameter Markov decision processes. In fourth\n\nEuropean Conference on Planning, pages 234\u2013246, 1997.\n\n[6] E. Lehmann and G. Casella. Theory of point estimation. Springer-Verlag, New York, USA,\n\n1998.\n\n[7] M. Putterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-\n\nInterscince, New York, 1994.\n\n[8] J. K. Satia and R. L. Lave. Markov decision processes with uncertain transition probabilities.\n\nOperations Research, 21(3):728\u2013740, 1973.\n\n[9] A. Shapiro and A. J. Kleywegt. Minimax analysis of stochastic problems. Optimization Methods\n\nand Software, 2002. to appear.\n\n[10] C. C. White and H. K. Eldeib. Markov decision processes with imprecise transition probabili-\n\nties. Operations Research, 42(4):739\u2013749, 1994.\n\n\f", "award": [], "sourceid": 2367, "authors": [{"given_name": "Arnab", "family_name": "Nilim", "institution": null}, {"given_name": "Laurent", "family_name": "Ghaoui", "institution": null}]}