{"title": "Safe Policy Improvement by Minimizing Robust Baseline Regret", "book": "Advances in Neural Information Processing Systems", "page_first": 2298, "page_last": 2306, "abstract": "An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.", "full_text": "Safe Policy Improvement by Minimizing Robust\n\nBaseline Regret\n\nMarek Petrik\n\nUniversity of New Hampshire\n\nmpetrik@cs.unh.edu\n\nMohammad Ghavamzadeh\nAdobe Research & INRIA Lille\n\nghavamza@adobe.com\n\nAbstract\n\nYinlam Chow\n\nStanford University\n\nychow@stanford.edu\n\nAn important problem in sequential decision-making under uncertainty is to use\nlimited data to compute a safe policy, which is guaranteed to outperform a given\nIn this paper, we develop and analyze a new model-based\nbaseline strategy.\napproach that computes a safe policy, given an inaccurate model of the system\u2019s\ndynamics and guarantees on the accuracy of this model. The new robust method\nuses this model to directly minimize the (negative) regret w.r.t. the baseline policy.\nContrary to existing approaches, minimizing the regret allows one to improve\nthe baseline policy in states with accurate dynamics and to seamlessly fall back\nto the baseline policy, otherwise. We show that our formulation is NP-hard and\npropose a simple approximate algorithm. Our empirical results on several domains\nfurther show that even the simple approximate algorithm can outperform standard\napproaches.\n\n1\n\nIntroduction\n\nMany problems in science and engineering can be formulated as a sequential decision-making\nproblem under uncertainty. A common scenario in such problems that occurs in many different \ufb01elds,\nsuch as online marketing, inventory control, health informatics, and computational \ufb01nance, is to \ufb01nd\na good or an optimal strategy/policy, given a batch of data generated by the current strategy of the\ncompany (hospital, investor). Although there are many techniques to \ufb01nd a good policy given a batch\nof data, only a few of them guarantee that the obtained policy will perform well, when it is deployed.\nSince deploying an untested policy can be risky for the business, the product (hospital, investment)\nmanager does not usually allow it to happen, unless we provide her/him with some performance\nguarantees of the obtained strategy, in comparison to the baseline policy (for example the policy that\nis currently in use).\nIn this paper, we focus on the model-based approach to this fundamental problem in the context\nof in\ufb01nite-horizon discounted Markov decision processes (MDPs). In this approach, we use the\nbatch of data and build a model or a simulator that approximates the true behavior of the dynamical\nsystem, together with an error function that captures the accuracy of the model at each state of the\nsystem. Our goal is to compute a safe policy, i.e., a policy that is guaranteed to perform at least\nas well as the baseline strategy, using the simulator and error function. Most of the work on this\ntopic has been in the model-free setting, where safe policies are computed directly from the batch of\ndata, without building an explicit model of the system [Thomas et al., 2015b,a]. Another class of\nmodel-free algorithms are those that use a batch of data generated by the current policy and return a\npolicy that is guaranteed to perform better. They optimize for the policy by repeating this process\nuntil convergence [Kakade and Langford, 2002; Pirotta et al., 2013].\nA major limitation of the existing methods for computing safe policies is that they either adopt a\nnewly learned policy with provable improvements or do not make any improvement at all by returning\nthe baseline policy. These approaches may be quite limiting when model uncertainties are not uniform\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\facross the state space. In such cases, it is desirable to guarantee an improvement over the baseline\npolicy by combining it with a learned policy on a state-by-state basis. In other words, we want to use\nthe learned policy at the states in which either the improvement is signi\ufb01cant or the model uncertainty\n(error function) is small, and to use the baseline policy everywhere else. However, computing a\nlearned policy that can be effectively combined with a baseline policy is non-trivial due to the complex\neffects of policy changes in an MDP. Our key insight is that this goal can be achieved by minimizing\nthe (negative) robust regret w.r.t. the baseline policy. This uni\ufb01es the sources of uncertainties in the\nlearned and baseline policies and allows a more systematic performance comparison. Note that our\napproach differs signi\ufb01cantly from the standard one, which compares a pessimistic performance\nestimate of the learned policy with an optimistic estimate of the baseline strategy. That may result in\nrejecting a learned policy with a performance (slightly) better than the baseline, simply due to the\ndiscrepancy between the pessimistic and optimistic evaluations.\nThe model-based approach of this paper builds on robust Markov decision processes [Iyengar, 2005;\nWiesemann et al., 2013; Ahmed and Varakantham, 2013]. The main difference is the availability\nof the baseline policy that creates unique challenges for sequential optimization. To the best of\nour knowledge, such challenges have not yet been fully investigated in the literature. A possible\nsolution is to solve the robust formulation of the problem and then accept the resulted policy only\nif its conservative performance estimate is better than the baseline. While a similar idea has been\ninvestigated in the model-free setting (e.g., [Thomas et al., 2015a]), we show in this paper that it can\nbe overly conservative.\nAs the main contribution of the paper, we propose and analyze a new robust optimization formulation\nthat captures the above intuition of minimizing robust regret w.r.t. the baseline policy. After a\npreliminary discussion in Section 2, we formally describe our model and analyze its main properties\nin Section 3. We show that in solving this optimization problem, we may have to go beyond the\nstandard space of deterministic policies and search in the space of randomized policies; we derive a\nbound on the performance loss of its solutions; and we prove that solving this problem is NP-hard.\nWe also propose a simple and practical approximate algorithm. Then, in Section 4, we show that\nthe standard model-based approach is really a tractable approximation of robust baseline regret\nminimization. Finally, our experimental results in Section 5 indicate that even the simple approximate\nalgorithm signi\ufb01cantly outperforms the standard model-based approach when the model is uncertain.\n\n2 Preliminaries\n\nWe consider problems in which the agent\u2019s interaction with the environment is modeled as an in\ufb01nite-\nhorizon \u03b3-discounted MDP. A \u03b3-discounted MDP is a tuple M = (cid:104)X ,A, r, P, p0, \u03b3(cid:105), where X and\nA are the state and action spaces, r(x, a) \u2208 [\u2212Rmax, Rmax] is the bounded reward function, P (\u00b7|x, a)\nis the transition probability function, p0(\u00b7) is the initial state distribution, and \u03b3 \u2208 (0, 1] is a discount\nfactor. We use \u03a0R = {\u03c0 : X \u2192 \u2206A\n} and \u03a0D = {\u03c0 : X \u2192 A} to denote the sets of randomized\nand deterministic stationary Markovian policies, respectively, where \u2206A is the set of probability\ndistributions over the action space A.\nThroughout the paper, we assume that the true reward r of the MDP is known, but the true transition\nprobability denoted by (cid:98)P . Due to limited number of samples and other modeling issues, it is unlikely\nprobability is not given. The generalization to include reward estimation is straightforward and is\nomitted for the sake of brevity. We use historical data to build a MDP model with the transition\nthat (cid:98)P matches the true transition probability of the system P (cid:63). We also require that the estimated\nmodel (cid:98)P deviates from the true transition probability P (cid:63) as stated in the following assumption:\nAssumption 1. For each (x, a) \u2208 X \u00d7A, the error function e(x, a) bounds the (cid:96)1 difference between\nthe estimated transition probability and true transition probability, i.e.,\n\n(cid:107)P (cid:63)(\u00b7|x, a) \u2212 (cid:98)P (\u00b7|x, a)(cid:107)1 \u2264 e(x, a).\n\n(1)\n\nThe error function e can be derived either directly from samples using high probability concentration\nbounds, as we brie\ufb02y outline in Appendix A, or based on speci\ufb01c domain properties.\nTo model the uncertainty in the transition probability, we adopt the notion of robust MDP\n(RMDP) [Iyengar, 2005; Nilim and El Ghaoui, 2005; Wiesemann et al., 2013], i.e., an extension of\n\n2\n\n\f.\n\nMDP in which nature adversarially chooses the transitions from a given uncertainty set\n\n(cid:110)\n(cid:111)\n\u03be : X \u00d7 A \u2192 \u2206X : (cid:107)\u03be(\u00b7|x, a) \u2212 (cid:98)P (\u00b7|x, a)(cid:107)1 \u2264 e(x, a), \u2200x, a \u2208 X \u00d7 A\n\n\u039e((cid:98)P , e) =\n\nFrom Assumption 1, we notice that the true transition probability is in the set of uncertain tran-\n\nsition probabilities, i.e., P (cid:63) \u2208 \u039e((cid:98)P , e). The above (cid:96)1 constraint is common in the RMDP litera-\n\nture (e.g., [Iyengar, 2005; Wiesemann et al., 2013; Petrik and Subramanian, 2014]). The uncertainty\nset \u039e in RMDP is (x, a)-rectangular and randomized [Le Tallec, 2007; Wiesemann et al., 2013].\nOne of the motivations for considering (x, a)-rectangular sets in RMDP is that they lead to tractable\nsolutions in the conventional reward maximization setting. However, in the robust regret minimization\nproblem that we propose in this paper, even if we assume that the uncertainty set is (x, a)-rectangular,\nit does not guarantee tractability of the solution. While it is of great interest to investigate the structure\nof uncertainty sets that lead to tractable algorithms in robust regret minimization, it is beyond the\nmain scope of this paper and we leave it as future work.\nFor each policy \u03c0 \u2208 \u03a0R and nature\u2019s choice \u03be \u2208 \u039e, the discounted return is de\ufb01ned as\n0 v\u03be\n\u03c0,\n\n\u03b3tr(cid:0)Xt, At\n\n(cid:34)T\u22121(cid:88)\n\n= p(cid:62)\n\n(cid:35)\n\n(cid:1)\n\nE\u03be\n\n| X0 \u223c p0, At \u223c \u03c0(Xt)\n\n\u03c1(\u03c0, \u03be) = lim\nT\u2192\u221e\n\nt=0\n\n\u03c0 is the corresponding\nwhere Xt and At are the state and action random variables at time t, and v\u03be\nvalue function. An optimal policy for a given \u03be is de\ufb01ned as \u03c0(cid:63)\n\u03be \u2208 arg max\u03c0\u2208\u03a0R \u03c1(\u03c0, \u03be). Similarly,\nunder the true transition probability P (cid:63), the true return of a policy \u03c0 and a truly optimal policy are\nde\ufb01ned as \u03c1(\u03c0, P (cid:63)) and \u03c0(cid:63) \u2208 arg max\u03c0\u2208\u03a0R \u03c1(\u03c0, P (cid:63)), respectively. Although we de\ufb01ne the optimal\npolicy using arg max\u03c0\u2208\u03a0R, it is known that every reward maximization problem in MDPs has at\nleast one optimal policy in \u03a0D.\nFinally, given a deterministic baseline policy \u03c0B, we call a policy \u03c0 safe, if its \"true\" performance is\nguaranteed to be no worse than that of the baseline policy, i.e., \u03c1(\u03c0, P (cid:63)) \u2265 \u03c1(\u03c0B, P (cid:63)).\n3 Robust Policy Improvement Model\n\nIn this section, we introduce and analyze an optimization procedure that robustly improves over a\ngiven baseline policy \u03c0B. As described above, the main idea is to \ufb01nd a policy that is guaranteed to\nbe an improvement for any realization of the uncertain model parameters. The following de\ufb01nition\nformalizes this intuition.\n\nDe\ufb01nition 2 (The Robust Policy Improvement Problem). Given a model uncertainty set \u039e((cid:98)P , e)\n\u03c1(\u03c0, \u03be) \u2265 \u03c1(\u03c0B, \u03be) + \u03b6, for every \u03be \u2208 \u039e((cid:98)P , e).1\nand a baseline policy \u03c0B, \ufb01nd a maximal \u03b6 \u2265 0 such that there exists a policy \u03c0 \u2208 \u03a0R for which\n\nThe problem posed in De\ufb01nition 2 readily translates to the following optimization problem:\n\n\u03c0S \u2208 arg max\n\u03c0\u2208\u03a0R\n\nmin\n\u03be\u2208\u039e\n\n\u03c1(\u03c0, \u03be) \u2212 \u03c1(\u03c0B, \u03be)\n\n.\n\n(2)\n\nNote that since the baseline policy \u03c0B achieves value 0 in (2), \u03b6 in De\ufb01nition 2 is always non-negative.\n\nTherefore, any solution \u03c0S of (2) is safe, because under the true transition probability P (cid:63) \u2208 \u039e((cid:98)P , e),\n\nwe have the guarantee that\n\n\u03c1(\u03c0, P (cid:63)) \u2212 \u03c1(\u03c0B, P (cid:63)) \u2265 min\n\u03be\u2208\u039e\n\n\u03c1(\u03c0, \u03be) \u2212 \u03c1(\u03c0B, \u03be)\n\n\u2265 0 .\n\nIt is important to highlight how De\ufb01nition 2 differs from the standard approach (e.g., [Thomas et\nal., 2015a]) on determining whether a policy \u03c0 is an improvement over the baseline policy \u03c0B. The\nstandard approach considers a statistical error bound that translates to the test: min\u03be\u2208\u039e \u03c1(\u03c0, \u03be) \u2265\nmax\u03be\u2208\u039e \u03c1(\u03c0B, \u03be). The uncertainty parameters \u03be on both sides of (2) are not necessarily the same.\nTherefore, any optimization procedure derived based on this test is more conservative than the\n1From now on, for brevity, we omit the parameters (cid:98)P and e, and use \u039e to denote the model uncertainty set.\nproblem in (2). Indeed when the error function in \u039e is large, even the baseline policy (\u03c0 = \u03c0B)\n\n3\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n\fx11\n\n\u03be1\n\na11\n\na12\n\n2\n\n3\n\na1\n\nx1\n\n\u03be2\n\n1\n\nstart\n\na2\n\n2\n\n\u03c0B\n\nx0\n\n\u03c0(cid:63)\n\na1\n\na2\n\n0\n\n1\n\nx1\n\n\u03c0(cid:63)\n\u03c0B\n\n\u03be(cid:63)\n\na1\n\n\u03be1\n\n+10/\u03b3\n\n\u221210/\u03b3\n\nFigure 1: (left) A robust/uncertain MDP used in Example 4 that illustrates the sub-optimality of\ndeterministic policies in solving the optimization problem (2). (right) A Markov decision process\nwith signi\ufb01cant uncertainty in the baseline policy.\n\nmay not pass this test. In Section 5.1, we show the conditions under which this approach fails. Our\napproach also differs from other related work in that we consider regret with respect to the baseline\npolicy, and not the optimal policy, as considered in [Xu and Mannor, 2009].\nIn the remainder of this section, we highlight some major properties of the optimization problem (2).\nSpeci\ufb01cally, we show that its solution policy may be purely randomized, we compute a bound on the\nperformance loss of its solution policy w.r.t. \u03c0(cid:63), and we \ufb01nally prove that it is a NP-hard problem.\n\n3.1 Policy Class\n\nThe following theorem shows that we should search for the solutions of the optimization problem (2)\nin the space of randomized policies \u03a0R.\nTheorem 3. The optimal solution to the optimization problem (2) may not be attained by a determin-\nistic policy. Moreover, the loss due to considering deterministic policies cannot be bounded, i.e., there\nexists no constant c \u2208 R such that\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\nmax\n\u03c0\u2208\u03a0R\n\nmin\n\u03be\u2208\u039e\n\n\u03c1(\u03c0, \u03be) \u2212 \u03c1(\u03c0B, \u03be)\n\n\u2264 c \u00b7 max\n\u03c0\u2208\u03a0D\n\nmin\n\u03be\u2208\u039e\n\n\u03c1(\u03c0, \u03be) \u2212 \u03c1(\u03c0B, \u03be)\n\n.\n\nProof. The proof follows directly from Example 4. The optimal policy in this example is randomized\nand achieves a guaranteed improvement \u03b6 = 1/2. There is no deterministic policy that guarantees a\npositive improvement over the baseline policy, which proves the second part of the theorem.\nExample 4. Consider the robust/uncertain MDP on the left panel of Figure 1 with states {x1, x11} \u2282\nX , actions A = {a1, a2, a11, a12}, and discount factor \u03b3 = 1. Actions a1 and a2 are shown as solid\nblack nodes. A number with no state represents a terminal state with the corresponding reward.\nThe robust outcomes {\u03be1, \u03be2} correspond to the uncertainty set of transition probabilities \u039e. The\nbaseline policy \u03c0B is deterministic and is denoted by double edges. It can be readily seen from\nthe monotonicity of the Bellman operator that any improved policy \u03c0 will satisfy \u03c0(a12|x11) = 1.\nTherefore, we will only focus on the policy at state x1. The robust improvement as a function of\n\u03c0(\u00b7|x1) and the uncertainties {\u03be1, \u03be2} is given as follows:\n\u03be2\n1\n2\n\n(cid:0)\u03c1(\u03c0, \u03be) \u2212 \u03c1(\u03c0B, \u03be)(cid:1) = min\n\n(cid:32)(cid:34) \u03c0 \\ \u03be\n\n(cid:20) \u03c0 \\ \u03be\n\na1\n\n(cid:21)(cid:33)\n\n\u03be1\n3\n2\n\nmin\n\u03be\u2208\u039e\n\n(cid:35)\n\n\u03be\u2208\u039e\n\na1\na2\n\n\u2212\n\n\u03be1\n2\n\n\u03be2\n1\n\n= 0.\n\nThis shows that no deterministic policy can achieve a positive improvement in this problem. However,\na randomized policy \u03c0(a1|x1) = \u03c0(a2|x1) = 1/2 returns the maximum improvement \u03b6 = 1/2.\nRandomized policies can do better than their deterministic counterparts, because they allow for\nhedging among various realizations of the MDP parameters. Example 4 shows a problem such that\nthere exists a realization of the parameters with improvement over the baseline when any deterministic\npolicy is executed. However in this example, there is no single realization of parameters that provides\nan improvement for all the deterministic policies simultaneously. Therefore, randomizing the policy\nguarantees an improvement independent of the parameters\u2019 choice.\n\n4\n\n\f3.2 Performance Bound\n\nGenerally, one cannot compute the truly optimal policy \u03c0(cid:63) using an imprecise model. Nevertheless, it\nis still crucial to understand how errors in the model translates to a performance loss w.r.t. an optimal\npolicy. The following theorem (proved in Appendix C) provides a bound on the performance loss of\nany solution \u03c0S to the optimization problem (2).\nTheorem 5. A solution \u03c0S to the optimization problem (2) is safe and its performance loss is bounded\nby the following inequality:\n\n(cid:26)2\u03b3Rmax\n\n(cid:16)\n\n(cid:17)\n\n(cid:27)\n\n(cid:107)e\u03c0(cid:63)(cid:107)1,u(cid:63)\n\n\u03c0(cid:63)+(cid:107)e\u03c0B(cid:107)1,u(cid:63)\n\n\u03c0B\n\n, \u03a6(\u03c0B)\n\n,\n\n\u03a6(\u03c0S) \u2206= \u03c1(\u03c0(cid:63), P (cid:63)) \u2212 \u03c1(\u03c0S, P (cid:63)) \u2264 min\n\n(1 \u2212 \u03b3)2\nwhere u(cid:63)\ntrue MDP P (cid:63). Furthermore, the above bound is tight.\n\n\u03c0(cid:63) and u(cid:63)\n\u03c0B\n\nare the state occupancy distributions of the optimal and baseline policies in the\n\n3.3 Computational Complexity\n\n(cid:16)\n\nIn this section, we analyze the computational complexity of solving the optimization problem (2)\nand prove that the problem is NP-hard. In particular, we proceed by showing that the following\nsub-problem of (2):\n\narg min\n\u03be\u2208\u039e\n\n(3)\nfor a \ufb01xed \u03c0 \u2208 \u03a0R, is NP-hard. The optimization problem (3) can be interpreted as computing a\npolicy that simultaneously minimizes the returns of two MDPs, whose transitions induced by policies\n\u03c0 and \u03c0B. The proof of Theorem 6 is given in Appendix D.\nTheorem 6. Both optimization problems (2) and (3) are NP-hard.\n\n\u03c1(\u03c0, \u03be) \u2212 \u03c1(\u03c0B, \u03be)\n\n,\n\n(cid:17)\n\nAlthough the optimization problem (2) is NP-hard in general, but it can be tractable in certain settings.\nOne such setting is when the Markov chain induced by the baseline policy is known precisely, as the\nfollowing proposition states. See Appendix E for the proof.\nProposition 7. Assume that for each x \u2208 X , the error function induced by the baseline policy is\nMDP (RMDP) problem and can be solved in polynomial time:\n\nzero, i.e., e(cid:0)x, \u03c0B(x)(cid:1) = 0.2 Then, the optimization problem (2) is equivalent to the following robust\n\narg max\n\u03c0\u2208\u03a0R\n\nmin\n\u03be\u2208\u039e\n\n\u03c1(\u03c0, \u03be).\n\n(4)\n\n3.4 Approximate Algorithm\n\nSolving for the optimal solution of (2) may not be possible in practice, since the problem is NP hard.\nIn this section, we propose a simple and practical approximate algorithm. The empirical results\nof Section 5 indicate that this algorithm holds promise and also suggest that the approach may be a\ngood starting point for building better approximate algorithms in the future.\n\nAlgorithm 1: Approximate Robust Baseline Regret Minimization Algorithm\n\n:Empirical transition probabilities: (cid:98)P , baseline policy \u03c0B, and the error function e\n\ninput\noutput :Policy \u02dc\u03c0S\n1 foreach x \u2208 X , a \u2208 A do\n\n2\n\n\u02dce(x, a) \u2190\n\n(cid:26)e(x, a) when \u03c0B(x) (cid:54)= a\n(cid:0)\u03c1(cid:0)\u03c0, \u03be(cid:1)\n\notherwise\n\n0\n\n3 end\n\n4 \u02dc\u03c0S \u2190 arg max\u03c0\u2208\u03a0R min\u03be\u2208\u039e((cid:98)P ,\u02dce)\n\n5 return \u02dc\u03c0S\n\n;\n\n\u2212 \u03c1(cid:0)\u03c0B, \u03be(cid:1)(cid:1) ;\n\nAlgorithm 1 contains the pseudocode of the proposed approximate method. The main idea is to\nuse a modi\ufb01ed uncertainty model by assuming no error in transition probabilities of the baseline\n\n2Note that this is equivalent to precisely knowing the Markov chain induced by the baseline policy P (cid:63)\n\n\u03c0B .\n\n5\n\n\fpolicy. Then it is possible to minimize the robust baseline regret in polynomial time as suggested\nby Theorem 7. Assuming no error in baseline transition probabilities is reasonable because of\ntwo main reasons. First, in practice, data is often generated by executing the baseline policy, and\nthus, we may have enough data for a good approximation of the baseline\u2019s transition probabilities:\n\n\u00b7 |x, \u03c0B(x)(cid:1). Second, transition probabilities often affect baseline\n\nand improved policies similarly, and as a result, have little effect on the difference between their\nreturns (i.e., the regret). See Section 5.1 for an example of such behavior.\n\n\u2200x \u2208 X ,(cid:98)P(cid:0)\n\n\u00b7 |x, \u03c0B(x)(cid:1)\n\n\u2248 P (cid:63)(cid:0)\n\n4 Standard Policy Improvement Methods\n\nIn Section 3, we showed that \ufb01nding an exact solution to the optimization problem (2) is computa-\ntionally expensive and proposed an approximate algorithm. In this section, we describe and analyze\ntwo standard methods for computing safe policies and show how they can be interpreted as an\napproximation of our proposed baseline regret minimization. Due to space limitations, we describe\nanother method, called reward-adjusted MDP, in Appendix H, but report its performance in Section 5.\n\n4.1 Solving the Simulator\n\nThe simplest solution to (2) is to assume that our simulator is accurate and to solve the reward maxi-\n\nmization problem of an MDP with the transition probability (cid:98)P , i.e., \u03c0sim \u2208 arg max\u03c0\u2208\u03a0R \u03c1(\u03c0,(cid:98)P ).\ntransition probability (cid:98)P . Then under Assumption 1, the performance loss of \u03c0sim is bounded by\n\nTheorem 8 quanti\ufb01es the performance loss of the resulted policy \u03c0sim.\nTheorem 8. Let \u03c0sim be an optimal policy of the reward maximization problem of an MDP with\n\n\u03a6(\u03c0sim) \u2206= \u03c1(\u03c0(cid:63), P (cid:63)) \u2212 \u03c1(\u03c0sim, P (cid:63)) \u2264\n\n2\u03b3Rmax\n\n(1 \u2212 \u03b3)2(cid:107)e(cid:107)\u221e.\n\nThe proof is available in Appendix F. Note that there is no guarantee that \u03c0sim is safe, and thus,\ndeploying it may lead to undesirable outcomes due to model uncertainties. Moreover, the performance\nguarantee of \u03c0sim, reported in Theorem 8, is weaker than that in Theorem 5 due to the L\u221e norm.\n\n4.2 Solving Robust MDP\n\nAnother standard solution to the problem in (2) is based on solving the RMDP problem (4). We\nprove that the policy returned by this algorithm is safe and has better (sharper) worst-case guarantees\nthan the simulator-based policy \u03c0sim. Details of this algorithm are summarized in Algorithm 2. The\nalgorithm \ufb01rst constructs and solves an RMDP. It then returns the solution policy if its worst-case\nperformance over the uncertainty set is better than the robust performance max\u03be\u2208\u039e \u03c1(\u03c0B, \u03be), and it\nreturns the baseline policy \u03c0B, otherwise.\n\ninput\noutput :Policy \u03c0R\n\nAlgorithm 2: RMDP-based Algorithm\n\n:Simulated MDP (cid:98)P , baseline policy \u03c0B, and the error function e\n\n1 \u03c00 \u2190 arg max\u03c0\u2208\u03a0R min\u03be\u2208\u039e((cid:98)P ,e) \u03c1(cid:0)\u03c0, \u03be(cid:1) ;\n2 if min\u03be\u2208\u039e((cid:98)P ,e) \u03c1(cid:0)\u03c00, \u03be(cid:1) > max\u03be\u2208\u039e \u03c1(\u03c0B, \u03be) then return \u03c00 else return \u03c0B ;\n(cid:17)\n\nAlgorithm 2 makes use of the following approximation to the solution of (2):\n\n(cid:16)\n\n\u2265 max\n\u03c0\u2208\u03a0R\n\nmin\n\u03be\u2208\u039e\n\n\u03c1(\u03c0, \u03be) \u2212 max\n\u03be\u2208\u039e\n\nmax\n\u03c0\u2208\u03a0R\n\nmin\n\u03be\u2208\u039e\n\n\u03c1(\u03c0, \u03be) \u2212 \u03c1(\u03c0B, \u03be)\n\n\u03c1(\u03c0B, \u03be),\n\nand guarantees safety by designing \u03c0 such that the RHS of this inequality is always non-negative.\nThe performance bound of \u03c0R is identical to that in Theorem 5 and is stated and proved in Theorem 12\nin Appendix G. Although the worst-case bounds are the same, we show in Section 5.1 that the\nperformance loss of \u03c0R may be worse than that of \u03c0S by an arbitrarily large margin.\n\n6\n\n\fIt is important to discuss the difference between Algorithms 1 and 2. Although both solve an RMDP,\nthey use different uncertainty sets \u039e. The uncertainty set used in Algorithm 2 is the true error function\nin building the simulator, while the uncertainty set used in Algorithm 1 assumes that the error function\nis zero for all the actions suggested by the baseline policy. As a result, both algorithms approximately\nsolve (2) but approximate the problem in different ways.\n\n5 Experimental Evaluation\n\nIn this section, we experimentally evaluate the bene\ufb01ts of minimizing the robust baseline regret. First,\nwe demonstrate that solving the problem in (2) may outperform the regular robust formulation by an\narbitrarily large margin. Then, in the remainder of the section, we compare the solution quality of\nAlgorithm 1 with simpler methods in more complex and realistic experimental domains. The purpose\nof our experiments is to show how solution quality depends on the degree of model uncertainties.\n\n5.1 An Illustrative Example\n\nConsider the example depicted on the right panel of Figure 1. White nodes represent states and black\nnodes represent state-action pairs. Labels on the edges originated from states indicate the policy\naccording to which the action is taken; labels on the edges originated from actions denote the rewards\nand, if necessary, the name of the uncertainty realization. The baseline policy is \u03c0B, the optimal\npolicy is \u03c0(cid:63), and the discount factor is \u03b3 \u2208 (0, 1).\nThis example represents a setting in which the level of uncertainty varies signi\ufb01cantly across the\nindividual states: the transition model is precise in state x0 and uncertain in state x1. The baseline\npolicy \u03c0B takes a suboptimal action in state x0 and the optimal action in the uncertain state x1. To\nprevent being overly conservative in computing a safe policy, one needs to consider that the realization\nof uncertainty in x1 in\ufb02uences both the baseline and improved policies.\nUsing the plain robust optimization formulation in Algorithm 2, even the optimal policy \u03c0(cid:63) is not\nconsidered safe in this example. In particular, the robust return of \u03c0(cid:63) is min\u03be \u03c1(\u03c0(cid:63), \u03be) = \u22129, while\nthe optimistic return of \u03c0B is max\u03be \u03c1(\u03c0B, \u03be) = +10. On the other hand, solving (2) will return the\noptimal policy since: min\u03be \u03c1(\u03c0(cid:63), \u03be) \u2212 \u03c1(\u03c0B, \u03be) = 11 \u2212 10 = \u22129 \u2212 (\u221210) = 1. Even the heuristic\nmethod of Section 3.4 will return the optimal policy. Note that since the reward-adjusted formulation\n(see its description in Appendix H) is even more conservative than the robust formulation, it will also\nfail to improve on the baseline policy.\n\n5.2 Grid Problem\n\nIn this section, we use a simple grid problem to compare the solution quality of Algorithm 1 with\nsimpler methods. The grid problem is motivated by modeling customer interactions with an online\nsystem. States in the problem represent a two dimensional grid. Columns capture states of interaction\nwith the website and rows capture customer states such as overall satisfaction. Actions can move\ncustomers along either dimension with some probability of failure. A more detailed description of\nthis domain is provided in Section I.1.\nOur goal is to evaluate how the solution quality of various methods depends on the magnitude of the\nmodel error e. The model is constructed from samples, and thus, its magnitude of error depends on\nthe number of samples used to build it. We use a uniform random policy to gather samples. Model\nerror function e is then constructed from this simulated data using bounds in Section B. The baseline\npolicy is constructed to be optimal when ignoring the row part of state; see Section I.1 for more\ndetails.\nAll methods are compared in terms of the improvement percentage in total return over the baseline\npolicy. Figure 2 depicts the results as a function of the number of transition samples used in\nconstructing the uncertain model and represents the mean of 40 runs. Methods used in the comparison\nare as follows: 1) EXP represents solving the nominal model as described in Section 4.1, 2) RWA\nrepresent the reward-adjusted formulation in Algorithm 3 of Appendix H, 3) ROB represents the\nrobust method in Algorithm 2, and 4) RBC represents our approximate solution of Algorithm 1.\nFigure 2 shows that Algorithm 1 not only reliably computes policies that are safe, but also signi\ufb01cantly\nimproves on the quality of the baseline policy when the model error is large. When the number of\n\n7\n\n\fFigure 2: Improvement in return over the baseline policy in: (left) the grid problem and (right) the\nenergy arbitrage problem. The dashed line shows the return of the optimal policy.\n\nsamples is small, Algorithm 1 is signi\ufb01cantly better than other methods by relying on the baseline\npolicy in states with a large model error and only taking improving actions when the model error is\nsmall. Note that EXP can be signi\ufb01cantly worse than the baseline policy, especially when the number\nof samples is small.\n\n5.3 Energy Arbitrage\n\nIn this section, we compare model-based policy improvement methods using a more complex domain.\nThe problem is to determine an energy arbitrage policy in given limited energy storage (a battery)\nand stochastic prices. At each time period, the decision-maker observes the available battery charge\nand a Markov state of energy price, and decides on the amount of energy to purchase or to sell.\nThe set of states in the energy arbitrage problem consists of three components: current state of charge,\ncurrent capacity, and a Markov state representing price; the actions represent the amount of energy\npurchased or sold; the rewards indicate pro\ufb01t/loss in the transactions. We discretize the state of\ncharge and action sets to 10 separate levels. The problem is based on the domain from [Petrik and\nWu, 2015], whose description is detailed in Appendix I.2.\nEnergy arbitrage is a good \ufb01t for model-based approaches because it combines known and unknown\ndynamics. Physics of battery charging and discharging can be modeled with high con\ufb01dence, while\nthe evolution of energy prices is uncertain. As a result, using an explicit battery model, the only\nuncertainty is in transition probabilities between the 10 states of the price process instead of the entire\n1000 state-action pairs. This signi\ufb01cantly reduces the number of samples needed.\nAs in the previous experiments, we estimate the uncertainty model in a data-driven manner. Notice\nthat the inherent uncertainty is only in price transitions and is independent of the policy used (which\ncontrols the storage dynamics). Here the uncertainty set of transition probabilities is estimated using\nthe method in Appendix A, but the uncertainty set is only a non-singleton w.r.t. price states. Figure 2\nshows the percentage improvement on the baseline policy averaged over 5 runs. We clearly observe\nthat the heuristic RBC method, described in Section 3.4, effectively interleaves the baseline policy (in\nstates with high level of uncertainty) and an improved policy (in states with low level of uncertainty),\nand results in the best performance in most cases. Solving a robust MDP with no baseline policy\nperformed similarly to directly solving the simulator.\n\n6 Conclusion\n\nIn this paper, we study the model-based approach to the fundamental problem of learning safe\npolicies given a batch of data. A policy is considered safe, if it is guaranteed to have an improved\nperformance over a baseline policy. Solving the problem of safety in sequential decision-making can\nimmensely increase the applicability of the existing technology to real-world problems. We show\nthat the standard robust formulation may be overly conservative and formulate a better approach\nthat interleaves an improved policy with the baseline policy, based on the error at each state. We\npropose and analyze an optimization problem based on this idea (see (2)) and prove that solving it is\nNP-hard. Furthermore, we propose several approximate solutions and experimentally evaluated their\nperformance.\n\n8\n\n050010001500200025003000NumberofSamples\u221220\u221210010203040ImprovementoverBaseline(%)EXPRWAROBRBC1000150020002500300035004000450050005500Numberofsamples\u22120.20\u22120.15\u22120.10\u22120.050.000.050.100.15Improvementoverbaseline(%)EXPROBRBC\fReferences\nA. Ahmed and P Varakantham. Regret based Robust Solutions for Uncertain Markov Decision\n\nProcesses. Advances in neural information processing systems, pages 1\u20139, 2013.\n\nT. Hansen, P. Miltersen, and U. Zwick. Strategy iteration is strongly polynomial for 2-player\nturn-based stochastic games with a constant discount factor. Journal of the ACM, 60(1):1\u201316,\n2013.\n\nG. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257\u2013280,\n\n2005.\n\nS. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In Proceed-\n\nings of the 19th International Conference on Machine Learning, pages 267\u2013274, 2002.\n\nY. Le Tallec. Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Processes. PhD\n\nthesis, MIT, 2007.\n\nA. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition\n\nmatrices. Operations Research, 53(5):780\u2013798, 2005.\n\nM. Petrik and D. Subramanian. RAAM : The bene\ufb01ts of robustness in approximating aggregated\n\nMDPs in reinforcement learning. In Neural Information Processing Systems, 2014.\n\nM. Petrik and X. Wu. Optimal Threshold Control for Energy Arbitrage with Degradable Battery\n\nStorage. In Uncertainty in Arti\ufb01cial Intelligence, pages 692\u2013701, 2015.\n\nM. Pirotta, M. Restelli, and D. Calandriello. Safe Policy Iteration. In Proceedings of the 30th\n\nInternational Conference on Machine Learning, 2013.\n\nP. Thomas, G. Teocharous, and M. Ghavamzadeh. High Con\ufb01dence Policy Improvement.\n\nInternational Conference on Machine Learning, pages 2380\u20132388, 2015.\n\nIn\n\nP. Thomas, G. Theocharous, and M. Ghavamzadeh. High con\ufb01dence off-policy evaluation. In\n\nProceedings of the Twenty-Ninth Conference on Arti\ufb01cial Intelligence, 2015.\n\nT. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. Weinberger. Inequalities for the L1\n\ndeviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.\n\nW. Wiesemann, D. Kuhn, and B. Rustem. Robust Markov decision processes. Mathematics of\n\nOperations Research, 38(1):153\u2013183, 2013.\n\nH. Xu and S. Mannor. Parametric regret in uncertain Markov decision processes. Proceedings of the\n\nIEEE Conference on Decision and Control, pages 3606\u20133613, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1193, "authors": [{"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "Adobe Research & INRIA"}, {"given_name": "Marek", "family_name": "Petrik", "institution": "University of New Hampshire"}, {"given_name": "Yinlam", "family_name": "Chow", "institution": "Stanford University"}]}