{"title": "Reinforcement Learning in Robust Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 701, "page_last": 709, "abstract": "An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case.", "full_text": "Reinforcement Learning in Robust Markov Decision\n\nProcesses\n\nShiau Hong Lim\n\nHuan Xu\n\nDepartment of Mechanical Engineering\n\nNational University of Singapore\n\nDepartment of Mechanical Engineering\n\nNational University of Singapore\n\nSingapore\n\nmpelsh@nus.edu.sg\n\nSingapore\n\nmpexuh@nus.edu.sg\n\nDepartment of Electrical Engineering\n\nShie Mannor\n\nTechnion, Israel\n\nshie@ee.technion.ac.il\n\nAbstract\n\nAn important challenge in Markov decision processes is to ensure robustness with\nrespect to unexpected or adversarial system behavior while taking advantage of\nwell-behaving parts of the system. We consider a problem setting where some\nunknown parts of the state space can have arbitrary transitions while other parts\nare purely stochastic. We devise an algorithm that is adaptive to potentially ad-\nversarial behavior and show that it achieves similar regret bounds as the purely\nstochastic case.\n\n1\n\nIntroduction\n\nMarkov decision processes (MDPs) [Puterman, 1994] have been widely used to model and solve\nsequential decision problems in stochastic environments. Given the parameters of an MDP, namely,\nthe rewards and transition probabilities, an optimal policy can be computed.\nIn practice, these\nparameters are often estimated from noisy data and furthermore, they may change during the exe-\ncution of a policy. Hence, the performance of the chosen policy may deteriorate signi\ufb01cantly; see\n[Mannor et al., 2007] for numerical experiments.\nThe robust MDP framework has been proposed to address this issue of parameter uncertainty (e.g.,\n[Nilim and El Ghaoui, 2005] and [Iyengar, 2005]). The robust MDP setting assumes that the true\nparameters fall within some uncertainty set U and seeks a policy that performs the best under the\nworst realization of the parameters. These solutions, however, can be overly conservative since they\nare based on worst-case realization. Variants of robust MDP formulations have been proposed to\nmitigate the conservativeness when additional information on parameter distribution [Strens, 2000,\nXu and Mannor, 2012] or coupling among the parameters [Mannor et al., 2012] are known. A major\ndrawback of previous work on robust MDPs is that they all focused on the planning problem with\nno effort to learn the uncertainty. Since in practice it is often dif\ufb01cult to accurately quantify the\nuncertainty, the solutions to the robust MDP can be conservative if a too large uncertainty set is\nused.\nIn this work, we make the \ufb01rst attempt to perform learning in robust MDPs. We assume that some of\nthe state-action pairs are adversarial in the sense that their parameters can change arbitrarily within\nU from one step to another. However, others are benign in the sense that they are \ufb01xed and behave\npurely stochastically. The learner, however, is given only the uncertainty set U and knows neither\nthe parameters nor the true nature of each state-action pair.\n\n1\n\n\fIn this setting, a traditional robust MDP approach would be equivalent to assuming that all parame-\nters are adversarial and therefore would always execute the minimax policy. This is too conservative\nsince it could be the case that most of the parameters are stochastic. Alternatively, one could use an\nexisting online learning algorithm such as UCRL2 [Jaksch et al., 2010] and assume that all parame-\nters are stochastic. This, as we show in the next section, may lead to suboptimal performance when\nsome of the states are adversarial.\nInstead, we propose an online learning approach to robust MDPs. We show that the cumulative\nreward obtained from this method is as good as the minimax policy that knows the true nature of\neach state-action pair. This means that by incorporating learning in robust MDPs, we can effectively\nresolve the \u201cconservativeness due to not knowing the uncertainty\u201d effect.\nThe rest of the paper is structured as follows. Section 2 discusses the key dif\ufb01culties in our setting\nand explains why existing solutions are not applicable.\nIn subsequent sections, we present our\nalgorithm, its theoretical performance bound and its analysis. Sections 3 and 4 cover the \ufb01nite-\nhorizon case while Section 5 deals with the in\ufb01nite-horizon case. We present some experiment\nresults in Section 6 and conclude in Section 7.\n\n2 Problem setting\n\nWe consider an MDP M with a \ufb01nite state space S and a \ufb01nite action space A. Let S = |S| and\nA = |A|. Executing action a in state s results in a random transition according to a distribution\nps,a(\u00b7) where ps,a(s(cid:48)) gives the probability of transitioning to state s(cid:48), and accumulate an immediate\nreward r(s, a).\nA robust MDP considers the case where the transition probability is determined in an adversarial\nway. That is, when action a is taken at state s, the transition probability ps,a(\u00b7) can be an arbitrary\nelement of the uncertainty set U(s, a). In particular, for different visits of same (s, a), the realization\nof ps,a can be different, possibly depends on the history. This can model cases where the system\ndynamics are in\ufb02uenced by competitors or exogeneous factors that are hard to model, or the MDP\nis a simpli\ufb01cation of a complicated dynamic system.\nPrevious research in robust MDPs focused exclusively on the planning problem. Here, the power of\nthe adversary \u2013 the uncertainty set of the parameter \u2013 is precisely known, and the goal is to \ufb01nd the\nminimax policy \u2013 the policy with the best performance under the worst admissible parameters.\nThis paper considers the learning problem of robust MDPs. We ask the following question: suppose\nthe power of the adversary (the extent to which it can affect the system) is not completely revealed\nto the decision maker, if we are allowed to play the MDP many times, can we still obtain an optimal\npolicy as if we knew the true extent of its power? Or to put it that way, can we develop a procedure\nthat provides the exact amount of protection against the unknown adversary?\nOur speci\ufb01c setup is as follows: for each (s, a) \u2208 S\u00d7A an uncertainty set U(s, a) is given. However,\nnot all states are adversarial. Only a subset F \u2282 S \u00d7A is truly adversarial while all the other state-\naction pairs behave purely stochastically, i.e., with a \ufb01xed unknown ps,a. Moreover, the set F is not\nknown to the algorithm.\nThis setting differs from existing setups, and is challenging for the following reasons:\n\n1. The adversarial actions ps,a are not directly observable.\n2. The adversarial behavior is not constrained, except it must belong to the uncertainty set.\n3. Ignoring the adversarial component results in sub-optimal behavior.\n\nThe \ufb01rst challenge precludes the use of algorithm based on stochastic games such as R-Max\n[Brafman and Tennenholtz, 2002]. The R-Max algorithm deals with stochastic games where the\nopponent\u2019s action-set for each state is known and the opponent\u2019s actions are always observable. In\nour setting, only the outcome (i.e., the next-state and the reward) of each transition is observable.\nThe algorithm does not observe the action ps,a taken by the adversary. Indeed, because the set F is\nunknown, even the action set of the adversary is unknown to the algorithm.\nThe second challenge is due to unconstrained adversarial behavior. For state-action pairs (s, a) \u2208 F,\nthe opponent is free to choose any ps,a \u2208 U(s, a) for each transition, possibly depends on the his-\n\n2\n\n\ftory and the strategy of the decision maker (i.e., non-oblivious). This affects the sort of performance\nguarantee one can reasonably expect from any algorithms. In particular, when considering the regret\nagainst the best stationary policy \u201cin hindsight\u201d, [Yu and Mannor, 2009] show that small change in\ntransition probabilities can cause large regret. Even with additional constraints on the allowed ad-\nversarial behavior, they showed that the regret bound still does not vanish with respect to the number\nof steps. Indeed, most results for adversarial MDPs [Even-Dar et al., 2005, Even-Dar et al., 2009,\nYu et al., 2009, Neu et al., 2010, Neu et al., 2012] only deal with adversarial rewards while the tran-\nsitions are assumed stochastic and \ufb01xed, which is considerably simpler than our setting.\nSince it is not possible to achieve vanishing regret against best stationary policy in hindsight, we\nchoose to measure the regret against the performance of a minimax policy that knows exactly which\nstate-actions are adversarial (i.e., the set F) as well as the true ps,a for all stochastic state-action\npairs. Intuitively, this means that if the adversary chooses to play \u201cnicely\u201d, we are not constrained\nto exploit this.\nFinally, given that we are competing against the minimax policy, one might ask whether we could\nsimply apply existing algorithms such as UCRL2 [Jaksch et al., 2010] and treat every state-action\npair as stochastic. The following example shows that ignoring any adversarial behavior may lead to\nlarge regret compared to the minimax policy.\n\nFigure 1: Example MDP with adversarial transitions.\n\nConsider the MDP in Figure 1. Suppose that a UCRL2-like algorithm is used, where all transitions\nare assumed purely stochastic. There are 3 alternative policies, each corresponds to choosing action\na1, a2 and a3 respectively in state s0. Action a1 leads to the optimal minimax average reward of\ng\u2217. State s2 leads to average reward of g\u2217 + \u03b2 for some \u03b2 > 0. State s1 has adversarial transition,\nwhere both s2 and s4 are possible next states. s4 has a similar behavior, where it may either lead to\ng\u2217 + \u03b2 or a \u201cbad\u201d region with average reward g\u2217\nWe consider two phases. In phase 1, the adversary behaves \u201cbenignly\u201d by choosing all solid-line\ntransitions. Since both a2 and a3 lead to similar outcome, we assume that in phase 1, both a2 and a3\nare chosen for T steps each. In phase 2, the adversary chooses the dashed-line transitions in both s1\nand s4. Due to a2 and a3 having similar values (both g\u2217 + \u03b2 > g\u2217) we can assume that a2 is always\nchosen in phase 2 (if a3 is ever chosen in phase 2 its value will quickly drop below that of a2).\nSuppose that a2 also runs for T steps in phase 2. A little algebra (see the supplementary material\nfor details) shows that at the end of phase 2 the expected value of s4 (from the learner\u2019s point of\nview) is g4 = g\u2217 + \u03b2\u2212\u03b1\n4 > g\u2217. The total\naccumulated rewards over both phases is however 3T g\u2217 + T (2\u03b2 \u2212 \u03b1). Let c = \u03b1 \u2212 2\u03b2 > 0. This\nmeans that the overall total regret is cT which is linear in T .\nNote that in the above example, the expected value of a2 remains greater than the minimax value\ng\u2217 throughout phase 2 and therefore the algorithm will continue to prefer a2, even though the actual\naccumulated average value is already way below g\u2217. The reason behind this is that the Markov\nproperty, which is crucial for UCRL2-like algorithms to work, has been violated due to s1 and s4\nbehaving in a non-independent way caused by the adversary.\n\nand therefore the expected value of s1 is g1 = g\u2217 + 3\u03b2\u2212\u03b1\n\n\u2212 \u03b1 for some 2\u03b2 < \u03b1 < 3\u03b2.\n\n2\n\n3 Algorithm and main result\n\nIn this section, we present our algorithm and the main result for the \ufb01nite-horizon case with the total\nreward as the performance measure. Section 5 provides the corresponding algorithm and result for\nthe in\ufb01nite-horizon average-reward case.\n\n3\n\ns0s1s3s2s4g\u2217g\u2217+\u03b2g\u2217\u2212\u03b1g\u2217+\u03b2a1a2a3\fFor simplicity, we assume without loss of generality a deterministic and known reward function\nr(s, a). We also assume that rewards are bounded such that r(s, a) \u2208 [0, 1]. It is straight-forward,\nby introducing additional states, to extend the algorithm and analysis to the case where the reward\nfunction is random, unknown and even adversarial.\nIn the \ufb01nite horizon case, we consider an episodic setting where each episode has a \ufb01xed and known\nlength T . The algorithm starts at a (possibly random) state s0 and executes T stages. After that,\na new episode begins, with an arbitrarily chosen start state (it can simply be the last state of the\nprevious episode). This goes on inde\ufb01nitely.\nLet \u03c0 be a \ufb01nite-horizon (non-stationary) policy where \u03c0t(s) gives the action to be executed in state\ns at step t in an episode, where t = 0, . . . , (T \u2212 1). Let Pt be a particular choice of ps,a \u2208 U(s, a)\nfor every (s, a) \u2208 F at step t. For each t = 0, . . . , (T \u2212 1), we de\ufb01ne\n\nV \u03c0\nt (s) = min\n\nPt,...,PT \u22122\n\nEPt,...,PT \u22122\n\nr(st(cid:48), \u03c0t(cid:48)(st(cid:48)))\n\nand\n\nV \u2217\nt (s) = max\n\n\u03c0\n\nV \u03c0\nt (s),\n\nT\u22121(cid:88)\n\nt(cid:48)=t\n\nwhere st = s and st+1, . . . , sT\u22121 are random variables due to the random transitions. We as-\nsume that U is such that the minimum above exists (e.g., compact set). It is not hard to show that\ngiven state s, there exists a policy \u03c0 with V \u03c0\n0 (s) and we can compute such a minimax\npolicy if the algorithm knows F and ps,a for all (s, a) /\u2208 F, from literature of robust MDP (e.g.,\n[Nilim and El Ghaoui, 2005] and [Iyengar, 2005]).\nThe main message of this paper is that we can determine a policy as good as the minimax policy\nwithout knowing either F or ps,a for (s, a) /\u2208 F. To make this formal, we de\ufb01ne the regret (against\nthe minimax performance) in episode i, for i = 1, 2, . . . as\n\n0 (s) = V \u2217\n\nwhere si\nregret for m episodes, which we want to minimize, is thus de\ufb01ned as\n\nt denote the actual state visited and action taken at step t of episode i.1 The total\n\nt and ai\n\n\u2206i = V \u2217\n\n0 (si\n\n0) \u2212\n\nr(si\n\nt, ai\n\nt),\n\nT\u22121(cid:88)\n\nt=0\n\nm(cid:88)\n\ni=1\n\n\u2206(m) =\n\n\u2206i.\n\nin\n\nscenario\n\nthe multi-armed\n\nThe main algorithm is given in Figure 2. OLRM is basically UCRL2 [Jaksch et al., 2010] with an\nadditional stochastic check to detect adversarial state-action pairs. Like UCRL2, the algorithm em-\nploys the \u201coptimism under uncertainty\u201d principle. We start by assuming that all states are stochastic.\nIf the adversary plays \u201cnicely\u201d, nothing else would have to be done. The key challenge, however, is\nto successfully identify the adversarial state-action pairs when they start to behave maliciously.\nA similar\nby\naddressed\n[Bubeck and Slivkins, 2012]. They show that it is possible to achieve near-optimal regret without\nknowing a priori whether a bandit is stochastic or adversarial. In [Bubeck and Slivkins, 2012], the\nkey is to check some consistency conditions that would be satis\ufb01ed if the behavior is stochastic. We\nuse the same strategy and the question is then, which condition? We discuss this in section 3.2.\nNote that the index k = 1, 2, . . . tracks the number of policies. A policy is executed until either a\nnew pair (s, a) fails the stochastic check, and hence deemed to be adversarial, or some state-action\npair has been executed too many times. In either case, we need to re-compute the current optimistic\npolicy (see Section 3.1 for the detail). Every time a new policy is computed we call it a new epoch.\nWhile each episode has the same length (T ), each epoch can span multiple episodes, and an epoch\ncan begin in the middle of an episode.\n\nsetting\n\nbandit\n\nbeen\n\nhas\n\n3.1 Computing an optimistic policy\n\nFigure 3 shows the algorithm for computing the optimistic minimax policy, where we treat all state-\naction pairs in the set F as adversarial, and (similar to UCRL2) use optimistic values for other\nstate-action pairs.\n\n1We provide high-probability regret bounds for any single trial, from which the expected regret can be\n\nreadily derived, if desired.\n\n4\n\n\fInput: S, A, T , \u03b4, and for each (s, a), U(s, a)\n\n1. Initialize the set F \u2190 {}.\n2. Initialize k \u2190 1.\n3. Compute an optimistic policy \u02dc\u03c0, assuming all state-action pairs in F are adversarial (Section\n\n3.1).\n\n4. Execute \u02dc\u03c0 until one of the followings happen:\n\n\u2022 The execution count of some state-action (s, a) has been doubled.\n\u2022 The executed state-action pair (s, a) fails the stochastic check (Section 3.2). In this case\n\n(s, a) is added to F .\n\n5. Increment k. Go back to step 3.\n\nFigure 2: The OLRM algorithm\n\nparticular, we use p(\u00b7)V (\u00b7) to mean the dot product between two such vectors, i.e. (cid:80)\n\nHere, to simplify notations, we frequently use V (\u00b7) to mean the vector whose elements are V (s)\nfor each s \u2208 S. This applies to both value functions as well as probability distributions over S. In\ns p(s)V (s).\nWe use Nk(s, a) to denote the total number of times the state-action pair (s, a) has been executed\nbefore epoch k. The corresponding empirical next-state distribution based on these transitions is\ndenoted as \u02c6Pk(\u00b7|s, a). If (s, a) has never been executed before epoch k, we de\ufb01ne Nk(s, a) = 1 and\nassume \u02c6Pk(\u00b7|s, a) to be arbitrarily de\ufb01ned.\nInput: S, A, T , \u03b4, F , k, and for each (s, a), U(s, a), \u02c6Pk(\u00b7|s, a) and Nk(s, a).\n\nT\u22121(s) = maxa r(s, a) for all s.\n\n1. Set \u02dcV k\n2. Repeat, for t = T \u2212 2, . . . , 0:\n\u2022 For each (s, a) \u2208 F , set\n\u2022 For each (s, a) /\u2208 F , set\n\n(cid:40)\n\n\u02dcQk\n\nt (s, a) = min\n\n\u2022 For each s, set\n\n3. Output \u02dc\u03c0.\n\nT \u2212 t,\n\nr(s, a) + \u02c6Pk(\u00b7|s, a) \u02dcV k\n\nt+1(\u00b7) + T\n\n2\n\nNk(s, a)\n\nlog\n\n2SAT k2\n\n\u03b4\n\n\u02dcV k\nt (s) = maxa \u02dcQk\n\nt (s, a)\n\nand\n\n\u02dc\u03c0t(s) = arg maxa \u02dcQk\n\nt (s, a).\n\n\u02dcQk\n\nt (s, a) = min\n\nT \u2212 t, min\np\u2208U (s,a)\n\nr(s, a) + p(\u00b7) \u02dcV k\n\nt+1(\u00b7)\n\n(cid:26)\n\n(cid:115)\n\n(cid:27)\n\n.\n\n(cid:41)\n\n.\n\nFigure 3: Algorithm for computing an optimistic minimax policy.\n\n3.2 Stochasticity check\nEvery time a state-action (s, a) /\u2208 F is executed, the outcome is recorded and subjected to a \u201cstochas-\nticity check\u201d. Let n be the total number of times (s, a) has been executed (including the latest one)\nand s(cid:48)\nn are the next-states for each of these transitions. Let k1, . . . , kn be the epochs in which\neach of these transitions happens. Let t1, . . . , tn be the step within the episodes (i.e. episode stage)\nwhere these transitions happen. Let \u03c4 be the total number of steps executed by the algorithm (from\nthe beginning) so far. The stochastic check fails if:\n\n1, . . . , s(cid:48)\n\n(cid:114)\n\n\u02c6Pkj (\u00b7|s, a) \u02dcV kj\n\ntj +1(\u00b7) \u2212\n\ntj +1(s(cid:48)\n\u02dcV kj\n\nj) > 5T\n\nnS log\n\n4SAT \u03c4 2\n\n\u03b4\n\n.\n\nn(cid:88)\n\nj=1\n\nn(cid:88)\n\nj=1\n\nThe stochastic check follows the intuitive saying \u201cif it is not broke, don\u2019t \ufb01x it\u201d, by checking whether\nthe value of actual transition from (s, a) is below what is expected from the parameter estimation.\n\n5\n\n\fOne can show that with high probability, all stochastic state-action pairs will always pass the stochas-\ntic check. Now consider an adversarial (s, a) pair: if the adversary plays \u201cnicely\u201d, the current policy\naccumulates satisfactory reward and hence nothing needs to be changed, even if the transitions them-\nselves fail to \u201clook\u201d stochastic; if the adversary plays \u201cnasty\u201d, then the stochastic check will detect\nit, and subsequently protect against it.\n\n3.3 Main result\nThe following theorem summarizes the performance of OLRM. Here and in the sequel, we use \u02dcO\nwhen the log terms are omitted. Our result for the in\ufb01nite-horizon case is similar (see Section 5).\nTheorem 1. Given \u03b4, T , S, A, the total regret of OLRM is\n\n\u2206(m) \u2264 \u02dcO(ST 3/2\u221aAm)\n\nfor all m, with probability at least 1 \u2212 \u03b4.\nNote that the above is with respect to the total number of episodes m. Since the total number of\nsteps is \u03c4 = mT , the regret bound in terms of \u03c4 is therefore \u02dcO(ST\u221aA\u03c4 ). This gives the familiar\n\u221a\u03c4 regret as in UCRL2. Also, the bound has the same dependencies on S and A as in UCRL2. The\nhorizon length T plays the role of the \u201cdiameter\u201d in the in\ufb01nite-horizon case and again it has the\nsame dependency as its counterpart in UCRL2.\nThe result shows that even though the algorithm deals with unknown stochastic and potentially\nadversarial states, it achieves the same regret bound as in the fully stochastic case. In the case where\nall states are in fact stochastic, this reduces to the same UCRL2 result.\n\n4 Analysis of OLRM\n\nWe brie\ufb02y explain the roadmap of the proof of Theorem 1. The complete proof can be found in the\nsupplementary material.\nOur proof starts with the following technical Lemma.\nLemma 1. The following holds for all state-action pair (s, a) /\u2208 F and for t = 0, . . . , (T \u2212 1) in\nall epochs k \u2265 1, with probability at least 1 \u2212 \u03b4:\nt+1(\u00b7) \u2212 ps,a(\u00b7) \u02dcV k\n\n\u02c6Pk(\u00b7|s, a) \u02dcV k\n\nt+1(\u00b7) \u2264 T\n\n4SAT k2\n\n(cid:115)\n\nNk(s, a)\n\n2S\n\nlog\n\n.\n\n\u03b4\n\nProof sketch. Since (s, a) /\u2208 F is stochastic, we apply the bound from [Weissman et al., 2003] for\nthe 1-norm deviation between \u02c6Pk(\u00b7|s, a) and ps,a. The bound follows from (cid:107) \u02dcV k\nUsing Lemma 1, we show the following lemma that with high probability, all purely stochastic\nstate-action pairs will always pass the stochastic check.\nLemma 2. The probability that any state-action pair (s, a) /\u2208 F gets added into set F while running\nthe algorithm is at most 2\u03b4.\n\nt+1(\u00b7)(cid:107)\u221e \u2264 T .\n\nProof sketch. Each (s, a) /\u2208 F is purely stochastic. Suppose (s, a) has been executed n times and\n1, . . . , s(cid:48)\ns(cid:48)\n\nn are the next-states for these transitions. Recall that the check fails if\n\n(cid:114)\n\ntj +1(s(cid:48)\n\u02dcV kj\n\nj) > 5T\n\nnS log\n\n4SAT \u03c4 2\n\n\u03b4\n\n.\n\nn(cid:88)\n\nj=1\n\n\u02c6Pkj (\u00b7|s, a) \u02dcV kj\n\ntj +1(\u00b7) \u2212\n\nn(cid:88)\n\nj=1\n\nWe can derive a high-probability bound that satis\ufb01es the stochastic check by applying the Azuma-\nHoeffding inequality on the martingale difference sequence\ntj +1(\u00b7) \u2212 \u02dcV kj\n\nXj = ps,a(\u00b7) \u02dcV kj\n\ntj +1(s(cid:48)\nj)\n\nfollowed by an application of Lemma 1.\n\n6\n\n\fWe then show that all value estimates \u02dcV k\nLemma 3. With probability at least 1 \u2212 \u03b4, and assume that no state-action pairs (s, a) /\u2208 F have\nbeen added to F , the following holds for every state s \u2208 S, every t \u2208 {0, . . . , T \u2212 1} and every\nk \u2265 1:\n\nt are always optimistic.\n\n\u02dcV k\n\nt (s) \u2265 V \u2217\n\nt (s).\n\nProof sketch. The key challenge is to prove that state-actions in F (adversarial) that have not been\nidenti\ufb01ed (i.e. all past transitions passed the test) would have optimistic \u02dcQ values. This can be done\nby, again, applying the Azuma-Hoeffding inequality.\n\nEquipped with the previous three lemmas, we are now able to establish Theorem 1.\n\nProof sketch. Lemma 3 established that all value estimates \u02dcV k\nt are always optimistic. We can there-\nfore bound the regret by bounding the difference between \u02dcV k\nt and the actual rewards received by the\nalgorithm. The \u201coptimistic gap\u201d shrinks in an expected manner as the number of steps executed by\nthe algorithm grows if all state-actions are stochastic.\nFor an adversarial state-action (s, a) \u2208 F, we use the following facts to ensure the above: (i) If\n(s, a) has been added to F (i.e., it failed the stochastic check) then all policies afterwards would\ncorrectly evaluate its value; (ii) All transitions before (s, a) is added to F (if ever) must have passed\nthe stochastic check and the check condition ensures that its behavior is consistent with what one\nwould expect if (s, a) was stochastic.\n\n5\n\nIn\ufb01nite horizon case\n\nIn the in\ufb01nite horizon case, let P be a particular choice of ps,a \u2208 U(s, a) for every (s, a) \u2208 F.\nGiven a (stationary) policy \u03c0, its average undiscounted reward (or \u201cgain\u201d) is de\ufb01ned as follows:\n\n(cid:35)\n\ng\u03c0\nP (s) = lim\n\u03c4\u2192\u221e\n\nEP\n\n1\n\u03c4\n\nr(si, \u03c0(si))\n\n(cid:34) \u03c4(cid:88)\n\nt=1\n\n\u03c4(cid:88)\n\nt=1\n\nwhere s1 = s. The limit always exists for \ufb01nite MDPs [Puterman, 1994]. We make the assumption\nthat regardless of the choice of P , the resulting MDP is communicating and unichain. 2 In this case\nP (s) is a constant and independent of s so we can drop the argument s.\ng\u03c0\nWe de\ufb01ne the worst-case average reward of \u03c0 over all possible P as g\u03c0 = minP g\u03c0\nminimax policy \u03c0\u2217 is any policy whose gain g\u03c0\u2217\nexecuting the MDP M for \u03c4 steps as\n\nP . An optimal\n= g\u2217 = max\u03c0 g\u03c0. We de\ufb01ne the regret after\n\n\u2206(\u03c4 ) = \u03c4 g\u2217\n\n\u2212\n\nr(st, at).\n\nThe main algorithm for the in\ufb01nite-horizon case, which we refer as OLRM2, is essentially iden-\ntical to OLRM. The main difference is in computing the optimistic policy and the corresponding\nstochastic check. The detailed algorithm is presented in the supplementary material.\nThe algorithms from [Tewari and Bartlett, 2007] can be used to compute an optimistic minimax\npolicy.\nIn particular, for each (s, a) \u2208 F , its transition function is chosen pessimistically from\nU(s, a). For each (s, a) /\u2208 F , its transition function is chosen optimistically from the following set:\n\n(cid:115)\n\n{p : (cid:107)p(\u00b7) \u2212 \u02c6Pk(\u00b7|s, a)(cid:107)1 \u2264 \u03c3} where \u03c3 =\n\n2S\n\nNk(s, a)\n\nlog\n\n4SAk2\n\n\u03b4\n\n.\n\n2 In more general settings, such as communicating or weakly communicating MDPs, although the optimal\npolicies (for a \ufb01xed P ) always have constant gain, the optimal minimax policies (over all possible P ) might\nhave non-constant gain. Additional assumptions on U, as well as a slight change in the de\ufb01nition of the regret\nare needed to deal with these cases. This is left for future research.\n\n7\n\n\fLet \u02dcPk(\u00b7|s, \u02dc\u03c0k(s)) be the minimax choice of transition functions for each s where the minimax gain\ng \u02dc\u03c0k is attained. The bias hk can be obtained by solving the following system of equations for h(\u00b7)\n(see [Puterman, 1994]):\n(1)\n\n+ h(s) = r(s, \u02dc\u03c0k(s)) + \u02dcPk(\u00b7|s, \u02dc\u03c0k(s))h(\u00b7).\n\n\u2200s \u2208 S,\n\ng \u02dc\u03c0k\n\nThe stochastic check for the in\ufb01nite-horizon case is mostly identical to the \ufb01nite-horizon case, except\nthat we replace T with the maximal span \u02dcH of the bias, de\ufb01ned as follows:\n\n(cid:17)\n\nhk(s)\n\n.\n\nThe stochastic check fails if:\n\n\u02dcH =\n\n(cid:16)\n\u02dcPkj (\u00b7|s, a)hkj (\u00b7) \u2212 n(cid:88)\n\nk\u2208{k1,...,kn}\n\nmax\n\nn(cid:88)\n\nj=1\n\nj=1\n\nmax\n\ns\n\nhk(s) \u2212 min\n(cid:114)\n\ns\n\n(cid:48)\nj) > 5 \u02dcH\nhkj (s\n\nnS log\n\n4SA\u03c4 2\n\n\u03b4\n\n.\n\nLet H be the maximal span of the bias of any optimal minimax policies. The following summa-\nrizes the performance of OLRM2. The proof, deferred in the supplementary material, is similar to\nTheorem 1.\nTheorem 2. Given \u03b4, S, A, the total regret of OLRM2 is\nfor all \u03c4, with probability at least 1 \u2212 \u03b4.\n6 Experiment\n\n\u2206(\u03c4 ) \u2264 \u02dcO(SH\u221aA\u03c4 )\n\nFigure 4: Total accumulated rewards. The vertical line marks the start of \u201cbreakdown\u201d.\n\nWe run both our algorithm as well as UCRL2 on the example MDP in Figure 1 for the in\ufb01nite-\nhorizon case. Figure 4 shows the result for g\u2217 = 0.18, \u03b2 = 0.07 and \u03b1 = 0.17. It shows that\nUCRL2 accumulates smaller total rewards than the optimal minimax policy while our algorithm\nactually accumulates larger total rewards than the minimax policy. We also include the result for a\nstandard robust MDP that treats all state-action pairs as adversarial and therefore performs poorly.\nAdditional details are provided in the supplementary material.\n\n7 Conclusion\n\nWe presented an algorithm for online learning of robust MDPs with unknown parameters, some\ncan be adversarial. We show that it achieves similar regret bound as in the fully stochastic case. A\nnatural extension is to allow the learning of the uncertainty sets in adversarial states, where the true\nuncertainty set is unknown. Our preliminary results show that very similar regret bounds can be\nobtained for learning from a class of nested uncertainty sets.\n\nAcknowledgments\n\nThis work is partially supported by the Ministry of Education of Singapore through AcRF Tier\nTwo grant R-265-000-443-112 and NUS startup grant R-265-000-384-133. The research leading to\nthese results has received funding from the European Research Council under the European Union\u2019s\nSeventh Framework Programme (FP/2007-2013)/ ERC Grant Agreement n.306638.\n\n8\n\n02468x 10600.511.522.5x 106Time stepsTotal reward OLRM2UCRL2Standard robust MDPOptimal minimax policy\fReferences\n\n[Brafman and Tennenholtz, 2002] Brafman, R. I. and Tennenholtz, M. (2002). R-max - a general\npolynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning\nResearch, 3:213\u2013231.\n\n[Bubeck and Slivkins, 2012] Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochas-\ntic and adversarial bandits. Journal of Machine Learning Research - Proceedings Track, 23:42.1\u2013\n42.23.\n\n[Even-Dar et al., 2005] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a markov\ndecision process. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Informa-\ntion Processing Systems 17, pages 401\u2013408. MIT Press, Cambridge, MA.\n\n[Even-Dar et al., 2009] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov\n\ndecision processes. Math. Oper. Res., 34(3):726\u2013736.\n\n[Iyengar, 2005] Iyengar, G. N. (2005). Robust dynamic programming. Math. Oper. Res., 30(2):257\u2013\n\n280.\n\n[Jaksch et al., 2010] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for\n\nreinforcement learning. J. Mach. Learn. Res., 99:1563\u20131600.\n\n[Mannor et al., 2012] Mannor, S., Mebel, O., and Xu, H. (2012). Lightning does not strike twice:\n\nRobust mdps with coupled uncertainty. In ICML.\n\n[Mannor et al., 2007] Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. (2007). Bias and\n\nvariance approximation in value function estimates. Manage. Sci., 53(2):308\u2013322.\n\n[McDiarmid, 1989] McDiarmid, C. (1989). On the method of bounded differences. In Surveys in\nCombinatorics, number 141 in London Mathematical Society Lecture Note Series, pages 148\u2013\n188. Cambridge University Press.\n\n[Neu et al., 2012] Neu, G., Gy\u00a8orgy, A., and Szepesv\u00b4ari, C. (2012). The adversarial stochastic short-\nest path problem with unknown transition probabilities. Journal of Machine Learning Research\n- Proceedings Track, 22:805\u2013813.\n\n[Neu et al., 2010] Neu, G., Gy\u00a8orgy, A., Szepesv\u00b4ari, C., and Antos, A. (2010). Online markov deci-\n\nsion processes under bandit feedback. In NIPS, pages 1804\u20131812.\n\n[Nilim and El Ghaoui, 2005] Nilim, A. and El Ghaoui, L. (2005). Robust control of markov deci-\n\nsion processes with uncertain transition matrices. Oper. Res., 53(5):780\u2013798.\n\n[Puterman, 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dy-\n\nnamic Programming. Wiley-Interscience.\n\n[Strens, 2000] Strens, M. (2000). A bayesian framework for reinforcement learning. In In Proceed-\nings of the Seventeenth International Conference on Machine Learning, pages 943\u2013950. ICML.\n\n[Tewari and Bartlett, 2007] Tewari, A. and Bartlett, P. (2007). Bounded parameter markov decision\n\nprocesses with average reward criterion. Learning Theory, pages 263\u2013277.\n\n[Weissman et al., 2003] Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger,\nInequalities for the l1 deviation of the empirical distribution. Technical report,\n\nM. J. (2003).\nInformation Theory Research Group, HP Laboratories.\n\n[Xu and Mannor, 2012] Xu, H. and Mannor, S. (2012). Distributionally robust markov decision\n\nprocesses. Math. Oper. Res., 37(2):288\u2013300.\n\n[Yu and Mannor, 2009] Yu, J. Y. and Mannor, S. (2009). Arbitrarily modulated markov decision\n\nprocesses. In CDC, pages 2946\u20132953.\n\n[Yu et al., 2009] Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with\n\narbitrary reward processes. Math. Oper. Res., 34(3):737\u2013757.\n\n9\n\n\f", "award": [], "sourceid": 413, "authors": [{"given_name": "Shiau Hong", "family_name": "Lim", "institution": "National University of Singapore"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}