{"title": "Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2994, "page_last": 3004, "abstract": "While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This results in weakly-communicating or multi-chain MDPs. In this paper, we introduce TUCRL, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with $S^c$ communicating states, $A$ actions and $\\Gamma^c \\leq S^c$ possible communicating next states, we derive a $O(D^c \\sqrt{\\Gamma^c S^c A T}) regret bound, where $D^c$ is the diameter (i.e., the length of the longest shortest path between any two states) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art.", "full_text": "Near Optimal Exploration-Exploitation in\n\nNon-Communicating Markov Decision Processes\n\nRonan Fruit\n\nSequel Team - Inria Lille\nronan.fruit@inria.fr\n\nMatteo Pirotta\n\nSequel Team - Inria Lille\n\nmatteo.pirotta@inria.fr\n\nAlessandro Lazaric\nFacebook AI Research\n\nlazaric@fb.com\n\nAbstract\n\nWhile designing the state space of an MDP, it is common to include states that are\ntransient or not reachable by any policy (e.g., in mountain car, the product space of\nspeed and position contains con\ufb01gurations that are not physically reachable). This\nresults in weakly-communicating or multi-chain MDPs. In this paper, we introduce\nTUCRL, the \ufb01rst algorithm able to perform ef\ufb01cient exploration-exploitation in\nany \ufb01nite Markov Decision Process (MDP) without requiring any form of prior\nknowledge. In particular, for any MDP with SC communicating states, A actions\n\nand \u0393C \u2264 SC possible communicating next states, we derive a (cid:101)O(DC\u221a\u0393CSCAT )\n\nregret bound, where DC is the diameter (i.e., the length of the longest shortest\npath between any two states) of the communicating part of the MDP. This is in\ncontrast with existing optimistic algorithms (e.g., UCRL, Optimistic PSRL) that\nsuffer linear regret in weakly-communicating MDPs, as well as posterior sampling\nor regularised algorithms (e.g., REGAL), which require prior knowledge on the bias\nspan of the optimal policy to achieve sub-linear regret. We also prove that in weakly-\ncommunicating MDPs, no algorithm can ever achieve a logarithmic growth of the\nregret without \ufb01rst suffering a linear regret for a number of steps that is exponential\nin the parameters of the MDP. Finally, we report numerical simulations supporting\nour theoretical \ufb01ndings and showing how TUCRL overcomes the limitations of the\nstate-of-the-art.\n\nIntroduction\n\n1\nReinforcement learning (RL) [1] studies the problem of learning in sequential decision-making\nproblems where the dynamics of the environment is unknown, but can be learnt by performing\nactions and observing their outcome in an online fashion. A sample-ef\ufb01cient RL agent must trade\noff the exploration needed to collect information about the environment, and the exploitation of\nthe experience gathered so far to gain as much reward as possible. In this paper, we focus on the\nregret framework in in\ufb01nite-horizon average-reward problems [2], where the exploration-exploitation\nperformance is evaluated by comparing the rewards accumulated by the learning agent and an optimal\npolicy. Jaksch et al. [2] showed that it is possible to ef\ufb01ciently solve the exploration-exploitation\ndilemma using the optimism in face of uncertainty (OFU) principle. OFU methods build con\ufb01dence\nintervals on the dynamics and reward (i.e., construct a set of plausible MDPs), and execute the optimal\npolicy of the \u201cbest\u201d MDP in the con\ufb01dence region [e.g., 2, 3, 4, 5, 6]. An alternative approach is\nposterior sampling (PS) [7], which maintains a posterior distribution over MDPs and, at each step,\nsamples an MDP and executes the corresponding optimal policy [e.g., 8, 9, 10, 11, 12].\nWeakly-communicating MDPs and misspeci\ufb01ed states. One of the main limitations of UCRL [2]\nand optimistic PSRL [12] is that they require the MDP to be communicating so that its diameter\nD (i.e., the length of the longest path among all shortest paths between any pair of states) is \ufb01nite.\nWhile assuming that all states are reachable may seem a reasonable assumption, it is rarely veri\ufb01ed in\npractice. In fact, it requires a designer to carefully de\ufb01ne a state space S that contains all reachable\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Breakout\n\n(b) Mountain Car\n\nFigure 1: Examples of non-communicating domains. Fig. b represents a phase plane plot of the\nMountain car domain (x, \u02d9x) \u2208 [\u22121.2, 0.6] \u00d7 [\u22120.07, 0.07]. The initial state is (\u22120.5, 0) and the red\narea corresponds to non-reachable states from the initial state. Other non-reachable states may exist.\nFig. a shows the initial state, one reachable state (middle) and an unreachable one (right).\n\nstates (otherwise it may not be possible to learn the optimal policy), but it excludes unreachable\nstates (otherwise the resulting MDP would be non-communicating). This requires a considerable\namount of prior knowledge about the environment. Consider a problem where we learn from images\ne.g., the Atari Breakout game [13]. The state space is the set of \u201cplausible\u201d con\ufb01gurations of the\nbrick wall, ball and paddle positions. The situation in which the wall has an hole in the middle is a\nvalid state (e.g., as an initial state) but it cannot be observed/reached starting from a dense wall (see\nFig. 1a). As such, it should be removed to obtain a \u201cwell-designed\u201d state space. While it may be\npossible to design a suitable set of \u201creachable\u201d states that de\ufb01ne a communicating MDP, this is often\na dif\ufb01cult and tedious task, sometimes even impossible. Now consider a continuous domain e.g., the\nMountain Car problem [14]. The state is decribed by the position x and velocity \u02d9x along the x-axis.\nThe state space of this domain is usually de\ufb01ned as the cartesian product [\u22121.2, 0.6] \u00d7 [\u22120.07, 0.07].\nUnfortunately, this set contains con\ufb01gurations that are not physically reachable as shown on Fig. 1b.\nThe dynamics of the system is constrained by the evolution equations. Therefore, the car can not go\narbitrarily fast. On the leftmost position (x = \u22121.2) the speed \u02d9x cannot exceed 0 due to the fact that\nsuch position can be reached only with velocity \u02d9x \u2264 0. To have a higher velocity, the car would need\nto acquire momentum from further left (i.e., x < \u22121.2) which is impossible by design (\u22121.2 is the\nleft-boundary of the position domain). The maximal speed reachable for x > \u22121.2 can be attained by\napplying the maximum acceleration at any time step starting from the state (x, \u02d9x) = (\u22121.2, 0). This\nidenti\ufb01es the curve reported in the Fig. 1b which denotes the boundary of the unreachable region.\nNote that other states may not be reachable. Whenever the state space is misspeci\ufb01ed or the MDP is\nweakly communicating (i.e., D = +\u221e), OFU-based algorithms (e.g.,UCRL) optimistically attribute\nlarge reward and non-zero probability to reach states that have never been observed, and thus they\ntend to repeatedly attempt to explore unreachable states. This results in poor performance and linear\nregret. A \ufb01rst attempt to overcome this major limitation is REGAL.C [3] (Fruit et al. [6] recently\nproposed SCAL, an implementable ef\ufb01cient version of REGAL.C), which requires prior knowledge of\nan upper-bound H to the span (i.e., range) of the optimal bias function h\u2217. The optimism of UCRL\nis then \u201cconstrained\u201d to policies whose bias has span smaller than H. This implicitly \u201cremoves\u201d\nnon-reachable states, whose large optimistic reward would cause the span to become too large.\nUnfortunately, an accurate knowledge of the bias span may not be easier to obtain than designing\na well-speci\ufb01ed state space. Bartlett and Tewari [3] proposed an alternative algorithm \u2013 REGAL.D\u2013\nthat leverages on the doubling trick [15] to avoid any prior knowledge on the span. Nonetheless,\nwe recently noticed a major \ufb02aw in the proof of [3, Theorem 3] that questions the validity of the\nalgorithm (see App. A for further details). PS-based algorithms also suffer from similar issues.1 To\nthe best of our knowledge, the only regret guarantees available in the literature for this setting are\n[17, 18, 19]. However, the counter-example of Osband and Roy [20] seems to invalidate the result of\nAbbasi-Yadkori and Szepesv\u00e1ri [17]. On the other hand, Ouyang et al. [18] and Theocharous et al.\n[19] present PS algorithms with expected Bayesian regret scaling linearly with H, where H is an\nupper-bound on the optimal bias spans of all the MDPs that can be drawn from the prior distribution\n([18, Asm. 1] and [19, Sec. 5]). In [18, Remark 1], the authors claim that their algorithm does not\nrequire the knowledge of H to derive the regret bound. However, in App. B we show on a very simple\nexample that for most continuous prior distributions (e.g., uninformative priors like Dirichlet), it is\nvery likely that H = +\u221e implying that the regret bound may not hold (similarly for [19]). As a\n1We notice that the problem of weakly-communicating MDPs and misspeci\ufb01ed states does not hold in the\nmore restrictive setting of \ufb01nite horizon [e.g., 8] since exploration is directly tailored to the states that are\nreachable within the known horizon, or under the assumption of the existence of a recurrent state [e.g., 16].\n\n2\n\nInitialstates1Reachablefroms1NOTreachablefroms1\u22121.2\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.4\u22120.0500.05Unreachablestatess1PositionxVelocity\u02d9x\fresult, similarly to REGAL.C, the prior distribution should contain prior knowledge on the bias span to\navoid poor performance.\nIn this paper, we present TUCRL, an algorithm designed to trade-off exploration and exploitation in\nweakly-communicating and multi-chain MDPs (e.g., MDPs with misspeci\ufb01ed states) without any\nprior knowledge and under the only assumption that the agent starts from a state in a communicating\nsubset of the MDP (Sec. 3). In communicating MDPs, TUCRL eventually (after a \ufb01nite number\nof steps) performs as UCRL, thus achieving problem-dependent logarithmic regret. When the\n\ntrue MDP is weakly-communicating, we prove that TUCRL achieves a (cid:101)O(\u221aT ) regret that with\n\npolynomial dependency on the MDP parameters. We also show that it is not possible to design\nan algorithm achieving logarithmic regret in weakly-communicating MDPs without having an\nexponential dependence on the MDP parameters (see Sec. 5). TUCRL is the \ufb01rst computationally\ntractable algorithm in the OFU literature that is able to adapt to the MDP nature without any prior\nknowledge. The theoretical \ufb01ndings are supported by experiments on several domains (see Sec. 4).\n2 Preliminaries\nWe consider a \ufb01nite weakly-communicating Markov decision process [21, Sec. 8.3] M = (cid:104)S,A, r, p(cid:105)\nwith a set of states S and a set of actions A =(cid:83)s\u2208S As. Each state-action pair (s, a) \u2208 S \u00d7 As\nis characterized by a reward distribution with mean r(s, a) and support in [0, rmax] as well as a\ntransition probability distribution p(\u00b7|s, a) over next states. In a weakly-communicating MDP, the\nstate-space S can be partioned into two subspaces [21, Section 8.3.1]: a communicating set of states\n(denoted SC in the rest of the paper) with each state in SC accessible \u2013with non-zero probability\u2013\nfrom any other state in SC under some stationary deterministic policy, and a \u2013possibly empty\u2013 set\nof states that are transient under all policies (denoted ST). We also denote by S = |S|, SC = |SC|\nand A = maxs\u2208S |As| the number of states and actions, and by \u0393C = maxs\u2208SC,a\u2208A (cid:107)p(\u00b7|s, a)(cid:107)0 the\nmaximum support of all transition probabilities p(\u00b7|s, a) with s \u2208 SC. The sets SC and ST form a\npartition of S i.e., SC \u2229 ST = \u2205 and SC \u222a ST = S. A deterministic policy \u03c0 : S \u2192 A maps states to\nactions and it has an associated long-term average reward (or gain) and a bias function de\ufb01ned as\n\nE(cid:20) 1\n\nT\n\nT(cid:88)t=1\n\nr(cid:0)st, \u03c0(st)(cid:1)(cid:21);\n\nE(cid:20) T(cid:88)t=1(cid:0)r(st, \u03c0(st)) \u2212 g\u03c0\n\nM (st)(cid:1)(cid:21),\n\nM (s) := C- lim\nh\u03c0\nT\u2192\u221e\n\ng\u03c0\nM (s) := lim\nT\u2192\u221e\nwhere the bias h\u03c0\nM (s) measures the expected total difference between the rewards accumulated by\n\u03c0 starting from s and the stationary reward in Cesaro-limit2 (denoted C- lim). Accordingly, the\ndifference of bias values h\u03c0\nM (s(cid:48)) quanti\ufb01es the (dis-)advantage of starting in state s rather\nthan s(cid:48). In the following, we drop the dependency on M whenever clear from the context and\ndenote by spS {h\u03c0} := maxs\u2208S h\u03c0(s) \u2212 mins\u2208S h\u03c0(s) the span of the bias function. In weakly\ncommunicating MDPs, any optimal policy \u03c0\u2217 \u2208 arg max\u03c0 g\u03c0(s) has constant gain, i.e., g\u03c0\n(s) = g\u2217\nfor all s \u2208 S. Finally, we denote by D, resp. DC, the diameter of M, resp. the diameter of the\ncommunicating part of M (i.e., restricted to the set SC):\nDC :=\n(1)\n\nM (s) \u2212 h\u03c0\n\nD :=\n\n\u2217\n\nmax\n\n(s,s(cid:48))\u2208S\u00d7S,s(cid:54)=s(cid:48){\u03c4M (s \u2192 s(cid:48))},\n\nmax\n\n(s,s(cid:48))\u2208SC\u00d7SC,s(cid:54)=s(cid:48){\u03c4M (s \u2192 s(cid:48))},\n\nwhere \u03c4M (s \u2192 s(cid:48)) is the expected time of the shortest path from s to s(cid:48) in M.\nLearning problem. Let M\u2217 be the true (unknown) weakly-communicating MDP. We consider the\nlearning problem where S, A and rmax are known, while sets SC and ST, rewards r and transition\nprobabilities p are unknown and need to be estimated on-line. We evaluate the performance of a\nlearning algorithm A after T time steps by its cumulative regret \u2206(A, T ) = T g\u2217 \u2212(cid:80)T\nt=1 rt(st, at).\nFurthermore, we state the following assumption.\nAssumption 1. The initial state s1 belongs to the communicating set of states SC.\nWhile this assumption somehow restricts the scenario we consider, it is fairly common in practice.\nFor example, all the domains that are characterized by the presence of a resetting distribution (e.g.,\nepisodic problems) satisfy this assumption (e.g., mountain car, cart pole, Atari games, taxi, etc.).\nMulti-chain MDPs. While we consider weakly-communicating MDPs for ease of notation, all our\nresults extend to the more general case of multi-chain MDPs.3 In this case, there may be multiple\n\n2For policies whose associated Markov chain is aperiodic, the standard limit exists.\n3In the case of misspeci\ufb01ed states, we implicitly de\ufb01ne a multi-chain MDP, where each non-reachable state\n\nhas a self-loop dynamics and it de\ufb01nes a \u201csingleton\u201d communicating subset.\n\n3\n\n\f\u03b2sa\n\nr,k(s, a)bk,\u03b4\nN +\nk (s, a)\n\n+\n\n49\n3 rmaxbk,\u03b4\nN\u00b1k (s, a)\n\n,\n\ncommunicating and transient sets of states and the optimal gain g\u2217 is different in each communicating\nsubset. In this case we de\ufb01ne SC as the set of states that are accessible \u2013with non-zero probability\u2013\nfrom s1 (s1 included) under some stationary deterministic policy. ST is de\ufb01ned as the complement of\nSC in S i.e., ST := S \\ SC. With these new de\ufb01nitions of SC and ST, Asm. 1 needs to be reformulated\nas follows:\nAssumption 1 for Multi-chain MDPs. The initial state s1 is accessible \u2013with non-zero probability\u2013\nfrom any other state in SC under some stationary deterministic policy. Equivalently, SC is a commu-\nnicating set of states.\nNote that the states belonging to ST can either be transient or belong to other communicating subsets\nof the MDP disjoint from SC. It does not really matter because the states in ST will never be visited\nby de\ufb01nition. As a result, the regret is still de\ufb01ned as before, where the learning performance is\ncompared to the optimal gain g\u2217(s1) related to the communicating set of states SC (cid:51) s1.\n3 Truncated Upper-Con\ufb01dence for Reinforcement Learning (TUCRL)\nIn this section we introduce Truncated Upper-Con\ufb01dence for Reinforcement Learning (TUCRL),\nan optimistic online RL algorithm that ef\ufb01ciently balances exploration and exploitation to learn in\nnon-communicating MDPs without prior knowledge (Fig. 2).\nSimilar to UCRL, at the beginning of each episode k, TUCRL constructs con\ufb01dence intervals for the\nreward and the dynamics of the MDP. Formally, for any (s, a) \u2208 S \u00d7 A we de\ufb01ne\n\nBp,k(s, a) =(cid:110)(cid:101)p(\u00b7|s, a) \u2208 C : \u2200s(cid:48) \u2208 S,|(cid:101)p(s(cid:48)|s, a) \u2212(cid:98)p(s(cid:48)|s, a)| \u2264 \u03b2sas\np,k (cid:111) ,\nBr,k(s, a) := [(cid:98)rk(s, a) \u2212 \u03b2sa\n\nr,k,(cid:98)rk(s, a) + \u03b2sa\n\nr,k] \u2229 [0, rmax],\n\n(cid:48)\n\n(cid:48)\n\u03b2sas\n\np,k :=(cid:115) 14(cid:98)\u03c32\n\nwhere Nk(s, a) is the number of visits in (s, a) before episode k, N +\n\nsize of the con\ufb01dence intervals is constructed using the empirical Bernstein\u2019s inequality [22, 23] as\n\nwhere C = {p \u2208 RS|\u2200s(cid:48), p(s(cid:48)) \u2265 0 \u2227(cid:80)s(cid:48) p(s(cid:48)) = 1} is the (S \u2212 1)-probability simplex, while the\nr,k :=(cid:115) 14(cid:98)\u03c32\np,k(s(cid:48)|s, a)bk,\u03b4\nN +\nk (s, a)\nk (s, a) := max{1, Nk(s, a)},\nN\u00b1k (s, a) := max{1, Nk(s, a)\u22121},(cid:98)\u03c32\np,k(s(cid:48)|s, a) are the empirical variances of r(s, a)\nand p(s(cid:48)|s, a) and bk,\u03b4 = ln(2SAtk/\u03b4). The set of plausible MDPs associated with the con\ufb01dence\nintervals is then Mk =(cid:8)M = (S,A,(cid:101)r,(cid:101)p) : (cid:101)r(s, a) \u2208 Br,k(s, a), (cid:101)p(\u00b7|s, a) \u2208 Bp,k(s, a)(cid:9). UCRL\nis optimistic w.r.t. the con\ufb01dence intervals so that for all states s that have never been visited the\noptimistic reward(cid:101)r(s, a) is set to rmax, while all transitions to s (i.e.,(cid:101)p(s|\u00b7,\u00b7)) are set to the largest\nvalue compatible with Bp,k(\u00b7,\u00b7). Unfortunately, some of the states with Nk(s, a) = 0 may be actually\nunreachable (i.e., s \u2208 ST) and UCRL would uniformly explore the policy space with the hope that at\nleast one policy reaches those (optimistically desirable) states. TUCRL addresses this issue by \ufb01rst\nconstructing empirical estimates of SC and ST (i.e., the set of communicating and transient states in\nNk(s, a) > 0(cid:9) \u222a\nM\u2217) using the states that have been visited so far, that is SC\nk := S \\ SC\n{stk} and ST\nIn order to avoid optimistic exploration attempts to unreachable states, we could simply execute\nUCRL on SC\nk, which is guaranteed to contain only states in the communicating set (since s1 \u2208 SC by\nk \u2286 SC). Nonetheless, this algorithm could under-explore state-action pairs\nAsm. 1, we have that SC\nthat would allow discovering other states in SC, thus getting stuck in a subset of the communicating\nstates of the MDP and suffering linear regret. While the states in SC\nk are guaranteed to be in the\ncommunicating subset, it is not possible to know whether states in ST\nk are actually reachable from\nSC\nk or not. Then TUCRL \ufb01rst \u201cguesses\u201d a lower bound on the probability of transition from states\ns \u2208 SC\nk and whenever the maximum transition probability from s to s(cid:48) compatible with the\ncon\ufb01dence intervals (i.e.,(cid:98)pk(s(cid:48)|s, a)+\u03b2sas\np,k ) is below the lower bound, it assumes that such transition\nis not possible. This strategy is based on the intuition that a transition either does not exist or it should\nhave a suf\ufb01ciently \u201cbig\u201d mass. However, these transitions should be periodically reconsidered in\norder to avoid under-exploration issues. More formally, let (\u03c1t)t\u2208N be a non-increasing sequence\nk, s \u2208 SC\nto be de\ufb01ned later, for all s(cid:48) \u2208 ST\np,k(s(cid:48)|s, a) are zero (i.e., this transition has never been observed so far), so the largest probability\n(cid:98)\u03c32\n\nk and a \u2208 As, the empirical mean(cid:98)pk(s(cid:48)|s, a) and variance\n\nk :=(cid:8)s \u2208 S(cid:12)(cid:12) (cid:80)a\u2208As\n\nk, where tk is the starting time of episode k.\n\nr,k(s, a) and(cid:98)\u03c32\n\n+\n\n49\n3 bk,\u03b4\nN\u00b1k (s, a)\n\n,\n\nk to s(cid:48) \u2208 ST\n\n(2)\n\n(3)\n\n(cid:48)\n\n4\n\n\fInput: Con\ufb01dence \u03b4 \u2208]0, 1[, rmax, S, A\nInitialization: Set N0(s, a) := 0 for any (s, a) \u2208 S \u00d7 A, t := 1 and observe s1.\nFor episodes k = 1, 2, ... do\n1. Set tk = t and episode counters \u03bdk(s, a) = 0\n\n(cid:48)|s, a),(cid:98)rk(s, a) and a set Mk\n2. Compute estimates(cid:98)pk(s\ntk-approximation(cid:101)\u03c0k of Eq. 5\n(cid:16)(cid:80)\n(a) Execute at =(cid:101)\u03c0k(st), obtain reward rt, and observe st+1\n\n3. Compute an rmax/\n4. While tk == t or\n\na\u2208Ast\n\n\u221a\n\nNk(st, a) > 0 and \u03bdk(st,(cid:101)\u03c0k(st)) \u2264 max{1, Nk (st,(cid:101)\u03c0k(st))}(cid:17)\n\n(b) Set \u03bdk(st, at) += 1 and set t += 1\n5. Set Nk+1(s, a) = Nk(s, a) + \u03bdk(s, a)\n\ndo\n\nFigure 2: TUCRL algorithm.\n\n3\n\nN\n\nbk,\u03b4\n\u00b1\nk (s,a)\n\nk (s(cid:48)|s, a) = 49\n\n. TUCRL\nk (s(cid:48)|s, a) to \u03c1tk and forces all transition probabilities below the threshold to zero, while\nk) are\n\n(most optimistic) of transition from s to s(cid:48) through any action a is(cid:101)p+\ncompares(cid:101)p+\nthe con\ufb01dence intervals of transitions to states that have already been explored (i.e., in SC\npreserved unchanged. This corresponds to constructing the alternative con\ufb01dence interval\nBp,k(s, a) = Bp,k(s, a) \u2229(cid:8)(cid:101)p(\u00b7|s, a) \u2208 C : \u2200s(cid:48) \u2208 ST\nk and(cid:101)p+\n((cid:102)Mk,(cid:101)\u03c0k) = arg max\nM\u2208Mk,\u03c0{g\u03c0\nM}.\n\n(4)\nGiven Bp,k, TUCRL (implicitly) constructs the corresponding set of plausible MDPs Mk and then\nsolves the optimistic optimization problem\n\nk (s(cid:48)|s, a) < \u03c1tk ,(cid:101)p(s(cid:48)|s, a) = 0(cid:9) .\n\nThe resulting algorithm follows the same structure as UCRL and it is shown in Fig. 2. The episode\nstopping condition at line 4 is slightly modi\ufb01ed w.r.t. UCRL. In fact, it guarantees that one action is\nalways executed and it forces an episode to terminate as soon as a state previously in ST\nk is visited\n(i.e., Nk(st, a) = 0). This minor change guarantees that Nk+1(s, a) = 0 for all the states s \u2208 ST\nk that\nwere not reachable at the beginning of the episode. The algorithm also needs minor modi\ufb01cations\nto the extended value iteration (EVI) algorithm used to solve (5) to guarantee both ef\ufb01ciency and\nconvergence. All technical details are reported in App. C.\n\n(5)\n\n3 (cid:113) SA\n\nIn practice, we set \u03c1t = 49bt,\u03b4\n\nt , so that the condition to remove transition reduces to N\u00b1k (s, a) >\n\n(cid:112)tk/SA. This shows that only transitions from state-action pairs that have been poorly visited so\nfar are enabled, while if the state-action pair has already been tried often and yet no transition to\ns(cid:48) \u2208 ST\nk is observed, then it is assumed that s(cid:48) is not reachable from s, a. When the number of visits\nin (s, a) is big, the transitions to \u201cunvisited\u201d states should be discarded because if the transition\nactually exists, it is most likely extremely small and so it is worth exploring other parts of the MDP\n\ufb01rst. Symmetrically, when the number of visits in (s, a) is small, the transitions to \u201cunvisited\u201d states\nshould be enabled because the transitions are quite plausible and the algorithm should try to explore\nthe outcome of taking action a in s and possibly reach states in ST\nk. We denote the set of state-action\npairs that are not suf\ufb01ciently explored by Kk =(cid:8)(s, a) \u2208 SC\n\nk \u00d7 A : N\u00b1k (s, a) \u2264(cid:112)tk/SA(cid:9).\n\n3.1 Analysis of TUCRL\nWe prove that the regret of TUCRL is bounded as follows.\nTheorem 1. For any weakly communicating MDP M, with probability at least 1 \u2212 \u03b4 it holds that for\nany T > 1, the regret of TUCRL is bounded as\n\u03b4 (cid:19)(cid:33) .\n\n\u2206(TUCRL, T ) = O(cid:32)rmaxDC(cid:115)\u0393CSCAT ln(cid:18) SAT\n\nS3A ln2(cid:18) SAT\n\n\u03b4 (cid:19) + rmax(cid:16)DC(cid:17)2\n\nThe \ufb01rst term in the regret shows the ability of TUCRL to adapt to the communicating part of the\ntrue MDP M\u2217 by scaling with the communicating diameter DC and MDP parameters SC and \u0393C. The\nsecond term corresponds to the regret incurred in the early stage where the regret grows linearly.\n\n5\n\n\fWhen M\u2217 is communicating, we match the square-root term of UCRL (\ufb01rst term), while the second\nterm is bigger than the one appearing in UCRL by a multiplicative factor DCS (ignoring logarithmic\nterms, see Sec. 5).\nWe now provide a sketch of the proof of Thm. 1 (the full proof is reported in App. D). In order to\npreserve readability, all following inequalities should be interpreted up to minor approximations and\nin high probability.\n\n\u2206k \u00b7 1{tk \u2265 C(k)}\n\n\u2206k \u00b7 1{tk < C(k)} +\n\nk \u00d7 A for which transitions to ST\n\nnumber of visits to s, a in episode k. We decompose the regret as\n\n\u2206k \u00b7 1{M\u2217 \u2208 Mk} (cid:46) m(cid:88)k=1\n\nLet \u2206k := (cid:80)s,a \u03bdk(s, a)(g\u2217 \u2212 r(s, a)) be the regret incurred in episode k, where \u03bdk(s, a) is the\n\u2206(TUCRL, T ) (cid:46) m(cid:88)k=1\nwhere C(k) = O(cid:0)(DC)2S3A ln2(2SAtk/\u03b4)(cid:1) de\ufb01nes the length of a full exploratory phase, where the\nagent may suffer linear regret.\nOptimism. The \ufb01rst technical dif\ufb01culty is that whenever some transitions are disabled, the plausible\nset of MDPs Mk may actually be biased and not contain the true MDP M\u2217. This requires to prove\nthat TUCRL (i.e., the gain of the solution returned by EVI) is always optimistic despite \u201cwrong\u201d\ncon\ufb01dence intervals. The following lemma helps to identify the possible scenarios that TUCRL can\nproduce (see App. D.2).4\nLemma 1. Let episode k be such that M\u2217 \u2208 Mk, ST\n(case I) or Kk (cid:54)= \u2205, i.e., \u2203(s, a) \u2208 SC\nk \u2283 ST (i.e., some states have not been reached) and\nThis result basically excludes the case where ST\nyet no transition from SC\nk = \u2205, the true MDP\nk = ST then\nM\u2217 \u2208 Mk = Mk w.h.p. by construction of the con\ufb01dence intervals. Similarly, if ST\nM\u2217 \u2208 Mk w.h.p., since TUCRL only truncates transitions that are indeed forbidden in M\u2217 itself. In\nboth cases, we can use the same arguments in [2] to prove optimism. In case II the gain of any state\ns(cid:48) \u2208 ST\nk, the gain of the solution returned\nthe precision of EVI).\n\nk to them is enabled. We start noticing that when ST\n\nk is set to rmax and, since there exists a path from SC\n\nk (cid:54)= \u2205 and tk \u2265 C(k). Then, either ST\nk are allowed (case II).\n\n(cid:101)\u2206k =(cid:88)s\u2208S\n\ninequalities. Nonetheless, we would be left with the problem of bounding the (cid:96)\u221e norm of wk\n\n\u03bdk(s,(cid:101)\u03c0k(s))(cid:18)(cid:88)s(cid:48)\u2208S(cid:101)pk(s(cid:48)|s,(cid:101)\u03c0k(s))(cid:101)hk(s(cid:48)) \u2212(cid:101)hk(s)(cid:19) = \u03bd(cid:48)k(cid:16)(cid:101)Pk \u2212 I(cid:17) wk\n\nby EVI is rmax, which makes it trivially optimistic. As a result we can conclude that(cid:101)gk (cid:38) g\u2217 (up to\nPer-episode regret. After bounding the optimistic reward(cid:101)rk(s, a) w.r.t. r(s, a), the only part left to\nbound the per-episode regret \u2206k is the term (cid:101)\u2206k =(cid:80)s,a \u03bdk(s, a)((cid:101)gk \u2212(cid:101)rk(s, a)). Similar to UCRL,\nwe could use the (optimistic) optimality equation and rewrite (cid:101)\u2206k as\nwhere wk :=(cid:101)hk \u2212 mins\u2208S{(cid:101)hk}e is a shifted version of the vector(cid:101)hk returned by EVI at episode\nk, and then proceed by bounding the difference between (cid:101)Pk and Pk using standard concentration\n(i.e., the range of the optimistic vector(cid:101)hk) over the whole state space, i.e., (cid:107)wk(cid:107)\u221e = spS{(cid:101)hk} =\nmaxs\u2208S(cid:101)hk(s)\u2212 mins\u2208S(cid:101)hk(s). While in communicating MDPs, it is possible to bound this quantity\nby the diameter of the MDP as spS {hk} \u2264 D [2, Sec. 4.3], in weakly-communicating MDPs\nD = +\u221e, thus making this result uninformative. As a result, we need to restrict our attention to the\nsubset of communicating states SC, where the diameter is \ufb01nite. We then split the per-step regret\nover states depending on whether they are explored enough or not as \u2206k (cid:46) (cid:80)s,a \u03bdk(s, a)((cid:101)gk \u2212\n(cid:101)rk(s, a))1{(s, a) /\u2208 Kk} + rmax(cid:80)s,a \u03bdk(s, a)1{(s, a) \u2208 Kk}. We start focusing on the poorly\nvisited state-action pairs, i.e., (s, a) \u2208 Kk. In this case TUCRL may suffer the maximum per-step\nregret rmax but the number of times this event happen is cumulatively \u201csmall\u201d (App. D.4.1):\nLemma 2. For any T \u2265 1 and any sequence of states and actions {s1, a1, . . . . . . sT , aT} we have:\n(st, at) \u2264(cid:112)t/SA(cid:111) \u2264 2(cid:16)\u221aSCAT + SCA(cid:17)\nm(cid:88)k=1(cid:88)s,a\n\n1(cid:110)N\u00b1kt\n\nT(cid:88)t=1\n\n4Notice that M\n\n\u2217 \u2208 Mk is true w.h.p. since Mk is obtained using non-truncated con\ufb01dence intervals.\n\n\u03bdk(s, a)1{N\u00b1k (s, a) \u2264(cid:112)tk/SA\n(cid:125)\n\n(s,a)\u2208Kk\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n} \u2264\n\nk = ST\n\n(6)\n\nm(cid:88)k=1\n\nk to ST\n\n6\n\n\fFigure 3: Cumulative regret in the taxi with misspeci\ufb01ed states (left-top) and in the communicating\ntaxi (left-bottom), and in the weakly communicating three-states domain with D = +\u221e (right).\nCon\ufb01dence intervals \u03b2r,k and \u03b2p,k are shrunk by a factor 0.05 and 0.01 for the three-states domain\nand taxi, respectively. Results are averaged over 20 runs and 95% con\ufb01dence intervals are reported.\n\nWhen (s, a) /\u2208 Kk (i.e., N\u00b1k (s, a) >(cid:112)tk/SA holds),(cid:80)s,a \u03bdk(s, a)((cid:101)gk \u2212(cid:101)rk(s, a)) \u00b7 1{(s, a) /\u2208 Kk}\n\ncan be bounded as in Eq. 6 but now restricted on SC\n\nk, so that,\n\n\u03bdk((cid:101)Pk \u2212 I)(cid:101)hk = (cid:88)s\u2208SC\n\nk\n\n\u03bdk(s,(cid:101)\u03c0k(s))(cid:18) (cid:88)s(cid:48)\u2208SC\n\nk(cid:101)pk(s(cid:48)|s,(cid:101)\u03c0k(s))wk(s(cid:48)) \u2212 wk(s)(cid:19).\n\nMk\n\nMk\n\nk, i.e., spSC\n\nk{wk} = maxs\u2208SC\n\nSince the stopping condition guarantees that \u03bdk(s,(cid:101)\u03c0k(s)) = 0 for all s \u2208 ST\nk, we can \ufb01rst restrict\nthe outer summation to states in SC. Furthermore, all state-action pairs (s, a) /\u2208 Kk are such that\nthe optimistic transition probability(cid:101)pk(s(cid:48)|s, a) is forced to zero for all s(cid:48) \u2208 ST\nk, thus reducing the\ninner summation. We are then left with providing a bound for the range of wk restricted to the\nstates in SC\nk{wk}. We recall that EVI run on a set of plausible MDPs\nMk returns a function(cid:101)hk such that(cid:101)hk(s(cid:48)) \u2212(cid:101)hk(s) \u2264 rmax \u00b7 \u03c4\n(s \u2192 s(cid:48)), for any pair s, s(cid:48) \u2208 S,\n(s \u2192 s(cid:48)) is the expected shortest path in the extended MDP Mk. Furthermore, since\nwhere \u03c4\nM\u2217 \u2208 Mk, for all s, s(cid:48) \u2208 SC\nk, \u03c4Mk (s \u2192 s(cid:48)) \u2264 DC. Unfortunately, since M\u2217 may not belong to Mk,\nthe bound on the shortest path in Mk (i.e., \u03c4Mk (s \u2192 s(cid:48))) may not directly translate into a bound\nfor the shortest path in Mk, thus preventing from bounding the range of(cid:101)hk even on the subset of\nstates in SC\nk. Nonetheless, in App. E we show that a minor modi\ufb01cation to the con\ufb01dence intervals of\nMk makes the shortest paths between any two states s, s(cid:48) \u2208 SC\nk equivalent in both sets of plausible\nk{wk} \u2264 DC. 5 The \ufb01nal regret in Thm. 1 is then obtained by\nMDPs, thus providing the bound spSC\ncombining all different terms.\n4 Experiments\nIn this section, we present experiments to validate the theoretical \ufb01ndings of Sec. 3. We compare\nTUCRL against UCRL and SCAL.6 We \ufb01rst consider the taxi problem [24] implemented in OpenAI\nGym [25].7 Even such a simple domain contains misspeci\ufb01ed states, since the state space is con-\nstructed as the outer product of the taxi position, the passenger position and the destination. This\nleads to states that cannot be reached from any possible starting con\ufb01guration (all the starting states\nbelong to SC). More precisely, out of 500 states in S, 100 are non-reachable. On Fig. 3(left) we\ncompare the regret of UCRL, SCAL and TUCRL when the misspeci\ufb01ed states are present (top)\n{wk} under\ncontrol. In App. F we present an alternative modi\ufb01cations for which the shortest paths between any two states\ns, s\n\n5Note that there is not a single way to modify the con\ufb01dence intervals of Mk to keep spSC\n(cid:48) \u2208 SC\n6To the best of out knowledge, there exists no implementable algorithm to solve the optimization step of\n\nk is not equal but smaller than in Mk thus ensuring that spSC\n\n{wk} \u2264 DC.\n\nk\n\nREGAL and REGAL.D.\n\n7The code is available on GitHub.\n\nk\n\n7\n\n0123\u00b7108024\u00b7107DurationTRegret\u2206(T)SCALc=200TUCRLUCRL02468\u00b710611.522.5\u00b7104Regret\u2206(T)SCALc=2TUCRLUCRL01,0002,0003,0004,0005,0006,000PTt=11{N\u00b1k(s,a)\u2264qtkSA}00.20.40.60.811.21.4\u00b710702004006008001,0001,2001,400DurationTCumulativeRegret\u2206(T)SCALc=10SCALc=5TUCRL\u221d\u221aT\f1 \u2212 \u03b5\n\n\u03b5\nr = 0\nb\n\n(b)\n\nr = 1\n\nb\n\nd\n\ny\n\nr = 0\n\n(a)\n\nE[\u2206(UCRL, T, M )]\n\n(cid:16) D2S2A\n\n\u03b3\n\nO\n\nln(T )\n\n(cid:17)\n\n(cid:112)\n\nAT ln(T ))\n\nO(T ) O(DS\n\nr = 1/2\n\nRegret upper-bound\n\nr = 0\n\n0\n\n\u2020\nM\n\nT\n\n\u2217\nM\n\nT\n\nT\n\nr = 1/2\n\nx\n\nx\n\nd\n\nb\n\nd\n\nr = 0\nd\nr = 1\nb\n\ny\n\n(c)\n\nFigure 4: 4a Expected regret of UCRL (with known horizon T given as input) as a function of T .\n4b 4c Toy example illustrating the dif\ufb01culty of learning non-communicating MDPs. We represent a\nfamily of possible MDPs M = (M\u03b5)\u03b5\u2208[0,1] where the probability \u03b5 to go from x to y lies in [0, 1].\n\n+\n\nand when they are removed (bottom). In the presence of misspeci\ufb01ed states (top), the regret of\nUCRL clearly grows linearly with T while TUCRL is able to learn as expected. On the other hand,\nwhen the MDP is communicating (bottom) TUCRL performs similarly to UCRL. The small loss in\nperformance is most likely due to the initial exploration phase during which the con\ufb01dence intervals\non the transition probabilities used by UCRL (see de\ufb01nition of Mk) are tighter than those used by\nk ). TUCRL uses a \u201cloose\u201d bound on the (cid:96)1-norm while UCRL uses S\nTUCRL (see de\ufb01nition of M\ndifferent bounds, one for every possible next state. Finally, SCAL outperforms TUCRL by exploiting\nprior knowledge on the bias span.\nWe further study TUCRL regret in the simple three-state domain introduced in [6] (see App. H\nfor details) with different reward distributions (uniform instead of Bernouilli). The environment is\ncomposed of only three states (s0, s1 and s2) and one action per state, except in s2 where two actions\nare available. As a result, the agent only has the choice between two possible policies. Fig. 3(left)\nshows the cumulative regret achieved by TUCRL and SCAL (with different upper-bounds on the\nbias span) when the diameter is in\ufb01nite i.e., SC = {s0, s2} and ST = {s1} (we omit UCRL, since\nit suffers linear regret). Both SCAL and TUCRL quickly achieve sub-linear regret as predicted by\ntheory. However, SCAL and TUCRL seem to achieve different growth rates in regret: while SCAL\nappears to reach a logarithmic growth, the regret of TUCRL seems to grow as \u221aT with periodic\n\u201cjumps\u201d that are increasingly distant (in time) from each other. This can be explained by the way the\nalgorithm works: while most of the time TUCRL is optimistic on the restricted state space SC (i.e.,\nSC\nk = SC), it periodically allows transitions to the set ST (i.e., SC\nk = S), which is indeed not reachable.\nEnabling these transitions triggers aggressive exploration during an entire episode. The policy played\nis then sub-optimal creating a \u201cjump\u201d in the regret. At the end of this exploratory episode, SC\nk will be\nset again to SC and the regret will stop increasing until the condition N\u00b1k \u2264(cid:112)tk/SA occurs again\n(the time between two consecutive exploratory episodes grows quadratically). The cumulative regret\nincurred during exploratory episodes can be bounded by the term plotted in green on Fig. 3(left). In\nLem. 2 we proved that this term is always bounded by O(\u221aSCAT ). Therefore, it is not surprising\nto observe a \u221aT increase of both the green and red curves. Unfortunately, the growth rate of the\nregret will keep increasing as \u221aT and will never become logarithmic unlike SCAL (or UCRL when\nthe MDP is communicating). This is because the condition N\u00b1k \u2264(cid:112)tk/SA will always be triggered\n\u0398(\u221aT ) times for any T . In Sec. 5 we show that this is not just a drawback speci\ufb01c to TUCRL, but it\nis rather an intrinsic limitation of learning in weakly-communicating MDPs.\n\n5 Exploration-exploitation dilemma with in\ufb01nite diameter\n\nIn this section we further investigate the empirical difference between SCAL and TUCRL and prove\nan impossibility result characterising the exploration-exploitation dilemma when the diameter is\nallowed to be in\ufb01nite and no prior knowledge on the optimal bias span is available.\n\n8\n\n\frmaxT (by de\ufb01nition)\n\nD2\u0393SA\n\n\u03b3\n\nM (s) : g\u03c0\n\n(7)\n\nWe \ufb01rst recall that the expected regret E[\u2206(UCRL, M, T )] of UCRL (with input parameter \u03b4 = 1/3T )\nafter T \u2265 1 time steps and for any \ufb01nite MDP M can be bounded in several ways:\n\nE[\u2206(UCRL, M, T )] \u2264\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n3 [2, Theorem 2]\nln(T ) + C3(M ) [2, Theorem 4]\n\nC1 \u00b7 rmaxD(cid:112)\u0393SAT ln(3T 2) + 1\nC2 \u00b7 rmax\nwhere \u03b3 = g\u2217M \u2212 maxs,\u03c0{g\u03c0\nM (s) < g\u2217M} is the gap in gain, C1 := 34 and C2 := 342 are\nnumerical constants independent of M, and C3(M ) := O(max\u03c0:\u03c0(s)=a T\u03c0) with T\u03c0 a measure of\nthe \u201cmixing time\u201d of policy \u03c0. The three different bounds lead to three different growth rates for the\nfunction T (cid:55)\u2212\u2192 E[\u2206(UCRL, M, T )] (see Fig. 4a): 1) for T \u2020M \u2265 T \u2265 0, the expected regret is linear\nin T , 2) for T \u2217M \u2265 T \u2265 T \u2020M the expected regret grows as \u221aT , 3) \ufb01nally for T \u2265 T \u2217M , the increase in\nregret is only logarithmic in T . These different \u201cregimes\u201d can be observed empirically (see [6, Fig.\n5, 12]). Using (7), it is easy to show that the time it takes for UCRL to achieve sub-linear regret is\nat most T \u2020M = (cid:101)O(D2\u0393SA). We say that an algorithm is ef\ufb01cient when it achieves sublinear regret\nafter a number of steps that is polynomial in the parameters of the MDP (i.e., UCRL is then ef\ufb01cient).\nWe now show with an example that without prior knowledge, any ef\ufb01cient learning algorithm must\nsatisfy T \u2217M = +\u221e when M has in\ufb01nite diameter (i.e., it cannot achieve logarithmic regret).\nExample 1. We consider a family of weakly-communicating MDPs M = (M\u03b5)\u03b5\u2208[0,1] represented\non Fig. 4(right). Every MDP instance in M is characterised by a speci\ufb01c value of \u03b5 \u2208 [0, 1] which\ncorresponds to the probability to go from x to y. For \u03b5 > 0 (Fig. 4b), the optimal policy of M\u03b5 is\nsuch that \u03c0\u2217(x) = b and the optimal gain is g\u2217\u03b5 = 1 while for \u03b5 = 0 (Fig. 4c) the optimal policy is\nsuch that \u03c0\u2217(x) = d and the optimal gain is g\u22170 = 1/2. We assume that the learning agent knows\nthat the true MDP M\u2217 belongs to M but does not know the value \u03b5\u2217 associated to M\u2217 = M\u03b5\u2217. We\nassume that all rewards are deterministic and that the agent starts in state x (coloured in grey).\nLemma 3. Let C1, C2, \u03b1, \u03b2 > 0 be positive real numbers and f a function de\ufb01ned for all \u03b5 \u2208]0, 1]\nby f (\u03b5) = C1(1/\u03b5)\u03b1. There exists no learning algorithm AT (with known horizon T ) satisfying both\n1. for all \u03b5 \u2208]0, 1], there exists T \u2020\u03b5 \u2264 f (\u03b5) such that E[\u2206(AT , M\u03b5, x, T )] < 1/6 \u00b7 T for all T \u2265 T \u2020\u03b5 ,\n2. and there exists T \u22170 < +\u221e such that E[\u2206(AT , M0, x, T )] \u2264 C2(ln(T ))\u03b2 for all T \u2265 T \u22170 .\nNote that point 1 in Lem. 3 formalizes the concept of \u201cef\ufb01cient learnability\u201d introduced by Sutton\nand Barto [26, Section 11.6] i.e., \u201clearnable within a polynomial rather than exponential number of\ntime steps\u201d. All the MDPs in M share the same number of states S = 2 \u2265 \u0393, number of actions\nA = 2, and gap in average reward \u03b3 = 1/2. As a result, any function of S, \u0393, A and \u03b3 will be\nconsidered as constant. For \u03b5 > 0, the diameter coincides with the optimal bias span of the MDP and\nD = spS {h\u2217} = 1/\u03b5 < +\u221e, while for \u03b5 = 0, D = +\u221e but spS {h\u2217} = 1/2. As shown in Eq. 7\nand Thm. 1, UCRL and TUCRL satisfy property 1. of Lem. 3 with \u03b1 = 2 and C1 = O(S2A) but do\nnot satisfy 2. On the other hand, SCAL satis\ufb01es 2. with \u03b2 = 1 and C2 = O(H 2SA/\u03b3) (although this\nresult is not available in the literature, it is straightforward to adapt the proof of UCRL [2, Theorem\n4] to SCAL) but since [6, Theorem 12] holds only when H \u2265 spS {h\u2217}, SCAL only satis\ufb01es 1. for\n\u03b5 \u2265 1/H and \u03b5 = 0 (not for \u03b5 \u2208]0, 1/H[). Lem. 3 proves that no algorithm can actually achieve both\n1. and 2. As a result, since TUCRL satis\ufb01es 1., it cannot satisfy 2. This matches the empirical results\npresented in Sec. 4 where we observed that when the diameter is in\ufb01nite, the growth rates of the regret\nof SCAL and TUCRL were respectively logarithmic and of order \u0398(\u221aT ). An algorithm that does not\nsatisfy 1. could potentially satisfy 2. but, by de\ufb01nition of 1., it would suffer linear regret for a number\nof steps that is more than polynomial in the parameters of the MDP (more precisely, eD1/\u03b2 ). This is\nnot a very desirable property and we claim that an ef\ufb01cient learning algorithm should always prefer\n\ufb01nite time guarantees (1.) over asymptotic guarantees (2.) when they cannot be accommodated.\n6 Conclusion\nWe introduced TUCRL, an algorithm that ef\ufb01ciently balances exploration and exploitation in weakly-\ncommunicating and multi-chain MDPs, when the starting state s1 belongs to a communicating set\n(Asm. 1). We showed that TUCRL achieves a square-root regret bound and that, in the general case,\nit is not possible to design algorithm with logarithmic regret and polynomial dependence on the MDP\nparameters. Several questions remain open: 1) relaxing Asm. 1 by considering a transient initial state\n(i.e., s1 \u2208 ST), 2) re\ufb01ning the lower bound of Jaksch et al. [2] to \ufb01nally understand whether it is\npossible to scale with spS {h\u2217} (at least in communicating MDPs) instead of D without any prior\nknowledge (the \ufb02aw in REGAL.D may suggest it is indeed impossible).\n\n9\n\n\fAcknowledgments\n\nThis research was supported in part by French Ministry of Higher Education and Research, Nord-Pas-\nde-Calais Regional Council and French National Research Agency (ANR) under project ExTra-Learn\n(n.ANR-14-CE24-0010-01).\n\nReferences\n[1] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press\n\nCambridge, 1998.\n\n[2] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[3] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement\n\nlearning in weakly communicating MDPs. In UAI, pages 35\u201342. AUAI Press, 2009.\n\n[4] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Emma Brunskill. Regret minimization in\n\nmdps with options without prior knowledge. In NIPS, pages 3169\u20133179, 2017.\n\n[5] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for\nundiscounted reinforcement learning in mdps. In ALT, volume 83 of Proceedings of Machine\nLearning Research, pages 770\u2013805. PMLR, 2018.\n\n[6] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Ef\ufb01cient bias-span-\nconstrained exploration-exploitation in reinforcement learning. CoRR, abs/1802.04020, 2018.\n\n[7] William R. Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3-4):285\u2013294, 1933.\n\n[8] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) ef\ufb01cient reinforcement learning via\n\nposterior sampling. In NIPS, pages 3003\u20133011, 2013.\n\n[9] Yasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Bayesian optimal control of smoothly parameter-\n\nized systems. In UAI, pages 1\u201311. AUAI Press, 2015.\n\n[10] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for\nreinforcement learning? In ICML, volume 70 of Proceedings of Machine Learning Research,\npages 2701\u20132710. PMLR, 2017.\n\n[11] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov\n\ndecision processes: A thompson sampling approach. In NIPS, pages 1333\u20131342, 2017.\n\n[12] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning:\n\nworst-case regret bounds. In NIPS, pages 1184\u20131194, 2017.\n\n[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n[14] Andrew William Moore. Ef\ufb01cient memory-based learning for robot control. Technical report,\n\nUniversity of Cambridge, 1990.\n\n[15] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The\nadversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of\nComputer Science, pages 322\u2013331, Oct 1995. doi: 10.1109/SFCS.1995.492488.\n\n[16] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov\ndecision processes. In COLT, volume 40 of JMLR Workshop and Conference Proceedings,\npages 861\u2013898. JMLR.org, 2015.\n\n10\n\n\f[17] Yasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Bayesian optimal control of smoothly param-\neterized systems. In Proceedings of the Thirty-First Conference on Uncertainty in Arti\ufb01cial\nIntelligence, UAI\u201915, pages 2\u201311, Arlington, Virginia, United States, 2015. AUAI Press. ISBN\n978-0-9966431-0-8.\n\n[18] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov\nIn Advances in Neural Information\n\ndecision processes: A thompson sampling approach.\nProcessing Systems 30, pages 1333\u20131342. Curran Associates, Inc., 2017.\n\n[19] Georgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, and Nikos Vlassis. Posterior\n\nsampling for large scale reinforcement learning. CoRR, abs/1711.07979, 2017.\n\n[20] Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without\n\nepisodes. CoRR, abs/1608.02731, 2016.\n\n[21] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., New York, NY, USA, 1994. ISBN 0471619779.\n\n[22] Jean-Yves Audibert, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Tuning bandit algorithms in stochastic\nenvironments. In Algorithmic Learning Theory, pages 150\u2013165, Berlin, Heidelberg, 2007.\nSpringer Berlin Heidelberg.\n\n[23] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample-variance\n\npenalization. In COLT, 2009.\n\n[24] Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function\n\ndecomposition. J. Artif. Intell. Res., 13:227\u2013303, 2000.\n\n[25] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.\n\n[26] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. Adaptive\ncomputation and machine learning. MIT Press, second edition, 2018. ISBN 9780262039246.\n\n[27] Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, and Daniil Ryabko. Optimal regret\nbounds for selecting the state representation in reinforcement learning. In Proceedings of the\n30th International Conference on Machine Learning, volume 28 of Proceedings of Machine\nLearning Research, pages 543\u2013551, Atlanta, Georgia, USA, 17\u201319 Jun 2013. PMLR.\n\n11\n\n\f", "award": [], "sourceid": 1560, "authors": [{"given_name": "Ronan", "family_name": "Fruit", "institution": "Inria Lille"}, {"given_name": "Matteo", "family_name": "Pirotta", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "Facebook Artificial Intelligence Research"}]}