{"title": "Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 4890, "page_last": 4899, "abstract": "The exploration bonus is an effective approach to manage the exploration-exploitation trade-off in Markov Decision Processes (MDPs).\nWhile it has been analyzed in infinite-horizon discounted and finite-horizon problems, we focus on designing and analysing the exploration bonus in the more challenging infinite-horizon undiscounted setting.\nWe first introduce SCAL+, a variant of SCAL (Fruit et al. 2018), that uses a suitable exploration bonus to solve any discrete unknown weakly-communicating MDP for which an upper bound $c$ on the span of the optimal bias function is known. We prove that SCAL+ enjoys the same regret guarantees as SCAL, which relies on the less efficient extended value iteration approach.\nFurthermore, we leverage the flexibility provided by the exploration bonus scheme to generalize SCAL+ to smooth MDPs with continuous state space and discrete actions. We show that the resulting algorithm (SCCAL+) achieves the same regret bound as UCCRL (Ortner and Ryabko, 2012) while being the first implementable algorithm for this setting.", "full_text": "Exploration Bonus for Regret Minimization in\nDiscrete and Continuous Average Reward MDPs\n\nJian Qian, Ronan Fruit\nSequel Team - Inria Lille\n\njian.qian@ens.fr, ronan.fruit@inria.fr\n\nMatteo Pirotta, Alessandro Lazaric\n\nFacebook AI Research\n\n{pirotta, lazaric}@fb.com\n\nAbstract\n\nThe exploration bonus is an effective approach to manage the exploration-\nexploitation trade-off in Markov Decision Processes (MDPs). While it has been\nanalyzed in in\ufb01nite-horizon discounted and \ufb01nite-horizon problems, we focus on\ndesigning and analysing the exploration bonus in the more challenging in\ufb01nite-\nhorizon undiscounted setting. We \ufb01rst introduce SCAL+, a variant of SCAL [1],\nthat uses a suitable exploration bonus to solve any discrete unknown weakly-\ncommunicating MDP for which an upper bound c on the span of the optimal bias\nfunction is known. We prove that SCAL+ enjoys the same regret guarantees as\nSCAL, which relies on the less ef\ufb01cient extended value iteration approach. Fur-\nthermore, we leverage the \ufb02exibility provided by the exploration bonus scheme\nto generalize SCAL+ to smooth MDPs with continuous state space and discrete\nactions. We show that the resulting algorithm (SCCAL+) achieves the same regret\nbound as UCCRL [2] while being the \ufb01rst implementable algorithm for this setting.\n\nIntroduction\n\n1\nWhile learning in an unknown environment, a reinforcement learning (RL) agent must trade off the\nexploration needed to collect information about the dynamics and reward, and the exploitation of\nthe experience gathered so far to gain reward. An effective strategy to trade off exploration and\nexploitation is the optimism in the face of uncertainty (OFU) principle. A popular technique to\nensure optimism is to use an exploration bonus. This approach has been successfully implemented in\nH-step \ufb01nite-horizon and in\ufb01nite-horizon \u03b3-discounted settings with provable guarantees in \ufb01nite\nMDPs. Furthermore, its simple structure (i.e., it only requires solving an estimated MDP with a\nreward increased by the bonus) allowed it to be integrated in deep RL algorithms [e.g., 3, 4, 5, 6].\nAs the exploration bonus is designed to bound estimation errors on the value function, it requires\nknowing the maximum reward rmax and the intrinsic horizon of the problem [e.g., 7, 8, 9] (e.g.,\nH in \ufb01nite-horizon and 1/(1 \u2212 \u03b3) in discounted problems). Here we consider the challenging\nin\ufb01nite-horizon undiscounted setting [10, Chap. 8], which generalizes the two previous settings when\nH \u2192 \u221e and \u03b3 \u2192 1. While several algorithms implementing the OFU principle in this setting have\nbeen proposed [11, 2, 12, 1, 13], none of them exploits the idea of an exploration bonus.\nIn this paper we study the problem of de\ufb01ning and analysing an exploration bonus approach in the\nin\ufb01nite-horizon undiscounted setting. Contrary to the other settings, in average reward there is no\ninformation about the intrinsic horizon. As a consequence, we follow the approach in [14, 1] and we\nassume that an upper-bound c on the range of the optimal bias (i.e., value function) is known. We\n\nde\ufb01ne SCAL+ and we show that its regret is bounded by (cid:101)O(cid:0) max{c, rmax}\u221a\u0393SAT(cid:1) w.h.p. for any\noptimistic reward(cid:98)r(s, a) + b(s, a) dominates the Bellman operator of the true MDP when applied to\n\nMDP with S states, A actions and \u0393 possible next states. We prove that the bonus used by SCAL+\nensures optimism using a novel technical argument. We no longer use an inclusion argument (i.e.,\nthe true MDP is contained in a set of plausible MDPs) but we reason directly at the level of the\nBellman operator. We show that the optimistic Bellman operator de\ufb01ned by the empirical MDP with\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe optimal bias function. This is suf\ufb01cient to prove that the solution of the optimistic MDP is indeed\n(gain-)optimistic. This proof technique has two main advantages w.r.t. the inclusion argument. First,\nit directly applies to slightly perturbed empirical MDPs, without re-deriving con\ufb01dence sets. Second,\nas we study the optimistic Bellman operator applied only to the optimal bias function (rather than\nall possible vector in RS), we save a factor \u221a\u0393 in designing the exploration bonus, compared to the\n(implicit) bounds obtained by algorithms relying on con\ufb01dence sets on the MDP. Furthermore, as\nSCAL+ only solves the estimated MDP with optimistic reward, it is computationally cheaper than\nUCRL-based algorithms, which require computing the optimal policy for an extended MDP with a\ncontinuous action space de\ufb01ned by the con\ufb01dence set over MDPs.\nSurprisingly, the \u201ctighter\u201d optimism of SCAL+ does not translate into a better regret, which actually\nmatches the one of SCAL and still depends on the factor \u221a\u0393. We isolate and discuss where the\nterm \u221a\u0393 appears in the proof sketch of Sect. 3.3. While Azar et al. [8], Jin et al. [9] managed to\nremove the \u221a\u0393 term in the \ufb01nite-horizon setting, their proof techniques cannot be directly applied\n\nto the in\ufb01nite-horizon case. Recently Ortner [15] derived an algorithm achieving O(cid:0)\u221atmixSAT(cid:1)\n\nregret bound under the assumption that the true MDP is ergodic (tmix denotes the maximum mixing\ntime of any policy). It remains an open question if a regret scaling with \u221aS (instead of \u221a\u0393S) can\nbe achieved in the in\ufb01nite-horizon case without any ergodicity assumption. We report preliminary\nexperiments showing that the exploration bonus may indeed limit over-exploration and lead to better\nempirical performance w.r.t. approaches based on con\ufb01dence intervals on the MDP itself (i.e., UCRL\nand SCAL). A more detailed comparison to existing literature is postponed to App. A.\nTo further illustrate the generality of the exploration bonus approach, we also present SCCAL+, an\nextension of SCAL+ to continuous state MDPs. As in [2, 16], we require the reward and transition\nfunctions to be H\u00f6lder continuous with parameters \u03c1L and \u03b1. SCCAL+ is also the \ufb01rst implementable\nalgorithm in continuous average reward problems with theoretical guarantees (existing algorithms with\ntheoretical guarantees such as UCCRL [2] cannot be implemented). The key result is a regret bound\n\nof (cid:101)O(cid:0) max{c, rmax}\u03c1L\u221aAT (\u03b1 + 2)/(2\u03b1 + 2)(cid:1) w.h.p. Finally, we provide an empirical comparison of\n\nSCCAL+ with a Q-learning algorithm with exploration bonus for average reward problems (RVIQ-\nUCB) inspired by [17, 9] and the results in this paper (to deal with continuous states).\n\n2 Preliminaries\nWe consider a weakly-communicating MDP [10, Sec. 8.3] M = (S,A, p, r) with state space S\nand action space A. Every state-action pair (s, a) is characterized by a reward distribution with\nmean r(s, a) and support in [0, rmax], and a transition distribution p(\u00b7|s, a) over next states. In this\nsection, we assume the \ufb01nite case (i.e., |S|,|A| < +\u221e), although all following de\ufb01nitions extend to\ncontinuous state spaces under mild assumptions on r and p (see Sect. 4). We denote by S = |S| and\nA = |A| the number of states and action, by \u0393(s, a) = (cid:107)p(\u00b7|s, a)(cid:107)0 the number of states reachable\nby selecting action a in state s, and by \u0393 = maxs,a \u0393(s, a) its maximum. A stationary Markov\nrandomized policy \u03c0 : S \u2192 P (A) maps states to distributions over actions. The set of stationary\nrandomized (resp. deterministic) policies is denoted by \u03a0SR (resp. \u03a0SD). Any policy \u03c0 \u2208 \u03a0SR has an\nassociated long-term average reward (or gain) and a bias function de\ufb01ned as\n\n(cid:20) T(cid:88)\n\nt=1\n\n(cid:0)r(st, at) \u2212 g\u03c0(st)(cid:1)(cid:21)\n\n,\n\ng\u03c0(s) := lim\n\nT\u2192+\u221e\n\nE\u03c0\ns\n\nr(st, at)\n\nand h\u03c0(s) := C- lim\nT\u2192+\u221e\n\nE\u03c0\ns\n\n(cid:20) 1\n\nT(cid:88)\n\nT\n\nt=1\n\n(cid:21)\n\nwhere E\u03c0\ns denotes the expectation over trajectories generated starting from s1 = s with at \u223c \u03c0(st).\nThe bias h\u03c0(s) measures the expected total difference between the reward and the stationary reward\nin Cesaro-limit (denoted by C- lim). Accordingly, the difference of bias h\u03c0(s) \u2212 h\u03c0(s(cid:48)) quanti\ufb01es\nthe (dis-)advantage of starting in state s rather than s(cid:48). We denote by sp (h\u03c0) := maxs h\u03c0(s) \u2212\nmins h\u03c0(s) the span of the bias function. In weakly communicating MDPs, any optimal policy\n\u03c0\u2217\n\u2208 arg max\u03c0 g\u03c0(s) has constant gain, i.e., g\u03c0\u2217 (s) = g\u2217 for all s \u2208 S. Moreover, there exists a\npolicy \u03c0\u2217\n\n\u2208 arg max\u03c0 g\u03c0(s) for which (g\u2217, h\u2217) = (g\u03c0\u2217 , h\u03c0\u2217 ) satisfy the optimality equation,\n\n(1)\nwhere L is the optimal Bellman operator. Finally, D = maxs(cid:54)=s(cid:48){\u03c4 (s \u2192 s(cid:48))} denotes the diameter\nof M, where \u03c4 (s \u2192 s(cid:48)) is the minimal expected number of steps needed to reach s(cid:48) from s.\n\n\u2217\na\u2208A{r(s, a) + p(\u00b7|s, a)Th\n\n\u2217\n\n\u2217\n= Lh\n\n\u2200s \u2208 S,\n\n\u2217\nh\n\n(s) + g\n\n(s) := max\n\n},\n\n2\n\n\fInput: Con\ufb01dence \u03b4 \u2208]0, 1[, rmax, S (I for SCCAL+), A, c \u2265 0 (and \u03c1L and \u03b1 for SCCAL+)\nFor episodes k = 1, 2, ... do\n1. Set tk = t and episode counters \u03bdk(s, a) = 0.\n\nk (I(s), a), bk(I(s), a) (Eq. 3 or 6) and build the MDP (cid:99)M +\n\nk\n\n2. Compute estimates (cid:98)p+\n(SCAL+) or(cid:99)M ag+\n\nk (I(s(cid:48))|I(s), a),(cid:98)r+\n-approximate solution of Eq. 4 for(cid:99)M +\n\n(SCCAL+).\n\nk\n3. Compute an rmax\ntk\n4. Sample action at \u223c \u03c0k(\u00b7|I(st)).\n5. While \u03bdk(I(st), at) < max{1, Nk(I(st), at)} do\n\nk (SCAL+) or(cid:99)M ag+\n\nk\n\n(SCCAL+) using SCOPT\n\n(a) Execute at, obtain reward rt, and observe next state st+1.\n(b) Increment counter \u03bdk(st, at) += 1.\n(c) Sample action at+1 \u223c \u03c0k(\u00b7|I(st+1)) and increment t += 1.\n\n6. Increment counters Nk+1(s, a) := Nk(s, a) + \u03bdk(s, a) for all s, a.\n\nFigure 1: Shared pseudo-code for SCAL+ and SCCAL+. For SCAL+ I(s) = s by de\ufb01nition.\n\nt=1(g\u2217\n\n\u2212 rt(st, at)). Finally, we make the following assumption.\n\nLearning Problem. Let M\u2217 be the true MDP. We consider the learning problem where S, A and\n\u2206(A, T ) =(cid:80)T\nrmax are known, while rewards r and dynamics p are unknown and need to be estimated on-line.\nWe evaluate the performance of a learning algorithm A after T time steps by its cumulative regret\nAssumption 1. There exists a known upper-bound c > 0 to the optimal bias span i.e., c \u2265 sp (h\u2217).\nThis assumption is common in the literature [see e.g., 18, 2, 1]. Such a bound to the \u201crange\u201d of the\nvalue function is already available in discounted and \ufb01nite horizon problems (i.e., as\n1\u2212\u03b3 and H),\nso Asm. 1 is not more restrictive. While the span sp (h\u2217) is a non-trivial function of the dynamics\nand the rewards of the MDP, some intuition about how the cumulative reward varies depending on\ndifferent starting states is often available. Furthermore, as sp (h\u2217) \u2264 rmaxD [e.g., 14], it is suf\ufb01cient\nto have prior knowledge about the diameter D and the range of the reward rmax, to provide a rough\nupper-bound on the span.\n3\nIn this section, we introduce SCAL+, the \ufb01rst online RL algorithm \u2013in the in\ufb01nite horizon undiscounted\nsetting\u2013 that leverages an exploration bonus to achieve near-optimal regret guarantees. Similar to\nSCAL [1], SCAL+ takes as input an upper-bound c on the optimal bias span (i.e., sp (h\u2217) \u2264 c) to\nconstrain the planning problem solved over time. The crucial difference is that SCAL+ does not\ncompute an optimistic MDP within a high-probability con\ufb01dence set, but it directly computes the\noptimal policy of the estimated MDP, with the reward increased by an exploration bonus. The bonus\nis carefully tuned so as to guarantee optimism and small regret at the same time (Thm. 1).\n\nSCAL+: SCAL with exploration bonus\n\n1\n\n3.1 The Algorithm\nSimilar to other OFU-based algorithms, SCAL+ proceeds in episodes (see Fig. 1)1. Denote by tk the\nstarting time of episode k, Nk(s, a, s(cid:48)) the number of observations of tuple (s, a, s(cid:48)) before episode\n\ns(cid:48) Nk(s, a, s(cid:48)). We de\ufb01ne the estimators of the transitions and rewards as\n\nrt(st, at)1(cid:0)(st, at) = (s, a)(cid:1)\n\nNk(s, a)\n\n(2)\n\n(cid:48)\nk (s\n\n|s, a) =\n\n1(s(cid:48) = s)\nNk(s, a) + 1\n\nNk(s, a, s(cid:48))\nNk(s, a) + 1\n\nk and Nk(s, a) :=(cid:80)\n(cid:98)p+\nwhere s \u2208 S is an arbitrary state and rk(s, a) := rmax,(cid:98)p+\ntransition model(cid:98)p+\nk (s(cid:48)\nln(cid:0)20SAN +\n(cid:123)(cid:122)\n\nfurther de\ufb01ne the exploration bonus\n\nbk(s, a) := (c + rmax)\n\n(cid:115)\n(cid:124)\n\nN +\n\n+\n\n,\n\nk (s, a)\n\nrk(s, a) =\n\ntk\u22121(cid:88)\nk (s(cid:48)\n\nt=1\n\nk (s, a)/\u03b4(cid:1)\n(cid:125)\n\n:=\u03b2sa\nk\n\n|s, a) is a biased (but asymptotically consistent) estimator of p(s(cid:48)\n\n|s, a) = 1/S when Nk(s, a) = 0. The\n|s, a). We\n\n+\n\nc\n\nNk(s, a) + 1\n\n,\n\n(3)\n\n1The algorithm is reported in its general form, which applies to both discrete and continuous state space.\n\n3\n\n\f\u03c0k := arg\n\n\u2217\n\ng\n\n(4)\n\n(5)\n\nsup\n\n\u03c0\u2208\u03a0c((cid:99)M +\n\n\u2200s \u2208\notherwise\n\nis executed until the number of visits in at least one state-action pair during the episode has doubled.\n\nk (s, ai) :=(cid:0)rk(s, a) + bk(s, a)(cid:1)\n\u03c0\u2208\u03a0c((cid:99)M +\n\nk ) := sup\n\nk ){g\u03c0},\nk , given v \u2208 RS and c \u2265 0, we de\ufb01ne the value operator (cid:98)T +\n(cid:40)(cid:98)L+v(s)\nc + mins{(cid:98)L+v(s)}\nc as vn+1 = (cid:98)T +\n\nwhere N +\nk (s, a) = max{1, Nk(s, a)}. Intuitively, the exploration bonus is large for poorly visited\nstate-action pairs, while it decreases with the number of visits. A crucial aspect in the formulation\nof bk is that it scales with the span c. In fact, the exploration bonus is not used to obtain an upper-\nk would be suf\ufb01cient), but it is designed to\ncon\ufb01dence bound on the reward (setting bk(s, a) = \u03b2sa\ntake into consideration how estimation errors on p and r, which are bounded by \u03b2sa\nk , may propagate\nto the bias and gain through repeated applications of the Bellman operator. As the span c provides\nprior knowledge about the \u201crange\u201d of the optimal bias function, the exploration bonus is obtained by\nconsidering that \u201clocal\u201d estimation errors may be ampli\ufb01ed up to a factor c. The speci\ufb01c shape of bk\nMDP (cid:99)M +\nk = (S,A+,(cid:98)p+\nk ,(cid:98)r+\nand \u03b2sa\nk and their theoretical properties are derived in Lem. 1. At each episode k, SCAL+ builds an\nk ) obtained by duplicating every action in A with transition probabilities\n(a, i) \u2208 A \u00d7 {1, 2} by ai. We then de\ufb01ne(cid:98)r+\nunchanged and optimistic reward set to 0. Formally, let A+ := A \u00d7 {1, 2} and we denote any pair\nproceeds by computing the optimal policy of the MDP(cid:99)M +\n\u00b7 1(i = 1). SCAL+\nc ((cid:99)M +\nk subject to the constrains on the bias span:\nwhere the constraint set is \u03a0c(M ) :=(cid:8)\u03c0 \u2208 \u03a0SR : sp (h\u03c0) \u2264 c \u2227 sp (g\u03c0) = 0(cid:9). The optimal policy\nk ){g\u03c0};\nProblem 4 is well posed and can be solved using SCOPT.Let(cid:98)L+ be the optimal Bellman operator\nassociated to(cid:99)M +\n(cid:110)\n(cid:111)\ns \u2208 S|(cid:98)L+v(s) \u2264 mins{(cid:98)L+v(s)} + c\n: RS \u2192 RS as\n(cid:98)T +\nc v = \u0393c(cid:98)L+v =\nc applies a span truncation to the one-step application of (cid:98)L+, which guarantees that\noperator (cid:98)T +\nwhere \u0393c is the span constrain projection operator (see [1, App. D] for details). In other words,\nsp((cid:98)T +\nwhere(cid:98)L+ is replaced by (cid:98)T +\nc v) \u2264 c. Given a vector v0 \u2208 RS and a reference state s, SCOPT runs relative value iteration\nc vn(s)e. The policy \u03c0k returned by SCOPT takes\naction in the augmented set A+ and it can be \u201cprojected\u201d on A as \u03c0k(s, a) \u2190 \u03c0k(s, a1) + \u03c0k(s, a2)\nepisode. Following similar steps as in [1], we can prove that(cid:99)M +\n(we use the same notation for the two policies), which is the policy actually executed through the\nk satis\ufb01es all suf\ufb01cient conditions for\nProposition 1. The MDP (cid:99)M +\noperator(cid:98)L+ is a \u03b3-span-contraction; 2) all policies are unichain; 3) the operator (cid:98)T +\nk satis\ufb01es the following properties: 1) the associated optimal Bellman\nc is globally fea-\nsible at any vector v \u2208 RS such that sp (v) \u2264 c, i.e., for all s \u2208 S, mina\u2208A{r(s, a)+p(\u00b7|s, a)Tv} \u2264\nmins(cid:48){Lv(s(cid:48))} + c. As a consequence, SCOPT converges and returns a policy \u03c0k solving (4).\n3.2 Optimistic Exploration Bonus\nto compute \u03c0k ((cid:99)M +\n(cid:0)(cid:99)M +\nAll regret proofs for OFU-based algorithms rely on the property that the optimal gain of the MDP used\nfor SCAL+, we need to ensure that the policy \u03c0k is gain-optimistic, i.e.,(cid:98)g+\nk in our case) is an upper-bound on g\u2217. If we want to use the same proof technique\nRecall that the optimal gain and bias of the true MDP (g\u2217, h\u2217) satisfy the optimality equation\nLh\u2217 = h\u2217 + g\u2217e where e = (1, . . . , 1). Since sp (h\u2217) \u2264 c (by assumption), we also have sp (Lh\u2217) =\nshows that a suf\ufb01cient condition to prove optimistic gain is to show that the operator (cid:98)T +\nsp (h\u2217 + g\u2217e) = sp (h\u2217) \u2264 c and so Tch\u2217 = Lh\u2217. A minor variation to Lemma 8 of Fruit et al. [1]\nc is optimistic\nw.r.t. its exact version when applied to the optimal bias function, i.e., (see Prop. 3 in App. B)\nAs the truncation operated by Tc (i.e., \u0393c) is monotone, this inequality is implied by(cid:98)L+\nFinally, since(cid:98)p+\nk (s, a2) \u2264(cid:98)r+\nk h\u2217\nsee that(cid:98)L+\nk \u2265 g\u2217 is to have(cid:98)Lkh\u2217\nk h\u2217 =(cid:98)Lkh\u2217, thus implying that a suf\ufb01cient condition for(cid:98)g+\nk (s(cid:48)\nwhich reduces to verifying optimism for the Bellman operator of (cid:99)M +\n\n\u2265 Lh\u2217.\nk (s, a1) it is immediate to\n\u2265 Lh\u2217,\nk when applied to the exact\noptimal bias function. The exploration bonus is tailored to achieve this condition with high probability.\n\n\u2217\nc h\n\n\u2217\n\u2265 h\n\n(cid:98)T +\n|s, a2) =(cid:98)pk(s(cid:48)\n\nSCOPT to converge and return the optimal policy (see App. B).\n\n|s, a) and(cid:98)r+\n\n|s, a1) =(cid:98)p+\n\nc vn \u2212 (cid:98)T +\n\n\u2217\n\n\u2217\ne = Tch\n\n.\n\n+ g\n\nk := g\u2217\n\nc\n\nk\n\n\u2265 g\u2217.\n\nk (s(cid:48)\n\n4\n\nc\n\n(cid:1)\n\n\f\u2265 Lh\u2217 (componentwise) and as a consequence(cid:98)g+\n\nLemma 1. Denote by(cid:98)Lk the optimal Bellman operator of (cid:99)Mk. With probability at least 1 \u2212 \u03b4\nall k \u2265 1,(cid:98)Lkh\u2217\nthe MLE of p). We also need to take into account the small bias introduced by(cid:98)pk(\u00b7|s, a) compared to\nall k \u2265 1, rk(s, a) + bk(s, a) +(cid:98)pk(\u00b7|s, a)\n\nProof (see App. D). By using Hoeffding-Azuma inequality and union bound, we can show that for\nk w.h.p. (pk is\nall k \u2265 1, |rk(s, a) \u2212 r(s, a)| \u2264 rmax\u03b2sa\nk and |(pk(\u00b7|s, a) \u2212 p(\u00b7|s, a))\npk(\u00b7|s, a) which is not bigger than c/(Nk(s, a) + 1) by de\ufb01nition. Then, with high probability, for\n(cid:124)\n\nh\u2217 for all (s, a) \u2208 S \u00d7 A.\n\n(cid:124)\n\u2265 r(s, a) + p(\u00b7|s, a)\n\nk \u2265 g\u2217.\nh\u2217\n\n(cid:124)\n\n| \u2264 c \u03b2sa\n\n5 , for\n\nh\u2217\n\nThe argument used to prove optimism (Lem. 1)) signi\ufb01cantly differs from the one used for UCRL and\nSCAL. Con\ufb01dence-based methods compute the optimal policy of an extended MDP that \u201ccontains\u201d\nthe true MDP M\u2217 (w.h.p.), which directly implies that the gain of the extended MDP is bigger than\nto over-exploration). In fact, the exploration bonus quanti\ufb01es by how much (cid:98)L+\ng\u2217. The main advantage of our argument is that it allows for a \u201ctighter\u201d optimism (i.e., less prone\n(cid:112)\nLh\u2217 and it approximately scales as bk(s, a) = (cid:101)\u0398(cid:0) max{rmax, c}/\nNk(s, a)(cid:1). In contrast, UCRL\nk h\u2217 is bigger than\nand SCAL use an optimistic Bellman operator(cid:101)L such that(cid:101)Lh\u2217 is bigger than Lh\u2217 by respectively\n(cid:112)\n(cid:112)\n\u0393/Nk(s, a)(cid:1) (UCRL) and(cid:101)\u0398(cid:0) max{rmax, c}\n\u0393/Nk(s, a)(cid:1) (SCAL). In other words, the\n(cid:101)\u0398(cid:0)rmaxD\noptimism in SCAL+ is tighter by a multiplicative factor \u221a\u0393.\n3.3 Regret Analysis of SCAL+\nWe report the main result of this section.\nTheorem 1. For any weakly communicating MDP M such that sp (h\u2217) \u2264 c, with probability at\nleast 1 \u2212 \u03b4 it holds that for any T \u2265 1, the regret of SCAL+ is bounded as\n\n(cid:17)\n\nT ln(cid:0)T /\u03b4(cid:1) + S2A ln2(cid:16) T\n\n(cid:17)(cid:19)(cid:19)\n\n(cid:18)(cid:115)(cid:16)(cid:88)\n\n(cid:18)\n\n\u2206(SCAL+, T ) = O\n\nmax{rmax, c}\n\n\u0393(s, a)\n\ns,a\n\n\u03b4\n\nSince the optimism in SCAL+ is tighter than in UCRL and SCAL by a factor \u221a\u0393, one may expect to get\na regret bound scaling as c\u221aSAT instead of c\u221a\u0393SAT , thus matching the lower bound of Jaksch et al.\n[11] as for the dependency in S. Unfortunately, such a bound seems dif\ufb01cult to achieve with SCAL+\n(and even SCAL) due to the correlation between hk and pk (see App. D). Azar et al. [8] managed to\nachieve the optimal dependence in S in \ufb01nite-horizon problems. In this setting, the de\ufb01nition of regret\nis different and it is not clear whether it is possible to adapt their guarantees and techniques to in\ufb01nite\nsampling has a regret of (cid:101)O(D\u221aSAT ) in the in\ufb01nite horizon undiscounted setting. Unfortunately,\nhorizon without introducing a \u0398(T )-term. Agrawal and Jia [19] showed the optimistic posterior\ntheir proof critically relies on the concentration inequality |(pk(\u00b7|s, a) \u2212 p(\u00b7|s, a))Thk| (cid:46) rmaxD\u03b2sa\nwhich is incorrect.2 It remains as an open question whether the \u221a\u0393 term can be actually removed.\nFinally, SCAL+\u2019s regret does not scale min{rmaxD, c} as for SCAL, implying that SCAL+ may\nperform worse when c is too large. The difference resides in the fact SCAL builds an extended MDP\nthat contains the true MDP (w.h.p.). The shortest path between two states in the extended MDP is\ntherefore shorter than in the true MDP and consequently, the diameter of the extended MDP is smaller\nthan the true diameter D. This explains why the regret of SCAL depends on both D and c (which is\nprovided as input to the algorithm). Unfortunately, in SCAL+ it is not clear how to bound the diameter\n\nk\n\nk and the only information that can be exploited to bound the regret is the constraint c.\nSCCAL+: SCAL+ for continuous state space\n\n4\nWe now consider an MDP with continuous state space S = [0, 1] and discrete action space A. In\ngeneral, it is impossible to learn an arbitrary real-valued function with only a \ufb01nite number of samples.\nWe therefore introduce the same smoothness assumption as Ortner and Ryabko [2]:\nAssumption 2 (H\u00f6lder continuity). There exist \u03c1L, \u03b1 > 0 s.t. for any two states s, s(cid:48)\n\u2208 S and any\naction a \u2208 A, |r(s, a)\u2212 r(s(cid:48), a)| \u2264 rmax\u03c1L|s\u2212 s(cid:48)\n|\u03b1 and (cid:107)p(\u00b7|s, a)\u2212 p(\u00b7|s(cid:48), a)(cid:107)1 \u2264 \u03c1L|s\u2212 s(cid:48)\n|\u03b1.\nAs in Sec. 3, we start by introducing our proposed algorithm SCCAL+ which is a variant of SCAL+\nfor continuous state space (Sec. 4.1), and then analyze its regret (Sec. 4.2).\n\nof(cid:99)M +\n\n2See https://arxiv.org/abs/1705.07041.\n\n5\n\n\fS\n\n1\n\nt=1\n\ns(cid:48)\u2208J\n\nS , k\n\nS\n\nNk(I, a)\n\nrag\nk (I, a) :=\n\ns\u2208I Nk(s, a),\n\n(cid:3) and Ik =(cid:3) k\u22121\n\n(cid:3) for k = 2, . . . , S. The set of\n\n4.1 The algorithm\nIn order to apply SCAL+ to a continuous problem, we discretize the state space as in [2]. We\n\naggregated states is then I := {I1, . . . , IS} (|I| = S). The number of intervals S is a parameter of\nthe algorithm and plays a central role in its performance. Note that the terms Nk(s, a, s(cid:48)) and Nk(s, a)\nde\ufb01ned in Sec. 3 are still well-de\ufb01ned for s and s(cid:48) lying in [0, 1] but are 0 except for a \ufb01nite number\ns\u2208I us is also well-de\ufb01ned as long as the collection\n(us)s\u2208I contains only a \ufb01nite number of non-zero elements. We can therefore de\ufb01ne the aggregated\n\npartition S into S intervals de\ufb01ned as I1 :=(cid:2)0, 1\nof s and s(cid:48). For any subset I \u2286 S, the sum(cid:80)\ncounts, rewards and transition probabilities for all I, J \u2208 I as: Nk(I, a) :=(cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\ns\u2208I Nk(s, a, s(cid:48))\ns\u2208I Nk(s, a)\n\nk (J|I, a) :=\nSimilar to Eq. 3, we de\ufb01ne the exploration bonus of an aggregated state as\n\nrt(st, at)1(st \u2208 I, at = a), pag\n\ntk\u22121(cid:88)\nbk(I, a) :=(c + rmax)(cid:0)\u03b2Ia\n\nk = (I,A,(cid:98)pag\nk ,(cid:98)rag\n\n(6)\nk is de\ufb01ned in (3). The main difference is an additional O(c\u03c1LS\u2212\u03b1) term that accounts for\nwhere \u03b2Ia\nthe fact that the states that we aggregate are not completely identical but have parameters that differ by\nk ),\n\nat most \u03c1LS\u2212\u03b1. We pick an arbitrary reference aggregated state I and de\ufb01ne(cid:99)M ag\nthe aggregated (discrete) analogue of(cid:99)Mk de\ufb01ned in Sec. 3, where(cid:98)rag\nSimilarly we \u201caugment\u201d (cid:99)M ag\ninto (cid:99)M ag+\n,(cid:98)rag+\n= (I,A+,(cid:98)pag+\nby duplicating each transition in (cid:99)M ag\nparameters as in Sec. 3) to solve optimization problem (4) on(cid:99)M ag+\nthe state space of M\u2217 is uncountable,(cid:99)M ag+\n\nk in Sec. 3)\nk . At each episode k, SCCAL+ uses SCOPT (with the same\n. This is possible because although\nhas only S < +\u221e states. SCOPT returns an optimistic\noptimal policy \u03c0k satisfying the span constraint. This policy is de\ufb01ned in the discrete aggregated\nstate space but can easily be extended to the continuous case by setting \u03c0k(s, a) := \u03c0k(I(s), a) for\nany (s, a) (with I(s) mapping a state to the interval containing it).\n\n) (analogue of (cid:99)M +\n\n\u2212\u03b1(cid:1) +\n\nk (J|I, a) :=\n\nk (J|I, a)\n\nNk(I, a)pag\n\nk = rag\n\nk + bk and\n\nNk(I, a) + 1\n\nNk(I, a) + 1\n\nNk(I, a) + 1\n\n(cid:98)pag\n\nk + \u03c1LS\n\n1(J = I)\n\n+\n\nc\n\nk\n\nk\n\nk\n\nk\n\n.\n\n,\n\nk\n\nk\n\n4.2 Regret Analysis of SCCAL+\nThis section is devoted to the regret analysis of SCCAL+, with the main result summarized in Thm.2.\nTheorem 2. For any MDP M satisfying Asm. 2 and such that sp (h\u2217\n(cid:19)(cid:19)\nM ) \u2264 c, with probability at least\n1 \u2212 \u03b4 it holds that for any T \u2265 1, the regret of SCCAL+ is bounded as\n(cid:17)\n\nAT ln(cid:0)T /\u03b4(cid:1) + S2A ln2(cid:0)T /\u03b4(cid:1) + \u03c1LS\n\n(cid:18)\n(cid:17)1/(\u03b1 + 1)\n\n\u2206(SCCAL+, T ) = O\n\nmax{rmax, c}\n\n(cid:113)\n\n\u2212\u03b1T\n\n(cid:18)\n\n(cid:16)\n\n(cid:16)\n\nS\n\n\u03b1\n\n1\n\nthe bound becomes: (cid:101)O\n\nBy setting S =\n\n(cid:113) T\n\nA\n\n(2\u03b1+2) T\n\n(\u03b1+2)\n(2\u03b1+2)\n\n\u03b1\u03c1L\n\n.\n\n(\u03b1+1)\nL\n\nA\n\nmax{rmax, c}\u03c1\n\nThm. 2 shows that SCCAL+ achieves the same regret as UCCRL [2] while being the only imple-\nmentable algorithm with such theoretical guarantees for this setting. Thm. 2 can be extended to\nare needed for the discretization leading to a regret bound of order (cid:101)O(T (2d + \u03b1)/(2d + 2\u03b1)) after tuning\nthe more general case where S is d-dimensional. As pointed out by [2], in this case Sd intervals\nS = T 1/(2d + 2\u03b1). Finally, we believe that SCCAL+ can be extended to the setting considered by [16]\nasymptotic regret (as \u03ba \u2192 \u221e) of (cid:101)O(T 2/3) while SCCAL+ is achieving (cid:101)O(T 3/4).\nwhere, in addition to H\u00f6lder conditions, the transition function is assumed to be \u03ba-times smoothly\ndifferentiable. In the case of Lipschitz model, i.e., \u03b1 = 1, this means that it is possible obtain an\nProof sketch. Thm. 2 can be seen as a generalization of Thm. 1 but the continuous nature of the state\nwith different state spaces: (cid:99)M ag\nspace makes the analysis more dif\ufb01cult. The main technical challenge lies in relating two MDPs\nk (with \ufb01nite state space) and M\u2217 (with continuous state space). For\nwe introduce an \u201cintermediate\u201d MDP(cid:99)Mk which has continuous state space like M\u2217, but which also\ninstance, It is necessary to compare these two MDPs to prove optimism. To facilitate the comparison,\ndepends on the samples collected before episode k like(cid:99)M ag\n\nk .\n\n6\n\n\fS(cid:82)\n\n+\n\ns(cid:48)\u2264s\n\nJ\n\nNk(I(s), a) + 1\n\n.\n\nc\n\n:= g\u2217\n\nc\n\nk\n\nk\n\n(cid:48)\n\n|s, a) :=\n\nk (J|I(s), a)\n\n+\n\nJ\nNk(I(s), a) + 1\n\nk (I(s), a),\nNk(I(s), a)pk(s(cid:48)\nNk(I(s), a) + 1\n\nc vn and un+1 := (cid:98)T ag+\n\nc\n\nn, vn is piece-wise constant and its discrete analogue is un i.e., un = v(cid:48)\n\n\u2208 J \u2208 I are piece-wise\nk . More precisely, \u2200J \u2208 I,\n=(cid:98)pag\n\nk (J|I(s), a)\n\n(7)\n\n\u2208 I(s))\nNk(I(s), a) + 1\n|s, a) is the\n\n|s, a)\n(cid:80)\nx\u2208I(s) Nk(x,a,s(cid:48))\nNk(I(s),a)\n\nk and(cid:99)Mk) although they have different state spaces and obtain:\n(cid:0)(cid:99)M ag+\n\n(cid:1) =(cid:98)g+\n\nk := g\u2217\n\nk (J|I(s), a) and so \u2200(s, J) \u2208 S \u00d7 I:\n1(s(cid:48)\nNk(I(s), a)pag\n=\nk and (cid:99)Mk can be easily compared (and as a consequence, so can (cid:99)M ag+\n\nDe\ufb01nition 1 (Empirical MDP with continuous state space). Let(cid:99)Mk = (S,A,(cid:98)pk,(cid:98)rk) be the continu-\nous state space MDP s.t. for all (s, a) \u2208 S \u00d7 A, rk(s, a) := rag\n(cid:98)rk(s, a) := rk(s, a) + bk(I(s), a) and (cid:98)pk(s\nS \u00b7 1(s(cid:48)\nRadon-Nikodym derivative of the cumulative density function F (s) =(cid:80)\nwhere I : S \u2192 I is the function mapping a state s to the interval containing s, and pk(s(cid:48)\nMDP(cid:99)Mk is designed so that: 1) the reward function is piece-wise constant over any interval in I and\nmatches the reward function of (cid:99)M ag\nconstant and match the transitions of the discrete state space MDP (cid:99)M ag\nk , 2) the transitions integrated over s(cid:48)\n(cid:82)\n(cid:90)\nJ pk(s(cid:48)\n|s, a)ds(cid:48) = pag\n(cid:98)pk(s\n\u2208 I(s))ds(cid:48)\n(cid:48)\n(cid:48)\n|s, a)ds\nThis ensures that (cid:99)M ag\n(cid:99)M +\nk , the augmented versions of(cid:99)M ag\n(cid:1)\n(cid:0)(cid:99)M +\nLemma 2. For any k \u2265 1,(cid:98)gag+\nProof (see App. C.2). We notice that for any continuous function v(s) de\ufb01ned on S and piece-wise\nconstant on the intervals of I, we can associate a discrete function v(cid:48)(I) (de\ufb01ned on I) such that\nanalogue. We de\ufb01ne the sequences (vn)n\u2208N and (un)n\u2208N by recursively applying (cid:98)T +\nc and (cid:98)T ag+\nfor all s \u2208 S, v(cid:48)(I(s)) = v(s). Let v0 = 0 (continuous function) and denote by v(cid:48)\n0 its discrete\nrespectively: vn+1 := (cid:98)T +\nvn+1(s) \u2212 vn(s) and un+1(I(s)) \u2212 un(I(s)) have the same limits, respectively(cid:98)g+\nk and(cid:98)gag+\nLeveraging Lem. 2, it is suf\ufb01cient to compare the gains of (cid:99)M +\n(cid:98)Lkh\u2217\nLemma 3. Denote by(cid:98)Lk the optimal Bellman operator of (cid:99)Mk. With probability at least 1 \u2212 \u03b4\n\u2265 Lh\u2217 (analogue of Lem. 1), with the difference that h\u2217 is de\ufb01ned on a continuous space.\nall k \u2265 1 we have(cid:98)Lkh\u2217\nk \u2265 g\u2217.\nProof (see Lem. 4 and 5) in App. C). The proof is similar to Lem. 1: we compare(cid:98)rk and(cid:98)pk with the\nthat(cid:98)pk is even more biased than before. Thanks to the smoothness assumption (Asm. 2), the extra\n\ntrue reward function r and transition probabilities p using concentration inequalities. Due to the\naggregation of states, there are two major differences with the discrete case. The \ufb01rst difference is\nbias is only of order O(LS\u2212\u03b1) (this explains why this term appears in the de\ufb01nition of the bonus\nin (6)). The second difference is that since there are uncountably many states, it is impossible to use a\nunion bound argument on the set of states (like in Lem. 1). Instead, we show using optional skipping\nthat the terms of interest are martingales and we apply Azuma\u2019s and Freedman\u2019s inequalities.\nThe rest of the proof is similar to SCAL+ with additional steps to deal with the continuous state space.\n5 Numerical Simulations\nWe design experiments to investigate the learning performance in discrete and continuous MDP\n(see App. E for details). In the discrete case, the main theoretical open question is whether the\ntighter exploration bonus does translate in a better regret, that is, whether the dependency on the\nbranching factor \u0393 in the regret bound is due to the analysis or not. Unfortunately, it is dif\ufb01cult\nto design experiments to thoroughly investigate the actual dependency. First, it is challenging to\ndesign MDPs with all parameters \ufb01xed (i.e., gain, span, diameters, number of states and actions)\nbut \u0393 (e.g., the bigger \u0393, the smaller the span as the MDP is more connected). Furthermore, the\nregret bound is worst-case w.r.t. all MDPs with a given set of parameters, which is dif\ufb01cult to\ndesign in practice. For these reasons, instead of investigating the exact dependency, we rather\nfocus on comparing the performance of SCAL+ to UCRL for different values of \u0393. We consider\n\nk and M\u2217 to prove optimism. Since\nboth MDPs have the same (continuous) state space, we can proceed as in Sec. 3.2 and just show that\n\n\u2265 Lh\u2217 (on the whole state space) and as a consequence(cid:98)gag+\n\n0. It is easy to show that for all\nn. Therefore the sequences\n\nun with u0 := v(cid:48)\n\nc\n\nk\n\nand\n\nk\n\n.\n\nk\n\n5 , for\n\n7\n\n\fFigure 2: Cumulative regret (Garnet MDPs) and cumulative reward (continuous state MDPs). We\nreport mean, max and min curves obtained over 50 independent runs.\n\n(cid:112)\nLk/Nk(s, a) and \u03b2p(s, a) =\nk(s, a) + c\u03b2p\n\nthe Garnet(S, A, \u0393) family [20] of random MDPs.\nIn all the experiments we take S = 200,\nA = 3 and c = 2 and we guarantee the MDPs to be communicating by setting p(s0|s, a) \u2265 0.01\n(cid:112)\nfor every pair (s, a) and an arbitrary state s0. In order to provide a fair comparison of UCRL,\nSCAL and SCAL+, we consider Hoeffdin-based con\ufb01dence bounds with standardized constants:\n\u0393Lk/Nk(s, a) with Lk = log(SA/\u03b4k)/4 for\n\u03b2r\nk(s, a) = rmax\nUCRL and SCAL, and bk(s, a) = \u03b2r\nk(s, a) for SCAL+. Since Garnet(S, A, \u0393) de\ufb01nes\na distribution over MDPs, we evaluate the algorithms on the MDP with median bias span (since\nthe distribution shows relatively long tails, see App. E). According to the theoretical analysis, the\nper-episode regret of UCRL scales as O(sp (hk)\u221a\u0393), where sp (hk) is the span of the optimistic\nMDP, while SCAL+ has regret O(c\u221a\u0393), where c is an upper-bound on sp (h\u2217). While in the worst\ncase sp (hk) \u2264 D, in the MDP we selected, UCRL always generates optimistic MDPs with span\nsp (hk) smaller than sp (h\u2217) \u2264 c. In this favorable case for UCRL, the only hope for SCAL+ to\nachieve better performance is if the tighter optimism translates into a per-episode regret of O(c), with\nno dependency on \u0393. This is indeed what we observed empirically. When \u0393 = 5, as expected, UCRL\noutperforms SCAL+ as sp (hk)\u221a\u0393 \u2264 c for most of the episodes. On the other hand, when \u0393 = 144,\nas sp (hk)\u221a\u0393 \u2265 c. Although this result does not provide a de\ufb01nite answer on whether and how the\n\nthe tighter optimism of SCAL+ allows a faster convergence to the optimal solution compared to UCRL\n\nregret of SCAL+ scales with \u0393, it hints to the fact that tighter optimism does indeed translate to better\nempirical performance w.r.t. con\ufb01dence-based algorithms such as UCRL.\nAs SCCAL+ is the \ufb01rst implementable model-based algorithm with regret guarantees in continuous\nMDPs, we compare it to model-free heuristic variants. We consider RVI Q-learning [17] with either\n\u0001-greedy and UCB [9] exploration.3 Since Q-learning is model-free, it does not perform planning and\nupdates the policy at each time step (the action selection is greedy w.r.t. the current estimate). Even\nin this case we harmonize the bonus such that b(s, a) = \u03b2r(s, a) + c\u03b2p(s, a) + (rmax + c)\u03c1LS\u2212\u03b1.\nWe use the same uniform discretization of the state space for all the algorithms. We considering a\ncontinuous version of the RiverSwim [7] discretized into S = 50 states (\u03c1L = \u03b1 = 1, c = 30, S \u2286 R)\nand the ShipSteering domain [21] with S = |I| = 512 discrete states (\u03c1L = 5, \u03b1 = 1, c = 1.5,\nS \u2286 R3) (see the App. E for MountainCar [22]). In both cases, RVIQ shows an unstable behaviour.\nIn the RiverSwim it outperforms the other approaches when optimistically initialized (i.e., q0 = c)\nwhile the same con\ufb01guration fails to learn in the ShipSteering. Moreover, RVIQ with q0 = 0 shows\nthe ability to learn in the ShipSteering but also high variance. This undesired behavior is typical\nof unstable algorithms (we observed linear regret in some run). RVIQ-UCB is able to learn in the\nRiverSwim but not in the ShipSteering. The only stable algorithm in both domains is SCCAL+.\n6 Conclusion\nWe derive the \ufb01rst regret analysis of exploration bonus for average reward with discrete and continuous\nstate space by leveraging on an upper-bound to the range of the optimal bias function to properly scale\nthe bonus (as done in other settings). It is an open question whether an exploration bonus approach\nis still possible when no prior knowledge on the span of the optimal bias function is available [see\ne.g., 11, 23]. Despite the \u221a\u0393 improvement in the de\ufb01nition of the exploration bonus (i.e., optimism)\ncompared to con\ufb01dence-set-based algorithms, the \ufb01nal regret still scales with \u0393 leaving it as an open\nquestion whether such dependency can be actually removed in non-ergodic MDPs.\n\n3Refer to App. E.1 for details about RVIQ and RVIQ-UCB. There is no known regret bound for model-free\n\nalgorithms in average reward, we think this is an interesting line of research for future work.\n\n8\n\n00.511.52\u00b710700.20.40.60.81\u00b7105TimeCumulativeRegretGarnet(200,3,5)SCALSCAL+UCRL00.511.52\u00b7107012\u00b7105TimeGarnet(200,3,144)0246810\u00b71070123\u00b7106TimeCumulativeRewardShipSteeringRVIQq0=0RVIQq0=cRVIQ-UCBSCCAL+02468\u00b710600.511.5\u00b7106TimeContinuousRiverSwim\fReferences\n[1] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Ef\ufb01cient bias-span-\nIn ICML, Proceedings of\n\nconstrained exploration-exploitation in reinforcement learning.\nMachine Learning Research. PMLR, 2018.\n\n[2] Ronald Ortner and Daniil Ryabko. Online regret bounds for undiscounted continuous reinforce-\n\nment learning. In NIPS, pages 1772\u20131780, 2012.\n\n[3] Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and R\u00e9mi\nMunos. Unifying count-based exploration and intrinsic motivation. In NIPS, pages 1471\u20131479,\n2016.\n\n[4] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman,\nFilip De Turck, and Pieter Abbeel. #exploration: A study of count-based exploration for deep\nreinforcement learning. In NIPS, pages 2750\u20132759, 2017.\n\n[5] Georg Ostrovski, Marc G. Bellemare, A\u00e4ron van den Oord, and R\u00e9mi Munos. Count-based\nIn ICML, volume 70 of Proceedings of Machine\n\nexploration with neural density models.\nLearning Research, pages 2721\u20132730. PMLR, 2017.\n\n[6] Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter. Count-based\n\nexploration in feature space for reinforcement learning. CoRR, abs/1706.08090, 2017.\n\n[7] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for\nmarkov decision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331, 2008.\n\n[8] Mohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for\nreinforcement learning. In ICML, volume 70 of Proceedings of Machine Learning Research,\npages 263\u2013272. PMLR, 2017.\n\n[9] Chi Jin, Zeyuan Allen-Zhu, S\u00e9bastien Bubeck, and Michael I. Jordan. Is q-learning provably\n\nef\ufb01cient? CoRR, abs/1807.03765, 2018.\n\n[10] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., New York, NY, USA, 1994. ISBN 0471619779.\n\n[11] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[12] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Emma Brunskill. Regret minimization in\n\nmdps with options without prior knowledge. In NIPS, pages 3169\u20133179, 2017.\n\n[13] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for\nundiscounted reinforcement learning in mdps. In ALT, volume 83 of Proceedings of Machine\nLearning Research, pages 770\u2013805. PMLR, 2018.\n\n[14] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement\n\nlearning in weakly communicating MDPs. In UAI, pages 35\u201342. AUAI Press, 2009.\n\n[15] Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration.\n\nCoRR, abs/1808.01813, 2018. URL http://arxiv.org/abs/1808.01813.\n\n[16] K. Lakshmanan, Ronald Ortner, and Daniil Ryabko. Improved regret bounds for undiscounted\ncontinuous reinforcement learning. In ICML, volume 37 of JMLR Workshop and Conference\nProceedings, pages 524\u2013532. JMLR.org, 2015.\n\n[17] Jinane Abounadi, Dimitri P. Bertsekas, and Vivek S. Borkar. Learning algorithms for markov\ndecision processes with average cost. SIAM J. Control and Optimization, 40(3):681\u2013698, 2001.\n\n[18] Ronald Ortner. Optimism in the face of uncertainty should be refutable. Minds and Machines,\n\n18(4):521\u2013526, 2008.\n\n[19] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning:\n\nworst-case regret bounds. In NIPS, pages 1184\u20131194, 2017.\n\n9\n\n\f[20] TW Archibald, KIM McKinnon, and LC Thomas. On the generation of markov decision\n\nprocesses. Journal of the Operational Research Society, 46(3):354\u2013361, 1995.\n\n[21] Michael T. Rosenstein and Andrew G. Barto. Supervised actor-critic reinforcement learning.\n\nHandbook of learning and approximate dynamic programming, 2:359, 2004.\n\n[22] Andrew William Moore. Ef\ufb01cient memory-based learning for robot control. Technical report,\n\nUniversity of Cambridge, 1990.\n\n[23] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation in\n\nnon-communicating markov decision processes. In NIPS, 2018.\n\n[24] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends\u00ae in Machine Learning, 5(1):1\u2013122,\n2012. ISSN 1935-8237. doi: 10.1561/2200000024.\n\n[25] Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear\n\nreinforcement learning. CoRR, abs/1802.09184, 2018.\n\n[26] Dimitri P Bertsekas. Dynamic programming and optimal control. Vol II. Number 2. Athena\n\nscienti\ufb01c Belmont, MA, 1995.\n\n[27] Tor Lattimore and Csaba Szepesv\u00e1ri. Bandit algorithms. Pre-publication version, 2018. URL\n\nhttp://downloads.tor-lattimore.com/banditbook/book.pdf.\n\n[28] Y.S. Chow and H. Teicher. Probability Theory: Independence, Interchangeability, Martingales.\n\nSpringer texts in statistics. World Publishing Company, 1988. ISBN 9780387966953.\n\n[29] Eyal Even-Dar and Yishay Mansour. Convergence of optimistic and incremental q-learning. In\n\nNIPS, pages 1499\u20131506. MIT Press, 2001.\n\n[30] A. Klenke and M. Lo\u00e8ve. Probability Theory: A Comprehensive Course. Graduate texts in\n\nmathematics. Springer, 2013. ISBN 9781447153627.\n\n[31] David A. Freedman. On tail probabilities for martingales. Ann. Probab., 3(1):100\u2013118, 02 1975.\n\ndoi: 10.1214/aop/1176996452.\n\n[32] Nicol\u00f2 Cesa-Bianchi and Claudio Gentile. Improved risk tail bounds for on-line algorithms. In\n\nNIPS, pages 195\u2013202, 2005.\n\n10\n\n\f", "award": [], "sourceid": 2710, "authors": [{"given_name": "Jian", "family_name": "QIAN", "institution": "INRIA Lille - Sequel Team"}, {"given_name": "Ronan", "family_name": "Fruit", "institution": "Inria Lille"}, {"given_name": "Matteo", "family_name": "Pirotta", "institution": "Facebook AI Research"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "Facebook Artificial Intelligence Research"}]}