{"title": "Worst-Case Regret Bounds for Exploration via Randomized Value Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 14433, "page_last": 14443, "abstract": "This paper studies a recent proposal to use randomized value functions to drive exploration in reinforcement learning. These randomized value functions are generated by injecting random noise into the training data, making the approach compatible with many popular methods for estimating parameterized value functions. By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, we show that planning with respect to these randomized value functions can induce provably efficient exploration.", "full_text": "Worst-Case Regret Bounds for Exploration via\n\nRandomized Value Functions\n\nDaniel Russo\n\nColumbia University\n\ndjr2174@gsb.columbia.edu\n\nAbstract\n\nThis paper studies a recent proposal to use randomized value functions to drive\nexploration in reinforcement learning. These randomized value functions are\ngenerated by injecting random noise into the training data, making the approach\ncompatible with many popular methods for estimating parameterized value func-\ntions. By providing a worst-case regret bound for tabular \ufb01nite-horizon Markov\ndecision processes, we show that planning with respect to these randomized value\nfunctions can induce provably ef\ufb01cient exploration.\n\n1\n\nIntroduction\n\nExploration is one of the central challenges in reinforcement learning (RL). A large theoretical\nliterature treats exploration in simple \ufb01nite state and action MDPs, showing that it is possible\nto ef\ufb01ciently learn a near optimal policy through interaction alone [5, 8, 10, 11, 13\u201316, 25, 26].\nOverwhelmingly, this literature focuses on optimistic algorithms, with most algorithms explicitly\nmaintaining uncertainty sets that are likely to contain the true MDP.\nIt has been dif\ufb01cult to adapt these exploration algorithms to the more complex problems investigated\nin the applied RL literature. Most applied papers seem to generate exploration through \u0001\u2013greedy\nor Boltzmann exploration. Those simple methods are compatible with practical value function\nlearning algorithms, which use parametric approximations to value functions to generalize across\nhigh dimensional state spaces. Unfortunately, such exploration algorithms can fail catastrophically in\nsimple \ufb01nite state MDPs [See e.g. 24]. This paper is inspired by the search for principled exploration\nalgorithms that both (1) are compatible with practical function learning algorithms and (2) provide\nrobust performance, at least when specialized to simple benchmarks like tabular MDPs.\nOur focus will be on methods that generate exploration by planning with respect to randomized value\nfunction estimates. This idea was \ufb01rst proposed in a conference paper by [22] and is investigated more\nthoroughly in the journal paper [24]. It is inspired by work on posterior sampling for reinforcement\nlearning (a.k.a Thompson sampling) [20, 27], which could be interpreted as sampling a value function\nfrom a posterior distribution and following the optimal policy under that value function for some\nextended period of time before resampling. A number of papers have subsequently investigated\napproaches that generate randomized value functions in complex reinforcement learning problems\n[6, 9, 12, 21, 23, 28, 29]. Our theory will focus on a speci\ufb01c approach of [22, 24], dubbed randomized\nleast squares value iteration (RLSVI), as specialized to tabular MDPs. The name is a play on the\nclassic least-squares policy iteration algorithm (LSPI) of [17]. RLSVI generates a randomized value\nfunction (essentially) by judiciously injecting Gaussian noise into the training data and then applying\napplying LSPI to this noisy dataset. One could naturally follow the same template while using other\nvalue learning algorithms in place of LSPI.\nThis is a strikingly simple algorithm, but providing rigorous theoretical guarantees has proved\nchallenging. One challenge is that, despite the appealing conceptual connections, there are signi\ufb01cant\nsubtleties to any precise link between RLSVI and posterior sampling. The issue is that posterior\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsampling based approaches are derived from a true Bayesian perspective in which one maintains\nbeliefs over the underlying MDP. The approaches of [6, 9, 12, 23, 24, 28, 29] model only the value\nfunction, so Bayes rule is not even well de\ufb01ned.1 The work of [22, 24] uses stochastic dominance\narguments to relate the value function sampling distribution of RLSVI to a correct posterior in a\nBayesian model where the true MDP is randomly drawn. This gives substantial insight, but the\nresulting analysis is not entirely satisfying as a robustness guarantee. It bounds regret on average\nover MDPs with transitions kernels drawn from a particular Dirichilet prior, but one may worry that\nhard reinforcement learning instances are extremely unlikely under this particular prior.\nThis paper develops a very different proof strategy and provides a worst-case regret bound for RLSVI\napplied to tabular \ufb01nite-horizon MDPs. The crucial proof steps are to show that each randomized\nvalue function sampled by RLSVI has a signi\ufb01cant probability of being optimistic (see Lemma 4)\nand then to show that from this property one can reduce regret analysis to concentration arguments\npioneered by [13] (see Lemmas 6, 7). This approach is inspired by frequentist analysis of Thompson\nsampling for linear bandits [2] and especially the lucid description of [1]. However, applying these\nideas in reinforcement learning appears to require novel analysis. The only prior extension of these\nproof techniques to tabular reinforcement learning was carried out by [3]. Re\ufb02ecting the dif\ufb01culty of\nsuch analyses, that paper does not provide regret bounds for a pure Thompson sampling algorithm;\ninstead their algorithm samples many times from the posterior to form an optimistic model, as in\nthe BOSS algorithm [4]. Also, unfortunately there is a signi\ufb01cant error that paper\u2019s analysis and the\ncorrection has not yet been posted online, making a careful comparison dif\ufb01cult at this time.\nThe established regret bounds are not state of the art for tabular \ufb01nite-horizon MDPs. Two steps in\nthe proofs introduce extra factors of\nS in the bounds, where S denotes the number of states. I hope\nsome smart reader can improve this by intelligently adapting the techniques of [5, 11]. However, the\nprimary goal of the paper is not to give the tightest possible regret bound, but to broaden the set of\nexploration approaches known to satisfy polynomial worst-case regret bounds. To this author, it is\nboth fascinating and beautiful that carefully adding noise to the training data generates sophisticated\nexploration and proving this formally is worthwhile.\n\n\u221a\n\n2 Problem formulation\n\nWe consider the problem of learning to optimize performance through repeated interactions with an\nunknown \ufb01nite horizon MDP M = (H,S,A, P,R, s1). The agent interacts with the environment\nacross K episodes. Each episode proceeds over H periods, where for period h \u2208 {1, . . . , H} of\nh \u2208 A = {1, . . . , A}, observes the\nepisode k the agent is in state sk\nreward rk\nh) :\nh = 1, . . . H, i = 1, . . . , k \u2212 1} denote the history of interactions prior to episode k. The Markov\ntransition kernel P encodes the transition probabilities, with\n\nh \u2208 [0, 1] and, for h < H, also observes next state sk\n\nh \u2208 S = {1, . . . , S}, takes action ak\n\nh+1 \u2208 S. Let Hk\u22121 = {(si\n\nh, ai\n\nh, ri\n\nThe reward distribution is encoded in R, with\n\nPh,sk\n\nh,ak\nh\n\nRh,sk\n\nh,ak\nh\n\n(s) = P(sk\n\n(dr) = P(cid:0)rk\n\nh+1 = s | ak\n\nh, sk\n\nh, . . . , ak\n\n1, sk\n\n1,Hk\u22121).\n\nh = dr | ak\n\nh, sk\n\nh, . . . , ak\n\n1, sk\n\n1,Hk\u22121\n\n(cid:1) .\n\nh,s,a |\nWe usually instead refer to expected rewards encoded in a vector R that satis\ufb01es Rh,s,a = E[rk\nh = a]. We then refer to an MDP (H,S,A, P, R, s1), described in terms of its expected\nh = s, ak\nsk\nrewards rather than its reward distribution, as this is suf\ufb01cient to determine the expected value accrued\nby any policy. The variable s1 denotes a deterministic initial state and we assume sk\n1 = s1 for every\nepisode k. At the expense of complicating some formulas, the entire paper could also be written\nassuming initial states are drawn from some distribution over S, which is more standard in the\nliterature.\nA deterministic Markov policy \u03c0 = (\u03c01, . . . , \u03c0H ) is a sequence of functions, where each \u03c0h : S \u2192 A\nprescribes an action to play in each state. We let \u03a0 denote the space of all such policies. We use\nh \u2208 RS to denote the value function associated with policy \u03c0 in the sub-episode consisting of\nV \u03c0\n1The precise issue is that, even given a prior over value functions, there is no likelihood function. Given and\nMDP, there is a well speci\ufb01ed likelihood of transitioning from state s to another s(cid:48), but a value function does not\nspecify a probabilistic data-generating model.\n\n2\n\n\fH+1 = 0 \u2208 RS. Then the value\ns \u2208 S, h = 1, . . . , H.\n\nperiods {h, . . . , H}. To simplify many expressions, we set V \u03c0\nfunctions for h \u2264 H are the unique solution to the the Bellman equations\n\n(cid:88)\n\nPs,h,\u03c0(s)(s(cid:48))V \u03c0\n\nh+1(s(cid:48))\n\ns(cid:48)\u2208S\n\nh (s) = max\u03c0\u2208\u03a0 V \u03c0\n\nV \u03c0\nh (s) = Rh,s,\u03c0(s) +\nThe optimal value function is V \u2217\nAn episodic reinforcement learning algorithm Alg is a possibly randomized procedure that associates\neach history with a policy to employ throughout the next episode. Formally, a randomized algorithm\ncan depend on random seeds {\u03bek}k\u2208N drawn independently of the past from some prespeci\ufb01ed\ndistribution. Such an episodic reinforcement learning algorithm selects a policy \u03c0k = Alg(Hk\u22121, \u03bek)\nto be employed throughout episode k.\nThe cumulative expected regret incurred by Alg over K episodes of interaction with the MDP M is\n\nh (s).\n\nRegret(M, K, Alg) = EAlg\n\nV \u2217\n1 (sk\n\n1) \u2212 V \u03c0k\n\n1 (sk\n1)\n\n(cid:34) K(cid:88)\n\nk=1\n\n(cid:35)\n\nwhere the expectation is taken over the random seeds used by a randomized algorithm and the\nrandomness in the observed rewards and state transitions that in\ufb02uence the algorithm\u2019s chosen\npolicies. This expression captures the algorithm\u2019s cumulative expected shortfall in performance\nrelative to an omniscient benchmark, which knows and always employs the true optimal policy.\nOf course, regret as formulated above depends on the MDP M to which the algorithm is applied. Our\ngoal is not to minimize regret under a particular MDP but to provide a guarantee that holds uniformly\nacross a class of MDPs. This can be expressed more formally by considering a class M containing\nall MDPs with S states, A actions, H periods, and rewards distributions bounded in [0, 1]. Our goal\nis to bound the worst-case regret supM\u2208M Regret(M, K, Alg) incurred by an algorithm throughout\nK episodes of interaction with an unknown MDP in this class. We aim for a bound on worst-case\nregret that scales sublinearly in K and has some reasonable polynomial dependence in the size of\nstate space, action space, and horizon. We won\u2019t explicitly maximize over M in the analysis. Instead,\nwe \ufb01x an arbitrary MDP M and seek to bound regret in a way that does not depend on the particular\ntransition probabilities or reward distributions under M.\nIt is worth remarking that, as formulated, our algorithm knows S, A, and H but does not have\nknowledge of the number of episodes K. Indeed, we study a so-called anytime algorithm that has\ngood performance for all suf\ufb01ciently long sequences of interaction.\n\nNotation for empirical estimates. We de\ufb01ne nk(h, s, a) = (cid:80)k\u22121\n\nh) = (s, a)} to be\nthe number of times action a has been sampled in state s, period h. For every tuple (h, s, a) with\nnk(h, s, a) > 0, we de\ufb01ne the empirical mean reward and empirical transition probabilities up to\nperiod h by\n\n1{(s(cid:96)\n\nh, a(cid:96)\n\n(cid:96)=1\n\nk\u22121(cid:88)\nk\u22121(cid:88)\n\n(cid:96)=1\n\n\u02c6Rk\n\nh,s,a =\n\n1\n\nnk(h, s, a)\n\nh,s,a(s(cid:48)) =\n\u02c6P k\n\n1\n\nnk(h, s, a)\n\n(cid:96)=1\n\n1{(s(cid:96)\n\nh, a(cid:96)\n\nh) = (s, a)}r(cid:96)\n\nh\n\n1{(s(cid:96)\n\nh, a(cid:96)\n\nh, s(cid:96)\n\nh+1) = (s, a, s(cid:48))} \u2200s(cid:48) \u2208 S.\n\n(1)\n\n(2)\n\nIf (h, s, a) was never sampled before episode k, we de\ufb01ne \u02c6Rk\n\nh,s,a = 0 and \u02c6P k\n\nh,s,a = 0 \u2208 RS.\n\n3 Randomized Least Squares Value Iteration\n\nThis section describes an algorithm called Randomized Least Squares Value Iteration (RLSVI).\nWe describe RLSVI as specialized to a simple tabular problem in a way that is most convenient\nfor the subsequent theoretical analysis. A mathematically equivalent de\ufb01nition \u2013 which de\ufb01nes\nRSLVI as estimating a value function on randomized training data \u2013 extends more gracefully . This\ninterpretation is given at the end of the section and more carefully in [24].\nAt the start of episode k, the agent has observed a history of interactions Hk\u22121. Based on this, it is\nnatural to consider an estimated MDP \u02c6M k = (H,S,A, \u02c6P k, \u02c6Rk, s1) with empirical estimates of mean\n\n3\n\n\f(cid:113) \u03b2k\n\nrewards and transition probabilities. These are precisely de\ufb01ned in Equation (2) and the surrounding\ntext. We could use backward recursion to solve for the optimal policy and value functions under the\nempirical MDP, but applying this policy would not generate exploration.\nRLSVI builds on this idea, but to induce exploration it judiciously adds Gaussian noise before solving\nfor an optimal policy. We can de\ufb01ne RLSVI concisely as follows. In episode k it samples a random\n\nvector with independent components wk \u2208 RHSA, where wk(h, s, a) \u223c N(cid:0)0, \u03c32\n\nk(h, s, a)(cid:1). We\n\nde\ufb01ne \u03c3k(h, s, a) =\nnk(h,s,a)+1, where \u03b2k is a tuning parameter and the denominator shrinks\nlike the standard deviation of the average of nk(h, s, a) i.i.d samples. Given wk, we construct a\n= (H,S,A, \u02c6P k, \u02c6Rk + wk, s1) by adding the\nrandomized perturbation of the empirical MDP M\nGaussian noise to estimated rewards. RLSVI solves for the optimal policy \u03c0k under this MDP and\napplies it throughout the episode. This policy is, of course, greedy with respect to the (randomized)\nk. The random noise wk in RLSVI should be large enough to dominate the\nvalue functions under M\nerror introduced by performing a noisy Bellman update using \u02c6P k and \u02c6Rk. We set \u03b2k = \u02dcO(H 3) in the\nanalysis, where functions of H offer a coarse bound on quantities like the variance of an empirically\nestimated Bellman update. For \u03b2 = {\u03b2k}k\u2208N, we denote this algorithm by RLSVI\u03b2.\n\nk\n\nRLSVI as regression on perturbed data. To extend beyond simple tabular problems, it is fruit-\nful to view RLSVI\u2013like in Algorithm 1\u2013as an algorithm that performs recursive least squares\nestimation on the state-action value function. Randomization is injected into these value func-\ntion estimates by perturbing observed rewards and by regularizing to a randomized prior sam-\nple. This prior sample is essential, as otherwise there would no be randomness in the esti-\nmated value function in initial periods. This procedure is the LSPI algorithm of [17] applied\nwith noisy data and a tabular representation. The paper [24] includes many experiments with\nnon-tabular representations.\nIt should be stressed that although data-perturbations are some-\ntimes used to regularize machine learning algorithms, here it is used only to drive exploration.\nAlgorithm 1: RLSVI for Tabular, Finite Horizon, MDPs\ninput\n\n:H, S, A, tuning parameters {\u03b2k}k\u2208N\n\n(1) for episodes k = 1, 2, . . . do\n\nL(Q | Qnext,D) =(cid:80)\n\n/* Define squared temporal difference error\n\n*/\n\n(s,a,r,s(cid:48))\u2208D (Q(s, a) \u2212 r \u2212 maxa(cid:48)\u2208A Qnext(s(cid:48), a(cid:48)))2 ;\nh+1) : (cid:96) < k}\nh, s(cid:96)\nH ,\u2205) : (cid:96) < k};\n\nh < H ;\n\n/* Past data */\n\n*/\n\n/* Draw prior sample */\n\n*/\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(19) end\n\nh, a(cid:96)\nH , a(cid:96)\n\nh, r(cid:96)\nH , r(cid:96)\n\nDh = {(s(cid:96)\nDH = {(s(cid:96)\n/* Randomly perturb data\nfor time periods h = 1, . . . , H do\nSample array \u02dcQh \u223c N (0, \u03b2kI) ;\n\u02dcDh \u2190 {};\nfor (s, a, r, s(cid:48)) \u2208 Dh do\n\nsample w \u223c N (0, \u03b2k);\n\u02dcDh \u2190 \u02dcDh \u222a {(s, a, r + w, s(cid:48))};\n\nend\n\nend\n/* Estimate Q on noisy data\nDe\ufb01ne terminal value Qk\nfor time periods h = H, . . . , 1 do\n\nH+1(s, a) \u2190 0 \u2200s, a ;\n\n\u02c6Qh \u2190 argminQ\u2208RSA L(Q | Qh+1, \u02dcDh) + (cid:107)Q \u2212 \u02dcQh(cid:107)2\n2 ;\n\nend\nApply greedy policy with respect to ( \u02c6Q1, . . . \u02c6QH ) throughout episode;\nObserve data sk\n\n1, rk\n\n1 , . . . sk\n\nH , ak\n\nH , rk\n\n1, ak\n\nH ;\n\nTo understand this presentation of RLSVI, it is helpful to understand an equivalence between\nposterior sampling in a Bayesian linear model and \ufb01tting a regularized least squares estimate to\n\n4\n\n\f(cid:16) 1\n\nrandomly perturbed data. We refer to [24] for a full discussion of this equivalence and review\nthe scalar case here. Consider Bayes updating of a scalar parameter \u03b8 \u223c N (0, \u03b2) based on noisy\nobservations Y = (y1, . . . , yn) where yi | \u03b8 \u223c N (0, \u03b2). The posterior distribution has the closed\nform \u03b8 | Y \u223c N\n. We could generate a sample from this distribution by \ufb01tting a\nleast squares estimate to noise perturbed data. Sample W = (w1, . . . , wn) where each wi \u223c N (0, \u03b2)\nis drawn independently and sample \u02dc\u03b8 \u223c N (0, \u03b2). Set \u02dcyi = yi + wi. Then\n\n(cid:80)n\n\n1 yi ,\n\n(cid:17)\n\nn+1\n\nn+1\n\n\u03b2\n\n(\u03b8 \u2212 \u02dcyi)2 + (\u03b8 \u2212 \u02dc\u03b8)2 =\n\n1\n\nn + 1\n\n\u02dcyi + \u02dc\u03b8\n\n(3)\n\n(cid:33)\n\n(cid:32) n(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n\u02c6\u03b8 = argmin\n\n(cid:16) 1\n\n\u03b8\u2208R\n\n(cid:80)n\n\n(cid:17)\n\n(cid:32)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:33)\n\n.\n\nsatis\ufb01es \u02c6\u03b8 | Y \u223c N\n. For more complex models, where exact posterior sampling\nis impossible, we may still hope estimation on randomly perturbed data generates samples that re\ufb02ect\nuncertainty in a sensible way. As far as RLSVI is concerned, roughly the same calculation shows that\nin Algorithm 1 \u02c6Qh(s, a) is equal to an empirical Bellman update plus Gaussian noise:\n\n1 yi ,\n\nn+1\n\nn+1\n\n\u03b2\n\n\u02c6Qh(s, a) | \u02c6Qh+1 \u223c N\n\n\u02c6Rh,s,a +\n\n\u02c6Ph,s,a(s(cid:48)) max\na(cid:48)\u2208A\n\n\u02c6Qh+1(s(cid:48), a(cid:48)) ,\n\n\u03b2k\n\nnk(h, s, a) + 1\n\nIt is worth noting that Algorithm 1 can be naturally applied to settings with function approximation.\nIn line 15, instead of minimizing over all possible state-action value functions Q \u2208 RS\u00d7A, we\nminimize over the parameter \u03b8 de\ufb01ning some approximate value function Q\u03b8. Instead of regularizing\ntoward a random prior sample \u02dcQh in line 15, the methods in [24] regularize toward a random prior\nparameter \u02dc\u03b8h. See [23] for a study of these randomized prior samples in deep reinforcement learning.\n\n4 Main result\n\nTheorem 1 establishes that RLSVI satis\ufb01es a worst-case polynomial regret bound for tabular \ufb01nite-\nhorizon MDPs. It is worth contrasting RLSVI to \u0001\u2013greedy exploration and Boltzmann exploration,\nwhich are both widely used randomization approaches to exploration. Those simple methods explore\nby directly injecting randomness to the action chosen at each timestep. Unfortunately, they can\nfail catastrophically even on simple examples with a \ufb01nite state space \u2013 requiring a time to learn\nthat scales exponentially in the size of the state space. Instead, RLSVI generates randomization by\ntraining value functions with randomly perturbed rewards. Theorem 1 con\ufb01rms that this approach\ngenerates a sophisticated form of exploration fundamentally different from \u0001\u2013greedy exploration and\nBoltzmann exploration. The notation \u02dcO ignores poly-logarithmic factors in H, S, A and K.\nTheorem 1. Let M denote the set of MDPs with horizon H, S states, A actions, and rewards bounded\nin [0,1]. Then for a tuning parameter sequence \u03b2 = {\u03b2k}k\u2208N with \u03b2k = 1\n\n2 SH 3 log(2HSAk),\n\n(cid:16)\n\n(cid:17)\n\n\u221a\n\nRegret(M, K, RLSVI\u03b2) \u2264 \u02dcO\n\nsup\nM\u2208M\n\nH 3S3/2\n\nAK\n\n.\n\n\u221a\n\nThis bound is not state of the art and that is not the main goal of this paper. I conjecture that the extra\nfactor of S can be removed from this bound through a careful analysis, making the dependence on\nS, A, and K, optimal. This conjecture is supported by numerical experiments and (informally) by\na Bayesian regret analysis [24]. One extra\nS appears to come from a step at the very end of the\nproof in Lemma 7, where we bound a certain L1 norm as in the analysis style of [13]. For optimistic\nalgorithms, some recent work has avoided directly bounding that L1-norm, yielding a tighter regret\nguarantee [5, 11]. Another factor of\nS stems from the choice of \u03b2k, which is used in the proof of\nLemma 5. This seems similar to an extra\nd factor that appears in worst-case regret upper bounds\nfor Thompson sampling in d-dimensional linear bandit problems [1].\nRemark 1. Some translation is required to relate the dependence on H with other literature. Many\nresults are given in terms of the number of periods T = KH, which masks a factor of H. Also unlike\ne.g. [5], this paper treats time-inhomogenous transition kernels. In some sense agents must learn\nabout H extra state/action pairs. Roughly speaking then, our result exactly corresponds to what one\nwould get by applying the UCRL2 analysis [13] to a time-inhomogenous \ufb01nite-horizon problem.\n\n\u221a\n\n\u221a\n\n5\n\n\f5 Proof of Theorem 1\n\nThe proof follows from several lemmas. Some are (possibly complex) technical adaptations of ideas\npresent in many regret analyses. Lemmas 4 and 6 are the main discoveries that prompted this paper.\nThroughout we use the following notation: for any MDP \u02dcM = (H,S,A, \u02dcP , \u02dcR, s1), let V ( \u02dcM , \u03c0) \u2208 R\ndenote the value function corresponding to policy \u03c0 from the initial state s1. In this notation, for the\ntrue MDP M we have V (M, \u03c0) = V \u03c0\n\n1 (s1).\n\nA concentration inequality. Through a careful application of Hoeffding\u2019s inequality, one can give\na high probability bound on the error in applying a Bellman update to the (non-random) optimal value\nfunction V \u2217\nh+1. Through this, and a union bound, Lemma bounds 2 bounds the expected number of\ntimes the empirically estimated MDP falls outside the con\ufb01dence set\n\n(cid:26)\n\nMk =\n\n(H,S,A, P (cid:48), R(cid:48), s1) :\n\n\u2200(h, s, a)|(R(cid:48)\n\nh,s,a \u2212 Ps,a,h , V \u2217\n\nh+1(cid:105)|\n\n\u2264(cid:113)\n\n(cid:27)\n\nek(h, s, a)\n\nh,s,a \u2212 Rh,s,a) + (cid:104)P (cid:48)\n(cid:115)\nP(cid:16) \u02c6M k /\u2208 Mk(cid:17) \u2264 \u03c02\n\nlog (2HSAk)\nnk(s, h, a) + 1\n\n.\n\n6 .\n\nwhere we de\ufb01ne\n\n(cid:113)\n\nek(h, s, a) = H\n\nThis set is a only a tool in the analysis and cannot be used by the agent since V \u2217\n\nLemma 2 (Validity of con\ufb01dence sets). (cid:80)\u221e\n\nk=1\n\nh+1 is unknown.\n\nFrom value function error to on policy Bellman error. For some \ufb01xed policy \u03c0, the next simple\nlemma expresses the gap between the value functions under two MDPs in terms of the differences\nbetween their Bellman operators. Results like this are critical to many analyses in the RL literature.\nNotice the asymmetric role of \u02dcM and M. The value functions correspond to one MDP while the state\ntrajectory is sampled in the other. We\u2019ll apply the lemma twice: once where \u02dcM is the true MDP and\nM is estimated one used by RLSVI and once where the role is reversed.\nLemma 3. Consider any policy \u03c0 and two MDPs \u02dcM = (H,S,A, \u02dcP , \u02dcR, s1) and M =\n(H,S,A, P , R, s1). Let \u02dcV \u03c0\nh denote the respective value functions of \u03c0 under \u02dcM and\nM. Then\n1 (s1)\u2212 \u02dcV \u03c0\n\n+ (cid:104)P h,sh,\u03c0(sh) \u2212 \u02dcPh,sh,\u03c0(sh) , \u02dcV \u03c0\n\nRh,sh,\u03c0(sh) \u2212 \u02dcRh,sh,\u03c0(sh)\n\n(cid:34) H(cid:88)\n\n1 (s1) = E\n\nh and V\n\nh+1(cid:105)\n\n(cid:16)\n\n(cid:17)\n\nV\n\n\u03c0\n\n\u03c0\n\n(cid:35)\n\n\u03c0,M\n\n,\n\nh=1\n\nH+1 \u2261 0 \u2208 RS and the expectation is over the sampled state trajectory s1, . . . sH drawn\n\nwhere \u02dcV \u03c0\nfrom following \u03c0 in the MDP M.\n\nProof.\n\n\u03c0\n\n1 (s1) \u2212 \u02dcV \u03c0\n\n1 (s1)\n\nV\n=R1,s1,\u03c0(s1) + (cid:104)P 1,s1,\u03c0(s1) , V\n=R1,s1,\u03c0(s1) \u2212 \u02dcR1,s1,\u03c0(s1) + (cid:104)P 1,s1,\u03c0(s1) \u2212 \u02dcP1,s1,\u03c0(s1) , \u02dcV \u03c0\n=R1,s1,\u03c0(s1) \u2212 \u02dcR1,s1,\u03c0(s1) + (cid:104)P 1,s1,\u03c0(s1) \u2212 \u02dcP1,s1,\u03c0(s1) , \u02dcV \u03c0\n\n2(cid:105) \u2212 \u02dcR1,s1,\u03c0(s1) \u2212 (cid:104) \u02dcP1,s1,\u03c0(s1) , \u02dcV \u03c0\n2 (cid:105)\n\n\u03c0\n\n(cid:104)\n\n2 (cid:105) + (cid:104)P 1,s1,\u03c0(s1) , V\n2 (cid:105) + E\n\n2 \u2212 \u02dcV \u03c0\n2 (cid:105)\n2 (s2) \u2212 \u02dcV \u03c0\n2 (s2)\n\nV\n\n\u03c0,M\n\n\u03c0\n\n\u03c0\n\n(cid:105)\n\n.\n\nExpanding this recursion gives the result.\n\nSuf\ufb01cient optimism through randomization. There is always the risk that, based on noisy obser-\nvations, an RL algorithm incorrectly forms a low estimate of the value function at some state. This\nmay lead the algorithm to avoid that state, therefore failing to gather the data needed to correct its\nfaulty estimate. To avoid such scenarios, nearly all provably ef\ufb01cient RL exploration algorithms build\npurposefully optimistic estimates. RLSVI does not do this and instead generates a randomized value\n\n6\n\n\ffunction. The following lemma is key to our analysis. It shows that, except in the rare event when it\nhas grossly mis-estimated the underlying MDP, RLSVI has at least a constant chance of sampling an\noptimistic value function. Similar results can be proved for Thompson sampling with linear models\n[1]. Recall M is the unknown true MDP with optimal policy \u03c0\u2217 and M\nk is RLSVI\u2019s noise perturbed\nMDP under which \u03c0k is an optimal policy.\nLemma 4. Let \u03c0\u2217 be an optimal policy for the true MDP M.\n\nIf \u02c6M k \u2208 Mk,\n\nthen\n\n(cid:17) \u2265 \u03a6(\u22121).\n\nP(cid:16)\n\nk\n\nV (M\n\n, \u03c0k) \u2265 V (M, \u03c0\u2217) | Hk\u22121\n\nThis result is more easily established through the following lemma, which avoids the need to carefully\ncondition on the history Hk\u22121 at each step. We conclude with the proof of Lemma 4 after.\nLemma 5. Fix any policy \u03c0 = (\u03c01, . . . , \u03c0H ) and vector e \u2208 RHSA with e(h, s, a) \u2265 0. Consider\nthe MDP M = (H,S,A, P, R, s1) and alternative \u00afR and \u00afP obeying the inequality\n\n\u2212(cid:112)e(h, s, a) \u2264 \u00afRh,s,a \u2212 Rh,s,a + (cid:104) \u00afPh,s,a \u2212 Ph,s,a, Vh+1(cid:105) \u2264(cid:112)e(h, s, a)\n\nfor every s \u2208 S, a \u2208 A and h \u2208 {1, . . . , H}. Take W \u2208 RHSA to be a random vector with\nindependent components where w(h, s, a) \u223c N (0, HSe(h, s, a)). Let \u00afV \u03c0\n1,W denote the (random)\nvalue function of the policy \u03c0 under the MDP \u00afM = (H,S,A, \u00afP , \u00afR + W ). Then\n\nP(cid:0) \u00afV \u03c0\n\n1 (s1)(cid:1) \u2265 \u03a6(\u22121).\n\n1,W (s1) \u2265 V \u03c0\n\n1,w(s1) \u2212 V \u03c0\n\nProof. To start, we consider an arbitrary deterministic vector w \u2208 RHSA (thought of as a possible\nrealization of W ) and evaluate the gap in value functions \u00afV \u03c0\n1 (s1). We can re-write this\nquantity by applying Lemma 3. Let s = (s1, . . . , sH ) denote a random sequence of states drawn by\nsimulating the policy \u03c0 in the MDP \u00afM from the deterministic initial state s1. Set ah = \u03c0(sh) for\nh = 1, . . . , H. Then\n(cid:20) H(cid:88)\n1,w(s1) \u2212 V \u03c0\n\u00afV \u03c0\nw(h, sh, \u03c0h(sh)) + \u00afRh,sh,\u03c0h(sh) \u2212 Rh,sh,\u03c0h(sh) + (cid:104) \u00afPh,sh,\u03c0h(sh) \u2212 Ph,sh,\u03c0h(sh) , V \u03c0\n(cid:34)\n(cid:16)\nH(cid:88)\n\nw(h, sh, \u03c0h(sh)) \u2212(cid:112)e(h, sh, \u03c0h(sh))\n\n\u2265 HE\n\n(cid:17)(cid:35)\n\n(cid:21)\nh,w(cid:105)\n\n= E\n\n1 (s1)\n\nh=1\n\n1\nH\n\nh=1\n\nwhere the expectation is taken over the sequence of sates s = (s1, . . . , sH ). De\ufb01ne d(h, s) =\n1\nH\n\nP(sh = s) for every h \u2264 H and s \u2208 S. Then the above equation can be written as\n\n1\nH\n\n(cid:0) \u00afV \u03c0\n\u2265 (cid:88)\n\uf8eb\uf8ed (cid:88)\n\ns\u2208S,h\u2264H\n\n\u2265\n\n1 (s1)(cid:1)\n(cid:17)\n(cid:16)\nw(h, s, \u03c0h(s)) \u2212(cid:112)e(h, s, \u03c0h(s))\n1,w(s1) \u2212 V \u03c0\n(cid:115) (cid:88)\n\nd(h, s)\n\n\u221a\n\n\uf8f6\uf8f8 \u2212\n\nd(h, s)w(h, s, \u03c0h(s))\n\nHS\n\ns\u2208S,h\u2264H\n\nd(h, s)2e(h, s, \u03c0h(s)) := X(w)\n\ns\u2208S,h\u2264H\n\n(cid:88)\n\n\uf8f6\uf8f8 .\n\nwhere the second inequality applies Cauchy-Shwartz. Now, since\n\nd(h, s)W (h, s, \u03c0h(s)) \u223c N (0, d(h, s)2HSe(h, s, \u03c0h(s))),\n\nwe have\n\nX(W ) \u223c N\n\n\uf8eb\uf8ed\u2212\n\n(cid:115)\n\n(cid:88)\n\ns\u2208S,h\u2264H\n\nHS\n\nd(h, s)2e(h, s, \u03c0h(s)), HS\n\nd(h, s)2e(h, s, \u03c0h(s))\n\nBy standardization, P(X(W ) \u2265 0) = \u03a6(\u22121). Therefore, P( \u00afV \u03c0\n\ns\u2208S,h\u2264H\n1,W (s1)\u2212V \u03c0\n\n1 (s1) \u2265 0) \u2265 \u03a6(\u22121).\n\n7\n\n\fProof of Lemma 4. Consider some history Hk\u22121 with \u02c6M k \u2208 Mk. Recall \u03c0k is the policy chosen by\nRLSVI, which is optimal under the MDP M\nk(h, s, a) =\nHSek(h, s, a), applying Lemma 5 conditioned on Hk\u22121 shows that with probability at least \u03a6(\u22121),\n, \u03c0k) \u2265 V (M, \u03c0\u2217), since by\nV (M\nde\ufb01nition \u03c0k is optimal under M\n\n, \u03c0\u2217) \u2265 V (M, \u03c0\u2217). When this occurs, we always have V (M\n\n= (H,S,A, \u02c6P k, \u02c6Rk + wk, s1). Since \u03c32\n\nk.\n\nk\n\nk\n\nk\n\nReduction to bounding online prediction error. The next lemma shows that the cumulative\nexpected regret of RLSVI is bounded in terms of the total prediction error in estimating the value\nfunction of \u03c0k. The critical feature of the result is it only depends on the algorithm being able\nto estimate the performance of the policies it actually employs and therefore gathers data about.\nFrom here, the regret analysis will follow only concentration arguments. For the purposes of\nanalysis, we let \u02dcM k denote an imagined second sample drawn from the same distribution as the\nk under RLSVI. More formally, let \u02dcM k = (H,S,A, \u02c6P k, \u02c6Rk + \u02dcwk, s1) where\nperturbed MDP M\n\u02dcwk(h, s, a) | Hk\u22121 \u223c N (0, \u03c32\nk(h, s, a)) is independent Gaussian noise. Conditioned on the history,\nk, but it is statistically independent of the policy \u03c0k\n\u02dcM k has the same marginal distribution as M\nselected by RLSVI.\nLemma 6. For an absolute constant c = \u03a6(\u22121)\u22121 < 6.31, we have\n\nRegret(M, K, RLSVI\u03b2) \u2264(c + 1)E\n\n|V (M\n\nk\n\n, \u03c0k) \u2212 V (M, \u03c0k)|\n\n+ cE\n\n|V ( \u02dcM k, \u03c0k) \u2212 V (M, \u03c0k)|\n\n+ H\n\nP( \u02c6M k /\u2208 Mk)\n\n.\n\n(cid:34) K(cid:88)\n(cid:34) K(cid:88)\n\nk=1\n\nk=1\n\n(cid:35)\n\n(cid:35)\n\nK(cid:88)\n(cid:124)\n\nk=1\n\n(cid:123)(cid:122)\n\n\u2264\u03c02/6\n\n(cid:125)\n\nh,s,a \u2212 Rh,s,a \u2208 R and \u0001k\n\nOnline prediction error bounds. We complete the proof with concentration arguments. Set\nh,s,a \u2212 Ph,s,a \u2208 RS to be the error in\nR(h, s, a) = \u02c6Rk\n\u0001k\nestimating mean the mean reward and transition vector corresponding to (h, s, a). The next re-\nsult follows by bounding each term in Lemma 6. This is done by using lemma 3 to expand the\nterms V (M , \u03c0k) \u2212 V (M, \u03c0k) and V (M , \u03c0k) \u2212 V ( \u02dcM , \u03c0k). We focus our analysis on bounding\n. The other term can be bounded in an identical manner2, so\n\nP (h, s, a) = \u02c6P k\n\nk\n\nk=1 |V (M\n\nwe omit this analysis.\nLemma 7. Let c = \u03a6(\u22121)\u22121 < 6.31. Then for any K \u2208 N,\n\n, \u03c0k) \u2212 V (M, \u03c0k)|(cid:105)\n(cid:35)\n\n|V (M\n\nk\n\n, \u03c0k) \u2212 V (M, \u03c0k)|\n\n\u2264\n\nE(cid:104)(cid:80)K\n(cid:34) K(cid:88)\n\nE\n\nk=1\n\n(cid:118)(cid:117)(cid:117)(cid:116)E\nK(cid:88)\nH\u22121(cid:88)\n(cid:34) K(cid:88)\nH(cid:88)\n\nh=1\n\nk=1\n\n+E\n\n(cid:118)(cid:117)(cid:117)(cid:116)E\nK(cid:88)\nH\u22121(cid:88)\n(cid:34) K(cid:88)\nH(cid:88)\n\nh=1\n\nk=1\n\n+ E\n\nh, ak\n\nh)(cid:13)(cid:13)2\n(cid:35)\nh)|\nh, ak\n\n1\n\n(cid:13)(cid:13)\u0001k\n\nP (h, sk\n\n|\u0001k\nR(h, sk\n\n(cid:13)(cid:13)V k\n\nh+1\n\n(cid:13)(cid:13)2\n\n\u221e\n\n|wk(h, sk\n\nh)|\nh, ak\n\n(cid:35)\n\n.\n\nk=1\n\nh=1\n\nk=1\n\nh=1\n\nh+1\n\n\u221a\n\n\u221a\n\n(cid:13)(cid:13)V k\n\n(cid:13)(cid:13)\u221e \u2264 \u02dcO(H 5/2\n\nThe remaining lemmas complete the proof. At each stage, RLSVI adds Gaussian noise with stan-\ndard deviation no larger than \u02dcO(H 3/2\nS). Ignoring extremely low probability events, we expect,\n\u221e \u2264 \u02dcO(H 6S). The proof of this Lemma\n(cid:16)\n\nmakes this precise by applying appropriate maximal inequalities.\nLemma 8.\n\nS) and hence(cid:80)H\u22121\n(cid:118)(cid:117)(cid:117)(cid:116)E\nH\u22121(cid:88)\n(cid:13)(cid:13)V k\n\n(cid:13)(cid:13)V k\n(cid:13)(cid:13)2\n\nK(cid:88)\n\n(cid:13)(cid:13)2\n\n(cid:17)\n\nH 3\n\nSK\n\n\u221a\n\nh+1\n\nh=1\n\nh+1\n\n\u221e = \u02dcO\n\nk=1\n\nh=1\n\n2In particular, an analogue of Lemma7 holds where we replace M\n\nh+1 with the value function\n\u02dcV k\nh+1 corresponding to policy \u03c0k in the MDP \u02dcM k, and the Gaussian noise wk with the \ufb01ctitious noise terms \u02dcwk.\n\nk with \u02dcM k, V k\n\n8\n\n\fThe next few lemmas are essentially a consequence of analysis in [13], and many sub-\nThe main idea is to apply\nsequent papers. We give proof sketches in the appendix.\n1, |\u0001k\nh)|\nh, ak\nThe pigeonhole principle gives\n\nknown concentration inequalities to bound (cid:13)(cid:13)\u0001k\n(cid:113)\n(cid:80)K\n\nh)| or |wk(h, sk\n(cid:113)\n\nin terms of either 1/nk(h, sk\n\n(cid:80)H\u22121\n\nh, ak\nh, ak\n\nh) or 1/\n\nR(h, sk\n\nh, ak\n\nh=1 (1/\n\nnk(h, sk\n\nh, ak\n\nh)) =\n\nh=1 1/nk(h, sk\n\nh, ak\n\nh, ak\n\nP (h, sk\nnk(h, sk\n\nh)(cid:13)(cid:13)2\nh) = O(log(SAKH) and (cid:80)K\n(cid:34) K(cid:88)\nH\u22121(cid:88)\n(cid:34) K(cid:88)\nH(cid:88)\n(cid:34) K(cid:88)\nH(cid:88)\n\nh)(cid:13)(cid:13)2\n(cid:35)\nh)|\nh, ak\n(cid:35)\nh)|\nh, ak\n\n|\u0001k\nR(h, sk\n\n|wk(h, sk\n\n(cid:13)(cid:13)\u0001k\n\nP (h, sk\n\nh, ak\n\n= \u02dcO\n\n= \u02dcO\n\n(cid:16)\n\n(cid:35)\n\nh=1\n\nh=1\n\nk=1\n\nk=1\n\nk=1\n\nh).\n\n(cid:80)H\u22121\n= \u02dcO(cid:0)S2AH(cid:1)\n(cid:16)\u221a\n(cid:17)\n\nSAKH\n\nH 3/2S\n\n(cid:17)\n\n\u221a\n\nAKH\n\nE\n\nE\n\n1\n\nk=1\n\n\u221a\nO(\nLemma 9.\n\nSAKH) .\n\nLemma 10.\n\nLemma 11.\n\nE\n\nk=1\n\nh=1\n\n6 Extensions and open directions\n\nThis paper gives the \ufb01rst worst-case regret bounds for algorithms that use randomized value functions\nto drive exploration. That the bounds are polynomial in all parameters indicates that adding noise\nduring value function training generates a sophisticated form of deep exploration that randomizing\nactions does not [24]. I hope this paper serves as a useful foundation for future analysis, as many\nquestions remain open. One glaring open problem is to study these approaches in problems that\nrequire generalization across large state space. Another is to study ensemble approaches [19, 21, 24]\nthat avoid re-estimating the value function in each episode.\nThere are also clear open questions in the tabular setting. The \ufb01rst, which I am pursuing, is to tighten\nthe dependence on S in the bounds. Another is to tighten the dependence on H. I suspect attaining the\noptimal dependence on H would require adjusting the variances of the noise perturbations in a more\nadaptive manner. Another question is to extend these proof techniques to handle time-homogeneous\nMDPs, where there are additional statistical dependencies that would break the current proof. Finally,\nI believe the proof techniques in this paper could yield high probability bounds on regret. To see this,\nset \u2206k = V (M, \u03c0\u2217) \u2212 V (M, \u03c0k) to be the regret incurred in period k. Lemma 4 together with the\nproof of Lemma 7 essentially bounds conditional expected regret E[\u2206k | Hk\u22121] with high probability.\nSince each \u2206k is bounded, one should be able to apply concentration inequalities to bound the sum\n\nof martingale differences(cid:80)K\n\nk=1 (\u2206k \u2212 E[\u2206k | Hk\u22121]) with high probability.\n\nAcknowledgments. Much of my understanding of randomized value functions comes from a\ncollaboration with Ian Osband, Ben Van Roy, and Zheng Wen. Mark Sellke and Chao Qin each\nnoticed the same error in the proof of Lemma 6 in the initial draft of this paper. The lemma has now\nbeen revised. I am extremely grateful for their careful reading of the paper.\n\nReferences\n[1] Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited. Electronic\n\nJournal of Statistics, 11(2):5165\u20135197, 2017.\n\n[2] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear\n\npayoffs. In International Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[3] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-\ncase regret bounds. In Advances in Neural Information Processing Systems, pages 1184\u20131194,\n2017.\n\n9\n\n\f[4] John Asmuth, Lihong Li, Michael L Littman, Ali Nouri, and David Wingate. A bayesian\nsampling approach to exploration in reinforcement learning. In Proceedings of the Twenty-Fifth\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 19\u201326. AUAI Press, 2009.\n\n[5] Mohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for\nIn Proceedings of the 34th International Conference on Machine\n\nreinforcement learning.\nLearning-Volume 70, pages 263\u2013272. JMLR. org, 2017.\n\n[6] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Ef\ufb01cient exploration\nthrough bayesian deep q-networks. In 2018 Information Theory and Applications Workshop\n(ITA), pages 1\u20139. IEEE, 2018.\n\n[7] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities: A\n\nnonasymptotic theory of independence. Oxford university press, 2013.\n\n[8] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213\u2013231,\n2002.\n\n[9] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random\n\nnetwork distillation. In International Conference on Learning Representations, 2019.\n\n[10] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforce-\nment learning. In Advances in Neural Information Processing Systems, pages 2818\u20132826,\n2015.\n\n[11] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac\nbounds for episodic reinforcement learning. In Advances in Neural Information Processing\nSystems, pages 5713\u20135723, 2017.\n\n[12] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex\nGraves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for\nexploration. In International Conference on Learning Representations, 2018.\n\n[13] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.\n\nJournal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[14] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably\n\nef\ufb01cient? In Advances in Neural Information Processing Systems, pages 4863\u20134873, 2018.\n\n[15] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD\n\nthesis, University of London London, England, 2003.\n\n[16] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMachine learning, 49(2-3):209\u2013232, 2002.\n\n[17] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of machine\n\nlearning research, 4(Dec):1107\u20131149, 2003.\n\n[18] Tor Lattimore and Csaba Szepesv\u00e1ri. Bandit algorithms. preprint, 2018.\n\n[19] Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in neural information\n\nprocessing systems, pages 3258\u20133266, 2017.\n\n[20] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) ef\ufb01cient reinforcement learning via\nposterior sampling. In Advances in Neural Information Processing Systems, pages 3003\u20133011,\n2013.\n\n[21] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via\nbootstrapped dqn. In Advances in neural information processing systems, pages 4026\u20134034,\n2016.\n\n[22] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized\nvalue functions. In International Conference on Machine Learning, pages 2377\u20132386, 2016.\n\n10\n\n\f[23] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep rein-\nforcement learning. In Advances in Neural Information Processing Systems, pages 8617\u20138629,\n2018.\n\n[24] Ian Osband, Benjamin Van Roy, Daniel J. Russo, and Zheng Wen. Deep exploration via\n\nrandomized value functions. Journal of Machine Learning Research, 20(124):1\u201362, 2019.\n\n[25] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac\nmodel-free reinforcement learning. In Proceedings of the 23rd international conference on\nMachine learning, pages 881\u2013888. ACM, 2006.\n\n[26] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in \ufb01nite mdps:\n\nPac analysis. Journal of Machine Learning Research, 10(Nov):2413\u20132444, 2009.\n\n[27] Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000,\n\npages 943\u2013950, 2000.\n\n[28] Ahmed Touati, Harsh Satija, Joshua Romoff, Joelle Pineau, and Pascal Vincent. Randomized\nvalue functions via multiplicative normalizing \ufb02ows. In Proceedings of the 35th Conference on\nUncertainty in Arti\ufb01cial Intelligence.\n\n[29] Nikolaos Tziortziotis, Christos Dimitrakakis, and Michalis Vazirgiannis. Randomised bayesian\n\nleast-squares policy iteration. arXiv preprint arXiv:1904.03535, 2019.\n\n[30] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger.\nInequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep,\n2003.\n\n11\n\n\f", "award": [], "sourceid": 8184, "authors": [{"given_name": "Daniel", "family_name": "Russo", "institution": "Columbia University"}]}