{"title": "Boltzmann Exploration Done Right", "book": "Advances in Neural Information Processing Systems", "page_first": 6284, "page_last": 6293, "abstract": "Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate? In this paper, we address several of these questions for the classic setup of stochastic multi-armed bandits. One of our main results is showing that the Boltzmann exploration strategy with any monotone learning-rate sequence will induce suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon $T$ and the suboptimality gap $\\Delta$). More importantly, we propose a novel variant that uses different learning rates for different arms, and achieves a distribution-dependent regret bound of order $\\frac{K\\log^2 T}{\\Delta}$ and a distribution-independent bound of order $\\sqrt{KT}\\log K$ without requiring such prior knowledge. To demonstrate the flexibility of our technique, we also propose a variant that guarantees the same performance bounds even if the rewards are heavy-tailed.", "full_text": "Boltzmann Exploration Done Right\n\nNicol\u00f2 Cesa-Bianchi\n\nUniversit\u00e0 degli Studi di Milano\n\nMilan, Italy\n\nnicolo.cesa-bianchi@unimi.it\n\nG\u00e1bor Lugosi\n\nICREA & Universitat Pompeu Fabra\n\nBarcelona, Spain\n\ngabor.lugosi@gmail.com\n\nClaudio Gentile\n\nINRIA Lille \u2013 Nord Europe\nVilleneuve d\u2019Ascq, France\ncla.gentile@gmail.com\n\nGergely Neu\n\nUniversitat Pompeu Fabra\n\nBarcelona, Spain\n\ngergely.neu@gmail.com\n\nAbstract\n\nBoltzmann exploration is a classic strategy for sequential decision-making under\nuncertainty, and is one of the most standard tools in Reinforcement Learning (RL).\nDespite its widespread use, there is virtually no theoretical understanding about\nthe limitations or the actual bene\ufb01ts of this exploration scheme. Does it drive\nexploration in a meaningful way? Is it prone to misidentifying the optimal actions\nor spending too much time exploring the suboptimal ones? What is the right tuning\nfor the learning rate? In this paper, we address several of these questions for the\nclassic setup of stochastic multi-armed bandits. One of our main results is showing\nthat the Boltzmann exploration strategy with any monotone learning-rate sequence\nwill induce suboptimal behavior. As a remedy, we offer a simple non-monotone\nschedule that guarantees near-optimal performance, albeit only when given prior\naccess to key problem parameters that are typically not available in practical\nsituations (like the time horizon T and the suboptimality gap \u2206). More importantly,\nwe propose a novel variant that uses different learning rates for different arms, and\nachieves a distribution-dependent regret bound of order K log2 T\nand a distribution-\nindependent bound of order\nKT log K without requiring such prior knowledge.\nTo demonstrate the \ufb02exibility of our technique, we also propose a variant that\nguarantees the same performance bounds even if the rewards are heavy-tailed.\n\n\u221a\n\n\u2206\n\n1\n\nIntroduction\n\nExponential weighting strategies are fundamental tools in a variety of areas, including Machine Learn-\ning, Optimization, Theoretical Computer Science, and Decision Theory [3]. Within Reinforcement\nLearning [23, 25], exponential weighting schemes are broadly used for balancing exploration and\nexploitation, and are equivalently referred to as Boltzmann, Gibbs, or softmax exploration policies\n[22, 14, 24, 19]. In the most common version of Boltzmann exploration, the probability of choosing\nan arm is proportional to an exponential function of the empirical mean of the reward of that arm.\nDespite the popularity of this policy, very little is known about its theoretical performance, even in\nthe simplest reinforcement learning setting of stochastic bandit problems.\nThe variant of Boltzmann exploration we focus on in this paper is de\ufb01ned by\n\nwhere pt,i is the probability of choosing arm i in round t,(cid:98)\u00b5t,i is the empirical average of the rewards\n\nobtained from arm i up until round t, and \u03b7t > 0 is the learning rate. This variant is broadly used\n\n(1)\n\npt,i \u221d e\u03b7t(cid:98)\u00b5t,i,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fin reinforcement learning [23, 25, 14, 26, 16, 18]. In the multiarmed bandit literature, exponential-\nweights algorithms are also widespread, but they typically use importance-weighted estimators for\nthe rewards \u2014see, e.g., [6, 8] (for the nonstochastic setting), [12] (for the stochastic setting), and\n[20] (for both stochastic and nonstochastic regimes). The theoretical behavior of these algorithms\nis generally well understood. For example, in the stochastic bandit setting Seldin and Slivkins [20]\nshow a regret bound of order K log2 T\n\u2206 , where \u2206 is the suboptimality gap (i.e., the smallest difference\nbetween the mean reward of the optimal arm and the mean reward of any other arm).\nIn this paper, we aim to achieve a better theoretical understanding of the basic variant of the\nBoltzmann exploration policy that relies on the empirical mean rewards. We \ufb01rst show that any\nmonotone learning-rate schedule will inevitably force the policy to either spend too much time\ndrawing suboptimal arms or completely fail to identify the optimal arm. Then, we show that a speci\ufb01c\nnon-monotone schedule of the learning rates can lead to regret bound of order K log T\n. However, the\nlearning schedule has to rely on full knowledge of the gap \u2206 and the number of rounds T . Moreover,\nour negative result helps us to identify a crucial shortcoming of the Boltzmann exploration policy: it\ndoes not reason about the uncertainty of the empirical reward estimates. To alleviate this issue, we\npropose a variant that takes this uncertainty into account by using separate learning rates for each\narm, where the learning rates account for the uncertainty of each reward estimate. We show that\nthe resulting algorithm guarantees a distribution-dependent regret bound of order K log2 T\n\u2206 , and a\ndistribution-independent bound of order\nOur algorithm and analysis is based on the so-called Gumbel\u2013softmax trick that connects the\nexponential-weights distribution with the maximum of independent random variables from the\nGumbel distribution.\n\nKT log K.\n\n\u221a\n\n\u22062\n\n2 The stochastic multi-armed bandit problem\n\nConsider the setting of stochastic multi-armed bandits: each arm i \u2208 [K] def= {1, 2, . . . , K} yields a\nreward with distribution \u03bdi, mean \u00b5i, with the optimal mean reward being \u00b5\u2217 = maxi \u00b5i. Without\nloss of generality, we will assume that the optimal arm is unique and has index 1. The gap of arm i is\nde\ufb01ned as \u2206i = \u00b5\u2217 \u2212 \u00b5i. We consider a repeated game between the learner and the environment,\nwhere in each round t = 1, 2, . . . , the following steps are repeated:\n\n1. The learner chooses an arm It \u2208 [K],\n2. the environment draws a reward Xt,It \u223c \u03bdIt independently of the past,\n3. the learner receives and observes the reward Xt,It.\n\nThe performance of the learner is measured in terms of the pseudo-regret de\ufb01ned as\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\nK(cid:88)\n\nE [Xt,It] = \u00b5\u2217T \u2212 E\n\n\u00b5It\n\n= E\n\n\u2206It\n\n=\n\n\u2206iE [NT,i] ,\n\n(2)\n\nt=1\n\nt=1\n\ni=1\n\nRT = \u00b5\u2217T \u2212 T(cid:88)\nwhere we de\ufb01ned Nt,i =(cid:80)t\n\nt=1\n\nI{Is=i}, that is, the number of times that arm i has been chosen until\nthe end of round t. We aim at constructing algorithms that guarantee that the regret grows sublinearly.\nWe will consider the above problem under various assumptions of the distribution of the rewards. For\nmost of our results, we will assume that each \u03bdi is \u03c3-subgaussian with a known parameter \u03c3 > 0,\nthat is, that\n\ns=1\n\n(cid:16)(cid:80)\n\nholds for all y \u2208 R and i \u2208 [K]. It is easy to see that any random variable bounded in an interval of\nlength B is B2/4-subgaussian. Under this assumption, it is well known that any algorithm will suffer\na regret of at least \u2126\n, as shown in the classic paper of Lai and Robbins [17]. There\nexist several algorithms guaranteeing matching upper bounds, even for \ufb01nite horizons [7, 10, 15]. We\nrefer to the survey of Bubeck and Cesa-Bianchi [9] for an exhaustive treatment of the topic.\n\n\u03c32 log T\n\ni>1\n\n\u2206i\n\nE(cid:104)\ney(X1,i\u2212E[X1,i])(cid:105) \u2264 e\u03c32y2/2\n(cid:17)\n\n2\n\n\f3 Boltzmann exploration done wrong\n\nWe now formally describe the heuristic form of Boltzmann exploration that is commonly used in\nthe reinforcement learning literature [23, 25, 14]. This strategy works by maintaining the empirical\nestimates of each \u00b5i de\ufb01ned as\n\n(cid:80)t\n\n(cid:98)\u00b5t,i =\n\ns=1 Xs,iI{Is=i}\n\nNt,i\n\n(3)\n\nand computing the exponential-weights distribution (1) for an appropriately tuned sequence of\nlearning rate parameters \u03b7t > 0 (which are often referred to as the inverse temperature). As noted on\nseveral occasions in the literature, \ufb01nding the right schedule for \u03b7t can be very dif\ufb01cult in practice\n[14, 26]. Below, we quantify this dif\ufb01culty by showing that natural learning-rate schedules may\nfail to achieve near-optimal regret guarantees. More precisely, they may draw suboptimal arms\ntoo much even after having estimated all the means correctly, or commit too early to a suboptimal\narm and never recover afterwards. We partially circumvent this issue by proposing an admittedly\narti\ufb01cial learning-rate schedule that actually guarantees near-optimal performance. However, a\nserious limitation of this schedule is that it relies on prior knowledge of problem parameters \u2206 and T\nthat are typically unknown at the beginning of the learning procedure. These observations lead us\nto the conclusion that the Boltzmann exploration policy as described by Equations (1) and (3) is no\nmore effective for regret minimization than the simplest alternative of \u03b5-greedy exploration [23, 7].\nBefore we present our own technical results, we mention that Singh et al. [21] propose a learning-rate\nschedule \u03b7t for Boltzmann exploration that simultaneously guarantees that all arms will be drawn\nin\ufb01nitely often as T goes to in\ufb01nity, and that the policy becomes greedy in the limit. This property\nis proven by choosing a learning-rate schedule adaptively to ensure that in each round t, each arm\ngets drawn with probability at least 1\nt , making it similar in spirit to \u03b5-greedy exploration. While\nthis strategy clearly leads to sublinear regret, it is easy to construct examples on which it suffers a\n\nregret of at least \u2126(cid:0)T 1\u2212\u03b1(cid:1) for any small \u03b1 > 0. In this paper, we pursue a more ambitious goal: we\n\naim to \ufb01nd out whether Boltzmann exploration can actually guarantee polylogarithmic regret. In\nthe rest of this section, we present both negative and positive results concerning the standard variant\nof Boltzmann exploration, and then move on to providing an ef\ufb01cient generalization that achieves\nconsistency in a more universal sense.\n\n3.1 Boltzmann exploration with monotone learning rates is suboptimal\n\nIn this section, we study the most natural variant of Boltzmann exploration that uses a monotone\nlearning-rate schedule. It is easy to see that in order to achieve sublinear regret, the learning rate \u03b7t\nneeds to increase with t so that the suboptimal arms are drawn with less and less probability as time\nprogresses. For the sake of clarity, we study the simplest possible setting with two arms with a gap of\n\u2206 between their means. We \ufb01rst show that, in order to guarantee near-optimal (logarithmic) regret,\nthe learning rate has to increase at least at a rate log t\n\u2206 even when the mean rewards are perfectly\nknown, and that any learning-rate sequence that increases at a slower logarithmic rate will lead to\npolynomial regret. In other words, log t\n\nProposition 1. Let us assume that(cid:98)\u00b5t,i = \u00b5i for all t and i = 1, 2 with \u00b51 > \u00b52. Assume that for\n(1+\u03b1)\u2206 + \u03b5 for all t \u2265 k.\n\n\u2206 , the learning rate satis\ufb01es \u03b7t \u2264 log(t\u22062)\n\n\u2206 is the minimal affordable learning rate.\n\nsome constants k \u2265 1, \u03b1 \u2265 0 and \u03b5 \u2264 1\nThen, the regret grows as\n\n(cid:16) log T\n(cid:17)\n(cid:16)\n1+\u03b1(cid:0) 1\n\n\u2206\n\n\u03b1\n\nT\n\n\u2206\n\n\u2022 RT = \u2126\n\n\u2022 RT = \u2126\n\nif \u03b1 = 0, and\n\n1+\u03b1(cid:17)\n(cid:1) 1\u2212\u03b1\n\nif \u03b1 > 0.\n\nProof. For t \u2265 k, the probability of pulling the suboptimal arm can be bounded as\n\n(cid:16)(cid:0)\u22062t(cid:1)\u2212 1\n1+\u03b1(cid:17)\n\nP [It = 2] =\n\n1\n\n1 + e\u03b7t\u2206 \u2265 e\u2212\u03b7t\u2206\n\n2\n\n= \u2126\n\n3\n\n\fby our assumption on \u03b7t. Summing up for all t, we get that the regret is at least\n\nT(cid:88)\n\n(cid:32)\nThe proof is concluded by observing that the sum(cid:80)T\n\u2126(cid:0)T\n\n1+\u03b1(cid:1) if \u03b1 > 0.\n\nP [It = 2] \u2265 \u2206 \u00b7\n\nRT = \u2206\n\nt=1\n\n\u03b1\n\nk + \u2126\n\nt=k t\u2212 1\n\n(cid:0)\u22062t(cid:1)\u2212 1\n\n1+\u03b1\n\n(cid:33)(cid:33)\n\n.\n\n(cid:32) T(cid:88)\n\nt=k\n\n1+\u03b1 is of the order \u2126 (log T ) if \u03b1 = 0 and\n\nThis simple proposition thus implies an asymptotic lower bound on the schedule of learning rates\n\u03b7t that provide near-optimal guarantees. In contrast, Theorem 1 below shows that all learning rate\nsequences that grow faster than 2 log t yield a linear regret, provided this schedule is adopted since\nthe beginning of the game. This should be contrasted with Theorem 2, which exhibits a schedule\nachieving logarithmic regret where \u03b7t grows faster than 2 log t only after the \ufb01rst \u03c4 rounds.\nTheorem 1. There exists a 2-armed stochastic bandit problem with rewards bounded in [0, 1] where\nBoltzmann exploration using any learning rate sequence \u03b7t such that \u03b7t > 2 log t for all t \u2265 1 has\nregret RT = \u2126(T ).\n\nProof. Consider the case where arm 2 gives a reward deterministically equal to 1\n2 whereas the optimal\n2. Note that the regret\narm 1 has a Bernoulli distribution of parameter p = 1\n2 + \u2206 for some 0 < \u2206 < 1\nof any algorithm satis\ufb01es RT \u2265 \u2206(T \u2212 t0)P [\u2200t > t0, It = 2]. Without loss of generality, assume\n\nthat(cid:98)\u00b51,1 = 0 and(cid:98)\u00b51,2 = 1/2. Then for all t, independent of the algorithm,(cid:98)\u00b5t,2 = 1/2 and\n\nFor t0 \u2265 1, Let Et0 be the event that Bin(Nt0,1, p) = 0, that is, up to time t0, arm 1 gives only zero\nreward whenever it is sampled. Then\n\npt,1 =\n\ne\u03b7tBin(Nt\u22121,1,p)\n\ne\u03b7t/2 + e\u03b7tBin(Nt\u22121,1,p)\n\nP [\u2200t > t0 It = 2] \u2265 P [Et0 ]\n\u2212 \u2206\n\n(cid:18) 1\n\n\u2265\n\n2\n\ne\u03b7t/2\n\n.\n\nand pt,2 =\n\ne\u03b7t/2 + e\u03b7tBin(Nt\u22121,1,p)\n\n(cid:16)\n(cid:19)t0(cid:16)\n\n1 \u2212 P [\u2203t > t0 It = 1 | Et0]\n\n(cid:17)\n1 \u2212 P [\u2203t > t0 It = 1 | Et0 ]\n\n(cid:17)\n\n.\n\nFor t > t0, let At,t0 be the event that arm 1 is sampled at time t but not at any of the times\nt0 + 1, t0 + 2, . . . , t \u2212 1. Then, for any t0 \u2265 1,\n\nP [\u2203t > t0 It = 1 | Et0] = P [\u2203t > t0 At,t0 | Et0 ] \u2264(cid:88)\n\ne\u2212\u03b7t/2 .\n\n=\n\n1\n\nt>t0\n\nt>t0\n\ns=t0+1\n\n1 \u2212\n\n1 + e\u03b7t/2\n\n(cid:88)\n\nt\u22121(cid:89)\n(cid:19)t0(cid:32)\n\n(cid:18)\n1 \u2212(cid:88)\n(cid:16) c\n\u2212 1\nt\u2212 c\n2 \u2212 1. This implies RT = \u2126(T ).\n\n(cid:18) 1\n(cid:90) \u221e\n\n\u2212 \u2206\n\n2 dx =\n\nx\u2212 c\n\n2 \u2264\n\nt>t0\n\n2\n\n2\n\nt0\n\n(cid:17)\n\n\u2264(cid:88)\n\nP [At,t0 | Et0 ]\n(cid:19)\n(cid:33)\n\nt>t0\n\n1\n\n1 + e\u03b7s/2\n\ne\u2212\u03b7t/2\n\n.\n\n2\u22121)\n\n\u2212( c\n0\n\nt\n\n\u2264 1\n2\n\nTherefore\n\nAssume \u03b7t \u2265 c log t for some c > 2 and for all t \u2265 t0. Then\n\nRT \u2265 \u2206(T \u2212 t0)\n\n(cid:88)\n\ne\u2212\u03b7t/2 \u2264(cid:88)\n\nwhenever t0 \u2265 (2a) 1\n\nt>t0\n\nt>t0\na where a = c\n\n3.2 A learning-rate schedule with near-optimal guarantees\n\nThe above negative result is indeed heavily relying on the assumption that \u03b7t > 2 log t holds since\nthe beginning. If we instead start off from a constant learning rate which we keep for a logarithmic\nnumber of rounds, then a logarithmic regret bound can be shown. Arguably, this results in a rather\nsimplistic exploration scheme, which can be essentially seen as an explore-then-commit strategy\n(e.g., [13]). Despite its simplicity, this strategy can be shown to achieve near-optimal performance\nif the parameters are tuned as a function the suboptimality gap \u2206 (although its regret scales at the\nsuboptimal rate of 1/\u22062 with this parameter). The following theorem (proved in Appendix A.1)\nstates this performance guarantee.\n\n4\n\n\fTheorem 2. Assume the rewards of each arm are in [0, 1] and let \u03c4 = 16eK log T\nBoltzmann exploration with learning rate \u03b7t = I{t<\u03c4} + log(t\u22062)\nI{t\u2265\u03c4} satis\ufb01es\n\n\u22062\n\n. Then the regret of\n\nRT \u2264 16eK log T\n\n\u22062\n\n+\n\n4 Boltzmann exploration done right\n\n\u2206\n9K\n\u22062 .\n\nWe now turn to give a variant of Boltzmann exploration that achieves near-optimal guarantees without\nprior knowledge of either \u2206 or T . Our approach is based on the observation that the distribution\n\npt,i \u221d exp (\u03b7t(cid:98)\u00b5t,i) can be equivalently speci\ufb01ed by the rule It = arg maxj {\u03b7t(cid:98)\u00b5t,j + Zt,j}, where\nC 2(cid:14)Nt,i with some constant\n\nZt,j is a standard Gumbel random variable1 drawn independently for each arm j (see, e.g., Abernethy\net al. [1] and the references therein). As we saw in the previous section, this scheme fails to guarantee\nconsistency in general, as it does not capture the uncertainty of the reward estimates. We now\npropose a variant that takes this uncertainty into account by choosing different scaling factors for\neach perturbation. In particular, we will use the simple choice \u03b2t,i =\nC > 0 that will be speci\ufb01ed later. Our algorithm operates by independently drawing perturbations\nZt,i from a standard Gumbel distribution for each arm i, then choosing action\n\n(cid:113)\n\n{(cid:98)\u00b5t,i + \u03b2t,iZt,i} .\n\nIt+1 = arg max\n\ni\n\n(4)\n\nWe refer to this algorithm as Boltzmann\u2013Gumbel exploration, or, in short, BGE. Unfortunately, the\nprobabilities pt,i no longer have a simple closed form, nevertheless the algorithm is very straightfor-\nward to implement. Our main positive result is showing the following performance guarantee about\nthe algorithm.2\nTheorem 3. Assume that the rewards of each arm are \u03c32-subgaussian and let c > 0 be arbitrary.\nThen, the regret of Boltzmann\u2013Gumbel exploration satis\ufb01es\n\nc2e\u03b3 + 18C 2e\u03c32/2C2\n\n(1 + e\u2212\u03b3)\n\nK(cid:88)\n\ni=2\n\n+\n\n\u2206i.\n\nRT \u2264 K(cid:88)\n\ni=2\n\n9C 2 log2\n+\n\u2206i\n\n(cid:0)T \u2206i/c2(cid:1)\n\n+\n\nK(cid:88)\n(cid:32) K(cid:88)\n\ni=2\n\n\u2206i\n\n(cid:33)\n\n.\n\nRT = O\n\n\u03c32 log2(T \u22062\n\ni /\u03c32)\n\ni=2\n\n\u2206i\n\nIn particular, choosing C = \u03c3 and c = \u03c3 guarantees a regret bound of\n\nNotice that, unlike any other algorithm that we are aware of, Boltzmann\u2013Gumbel exploration\nstill continues to guarantee meaningful regret bounds even if the subgaussianity constant \u03c3 is\nunderestimated\u2014although such misspeci\ufb01cation is penalized exponentially in the true \u03c32. A downside\nof our bound is that it shows a suboptimal dependence on the number of rounds T :\nit grows\n\ni )(cid:14)\u2206i, in contrast to the standard regret bounds for the UCB\ni>1(log T )(cid:14)\u2206i. However, our guarantee improves on the\n\nasymptotically as(cid:80)\nalgorithm of Auer et al. [7] that grow as(cid:80)\n\ni>1 log2(T \u22062\n\nKT log T . This is shown in the\n\ndistribution-independent regret bounds of UCB that are of order\nfollowing corollary.\nCorollary 1. Assume that the rewards of each arm are \u03c32-subgaussian. Then, the regret of Boltzmann\u2013\nGumbel exploration with C = \u03c3 satis\ufb01es RT \u2264 200\u03c3\nNotably, this bound shows optimal dependence on the number of rounds T , but is suboptimal in terms\nof the number of arms. To complement this upper bound, we also show that these bounds are tight in\nthe sense that the log K factor cannot be removed.\n\nTheorem 4. For any K and T such that(cid:112)K/T log K \u2264 1, there exists a bandit problem with\n\nKT log K.\n\nrewards bounded in [0, 1] where the regret of Boltzmann\u2013Gumbel exploration with C = 1 is at least\nRT \u2265 1\n\nKT log K.\n\n\u221a\n\n\u221a\n\n\u221a\n\n2\n\nis the Euler-Mascheroni constant.\n\n1The cumulative density function of a standard Gumbel random variable is F (x) = exp(\u2212e\u2212x+\u03b3) where \u03b3\n2We use the notation log+(\u00b7) = max{0,\u00b7}.\n\n5\n\n\fThe proofs can be found in the Appendices A.5 and A.6. Note that more sophisticated policies are\n\u221a\nknown to have better distribution-free bounds. The algorithm MOSS [4] achieves minimax-optimal\nKT distribution-free bounds, but distribution-dependent bounds of the form (K/\u2206) log(T \u22062)\nwhere \u2206 is the suboptimality gap. A variant of UCB using action elimination and due to Auer and\n\ni )(cid:14)\u2206i corresponding to a(cid:112)KT (log K) distribution-free bound.\n\nOrtner [5] has regret(cid:80)\n\nThe same bounds are achieved by the Gaussian Thompson sampling algorithm of Agrawal and Goyal\n[2], given that the rewards are subgaussian.\nWe \ufb01nally provide a simple variant of our algorithm that allows to handle heavy-tailed rewards,\nintended here as reward distributions that are not subgaussian. We propose to use technique due to\nCatoni [11] based on the in\ufb02uence function\n\ni>1 log(T \u22062\n\nUsing this function, we de\ufb01ne our estimates as\n\n\u03c8(x) =\n\n(cid:26)log(cid:0)1 + x + x2/2(cid:1) ,\n\u2212 log(cid:0)1 \u2212 x + x2/2(cid:1) ,\n(cid:18) Xs,i\n\nt(cid:88)\n\nI{Is=i}\u03c8\n\n(cid:98)\u00b5t,i = \u03b2t,i\n\ns=1\n\n\u03b2t,iNt,i\n\nfor x \u2265 0,\nfor x \u2264 0.\n\n(cid:19)\n\nE(cid:2)X 2\n\nWe prove the following result regarding Boltzmann\u2013Gumbel exploration run with the above estimates.\nTheorem 5. Assume that the second moment of the rewards of each arm are bounded uniformly as\n\n(cid:3) \u2264 V and let c > 0 be arbitrary. Then, the regret of Boltzmann\u2013Gumbel exploration satis\ufb01es\nRT \u2264 K(cid:88)\n\n(cid:0)T \u2206i/c2(cid:1)\n\nc2e\u03b3 + 18C 2eV /2C2\n\n(1 + e\u2212\u03b3)\n\nK(cid:88)\n\nK(cid:88)\n\n\u2206i.\n\n+\n\n+\n\ni\n\ni=2\n\n\u2206i\n\ni=2\n\n9C 2 log2\n+\n\u2206i\n\ni=2\n\nNotably, this bound coincides with that of Theorem 3, except that \u03c32 is replaced by V . Thus, by\n\u221a\nfollowing the proof of Corollary 1, we can show a distribution-independent regret bound of order\n\nKT log K.\n\n5 Analysis\n\nLet us now present the proofs of our main results concerning Boltzmann\u2013Gumbel exploration,\nTheorems 3 and 5. Our analysis builds on several ideas from Agrawal and Goyal [2]. We \ufb01rst provide\ngeneric tools that are independent of the reward estimator and then move on to providing speci\ufb01cs for\nboth estimators.\n\nWe start with introducing some notation. We de\ufb01ne(cid:101)\u00b5t,i =(cid:98)\u00b5t,i + \u03b2t,iZt,i, so that the algorithm can\nbe simply written as It = arg maxi(cid:101)\u00b5t,i. Let Ft\u22121 be the sigma-algebra generated by the actions\nxi, yi satisfying \u00b5i \u2264 xi \u2264 yi \u2264 \u00b51 and de\ufb01ne qt,i = P [(cid:101)\u00b5t,1 > yi|Ft\u22121]. Furthermore, we de\ufb01ne\nt,i = {(cid:101)\u00b5t,i \u2264 yi}. With this notation at hand, we can decompose\nt,i = {(cid:98)\u00b5t,i \u2264 xi} and E(cid:101)\u00b5\nthe events E(cid:98)\u00b5\n(cid:105)\nT(cid:88)\nIt = i, E(cid:101)\u00b5\nt,i, E(cid:98)\u00b5\n\ntaken by the learner and the realized rewards up to the beginning of round t. Let us \ufb01x thresholds\n\nthe number of draws of any suboptimal i as follows:\n\nP(cid:104)\nIt = i, E(cid:98)\u00b5\n\nIt = i, E(cid:101)\u00b5\n\nt,i, E(cid:98)\u00b5\n\nE [NT,i] =\n\nT(cid:88)\n\nT(cid:88)\n\nP(cid:104)\n\nP(cid:104)\n\n(cid:105)\n\n(cid:105)\n\n(5)\n\n+\n\n+\n\n.\n\nt,i\n\nt,i\n\nt,i\n\nt=1\n\nt=1\n\nt=1\n\nIt remains to choose the thresholds xi and yi in a meaningful way: we pick xi = \u00b5i + \u2206i\nyi = \u00b51 \u2212 \u2206i\nindividual terms capture the following events:\n\n3 and\n3 . The rest of the proof is devoted to bounding each term in Eq. (5). Intuitively, the\n\n\u2022 The \ufb01rst term counts the number of times that, even though the estimated mean reward\nof arm i is well-concentrated and the additional perturbation Zt.i is not too large, arm i\nwas drawn instead of the optimal arm 1. This happens when the optimal arm is poorly\nestimated or when the perturbation Zt,1 is not large enough. Intuitively, this term measures\nthe interaction between the perturbations Zt,1 and the random \ufb02uctuations of the reward\n\nestimate(cid:98)\u00b5t,1 around its true mean, and will be small if the perturbations tend to be large\n\nenough and the tail of the reward estimates is light enough.\n\n6\n\n\f\u2022 The second term counts the number of times that the mean reward of arm i is well-estimated,\nbut it ends up being drawn due to a large perturbation. This term can be bounded indepen-\ndently of the properties of the mean estimator and is small when the tail of the perturbation\ndistribution is not too heavy.\n\u2022 The last term counts the number of times that the reward estimate of arm i is poorly\nconcentrated. This term is independent of the perturbations and only depends on the\nproperties of the reward estimator.\n\nAs we will see, the \ufb01rst and the last terms can be bounded in terms of the moment generating function\nof the reward estimates, which makes subgaussian reward estimators particularly easy to treat. We\nbegin by the most standard part of our analysis: bounding the third term on the right-hand-side of (5)\nin terms of the moment-generating function.\nLemma 1. Let us \ufb01x any i and de\ufb01ne \u03c4k as the k\u2019th time that arm i was drawn. We have\n\nT(cid:88)\n\nP(cid:104)\n\nIt = i, E(cid:98)\u00b5\n\nt,i\n\n(cid:105) \u2264 1 +\n\nT\u22121(cid:88)\n\nE\n\nexp\n\nt=1\n\nk=1\n\n(cid:18)(cid:98)\u00b5\u03c4k,i \u2212 \u00b5i\n\n(cid:19)(cid:21)\n\n\u03b2\u03c4k,i\n\n\u00b7 e\u2212 \u2206i\n\n\u221a\nk\n3C .\n\nInterestingly, our next key result shows that the \ufb01rst term can be bounded by a nearly identical\nexpression:\nLemma 2. Let us de\ufb01ne \u03c4k as the k\u2019th time that arm 1 was drawn. For any i, we have\n\nT(cid:88)\n\nP(cid:104)\n\nIt = i, E(cid:101)\u00b5\n\nt,i, E(cid:98)\u00b5\n\nt,i\n\n(cid:105) \u2264 T\u22121(cid:88)\n\nE\n\nexp\n\n(cid:18) \u00b51 \u2212(cid:98)\u00b5\u03c4k,1\n\n(cid:19)(cid:21)\n\n\u03b2\u03c4k,1\n\nt=1\n\nk=0\n\ne\u2212\u03b3\u2212 \u2206i\n\n\u221a\nk\n3C .\n\n(cid:20)\n\n(cid:20)\n\nIt remains to bound the second term in Equation (5), which we do in the following lemma:\nLemma 3. For any i (cid:54)= 1 and any constant c > 0, we have\n\nP(cid:104)\nIt = i, E(cid:101)\u00b5\n\nt,i, E(cid:98)\u00b5\n\nt,i\n\n(cid:105) \u2264 9C 2 log2\n\n+\n\nT(cid:88)\n\nt=1\n\n(cid:0)T \u22062\ni /c2(cid:1) + c2e\u03b3\n\n.\n\n\u22062\ni\n\nThe proofs of these three lemmas are included in the supplementary material.\n\n5.1 The proof of Theorem 3\n\nestimator. Building on the results of the previous section, observe that we are left with bounding\nthe terms appearing in Lemmas 1 and 2. To this end, let us \ufb01x a k and an i and notice that by\n-subgaussian (as\n\nFor this section, we assume that the rewards are \u03c3-subgaussian and that(cid:98)\u00b5t,i is the empirical-mean\nthe subgaussianity assumption on the rewards, the empirical mean (cid:101)\u00b5\u03c4k,i is \u03c3\u221a\n(cid:113) k\n\nholds for any \u03b1. In particular, using this above formula for \u03b1 = 1/\u03b2\u03c4k,i =\n\nN\u03c4k,i = k). In other words,\n\nC2 , we obtain\n\nk\n\nE(cid:104)\ne\u03b1((cid:98)\u00b5\u03c4k ,i\u2212\u00b5i)(cid:105) \u2264 e\u03b12\u03c32/2k\n(cid:18)(cid:98)\u00b5\u03c4k,i \u2212 \u00b5i\n(cid:20)\n(cid:18)(cid:98)\u00b5\u03c4k,i \u2212 \u00b5i\n(cid:19)(cid:21)\n\nT\u22121(cid:88)\n\n3C \u2264 e\u03c32/2C2\n\n(cid:19)(cid:21)\n\n\u00b7 e\u2212 \u2206i\n\n\u03b2\u03c4k,i\n\nexp\n\nE\n\n\u221a\n\nk\n\n\u2264 e\u03c32/2C2\n\n.\n\nexp\n\nwhere the last step follows from the fact3 that(cid:80)\u221e\n\n\u03b2\u03c4k,i\n\nk=1\n\nk=1\n\n\u221a\nk=0 ec\n\nk \u2264 2\n\nc2 holds for all c > 0. The statement of\nTheorem 3 now follows from applying the same argument to the bound of Lemma 2, using Lemma 3,\nand the standard expression for the regret in Equation (2).\n\n\u221a\nk\n\ne\u2212 \u2206i\n\n3C \u2264 18C 2e\u03c32/2C2\n\n,\n\n\u22062\ni\n\nThus, the sum appearing in Lemma 1 can be bounded as\n\n(cid:20)\n\nT\u22121(cid:88)\n\nE\n\n3This can be easily seen by bounding the sum with an integral.\n\n7\n\n\fFigure 1: Empirical performance of Boltzmann exploration variants, Boltzmann\u2013Gumbel exploration\nand UCB for (a) i.i.d. initialization and (b) malicious initialization, as a function of C 2. The dotted\nvertical line corresponds to the choice C 2 = 1/4 suggested by Theorem 3.\n\n5.2 The proof of Theorem 5\n\nWe now drop the subgaussian assumption on the rewards and consider reward distributions that are\npossibly heavy-tailed, but have bounded variance. The proof of Theorem 5 trivially follows from\nthe arguments in the previous subsection and using Proposition 2.1 of Catoni [11] (with \u03b8 = 0) that\nguarantees the bound\n\nE\n\n6 Experiments\n\n(cid:20)\n\nexp\n\n(cid:18)\n\u00b1 \u00b5i \u2212(cid:98)\u00b5t,i\n\n\u03b2t,i\n\n(cid:21)\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) Nt,i = n\n\n(cid:32)E(cid:2)X 2\n\n(cid:3)\n\n(cid:33)\n\ni\n2C 2\n\n\u2264 exp\n\n.\n\n(6)\n\nThis section concludes by illustrating our theoretical results through some experiments, highlighting\nthe limitations of Boltzmann exploration and contrasting it with the performance of Boltzmann\u2013\nGumbel exploration. We consider a stochastic multi-armed bandit problem with K = 10 arms each\nyielding Bernoulli rewards with mean \u00b5i = 1/2 for all suboptimal arms i > 1 and \u00b51 = 1/2 + \u2206 for\nthe optimal arm. We set the horizon to T = 106 and the gap parameter to \u2206 = 0.01. We compare\nthree variants of Boltzmann exploration with inverse learning rate parameters\n\n\u2022 \u03b2t = C 2 (BE-const),\n\u2022 \u03b2t = C 2/ log t (BE-log), and\n\u2022 \u03b2t = C 2/\n\nt (BE-sqrt)\n\n\u221a\n\nbonus(cid:112)C 2 log(t)/Nt,i.\n\nfor all t, and compare it with Boltzmann\u2013Gumbel exploration (BGE), and UCB with exploration\n\nWe study two different scenarios: (a) all rewards drawn i.i.d. from the Bernoulli distributions with\nthe means given above and (b) the \ufb01rst T0 = 5,000 rewards set to 0 for arm 1. The latter scenario\nsimulates the situation described in the proof of Theorem 1, and in particular exposes the weakness\nof Boltzmann exploration with increasing learning rate parameters. The results shown on Figure 1 (a)\nand (b) show that while some variants of Boltzmann exploration may perform reasonably well when\ninitial rewards take typical values and the parameters are chosen luckily, all standard versions fail to\nidentify the optimal arm when the initial draws are not representative of the true mean (which happens\nwith a small constant probability). On the other hand, UCB and Boltzmann\u2013Gumbel exploration\ncontinue to perform well even under this unlikely event, as predicted by their respective theoretical\nguarantees. Notably, Boltzmann\u2013Gumbel exploration performs comparably to UCB in this example\n(even slightly outperforming its competitor here), and performs notably well for the recommended\nparameter setting of C 2 = \u03c32 = 1/4 (noting that Bernoulli random variables are 1/4-subgaussian).\n\n8\n\n10-2100102C20200040006000800010000regret(a)10-2100102C20200040006000800010000regret(b)BE(const)BE(log)BE(sqrt)BGEUCB\fAcknowledgements G\u00e1bor Lugosi was supported by the Spanish Ministry of Economy and Com-\npetitiveness, Grant MTM2015-67304-P and FEDER, EU. Gergely Neu was supported by the UPFel-\nlows Fellowship (Marie Curie COFUND program n\u25e6 600387).\n\nReferences\n[1] J. Abernethy, C. Lee, A. Sinha, and A. Tewari. Online linear optimization via smoothing. In\nM.-F. Balcan and Cs. Szepesv\u00e1ri, editors, Proceedings of The 27th Conference on Learning\nTheory, volume 35 of JMLR Proceedings, pages 807\u2013823. JMLR.org, 2014.\n\n[2] S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In AISTATS,\n\npages 99\u2013107, 2013.\n\n[3] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: A meta-algorithm\n\nand applications. Theory of Computing, 8:121\u2013164, 2012.\n\n[4] J.-Y. Audibert and S. Bubeck. Minimax policies for bandits games.\n\nIn S. Dasgupta and\nA. Klivans, editors, Proceedings of the 22nd Annual Conference on Learning Theory. Omnipress,\nJune 18\u201321 2009.\n\n[5] P. Auer and R. Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed\n\nbandit problem. Periodica Mathematica Hungarica, 61:55\u201365, 2010. ISSN 0031-5303.\n\n[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino:\nIn Foundations of Computer Science, 1995.\n\nThe adversarial multi-armed bandit problem.\nProceedings., 36th Annual Symposium on, pages 322\u2013331. IEEE, 1995.\n\n[7] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\nMach. Learn., 47(2-3):235\u2013256, May 2002. ISSN 0885-6125. doi: 10.1023/A:1013689704352.\nURL http://dx.doi.org/10.1023/A:1013689704352.\n\n[8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM J. Comput., 32(1):48\u201377, 2002. ISSN 0097-5397.\n\n[9] S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed\n\nBandit Problems. Now Publishers Inc, 2012.\n\n[10] O. Capp\u00e9, A. Garivier, O.-A. Maillard, R. Munos, G. Stoltz, et al. Kullback\u2013leibler upper\ncon\ufb01dence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516\u20131541,\n2013.\n\n[11] O. Catoni. Challenging the empirical mean and empirical variance: A deviation study. Annales\n\nde l\u2019Institut Henri Poincar\u00e9, Probabilit\u00e9s et Statistiques, 48(4):1148\u20131185, 11 2012.\n\n[12] N. Cesa-Bianchi and P. Fischer. Finite-time regret bounds for the multiarmed bandit problem.\n\nIn ICML, pages 100\u2013108, 1998.\n\n[13] A. Garivier, E. Kaufmann, and T. Lattimore. On explore-then-commit strategies. In NIPS, 2016.\n\n[14] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal\n\nof arti\ufb01cial intelligence research, 4:237\u2013285, 1996.\n\n[15] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal\n\n\ufb01nite-time analysis. In ALT\u201912, pages 199\u2013213, 2012.\n\n[16] V. Kuleshov and D. Precup. Algorithms for multi-armed bandit problems. arXiv preprint\n\narXiv:1402.6028, 2014.\n\n[17] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6:4\u201322, 1985.\n\n[18] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value\n\nfunctions. 2016.\n\n9\n\n\f[19] T. Perkins and D. Precup. A convergent form of approximate policy iteration. In S. Becker,\nS. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15,\npages 1595\u20131602, Cambridge, MA, USA, 2003. MIT Press.\n\n[20] Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits.\nIn Proceedings of the 30th International Conference on Machine Learning (ICML 2014), pages\n1287\u20131295, 2014.\n\n[21] S. P. Singh, T. Jaakkola, M. L. Littman, and Cs. Szepesv\u00e1ri. Convergence results for single-step\non-policy reinforcement-learning algorithms. Machine Learning, 38(3):287\u2013308, 2000. URL\nciteseer.ist.psu.edu/article/singh98convergence.html.\n\n[22] R. Sutton. Integrated architectures for learning, planning, and reacting based on approximating\ndynamic programming. In Proceedings of the Seventh International Conference on Machine\nLearning, pages 216\u2013224. San Mateo, CA, 1990.\n\n[23] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n[24] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for\nreinforcement learning with function approximation. In S. Solla, T. Leen, and K. M\u00fcller, editors,\nAdvances in Neural Information Processing Systems 12, pages 1057\u20131063, Cambridge, MA,\nUSA, 1999. MIT Press.\n\n[25] Cs. Szepesv\u00e1ri. Algorithms for Reinforcement Learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning. Morgan & Claypool Publishers, 2010.\n\n[26] J. Vermorel and M. Mohri. Multi-armed bandit algorithms and empirical evaluation.\n\nEuropean conference on machine learning, pages 437\u2013448. Springer, 2005.\n\nIn\n\n10\n\n\f", "award": [], "sourceid": 3165, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-Bianchi", "institution": "Universit\u00e0 degli Studi di Milano, Italy"}, {"given_name": "Claudio", "family_name": "Gentile", "institution": "INRIA"}, {"given_name": "Gabor", "family_name": "Lugosi", "institution": "Pompeu Fabra University"}, {"given_name": "Gergely", "family_name": "Neu", "institution": "Universitat Pompeu Fabra"}]}