{"title": "Explore no more: Improved high-probability regret bounds for non-stochastic bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3168, "page_last": 3176, "abstract": "This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. Such results are rather scarce in the literature since proving them requires a large deal of technical effort and significant modifications to the standard, more intuitive algorithms that come only with guarantees that hold on expectation. One of these modifications is forcing the learner to sample arms from the uniform distribution at least $\\Omega(\\sqrt{T})$ times over $T$ rounds, which can adversely affect performance if many of the arms are suboptimal. While it is widely conjectured that this property is essential for proving high-probability regret bounds, we show in this paper that it is possible to achieve such strong results without this undesirable exploration component. Our result relies on a simple and intuitive loss-estimation strategy called Implicit eXploration (IX) that allows a remarkably clean analysis. To demonstrate the flexibility of our technique, we derive several improved high-probability bounds for various extensions of the standard multi-armed bandit framework.Finally, we conduct a simple experiment that illustrates the robustness of our implicit exploration technique.", "full_text": "Explore no more: Improved high-probability regret\n\nbounds for non-stochastic bandits\n\nGergely Neu\u2217\nSequeL team\n\nINRIA Lille \u2013 Nord Europe\n\ngergely.neu@gmail.com\n\nAbstract\n\n\u221a\n\nThis work addresses the problem of regret minimization in non-stochastic multi-\narmed bandit problems, focusing on performance guarantees that hold with high\nprobability. Such results are rather scarce in the literature since proving them re-\nquires a large deal of technical effort and signi\ufb01cant modi\ufb01cations to the standard,\nmore intuitive algorithms that come only with guarantees that hold on expectation.\nOne of these modi\ufb01cations is forcing the learner to sample arms from the uniform\ndistribution at least \u2126(\nT ) times over T rounds, which can adversely affect per-\nformance if many of the arms are suboptimal. While it is widely conjectured that\nthis property is essential for proving high-probability regret bounds, we show in\nthis paper that it is possible to achieve such strong results without this undesirable\nexploration component. Our result relies on a simple and intuitive loss-estimation\nstrategy called Implicit eXploration (IX) that allows a remarkably clean analy-\nsis. To demonstrate the \ufb02exibility of our technique, we derive several improved\nhigh-probability bounds for various extensions of the standard multi-armed bandit\nframework. Finally, we conduct a simple experiment that illustrates the robustness\nof our implicit exploration technique.\n\n1\n\nIntroduction\n\nConsider the problem of regret minimization in non-stochastic multi-armed bandits, as de\ufb01ned in\nthe classic paper of Auer, Cesa-Bianchi, Freund, and Schapire [5]. This sequential decision-making\nproblem can be formalized as a repeated game between a learner and an environment (sometimes\ncalled the adversary).\nIn each round t = 1, 2, . . . , T , the two players interact as follows: The\nlearner picks an arm (also called an action) It \u2208 [K] = {1, 2, . . . , K} and the environment selects\na loss function (cid:96)t : [K] \u2192 [0, 1], where the loss associated with arm i \u2208 [K] is denoted as (cid:96)t,i.\nSubsequently, the learner incurs and observes the loss (cid:96)t,It. Based solely on these observations, the\ngoal of the learner is to choose its actions so as to accumulate as little loss as possible during the\ncourse of the game. As traditional in the online learning literature [10], we measure the performance\nof the learner in terms of the regret de\ufb01ned as\n\nT(cid:88)\n\nt=1\n\nRT =\n\nT(cid:88)\n\nt=1\n\n(cid:96)t,It \u2212 min\ni\u2208[K]\n\n(cid:96)t,i.\n\nWe say that the environment is oblivious if it selects the sequence of loss vectors irrespective of\nthe past actions taken by the learner, and adaptive (or non-oblivious) if it is allowed to choose (cid:96)t\nas a function of the past actions It\u22121, . . . , I1. An equivalent formulation of the multi-armed bandit\ngame uses the concept of rewards (also called gains or payoffs) instead of losses: in this version,\n\u2217The author is currently with the Department of Information and Communication Technologies, Pompeu\n\nFabra University, Barcelona, Spain.\n\n1\n\n\fthe adversary chooses the sequence of reward functions (rt) with rt,i denoting the reward given\nto the learner for choosing action i in round t. In this game, the learner aims at maximizing its\ntotal rewards. We will refer to the above two formulations as the loss game and the reward game,\nrespectively.\nOur goal in this paper is to construct algorithms for the learner that guarantee that the regret grows\nsublinearly. Since it is well known that no deterministic learning algorithm can achieve this goal\n[10], we are interested in randomized algorithms. Accordingly, the regret RT then becomes a ran-\ndom variable that we need to bound in some probabilistic sense. Most of the existing literature on\nnon-stochastic bandits is concerned with bounding the pseudo-regret (or weak regret) de\ufb01ned as\n\n(cid:34) T(cid:88)\n\n(cid:96)t,It \u2212 T(cid:88)\n\nt=1\n\nt=1\n\n(cid:98)RT = max\n\ni\u2208[K]\n\nE\n\n(cid:35)\n\n(cid:96)t,i\n\n,\n\nwhere the expectation integrates over the randomness injected by the learner. Proving bounds on\nthe actual regret that hold with high probability is considered to be a signi\ufb01cantly harder task that\ncan be achieved by serious changes made to the learning algorithms and much more complicated\nanalyses. One particular common belief is that in order to guarantee high-con\ufb01dence performance\nguarantees, the learner cannot avoid repeatedly sampling arms from a uniform distribution, typically\n\nKT(cid:1) times [5, 4, 7, 9]. It is easy to see that such explicit exploration can impact the empirical\n\n\u2126(cid:0)\u221a\n\nperformance of learning algorithms in a very negative way if there are many arms with high losses:\neven if the base learning algorithm quickly learns to focus on good arms, explicit exploration still\nforces the regret to grow at a steady rate. As a result, algorithms with high-probability performance\nguarantees tend to perform poorly even in very simple problems [25, 7].\nIn the current paper, we propose an algorithm that guarantees strong regret bounds that hold with\nhigh probability without the explicit exploration component. One component that we preserve from\nthe classical recipe for such algorithms is the biased estimation of losses, although our bias is of\na much more delicate nature, and arguably more elegant than previous approaches. In particular,\nwe adopt the implicit exploration (IX) strategy \ufb01rst proposed by Koc\u00b4ak, Neu, Valko, and Munos\n[19] for the problem of online learning with side-observations. As we show in the current pa-\nper, this simple loss-estimation strategy allows proving high-probability bounds for a range of non-\nstochastic bandit problems including bandits with expert advice, tracking the best arm and bandits\nwith side-observations. Our proofs are arguably cleaner and less involved than previous ones, and\nvery elementary in the sense that they do not rely on advanced results from probability theory like\nFreedman\u2019s inequality [12]. The resulting bounds are tighter than all previously known bounds and\nhold simultaneously for all con\ufb01dence levels, unlike most previously known bounds [5, 7]. For the\n\ufb01rst time in the literature, we also provide high-probability bounds for anytime algorithms that do\nnot require prior knowledge of the time horizon T . A minor conceptual improvement in our analysis\nis a direct treatment of the loss game, as opposed to previous analyses that focused on the reward\ngame, making our treatment more coherent with other state-of-the-art results in the online learning\nliterature1.\nThe rest of the paper is organized as follows. In Section 2, we review the known techniques for prov-\ning high-probability regret bounds for non-stochastic bandits and describe our implicit exploration\nstrategy in precise terms. Section 3 states our main result concerning the concentration of the IX\nloss estimates and shows applications of this result to several problem settings. Finally, we conduct\na set of simple experiments to illustrate the bene\ufb01ts of implicit exploration over previous techniques\nin Section 4.\n\n2 Explicit and implicit exploration\n\nMost principled learning algorithms for the non-stochastic bandit problem are constructed by using\na standard online learning algorithm such as the exponentially weighted forecaster ([26, 20, 13])\nor follow the perturbed leader ([14, 18]) as a black box, with the true (unobserved) losses replaced\nby some appropriate estimates. One of the key challenges is constructing reliable estimates of the\nlosses (cid:96)t,i for all i \u2208 [K] based on the single observation (cid:96)t,It. Following Auer et al. [5], this is\n1In fact, studying the loss game is colloquially known to allow better constant factors in the bounds in many\n\nsettings (see, e.g., Bubeck and Cesa-Bianchi [9]). Our result further reinforces these observations.\n\n2\n\n\f(cid:98)(cid:96)t,i =\n\n(cid:98)rt,i =\n\ntraditionally achieved by using importance-weighted loss/reward estimates of the form\n\n(cid:96)t,i\npt,i\n\nI{It=i}\n\nrt,i\npt,i\n\nI{It=i}\n\nor\n\n(1)\nwhere pt,i = P [ It = i|Ft\u22121] is the probability that the learner picks action i in round t, conditioned\non the observation history Ft\u22121 of the learner up to the beginning of round t. It is easy to show that\n\nthese estimates are unbiased for all i with pt,i > 0 in the sense that E(cid:98)(cid:96)t,i = (cid:96)t,i for all such i.\nto compute the weights wt,i = exp(cid:0)\u2212\u03b7(cid:80)t\u22121\n\ns=1(cid:98)(cid:96)s\u22121,i\n\nFor concreteness, consider the EXP3 algorithm of Auer et al. [5] as described in Bubeck and Cesa-\nBianchi [9, Section 3]. In every round t, this algorithm uses the loss estimates de\ufb01ned in Equation (1)\n\n(cid:1) for all i and some positive parameter \u03b7 that\n\nis often called the learning rate. Having computed these weights, EXP3 draws arm It = i with\nprobability proportional to wt,i. Relying on the unbiasedness of the estimates (1) and an optimized\nsetting of \u03b7, one can prove that EXP3 enjoys a pseudo-regret bound of\n2T K log K. However, the\n\ufb02uctuations of the loss estimates around the true losses are too large to permit bounding the true\nregret with high probability. To keep these \ufb02uctuations under control, Auer et al. [5] propose to use\nthe biased reward-estimates\n\n\u221a\n\nwith an appropriately chosen \u03b2 > 0. Given these estimates, the EXP3.P algorithm of Auer et al. [5]\n\n(cid:101)rt,i =(cid:98)rt,i +\ncomputes the weights wt,i = exp(cid:0)\u03b7(cid:80)t\u22121\ns=1(cid:101)rs,i\n\n\u03b2\npt,i\n\ndistribution\n\npt,i = (1 \u2212 \u03b3)\n\n(2)\n\n(cid:1) for all arms i and then samples It according to the\nwt,i(cid:80)K\n\n+\n\n\u03b3\nK\n\n,\n\nj=1 wt,j\n\nwhere \u03b3 \u2208 [0, 1] is the exploration parameter. The argument for this explicit exploration is that it\nhelps to keep the range (and thus the variance) of the above reward estimates bounded, thus enabling\nthe use of (more or less) standard concentration results2. In particular, the key element in the analysis\nof EXP3.P [5, 9, 7, 6] is showing that the inequality\n\nT(cid:88)\n\n(rt,i \u2212(cid:101)rt,i) \u2264 log(K/\u03b4)\n\n\u03b2\n\ncumulative estimates(cid:80)T\n\nholds simultaneously for all i with probability at least 1 \u2212 \u03b4. In other words, this shows that the\n\nt=1(cid:101)rt,i are upper con\ufb01dence bounds for the true rewards(cid:80)T\n\nt=1 rt,i.\n\nt=1\n\nIn the current paper, we propose to use the loss estimates de\ufb01ned as\n\n(cid:101)(cid:96)t,i =\n\n(cid:96)t,i\n\npt,i + \u03b3t\n\nI{It=i},\n\n(3)\n\nfor all i and an appropriately chosen \u03b3t > 0, and then use the resulting estimates in an exponential-\nweights algorithm scheme without any explicit exploration. Loss estimates of this form were \ufb01rst\nused by Koc\u00b4ak et al. [19]\u2014following them, we refer to this technique as Implicit eXploration, or,\nin short, IX. In what follows, we argue that that IX as de\ufb01ned above achieves a similar variance-\nreducing effect as the one achieved by the combination of explicit exploration and the biased reward\nestimates of Equation (2). In particular, we show that the IX estimates (3) constitute a lower con-\n\ufb01dence bound for the true losses which allows proving high-probability bounds for a number of\nvariants of the multi-armed bandit problem.\n\n3 High-probability regret bounds via implicit exploration\n\nIn this section, we present a concentration result concerning the IX loss estimates of Equation (3),\nand apply this result to prove high-probability performance guarantees for a number of non-\nstochastic bandit problems. The following lemma states our concentration result in its most general\nform:\n\n2Explicit exploration is believed to be inevitable for proving bounds in the reward game for various other\n\nreasons, too\u2014see Bubeck and Cesa-Bianchi [9] for a discussion.\n\n3\n\n\fLemma 1. Let (\u03b3t) be a \ufb01xed non-increasing sequence with \u03b3t \u2265 0 and let \u03b1t,i be nonnegative\nFt\u22121-measurable random variables satisfying \u03b1t,i \u2264 2\u03b3t for all t and i. Then, with probability at\nleast 1 \u2212 \u03b4,\n\n(cid:16)(cid:101)(cid:96)t,i \u2212 (cid:96)t,i\n\n(cid:17) \u2264 log (1/\u03b4) .\n\n\u03b1t,i\n\nT(cid:88)\n\nK(cid:88)\n\nt=1\n\ni=1\n\nA particularly important special case of the above lemma is the following:\nCorollary 1. Let \u03b3t = \u03b3 \u2265 0 for all t. With probability at least 1 \u2212 \u03b4,\n\n(cid:16)(cid:101)(cid:96)t,i \u2212 (cid:96)t,i\n\n(cid:17) \u2264 log (K/\u03b4)\n\n.\n\n2\u03b3\n\nT(cid:88)\n\nt=1\n\nsimultaneously holds for all i \u2208 [K].\nThis corollary follows from applying Lemma 1 to the functions \u03b1t,i = 2\u03b3I{i=j} for all j and\napplying the union bound. The full proof of Lemma 1 is presented in the Appendix. For didactic\npurposes, we now present a direct proof for Corollary 1, which is essentially a simpler version of\nLemma 1.\n\nProof of Corollary 1. For convenience, we will use the notation \u03b2 = 2\u03b3. First, observe that\n\n(cid:16)\n\n(cid:17)\n1 + \u03b2(cid:98)(cid:96)t,i\n1+z/2 \u2264\n\n,\n\nz\n\n(cid:101)(cid:96)t,i =\n\n(cid:96)t,i\n\npt,i + \u03b3\n\nI{It=i} \u2264\n(cid:16)\n\nexp\n\n(cid:96)t,i\n\npt,i + \u03b3(cid:96)t,i\n\nI{It=i} =\n\n1\n2\u03b3\n\n\u00b7 2\u03b3(cid:96)t,i/pt,i\n1 + \u03b3(cid:96)t,i/pt,i\n\nI{It=i} \u2264 1\n\u03b2\n\n\u00b7 log\n\n(cid:105) \u2264E(cid:104)\n\n1 + \u03b2(cid:98)(cid:96)t,i\n\nwhere the \ufb01rst step follows from (cid:96)t,i \u2208 [0, 1] and last one from the elementary inequality\nlog(1 + z) that holds for all z \u2265 0. Using the above inequality, we get that\n\n(cid:17)(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:12)(cid:12)(cid:12)Ft\u22121\nE(cid:104)\n(cid:105) \u2264 1 + \u03b2(cid:96)t,i \u2264 exp (\u03b2(cid:96)t,i) ,\n\u03b2(cid:101)(cid:96)t,i\n(cid:12)(cid:12)(cid:12)Ft\u22121\nwhere the second and third steps are obtained by using E(cid:104)(cid:98)(cid:96)t,i\n(cid:105) \u2264 (cid:96)t,i that holds by de\ufb01nition\nof (cid:98)(cid:96)t,i, and the inequality 1 + z \u2264 ez that holds for all z \u2208 R. As a result, the process Zt =\n(cid:0)(cid:101)(cid:96)s,i \u2212 (cid:96)s,i\nexp(cid:0)\u03b2(cid:80)t\n(cid:1)(cid:1) is a supermartingale with respect to (Ft): E [ Zt|Ft\u22121] \u2264 Zt\u22121. Observe\n(cid:1)(cid:33)(cid:35)\n(cid:34) T(cid:88)\n(cid:35)\n(cid:1) > \u03b5\n(cid:0)(cid:101)(cid:96)t,i \u2212 (cid:96)t,i\n\n(cid:0)(cid:101)(cid:96)t,i \u2212 (cid:96)t,i\n\nthat, since Z0 = 1, this implies E [ZT ] \u2264 E [ZT\u22121] \u2264 . . . \u2264 1, and thus by Markov\u2019s inequality,\n\ns=1\n\n\u00b7 exp(\u2212\u03b2\u03b5) \u2264 exp(\u2212\u03b2\u03b5)\n\nT(cid:88)\n\nexp\n\n\u03b2\n\n\u2264 E\n\n(cid:32)\n\n(cid:34)\n\nP\n\nt=1\n\nt=1\n\nholds for any \u03b5 > 0. The statement of the lemma follows from solving exp(\u2212\u03b2\u03b5) = \u03b4/K for \u03b5 and\nusing the union bound over all arms i.\n\nIn what follows, we put Lemma 1 to use and prove improved high-probability performance guaran-\ntees for several well-studied variants of the non-stochastic bandit problem, namely, the multi-armed\nbandit problem with expert advice, tracking the best arm for multi-armed bandits, and bandits with\nside-observations. The general form of Lemma 1 will allow us to prove high-probability bounds for\nanytime algorithms that can operate without prior knowledge of T . For clarity, we will only provide\nsuch bounds for the standard multi-armed bandit setting; extending the derivations to other settings\nis left as an easy exercise. For all algorithms, we prove bounds that scale linearly with log(1/\u03b4) and\n\nhold simultaneously for all levels \u03b4. Note that this dependence can be improved to(cid:112)log(1/\u03b4) for a\n\n\ufb01xed con\ufb01dence level \u03b4, if the algorithm can use this \u03b4 to tune its parameters. This is the way that\nTable 1 presents our new bounds side-by-side with the best previously known ones.\n\n4\n\n\fSetting\nMulti-armed bandits\nBandits with expert advice\nTracking the best arm\nBandits with side-observations\n\nBest known regret bound Our new regret bound\n\n5.15(cid:112)T K log(K/\u03b4)\n6(cid:112)T K log(N/\u03b4)\n7(cid:112)KT S log(KT /\u03b4S)\nmT(cid:1)\n(cid:101)O(cid:0)\u221a\n\n2(cid:112)2T K log(K/\u03b4)\n2(cid:112)2T K log(N/\u03b4)\n2(cid:112)2KT S log(KT /\u03b4S)\n\u03b1T(cid:1)\n(cid:101)O(cid:0)\u221a\n\nTable 1: Our results compared to the best previously known results in the four settings considered\nin Sections 3.1\u20133.4. See the respective sections for references and notation.\n\n3.1 Multi-armed bandits\n\nIn this section, we propose a variant of the\nEXP3 algorithm of Auer et al. [5] that uses the\nIX loss estimates (3): EXP3-IX. The algorithm\nin its most general form uses two nonincreasing\nsequences of nonnegative parameters: (\u03b7t) and\n(\u03b3t). In every round, EXP3-IX chooses action\nIt = i with probability proportional to\n\npt,i \u221d wt,i = exp\n\n\u2212\u03b7t\n\n,\n\n(4)\n\n(cid:32)\n\n(cid:33)\n\nt\u22121(cid:88)\n\ns=1\n\n(cid:101)(cid:96)s,i\n\nAlgorithm 1 EXP3-IX\nParameters: \u03b7 > 0, \u03b3 > 0.\nInitialization: w1,i = 1.\nfor t = 1, 2, . . . , T , repeat\n1. pt,i = wt,i(cid:80)K\n.\n2. Draw It \u223c pt = (pt,1, . . . , pt,K).\n4. (cid:101)(cid:96)t,i \u2190 (cid:96)t,i\n3. Observe loss (cid:96)t,It.\nI{It=i} for all i \u2208 [K].\n5. wt+1,i \u2190 wt,ie\u2212\u03b7(cid:101)(cid:96)t,i for all i \u2208 [K].\n\nj=1 wt,j\n\npt,i+\u03b3\n\nwithout mixing any explicit exploration term\ninto the distribution. A \ufb01xed-parameter version\nof EXP3-IX is presented as Algorithm 1.\n\u221a\nOur theorem below states a high-probability bound on the regret of EXP3-IX. Notably, our bound\nexhibits the best known constant factor of 2\n2 in the leading term, improving on the factor of 5.15\ndue to Bubeck and Cesa-Bianchi [9]. The best known leading constant for the pseudo-regret bound\nof EXP3 is\nTheorem 1. Fix an arbitrary \u03b4 > 0. With \u03b7t = 2\u03b3t =\n\n2, also proved in Bubeck and Cesa-Bianchi [9].\n\nfor all t, EXP3-IX guarantees\n\n(cid:113) 2 log K\n\n\u221a\n\nwith probability at least 1\u2212\u03b4. Furthermore, setting \u03b7t = 2\u03b3t =\n\nfor all t, the bound becomes\n\nProof. Let us \ufb01x an arbitrary \u03b4(cid:48) \u2208 (0, 1). Following the standard analysis of EXP3 in the loss game\nand nonincreasing learning rates [9], we can obtain the bound\n\nKT\n\n2KT\nlog K\n\n+ 1\n\nlog (2/\u03b4)\n\n(cid:33)\n(cid:113) log K\n(cid:33)\n\nKt\n\n2\n\nKT\nlog K\n\n+ 1\n\nlog (2/\u03b4) .\n\n(cid:32)(cid:115)\n\n(cid:115)\n\nRT \u2264 2(cid:112)2KT log K +\n(cid:32)\nRT \u2264 4(cid:112)KT log K +\n(cid:33)\n(cid:32) K(cid:88)\n\npt,i(cid:101)(cid:96)t,i \u2212(cid:101)(cid:96)t,j\n\nT(cid:88)\n\nK(cid:88)\n\n\u2264 log K\n\u03b7T\n\n+\n\n\u03b7t\n2\n\nt=1\n\ni=1\n\npt,i\n\n(cid:17)2\n\n(cid:16)(cid:101)(cid:96)t,i\n\nK(cid:88)\n\n(cid:96)t,i (pt,i + \u03b3t)\n\n\u2212 \u03b3t\n\npt,i + \u03b3t\n\nI{It=i}\n\n(cid:101)(cid:96)t,i.\ni=1(cid:101)(cid:96)t,i holds by the boundedness of the losses. Thus, we get that\n(cid:16)\n\n= (cid:96)t,It \u2212 \u03b3t\n(cid:101)(cid:96)t,i\n\n(cid:17) K(cid:88)\n\npt,i + \u03b3t(cid:96)t,i\n\nlog K\n\n(cid:17)\n\n(cid:96)t,i\n\ni=1\n\ni=1\n\n+\n\n+\n\nK(cid:88)\n\n(5)\n\n(cid:96)t,j \u2212(cid:101)(cid:96)t,j\n\u2264 log (K/\u03b4(cid:48))\n\nt=1\n\n+\n\n2\u03b3\n\nT(cid:88)\n(cid:16) \u03b7t\n\nt=1\n\n(cid:16) \u03b7t\n(cid:17) K(cid:88)\n\n+ \u03b3t\n\n2\n\n+ \u03b3t\n\ni=1\n\n\u03b7T\n\nT(cid:88)\n\n+\n\n2\n\nt=1\n\ni=1\n\n(cid:96)t,i + log (1/\u03b4(cid:48))\n\nlog K\n\n\u03b7\n\n5\n\nt=1\n\ni=1\nfor any j. Now observe that\n\nT(cid:88)\nK(cid:88)\nK(cid:88)\npt,i(cid:101)(cid:96)t,i =\nSimilarly,(cid:80)K\nt,i \u2264(cid:80)K\ni=1 pt,i(cid:101)(cid:96)2\n((cid:96)t,It \u2212 (cid:96)t,j) \u2264 T(cid:88)\nT(cid:88)\n\nI{It=i}\n\ni=1\n\ni=1\n\nt=1\n\n\fholds with probability at least 1 \u2212 2\u03b4(cid:48), where the last line follows from an application of Lemma 1\nwith \u03b1t,i = \u03b7t/2 + \u03b3t for all t, i and taking the union bound. By taking j = arg mini LT,i and\n\u03b4(cid:48) = \u03b4/2, and using the boundedness of the losses, we obtain\n\n(cid:17)\nThe statements of the theorem then follow immediately, noting that(cid:80)T\n\nRT \u2264 log (2K/\u03b4)\n\n(cid:16) \u03b7t\n\nT(cid:88)\n\nlog K\n\n+ \u03b3t\n\n+ K\n\n2\u03b3T\n\n\u03b7T\n\nt=1\n\n+\n\n2\n\n+ log (2/\u03b4) .\n\u221a\nt=1 1/\n\n\u221a\nt \u2264 2\n\nT .\n\n3.2 Bandits with expert advice\n\n\u03bet(1), \u03bet(2), . . . , \u03bet(N ) \u2208 [0, 1]K over the K arms, such that(cid:80)K\n\nWe now turn to the setting of multi-armed bandits with expert advice, as de\ufb01ned in Auer et al. [5],\nand later revisited by McMahan and Streeter [22] and Beygelzimer et al. [7]. In this setting, we\nassume that in every round t = 1, 2, . . . , T , the learner observes a set of N probability distributions\ni=1 \u03bet,i(n) = 1 for all n \u2208 [N ].\nWe assume that the sequences (\u03bet(n)) are measurable with respect to (Ft). The nthof these vectors\nrepresent the probabilistic advice of the corresponding nth expert. The goal of the learner in this\nsetting is to pick a sequence of arms so as to minimize the regret against the best expert:\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nK(cid:88)\n\nt=1\n\ni=1\n\nR\u03be\n\nT =\n\n(cid:96)t,It \u2212 min\nn\u2208[N ]\n\n\u03bet,i(n)(cid:96)t,i \u2192 min .\n\nTo tackle this problem, we propose a modi\ufb01cation of the EXP4 algorithm of Auer et al. [5] that uses\nthe IX loss estimates (3), and also drops the explicit exploration component of the original algorithm.\nSpeci\ufb01cally, EXP4-IX uses the loss estimates de\ufb01ned in Equation (3) to compute the weights\n\n(cid:32)\n\n\u2212\u03b7\n\nt\u22121(cid:88)\n\nK(cid:88)\n\n\u03bes,i(n)(cid:101)(cid:96)s,i\n\n(cid:33)\n\nwt,n = exp\n\nfor every expert n \u2208 [N ], and then draw arm i with probability pt,i \u221d(cid:80)N\n\nn=1 wt,n\u03bet,i(n). We now\n\u221a\nstate the performance guarantee of EXP4-IX. Our bound improves the best known leading constant\n2 and is a factor of 2 worse than the best known constant in\nof 6 due to Beygelzimer et al. [7] to 2\nthe pseudo-regret bound for EXP4 [9]. The proof of the theorem is presented in the Appendix.\n\ns=1\n\ni=1\n\nfor all t. Then, with probability at\n\nTheorem 2. Fix an arbitrary \u03b4 > 0 and set \u03b7 = 2\u03b3 =\nleast 1 \u2212 \u03b4, the regret of EXP4-IX satis\ufb01es\n\nT \u2264 2(cid:112)2KT log N +\n\nR\u03be\n\n(cid:32)(cid:115)\n\n3.3 Tracking the best sequence of arms\n\n(cid:113) 2 log N\n(cid:33)\n\nKT\n\n2KT\nlog N\n\n+ 1\n\nlog (2/\u03b4) .\n\nIn this section, we consider the problem of competing with sequences of actions. Similarly to\nHerbster and Warmuth [17], we consider the class of sequences that switch at most S times between\nactions. We measure the performance of the learner in this setting in terms of the regret against the\nbest sequence from this class C(S) \u2286 [K]T , de\ufb01ned as\n\nT(cid:88)\n\nt=1\n\nRS\n\nT =\n\n(cid:96)t,It \u2212 min\n\n(Jt)\u2208C(S)\n\n(cid:96)t,Jt.\n\nSimilarly to Auer et al. [5], we now propose to adapt the Fixed Share algorithm of Herbster and\nWarmuth [17] to our setting. Our algorithm, called EXP3-SIX, updates a set of weights wt,\u00b7 over\nthe arms in a recursive fashion. In the \ufb01rst round, EXP3-SIX sets w1,i = 1/K for all i. In the\nfollowing rounds, the weights are updated for every arm i as\n\nwt+1,i = (1 \u2212 \u03b1)wt,i \u00b7 e\u2212\u03b7(cid:101)(cid:96)t,i +\n\nwt,j \u00b7 e\u2212\u03b7(cid:101)(cid:96)t,j .\n\nT(cid:88)\n\nt=1\n\nK(cid:88)\n\nj=1\n\n\u03b1\nK\n\n6\n\n\fIn round t, the algorithm draws arm It = i with probability pt,i \u221d wt,i. Below, we give the\n\u221a\nperformance guarantees of EXP3-SIX. Note that our leading factor of 2\n2 again improves over the\nbest previously known leading factor of 7, shown by Audibert and Bubeck [3]. The proof of the\ntheorem is given in the Appendix.\n\nTheorem 3. Fix an arbitrary \u03b4 > 0 and set \u03b7 = 2\u03b3 =\nThen, with probability at least 1 \u2212 \u03b4, the regret of EXP3-SIX satis\ufb01es\n\nand \u03b1 = S\n\nT\u22121 , where \u00afS = S + 1.\n\n(cid:113) 2 \u00afS log K\n(cid:32)(cid:115)\n\nKT\n\n2KT\n\u00afS log K\n\n(cid:33)\n\n(cid:115)\nT \u2264 2\nRS\n\n(cid:18) eKT\n\n(cid:19)\n\n+\n\nS\n\n2KT \u00afS log\n\n+ 1\n\nlog (2/\u03b4) .\n\n3.4 Bandits with side-observations\n\nLet us now turn to the problem of online learning in bandit problems in the presence of side ob-\nservations, as de\ufb01ned by Mannor and Shamir [21] and later elaborated by Alon et al. [1]. In this\nsetting, the learner and the environment interact exactly as in the multi-armed bandit problem, the\nmain difference being that in every round, the learner observes the losses of some arms other than\nits actually chosen arm It. The structure of the side observations is described by the directed graph\nG: nodes of G correspond to individual arms, and the presence of arc i \u2192 j implies that the learner\nwill observe (cid:96)t,j upon selecting It = i.\nImplicit exploration and EXP3-IX was \ufb01rst proposed by Koc\u00b4ak et al. [19] for this precise setting.\nTo describe this variant, let us introduce the notations Ot,i = I{It=i} + I{(It\u2192i)\u2208G} and ot,i =\n.\not,i+\u03b3t\nWith these estimates at hand, EXP3-IX draws arm It from the exponentially weighted distribution\nde\ufb01ned in Equation (4). The following theorem provides the regret bound concerning this algorithm.\nTheorem 4. Fix an arbitrary \u03b4 > 0. Assume that T \u2265 K 2/(8\u03b1) and set \u03b7 = 2\u03b3 =\n2\u03b1T log(KT ) ,\nwhere \u03b1 is the independence number of G. With probability at least 1 \u2212 \u03b4, EXP3-IX guarantees\n\nE [ Ot,i|Ft\u22121]. Then, the IX loss estimates in this setting are de\ufb01ned for all t, i as(cid:101)(cid:96)t,i = Ot,i(cid:96)t,i\n(cid:113) log K\n(cid:114)\n\n(cid:115)\n2\u03b1T(cid:0)log2K +log KT(cid:1)+2\n\n(cid:17)\u00b7(cid:113)\n4+2(cid:112)log (4/\u03b4)\n\nRT \u2264(cid:16)\n\n\u03b1T log(KT )\n\nlog (4/\u03b4)+\n\nT log(4/\u03b4)\n\n.\n\n2\n\nlog K\n\nThe proof of the theorem is given in the Appendix. While the proof of this statement is signi\ufb01cantly\nmore involved than the other proofs presented in this paper, it provides a fundamentally new result.\nIn particular, our bound is in terms of the independence number \u03b1 and thus matches the minimax\nregret bound proved by Alon et al. [1] for this setting up to logarithmic factors. In contrast, the only\nhigh-probability regret bound for this setting due to Alon et al. [2] scales with the size m of the\nmaximal acyclic subgraph of G, which can be much larger than \u03b1 in general (i.e., m may be o(\u03b1)\nfor some graphs [1]).\n\n4 Empirical evaluation\nWe conduct a simple experiment to demonstrate the robustness of EXP3-IX as compared to EXP3\nand its superior performance as compared to EXP3.P. Our setting is a 10-arm bandit problem where\nall losses are independent draws of Bernoulli random variables. The mean losses of arms 1 through\n8 are 1/2 and the mean loss of arm 9 is 1/2 \u2212 \u2206 for all rounds t = 1, 2, . . . , T . The mean losses of\narm 10 are changing over time: for rounds t \u2264 T /2, the mean is 1/2 + \u2206, and 1/2\u2212 4\u2206 afterwards.\nThis choice ensures that up to at least round T /2, arm 9 is clearly better than other arms. In the\nsecond half of the game, arm 10 starts to outperform arm 9 and eventually becomes the leader.\nWe have evaluated the performance of EXP3, EXP3.P and EXP3-IX in the above setting with T =\n106 and \u2206 = 0.1. For fairness of comparison, we evaluate all three algorithms for a wide range\nof parameters. In particular, for all three algorithms, we set a base learning rate \u03b7 according to the\nbest known theoretical results [9, Theorems 3.1 and 3.3] and varied the multiplier of the respective\nbase parameters between 0.01 and 100. Other parameters are set as \u03b3 = \u03b7/2 for EXP3-IX and\n\u03b2 = \u03b3/K = \u03b7 for EXP3.P. We studied the regret up to two interesting rounds in the game: up\nto T /2, where the losses are i.i.d., and up to T where the algorithms have to notice the shift in the\n\n7\n\n\fFigure 1: Regret of EXP3, EXP3.P, and EXP3-IX, respectively in the problem described in Sec-\ntion 4.\n\n\u221a\n\n\u221a\n\nT ) times is necessary for achieving near-optimal guarantees.\n\nloss distributions. Figure 1 shows the empirical means and standard deviations over 50 runs of the\nregrets of the three algorithms as a function of the multipliers. The results clearly show that EXP3-\nIX largely improves on the empirical performance of EXP3.P and is also much more robust in the\nnon-stochastic regime than vanilla EXP3.\n5 Discussion\nIn this paper, we have shown that, contrary to popular belief, explicit exploration is not necessary to\nachieve high-probability regret bounds for non-stochastic bandit problems. Interestingly, however,\nwe have observed in several of our experiments that our IX-based algorithms still draw every arm\nroughly\nT times, even though this is not explicitly enforced by the algorithm. This suggests a need\nfor a more complete study of the role of exploration, to \ufb01nd out whether pulling every single arm\n\u2126(\nOne can argue that tuning the IX parameter that we introduce may actually be just as dif\ufb01cult in\npractice as tuning the parameters of EXP3.P. However, every aspect of our analysis suggests that\n\u03b3t = \u03b7t/2 is the most natural choice for these parameters, and thus this is the choice that we\nrecommend. One limitation of our current analysis is that it only permits deterministic learning-rate\nand IX parameters (see the conditions of Lemma 1). That is, proving adaptive regret bounds in the\nvein of [15, 24, 23] that hold with high probability is still an open challenge.\nAnother interesting direction for future work is whether the implicit exploration approach can help in\nadvancing the state of the art in the more general setting of linear bandits. All known algorithms for\nthis setting rely on explicit exploration techniques, and the strength of the obtained results depend\ncrucially on the choice of the exploration distribution (see [8, 16] for recent advances). Interestingly,\nIX has a natural extension to the linear bandit problem. To see this, consider the vector Vt = eIt and\nthe matrix Pt = E [VtV T\nt (cid:96)t.\nWhether or not this estimate is the right choice for linear bandits remains to be seen.\nFinally, we note that our estimates (3) are certainly not the only ones that allow avoiding explicit\nexploration.\nIn fact, the careful reader might deduce from the proof of Lemma 1 that the same\nconcentration can be shown to hold for the alternative loss estimates (cid:96)t,iI{It=i}/ (pt,i + \u03b3(cid:96)t,i) and\n\nt ]. Then, the IX loss estimates can be written as(cid:101)(cid:96)t = (Pt + \u03b3I)\u22121VtV T\n(cid:1)/(2\u03b3). Actually, a variant of the latter estimate was used previously for\n\nlog(cid:0)1 + 2\u03b3(cid:96)t,iI{It=i}/pt,i\n\nproving high-probability regret bounds in the reward game by Audibert and Bubeck [4]\u2014however,\ntheir proof still relied on explicit exploration. It is not hard to verify that all the results we presented\nin this paper (except Theorem 4) can be shown to hold for the above two estimates, too.\n\nAcknowledgments This work was supported by INRIA, the French Ministry of Higher Education\nand Research, and by FUI project Herm`es. The author wishes to thank Haipeng Luo for catching a\nbug in an earlier version of the paper, and the anonymous reviewers for their helpful suggestions.\n\n8\n\n10\u2212210\u2212110010110200.511.522.533.544.55x104\u03b7multiplierregretatT/2EXP3EXP3.PEXP3\u2212IX10\u2212210\u22121100101102\u22121.5\u22121\u22120.500.511.5x105\u03b7multiplierregretatTEXP3EXP3.PEXP3\u2212IX\fReferences\n[1] N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour. From Bandits to Experts: A Tale of Domination\n\nand Independence. In NIPS-25, pages 1610\u20131618, 2012.\n\n[2] N. Alon, N. Cesa-Bianchi, C. Gentile, S. Mannor, Y. Mansour, and O. Shamir. Nonstochastic multi-armed\n\nbandits with graph-structured feedback. arXiv preprint arXiv:1409.8428, 2014.\n\n[3] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of\n\nthe 22nd Annual Conference on Learning Theory (COLT), 2009.\n\n[4] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of\n\nMachine Learning Research, 11:2785\u20132836, 2010.\n\n[5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.\n\nSIAM J. Comput., 32(1):48\u201377, 2002. ISSN 0097-5397.\n\n[6] P. L. Bartlett, V. Dani, T. P. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. High-probability regret bounds\n\nfor bandit online linear optimization. In COLT, pages 335\u2013342, 2008.\n\n[7] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandit algorithms with\n\nsupervised learning guarantees. In AISTATS 2011, pages 19\u201326, 2011.\n\n[8] S. Bubeck, N. Cesa-Bianchi, and S. M. Kakade. Towards minimax policies for online linear optimization\n\nwith bandit feedback. 2012.\n\n[9] S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit\n\nProblems. Now Publishers Inc, 2012.\n\n[10] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New\n\nYork, NY, USA, 2006.\n\n[11] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets \ufb01xed share (and feels no\n\nregret). In NIPS-25, pages 989\u2013997. 2012.\n\n[12] D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3:100\u2013118, 1975.\n[13] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n\n[14] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the theory of games, 3:97\u2013139,\n\n1957.\n\n[15] E. Hazan and S. Kale. Better algorithms for benign bandits. The Journal of Machine Learning Research,\n\n12:1287\u20131311, 2011.\n\n[16] E. Hazan, Z. Karnin, and R. Meka. Volumetric spanners: an ef\ufb01cient exploration basis for learning. In\n\nCOLT, pages 408\u2013422, 2014.\n\n[17] M. Herbster and M. Warmuth. Tracking the best expert. Machine Learning, 32:151\u2013178, 1998.\n[18] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Computer and\n\nSystem Sciences, 71:291\u2013307, 2005.\n\n[19] T. Koc\u00b4ak, G. Neu, M. Valko, and R. Munos. Ef\ufb01cient learning by implicit exploration in bandit problems\n\nwith side observations. In NIPS-27, pages 613\u2013621, 2014.\n\n[20] N. Littlestone and M. Warmuth. The weighted majority algorithm. Information and Computation, 108:\n\n212\u2013261, 1994.\n\n[21] S. Mannor and O. Shamir. From Bandits to Experts: On the Value of Side-Observations.\n\nInformation Processing Systems, 2011.\n\nIn Neural\n\n[22] H. B. McMahan and M. Streeter. Tighter bounds for multi-armed bandits with expert advice. In COLT,\n\n2009.\n\n[23] G. Neu. First-order regret bounds for combinatorial semi-bandits. In COLT, pages 1360\u20131375, 2015.\n[24] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COLT, pages 993\u20131019,\n\n2013.\n\n[25] Y. Seldin, N. Cesa-Bianchi, P. Auer, F. Laviolette, and J. Shawe-Taylor. PAC-Bayes-Bernstein inequality\nfor martingales and its application to multiarmed bandits. In Proceedings of the Workshop on On-line\nTrading of Exploration and Exploitation 2, 2012.\n\n[26] V. Vovk. Aggregating strategies. In Proceedings of the third annual workshop on Computational learning\n\ntheory (COLT), pages 371\u2013386, 1990.\n\n9\n\n\f", "award": [], "sourceid": 1770, "authors": [{"given_name": "Gergely", "family_name": "Neu", "institution": "INRIA"}]}