{"title": "Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 2184, "page_last": 2192, "abstract": "We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over $[0,1]$. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and the first $m$ failures, respectively, where $m$ is a fixed parameter. This two-target algorithm achieves a long-term average regret in $\\sqrt{2n}$ for a large parameter $m$ and a known time horizon $n$. This regret is optimal and strictly less than the regret achieved by the best known algorithms, which is in $2\\sqrt{n}$. The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons.", "full_text": "Two-Target Algorithms for In\ufb01nite-Armed Bandits\n\nwith Bernoulli Rewards\n\nThomas Bonald\u2217\n\nTelecom ParisTech\n\nParis, France\n\nthomas.bonald@telecom-paristech.fr\n\nDepartment of Networking and Computer Science\n\nAutomatic Control Department\n\nAlexandre Prouti`ere\u2217\u2020\n\nKTH\n\nStockholm, Sweden\nalepro@kth.se\n\nAbstract\n\nWe consider an in\ufb01nite-armed bandit problem with Bernoulli rewards. The mean\nrewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are\nreferred to as a success and a failure, respectively. We propose a novel algorithm\nwhere the decision to exploit any arm is based on two successive targets, namely,\nthe total number of successes until the \ufb01rst failure and until the \ufb01rst m failures,\nrespectively, where m is a \ufb01xed parameter. This two-target algorithm achieves a\nlong-term average regret in\n2n for a large parameter m and a known time hori-\n\u221a\nzon n. This regret is optimal and strictly less than the regret achieved by the best\nknown algorithms, which is in 2\nn. The results are extended to any mean-reward\ndistribution whose support contains 1 and to unknown time horizons. Numerical\nexperiments show the performance of the algorithm for \ufb01nite time horizons.\n\n\u221a\n\n1\n\nIntroduction\n\nMotivation. While classical multi-armed bandit problems assume a \ufb01nite number of arms [9],\nmany practical situations involve a large, possibly in\ufb01nite set of options for the player. This is the\ncase for instance of on-line advertisement and content recommandation, where the objective is to\npropose the most appropriate categories of items to each user in a very large catalogue. In such\nsituations, it is usually not possible to explore all options, a constraint that is best represented by a\nbandit problem with an in\ufb01nite number of arms. Moreover, even when the set of options is limited,\nthe time horizon may be too short in practice to enable the full exploration of these options. Unlike\nclassical algorithms like UCB [10, 1], which rely on a initial phase where all arms are sampled\nonce, algorithms for in\ufb01nite-armed bandits have an intrinsic stopping rule in the number of arms to\nexplore. We believe that this provides useful insights into the design of ef\ufb01cient algorithms for usual\n\ufb01nite-armed bandits when the time horizon is relatively short.\n\nOverview of the results. We consider a stochastic in\ufb01nite-armed bandit with Bernoulli rewards,\nthe mean reward of each arm having a uniform distribution over [0, 1]. This model is representative\nof a number of practical situations, such as content recommandation systems with like/dislike feed-\nback and without any prior information on the user preferences. We propose a two-target algorithm\n2n for large m and a\nbased on some \ufb01xed parameter m that achieves a long-term average regret in\n\u221a\nlarge known time horizon n. We prove that this regret is optimal. The anytime version of this algo-\nrithm achieves a long-term average regret in 2\nn for unknown time horizon n, which we conjecture\nto be also optimal. The results are extended to any mean-reward distribution whose support contains\n1. Speci\ufb01cally, if the probability that the mean reward exceeds u is equivalent to \u03b1(1 \u2212 u)\u03b2 when\n\n\u221a\n\n\u2217The authors are members of the LINCS, Paris, France. See www.lincs.fr.\n\u2020Alexandre Prouti`ere is also af\ufb01liated to INRIA, Paris-Rocquencourt, France.\n\n1\n\n\fu \u2192 1\u2212, the two-target algorithm achieves a long-term average regret in C(\u03b1, \u03b2)n\n\u03b2+1 , with some\nexplicit constant C(\u03b1, \u03b2) that depends on whether the time horizon is known or not. This regret is\nprovably optimal when the time horizon is known. The precise statements and proofs of these more\ngeneral results are given in the appendix.\n\n\u03b2\n\n\u221a\n\n\u221a\n\n\u221a\n\n2n, which invalidates the conjecture. We also provide a proof of the lower bound in\n\nRelated work. The stochastic in\ufb01nite-armed bandit problem has \ufb01rst been considered in a general\nsetting by Mallows and Robbins [12] and then in the particular case of Bernoulli rewards by Her-\nschkorn, Pek\u00a8oz and Ross [6]. The proposed algorithms are \ufb01rst-order optimal in the sense that they\nminimize the ratio Rn/n for large n, where Rn is the regret after n time steps. In the considered\nsetting of Bernoulli rewards with mean rewards uniformly distributed over [0, 1], this means that the\nratio Rn/n tends to 0 almost surely. We are interested in second-order optimality, namely, in min-\n\u221a\nimizing the equivalent of Rn for large n. This issue is addressed by Berry et. al. [2], who propose\nvarious algorithms achieving a long-term average regret in 2\nn, conjecture that this regret is opti-\n\u221a\n2n. Our algorithm achieves a regret that is arbitrarily close to\nmal and provide a lower bound in\n2n since that\ngiven in [2, Theorem 3] relies on the incorrect argument that the number of explored arms and the\nmean rewards of these arms are independent random variables1; the extension to any mean-reward\ndistribution [2, Theorem 11] is based on the same erroneous argument2.\nThe algorithms proposed by Berry et. al. [2] and applied in [11, 4, 5, 7] to various mean-reward\ndistributions are variants of the 1-failure strategy where each arm is played until the \ufb01rst failure,\n\u221a\ncalled a run. For instance, the non-recalling\nn-run policy consists in exploiting the \ufb01rst arm giving\n\u221a\nn. For a uniform mean-reward distribution over [0, 1], the average number of\na run larger than\n\u221a\nn and the selected arm is exploited for the equivalent of n time steps with an\nexplored arms is\nexpected failure rate of 1/\nn. We introduce a second target to improve\nthe expected failure rate of the selected arm, at the expense of a slightly more expensive exploration\n\nphase. Speci\ufb01cally, we show that it is optimal to explore(cid:112)n/2 arms on average, resulting in the\n\n2n of the exploited arm, for the equivalent of n time steps, hence the\n2n. For unknown horizon times, anytime versions of the algorithms of Berry et. al. [2]\nn). We\nn, which we\n\n\u221a\nexpected failure rate 1/\n\u221a\nregret of\n\u221a\nare proposed by Teytaud, Gelly and Sebag in [13] and proved to achieve a regret in O(\nshow that the anytime version of our algorithm achieves a regret arbitrarily close to 2\nconjecture to be optimal.\nOur results extend to any mean-reward distribution whose support contains 1, the regret depending\non the characteristics of this distribution around 1. This problem has been considered in the more\ngeneral setting of bounded rewards by Wang, Audibert and Munos [15]. When the time horizon\nis known, their algorithms consist in exploring a pre-de\ufb01ned set of K arms, which depends on\nthe parameter \u03b2 mentioned above, using variants of the UCB policy [10, 1]. In the present case\nof Bernoulli rewards and mean-reward distributions whose support contains 1, the corresponding\n\u03b2+1 , up to logarithmic terms coming from the exploration of the K arms, as in usual\nregret is in n\n\ufb01nite-armed bandits algorithms [9]. The nature of our algorithm is very different in that it is based\non a stopping rule in the exploration phase that depends on the observed rewards. This does not only\nremove the logarithmic terms in the regret but also achieves the optimal constant.\n\nn, yielding the regret of 2\n\n\u221a\n\n\u221a\n\n\u03b2\n\n2 Model\n\nWe consider a stochastic multi-armed bandit with an in\ufb01nite number of arms. For any k = 1, 2, . . .,\nthe rewards of arm k are Bernoulli with unknown parameter \u03b8k. We refer to rewards 0 and 1 as a\nfailure and a success, respectively, and to a run as a consecutive sequence of successes followed by\na failure. The mean rewards \u03b81, \u03b82, . . . are themselves random, uniformly distributed over [0, 1].\n\n1Speci\ufb01cally, it is assumed that for any policy, the mean rewards of the explored arms have a uniform\ndistribution over [0, 1], independently of the number of explored arms. This is incorrect. For the 1-failure\npolicy for instance, given that only one arm has been explored until time n, the mean reward of this arm has a\nbeta distribution with parameters 1, n.\n\n2This lower bound is 4(cid:112)n/3 for a beta distribution with parameters 1/2, 1, see [11], while our algorithm\n\nn in this case, since C(\u03b1, \u03b2) = 2 for \u03b1 = 1/2 and \u03b2 = 1, see the\n\n\u221a\nachieves a regret arbitrarily close to 2\nappendix. Thus the statement of [2, Theorem 11] is false.\n\n2\n\n\fAt any time t = 1, 2, . . ., we select some arm It and receive the corresponding reward Xt, which\nis a Bernoulli random variable with parameter \u03b8It. We take I1 = 1 by convention. At any\ntime t = 2, 3, . . ., the arm selection only depends on previous arm selections and rewards; for-\nmally, the random variable It is Ft\u22121-mesurable, where Ft denotes the \u03c3-\ufb01eld generated by the\nset {I1, X1, . . . , It, Xt}. Let Kt be the number of arms selected until time t. Without any loss of\ngenerality, we assume that {I1, . . . , It} = {1, 2, . . . , Kt} for all t = 1, 2, . . ., i.e., new arms are\nselected sequentially. We also assume that It+1 = It whenever Xt = 1: if the selection of arm It\ngives a success at time t, the same arm is selected at time t + 1.\nThe objective is to maximize the cumulative reward or, equivalently, to minimize the regret de\ufb01ned\nt=1 Xt. Speci\ufb01cally, we focus on the average regret E(Rn), where expectation\nis taken over all random variables, including the sequence of mean rewards \u03b81, \u03b82, . . .. The time\nhorizon n may be known or unknown.\n\nby Rn = n \u2212(cid:80)n\n\n3 Known time horizon\n\n3.1 Two-target algorithm\n\nThe two-target algorithm consists in exploring new arms until two successive targets (cid:96)1 and (cid:96)2 are\nreached, in which case the current arm is exploited until the time horizon n. The \ufb01rst target aims\nat discarding \u201cbad\u201d arms while the second aims at selecting a \u201cgood\u201d arm. Speci\ufb01cally, using the\nnames of the variables indicated in the pseudo-code below, if the length L of the \ufb01rst run of the\ncurrent arm I is less than (cid:96)1, this arm is discarded and a new arm is selected; otherwise, arm I is\npulled for m \u2212 1 additional runs and exploited until time n if the total length L of the m runs is at\nleast (cid:96)2, where m \u2265 2 is a \ufb01xed parameter of the algorithm. We prove in Proposition 1 below that,\n\nfor large m, the target values3 (cid:96)1 = (cid:98) 3(cid:112) n\n\n2(cid:99) and (cid:96)2 = (cid:98)m(cid:112) n\n\n2(cid:99) provide a regret in\n\n2n.\n\n\u221a\n\n(cid:96)1 = (cid:98) 3(cid:112) n\n\n2(cid:99), (cid:96)2 = (cid:98)m(cid:112) n\n\nAlgorithm 1: Two-target algorithm with known time horizon n.\nParameters: m, n\nFunction:\nExplore\nI \u2190 I + 1, L \u2190 0, M \u2190 0\nAlgorithm:\n2(cid:99)\nI \u2190 0\nExplore\nExploit \u2190 false\nforall the t = 1, 2, . . . , n do\nGet reward X from arm I\nif not Exploit then\nif X = 1 then\nL \u2190 L + 1\nM \u2190 M + 1\nif M = 1 then\n\nelse\n\nif L < (cid:96)1 then\n\nExplore\n\nelse if M = m then\nif L < (cid:96)2 then\n\nelse\n\nExplore\nExploit \u2190 true\n\n3The \ufb01rst target could actually be any function (cid:96)1 of the time horizon n such that (cid:96)1 \u2192 +\u221e and (cid:96)1/\n\nwhen n \u2192 +\u221e. Only the second target is critical.\n\n\u221a\nn \u2192 0\n\n3\n\n\f3.2 Regret analysis\n\nProposition 1 The two-target algorithm with targets (cid:96)1 = (cid:98) 3(cid:112) n\n(cid:18) (cid:96)2 \u2212 m + 2\n\n(cid:96)2 + 1\n\n2(cid:99) and (cid:96)2 = (cid:98)m(cid:112) n\n(cid:19)m(cid:18)\n\n\u2200n \u2265 m2\n2\n\n, E(Rn) \u2264 m +\n\nm\n\n(cid:96)2 \u2212 (cid:96)1 \u2212 m + 2\n\n(cid:19)\n2(cid:99) satis\ufb01es:\n\nm + 1\n(cid:96)1 + 1\n\n.\n\n2 +\n\n+ 2\n\n1\nm\n\nIn particular,\n\nlim sup\nn\u2192+\u221e\n\nE(Rn)\u221a\nn\n\n\u2264\n\n\u221a\n\n2 +\n\n\u221a\n1\nm\n\n2\n\n.\n\nProof. Note that Let U1 = 1 if arm 1 is used until time n and U1 = 0 otherwise. Denote by M1 the\ntotal number of 0\u2019s received from arm 1. We have:\n\nE(Rn) \u2264 P (U1 = 0)(E(M1|U1 = 0) + E(Rn)) + P (U1 = 1)(m + nE(1 \u2212 \u03b81|U1 = 1)),\n\nso that:\n\nE(Rn) \u2264 E(M1|U1 = 0)\n\nP (U1 = 1)\n\n+ m + nE(1 \u2212 \u03b81|U1 = 1).\n\n(1)\n\nLet Nt be the number of 0\u2019s received from arm 1 until time t when this arm is played until time t.\nNote that n \u2265 m2\n2 implies n \u2265 (cid:96)2. Since P (N(cid:96)1 = 0|\u03b81 = u) = u(cid:96)1, the probability that the \ufb01rst\ntarget is achieved by arm 1 is given by:\n\n(cid:90) 1\nm\u22121(cid:88)\n\n0\n\n1\n\n.\n\n(cid:96)1 + 1\n\nu(cid:96)1du =\n\n(cid:18)(cid:96)2 \u2212 (cid:96)1\n\n(cid:19)\n\nP (N(cid:96)1 = 0) =\n\nSimilarly,\n\nso that the probability that arm 1 is used until time n is given by:\n\nP (U1 = 1) =\n\nP (N(cid:96)2\u2212(cid:96)1 < m|\u03b81 = u) =\n\nu(cid:96)2\u2212(cid:96)1\u2212j(1 \u2212 u)j,\n\nj\n\n0\n\nj=0\n\n(cid:90) 1\nP (N(cid:96)1 = 0|\u03b81 = u)P (N(cid:96)2\u2212(cid:96)1 < m|\u03b81 = u)du,\nm\u22121(cid:88)\n(cid:18) (cid:96)2 \u2212 (cid:96)1 \u2212 m + 2\n\n(cid:19)m \u2264 P (U1 = 1) \u2264 m\n\n((cid:96)2 \u2212 (cid:96)1)!\n((cid:96)2 \u2212 (cid:96)1 \u2212 j)!\n\n((cid:96)2 \u2212 j)!\n((cid:96)2 + 1)!\n\nj=0\n\n.\n\n.\n\n=\n\n(cid:96)2 \u2212 m + 2\n\n(cid:96)2 + 1\n\nWe deduce:\n\nm\n\n(cid:96)2 + 1\n\nMoreover,\nE(M1|U1 = 0) = 1 + (m \u2212 1)P (N(cid:96)1 = 0|U1 = 0) \u2264 1 + (m \u2212 1)\nwhere the last inequality follows from (2) and the fact that (cid:96)2 \u2265 m2\n2 .\nIt remains to calculate E(1 \u2212 \u03b81|U1 = 1). Since:\n\n(2)\n\nP (N(cid:96)1 = 0)\nP (U1 = 0)\n\n\u2264 1 + 2\n\nm + 1\n(cid:96)1 + 1\n\n,\n\nj\n\nj=0\n\n(cid:18)(cid:96)2 \u2212 (cid:96)1\n(cid:19)\nm\u22121(cid:88)\n(cid:18)(cid:96)2 \u2212 (cid:96)1\n(cid:90) 1\nm\u22121(cid:88)\nm\u22121(cid:88)\n\nj=0\n\nj\n\n0\n\n(cid:19)\n\nu(cid:96)2\u2212j(1 \u2212 u)j,\n\nu(cid:96)2\u2212j(1 \u2212 u)j+1du,\n\n((cid:96)2 \u2212 (cid:96)1)!\n((cid:96)2 \u2212 j)!\n((cid:96)2 \u2212 (cid:96)1 \u2212 j)!\n((cid:96)2 + 2)!\n\u2264\nm(m + 1)\n\n1\n\nj=0\n\n2((cid:96)2 + 1)((cid:96)2 + 2)\n\nP (U1 = 1)\n\n(j + 1),\n\n(cid:18)\n\n1 +\n\n1\nm\n\n(cid:19)\n\n.\n\n(cid:3)\n\nP (U1 = 1|\u03b81 = u) =\n\nwe deduce:\n\nE(1 \u2212 \u03b81|U1 = 1) =\n\n=\n\n\u2264\n\n1\n\nP (U1 = 1)\n\n1\n\nP (U1 = 1)\n\n1\n\nP (U1 = 1)\n\nThe proof then follows from (1) and (2).\n\n4\n\n\f3.3 Lower bound\n\nThe following result shows that the two-target algorithm is asymptotically optimal (for large m).\n\nTheorem 1 For any algorithm with known time horizon n,\n\u221a\n\nlim inf\nn\u2192+\u221e\n\nE(Rn)\u221a\nn\n\n\u2265\n\n2.\n\n1\n\n(cid:113) 2\n\nProof. We present the main ideas of the proof. The details are given in the appendix. Assume an\noracle reveals the parameter of each arm after the \ufb01rst failure of this arm. With this information,\nthe optimal policy explores a random number of arms, each until the \ufb01rst failure, then plays only\none of these arms until time n. Let \u00b5 be the parameter of the best known arm at time t. Since the\nprobability that any new arm is better than this arm is 1 \u2212 \u00b5, the mean cost of exploration to \ufb01nd\na better arm is\n1\u2212\u00b5. The corresponding mean reward has a uniform distribution over [\u00b5, 1] so that\nthe mean gain of exploitation is less than (n \u2212 t) 1\u2212\u00b5\n(it is not equal to this quantity due to the time\nspent in exploration). Thus if 1 \u2212 \u00b5 <\nn\u2212t, it is preferable not to explore new arms and to play\nthe best known arm, with mean reward \u00b5, until time n. A fortiori, the best known arm is played\nn. We denote by An the \ufb01rst arm whose\nn. We have Kn \u2264 An (the optimal policy cannot explore more than\n\nuntil time n whenever its parameter is larger than 1 \u2212(cid:113) 2\nparameter is larger than 1 \u2212(cid:113) 2\n(cid:114) n\nThe parameter \u03b8An of arm An is uniformly distributed over [1 \u2212(cid:113) 2\n(cid:114) 1\n\nn , 1], so that\n\nAn arms) and\n\nE(An) =\n\n2\n\n2\n\n.\n\nE(\u03b8An ) = 1 \u2212\n\n.\n\n2n\n\nFor all k = 1, 2, . . ., let L1(k) be the length of the \ufb01rst run of arm k. We have:\n\n(cid:114) 2\n\nn\n\n(cid:113) 2\n1 \u2212(cid:113) 2\n\n\u2212 ln(\n\nn\n\nn )\n\n\u22121)\n\n(cid:114) n\n\n2\n\n.\n\n(3)\n\n, (4)\n\n(5)\n\nE(L1(1)+. . .+L1(An\u22121)) = (E(An)\u22121)E(L1(1)|\u03b81 \u2264 1\u2212\n\n) = (\n\nusing the fact that:\n\nE(L1(1)|\u03b81 \u2264 1 \u2212\n\n(cid:114) 2\n\nn\n\n) =\n\n(cid:90) 1\u2212\n\n\u221a\n\n2\nn\n\n0\n\n1\n\n1 \u2212 u\n\n1 \u2212(cid:113) 2\n\ndu\n\nn\n\nIn particular,\n\nand\n\nlim\n\nn\u2192+\u221e\n\n1\nn\n\nE(L1(1) + . . . + L1(An \u2212 1)) \u2192 0\n\nlim\n\nn\u2192+\u221e\n\n1\nn\n\nP (L1(1) + . . . + L1(An \u2212 1) \u2264 n\n\n4\n\n5 ) \u2192 1.\n\nTo conclude, we write:\n\nE(Rn) \u2265 E(Kn) + E((n \u2212 L1(1) \u2212 . . . \u2212 L1(An \u2212 1)))(1 \u2212 \u03b8An )).\n\nn denotes the \ufb01rst arm whose parameter is larger than 1 \u2212(cid:113) 2\n\n5}, the number of explored arms satis\ufb01es\n. Since\n\nObserve that, on the event {L1(1) + . . . + L1(An\u2212 1) \u2264 n 4\nKn \u2265 A(cid:48)\nP (L1(1) + . . . + L1(An \u2212 1) \u2264 n 4\n\nn where A(cid:48)\n\n4\n5\n\nn\u2212n\n2\n\n, we deduce that:\n\n(cid:113)\n\nn\u2212n\n\n4\n5\n\n5 ) \u2192 1 and E(A(cid:48)\nE(Kn)\u221a\nn\n\nlim inf\nn\u2192+\u221e\n\nn) =\n\u2265 1\u221a\n2\n\n.\n\n5\n\n\fBy the independence of \u03b8An and L1(1), . . . , L1(An \u2212 1),\n\n1\u221a\nn\n\nE((n \u2212 L1(1) \u2212 . . . \u2212 L1(An \u2212 1)))(1 \u2212 \u03b8An ))\n\n=\n\n1\u221a\nn\n\n(n \u2212 E(L1(1) + . . . + L1(An \u2212 1)))(1 \u2212 E(\u03b8An )),\n\nwhich tends to 1\u221a\n\n2\n\nin view of (3) and (5). The announced bound follows.\n\n(cid:3)\n\n4 Unknown time horizon\n\n4.1 Anytime version of the algorithm\n\nWhen the time horizon is unknown, the targets depend on the current time t, say (cid:96)1(t) and (cid:96)2(t).\nNow any arm that is exploited may be eventually discarded, in the sense that a new arm is explored.\nThis happens whenever either L1 < (cid:96)1(t) or L2 < (cid:96)2(t), where L1 and L2 are the respective lengths\nof the \ufb01rst run and the \ufb01rst m runs of this arm. Thus, unlike the previous version of the algorithm\nwhich consists in an exploration phase followed by an exploitation phase, the anytime version of the\n\u221a\nalgorithm continuously switches between exploration and exploitation. We prove in Proposition 2\nbelow that, for large m, the target values (cid:96)1(t) = (cid:98) 3\nt(cid:99) given in the pseudo-code\n\u221a\nachieve an asymptotic regret in 2\n\n\u221a\nt(cid:99) and (cid:96)2(t) = (cid:98)m\n\nn.\n\nAlgorithm 2: Two-target algorithm with unknown time horizon.\nParameter: m\nFunction:\nExplore\nI \u2190 I + 1, L \u2190 0, M \u2190 0\nAlgorithm:\nI \u2190 0\nExplore\nExploit \u2190 false\nforall the t = 1, 2, . . . do\n\u221a\nt(cid:99), (cid:96)2 = (cid:98)m\n\n\u221a\nGet reward X from arm I\n(cid:96)1 = (cid:98) 3\nt(cid:99)\nif Exploit then\n\nif L1 < (cid:96)1 or L2 < (cid:96)2 then\n\nExplore\nExploit \u2190 false\n\nelse\n\nif X = 1 then\nL \u2190 L + 1\nM \u2190 M + 1\nif M = 1 then\n\nelse\n\nif L < (cid:96)1 then\n\nelse\n\nExplore\nL1 \u2190 L\n\nelse if M = m then\nif L < (cid:96)2 then\n\nelse\n\nExplore\nL2 \u2190 L\nExploit\u2190 true\n\n6\n\n\f4.2 Regret analysis\n\u221a\nProposition 2 The two-target algorithm with time-dependent targets (cid:96)1(t) = (cid:98) 3\n\u221a\n(cid:98)m\n\nt(cid:99) satis\ufb01es:\n\nt(cid:99) and (cid:96)2(t) =\n\nlim sup\nn\u2192+\u221e\n\nE(Rn)\u221a\nn\n\n\u2264 2 +\n\n1\nm\n\n.\n\nProof. For all k = 1, 2, . . ., denote by L1(k) and L2(k) the respective lengths of the \ufb01rst run and\nof the \ufb01rst m runs of arm k when this arm is played continuously. Since arm k cannot be selected\nbefore time k, the regret at time n satis\ufb01es:\n\nRn \u2264 Kn + m\n\n1{L1(k)>(cid:96)1(k)} +\n\nk=1\n\nt=1\n\n(1 \u2212 Xt)1{L2(It)>(cid:96)2(t)}.\n\nFirst observe that, since the target functions (cid:96)1(t) and (cid:96)2(t) are non-decreasing, Kn is less than or\nequal to K(cid:48)\nn, the number of arms selected by a two-target policy with known time horizon n and\n\ufb01xed targets (cid:96)1(n) and (cid:96)2(n). In this scheme, let U(cid:48)\n1 = 0\notherwise. It then follows from (2) that P (U(cid:48)\nn when\nn \u2192 +\u221e.\nNow,\n\n1 = 1 if arm 1 is used until time n and U(cid:48)\nn) \u223c \u221a\n\nn and E(Kn) \u2264 E(K(cid:48)\n\n1 = 1) \u223c 1\u221a\n\nn(cid:88)\n\nKn(cid:88)\n\nk=1\n\n\u221e(cid:88)\n\u221e(cid:88)\n\u221e(cid:88)\n\nk=1\n\nk=1\n\n(cid:32) Kn(cid:88)\n\nk=1\n\nE\n\n(cid:33)\n\n1{L1(k)>(cid:96)1(k)}\n\n=\n\n=\n\n\u2264\n\nP (L1(k) > (cid:96)1(k), Kn \u2265 k),\n\nP (L1(k) > (cid:96)1(k))P (Kn \u2265 k|L1(k) > (cid:96)1(k)),\n\nE(Kn)(cid:88)\n\nk=1\n\nP (L1(k) > (cid:96)1(k))P (Kn \u2265 k) \u2264\n\nP (L1(k) > (cid:96)1(k)),\n\nwhere the \ufb01rst inequality follows from the fact that for any arm k and all u \u2208 [0, 1],\n\nP (\u03b8k \u2265 u|L1(k) > (cid:96)1(k)) \u2265 P (\u03b8k \u2265 u)\n\nand the second inequality follows from the fact that the random variables L1(1), L1(2), . . . are\ni.i.d. and the sequence (cid:96)1(1), (cid:96)1(2), . . . is non-decreasing. Since E(Kn) \u2264 E(K(cid:48)\nn when\nn \u2192 +\u221e and P (L1(k) > (cid:96)1(k)) \u223c 1\n3\u221a\n\nwhen k \u2192 +\u221e, we deduce:\n\nand P (Kn \u2265 k|\u03b8k \u2265 u) \u2264 P (Kn \u2265 k),\nn) \u223c \u221a\n\nk\n\n(cid:32) Kn(cid:88)\n\nk=1\n\n(cid:33)\n\n1{L1(k)>(cid:96)1(k)}\n\n= 0.\n\nlim\n\nn\u2192+\u221e\n\n1\u221a\nn\n\nE\n\nFinally,\n\nE((1 \u2212 Xt)1{L2(It)>(cid:96)2(t)}) \u2264 E(1 \u2212 Xt|L2(It) > (cid:96)2(t)) \u223c m + 1\nm\n\n\u221a\n1\n2\n\nso that\n\nlim sup\nn\u2192+\u221e\n\n1\u221a\nn\n\nn(cid:88)\n\nt=1\n\nE((1 \u2212 Xt)1{L2(It)>(cid:96)2(t)}) \u2264 m + 1\nm\n\nlim\n\nn\u2192+\u221e\n\n(cid:90) 1\n\n1\nn\n\u221a\n1\n2\n\nu\n\nm + 1\n\nm\n\n0\n\nCombining the previous results yields:\n\n=\n\nwhen t \u2192 +\u221e,\n\n(cid:114) n\n\n,\n\nt\n\n1\n2\n\nt\n\nn(cid:88)\n\nt=1\n\ndu =\n\nm + 1\n\nm\n\n.\n\n(cid:3)\n\nlim sup\nn\u2192+\u221e\n\nE(Rn)\u221a\nn\n\n\u2264 2 +\n\n1\nm\n\n.\n\n7\n\n\f4.3 Lower bound\n\n\u221a\nWe believe that if E(Rn)/\nn tends to some limit, then this limit is at least 2. To support this\nconjecture, consider an oracle that reveals the parameter of each arm after the \ufb01rst failure of this arm,\nas in the proof of Theorem 1. With this information, an optimal policy exploits an arm whenever its\nparameter is larger than some increasing function \u00af\u03b8t of time t. Assume that 1 \u2212 \u00af\u03b8t \u223c 1\n\u221a\nfor some\nc > 0 when t \u2192 +\u221e. Then proceeding as in the proof of Theorem 1, we get:\n\nc\n\nt\n\nlim inf\nn\u2192+\u221e\n\nE(Rn)\u221a\nn\n\n\u2265 c + lim\nn\u2192+\u221e\n\n1\nn\n\n5 Numerical results\n\n(cid:114) n\n\nn(cid:88)\n\nt=1\n\n(cid:90) 1\n\n0\n\n1\n2c\n\n= c +\n\n1\nc\n\nt\n\n\u221a\ndu\n2\nu\n\n= c +\n\n\u2265 2.\n\n1\nc\n\nFigure 1 gives the expected failure rate E(Rn)/n with respect to the time horizon n, that is supposed\nto be known. The results are derived from the simulation of 105 independent samples and shown\nwith 95% con\ufb01dence intervals. The mean rewards have (a) a uniform distribution or (b) a Beta(1,2)\ndistribution, corresponding to the probability density function u (cid:55)\u2192 2(1 \u2212 u). The single-target\nalgorithm corresponds to the run policy of Berry et. al. [2] with the asymptotically optimal target\nvalues\n2n, respectively. For the two-target algorithm, we take m = 3 and the target\nvalues given in Proposition 1 and Proposition 3 (in the appendix). The results are compared with\n\nthe respective asymptotic lower bounds(cid:112)2/n and 3(cid:112)3/n. The performance gains of the two-target\n\n\u221a\nn and 3\n\nalgorithm turn out to be negligible for the uniform distribution but substantial for the Beta(1,2)\ndistribution, where \u201cgood\u201d arms are less frequent.\n\n\u221a\n\n(a) Uniform mean-reward distribution\n\n(b) Beta(1,2) mean-reward distribution\n\nFigure 1: Expected failure rate E(Rn)/n with respect to the time horizon n.\n\n6 Conclusion\n\n\u221a\n\n\u221a\n2n and 2\n\nThe proposed algorithm uses two levels of sampling in the exploration phase: the \ufb01rst eliminates\n\u201cbad\u201d arms while the second selects \u201cgood\u201d arms. To our knowledge, this is the \ufb01rst algorithm that\nachieves the optimal regrets in\nn for known and unknown horizon times, respectively.\nFuture work will be devoted to the proof of the lower bound in the case of unknown horizon time. We\nalso plan to study various extensions of the present work, including mean-reward distributions whose\nsupport does not contain 1 and distribution-free algorithms. Finally, we would like to compare the\nperformance of our algorithm for \ufb01nite-armed bandits with those of the best known algorithms like\nKL-UCB [10, 3] and Thompson sampling [14, 8] over short time horizons where the full exploration\nof the arms is generally not optimal.\n\nAcknowledgments\n\nThe authors acknowledge the support of the European Research Council, of the French ANR (GAP\nproject), of the Swedish Research Council and of the Swedish SSF.\n\n8\n\n 0 0.1 0.2 0.3 0.4 0.5 10 100 1000 10000Expected failure rateTime horizonAsymptotic lower boundSingle-target algorithmTwo-target algorithm 0 0.1 0.2 0.3 0.4 0.5 0.6 10 100 1000 10000Expected failure rateTime horizonAsymptotic lower boundSingle-target algorithmTwo-target algorithm\fReferences\n[1] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Mach. Learn., 47(2-3):235\u2013256, May 2002.\n\n[2] Donald A. Berry, Robert W. Chen, Alan Zame, David C. Heath, and Larry A. Shepp. Bandit\n\nproblems with in\ufb01nitely many arms. Annals of Statistics, 25(5):2103\u20132116, 1997.\n\n[3] Olivier Capp\u00b4e, Aur\u00b4elien Garivier, Odalric-Ambrym Maillard, R\u00b4emi Munos, and Gilles Stoltz.\nKullback-leibler upper con\ufb01dence bounds for optimal sequential allocation. To appear in An-\nnals of Statistics, 2013.\n\n[4] Kung-Yu Chen and Chien-Tai Lin. A note on strategies for bandit problems with in\ufb01nitely\n\nmany arms. Metrika, 59(2):193\u2013203, 2004.\n\n[5] Kung-Yu Chen and Chien-Tai Lin. A note on in\ufb01nite-armed bernoulli bandit problems with\n\ngeneralized beta prior distributions. Statistical Papers, 46(1):129\u2013140, 2005.\n\n[6] Stephen J Herschkorn, Erol Pekoez, and Sheldon M Ross. Policies without memory for the\nin\ufb01nite-armed bernoulli bandit under the average-reward criterion. Probability in the Engi-\nneering and Informational Sciences, 10:21\u201328, 1996.\n\n[7] Ying-Chao Hung. Optimal bayesian strategies for the in\ufb01nite-armed bernoulli bandit. Journal\n\nof Statistical Planning and Inference, 142(1):86\u201394, 2012.\n\n[8] Emilie Kaufmann, Nathaniel Korda, and R\u00b4emi Munos. Thompson sampling: An asymptoti-\ncally optimal \ufb01nite-time analysis. In Algorithmic Learning Theory, pages 199\u2013213. Springer,\n2012.\n\n[9] Tze L. Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances\n\nin Applied Mathematics, 6(1):4\u201322, 1985.\n\n[10] Tze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals\n\nof Statistics, pages 1091\u20131114, 1987.\n\n[11] Chien-Tai Lin and CJ Shiau. Some optimal strategies for bandit problems with beta prior\n\ndistributions. Annals of the Institute of Statistical Mathematics, 52(2):397\u2013405, 2000.\n\n[12] C.L Mallows and Herbert Robbins. Some problems of optimal sampling strategy. Journal of\n\nMathematical Analysis and Applications, 8(1):90 \u2013 103, 1964.\n\n[13] Olivier Teytaud, Sylvain Gelly, and Mich`ele Sebag. Anytime many-armed bandits. In CAP07,\n\n2007.\n\n[14] W. R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View\n\nof the Evidence of Two Samples. Biometrika, 25:285\u2013294, 1933.\n\n[15] Yizao Wang, Jean-Yves Audibert, and R\u00b4emi Munos. Algorithms for in\ufb01nitely many-armed\n\nbandits. In NIPS, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1069, "authors": [{"given_name": "Thomas", "family_name": "Bonald", "institution": "Telecom ParisTech"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": "KTH"}]}