{"title": "Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence", "book": "Advances in Neural Information Processing Systems", "page_first": 3212, "page_last": 3220, "abstract": "We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms.", "full_text": "Best Arm Identi\ufb01cation: A Uni\ufb01ed Approach to Fixed\n\nBudget and Fixed Con\ufb01dence\n\nVictor Gabillon\n\nAlessandro Lazaric\n\nMohammad Ghavamzadeh\n\nINRIA Lille - Nord Europe, Team SequeL\n\nVictor Gabillon, Mohammad Ghavamzadeh & Alessandro Lazaric\n\nAbstract\n\nWe study the problem of identifying the best arm(s) in the stochastic multi-armed\nbandit setting. This problem has been studied in the literature from two different\nperspectives: \ufb01xed budget and \ufb01xed con\ufb01dence. We propose a unifying approach\nthat leads to a meta-algorithm called uni\ufb01ed gap-based exploration (UGapE), with\na common structure and similar theoretical analysis for these two settings. We\nprove a performance bound for the two versions of the algorithm showing that the\ntwo problems are characterized by the same notion of complexity. We also show\nhow the UGapE algorithm as well as its theoretical analysis can be extended to\ntake into account the variance of the arms and to multiple bandits. Finally, we\nevaluate the performance of UGapE and compare it with a number of existing\n\ufb01xed budget and \ufb01xed con\ufb01dence algorithms.\n\n1\n\nIntroduction\n\nThe problem of best arm(s) identi\ufb01cation [6, 3, 1] in the stochastic multi-armed bandit setting has\nrecently received much attention. In this problem, a forecaster repeatedly selects an arm and ob-\nserves a sample drawn from its reward distribution during an exploration phase, and then is asked to\nreturn the best arm(s). Unlike the standard multi-armed bandit problem, where the goal is to maxi-\nmize the cumulative sum of rewards obtained by the forecaster (see e.g., [15, 2]), in this problem the\nforecaster is evaluated on the quality of the arm(s) returned at the end of the exploration phase. This\nabstract problem models a wide range of applications. For instance, let us consider a company that\nhas K different variants of a product and needs to identify the best one(s) before actually placing it\non the market. The company sets up a testing phase in which the products are tested by potential\ncustomers. Each customer tests one product at the time and gives it a score (a reward). The objective\nof the company is to return a product at the end of the test phase which is likely to be successful once\nplaced on the market (i.e., the best arm identi\ufb01cation), and it is not interested in the scores collected\nduring the test phase (i.e., the cumulative reward).\nThe problem of best arm(s) identi\ufb01cation has been studied in two distinct settings in the literature.\nFixed budget. In the \ufb01xed budget setting (see e.g., [3, 1]), the number of rounds of the exploration\nphase is \ufb01xed and is known by the forecaster, and the objective is to maximize the probability of\nreturning the best arm(s). In the above example, the company \ufb01xes the length of the test phase before\nhand (e.g., enrolls a \ufb01xed number of customers) and de\ufb01nes a strategy to choose which products to\nshow to the testers so that the \ufb01nal selected product is the best with the highest probability. Audibert\net al. [1] proposed two different strategies to solve this problem. They de\ufb01ned a strategy based\non upper con\ufb01dence bounds, called UCB-E, whose optimal parameterization is strictly related to a\nmeasure of the complexity of the problem. They also introduced an elimination algorithm, called\nSuccessive Rejects, which divides the budget n in phases and discards one arm per phase. Both\nalgorithms were shown to have nearly optimal probability of returning the best arm. Deng et al. [5]\nand Gabillon et al. [8] considered the extension of the best arm identi\ufb01cation problem to the multi-\n\n1\n\n\fbandit setting, where the objective is to return the best arm for each bandit. Recently, Bubeck et\nal. [4] extended the previous results to the problem of m-best arm identi\ufb01cation and introduced a\nnew version of the Successive Rejects algorithm (with accept and reject) that is able to return the set\nof the m-best arms with high probability.\nFixed con\ufb01dence. In the \ufb01xed con\ufb01dence setting (see e.g., [12, 6]), the forecaster tries to mini-\nmize the number of rounds needed to achieve a \ufb01xed con\ufb01dence about the quality of the returned\narm(s). In the above example, the company keeps enrolling customers in the test until it is, e.g., 95%\ncon\ufb01dent that the best product has been identi\ufb01ed. Maron & Moore [12] considered a slightly dif-\nferent setting where besides a \ufb01xed con\ufb01dence also the maximum number of rounds is \ufb01xed. They\ndesigned an elimination algorithm, called Hoeffding Races, based on progressively discarding the\narms that are suboptimal with enough con\ufb01dence. Mnih et al. [14] introduced an improved al-\ngorithm, built on the Bernstein concentration inequality, which takes into account the empirical\nvariance of each arm. Even-Dar et al. [6] studied the \ufb01xed con\ufb01dence setting without any budget\nconstraint and designed an elimination algorithm able to return an arm with a required accuracy\n\u0001 (i.e., whose performance is at least \u0001-close to the optimal arm). Kalyanakrishnan & Stone [10]\nfurther extended this approach to the case where the m-best arms must be returned with a given\ncon\ufb01dence. Finally, Kalyanakrishnan et al. [11] recently introduced an algorithm for the case of\nm-best arm identi\ufb01cation along with a thorough theoretical analysis showing the number of rounds\nneeded to achieve the desired con\ufb01dence.\nAlthough the \ufb01xed budget and \ufb01xed con\ufb01dence problems have been studied separately, they display\nseveral similarities. In this paper, we propose a uni\ufb01ed approach to these two settings in the general\ncase of m-best arm identi\ufb01cation with accuracy \u0001.1 The main contributions of the paper can be\nsummarized as follows:\nAlgorithm. In Section 3, we propose a novel meta-algorithm, called uni\ufb01ed gap-based exploration\n(UGapE), which uses the same arm selection and (arm) return strategies for the two settings. This\nalgorithm allows us to solve settings that have not been covered in the previous work (e.g., the case\nof \u0001 (cid:54)= 0 has not been studied in the \ufb01xed budget setting). Furthermore, we show in Appendix C of\n[7] that UGapE outperforms existing algorithms in some settings (e.g., it improves the performance\nof the algorithm by Mnih et al. [14] in the \ufb01xed con\ufb01dence setting). We also provide a thorough\nempirical evaluation of UGapE and compare it with a number of existing \ufb01xed budget and \ufb01xed\ncon\ufb01dence algorithms in Appendix C of [7].\nTheoretical analysis. Similar to the algorithmic contribution, in Section 4, we show that a large\nportion of the theoretical analysis required to study the behavior of the two settings of the UGapE\nalgorithm can be uni\ufb01ed in a series of lemmas. The \ufb01nal theoretical guarantees are thus a direct\nconsequence of these lemmas when used in the two speci\ufb01c settings.\nProblem complexity. In Section 4.4, we show that the theoretical analysis indicates that the two\nproblems share exactly the same de\ufb01nition of complexity. In particular, we show that the probability\nof success in the \ufb01xed budget setting as well as the sample complexity in the \ufb01xed con\ufb01dence setting\nstrictly depend on the inverse of the gaps of the arms and the desired accuracy \u0001.\nExtensions. Finally, in Appendix B of [7], we discuss how the proposed algorithm and analysis can\nbe extended to improved de\ufb01nitions of con\ufb01dence interval (e.g., Bernstein-based bounds) and to\nmore complex settings, such as the multi-bandit best arm identi\ufb01cation problem introduced in [8].\n\n2 Problem Formulation\nIn this section, we introduce the notation used throughout the paper. Let A = {1, . . . , K} be the set\nof arms such that each arm k \u2208 A is characterized by a distribution \u03bdk bounded in [0, b] with mean\n\u00b5k and variance \u03c32\n\nk. We de\ufb01ne the m-max and m-argmax operators as2\n\n\u00b5(m) =\n\nm\nmax\nk\u2208A\n\n\u00b5k\n\nand\n\n(m) = arg\n\nm\nmax\nk\u2208A\n\n\u00b5k ,\n\nwhere (m) denotes the index of the m-th best arm in A and \u00b5(m) is its corresponding mean so that\n\u00b5(1) \u2265 \u00b5(2) \u2265 . . . \u2265 \u00b5(K). We denote by Sm \u2282 A any subset of m arms (i.e., |Sm| = m < K) and\nby Sm,\u2217 the subset of the m best arms (i.e., k \u2208 Sm,\u2217 iif \u00b5k \u2265 \u00b5(m)). Without loss of generality, we\n\n1Note that when \u0001 = 0 and m = 1 this reduces to the standard best arm identi\ufb01cation problem.\n2Ties are broken in an arbitrary but consistent manner.\n\n2\n\n\fassume there exists a unique set Sm,\u2217. In the following we drop the superscript m and use S = Sm\nand S\u2217 = Sm,\u2217 whenever m is clear from the context. With a slight abuse of notation we further\nextend the m-max operator to an operator returning a set of arms, such that\n\n{\u00b5(1), . . . , \u00b5(m)} =\n\n\u00b5k\nFor each arm k \u2208 A, we de\ufb01ne the gap \u2206k as\n\n1..m\nmax\nk\u2208A\n\n(cid:26)\u00b5k \u2212 \u00b5(m+1)\n\n\u00b5(m) \u2212 \u00b5k\n\n\u2206k =\n\nand\n\nS\u2217 = arg\n\n1..m\nmax\nk\u2208A\n\n\u00b5k .\n\nif k \u2208 S\u2217\nif k /\u2208 S\u2217 .\n\nThis de\ufb01nition of gap indicates that if k \u2208 S\u2217, \u2206k represents the \u201cadvantage\u201d of arm k over the\nsuboptimal arms, and if k /\u2208 S\u2217, \u2206k denotes how suboptimal arm k is. Note that we can also write\nthe gap as \u2206k = | m\n\u00b5i \u2212 \u00b5k|. Given an accuracy \u0001 and a number of arms m, we say that an arm\nmax\ni(cid:54)=k\nk is (\u0001,m)-optimal if \u00b5k \u2265 \u00b5(m) \u2212 \u0001. Thus, we de\ufb01ne the (\u0001,m)-best arm identi\ufb01cation problem as\nthe problem of \ufb01nding a set S of m (\u0001,m)-optimal arms.\nThe (\u0001,m)-best arm identi\ufb01cation problem can be formalized as a game between a stochastic bandit\nenvironment and a forecaster. The distributions {\u03bdk} are unknown to the forecaster. At each round t,\nthe forecaster pulls an arm I(t) \u2208 A and observes an independent sample drawn from the distribution\n\u03bdI(t). The forecaster estimates the expected value of each arm by computing the average of the\nsamples observed over time. Let Tk(t) be the number of times that arm k has been pulled by the end\ns=1 Xk(s), where Xk(s) is\nthe s-th sample observed from \u03bdk. For any arm k \u2208 A, we de\ufb01ne the notion of arm simple regret as\n(1)\n\nof round t, then the mean of this arm is estimated as(cid:98)\u00b5k(t) = 1\n\n(cid:80)Tk(t)\n\nrk = \u00b5(m) \u2212 \u00b5k,\n\nTk(t)\n\nand for any set S \u2282 A of m arms, we de\ufb01ne the simple regret as\nrk = \u00b5(m) \u2212 min\n\u00b5k.\nk\u2208S\n\nrS = max\nk\u2208S\n\n(2)\nWe denote by \u2126(t) \u2282 A the set of m arms returned by the forecaster at the end of the exploration\nphase (when the alg. stops after t rounds), and by r\u2126(t) its corresponding simple regret. Returning\nm (\u0001,m)-optimal arms is then equivalent to having r\u2126(t) smaller than \u0001. Given an accuracy \u0001 and a\nnumber of arms m to return, we now formalize the two settings of \ufb01xed budget and \ufb01xed con\ufb01dence.\nFixed budget. The objective is to design a forecaster capable of returning a set of m (\u0001,m)-optimal\narms with the largest possible con\ufb01dence using a \ufb01xed budget of n rounds. More formally, given\n\na budget n, the performance of the forecaster is measured by the probability(cid:101)\u03b4 of not meeting the\n(\u0001,m) requirement, i.e.,(cid:101)\u03b4 = P(cid:2)r\u2126(n) \u2265 \u0001(cid:3), the smaller(cid:101)\u03b4, the better the algorithm.\nof m (\u0001,m)-optimal arms with a \ufb01xed con\ufb01dence. We denote by(cid:101)n the time when the algorithm stops\nand by \u2126((cid:101)n) its set of returned arms. Given a con\ufb01dence level \u03b4, the forecaster has to guarantee that\nP(cid:2)r\u2126((cid:101)n) \u2265 \u0001(cid:3) \u2264 \u03b4. The performance of the forecaster is then measured by the number of rounds(cid:101)n\n\nFixed con\ufb01dence. The goal is to design a forecaster that stops as soon as possible and returns a set\n\neither in expectation or high probability.\nAlthough these settings have been considered as two distinct problems, in Section 3 we introduce\na uni\ufb01ed arm selection strategy that can be used in both cases by simply changing the stopping\ncriteria. Moreover, we show in Section 4 that the bounds on the performance of the algorithm in the\ntwo settings share the same notion of complexity and can be derived using very similar arguments.\n\n3 Uni\ufb01ed Gap-based Exploration Algorithm\n\nIn this section, we describe the uni\ufb01ed gap-based exploration (UGapE) meta-algorithm and show\nhow it is implemented in the \ufb01xed-budget and \ufb01xed-con\ufb01dence settings. As shown in Figure 1, both\n\ufb01xed-budget (UGapEb) and \ufb01xed-con\ufb01dence (UGapEc) instances of UGapE use the same arm-\nselection strategy, SELECT-ARM (described in Figure 2), and upon stopping, return the m-best\narms in the same manner (using \u2126). The two algorithms only differ in their stopping criteria. More\nprecisely, both algorithms receive as input the de\ufb01nition of the problem (\u0001, m), a constraint (the\n\n3\n\n\fbudget n in UGapEb and the con\ufb01dence level \u03b4 in UGapEc), and a parameter (a or c). While\nUGapEb runs for n rounds and then returns the set of arms \u2126(n), UGapEc runs until it achieves the\ndesired accuracy \u0001 with the requested con\ufb01dence level \u03b4. This difference is due to the two different\nobjectives targeted by the algorithms; while UGapEc optimizes its budget for a given con\ufb01dence\nlevel, UGapEb\u2019s goal is to optimize the quality of its recommendation for a \ufb01xed budget.\n\nUGapEb (\u0001, m, n, a)\n\nParameters: accuracy \u0001, number of arms m,\nbudget n, exploration parameter a\n\nInitialize: Pull each arm k once, update(cid:98)\u00b5k(K)\n\nand set Tk(K) = 1\nSAMP\nfor t = K + 1, . . . , n do\n\nSELECT-ARM (t)\n\nend for\nSAMP\nReturn \u2126(n) = arg min\nJ(t)\n\nBJ(t)(t)\n\nUGapEc (\u0001, m, \u03b4, c)\n\nParameters: accuracy \u0001, number of arms m,\ncon\ufb01dence level \u03b4, exploration parameter c\nInitialize: Pull each arm k once, update\n\n(cid:98)\u00b5k(K), set Tk(K) = 1 and t \u2190 K + 1\n\nSAMP\nwhile BJ(t)(t) \u2265 \u0001 do\n\nSELECT-ARM (t)\nt \u2190 t + 1\n\nend while\nSAMP\nReturn \u2126(t) = J(t)\n\nFigure 1: The pseudo-code for the UGapE algorithm in the \ufb01xed-budget (UGapEb) (left) and \ufb01xed-\ncon\ufb01dence (UGapEc) (right) settings.\n\n1..m\nmin\nk\u2208A\n\nBk(t)\n\nSELECT-ARM (t)\n\nCompute Bk(t) for each arm k \u2208 A\nIdentify the set of m arms J(t) \u2208 arg\n\u03b2k(t \u2212 1)\nPull the arm I(t) = arg max\nk\u2208{lt,ut}\n\nRegardless of the \ufb01nal objective, how to select\nan arm at each round (arm-selection strategy) is\nthe key component of any multi-arm bandit al-\ngorithm. One of the most important features of\nUGapE is having a unique arm-selection strat-\negy for the \ufb01xed-budget and \ufb01xed-con\ufb01dence\nsettings. We now describe the UGapE\u2019s arm-\nselection strategy, whose pseudo-code has been\nreported in Figure 2. At each time step t,\nUGapE \ufb01rst uses the observations up to time t\u2212\nUi(t)\u2212\n1 and computes an index Bk(t) =\nLk(t) for each arm k \u2208 A, where\n\u2200t, \u2200k \u2208 A\n(3)\nIn Eq. 3, \u03b2k(t \u2212 1) is a con\ufb01dence interval,3 and Uk(t) and Lk(t) are high probability upper and\nlower bounds on the mean of arm k, \u00b5k, after t\u2212 1 rounds. Note that the parameters a and c are used\nin the de\ufb01nition of the con\ufb01dence interval \u03b2k, whose shape strictly depends on the concentration\nbound used by the algorithm. For example, we can derive \u03b2k from the Chernoff-Hoeffding bound as\n\nFigure 2: The pseudo-code for the UGapE\u2019s arm-\nselection strategy. This routine is used in both\nUGapEb and UGapEc instances of UGapE.\n\nLk(t) =(cid:98)\u00b5k(t \u2212 1) \u2212 \u03b2k(t \u2212 1).\n\nUk(t) =(cid:98)\u00b5k(t \u2212 1) + \u03b2k(t \u2212 1)\n\n(cid:0)TI(t)(t \u2212 1) + 1(cid:1) \u223c \u03bdI(t)\n\nObserve XI(t)\n\nUpdate (cid:98)\u00b5I(t)(t) and TI(t)(t)\n\nm\nmax\ni(cid:54)=k\n\n,\n\n(cid:115)\n\nc log 4K(t\u22121)3\nTk(t \u2212 1)\n\n\u03b4\n\n.\n\n(4)\n\n(cid:114) a\n\nTk(t \u2212 1)\n\nUGapEb: \u03b2k(t \u2212 1) = b\n\n,\n\nUGapEc: \u03b2k(t \u2212 1) = b\n\nIn Sec. 4, we discuss how the parameters a and c can be tuned and we show that while a should be\ntuned as a function of n and \u0001 in UGapEb, c = 1/2 is always a good choice for UGapEc. De\ufb01ning\nthe con\ufb01dence interval in a general form \u03b2k(t\u2212 1) allows us to easily extend the algorithm by taking\ninto account different (higher) moments of the arms (see Appendix B of [7] for the case of variance,\nwhere \u03b2k(t \u2212 1) is obtained from the Bernstein inequality). From Eq. 3, we may see that the index\nBk(t) is an upper-bound on the simple regret rk of the kth arm (see Eq. 1). We also de\ufb01ne an index\nfor a set S as BS(t) = maxi\u2208S Bi(t). Similar to the arm index, BS is also de\ufb01ned in order to\nupper-bound the simple regret rS with high probability (see Lemma 1).\nAfter computing the arm indices, UGapE \ufb01nds a set of m arms J(t) with minimum upper-bound\non their simple regrets, i.e., J(t) = arg\nBk(t). From J(t), it computes two arm indices ut =\narg maxj /\u2208J(t) Uj(t) and lt = arg mini\u2208J(t) Li(t), where in both cases the tie is broken in favor of\n\n1..m\nmin\nk\u2208A\n\n3To be more precise, \u03b2k(t \u2212 1) is the width of a con\ufb01dence interval or a con\ufb01dence radius.\n\n4\n\n\fthe arm with the largest uncertainty \u03b2(t\u2212 1). Arms lt and ut are the worst possible arm among those\nin J(t) and the best possible arm left outside J(t), respectively, and together they represent how bad\nthe choice of J(t) could be. Finally, the algorithm selects and pulls the arm I(t) as the arm with the\nlarger \u03b2(t \u2212 1) among ut and lt, observes a sample XI(t)\n\n(cid:0)TI(t)(t \u2212 1) + 1(cid:1) from the distribution\n\u03bdI(t), and updates the empirical mean(cid:98)\u00b5I(t)(t) and the number of pulls TI(t)(t) of the selected arm\n\nI(t).\nThere are two more points that need to be discussed about the UGapE algorithm. 1) While UGapEc\nde\ufb01nes the set of returned arms as \u2126(t) = J(t), UGapEb returns the set of arms J(t) with the\nsmallest index, i.e., \u2126(n) = arg minJ(t) BJ(t)(t), t \u2208 {1, . . . , n}. 2) UGapEc stops (we refer to\n\nthe number of rounds before stopping as(cid:101)n) when BJ((cid:101)n+1)((cid:101)n + 1) is less than the given accuracy\nselected set J((cid:101)n + 1) is smaller than \u0001. This guarantees that the simple regret (see Eq. 2) of the set\nreturned by the algorithm, \u2126((cid:101)n) = J((cid:101)n + 1), to be smaller than \u0001 with probability larger than 1 \u2212 \u03b4.\n\n\u0001, i.e., when even the mth worst upper-bound on the arm simple regret among all the arms in the\n\n4 Theoretical Analysis\n\nIn this section, we provide high probability upper-bounds on the performance of the two instances\nof the UGapE algorithm, UGapEb and UGapEc, introduced in Section 3. An important feature of\nUGapE is that since its \ufb01xed-budget and \ufb01xed-con\ufb01dence versions share the same arm-selection\nstrategy, a large part of their theoretical analysis can be uni\ufb01ed. We \ufb01rst report this uni\ufb01ed part of\nthe proof in Section 4.1, and then provide the \ufb01nal performance bound for each of the algorithms,\nUGapEb and UGapEc, separately, in Sections 4.2 and 4.3, respectively.\nBefore moving to the main results, we de\ufb01ne additional notation used in the analysis. We \ufb01rst de\ufb01ne\nevent E as\n(5)\nwhere the values of T and \u03b2k are de\ufb01ned for each speci\ufb01c setting separately. Note that event E plays\nan important role in the sequel, since it allows us to \ufb01rst derive a series of results which are directly\nimplied by the event E and to postpone the study of the stochastic nature of the problem (i.e., the\nprobability of E) in the two speci\ufb01c settings. In particular, when E holds, we have that for any arm\nk \u2208 A and at any time t, Lk(t) \u2264 \u00b5k \u2264 Uk(t). Finally, we de\ufb01ne the complexity of the problem as\n\nE =(cid:8)\u2200k \u2208 A, \u2200t \u2208 {1, . . . , T}, (cid:12)(cid:12)(cid:98)\u00b5k(t) \u2212 \u00b5k\n\n(cid:12)(cid:12) < \u03b2k(t)(cid:9),\n\nK(cid:88)\n\nb2\n\nH\u0001 =\n\nmax( \u2206i+\u0001\n\n2\n\n, \u0001)2\n\ni=1\n\n.\n\n(6)\n\nNote that although the complexity has an explicit dependence on \u0001, it also depends on the number of\narms m through the de\ufb01nition of the gaps \u2206i, thus making it a complexity measure of the (\u0001, m) best\narm identi\ufb01cation problem. In Section 4.4, we will discuss why the complexity of the two instances\nof the problem is measured by this quantity.\n\n4.1 Analysis of the Arm-Selection Strategy\nHere we report lower (Lemma 1) and upper (Lemma 2) bounds for indices BS on the event E, which\nshow their connection with the regret and gaps. The technical lemmas used in the proofs (Lemmas 3\nand 4 and Corollary 1) are reported in Appendix A of [7]. We \ufb01rst prove that for any set S (cid:54)= S\u2217\nand any time t \u2208 {1, . . . , T}, the index BS(t) is an upper-bound on the simple regret of this set rS.\nLemma 1. On event E, for any set S (cid:54)= S\u2217 and any time t \u2208 {1, . . . , T}, we have BS(t) \u2265 rS.\nProof. On event E, for any arm i /\u2208 S\u2217 and each time t \u2208 {1, . . . , T}, we may write\n\n(cid:0)(cid:98)\u00b5j(t \u2212 1) + \u03b2j(t \u2212 1)(cid:1) \u2212(cid:0)(cid:98)\u00b5i(t \u2212 1) \u2212 \u03b2i(t \u2212 1)(cid:1)\n\nBi(t) =\n\nm\nmax\nj(cid:54)=i\n\u2265 m\nmax\nj(cid:54)=i\n\nUj(t) \u2212 Li(t) =\n\u00b5j \u2212 \u00b5i = \u00b5(m) \u2212 \u00b5i = ri .\n\nm\nmax\nj(cid:54)=i\n\n(7)\n\nUsing Eq. 7, we have\n\nBi(t) \u2265 max\ni\u2208(S\u2212S\u2217)\nwhere the last passage follows from the fact that ri \u2264 0 for any i \u2208 S\u2217.\n\nBi(t) \u2265 max\ni\u2208(S\u2212S\u2217)\n\nBS(t) = max\ni\u2208S\n\nri = rS,\n\n5\n\n\fLemma 2. On event E, if arm k \u2208 {lt, ut} is pulled at time t \u2208 {1, . . . , T}, we have\n\nBJ(t)(t) \u2264 min(cid:0)0,\u2212\u2206k + 2\u03b2k(t \u2212 1)(cid:1) + 2\u03b2k(t \u2212 1).\nB(t) \u2264 min(cid:0)0,\u2212\u2206k + 2\u03b2k(t \u2212 1)(cid:1) + 2\u03b2k(t \u2212 1).\n\nProof. We \ufb01rst prove the statement for B(t) = Uut(t) \u2212 Llt(t), i.e.,\n\n(8)\n\n(9)\n\nWe consider the following cases:\nCase 1. k = ut:\nCase 1.1. ut \u2208 S\u2217: Since by de\ufb01nition ut /\u2208 J(t), there exists an arm j /\u2208 S\u2217 such that j \u2208 J(t).\nNow we may write\n\n(c)\u2265 Lut(t) =(cid:98)\u00b5k(t \u2212 1) \u2212 \u03b2k(t \u2212 1)\n\n\u00b5(m+1) \u2265 \u00b5j\n\n(a)\u2265 Lj(t)\n\n(b)\u2265 Llt(t)\n\n(10)\n(a) and (d) hold because of event E, (b) follows from the fact that j \u2208 J(t) and from the de\ufb01nition\nof lt, and (c) is the result of Lemma 4. From Eq. 10, we may deduce that \u2212\u2206k + 2\u03b2k(t \u2212 1) \u2265 0,\nwhich together with Corollary 1 gives us the desired result (Eq. 9).\nCase 1.2. ut /\u2208 S\u2217:\nCase 1.2.1. lt \u2208 S\u2217: In this case, we may write\n\n(d)\u2265 \u00b5k \u2212 2\u03b2k(t \u2212 1)\n\nB(t) = Uut(t) \u2212 Llt(t)\n\n(a)\u2264 \u00b5ut + 2\u03b2ut(t \u2212 1) \u2212 \u00b5lt + 2\u03b2lt(t \u2212 1)\n\n(b)\u2264 \u00b5ut + 2\u03b2ut(t \u2212 1) \u2212 \u00b5(m) + 2\u03b2lt(t \u2212 1)\n\n(c)\u2264 \u2212\u2206ut + 4\u03b2ut(t \u2212 1)\n\n(11)\n(a) holds because of event E, (b) is from the fact that lt \u2208 S\u2217, and (c) is because ut is pulled, and\nthus, \u03b2ut(t \u2212 1) \u2265 \u03b2lt(t \u2212 1). The \ufb01nal result follows from Eq. 11 and Corollary 1.\nCase 1.2.2.\narm j \u2208 S\u2217 such that j /\u2208 J(t). Now we may write\n(a)\u2265 Uut(t)\n\nlt /\u2208 S\u2217: Since lt /\u2208 S\u2217 and the fact that by de\ufb01nition lt \u2208 J(t), there exists an\n\n(12)\n(a) and (c) hold because of event E, (b) is from the de\ufb01nition of ut and the fact that j /\u2208 J(t), and\n(d) holds because j \u2208 S\u2217. From Eq. 12, we may deduce that \u2212\u2206ut + 2\u03b2ut(t \u2212 1) \u2265 0, which\ntogether with Corollary 1 gives us the \ufb01nal result (Eq. 9).\n\n\u00b5ut + 2\u03b2ut(t \u2212 1)\n\n(b)\u2265 Uj(t)\n\n(c)\u2265 \u00b5j\n\n(d)\u2265 \u00b5(m)\n\nWith similar arguments and cases, we prove the result of Eq. 9 for k = lt. The \ufb01nal state-\nment of the lemma (Eq. 8) follows directly from BJ(t)(t) \u2265 B(t) as shown in Lemma 3.\n\nUsing Lemmas 1 and 2, we de\ufb01ne an upper and a lower bounds on BJ(t) in terms of quantities\nrelated to the regret of J(t). Lemma 1 con\ufb01rms the intuition that the B-values upper-bound the\nregret of the corresponding set of arms (with high probability). Unfortunately, this is not enough\nto claim that selecting J(t) as the set of arms with smallest B-values actually correspond to arms\nwith small regret, since BJ(t) could be an arbitrary loose bound on the regret. Lemma 2 provides\nthis complementary guarantee speci\ufb01cally for the set J(t), in the form of an upper-bound on BJ(t)\nw.r.t. the gap of k \u2208 {ut, lt}. This implies that as the algorithm runs, the choice of J(t) becomes\nmore and more accurate since BJ(t) is constrained between rJ(t) and a quantity (Eq. 8) that gets\nsmaller and smaller, thus implying that selecting the arms with the smaller B-value, i.e., the set J(t),\ncorresponds to those which actually have the smallest regret, i.e., the arms in S\u2217. This argument will\nbe implicitly at the basis of the proofs of the two following theorems.\n\n4.2 Regret Bound for the Fixed-Budget Setting\n\nHere we prove an upper-bound on the simple-regret of UGapEb. Since the setting considered by the\nalgorithm is \ufb01xed-budget, we may set T = n. From the de\ufb01nition of the con\ufb01dence interval \u03b2i(t)\nin Eq. 4 and a union bound, we have that P(E) \u2265 1 \u2212 2Kn exp(\u22122a).4 We now have all the tools\nneeded to prove the performance of UGapEb for the m (\u0001,m)-best arm identi\ufb01cation problem.\n\n4The extension to a con\ufb01dence interval that takes into account the variance of the arms is discussed in\n\nAppendix B of [7].\n\n6\n\n\fTheorem 1. If we run UGapEb with parameter 0 < a \u2264 n\u2212K\n\n(cid:101)\u03b4 = P(cid:0)r\u2126(n) \u2265 \u0001(cid:1) \u2264 2Kn exp(\u22122a),\n\n4H\u0001\n\n, its simple regret r\u2126(n) satis\ufb01es\n\nand in particular this probability is minimized for a = n\u2212K\nProof. The proof is by contradiction. We assume that r\u2126(n) > \u0001 on event E and consider the\nfollowing two steps:\nStep 1: Here we show that on event E, we have the following upper-bound on the number of pulls\nof any arm i \u2208 A:\n\n4H\u0001\n\n.\n\nTi(n) <\n\nmax(cid:0) \u2206i+\u0001\n\n4ab2\n\n, \u0001(cid:1)2 + 1.\n\n2\n\n(13)\n\nLet ti be the last time that arm i is pulled. If arm i has been pulled only during the initialization\nphase, Ti(n) = 1 and Eq. 13 trivially holds. If i has been selected by SELECT-ARM, then we have\n\nmin(cid:0) \u2212 \u2206i + 2\u03b2i(ti \u2212 1), 0(cid:1) + 2\u03b2i(ti \u2212 1)\n\n(d)\n> \u0001,\n\n(a)\u2265 B(ti)\n\n(b)\u2265 BJ(ti)(ti)\n\n(c)\u2265 B\u2126(n)(t(cid:96))\n\n(14)\nwhere t(cid:96) \u2208 {1, . . . , n} is the time such that \u2126(n) = J(t(cid:96)). (a) and (b) are the results of Lemmas 2\nand 3, (c) is by the de\ufb01nition of \u2126(n), and (d) holds because using Lemma 1, we know that if\nthe algorithm suffers a simple regret r\u2126(n) > \u0001 (as assumed at the beginning of the proof), then\n\u2200t = 1, . . . , n + 1, B\u2126(n)(t) > \u0001. By the de\ufb01nition of ti, we know Ti(n) = Ti(ti \u2212 1) + 1. Using\nthis fact, the de\ufb01nition of \u03b2i(ti \u2212 1), and Eq. 14, it is straightforward to show that Eq. 13 holds.\n\nStep 2: We know that(cid:80)K\n\ni=1 Ti(n) = n. Using Eq. 13, we have(cid:80)K\n\n,\u0001(cid:1)2 + K > n\n\nmax(cid:0) \u2206i+\u0001\n\n4ab2\n\non event E. It is easy to see that by selecting a \u2264 n\u2212K\n, the left-hand-side of this inequality will be\nsmaller than or equal to n, which is a contradiction. Thus, we conclude that r\u2126(n) \u2264 \u0001 on event E.\nThe \ufb01nal result follows from the probability of event E de\ufb01ned at the beginning of this section.\n\n4H\u0001\n\n2\n\ni=1\n\n4.3 Regret Bound for the Fixed-Con\ufb01dence Setting\n\nsatis\ufb01es\n\nHere we prove an upper-bound on the simple-regret of UGapEc. Since the setting considered by the\nalgorithm is \ufb01xed-con\ufb01dence, we may set T = +\u221e. From the de\ufb01nition of the con\ufb01dence interval\n\u03b2i(t) in Eq. 4 and a union bound on Tk(t) \u2208 {0, . . . , t}, t = 1, . . . ,\u221e, we have that P(E) \u2265 1 \u2212 \u03b4.\n\nTheorem 2. The UGapEc algorithm stops after(cid:101)n rounds and returns a set of m arms, \u2126((cid:101)n), that\n\nP(cid:0)r\u2126((cid:101)n+1) \u2264 \u0001 \u2227(cid:101)n \u2264 N(cid:1) \u2265 1 \u2212 \u03b4,\n\nwhere N = K + O(H\u0001 log H\u0001\n\n\u03b4 ) and c has been set to its optimal value 1/2.\n\nProof. We \ufb01rst prove the bound on the simple regret of UGapEc. Using Lemma 1, we have that on\n\nevent E, the simple regret of UGapEc upon stopping satis\ufb01es BJ((cid:101)n+1)((cid:101)n + 1) = B\u2126((cid:101)n+1)((cid:101)n + 1) \u2265\nr\u2126((cid:101)n+1). As a result, on event E, the regret of UGapEc cannot be bigger than \u0001, because then it\ncontradicts the stopping condition of the algorithm, i.e., BJ((cid:101)n+1)((cid:101)n + 1) < \u0001. Therefore, we have\nP(cid:0)r\u2126((cid:101)n+1) \u2264 \u0001(cid:1) \u2265 1 \u2212 \u03b4. Now we prove the bound for the sample complexity. Similar to the proof\n\nof Theorem 1, we consider the following two steps:\nStep 1: Here we show that on event E, we have the following upper-bound on the number of pulls\nof any arm i \u2208 A:\n\nTi((cid:101)n) \u2264 2b2 log(4K((cid:101)n \u2212 1)3/\u03b4)\n, \u0001(cid:1)2\n\nmax(cid:0) \u2206i+\u0001\n\n+ 1.\n\nphase, Ti((cid:101)n) = 1 and Eq. 15 trivially holds. If i has been selected by SELECT-ARM, then we have\nLet ti be the last time that arm i is pulled. If arm i has been pulled only during the initialization\nBJ(ti)(ti) \u2265 \u0001. Now using Lemma 2, we may write\n\nBJ(ti)(ti) \u2264 min(cid:0)0,\u2212\u2206i + 2\u03b2i(ti \u2212 1)(cid:1) + 2\u03b2i(ti \u2212 1).\n\n(16)\nWe can prove Eq. 15 by plugging in the value of \u03b2i(ti \u2212 1) from Eq. 4 and solving Eq. 16 for Ti(ti)\ntaking into account that Ti(ti \u2212 1) + 1 = Ti(ti).\n\n2\n\n(15)\n\n7\n\n\fStep 2: We know that(cid:80)K\n1)3/\u03b4(cid:1) + K \u2265(cid:101)n. Solving this inequality gives us(cid:101)n \u2264 N.\n\ni=1 Ti((cid:101)n) = (cid:101)n. Using Eq. 15, on event E, we have 2H\u0001 log(cid:0)K((cid:101)n \u2212\n\n4.4 Problem Complexity\n\n2\n\nTheorems 1 and 2 indicate that both the probability of success and sample complexity of UGapE are\ndirectly related to the complexity H\u0001 de\ufb01ned by Eq. 6. This implies that H\u0001 captures the intrinsic\ndif\ufb01culty of the (\u0001,m)-best arm(s) identi\ufb01cation problem independently from the speci\ufb01c setting\nconsidered. Furthermore, note that this de\ufb01nition generalizes existing notions of complexity. For\nexample, for \u0001 = 0 and m = 1 we recover the complexity used in the de\ufb01nition of UCB-E [1] for\nthe \ufb01xed budget setting and the one de\ufb01ned in [6] for the \ufb01xed accuracy problem. Let us analyze\nH\u0001 in the general case of \u0001 > 0. We de\ufb01ne the complexity of a single arm i \u2208 A, H\u0001,i =\n, \u0001)2. When the gap \u2206i is smaller than the desired accuracy \u0001, i.e., \u2206i \u2264 \u0001, then\nb2/ max( \u2206i+\u0001\nthe complexity reduces to H\u0001,i = 1/\u00012.\nIn fact, the algorithm can stop as soon as the desired\naccuracy \u0001 is achieved, which means that there is no need to exactly discriminate between arm i and\nthe best arm. On the other hand, when \u2206i > \u0001, then the complexity becomes H\u0001,i = 4b2/(\u2206i + \u0001)2.\nThis shows that when the desired accuracy is smaller than the gap, the complexity of the problem is\nsmaller than the case of \u0001 = 0, for which we have H0,i = 4b2/\u22062\ni .\nMore in general, the analysis reported in the paper suggests that the performance of a upper con-\n\ufb01dence bound based algorithm such as UGapE is characterized by the same notion of complexity\nin both settings. Thus, whenever the complexity is known, it is possible to exploit the theoretical\nanalysis (bounds on the performance) to easily switch from one setting to the other. For instance, as\nalso suggested in Section 5.4 of [9], if the complexity H is known, an algorithm like UGapEc can\nbe adapted to run in the \ufb01xed budget setting by inverting the bound on its sample complexity. This\nwould lead to an algorithm similar to UGapEb with similar performance, although the parameter\ntuning could be more dif\ufb01cult because of the intrinsic poor accuracy in the constants of the bound.\nOn the other hand, it is an open question whether it is possible to \ufb01nd an \u201cequivalence\u201d between\nalgorithms for the two different settings when the complexity is not known. In particular, it would\nbe important to derive a distribution-dependent lower bound in the form of the one reported in [1]\nfor the general case of \u0001 \u2265 0 and m \u2265 1 for both the \ufb01xed budget and \ufb01xed con\ufb01dence settings.\n\n5 Summary and Discussion\n\nWe proposed a meta-algorithm, called uni\ufb01ed gap-based exploration (UGapE), that uni\ufb01es the two\nsettings of the best arm(s) identi\ufb01cation problem in stochastic multi-armed bandit: \ufb01xed budget and\n\ufb01xed con\ufb01dence. UGapE can be instantiated as two algorithms with a common structure (the same\narm-selection and arm-return strategies) corresponding to these two settings, whose performance\ncan be analyzed in a uni\ufb01ed way, i.e., a large portion of their theoretical analysis can be uni\ufb01ed in\na series of lemmas. We proved a performance bound for the UGapE algorithm in the two settings.\nWe also showed how UGapE and its theoretical analysis can be extended to take into account the\nvariance of the arms and to multiple bandits. Finally, we evaluated the performance of UGapE and\ncompare it with a number of existing \ufb01xed budget and \ufb01xed con\ufb01dence algorithms.\nThis uni\ufb01cation is important for both theoretical and algorithmic reasons. Despite their similarities,\n\ufb01xed budget and \ufb01xed con\ufb01dence settings have been treated differently in the literature. We believe\nthat this uni\ufb01cation provides a better understanding of the intrinsic dif\ufb01culties of the best arm(s)\nidenti\ufb01cation problem. In particular, our analysis showed that the same complexity term charac-\nterizes the hardness of both settings. As mentioned in the introduction, there was no algorithm\navailable for several settings considered in this paper, e.g., (\u0001,m)-best arm identi\ufb01cation with \ufb01xed\nbudget. With UGapE, we introduced an algorithm that can be easily adapted to all these settings.\nAcknowledgments This work was supported by Ministry of Higher Education and Research, Nord-\nPas de Calais Regional Council and FEDER through the \u201ccontrat de projets \u00e9tat region 2007\u20132013\",\nFrench National Research Agency (ANR) under project LAMPADA n\u25e6 ANR-09-EMER-007, Eu-\nropean Community\u2019s Seventh Framework Programme (FP7/2007-2013) under grant agreement n\u25e6\n270327, and PASCAL2 European Network of Excellence.\n\n8\n\n\fReferences\n[1] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identi\ufb01cation in multi-armed bandits. In\nProceedings of the Twenty-Third Annual Conference on Learning Theory, pages 41\u201353, 2010.\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit prob-\n\nlem. Machine Learning, 47:235\u2013256, 2002.\n\n[3] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandit problems. In\nProceedings of the Twentieth International Conference on Algorithmic Learning Theory, pages\n23\u201337, 2009.\n\n[4] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identi\ufb01cations in multi-armed bandits.\n\nCoRR, abs/1205.3181, 2012.\n\n[5] K. Deng, J. Pineau, and S. Murphy. Active learning for developing personalized treatment.\nIn Proceedings of the Twenty-Seventh International Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 161\u2013168, 2011.\n\n[6] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for\nthe multi-armed bandit and reinforcement learning problems. Journal of Machine Learning\nResearch, 7:1079\u20131105, 2006.\n\n[7] V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best Arm Identi\ufb01cation: A Uni\ufb01ed Approach\n\nto Fixed Budget and Fixed Con\ufb01dence. Technical report 00747005, October 2012.\n\n[8] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identi\ufb01cation.\nIn Proceedings of Advances in Neural Information Processing Systems 25, pages 2222\u20132230,\n2011.\n\n[9] S. Kalyanakrishnan. Learning Methods for Sequential Decision Making with Imperfect Repre-\nsentations. PhD thesis, Department of Computer Science, The University of Texas at Austin,\nAustin, Texas, USA, December 2011. Published as UT Austin Computer Science Technical\nReport TR-11-41.\n\n[10] S. Kalyanakrishnan and P. Stone. Ef\ufb01cient selection of multiple bandit arms: Theory and prac-\ntice. In Proceedings of the Twenty-Seventh International Conference on Machine Learning,\npages 511\u2013518, 2010.\n\n[11] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-\narmed bandits. In Proceedings of the Twentieth International Conference on Machine Learn-\ning, 2012.\n\n[12] O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classi\ufb01ca-\ntion and function approximation. In Proceedings of Advances in Neural Information Process-\ning Systems 6, pages 59\u201366, 1993.\n\n[13] A. Maurer and M. Pontil. Empirical bernstein bounds and sample-variance penalization. In\n\n22th annual conference on learning theory, 2009.\n\n[14] V. Mnih, Cs. Szepesv\u00e1ri, and J.-Y. Audibert. Empirical Bernstein stopping. In Proceedings of\n\nthe Twenty-Fifth International Conference on Machine Learning, pages 672\u2013679, 2008.\n\n[15] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematics Society, 58:527\u2013535, 1952.\n\n9\n\n\f", "award": [], "sourceid": 1471, "authors": [{"given_name": "Victor", "family_name": "Gabillon", "institution": null}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": null}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": null}]}