{"title": "Best-Arm Identification in Linear Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 828, "page_last": 836, "abstract": "We study the best-arm identification problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter $\\theta^*$ and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies that pull arms to identify the best arm with a fixed confidence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the $G$-optimality criterion used in optimal experimental design.", "full_text": "Best-Arm Identi\ufb01cation in Linear Bandits\n\nMarta Soare\n\nAlessandro Lazaric\n\nR\u00e9mi Munos\u2217 \u2020\n\n{marta.soare,alessandro.lazaric,remi.munos}@inria.fr\n\nINRIA Lille \u2013 Nord Europe, SequeL Team\n\nAbstract\n\nWe study the best-arm identi\ufb01cation problem in linear bandit, where the rewards\nof the arms depend linearly on an unknown parameter \u03b8\u2217 and the objective is to\nreturn the arm with the largest reward. We characterize the complexity of the\nproblem and introduce sample allocation strategies that pull arms to identify the\nbest arm with a \ufb01xed con\ufb01dence, while minimizing the sample budget. In partic-\nular, we show the importance of exploiting the global linear structure to improve\nthe estimate of the reward of near-optimal arms. We analyze the proposed strate-\ngies and compare their empirical performance. Finally, as a by-product of our\nanalysis, we point out the connection to the G-optimality criterion used in optimal\nexperimental design.\n\n1 Introduction\nThe stochastic multi-armed bandit problem (MAB) [16] offers a simple formalization for the study\nof sequential design of experiments. In the standard model, a learner sequentially chooses an arm\nout of K and receives a reward drawn from a \ufb01xed, unknown distribution relative to the chosen\narm. While most of the literature in bandit theory focused on the problem of maximization of\ncumulative rewards, where the learner needs to trade-off exploration and exploitation, recently the\npure exploration setting [5] has gained a lot of attention. Here, the learner uses the available budget\nto identify as accurately as possible the best arm, without trying to maximize the sum of rewards.\nAlthough many results are by now available in a wide range of settings (e.g., best-arm identi\ufb01cation\nwith \ufb01xed budget [2, 11] and \ufb01xed con\ufb01dence [7], subset selection [6, 12], and multi-bandit [9]),\nmost of the work considered only the multi-armed setting, with K independent arms.\nAn interesting variant of the MAB setup is the stochastic linear bandit problem (LB), introduced\nin [3]. In the LB setting, the input space X is a subset of Rd and when pulling an arm x, the learner\nobserves a reward whose expected value is a linear combination of x and an unknown parameter\n\u03b8\u2217 \u2208 Rd. Due to the linear structure of the problem, pulling an arm gives information about the\nparameter \u03b8\u2217 and indirectly, about the value of other arms. Therefore, the estimation of K mean-\nrewards is replaced by the estimation of the d features of \u03b8\u2217. While in the exploration-exploitation\nsetting the LB has been widely studied both in theory and in practice (e.g., [1, 14]), in this paper we\nfocus on the pure-exploration scenario.\nThe fundamental difference between the MAB and the LB best-arm identi\ufb01cation strategies stems\nfrom the fact that in MAB an arm is no longer pulled as soon as its sub-optimality is evident (in\nhigh probability), while in the LB setting even a sub-optimal arm may offer valuable information\nabout the parameter vector \u03b8\u2217 and thus improve the accuracy of the estimation in discriminating\namong near-optimal arms. For instance, consider the situation when K\u22122 out of K arms are already\ndiscarded. In order to identify the best arm, MAB algorithms would concentrate the sampling on\nthe two remaining arms to increase the accuracy of the estimate of their mean-rewards until the\ndiscarding condition is met for one of them. On the contrary, a LB pure-exploration strategy would\nseek to pull the arm x \u2208 X whose observed reward allows to re\ufb01ne the estimate \u03b8\u2217 along the\ndimensions which are more suited in discriminating between the two remaining arms. Recently, the\nbest-arm identi\ufb01cation in linear bandits has been studied in a \ufb01xed budget setting [10], in this paper\nwe study the sample complexity required to identify the best-linear arm with a \ufb01xed con\ufb01dence.\n\n\u2217This work was done when the author was a visiting researcher at Microsoft Research New-England.\n\u2020Current af\ufb01liation: Google DeepMind.\n\n1\n\n\f2 Preliminaries\nThe setting. We consider the standard linear bandit model. Let X \u2286 Rd be a \ufb01nite set of arms,\nwhere |X| = K and the \u21132-norm of any arm x \u2208 X , denoted by ||x||, is upper-bounded by L.\nGiven an unknown parameter \u03b8\u2217 \u2208 Rd, we assume that each time an arm x \u2208 X is pulled, a random\nreward r(x) is generated according to the linear model r(x) = x\u22a4\u03b8\u2217 + \u03b5, where \u03b5 is a zero-mean\ni.i.d. noise bounded in [\u2212\u03c3; \u03c3]. Arms are evaluated according to their expected reward x\u22a4\u03b8\u2217 and\nwe denote by x\u2217 = arg maxx\u2208X x\u22a4\u03b8\u2217 the best arm in X . Also, we use \u03a0(\u03b8) = arg maxx\u2208X x\u22a4\u03b8\nto refer to the best arm corresponding to an arbitrary parameter \u03b8. Let \u0394(x, x\u2032) = (x \u2212 x\u2032)\u22a4\u03b8\u2217 be\nthe value gap between two arms, then we denote by \u0394(x) = \u0394(x\u2217, x) the gap of x w.r.t. the optimal\narm and by \u0394min = minx\u2208X \u0394(x) the minimum gap, where \u0394min > 0. We also introduce the sets\nY = {y = x \u2212 x\u2032,\u2200x, x\u2032 \u2208 X} and Y \u2217 = {y = x\u2217 \u2212 x,\u2200x \u2208 X} containing all the directions\nobtained as the difference of two arms (or an arm and the optimal arm) and we rede\ufb01ne accordingly\nthe gap of a direction as \u0394(y) = \u0394(x, x\u2032) whenever y = x \u2212 x\u2032.\nThe problem. We study the best-arm identi\ufb01cation problem. Let \u02c6x(n) be the estimated best arm\nreturned by a bandit algorithm after n steps. We evaluate the quality of \u02c6x(n) by the simple regret\nRn = (x\u2217 \u2212 \u02c6x(n))\u22a4\u03b8\u2217. While different settings can be de\ufb01ned (see [8] for an overview), here we\nfocus on the (\u01eb, \u03b4)-best-arm identi\ufb01cation problem (the so-called PAC setting), where given \u01eb and\n\u03b4 \u2208 (0, 1), the objective is to design an allocation strategy and a stopping criterion so that when\nthe algorithm stops, the returned arm \u02c6x(n) is such that P\ufffdRn \u2265 \u01eb\ufffd \u2264 \u03b4, while minimizing the\nneeded number of steps. More speci\ufb01cally, we will focus on the case of \u01eb = 0 and we will provide\nhigh-probability bounds on the sample complexity n.\nThe multi-armed bandit case. In MAB, the complexity of best-arm identi\ufb01cation is characterized\nby the gaps between arm values, following the intuition that the more similar the arms, the more pulls\nare needed to distinguish between them. More formally, the complexity is given by the problem-\ni.e., the inverse of the pairwise gaps between the best arm\nand the suboptimal arms. In the \ufb01xed budget case, HMAB determines the probability of returning the\nwrong arm [2], while in the \ufb01xed con\ufb01dence case, it characterizes the sample complexity [7].\nTechnical tools. Unlike in the multi-arm bandit scenario where pulling one arm does not provide\nany information about other arms, in a linear model we can leverage the rewards observed over time\nto estimate the expected reward of all the arms in X . Let xn = (x1, . . . , xn) \u2208 X n be a sequence\nof arms and (r1, . . . , rn) the corresponding observed (random) rewards. An unbiased estimate of\nxn bxn, where Axn =\ufffdn\n\u03b8\u2217 can be obtained by ordinary least-squares (OLS) as \u02c6\u03b8n = A\u22121\nt \u2208\nRd\u00d7d and bxn = \ufffdn\nt=1 xtrt \u2208 Rd. For any \ufb01xed sequence xn, through Azuma\u2019s inequality, the\nprediction error of the OLS estimate is upper-bounded in high-probability as follows.\nProposition 1. Let c = 2\u03c3\u221a2 and c\u2032 = 6/\u03c02. For every \ufb01xed sequence xn, we have1\nxn\ufffdlog(c\u2032n2K/\u03b4)\ufffd \u2265 1 \u2212 \u03b4.\n\nP\ufffd\u2200n \u2208 N,\u2200x \u2208 X ,\ufffd\ufffdx\u22a4\u03b8\u2217 \u2212 x\u22a4 \u02c6\u03b8n\ufffd\ufffd \u2264 c||x||A\u22121\n\ndependent quantity HMAB = \ufffdK\n\nt=1 xtx\u22a4\n\n1\n\u03942\ni\n\n(1)\n\ni=1\n\nWhile in the previous statement xn is \ufb01xed, a bandit algorithm adapts the allocation in response to\nthe rewards observed over time. In this case a different high-probability bound is needed.\nProposition 2 (Thm. 2 in [1]). Let \u02c6\u03b8\u03b7\n\nregularizer \u03b7 and let \ufffdA\u03b7\nthat at any step t, xt only depends on (x1, r1, . . . , xt\u22121, rt\u22121), w.p. 1 \u2212 \u03b4, we have\n\ufffd + \u03b71/2||\u03b8\u2217||\ufffd.\n\ufffd\ufffdx\u22a4\u03b8\u2217 \u2212 x\u22a4 \u02c6\u03b8\u03b7\n\nn be the solution to the regularized least-squares problem with\nx = \u03b7Id + Ax. Then for all x \u2208 X and every adaptive sequence xn such\nn\ufffd\ufffd \u2264 ||x||( \ufffdA\u03b7\n\nxn )\u22121\ufffd\u03c3\ufffdd log\ufffd 1 + nL2/\u03b7\n\nThe crucial difference w.r.t. Eq. 1 is an additional factor \u221ad, the price to pay for adapting xn to the\nsamples. In the sequel we will often resort to the notion of design (or \u201csoft\u201d allocation) \u03bb \u2208 Dk,\nwhich prescribes the proportions of pulls to arm x and Dk denotes the simplex X . The counterpart\nof the design matrix A for a design \u03bb is the matrix \u039b\u03bb = \ufffdx\u2208X \u03bb(x)xx\u22a4. From an allocation xn\nwe can derive the corresponding design \u03bbxn as \u03bbxn (x) = Tn(x)/n, where Tn(x) is the number of\ntimes arm x is selected in xn, and the corresponding design matrix is Axn = n\u039b\u03bbxn .\n\n(2)\n\n\u03b4\n\n1Whenever Prop.1 is used for all directions y \u2208 Y, then the logarithmic term becomes log(c\u2032n2K 2/\u03b4)\n\nbecause of an additional union bound. For the sake of simplicity, in the sequel we always use logn(K 2/\u03b4).\n\n2\n\n\f0\n\nx1\n\nx2\n\nx3\n\n\u03b8\u2217\n\nC(x2)\n\nC(x3)\n\nC(x1) = C\u2217\n\n3 The Complexity of the Linear Best-Arm Identi\ufb01cation Problem\nAs reviewed in Sect. 2, in the MAB case the complexity\nof the best-arm identi\ufb01cation task is characterized by the\nreward gaps between the optimal and suboptimal arms.\nIn this section, we propose an extension of the notion of\ncomplexity to the case of linear best-arm identi\ufb01cation.\nIn particular, we characterize the complexity by the per-\nformance of an oracle with access to the parameter \u03b8\u2217.\nStopping condition. Let C(x) ={\u03b8 \u2208 Rd, x \u2208 \u03a0(\u03b8)} be\nthe set of parameters \u03b8 which admit x as an optimal arm.\nAs illustrated in Fig. 1, C(x) is the cone de\ufb01ned by the\nintersection of half-spaces such that C(x) = \u2229x\u2032\u2208X{\u03b8 \u2208\nRd, (x \u2212 x\u2032)\u22a4\u03b8 \u2265 0} and all the cones together form a\npartition of the Euclidean space Rd. We assume that the\noracle knows the cone C(x\u2217) containing all the param-\neters for which x\u2217 is optimal. Furthermore, we assume\nthat for any allocation xn, it is possible to construct a con\ufb01dence set S \u2217(xn) \u2286 Rd such that\n\u03b8\u2217 \u2208 S \u2217(xn) and the (random) OLS estimate \u02c6\u03b8n belongs to S \u2217(xn) with high probability, i.e.,\nP\ufffd\u02c6\u03b8n \u2208 S \u2217(xn)\ufffd \u2265 1 \u2212 \u03b4. As a result, the oracle stopping criterion simply checks whether the\ncon\ufb01dence set S \u2217(xn) is contained in C(x\u2217) or not. In fact, whenever for an allocation xn the set\nS \u2217(xn) overlaps the cones of different arms x \u2208 X , there is ambiguity in the identity of the arm\n\u03a0(\u02c6\u03b8n). On the other hand when all possible values of \u02c6\u03b8n are included with high probability in the\n\u201cright\u201d cone C(x\u2217), then the optimal arm is returned.\n\nFigure 1: The cones corresponding to three\narms (dots) in R2. Since \u03b8\u2217 \u2208 C(x1), then\nx\u2217 = x1. The con\ufb01dence set S \u2217(xn) (in\ngreen) is aligned with directions x1 \u2212x2 and\nx1 \u2212 x3. Given the uncertainty in S \u2217(xn),\nboth x1 and x3 may be optimal.\n\nLemma 1. Let xn be an allocation such that S \u2217(xn) \u2286 C(x\u2217). Then P\ufffd\u03a0(\u02c6\u03b8n) \ufffd= x\u2217\ufffd \u2264 \u03b4.\nArm selection strategy. From the previous lemma2 it follows that the objective of an arm selection\nstrategy is to de\ufb01ne an allocation xn which leads to S \u2217(xn) \u2286 C(x\u2217) as quickly as possible.3 Since\nthis condition only depends on deterministic objects (S \u2217(xn) and C(x\u2217)), it can be computed inde-\npendently from the actual reward realizations. From a geometrical point of view, this corresponds\nto choosing arms so that the con\ufb01dence set S \u2217(xn) shrinks into the optimal cone C(x\u2217) within the\nsmallest number of pulls. To characterize this strategy we need to make explicit the form of S \u2217(xn).\nIntuitively speaking, the more S \u2217(xn) is \u201caligned\u201d with the boundaries of the cone, the easier it is\nto shrink it into the cone. More formally, the condition S \u2217(xn) \u2286 C(x\u2217) is equivalent to\n\n\u2200x \u2208 X ,\u2200\u03b8 \u2208 S \u2217(xn), (x\u2217 \u2212 x)\u22a4\u03b8 \u2265 0 \u21d4 \u2200y \u2208 Y \u2217,\u2200\u03b8 \u2208 S \u2217(xn), y\u22a4(\u03b8\u2217 \u2212 \u03b8) \u2264 \u0394(y).\n\nThen we can simply use Prop. 1 to directly control the term y\u22a4(\u03b8\u2217 \u2212 \u03b8) and de\ufb01ne\nxn\ufffdlogn(K 2/\u03b4)\ufffd .\n\nThus the stopping condition S \u2217(xn) \u2286 C(x\u2217) is equivalent to the condition that, for any y \u2208 Y \u2217,\n\n(3)\n\nS \u2217(xn) =\ufffd\u03b8 \u2208 Rd,\u2200y \u2208 Y \u2217, y\u22a4(\u03b8\u2217 \u2212 \u03b8) \u2264 c||y||A\u22121\nxn\ufffdlogn(K 2/\u03b4) \u2264 \u0394(y).\nxn\ufffdlogn(K 2/\u03b4)\n\nc||y||A\u22121\n\nc||y||A\u22121\n\nn = arg min\n\nx\u2217\n\nmax\ny\u2208Y \u2217\n\nxn\n\n\u0394(y)\n\n= arg min\nxn\n\nFrom this condition, the oracle allocation strategy simply follows as\n\n(4)\n\n(5)\n\n||y||A\u22121\n\u0394(y)\n\nxn\n\n.\n\nmax\ny\u2208Y \u2217\n\nNotice that this strategy does not return an uniformly accurate estimate of \u03b8\u2217 but it rather pulls arms\nthat allow to reduce the uncertainty of the estimation of \u03b8\u2217 over the directions of interest (i.e., Y \u2217)\nbelow their corresponding gaps. This implies that the objective of Eq. 5 is to exploit the global linear\nassumption by pulling any arm in X that could give information about \u03b8\u2217 over the directions in Y \u2217,\nso that directions with small gaps are better estimated than those with bigger gaps.\n\n2For all the proofs in this paper, we refer the reader to the long version of the paper [18].\n3Notice that by de\ufb01nition of the con\ufb01dence set and since \u03b8n \u2192 \u03b8\u2217 as n \u2192 \u221e, any strategy repeatedly\n\npulling all the arms would eventually meet the stopping condition.\n\n3\n\n\fSample complexity. We are now ready to de\ufb01ne the sample complexity of the oracle, which corre-\nsponds to the minimum number of steps needed by the allocation in Eq. 5 to achieve the stopping\ncondition in Eq. 4. From a technical point of view, it is more convenient to express the complexity of\nthe problem in terms of the optimal design (soft allocation) instead of the discrete allocation xn. Let\n/\u03942(y) be the square of the objective function in Eq. 5 for any design\n\u03c1\u2217(\u03bb) = maxy\u2208Y \u2217 ||y||2\n\u03bb \u2208 Dk. We de\ufb01ne the complexity of a linear best-arm identi\ufb01cation problem as the performance\nachieved by the optimal design \u03bb\u2217 = arg min\u03bb \u03c1\u2217(\u03bb), i.e.\n\n\u039b\u22121\n\n\u03bb\n\nHLB = min\n\u03bb\u2208Dk\n\nmax\ny\u2208Y \u2217\n\n||y||2\n\u039b\u22121\n\u03942(y)\n\n\u03bb\n\n= \u03c1\u2217(\u03bb\u2217).\n\n(6)\n\n\u039b\u22121\n\n\u03bb\n\nA\u22121\n\u03bb\u2217\n\nN \u2217 = c2HLB logn(K 2/\u03b4),\n\ny\u2208Y \u2217 ||y||2/(L\u03942\nmax\n\nmin) \u2264 HLB \u2264 4d/\u03942\n\nmin.\n\n/\u03942(y) = \u03c1\u2217(\u03bb\u2217)/n. Finally, we bound the range of the complexity.\n\nThis de\ufb01nition of complexity is less explicit than in the case of HMAB but it contains similar ele-\nments, notably the inverse of the gaps squared. Nonetheless, instead of summing the inverses over\nall the arms, HLB implicitly takes into consideration the correlation between the arms in the term\n, which represents the uncertainty in the estimation of the gap between x\u2217 and x (when\n||y||2\ny = x\u2217 \u2212 x). As a result, from Eq. 4 the sample complexity becomes\n(7)\nwhere we use the fact that, if implemented over n steps, \u03bb\u2217 induces a design matrix A\u03bb\u2217 = n\u039b\u03bb\u2217\nand maxy ||y||2\nLemma 2. Given an arm set X \u2286 Rd and a parameter \u03b8\u2217, the complexity HLB (Eq. 6) is such that\n(8)\nFurthermore, if X is the canonical basis, the problem reduces to a MAB and HMAB\u2264 HLB\u2264 2HMAB.\nThe previous bounds show that \u0394min plays a signi\ufb01cant role in de\ufb01ning the complexity of the\nproblem, while the speci\ufb01c shape of X impacts the numerator in different ways. In the worst case\nthe full dimensionality d appears (upper-bound), and more arm-set speci\ufb01c quantities, such as the\nnorm of the arms L and of the directions Y \u2217, appear in the lower-bound.\n4 Static Allocation Strategies\nThe oracle stopping condition (Eq. 4) and allo-\ncation strategy (Eq. 5) cannot be implemented in\npractice since \u03b8\u2217, the gaps \u0394(y), and the direc-\ntions Y \u2217 are unknown. In this section we investi-\ngate how to de\ufb01ne algorithms that only rely on the\ninformation available from X and the samples col-\nlected over time. We introduce an empirical stop-\nping criterion and two static allocations.\nEmpirical stopping criterion. The stopping con-\ndition S \u2217(xn) \u2286 C(x\u2217) cannot be tested since\nS \u2217(xn) is centered in the unknown parameter \u03b8\u2217\nand C(x\u2217) depends on the unknown optimal arm\nx\u2217. Nonetheless, we notice that given X , for each\nx \u2208 X the cones C(x) can be constructed beforehand. Let \ufffdS(xn) be a high-probability con\ufb01dence\nset such that for any xn, \u02c6\u03b8n \u2208 \ufffdS(xn) and P(\u03b8\u2217 \u2208 \ufffdS(xn)) \u2265 1 \u2212 \u03b4. Unlike S \u2217, \ufffdS can be directly\ncomputed from samples and we can stop whenever there exists an x such that \ufffdS(xn) \u2286 C(x).\nan arm x \u2208 X such that \ufffdS(xn) \u2286 C(x) then P\ufffd\u03a0(\u02c6\u03b8n) \ufffd= x\u2217\ufffd \u2264 \u03b4.\nthat guarantees that the (random) con\ufb01dence set \ufffdS(xn) shrinks in one of the cones C(x) within the\nfewest number of steps. Let \ufffd\u0394n(x, x\u2032) = (x \u2212 x\u2032)\u22a4 \u02c6\u03b8n be the empirical gap between arms x, x\u2032.\nThen the stopping condition \ufffdS(xn) \u2286 C(x) can be written as\n\u2203x \u2208 X ,\u2200x\u2032 \u2208 X ,\u2200\u03b8 \u2208 \ufffdS(xn), (x \u2212 x\u2032)\u22a4\u03b8 \u2265 0\n\nif G-allocation then\nxt = arg min\nmax\nx\u2032\u2208X\nelse if X Y-allocation then\nxt = arg min\nend if\nUpdate \u02c6\u03b8t = A\u22121\n\nInput: decision space X \u2208 Rd, con\ufb01dence \u03b4 > 0\nSet: t = 0; Y = {y = (x \u2212 x\u2032); x \ufffd= x\u2032 \u2208 X };\nwhile Eq. 11 is not true do\n\nLemma 3. Let xn = (x1, . . . , xn) be an arbitrary allocation sequence. If after n steps there exists\n\nArm selection strategy. Similarly to the oracle algorithm, we should design an allocation strategy\n\n\u21d4 \u2203x \u2208 X ,\u2200x\u2032 \u2208 X ,\u2200\u03b8 \u2208 \ufffdS(xn), (x \u2212 x\u2032)\u22a4(\u02c6\u03b8n \u2212 \u03b8) \u2264 \ufffd\u0394n(x, x\u2032).\n\nFigure 2: Static allocation algorithms\n\nend while\nReturn arm \u03a0(\u02c6\u03b8t)\n\nx\u2032\u22a4(A + xx\u22a4)\u22121x\u2032\n\nx\u2208X\n\nx\u2208X\n\nmax\ny\u2208Y\n\ny\u22a4(A + xx\u22a4)\u22121y\n\nt bt, t = t + 1\n\n(9)\n\n4\n\n\fThis suggests that the empirical con\ufb01dence set can be de\ufb01ned as\n\n\ufffdS(xn) =\ufffd\u03b8 \u2208 Rd,\u2200y \u2208 Y, y\u22a4(\u02c6\u03b8n \u2212 \u03b8) \u2264 c||y||A\u22121\n\nstopping condition in Eq. 9 could be reformulated as\n\u2203x \u2208 X ,\u2200x\u2032 \u2208 X , c||x \u2212 x\u2032||A\u22121\n\nUnlike S \u2217(xn), \ufffdS(xn) is centered in \u02c6\u03b8n and it considers all directions y \u2208 Y. As a result, the\n\n(11)\nAlthough similar to Eq. 4, unfortunately this condition cannot be directly used to derive an alloca-\ntion strategy. In fact, it is considerably more dif\ufb01cult to de\ufb01ne a suitable allocation strategy to \ufb01t a\n\nxn\ufffdlogn(K 2/\u03b4)\ufffd .\nxn\ufffdlogn(K 2/\u03b4) \u2264 \ufffd\u0394n(x, x\u2032).\n\n(10)\n\nrandom con\ufb01dence set \ufffdS into a cone C(x) for an x which is not known in advance. In the following\n\nwe propose two allocations that try to achieve the condition in Eq. 11 as fast as possible by imple-\nmenting a static arm selection strategy, while we present a more sophisticated adaptive strategy in\nSect. 5. The general structure of the static allocations in summarized in Fig. 2.\nG-Allocation Strategy. The de\ufb01nition of the G-allocation strategy directly follows from the ob-\nservation that for any pair (x, x\u2032) \u2208 X 2 we have that ||x \u2212 x\u2032||A\u22121\n. This\nxn \u2264 2 maxx\u2032\u2032\u2208X ||x\u2032\u2032||A\u22121\nreduces an upper bound on the quantity\nsuggests that an allocation minimizing maxx\u2208X ||x||A\u22121\ntested in the stopping condition in Eq. 11. Thus, for any \ufb01xed n, we de\ufb01ne the G-allocation as\n\nxn\n\nxn\n\nxG\n\nn = arg min\n\nxn\n\nx\u2208X ||x||A\u22121\nmax\n\nxn\n\n.\n\n(12)\n\nWe notice that this formulation coincides with the standard G-optimal design (hence the name of\nthe allocation) de\ufb01ned in experimental design theory [15, Sect. 9.2] to minimize the maximal mean-\nsquared prediction error in linear regression. The G-allocation can be interpreted as the design that\nallows to estimate \u03b8\u2217 uniformly well over all the arms in X . Notice that the G-allocation in Eq. 12\nis well de\ufb01ned only for a \ufb01xed number of steps n and it cannot be directly implemented in our case,\nsince n is unknown in advance. Therefore we have to resort to a more \u201cincremental\u201d implementation.\nIn the experimental design literature a wide number of approximate solutions have been proposed to\nsolve the NP-hard discrete optimization problem in Eq. 12 (see [4, 17] for some recent results and\n[18] for a more thorough discussion). For any approximate G-allocation strategy with performance\nn , the sample complexity N G is bounded as\nno worse than a factor (1 + \u03b2) of the optimal strategy xG\nfollows.\nTheorem 1. If the G-allocation strategy is implemented with a \u03b2-approximate method and the\nstopping condition in Eq. 11 is used, then\n\nP\ufffdN G \u2264\n\n16c2d(1 + \u03b2) logn(K 2/\u03b4)\n\n\u03942\n\nmin\n\n\u2227 \u03a0(\u02c6\u03b8N G ) = x\u2217\ufffd \u2265 1 \u2212 \u03b4.\n\n(13)\n\nNotice that this result matches (up to constants) the worst-case value of N \u2217 given the upper bound\non HLB. This means that, although completely static, the G-allocation is already worst-case optimal.\nX Y-Allocation Strategy. Despite being worst-case optimal, G-allocation is minimizing a rather\nloose upper bound on the quantity used to test the stopping criterion. Thus, we de\ufb01ne an alternative\nstatic allocation that targets the stopping condition in Eq. 11 more directly by reducing its left-hand-\nside for any possible direction in Y. For any \ufb01xed n, we de\ufb01ne the XY-allocation as\n\n(14)\nXY-allocation is based on the observation that the stopping condition in Eq. 11 requires only the\nempirical gaps \ufffd\u0394(x, x\u2032) to be well estimated, hence arms are pulled with the objective of increasing\nthe accuracy of directions in Y instead of arms X . This problem can be seen as a transductive variant\nof the G-optimal design [19], where the target vectors Y are different from the vectors X used in the\ndesign. The sample complexity of the XY-allocation is as follows.\nTheorem 2. If the XY-allocation strategy is implemented with a \u03b2-approximate method and the\nstopping condition in Eq. 11 is used, then\n\ny\u2208Y ||y||A\u22121\nmax\n\nn = arg min\n\nxX Y\n\nxn\n\nxn\n\n.\n\nP\ufffdN X Y \u2264\n\n32c2d(1 + \u03b2) logn(K 2/\u03b4)\n\n\u03942\n\nmin\n\n\u2227 \u03a0(\u02c6\u03b8N X Y ) = x\u2217\ufffd \u2265 1 \u2212 \u03b4.\n\n(15)\n\nAlthough the previous bound suggests that XY achieves a performance comparable to the G-\nallocation, in fact XY may be arbitrarily better than G-allocation (for an example, see [18]).\n\n5\n\n\fx\u2208X\n\ny\u22a4A\u22121\nt y\n\ny\u22a4(A + xx\u22a4)\u22121y\n\nj\u22121\n\nnj\u22121 )/nj\u22121 do\n\nSelect arm xt = arg min\n\ns=1 xsrs; \u02c6\u03b8j = A\u22121\nt b\n\nmax\ny\u2208Y\nt , t = t + 1\n\nInput: decision space X \u2208 Rd; parameter \u03b1; con\ufb01dence \u03b4\nSet j = 1; \ufffdXj = X ; \ufffdY1 = Y; \u03c10 = 1; n0 = d(d + 1) + 1\nwhile | \ufffdXj| > 1 do\n\u03c1j = \u03c1j\u22121\nt = 1; A0 = I\nwhile \u03c1j/t \u2265 \u03b1\u03c1j\u22121(x\n\n5 X Y-Adaptive Allocation Strategy\nFully adaptive allocation strategies.\nAlthough both G- and XY-allocation are\nsound since they minimize upper-bounds\non the quantities used by the stopping\ncondition (Eq. 11), they may be very sub-\noptimal w.r.t. the ideal performance of\nthe oracle introduced in Sec. 3. Typi-\ncally, an improvement can be obtained by\nmoving to strategies adapting on the re-\nwards observed over time. Nonetheless,\nas reported in Prop. 2, whenever xn is\nnot a \ufb01xed sequence, the bound in Eq. 2\nshould be used. As a result, a factor \u221ad\nwould appear in the de\ufb01nition of the con-\n\ufb01dence sets and in the stopping condi-\ntion. This directly implies that the sample\ncomplexity of a fully adaptive strategy\nwould scale linearly with the dimension-\nality d of the problem, thus removing any\nadvantage w.r.t. static allocations. In fact,\nthe sample complexity of G- and XY-\nallocation already scales linearly with d\nand from Lem. 2 we cannot expect to im-\nprove the dependency on \u0394min. Thus, on the one hand, we need to use the tighter bounds in Eq. 1\nand, on the other hand, we require to be adaptive w.r.t. samples. In the sequel we propose a phased\nalgorithm which successfully meets both requirements using a static allocation within each phase\nbut choosing the type of allocation depending on the samples observed in previous phases.\nAlgorithm. The ideal case would be to de\ufb01ne an empirical version of the oracle allocation in Eq. 5\nso as to adjust the accuracy of the prediction only on the directions of interest Y \u2217 and according to\ntheir gaps \u0394(y). As discussed in Sect. 4 this cannot be obtained by a direct adaptation of Eq. 11. In\nthe following, we describe a safe alternative to adjust the allocation strategy to the gaps.\nLemma 4. Let xn be a \ufb01xed allocation sequence and \u02c6\u03b8n its corresponding estimate for \u03b8\u2217. If an\narm x \u2208 X is such that\n\nUpdate At = At\u22121 + xtx\u22a4\n\u03c1j = maxy\u2208 \ufffdYj\nend while\nCompute b = \ufffdt\n\ufffdXj+1 = X\nfor x \u2208 X do\nif \u2203x\u2032 : ||x \u2212 x\u2032||A\u22121\n\ufffdXj+1 = \ufffdXj+1 \u2212 {x}\nend if\nend for\n\ufffdYj+1 = {y = (x \u2212 x\u2032); x, x\u2032 \u2208 \ufffdXj+1}\nend while\nReturn \u03a0(\u02c6\u03b8j)\n\n\ufffdlogn(K 2/\u03b4) \u2264 \ufffd\u0394j(x\u2032, x) then\n\nFigure 3: X Y-Adaptive allocation algorithm\n\nt\n\n\u2203x\u2032 \u2208 X s.t. c||x\u2032 \u2212 x||A\u22121\n\nxn\ufffdlogn(K 2/\u03b4) < \ufffd\u0394n(x\u2032, x),\n\nthen arm x is sub-optimal. Moreover, if Eq. 16 is true, we say that x\u2032 dominates x.\n\n(16)\n\nLem. 4 allows to easily construct the set of potentially optimal arms, denoted \ufffdX (xn), by removing\nfrom X all the dominated arms. As a result, we can replace the stopping condition in Eq. 11, by\njust testing whether the number of non-dominated arms |\ufffdX (xn)| is equal to 1, which corresponds to\nthe case where the con\ufb01dence set is fully contained into a single cone. Using \ufffdX (xn), we construct\n\ufffdY(xn) = {y = x\u2212 x\u2032; x, x\u2032 \u2208 \ufffdX (xn)}, the set of directions along which the estimation of \u03b8\u2217 needs\nto be improved to further shrink \ufffdS(xn) into a single cone and trigger the stopping condition. Note\nphases. Let j \u2208 N be the index of a phase and nj its corresponding length. We denote by \ufffdXj the set\nof non-dominated arms constructed on the basis of the samples collected in the phase j \u2212 1. This\nset is used to identify the directions \ufffdYj and to de\ufb01ne a static allocation which focuses on reducing\nthe uncertainty of \u03b8\u2217 along the directions in \ufffdYj. Formally, in phase j we implement the allocation\n\nthat if xn was an adaptive strategy, then we could not use Lem. 4 to discard arms but we should rely\non the bound in Prop. 2. To avoid this problem, an effective solution is to run the algorithm through\n\nxj\n\n(17)\n\nnj = arg min\nxnj\n\ny\u2208 \ufffdYj ||y||A\u22121\nmax\n\nxnj\n\n,\n\nwhich coincides with a XY-allocation (see Eq. 14) but restricted on \ufffdYj. Notice that xj\nnj may still\nuse any arm in X which could be useful in reducing the con\ufb01dence set along any of the directions in\n\n6\n\n\fj and then is used to test the stopping condition in Eq. 11. Whenever the stopping condition does\n\n\ufffdYj. Once phase j is over, the OLS estimate \u02c6\u03b8j is computed using the rewards observed within phase\nnot hold, a new set \ufffdXj+1 is constructed using the discarding condition in Lem. 4 and a new phase is\n\nstarted. Notice that through this process, at each phase j, the allocation xj\nthe previous allocations and the use of the bound from Prop. 1 is still correct.\nA crucial aspect of this algorithm is the length of the phases nj. On the one hand, short phases allow\n\nnj is static conditioned on\n\n\u03bb\n\nj\nn\n\nshort, it is very unlikely that the estimate \u02c6\u03b8j may be accurate enough to actually discard any arm.\nAn effective way to de\ufb01ne the length of a phase in a deterministic way is to relate it to the actual\n\na high rate of adaptivity, since \ufffdXj is recomputed very often. On the other hand, if a phase is too\nuncertainty of the allocation in estimating the value of all the active directions in \ufffdYj. In phase j, let\n\u03c1j(\u03bb) = maxy\u2208 \ufffdYj ||y||2\n\n, then given a parameter \u03b1 \u2208 (0, 1), we de\ufb01ne\n\n\u039b\u22121\n\nnj = min\ufffdn \u2208 N : \u03c1j(\u03bbx\n\n)/n \u2264 \u03b1\u03c1j\u22121(\u03bbj\u22121)/nj\u22121\ufffd,\n\nn is the allocation de\ufb01ned in Eq. 17 and \u03bbj\u22121 is the design corresponding to xj\u22121\n\n(18)\nnj\u22121, the\nwhere xj\nallocation performed at phase j \u2212 1.\nIn words, nj is the minimum number of steps needed by\nthe XY-adaptive allocation to achieve an uncertainty over all the directions of interest which is a\nfraction \u03b1 of the performance obtained in the previous iteration. Notice that given \ufffdYj and \u03c1j\u22121 this\nquantity can be computed before the actual beginning of phase j. The resulting algorithm using the\nXY-Adaptive allocation strategy is summarized in Fig. 3.\nSample complexity. Although the XY-Adaptive allocation strategy is designed to approach the\noracle sample complexity N \u2217, in early phases it basically implements a XY-allocation and no sig-\nni\ufb01cant improvement can be expected until some directions are discarded from \ufffdY. At that point,\nXY-adaptive starts focusing on directions which only contain near-optimal arms and it starts ap-\nproaching the behavior of the oracle. As a result, in studying the sample complexity of XY-Adaptive\nwe have to take into consideration the unavoidable price of discarding \u201csuboptimal\u201d directions. This\ncost is directly related to the geometry of the arm space that in\ufb02uences the number of samples needed\nbefore arms can be discarded from X . To take into account this problem-dependent quantity, we in-\ntroduce a slightly relaxed de\ufb01nition of complexity. More precisely, we de\ufb01ne the number of steps\nneeded to discard all the directions which do not contain x\u2217, i.e. Y \u2212 Y \u2217. From a geometrical point\nof view, this corresponds to the case when for any pair of suboptimal arms (x, x\u2032), the con\ufb01dence set\nS \u2217(xn) does not intersect the hyperplane separating the cones C(x) and C(x\u2032). Fig. 1 offers a simple\nillustration for such a situation: S \u2217 no longer intercepts the border line between C(x2) and C(x3),\nwhich implies that direction x2 \u2212 x3 can be discarded. More formally, the hyperplane containing\nparameters \u03b8 for which x and x\u2032 are equivalent is simply C(x) \u2229 C(x\u2032) and the quantity\n(19)\nn ) \u2229 (C(x) \u2229 C(x\u2032)) = \u2205}\ncorresponds to the minimum number of steps needed by the static XY-allocation strategy to discard\nall the suboptimal directions. This term together with the oracle complexity N \u2217 characterizes the\nsample complexity of the phases of the XY-adaptive allocation. In fact, the length of the phases is\nsuch that either they correspond to the complexity of the oracle or they can never last more than the\nsteps needed to discard all the sub-optimal directions. As a result, the overall sample complexity of\nthe XY-adaptive algorithm is bounded as in the following theorem.\nTheorem 3. If the XY-Adaptive allocation strategy is implemented with a \u03b2-approximate method\nand the stopping condition in Eq. 11 is used, then\n\nM \u2217 = min{n \u2208 N,\u2200x \ufffd= x\u2217,\u2200x\u2032 \ufffd= x\u2217,S \u2217(xX Y\n\nP\ufffdN \u2264\n\n(1 + \u03b2) max{M \u2217, 16\n\n\u03b1 N \u2217}\n\nlog(1/\u03b1)\n\nlog\ufffd c\ufffdlogn(K 2/\u03b4)\n\n\u0394min\n\n\ufffd \u2227 \u03a0(\u02c6\u03b8N ) = x\u2217\ufffd \u2265 1 \u2212 \u03b4.\n\n(20)\n\nWe \ufb01rst remark that, unlike G and XY, the sample complexity of XY-Adaptive does not have any\ndirect dependency on d and \u0394min (except in the logarithmic term) but it rather scales with the oracle\ncomplexity N \u2217 and the cost of discarding suboptimal directions M \u2217. Although this additional cost\nis probably unavoidable, one may have expected that XY-Adaptive may need to discard all the\nsuboptimal directions before performing as well as the oracle, thus having a sample complexity of\nO(M \u2217 +N \u2217). Instead, we notice that N scales with the maximum of M \u2217 and N \u2217, thus implying that\nXY-Adaptive may actually catch up with the performance of the oracle (with only a multiplicative\nfactor of 16/\u03b1) whenever discarding suboptimal directions is less expensive than actually identifying\nthe best arm.\n\n7\n\n\f3.5\n\n2.5\n\n3\n\n2\n\n1\n\n0\n\n1.5\n\n0.5\n\nx 105\n\nl\n\ns\ne\np\nm\na\nS\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\nFully adaptive\nG\nXY\nXY\u2212Adaptive\nXY\u2212Oracle\n\n6 Numerical Simulations\nWe illustrate the performance of XY-Adaptive and compare it to the XY-Oracle strategy (Eq. 5), the\nstatic allocations XY and G, as well as with the fully-adaptive version of XY where \ufffdX is updated\nat each round and the bound from Prop.2 is used. For a \ufb01xed con\ufb01dence \u03b4 = 0.05, we compare the\nsampling budget needed to identify the best arm with probability at least 1 \u2212 \u03b4. We consider a set\nof arms X \u2208 Rd, with |X| = d + 1 including the canonical basis (e1, . . . , ed) and an additional arm\nxd+1 = [cos(\u03c9) sin(\u03c9) 0 . . . 0]\u22a4. We choose \u03b8\u2217 = [2 0 0 . . . 0]\u22a4, and \ufb01x \u03c9 = 0.01, so that\n\u0394min = (x1 \u2212 xd+1)\u22a4\u03b8\u2217 is much smaller than the other gaps. In this setting, an ef\ufb01cient sampling\nstrategy should focus on reducing the uncertainty in the direction \u02dcy = (x1 \u2212 xd+1) by pulling the\narm x2 = e2 which is almost aligned with \u02dcy. In fact, from the rewards obtained from x2 it is easier\nto decrease the uncertainty about the second component of \u03b8\u2217, that is precisely the dimension which\nallows to discriminate between x1 and xd+1. Also, we \ufb01x \u03b1 = 1/10, and the noise \u03b5 \u223c N (0, 1).\nEach phase begins with an initialization matrix A0, obtained by pulling once each canonical arm. In\nFig. 4 we report the sampling budget of the algorithms, averaged over 100 runs, for d = 2 . . . 10.\nThe results. The numerical results show that XY-\nAdaptive is effective in allocating the samples to\nshrink the uncertainty in the direction \u02dcy.\nIndeed,\nXY-adaptive identi\ufb01es the most important direction\nafter few phases and is able to perform an allocation\nwhich mimics that of the oracle. On the contrary,\nXY and G do not adjust to the empirical gaps and\nconsider all directions as equally important. This\nbehavior forces XY and G to allocate samples until\nthe uncertainty is smaller than \u0394min in all directions.\nEven though the Fully-adaptive algorithm also iden-\nti\ufb01es the most informative direction rapidly, the \u221ad\nterm in the bound delays the discarding of the arms\nand prevents the algorithm from gaining any advan-\ntage compared to XY and G. As shown in Fig. 4,\nthe difference between the budget of XY-Adaptive and the static strategies increases with the num-\nber of dimensions. In fact, while additional dimensions have little to no impact on XY-Oracle and\nXY-Adaptive (the only important direction remains \u02dcy independently from the number of unknown\nfeatures of \u03b8\u2217), for the static allocations more dimensions imply more directions to be considered\nand more features of \u03b8\u2217 to be estimated uniformly well until the uncertainty falls below \u0394min.\n7 Conclusions\nIn this paper we studied the problem of best-arm identi\ufb01cation with a \ufb01xed con\ufb01dence, in the linear\nbandit setting. First we offered a preliminary characterization of the problem-dependent complexity\nof the best arm identi\ufb01cation task and shown its connection with the complexity in the MAB setting.\nThen, we designed and analyzed ef\ufb01cient sampling strategies for this problem. The G-allocation\nstrategy allowed us to point out a close connection with optimal experimental design techniques, and\nin particular to the G-optimality criterion. Through the second proposed strategy, XY-allocation,\nwe introduced a novel optimal design problem where the testing arms do not coincide with the arms\nchosen in the design. Lastly, we pointed out the limits that a fully-adaptive allocation strategy might\nhave in the linear bandit setting and proposed a phased-algorithm, XY-Adaptive, that learns from\nprevious observations, without suffering from the dimensionality of the problem. Since this is one of\nthe \ufb01rst works that analyze pure-exploration problems in the linear-bandit setting, it opens the way\nfor an important number of similar problems already studied in the MAB setting. For instance, we\ncan investigate strategies to identify the best-linear arm when having a limited budget or study the\nbest-arm identi\ufb01cation when the set of arms is very large (or in\ufb01nite). Some interesting extensions\nalso emerge from the optimal experimental design literature, such as the study of sampling strategies\nfor meeting the G-optimality criterion when the noise is heterosckedastic, or the design of ef\ufb01cient\nstrategies for satisfying other related optimality criteria, such as V-optimality.\nAcknowledgments This work was supported by the French Ministry of Higher Education and Re-\nsearch, Nord-Pas de Calais Regional Council and FEDER through the \u201cContrat de Projets Etat Re-\ngion 2007\u20132013\", and European Community\u2019s Seventh Framework Programme under grant agree-\nment no 270327 (project CompLACS).\n\nFigure 4: The sampling budget needed to identify\nthe best arm, when the dimension grows from R2\nto R10.\n\nDimension of the input space\n\nd=9\n\nd=10\n\nd=6\n\nd=7\n\nd=8\n\nd=2\n\nd=3\n\nd=4\n\nd=5\n\n8\n\n\fReferences\n[1] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri.\n\nImproved algorithms for linear\nstochastic bandits. In Proceedings of the 25th Annual Conference on Neural Information Pro-\ncessing Systems (NIPS), 2011.\n\n[2] Jean-Yves Audibert, S\u00e9bastien Bubeck, and R\u00e9mi Munos. Best arm identi\ufb01cation in multi-\n\narmed bandits. In Proceedings of the 23rd Conference on Learning Theory (COLT), 2010.\n\n[3] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Ma-\n\nchine Learning Research, 3:397\u2013422, 2002.\n\n[4] Mustapha Bouhtou, Stephane Gaubert, and Guillaume Sagnol. Submodularity and randomized\nrounding techniques for optimal experimental design. Electronic Notes in Discrete Mathemat-\nics, 36:679\u2013686, 2010.\n\n[5] S\u00e9bastien Bubeck, R\u00e9mi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits\nproblems. In Proceedings of the 20th International Conference on Algorithmic Learning The-\nory (ALT), 2009.\n\n[6] S\u00e9bastien Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identi\ufb01cations in multi-\narmed bandits. In Proceedings of the International Conference in Machine Learning (ICML),\npages 258\u2013265, 2013.\n\n[7] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079\u2013\n1105, December 2006.\n\n[8] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identi\ufb01cation:\nA uni\ufb01ed approach to \ufb01xed budget and \ufb01xed con\ufb01dence. In Proceedings of the 26th Annual\nConference on Neural Information Processing Systems (NIPS), 2012.\n\n[9] Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, and S\u00e9bastien Bubeck.\nMulti-bandit best arm identi\ufb01cation. In Proceedings of the 25th Annual Conference on Neural\nInformation Processing Systems (NIPS), pages 2222\u20132230, 2011.\n\n[10] Matthew D. Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget\nconstraints in model-based bandit optimization with application to automatic machine learning.\nIn Proceedings of the 17th International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 365\u2013374, 2014.\n\n[11] Kevin G. Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck.\n\nlil\u2019 UCB : An\noptimal exploration algorithm for multi-armed bandits. In Proceeding of the 27th Conference\non Learning Theory (COLT), 2014.\n\n[12] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset\nselection. In Proceedings of the 26th Conference on Learning Theory (COLT), pages 228\u2013251,\n2013.\n\n[13] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian\n\nJournal of Mathematics, 12:363\u2013366, 1960.\n\n[14] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th International Confer-\nence on World Wide Web (WWW), pages 661\u2013670, 2010.\n\n[15] Friedrich Pukelsheim. Optimal Design of Experiments. Classics in Applied Mathematics.\n\nSociety for Industrial and Applied Mathematics, 2006.\n\n[16] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the Amer-\n\nican Mathematical Society, pages 527\u2013535, 1952.\n\n[17] Guillaume Sagnol. Approximation of a maximum-submodular-coverage problem involving\nspectral functions, with application to experimental designs. Discrete Appl. Math., 161(1-\n2):258\u2013276, January 2013.\n\n[18] Marta Soare, Alessandro Lazaric, and R\u00e9mi Munos. Best-Arm Identi\ufb01cation in Linear Bandits.\n\nTechnical report, http://arxiv.org/abs/1409.6110.\n\n[19] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In\nProceedings of the 23rd International Conference on Machine Learning (ICML), pages 1081\u2013\n1088, 2006.\n\n9\n\n\f", "award": [], "sourceid": 553, "authors": [{"given_name": "Marta", "family_name": "Soare", "institution": "INRIA Lille - Nord Europe"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Remi", "family_name": "Munos", "institution": "INRIA / MSR"}]}