{"title": "Sequential Experimental Design for Transductive Linear Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 10667, "page_last": 10677, "abstract": "In this paper we introduce the pure exploration transductive linear bandit problem: given a set of measurement vectors $\\mathcal{X}\\subset \\mathbb{R}^d$, a set of items $\\mathcal{Z}\\subset \\mathbb{R}^d$, a fixed confidence $\\delta$, and an unknown vector $\\theta^{\\ast}\\in \\mathbb{R}^d$, the goal is to infer $\\arg\\max_{z\\in \\mathcal{Z}} z^\\top\\theta^\\ast$ with probability $1-\\delta$ by making as few sequentially chosen noisy measurements of the form $x^\\top\\theta^{\\ast}$ as possible. When $\\mathcal{X}=\\mathcal{Z}$, this setting generalizes linear bandits, and when $\\mathcal{X}$ is the standard basis vectors and $\\mathcal{Z}\\subset \\{0,1\\}^d$, combinatorial bandits. The transductive setting naturally arises when the set of measurement vectors is limited due to factors such as availability or cost. As an example, in drug discovery the compounds and dosages $\\mathcal{X}$ a practitioner may be willing to evaluate in the lab in vitro due to cost or safety reasons may differ vastly from those compounds and dosages $\\mathcal{Z}$ that can be safely administered to patients in vivo. Alternatively, in recommender systems for books, the set of books $\\mathcal{X}$ a user is queried about may be restricted to known best-sellers even though the goal might be to recommend more esoteric titles $\\mathcal{Z}$. In this paper, we provide instance-dependent lower bounds for the transductive setting, an algorithm that matches these up to logarithmic factors, and an evaluation. In particular, we present the first non-asymptotic algorithm for linear bandits that nearly achieves the information-theoretic lower bound.", "full_text": "Sequential Experimental Design for Transductive\n\nLinear Bandits\n\nTanner Fiez\n\nElectrical & Computer Engineering\n\nUniversity of Washington\n\n\ufb01ezt@uw.edu\n\nAllen School of Computer Science & Engineering\n\nLalit Jain\u21e4\n\nUniversity of Washington\nlalitj@cs.washington.edu\n\nAllen School of Computer Science & Engineering\n\nKevin Jamieson\n\nUniversity of Washington\n\njamieson@cs.washington.edu\n\nLillian Ratliff\n\nElectrical & Computer Engineering\n\nUniversity of Washington\n\nratlif\ufb02@uw.edu\n\nAbstract\n\nIn this paper we introduce the pure exploration transductive linear bandit problem:\ngiven a set of measurement vectors X\u21e2 Rd, a set of items Z\u21e2 Rd, a \ufb01xed\ncon\ufb01dence , and an unknown vector \u2713\u21e4 2 Rd, the goal is to infer argmaxz2Z z>\u2713\u21e4\nwith probability 1 by making as few sequentially chosen noisy measurements\nof the form x>\u2713\u21e4 as possible. When X = Z, this setting generalizes linear bandits,\nand when X is the standard basis vectors and Z\u21e2{ 0, 1}d, combinatorial bandits.\nThe transductive setting naturally arises when the set of measurement vectors is\nlimited due to factors such as availability or cost. As an example, in drug discovery\nthe compounds and dosages X a practitioner may be willing to evaluate in the lab\nin vitro due to cost or safety reasons may differ vastly from those compounds and\ndosages Z that can be safely administered to patients in vivo. Alternatively, in\nrecommender systems for books, the set of books X a user is queried about may be\nrestricted to known best-sellers even though the goal might be to recommend more\nesoteric titles Z. In this paper, we provide instance-dependent lower bounds for\nthe transductive setting, an algorithm that matches these up to logarithmic factors,\nand an evaluation. In particular, we present the \ufb01rst non-asymptotic algorithm for\nlinear bandits that nearly achieves the information-theoretic lower bound.\n\n1\n\nIntroduction\n\nIn content recommendation or property optimization in the physical sciences, often there is a set of\nitems (e.g., products to purchase, drugs) described by a set of feature vectors Z\u21e2 Rd, and the goal is\nto \ufb01nd the z 2Z that maximizes some response or property (e.g., af\ufb01nity of user to the product, drug\ncombating disease). A natural model for these settings is to assume that there is an unknown vector\n\u2713\u21e4 2 Rd and the expected response to any item z 2Z , if evaluated, is equal to z>\u2713\u21e4. However,\nwe often cannot measure z>\u2713\u21e4 directly, but we may infer it transductively through some potentially\nnoisy probes. That is, given a \ufb01nite set of probes X\u21e2 Rd we observe x>\u2713\u21e4 + \u2318 for any x 2X\nwhere \u2318 is independent mean-zero, sub-Gaussian noise. Given a set of measurements {(xi, ri)}N\ni=1(ri x>i \u2713)2 and then useb\u2713 as a\nplug-in estimate for \u2713\u21e4 to estimate the optimal z\u21e4 := argmaxz2Z z>\u2713\u21e4. However, the accuracy of\nsuch a plug-in estimator depends critically on the number and choice of probes used to construct\n\none can construct the least squares estimatorb\u2713 = arg min\u2713PN\n\ni=1\n\n\u21e4Contribution shared equally among T. Fiez and L. Jain.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fb\u2713. Unfortunately, the optimal allocation of probes cannot be decided a priori: it must be chosen\nsequentially and adapt to the observations in real-time to optimize the accuracy of the prediction.\nIf the probing vectors (arms) X are equal to the item vectors Z, this problem is known as pure\nexploration for linear bandits which is considered in [21, 30, 31, 33]. This naturally arises in content\nrecommendation, for example, if X = Z is a feature representation of songs, and \u2713\u21e4 represents a\nuser\u2019s music preferences, a music recommendation system can elicit the preference for a particular\nsong z 2Z directly by enqueuing it in the user\u2019s playlist. However, often times there are constraints\non which items in Z can be shown to the user.\n1. X\u21e2Z . Consider a whiskey bar with hundreds of whiskies ranging in price from dollars a shot to\nhundreds of dollars. The bar tender may have an implicit feature representation of each whiskey,\nthe patron has an implicit preference vector \u2713\u21e4, and the bar tender wants to select the affordable\nwhiskeys X\u21e2Z in a taste test to get an idea of the patron\u2019s preferences before recommending\nthe expensive whiskies that optimize the patron\u2019s preferences in Z.\n2. Z\u21e2X . In drug discovery, thousands of compounds are evaluated in order to determine which\nones are effective at combating a disease. However, it may be that while Z is the set of compounds\nand doses that are approved for medical use (e.g., safe), it may be advantageous to test even unsafe\ncompounds or dosages X such that XZ . Such unsafe X may aid in predicting the optimal\nz\u21e4 2Z because they provide more information about \u2713\u21e4.\n3. Z\\X = ;. Consider a user shopping for a home among a set Z where each is parameterized by\na number of factors like distance to work, school quality, crime rate, etc. so that each z 2Z can\nbe described as a linear combination of the relevant factors described by X : z =Px2X \u21b5z,xx,\nwhere we may take each x 2X to simply be one-hot-encoded. The response x>\u2713\u21e4 + \u2318 re\ufb02ects\nthe user\u2019s preferences for the query x, a speci\ufb01c attribute of the house. Indeed, if all \u21b5z,x 2{ 0, 1}\nthis is known as pure exploration for combinatorial bandits [10, 8]. That is, a house either has the\nattribute, or not.\n\nGiven items Z, measurement probes X , a con\ufb01dence , and an unknown \u2713\u21e4, this paper develops\nalgorithms to sequentially decide which measurements in X to take in order to minimize the number\nof measurements necessary in order to determine z\u21e4 with high probability.\n1.1 Contributions\n\nOur goals are broadly to \ufb01rst de\ufb01ne the transductive bandit problem and then characterize the\ninstance-optimal sample complexity for this problem. Our contributions include the following.\n1. In Section 2 we provide instance dependent lower bounds for the transductive bandit problem\nthat simultaneously generalize previous known lower bounds for linear bandits and combinatorial\nbandits using standard arguments.\n\n2. In Section 3 we give an algorithm (Algorithm 1) for transductive linear bandits and prove an\nassociated sample complexity result (Theorem 2). We show that the sample complexity we\nobtain matches the lower bound up to logarithmic factors. This is the primary contribution of\nthe paper. Along the way, we discuss how rounding procedures can be used to improve upon the\ncomputational complexity of this algorithm.\n\n3. In Sections 4 and 5 we contrast our algorithm with previous work from a theoretical and empirical\nperspective, respectively. Our experiments show that our theoretically superior algorithm is\nempirically competitive with previous algorithms on a range of problem scenarios.\n\n0 is a positive semide\ufb01nite matrix, and y 2 Rd is a vector, let kyk2\n\n1.2 Notation\nFor each z 2Z de\ufb01ne the gap of z, (z) = (z\u21e4 z)>\u2713\u21e4 and furthermore, min = minz6=z\u21e4\nIf A 2 Rd\u21e5d\ninduced semi-norm. Let X := { 2 R|X| : 0,Px2X x = 1} denote the set of probability\ndistributions on X . Taking S\u21e2Z to a subset of the arm set, we de\ufb01ne two operators we de\ufb01ne\nY(S) = {z z0 : 8 z, z0 2S , z 6= z0} as the directions obtained from the differences between\neach pair of arms and Y\u21e4(S) = {z\u21e4 z : 8 z 2S \\ z\u21e4} as the directions obtained from the\ndifferences between the optimal arm and each suboptimal arm. Finally, for an arbitrary set of vectors\nV\u21e2 Rd, de\ufb01ne \u21e2(V) = min2 X\n(Px2X xxx>)1. This quantity will be crucial in the\ndiscussion of our sample complexity and it is motivated in Section 2.2\n\n(z).\nA := y>Ay denote the\n\nmaxv2V kvk2\n\n2\n\n\f2 Transductive Linear Bandits Problem\nConsider known \ufb01nite collections of d-dimensional vectors X\u21e2 Rd and Z\u21e2 Rd , known con\ufb01dence\n 2 (0, 1), and unknown \u2713\u21e4 2 Rd. The objective is to identify z\u21e4 = argmaxz2Z z>\u2713\u21e4 with\nprobability at least 1 while taking as few measurements in X as possible. Formally, a transductive\nlinear bandits algorithm is described by a selection rule Xt 2X at each time t given the history\n(Xs, Rs)st \u2713\u21e4 + \u2318t where \u2318t\nis independent, zero-mean, and 1-sub-Gaussian. Let P\u2713\u21e4, E\u2713\u21e4 denote the probability law of Rt|Ft1\nfor all t.\nDe\ufb01nition 1. We say that an algorithm for a transductive bandit problem is -PAC for X ,Z\u21e2 Rd if\nfor all \u2713\u21e4 2 Rd we have P\u2713\u21e4(bz = z\u21e4) 1 .\n\n2.1 Optimal allocations\nIn this section we discuss a number of ways we can allocate a measurement budget to the different\narms. The following establishes a lower bound on the expected number of samples any -PAC\nalgorithm must take.\nTheorem 1. Assume \u2318t\nsatisfy\n\niid\u21e0N (0, 1) for all t. Then for any 2 (0, 1), any -PAC algorithm must\n\nE\u2713\u21e4[\u2327 ] log(1/2.4) min\n2 X\n\nmax\n\nz2Z\\{z\u21e4}\n\nkz\u21e4 zk2\n\n(Px2X xxx>)1\n\n((z\u21e4 z)>\u2713\u21e4)2\n\n.\n\nkz\u21e4 zk2\n\n((z\u21e4 z)>\u2713\u21e4)2\n\n\u21e4 := argmin\n2 X\n\nmax\n\nz2Z\\{z\u21e4}\n\n((z\u21e4 z)>\u2713\u21e4)2\n\nand \u21e4 = max\nZ\\{z\u21e4}\n\n(Px2X \u21e4xxx>)1\n\n(Px2X xxx>)1\n\nThis lower bound is proved in Appendix C using standard techniques and employs the transportation\ninequality of [22]. It generalizes a previous lower bound in the setting of linear bandits [29] and\nlower bounds in the combinatorial bandit literature [10].\nOptimal static allocation. To demonstrate that this lower bound is tight, de\ufb01ne\nkz\u21e4 zk2\n\n,\n(1)\nwhere \u21e4 is the value of the lower bound and \u21e4 is the allocation that achieves it. Suppose we\nsample arm x 2X exactly 2b\u21e4xNc times where we assume2 N 2 N is suf\ufb01ciently large so that\nminx:x>0bxNc > 0. If N = d2 \u21e4 log(|Z|/)e then as we will show shortly (Section 2.2), the\nleast squares estimatorb\u2713 satis\ufb01es (z\u21e4 z)>b\u2713> 0 for all z 2Z \\ z\u21e4 with probability at least 1 .\nThus, with probability at least 1 , z\u21e4 is equal tobz = arg maxz2Z z>b\u2713 and the total number of\nsamples is bounded by 2N which is within 4 log(|Z|) of the lower bound. Unfortunately, of course,\nthe allocation \u21e4 relies on knowledge of \u2713\u21e4 (which determines z\u21e4) which is unknown a priori, and\nthus this is not a realizable strategy.\nOther static allocations. Short of \u21e4 it is natural to consider allocations that arise from optimal\nlinear experimental design [27]. For the special case of X = Z it has been argued ad nauseam that a\nG-optimal design, argmin2 X\n(Px2X xxx>)1, is woefully loose since it does\nnot utilize the differences x x0, x, x0 2X [25, 30, 33]. Also for the X = Z case, [34, 30] have\nproposed the static XY-allocation given as argmin2 X\n(Px2X xxx>)1. In\n[30] it is shown that no more than O( d\nlog(|X| log(1/min)/)) samples from each of these\n2\nallocations suf\ufb01ce to identify the best arm. While the above discussion demonstrates that for every \u2713\u21e4\nthere exists an optimal static allocation (that explicitly uses \u2713\u21e4) nearly achieving the lower bound, any\nstatic allocation with no prior knowledge of \u2713\u21e4 can require a factor of d more samples than necessary.\nProposition 1. Let c, c0 be universal constants. For any > 0, d even, there exists sets X = Z\u21e2 Rd\nand a set \u21e5 \u21e2 Rd, such that infA max\u27132\u21e5 E\u2713[\u2327 ] cd log(1/)\nwhere A is the set of all algorithms\nthat are -PAC for X ,Z and take a static allocation of samples. On the other hand \u21e4/c0 \uf8ff d + 1\nfor every choice of \u2713\u21e4 2 \u21e5.\n\nmaxx,x02X kx x0k2\n\nmaxx2X ,x6=x\u21e4 kxk2\n\nmin\n\n2Such an assumption is avoided by a sophisticated rounding procedure that we will describe shortly.\n\n\n\n3\n\n\n\n\fThis proposition indicates that it is necessary to devise an adaptive algorithm to obtain a instance-\noptimal sample complexity. The proof of this proposition can be found in Appendix D.\nAdaptive allocations. As suggested by the problem de\ufb01nition, our strategy is to adapt the allocation\nover time, informed by the observations up to the current time. Speci\ufb01cally, our algorithm will\nproceed in rounds where at round t, we perform an XY-allocation that is suf\ufb01cient to remove all arms\nz 2Z that have gaps of at least 2(t1). We show that the total number of measurements accumulates\nto \u21e4 log(|Z|2/) times some additional logarithmic factors, nearly achieving the optimal allocation\nas well as the lower bound. In Section 4, we review related procedures for the speci\ufb01c case of X = Z.\n2.2 Review of Least Squares\nGiven a \ufb01xed design xT = (xt)T\n\nt=1, a natural\napproach is to construct the ordinary-least squares (OLS) estimateb\u2713 = (PT\nt=1 rtxt).\nOne can showb\u2713 is unbiased with covariance (PT\nt=1 xtx>t )1. Moreover, for any y 2 Rd, we\nP\u21e3y>(\u2713\u21e4 b\u2713) qkyk2\nt=1 xtx>t )12 log(1/)\u2318 \uf8ff .\n(2)\n(PT\n\nt=1 with each xt 2X and associated rewards (rt)T\nt=1 xtx>t )1(PT\n\nIn particular, if we want this to hold for all y 2Y \u21e4(Z), we need to union bound over Z replacing \nwith /|Z|. Let us now use this to analyze the procedure discussed above (in the discussion on the\noptimal static allocation after Theorem 1) that gives an allocation matching the lower bound. With\nthe choice of N = d2 \u21e4 log(|Z|/)e and the allocation 2b\u21e4xNc for each x 2X , we have for each\nz 2Z \\ z\u21e4 that with probability at least 1 ,\n\nhave3\n\n(Px 2bN\u21e4xxxT c)12 log(|Z|/) 0\n\nsince for each y = z\u21e4 z 2Y \u21e4(Z) we have\n\n(z\u21e4 z)>b\u2713 (z\u21e4 z)>\u2713\u21e4 qkz\u21e4 zk2\n\u21e4xxx>\u23181\n2bN\u21e4xcxx>\u23181\n\ny \uf8ff y>\u21e3Xx2X\n\nwhere the last inequality plugs in the value of N and the de\ufb01nition of \u21e4. The fact that at most\n\ny>\u21e3Xx2X\none z0 2Z can satisfy (z0 z)>b\u2713> 0 for all z 6= z0 2Z , and that z0 = z\u21e4 does, certi\ufb01es that\nbz = arg maxz2Z z>b\u2713 is indeed the best arm with probability at least 1 . Note that equation (3)\n\nprovides the motivation for how the form of \u21e4 is obtained. Rearranging, it is equivalent to,\n\ny/N \uf8ff ((z\u21e4 z)>\u2713\u21e4)2/(2 log(|Z|/)),\n\n(3)\n\nN 2 log(|Z|/) max\nZ\\{z\u21e4}\n\nkz\u21e4 zk2\n\n(Px2X \u21e4xxx>)1\n\n((z\u21e4 z)>\u2713\u21e4)2\n\nfor all z 2Z \\ { z\u21e4}\n\nThinking of the right hand side of the inequality as a function of , \u21e4 is precisely chosen to minimize\nthis quantity and hence the sample complexity.\n\n2.3 Rounding Procedures\nWe brie\ufb02y digress to address a technical issue. Given an allocation and an arbitrary subset of\nvectors Y, in general, drawing N samples xN := {x1, . . . , xN} at random from X according to the\nt=1 xtx>t )1 (which appears in the width\ndistribution x may result in a design where maxy2Y kyk2\n(PN\nof the con\ufb01dence interval (2)) differs signi\ufb01cantly from maxy2Y kyk2\n(Px2X xxx>)1/N. Naive\nstrategies for choosing xN will fail. We can not simply use an allocation of N x samples for any\nspeci\ufb01c x since this may not be an integer. Furthermore, greedily rounding N x to an allocation\nbN xc or dN xe may result in fewer than necessary, or far more than N total samples if the support\nof is large. However, given \u270f> 0, there are ef\ufb01cient rounding procedures that produce (1 + \u270f)\napproximations as long as N is greater than some minimum number of samples r(\u270f). In short,\ngiven and a choice of N they return an allocation xN satisfying maxy2Y kyk2\ni=1 xix>i )1 \uf8ff\n(PN\n(Px2X xxx>)1/N. Such a procedure with r(\u270f) \uf8ff O(d/\u270f2) is described in\n(1 + \u270f) maxy2Y kyk2\n3There is a technical issue of whether the set Z lies in the span of X which in general is necessary to obtain\n\nunbiased estimates of (z\u21e4 z)>\u2713\u21e4. Throughout the following we assume that span(X ) = Rd.\n\n4\n\n\fSection B in the supplementary. In our experiments we use a rounding procedure from [27] that is\neasier to implement with r(\u270f) = 2kk0/\u270f \uf8ff (d(d + 1) + 2)/\u270f. In general, \u270f should be thought of as\na constant. The number of samples N we need to take in our algorithm will be signi\ufb01cantly larger\nthan r(\u270f), so the impact of the rounding procedure is minimal. We provide details on this rounding\nprocedure in Section B of the supplementary (also see [30, Appendix C]).\n\n3 Sequential Experimental Design for Transductive Linear Bandits\n\nOur algorithm for the pure exploration transductive bandit is presented in Algorithm 1. The algorithm\n\nmaxz2bZt kz\u21e4zk2\n\nt, the algorithm samples in such a way to remove all arms with gaps greater than 2(t1). Thus\n\nproceeds in rounds, keeping track of the active arms bZt \u2713Z in each round t. At the start of round\ndenoting St := {z 2Z :( z) \uf8ff 4 \u00b7 2t}, in round t we expect bZt \u21e2S t.\nAs described above, if we knew \u2713\u21e4, we would sample according to the optimal allocation\nargmin2 X\n(Px2X xxx>)1/((z\u21e4z)>\u2713\u21e4)2. However, if at the start of the round\nwe only have an upper bound on the gaps (z) \uf8ff 4 \u00b7 2t and do not know z\u21e4, we can use the tri-\nangle inequality to obtain 4 maxz2bZt kz\u21e4 zk2\n(Px2X xxxT )1\n(Px2X xxxT )1.4 This mo-\nand lower-bound the objective by (2t3)2 min2 X\ntivates our choice of t and \u21e2(Y(bZt)). Thus by the same logic used in Section 2.2, Nt =\nd2(2t)2(1 + \u270f)\u21e2(Y(bZt)) log(|Z|2/t)e samples should suf\ufb01ce to guarantee that we can construct a\ncon\ufb01dence interval on each (z z0)>\u2713\u21e4 for (z z0) 2Y (bZt) of size at most 2t (with the |Z|2 in\nrounding principle. Finally, this con\ufb01dence interval allows us to provably remove any arm z 2 bZt\n\nthe logarithm accounting for a union bound over arms). The (1 + \u270f) accounts for slack from the\n\n(Px2X xxx>)1 maxy2Y(bZt) kyk2\n\nmaxy2Y(bZt) kyk2\n\nsuch that (z) > 2(t1) in round t.\n\nt2\n\nAlgorithm 1: RAGE(X ,Z,\u270f, r (\u00b7), ): Randomized Adaptive Gap Elimination\nInput: Arms X\u21e2 Rd, items Z\u21e2 Rd, rounding approximation factor \u270f with default value 1/10, function r(\u00b7)\ngiving minimum number of samples to obtain rounding approximation \u270f, and con\ufb01dence level 2 (0, 1).\nInitialize: Let bZ1 Z , t 1\nwhile |bZt| > 1 do\nt \n\u21e4t arg min2 X\n(Px2X xxx>)1\n\u21e2(Y(bZt)) min2 X\n(Px2X xxx>)1\nNt max\u23032(2t)2\u21e2(Y(bZt))(1 + \") log(|Z|2/t)\u2325, r(\u270f) \nxNt ROUND(\u21e4t , Nt)\nPull arms x1, . . . , xNt and obtain rewards r1, . . . , rNt\nj=1 xjx>j and bt :=PNt\nComputeb\u2713t = A1\nbZt+1 bZt \\z 2 bZ|9 z0 2 bZ : kz0 zkA1\nOutput: bZt\nTheorem 2. Assume that maxz2Z (z) \uf8ff 2. Then with probability at least 1, using an \u270f-ef\ufb01cient\nrounding procedure, Algorithm 1 returns z\u21e4 and requires a worst-case sample complexity of\n\nmaxy2Y(bZt) kyk2\nmaxy2Y(bZt) kyk2\nt bt using At :=PNt\n\nt p2 log(|Z|2/t) < (z0 z)>b\u2713t \n\nt t + 1\n\nj=1 xjrj\n\nN \uf8ff\n\nblog2(4/min)cXt=1\n\nmax\u23032(2t)2\u21e2(Y(St))(1 + \u270f) log(t2|Z|2/)\u2325, r(\u270f) \n\nwhere St = {z 2Z :( z) \uf8ff 4\u00b7 2t}. In particular, ROUND can be chosen so that r(\u270f) = O(d/\u270f2).\nFurthermore, N \uf8ff c \u21e4 log2(4/min) log(|Z|2 log2(4/min)2/) + r(\u270f) log2(4/min) for some\nabsolute constant c, in other words Algorithm 1 is instance optimal up to logarithmic factors.\nWe provide a proof of the sample complexity bound in Section A. The primary novelty in our analysis\nis in quantifying the relationship between the algorithm sample complexity and the lower bound.\n\n4 Where we recall for any subset S\u21e2Z ,Y(S) := {z z0 : z, z0 2S} and for an arbitrary subset V\u21e2 Rd\n\nwe have \u21e2(V ) = min2 X\n\nmaxv2V kvk2\n\n(Px2X xxx>)1.\n5\n\n\f3.1\n\nInterpreting the sample complexity.\n\nUp to logarithmic factors, Algorithm 1 matches the lower bound obtained in Theorem 1. However,\nthe term \u21e2(Y(St)) may seem a bit mysterious. In this section we try to interpret this quantity in terms\nof the geometry of X and Z.\nLet conv(X[ X ) denote the convex hull of X[ X , and for any set Y\u21e2 Rd de\ufb01ne the gauge of\nY,\nIn the case where Y is a singleton Y = {y}, (y) := Y is the gauge norm of y with respect to\nconv(X[X ), a familiar quantity from convex analysis [28]. We can provide a natural upper bound\nfor \u21e2(Y) in terms of the gauge.\nLemma 1. Let Y\u21e2 Rd. Then\n\nY = max{c > 0 : cY\u21e2 conv(X[ X )}.\n\n(4)\n\nmax\n\ny2Y kyk2\n\n2/(max\n\nx2X kxk2) \uf8ff \u21e2(Y) \uf8ff d/2\nY .\n\nIn the case of a singleton Y = {y}, we can improve the upper bound to \u21e2(Y) \uf8ff 1/(y)2.\nThe proof of this Lemma is in Appendix E. To see the potential for adaptive gains we focus on\nthe case of linear bandits where X = Z. Consider an example with X = {ei}d\ni=1 [{ z0} for z0 =\n(cos(\u21b5), sin(\u21b5), 0,\u00b7\u00b7\u00b7 , 0) where \u21b5 2 [0, .1), and \u2713\u21e4 = e1. Note that min \u21e1 1 cos(\u21b5) \u21e1 \u21b52/2.\nThen S1 = X , and an easy computation shows Y(X ) is a constant bounded from zero. After\nthe \ufb01rst round, all arms except e1 and z0 will be removed, so Y(St) = {e1 z0} for t 2, and\nY(St) \u21e1 1/ sin(\u21b5) \u21e1 1/\u21b5. Summing over all rounds, we see that this implies a sample complexity\nof eO(d + 1/\u21b52) up to log factors, which is a signi\ufb01cant improvement over the static XY-allocation\nsample complexity of eO(d/\u21b52).\n4 Related Work\nWhen X = Z = {e1,\u00b7\u00b7\u00b7 , ed}\u21e2 Rd is the set of standard basis vectors, the problem reduces\nto that of the best-arm identi\ufb01cation problem for multi-armed bandits which has been extensively\nstudied [14, 19, 20, 22, 11]. In addition, pure exploration for combinatorial bandits where X =\n{e1,\u00b7\u00b7\u00b7 , ed}\u21e2 Rd and Z\u21e2{ 0, 1}d has also received a great deal of attention [10, 8, 12, 9].\nIn the setting of linear bandits when X = Z, despite a great deal of work in the regret and contextual\nsettings [1, 26, 25, 13], there has been far less work on linear bandits for pure exploration. This\nproblem was \ufb01rst introduced in [30] and since then, there have been a few other works on this topic,\n[31, 21, 33] that we now discuss.\n\u2022 Soare et al. [30] made the initial connections to G-optimal experimental design. That work provides\nthe \ufb01rst passive algorithm with a sample complexity of O( d\nlog(|X|/) + d2). Note that the d2\n2\ncomes from the minimum number of samples needed for an ef\ufb01cient rounding procedure and thus\ncould be reduced to d using improved rounding procedures (see [2]). They also provide an adaptive\nalgorithm, XY-adaptive algorithm for linear bandits. Their algorithm is very similar to ours, with\ntwo notable differences. Firstly, instead of using an ef\ufb01cient rounding procedure, they use a greedy\niterative scheme to compute an optimal allocation. Secondly, their algorithm does not discard items\nthat are provably sub-optimal. As a result, their sample complexity (up to logarithmic factors)\nscales as max{M\u21e4, \u21e4} log(|X|/(min)) + d2 where M\u21e4 is de\ufb01ned (informally) as the amount\nof samples needed using a static allocation to remove all sub-optimal directions in Y(X ) \\ Y\u21e4(X ).\n\u2022 In Tao et al. [31], the focus is on developing different estimators with the goal of removing the\nconstant term d2 in Soare et al.\u2019s passive sample complexity. Instead of using a rounding procedure,\nthey use a different estimator than the OLS estimator \u2713\u21e4. Note that the rounding procedure in [2]\nand described in the supplementary could have been applied directly to Soare\u2019s static allocation\nalgorithm giving the same sample complexity as the one obtained in [31]. They also provide an\nadaptive algorithm ALBA, that achieves a sample complexity of O(Pd\ni ) where i is the\ni-th smallest gap of the vectors in X . It is easy to see that this sample complexity is not optimal:\nimagine a situation in which the vectors of X with the (d 1)-smallest gaps are identical to\nthe vector x0 6= x?. Then we only need to pay once for the samples needed to remove x0, not\n\ni=1 1/2\n\nmin\n\n6\n\n\f(d 1)-times. Finally, their algorithms do not compute the optimal allocation over differences of\nvectors in X , but instead on X directly \u00e0 la G-optimal design. We will see the inef\ufb01ciency of this\nstrategy in the experiments.\n\u2022 Karnin [21] provides an algorithm that uses repeated rounds (for probability ampli\ufb01cation) of ex-\nploration phases combined with veri\ufb01cation phases to provide an asymptotically optimal algorithm,\nmeaning when ! 0 the sample complexity divided by log(1/) approaches \u21e4. Though this\nis a nice theoretical result, the algorithm is not practical; the exploration phase is simply a na\u00efve\npassive G-optimal design.\n\n\u2022 In Xu et al. [33], a fully adaptive algorithm called LinGapE inspired by the UGapE algorithm [15]\nis proposed. Since LinGapE is fully adaptive, a con\ufb01dence bound allowing for dependence in the\nsamples is necessary and the authors employ the self-normalized bound of [1]. The algorithm\nrequires each arm to be pulled once - an undesirable characteristic of a linear bandit algorithm\nsince the structure of the problem allows for information to be obtained about arms that are not\npulled. A recent work [23], extends this algorithm to generalized linear models where the expected\nreward of pulling arm z reward is given by a non-linear link function of z>\u2713\u21e4.\n\nFinally, we mention [34], which considers transductive experimental design from a computational\nand optimization perspective, and explores XY-allocation for arbitrary kernels.\n5 Experiments\n\nIn this section, we present simulations for the linear bandit pure exploration problem and the general\ntransductive bandit problem. We compare our proposed algorithm with both adaptive and non-\nadaptive strategies. The adaptive strategies are XY-Adaptive allocation from [30], LinGapE from\n[33], and ALBA from [31]. The non-adaptive strategies are static XY-allocation, as described in\nSection 2, and an oracle strategy that knows \u2713\u21e4 and samples according to \u21e4. We do not compare to\nthe algorithm given in [21] since it is primarily a theoretical contribution and in moderate-con\ufb01dence\nregimes obtains only the non-adaptive sample complexity. We run each algorithm at a con\ufb01dence\nlevel of = 0.05. The empirical failure probability of each of the algorithms in the simulations is\nzero. To compute the samples for RAGE, we \ufb01rst used the Frank-Wolfe algorithm (with a precise\nstopping condition in the supplementary) to \ufb01nd t, and then the rounding procedure from [27] with\n\u270f = 1/10. Further implementation details of RAGE and discussion pertaining to the implementation\nof the other algorithms can be found in the supplementary material in Section F. We remark here that\nin our implementation of the XY-Adaptive allocation, we follow the experiments in [30] and allow\nfor provably suboptimal arms to be discarded (though this is not how the algorithm is written in their\npaper). The resulting algorithm is then similar to our algorithm. Unless explicitly mentioned, noise\nin the observations was generated from a standard normal distribution.\nLinear bandits: benchmark example. The \ufb01rst experiment we present has become a benchmark in\nthe linear bandit pure exploration literature since it was introduced in [30]. In this problem, X =\nZ = {e1, . . . , ed, x0}\u21e2 Rd where ei is the i-th standard basis vector, x0 = cos(.01)e1 + sin(.01)e2,\nand \u2713\u21e4 = 2e1 so that x\u21e4 = x1. An ef\ufb01cient sampling strategy for this problem needs to focus on\nreducing uncertainty in the direction (x1 xd+1), which can be achieved by focusing pulls on arm\nx2 = e2 since it is most aligned with this direction.\nThe results for this experiment are shown in Fig. 1a. The RAGE algorithm performs competitively\nwith existing algorithms and the oracle allocation. The XY-Adaptive algorithm is similar to RAGE,\nbut with weaker theoretical guarantees, so naturally it performs nearly equivalently. We omit it from\nthe remaining experiments for this reason. The LinGapE algorithm performs well when the number\nof dimensions and arms is small. However, as the number of arms grows, LinGapE suffers from a\nworse dimension dependency in the con\ufb01dence interval. ALBA performs the worst of the recently\nproposed algorithms and this is to be expected since it computes an allocation on the X set instead of\non the Y(X ) set. This example clearly highlights the gains of adaptive sampling over non-adaptive\nallocations such as the static XY-allocation. However, since X is relatively small in this case, it\nfails to tease out important differences between the algorithms that can greatly increase the sample\ncomplexity. We construct examples to demonstrate these claims now.\nMany arms with moderate gaps. In this example, for a given value of n 3, we construct a set of\narms X\u21e2 R2, where X = Z = {e1, cos(3\u21e1/4)e1 + sin(3\u21e1/4)e2}[{ cos(\u21e1/4 + i)e1 + sin(\u21e1/4 +\ni=3 with i \u21e0N (0, .09) for each i 2{ 3, . . . , n}. The parameter vector is \ufb01xed to be \u2713\u21e4 = e1\ni)e2}n\n\n7\n\n\f(a) Benchmark\n\n(b) Duplicate arms\n\n(c) Uniform sphere\n\n(d) Transductive\n\nFigure 1\n\nso that x1 is the optimal arm, x2 gives the most information to identify the optimal arm, and the\nremaining arms roughly point in the same direction with an expected gap of \u21e1 0.3.\nIn Fig. 1b, we show the results of the experiment as we increase the number of arms. The LinGapE\nalgorithm suffers from a linear scaling in the number of arms since it must sample each arm as an\ninitialization. An ef\ufb01cient sampling strategy should focus energy on x2, and as it does so, it will gain\ninformation about the arms that are nearly duplicates of each other, which is how RAGE performs.\nUniform distribution on a sphere. In this example, X = Z is sampled from a unit sphere of\ndimension d = 9 centered at the origin. Following [31], we select the two closest arms x, x0 2X\nand let \u2713\u21e4 = x. In Fig. 1c, we show the sample complexity of the algorithms as the number of arms\ngrows. The RAGE algorithm signi\ufb01cantly outperforms ALBA and this is primarily due to the fact that\nALBA computes a G-optimal design on the active vectors in each round instead of on the differences\nbetween these vectors. Thus the ALBA sampling distribution can be focused on a very different set\nof arms from the optimal one.\nTransductive example. We now present a general transductive bandit example. Since the existing\nalgorithms in the linear bandit literature do not generalize to this problem, we compare with a static\nXY-allocation on X ,Y(Z) and an oracle XY-allocation on X ,Y\u21e4(Z) that knows the optimal arm\nand the gaps. We construct an example in Rd with d even where X = {e1, . . . , ed}. The set Z is\nalso chosen so |Z| = d, the \ufb01rst d/2 vectors are given by z1, . . . , zd/2 = (e1, . . . , ed/2) and then\nzd/2+j = cos(.1)ej + sin(.1)ej+d/2 for each j 2{ 1, . . . , d/2}. Take \u2713\u21e4 = e1 so z1 is the optimal\narm. The results of this simulation are depicted in Fig. 1d. The RAGE algorithm signi\ufb01cantly\noutperforms the static allocation and nearly matches the oracle allocation.\nWe now present examples motivated by real-world applications.\nMultivariate testing example. In many experimental design settings, there are a series of D factors\nthat can be either in a set of N states, and the goal is to determine the treatment con\ufb01guration that has\nthe highest outcome for a given metric. As a concrete example in web page optimization, it is common\nthat the composition of an advertisement layout selection may consist of several choices such as an\nimage, background color, and keyword to display (e.g. [16]), and we seek to \ufb01nd the combination\nwith the highest clickthrough rate. To formalize the problem, consider a webpage consisting of D\ndistinct slots and suppose that there are 2 content choices that can be presented in each slot. Let\n\n8\n\n\f(a) Experimental Design\n\n(b) Yahoo Example\n\nFigure 2\n\nx>\u2713\u21e4 = \u2713\u21e40 + \u21b51PD\n\nj=1 \u2713\u21e4j wj + \u21b52PD\n\nk=1PD\n\nthe set W = {1, 1}D satisfying |W| = 2D encode each layout. We model the problem using a\nfactorial design (see, e.g., [6]) including pairwise interaction features to generate a linear bandit\nproblem. Each layout is represented by an arm x 2X where X = Z \u21e2 {1, 1}1+D+D(D1)/2 and\n|X| = 2D. The expected reward of any x 2X corresponding to a layout w 2W is given by\n\n`=k+1 \u2713\u21e4k,`wkw`,\n\nwhere \u2713\u21e40 is a common bias weight, \u2713\u21e4j is a weight for the j\u2013th slot, and \u2713\u21e4j,k is a weight for the\ninteraction between the content in the k\u2013th and `\u2013th slots. We also include known parameters \u21b51 = 1\nand \u21b52 = 0.5 that control the strength of the \ufb01rst and second order interactions respectively. The\nweights of the parameter vector are drawn from a discrete uniform distribution with a range of\n[0.3, 0.3] and a granularity of 0.01. The results of this example are shown in Fig. 2a. The RAGE\nalgorithm performs close to the oracle on this example, while the sample complexity of the rest of the\nalgorithms grows as the number of arms and dimension of the problem goes up.\nClick-through example. To conduct an experiment based on real data, we build a problem using\nthe Yahoo! Webscope Dataset R6A.5 The dataset contains user click log records for news articles\ndisplayed uniformly at random on the Yahoo! front page between May 1st, 2009 and May 10th, 2009.\nEach click log record consists of a binary outcome along with 6 features identifying the user and 6\nfeatures identifying the article.\nTo build a linear bandit problem from the dataset, we construct an arm set X = Z\u21e2 R36 by taking\nthe outer product of the user and article features for each click log record on May 1st, 2009. We\nthen \ufb01t a regularized least squares estimate using a regularization parameter of 0.01 to obtain \u2713\u21e4. To\nmodel binary rewards, we let the observed reward be generated by a draw of a Bernoulli random\nvariable with parameter x>\u2713\u21e4 for any arm selection x 2X . Since x>\u2713\u21e4 2 (0, 0.11) 8 x 2X , the\nnoise is bounded between [1, 1], which causes it to be 1-sub-Gaussian. We simulate the problem\nwith 40 arms including the arm with the maximum reward in the dataset and the remaining arms were\nselected at random from the set of arms with gap at least 0.01 from the optimal arm so the problem\nis not too hard. The experiment setup is similar to that from [33] for this dataset. The results are\npresented in Fig. 2b. We see that the RAGE algorithm has good performance on this example based\non real world data.\n\n6 Conclusion\n\nIn this paper we have proposed the problem of best-arm identi\ufb01cation for transductive linear bandits,\nprovided an algorithm, and matching upper and lower bounds. As a remark it is straightforward\nto exit our algorithm early with an \"-good arm. It still remains to develop anytime algorithms\nfor this problem, as has been done in pure exploration for multi-armed bandits [19] that do not\nthrow out samples. In addition, we suspect our algorithm actually matches the lower-bound and the\nlog(1/min) factor is unnecessary. Finally, it is possible that some of the ideas developed here extend\nto the setting of regret and could be used to give instance based regret bounds for linear bandits [25].\nWe hope to explore connections to both the regret and \ufb01xed budget settings in further works.\n\n5https://webscope.sandbox.yahoo.com/\n\n9\n\n\fReferences\n[1] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri.\n\nImproved algorithms for linear\nstochastic bandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320,\n2011.\n\n[2] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal discrete optimiza-\ntion for experimental design: A regret minimization approach. arXiv preprint arXiv:1711.05174,\n2017.\n\n[3] Jean-Yves Audibert and S\u00e9bastien Bubeck. Best arm identi\ufb01cation in multi-armed bandits. In\n\nConference on Learning Theory, pages 41\u201353, 2010.\n\n[4] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3:397\u2013422, 2002.\n\n[5] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[6] George EP Box, J Stuart Hunter, and William G Hunter. Statistics for experimenters. In Wiley\n\nSeries in Probability and Statistics. Wiley Hoboken, NJ, USA, 2005.\n\n[7] S\u00e9bastien Bubeck, R\u00e9mi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits\nproblems. In International Conference on Algorithmic Learning Theory, pages 23\u201337, 2009.\n\n[8] Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration:\nEf\ufb01cient algorithms and an analysis with localization. arXiv preprint arXiv:1711.08018, 2017.\n\n[9] Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid\n\nconstraints. In Conference on Learning Theory, pages 647\u2013669, 2016.\n\n[10] Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling\n\nalgorithms for combinatorial pure exploration. arXiv preprint arXiv:1706.01081, 2017.\n\n[11] Lijie Chen and Jian Li. On the optimal sample complexity for best arm identi\ufb01cation. arXiv\n\npreprint arXiv:1511.03774, 2015.\n\n[12] Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure\nexploration of multi-armed bandits. In Advances in Neural Information Processing Systems,\npages 379\u2013387, 2014.\n\n[13] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under\n\nbandit feedback. 2008.\n\n[14] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. Journal of machine learning\nresearch, 7(Jun):1079\u20131105, 2006.\n\n[15] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identi\ufb01cation:\nA uni\ufb01ed approach to \ufb01xed budget and \ufb01xed con\ufb01dence. In Advances in Neural Information\nProcessing Systems, pages 3212\u20133220, 2012.\n\n[16] Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. An ef\ufb01cient bandit\nalgorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 1813\u20131821. ACM,\n2017.\n\n[17] Matthew Hoffman, Bobak Shahriari, and Nando Freitas. On correlation and budget constraints\nin model-based bandit optimization with application to automatic machine learning. In Interna-\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pages 365\u2013374, 2014.\n\n[18] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1),\n\npages 427\u2013435, 2013.\n\n10\n\n\f[19] Kevin Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck. lil\u2019ucb: An optimal\nexploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages\n423\u2013439, 2014.\n\n[20] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed\n\nbandits. In International Conference on Machine Learning, pages 1238\u20131246, 2013.\n\n[21] Zohar S Karnin. Veri\ufb01cation based solution for structured mab problems. In Advances in Neural\n\nInformation Processing Systems, pages 145\u2013153, 2016.\n\n[22] Emilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On the complexity of best-arm\nidenti\ufb01cation in multi-armed bandit models. Journal of Machine Learning Research, 17:1\u201342,\n2016.\n\n[23] Abbas Kazerouni and Lawrence M Wein. Best arm identi\ufb01cation in generalized linear bandits.\n\narXiv preprint arXiv:1905.08224, 2019.\n\n[24] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian\n\nJournal of Mathematics, 12:363\u2013366, 1960.\n\n[25] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of \ufb01nite-\n\narmed linear bandits. arXiv preprint arXiv:1610.04491, 2016.\n\n[26] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th international conference\non World wide web, pages 661\u2013670. ACM, 2010.\n\n[27] Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.\n[28] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.\n[29] Marta Soare. Sequential resource allocation in linear stochastic bandits. PhD thesis, Universit\u00e9\n\nLille 1-Sciences et Technologies, 2015.\n\n[30] Marta Soare, Alessandro Lazaric, and R\u00e9mi Munos. Best-arm identi\ufb01cation in linear bandits.\n\nIn Advances in Neural Information Processing Systems, pages 828\u2013836, 2014.\n\n[31] Chao Tao, Sa\u00fal Blanco, and Yuan Zhou. Best arm identi\ufb01cation in linear bandits with linear\ndimension dependency. In International Conference on Machine Learning, pages 4884\u20134893,\n2018.\n\n[32] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics,\n\npages 135\u2013166, 2004.\n\n[33] Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure explo-\nration in linear bandits. In International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 843\u2013851, 2018.\n\n[34] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In\nProceedings of the 23rd international conference on Machine learning, pages 1081\u20131088. ACM,\n2006.\n\n11\n\n\f", "award": [], "sourceid": 5689, "authors": [{"given_name": "Tanner", "family_name": "Fiez", "institution": "University of Washington"}, {"given_name": "Lalit", "family_name": "Jain", "institution": "University of Washington"}, {"given_name": "Kevin", "family_name": "Jamieson", "institution": "U Washington"}, {"given_name": "Lillian", "family_name": "Ratliff", "institution": "University of Washington"}]}