{"title": "Combinatorial Pure Exploration of Multi-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 379, "page_last": 387, "abstract": "We study the {\\em combinatorial pure exploration (CPE)} problem in the stochastic multi-armed bandit setting, where a learner explores a set of arms with the objective of identifying the optimal member of a \\emph{decision class}, which is a collection of subsets of arms with certain combinatorial structures such as size-$K$ subsets, matchings, spanning trees or paths, etc. The CPE problem represents a rich class of pure exploration tasks which covers not only many existing models but also novel cases where the object of interest has a non-trivial combinatorial structure. In this paper, we provide a series of results for the general CPE problem. We present general learning algorithms which work for all decision classes that admit offline maximization oracles in both fixed confidence and fixed budget settings. We prove problem-dependent upper bounds of our algorithms. Our analysis exploits the combinatorial structures of the decision classes and introduces a new analytic tool. We also establish a general problem-dependent lower bound for the CPE problem. Our results show that the proposed algorithms achieve the optimal sample complexity (within logarithmic factors) for many decision classes. In addition, applying our results back to the problems of top-$K$ arms identification and multiple bandit best arms identification, we recover the best available upper bounds up to constant factors and partially resolve a conjecture on the lower bounds.", "full_text": "Combinatorial Pure Exploration of\n\nMulti-Armed Bandits\n\nShouyuan Chen1\u21e4 Tian Lin2\n1The Chinese University of Hong Kong\n1{sychen,king,lyu}@cse.cuhk.edu.hk\n\nIrwin King1 Michael R. Lyu1 Wei Chen3\n\n2lint10@mails.tsinghua.edu.cn 3weic@microsoft.com\n\n2Tsinghua University\n\n3Microsoft Research Asia\n\nAbstract\n\nWe study the combinatorial pure exploration (CPE) problem in the stochastic multi-armed\nbandit setting, where a learner explores a set of arms with the objective of identifying\nthe optimal member of a decision class, which is a collection of subsets of arms with\ncertain combinatorial structures such as size-K subsets, matchings, spanning trees or paths,\netc. The CPE problem represents a rich class of pure exploration tasks which covers not\nonly many existing models but also novel cases where the object of interest has a non-\ntrivial combinatorial structure. In this paper, we provide a series of results for the general\nCPE problem. We present general learning algorithms which work for all decision classes\nthat admit of\ufb02ine maximization oracles in both \ufb01xed con\ufb01dence and \ufb01xed budget settings.\nWe prove problem-dependent upper bounds of our algorithms. Our analysis exploits the\ncombinatorial structures of the decision classes and introduces a new analytic tool. We also\nestablish a general problem-dependent lower bound for the CPE problem. Our results show\nthat the proposed algorithms achieve the optimal sample complexity (within logarithmic\nfactors) for many decision classes. In addition, applying our results back to the problems\nof top-K arms identi\ufb01cation and multiple bandit best arms identi\ufb01cation, we recover the\nbest available upper bounds up to constant factors and partially resolve a conjecture on the\nlower bounds.\nIntroduction\n\n1\nMulti-armed bandit (MAB) is a predominant model for characterizing the tradeoff between explo-\nration and exploitation in decision-making problems. Although this is an intrinsic tradeoff in many\ntasks, some application domains prefer a dedicated exploration procedure in which the goal is to\nidentify an optimal object among a collection of candidates and the reward or loss incurred during\nexploration is irrelevant. In light of these applications, the related learning problem, called pure ex-\nploration in MABs, has received much attention. Recent advances in pure exploration MABs have\nfound potential applications in many domains including crowdsourcing, communication network\nand online advertising.\nIn many of these application domains, a recurring problem is to identify the optimal object with\ncertain combinatorial structure. For example, a crowdsourcing application may want to \ufb01nd the best\nassignment from workers to tasks such that overall productivity of workers is maximized. A network\nrouting system during the initialization phase may try to build a spanning tree that minimizes the\ndelay of links, or attempts to identify the shortest path between two sites. An online advertising\nsystem may be interested in \ufb01nding the best matching between ads and display slots. The literature\nof pure exploration MAB problems lacks a framework that encompasses these kinds of problems\nwhere the object of interest has a non-trivial combinatorial structure. Our paper contributes such\na framework which accounts for general combinatorial structures, and develops a series of results,\nincluding algorithms, upper bounds and lower bounds for the framework.\nIn this paper, we formulate the combinatorial pure exploration (CPE) problem for stochastic multi-\narmed bandits. In the CPE problem, a learner has a \ufb01xed set of arms and each arm is associated with\nan unknown reward distribution. The learner is also given a collection of sets of arms called decision\nclass, which corresponds to a collection of certain combinatorial structures. During the exploration\nperiod, in each round the learner chooses an arm to play and observes a random reward sampled from\n\n\u21e4This work was done when the \ufb01rst two authors were interns at Microsoft Research Asia.\n\n1\n\n\fthe associated distribution. The objective is when the exploration period ends, the learner outputs a\nmember of the decision class that she believes to be optimal, in the sense that the sum of expected\nrewards of all arms in the output set is maximized among all members in the decision class.\nThe CPE framework represents a rich class of pure exploration problems. The conventional pure ex-\nploration problem in MAB, whose objective is to \ufb01nd the single best arm, clearly \ufb01ts into this frame-\nwork, in which the decision class is the collection of all singletons. This framework also naturally\nencompasses several recent extensions, including the problem of \ufb01nding the top K arms (henceforth\nTOPK) [18, 19, 8, 20, 31] and the multi-bandit problem of \ufb01nding the best arms simultaneously\nfrom several disjoint sets of arms (henceforth MB) [12, 8]. Further, this framework covers many\nmore interesting cases where the decision classes correspond to collections of non-trivial combina-\ntorial structures. For example, suppose that the arms represent the edges in a graph. Then a decision\nclass could be the set of all paths between two vertices, all spanning trees or all matchings of the\ngraph. And, in these cases, the objectives of CPE become identifying the optimal paths, spanning\ntrees and matchings through bandit explorations, respectively. To our knowledge, there are no results\navailable in the literature for these pure exploration tasks.\nThe CPE framework raises several interesting challenges to the design and analysis of pure explo-\nration algorithms. One challenge is that, instead of solving each type of CPE task in an ad-hoc way,\none requires a uni\ufb01ed algorithm and analysis that support different decision classes. Another chal-\nlenge stems from the combinatorial nature of CPE, namely that the optimal set may contain some\narms with very small expected rewards (e.g., it is possible that a maximum matching contains the\nedge with the smallest weight); hence, arms cannot be eliminated simply based on their own re-\nwards in the learning algorithm or ignored in the analysis. This differs from many existing approach\nof pure exploration MABs. Therefore, the design and analysis of algorithms for CPE demands novel\ntechniques which take both rewards and combinatorial structures into account.\nOur results. In this paper, we propose two novel learning algorithms for general CPE problem: one\nfor the \ufb01xed con\ufb01dence setting and one for the \ufb01xed budget setting. Both algorithms support a wide\nrange of decision classes in a uni\ufb01ed way. In the \ufb01xed con\ufb01dence setting, we present Combinatorial\nLower-Upper Con\ufb01dence Bound (CLUCB) algorithm. The CLUCB algorithm does not need to know\nthe de\ufb01nition of the decision class, as long as it has access to the decision class through a maximiza-\ntion oracle. We upper bound the number of samples used by CLUCB. This sample complexity bound\ndepends on both the expected rewards and the structure of decision class. Our analysis relies on a\nnovel combinatorial construction called exchange class, which may be of independent interest for\nother combinatorial optimization problems. Specializing our result to TOPK and MB, we recover\nthe best available sample complexity bounds [19, 13, 20] up to constant factors. While for other de-\ncision classes in general, our result establishes the \ufb01rst sample complexity upper bound. We further\nshow that CLUCB can be easily extended to the \ufb01xed budget setting and PAC learning setting and\nwe provide related theoretical guarantees in the supplementary material.\nMoreover, we establish a problem-dependent sample complexity lower bound for the CPE problem.\nOur lower bound shows that the sample complexity of the proposed CLUCB algorithm is optimal\n(to within logarithmic factors) for many decision classes, including TOPK, MB and the decision\nclasses derived from matroids (e.g., spanning tree). Therefore our upper and lower bounds provide\na nearly full characterization of the sample complexity of these CPE problems. For more general\ndecision classes, our results show that the upper and lower bounds are within a relatively benign\nfactor. To the best of our knowledge, there are no problem-dependent lower bounds known for pure\nexploration MABs besides the case of identifying the single best arm [24, 1]. We also notice that\nour result resolves the conjecture of Bubeck et al. [8] on the problem-dependent sample complexity\nlower bounds of TOPK and MB problems, for the cases of Gaussian reward distributions.\nIn the \ufb01xed budget setting, we present a parameter-free algorithm called Combinatorial Successive\nAccept Reject (CSAR) algorithm. We prove a probability of error bound of the CSAR algorithm. This\nbound can be shown to be equivalent to the sample complexity bound of CLUCB within logarithmic\nfactors, although the two algorithms are based on quite different techniques. Our analysis of CSAR\nre-uses exchange classes as tools. This suggests that exchange classes may be useful for analyzing\nsimilar problems. In addition, when applying the algorithm to back TOPK and MB, our bound\nrecovers the best known result in the \ufb01xed budget setting due to Bubeck et al. [8] up to constant\nfactors.\n\n2\n\n\f2 Problem Formulation\nIn this section, we formally de\ufb01ne the CPE problem. Suppose that there are n arms and the arms\nare numbered 1, 2, . . . , n. Assume that each arm e 2 [n] is associated with a reward distribution\n'e. Let w =w(1), . . . , w(n)T denote the vector of expected rewards, where each entry w(e) =\nEX\u21e0'e[X] denotes the expected reward of arm e. Following standard assumptions of stochastic\nMABs, we assume that all reward distributions have R-sub-Gaussian tails for some known constant\nR > 0. Formally, if X is a random variable drawn from 'e for some e 2 [n], then, for all t 2 R,\none has E\u21e5 exp(tX tE[X])\u21e4 \uf8ff exp(R2t2/2). It is known that the family of R-sub-Gaussian tail\ndistributions encompasses all distributions that are supported on [0, R] as well as many unbounded\ndistributions such as Gaussian distributions with variance R2 (see e.g., [27, 28]).\nWe de\ufb01ne a decision class M\u2713 2[n] as a collection of sets of arms. Let M\u21e4 = arg maxM2M w(M )\ndenote the optimal member of the decision class M which maximizes the sum of expected rewards1.\nA learner\u2019s objective is to identify M\u21e4 from M by playing the following game with the stochastic\nenvironment. At the beginning of the game, the decision class M is revealed to the learner while\nthe reward distributions {'e}e2[n] are unknown to her. Then, the learner plays the game over a\nsequence of rounds; in each round t, she pulls an arm pt 2 [n] and observes a reward sampled\nfrom the associated reward distribution 'pt. The game continues until certain stopping condition is\nsatis\ufb01ed. After the game \ufb01nishes, the learner need to output a set Out 2M .\nWe consider two different stopping conditions of the game, which are known as \ufb01xed con\ufb01dence\nsetting and \ufb01xed budget setting in the literature. In the \ufb01xed con\ufb01dence setting, the learner can stop\nthe game at any round. She need to guarantee that Pr[Out = M\u21e4] 1 for a given con\ufb01dence\nparameter . The learner\u2019s performance is evaluated by her sample complexity, i.e., the number of\npulls used by the learner. In the \ufb01xed budget setting, the game stops after a \ufb01xed number T of rounds,\nwhere T is given before the game starts. The learner tries to minimize the probability of error, which\nis formally Pr[Out 6= M\u21e4], within T rounds. In this setting, her performance is measured by the\nprobability of error.\n3 Algorithm, Exchange Class and Sample Complexity\nIn this section, we present Combinatorial Lower-Upper Con\ufb01dence Bound (CLUCB) algorithm, a\nlearning algorithm for the CPE problem in the \ufb01xed con\ufb01dence setting, and analyze its sample com-\nplexity. En route to our sample complexity bound, we introduce the notions of exchange classes and\nthe widths of decision classes, which play an important role in the analysis and sample complexity\nbound. Furthermore, the CLUCB algorithm can be extended to the \ufb01xed budget and PAC learning\nsettings, the discussion of which is included in the supplementary material (Appendix B).\nOracle. We allow the CLUCB algorithm to access a maximization oracle. A maximization oracle\ntakes a weight vector v 2 Rn as input and \ufb01nds an optimal set from a given decision class M with\nrespect to the weight vector v. Formally, we call a function Oracle: Rn !M a maximization oracle\nfor M if, for all v 2 Rn, we have Oracle(v) 2 arg maxM2M v(M ). It is clear that a wide range\nof decision classes admit such maximization oracles, including decision classes corresponding to\ncollections of matchings, paths or bases of matroids (see later for concrete examples). Besides the\naccess to the oracle, CLUCB does not need any additional knowledge of the decision class M.\nAlgorithm. Now we describe the details of CLUCB, as shown in Algorithm 1. During its execution,\nthe CLUCB algorithm maintains empirical mean \u00afwt(e) and con\ufb01dence radius radt(e) for each arm\ne 2 [n] and each round t. The construction of con\ufb01dence radius ensures that |w(e) \u00afwt(e)|\uf8ff\nradt(e) holds with high probability for each arm e 2 [n] and each round t > 0. CLUCB begins\nwith an initialization phase in which each arm is pulled once. Then, at round t n, CLUCB uses\nthe following procedure to choose an arm to play. First, CLUCB calls the oracle which \ufb01nds the\nset Mt = Oracle( \u00afwt). The set Mt is the \u201cbest\u201d set with respect to the empirical means \u00afwt. Then,\nCLUCB explores possible re\ufb01nements of Mt. In particular, CLUCB uses the con\ufb01dence radius to\ncompute an adjusted expectation vector \u02dcwt in the following way: for each arm e 2 Mt, \u02dcwt(e) is\nequal to to the lower con\ufb01dence bound \u02dcwt(e) = \u00afwt(e) radt(e); and for each arm e 62 Mt, \u02dcwt(e) is\nequal to the upper con\ufb01dence bound \u02dcwt(e) = \u00afwt(e) + radt(e). Intuitively, the adjusted expectation\nvector \u02dcwt penalizes arms belonging to the current set Mt and encourages exploring arms out of\n\n1We de\ufb01ne v(S) ,Pi2S v(i) for any vector v 2 Rn and any set S \u2713 [n]. In addition, for convenience,\n\nwe will assume that M\u21e4 is unique.\n\n3\n\n\fInitialize: Play each arm e 2 [n] once. Initialize empirical means \u00afwn and set Tn(e) 1 for all e.\n\nAlgorithm 1 CLUCB: Combinatorial Lower-Upper Con\ufb01dence Bound\nRequire: Con\ufb01dence 2 (0, 1); Maximization oracle: Oracle(\u00b7) : Rn !M\n1: for t = n, n + 1, . . . do\n2:\nMt Oracle( \u00afwt)\n3:\nCompute con\ufb01dence radius radt(e) for all e 2 [n]\nfor e = 1, . . . , n do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nif e 2 Mt then \u02dcwt(e) \u00afwt(e) radt(e)\nelse \u02dcwt(e) \u00afwt(e) + radt(e)\n\n\u02dcMt Oracle( \u02dcwt)\nif \u02dcwt( \u02dcMt) = \u02dcwt(Mt) then\n\npt arg maxe2( \u02dcMt\\Mt)[(Mt\\ \u02dcMt) radt(e)\nPull arm pt and observe the reward\nUpdate empirical means \u00afwt+1 using the observed reward\nUpdate number of pulls: Tt+1(pt) Tt(pt) + 1 and Tt+1(e) Tt(e) for all e 6= pt\n\nOut Mt\nreturn Out\n\n. radt(e) is de\ufb01ned later in Theorem 1\n\n. break ties arbitrarily\n\nMt. CLUCB then calls the oracle using the adjusted expectation vector \u02dcwt as input to compute a\nre\ufb01ned set \u02dcMt = Oracle( \u02dcwt). If \u02dcwt( \u02dcMt) = \u02dcwt(Mt) then CLUCB stops and returns Out = Mt.\nOtherwise, CLUCB pulls the arm that belongs to the symmetric difference between Mt and \u02dcMt and\nhas the largest con\ufb01dence radius (intuitively the largest uncertainty). This ends the t-th round of\nCLUCB. We note that CLUCB generalizes and uni\ufb01es the ideas of several different \ufb01xed con\ufb01dence\nalgorithms dedicated to the TOPK and MB problems in the literature [19, 13, 20].\n3.1 Sample complexity\nNow we establish a problem-dependent sample complexity bound of the CLUCB algorithm. To for-\nmally state our result, we need to introduce several notions.\nGap. We begin with de\ufb01ning a natural hardness measure of the CPE problem. For each arm e 2 [n],\nwe de\ufb01ne its gap e as\n\ne =\u21e2w(M\u21e4) maxM2M:e2M w(M )\n\n(1)\nwhere we adopt the convention that the maximum value of an empty set is 1. We also de\ufb01ne the\nhardness H as the sum of inverse squared gaps\n(2)\n\nw(M\u21e4) maxM2M:e62M w(M )\n\nif e 62 M\u21e4,\nif e 2 M\u21e4,\n\n2\ne .\n\nH = Xe2[n]\n\nWe see that, for each arm e 62 M\u21e4, the gap e represents the sub-optimality of the best set that\nincludes arm e; and, for each arm e 2 M\u21e4, the gap e is the sub-optimality of the best set that does\nnot include arm e. This naturally generalizes and uni\ufb01es previous de\ufb01nitions of gaps [1, 12, 18, 8].\nExchange class and the width of a decision class. A notable challenge of our analysis stems from\nthe generality of CLUCB which, as we have seen, supports a wide range of decision classes M.\nIndeed, previous algorithms for special cases including TOPK and MB require a separate analysis\nfor each individual type of problem. Such strategy is intractable for our setting and we need a uni\ufb01ed\nanalysis for all decision classes. Our solution to this challenge is a novel combinatorial construction\ncalled exchange class, which is used as a proxy for the structure of the decision class. Intuitively,\nan exchange class B for a decision class M can be seen as a collection of \u201cpatches\u201d (borrowing\nconcepts from source code management) such that, for any two different sets M, M0 2M , one can\ntransform M to M0 by applying a series of patches of B; and each application of a patch yields a\nvalid member of M. These patches are later used by our analysis to build gadgets that interpolate\nbetween different members of the decision class and serve to bridge key quantities. Furthermore, the\nmaximum patch size of B will play an important role in our sample complexity bound.\nNow we formally de\ufb01ne the exchange class. We begin with the de\ufb01nition of exchange sets, which\nformalize the aforementioned \u201cpatches\u201d. We de\ufb01ne an exchange set b as an ordered pair of disjoint\nsets b = (b+, b) where b+ \\ b = ; and b+, b \u2713 [n]. Then, we de\ufb01ne operator such that, for\nany set M \u2713 [n] and any exchange set b = (b+, b), we have M b , M\\b [ b+. Similarly, we\nalso de\ufb01ne operator such that M b , M\\b+ [ b.\n\n4\n\n\fWe call a collection of exchange sets B an exchange class for M if B satis\ufb01es the following property.\nFor any M, M0 2M such that M 6= M0 and for any e 2 (M\\M0), there exists an exchange set\n(b+, b) 2B which satis\ufb01es \ufb01ve constraints: (a) e 2 b, (b) b+ \u2713 M0\\M, (c) b \u2713 M\\M0, (d)\n(M b) 2M and (e) (M0 b) 2M .\nIntuitively, constraints (b) and (c) resemble the concept of patches in the sense that b+ contains\nonly the \u201cnew\u201d elements from M0 and b contains only the \u201cold\u201d elements of M; constraints (d)\nand (e) allow one to transform M one step closer to M0 by applying a patch b 2B to yield (M \nb) 2M (and similarly for M0 b). These transformations are the basic building blocks in our\nanalysis. Furthermore, as we will see later in our examples, for many decision classes, there are\nexchange classes representing natural combinatorial structures, e.g., augmenting paths and cycles of\nmatchings.\nIn our analysis, the key quantity of exchange class is called width, which is de\ufb01ned as the size of the\nlargest exchange set as follows\n\nwidth(B) = max\n\n(b+,b)2B |b+| + |b|.\n\n(3)\n\nLet Exchange(M) denote the family of all possible exchange classes for M. We de\ufb01ne the width\nof a decision class M as the width of the thinnest exchange class\n\nwidth(M) =\n\nmin\n\nB2Exchange(M)\n\nwidth(B).\n\n(4)\n\nSample complexity. Our main result of this section is a problem-dependent sample complexity\nbound of the CLUCB algorithm which show that, with high probability, CLUCB returns the optimal\n\nset M\u21e4 and uses at most \u02dcO width(M)2H samples.\nTheorem 1. Given any 2 (0, 1), any decision class M\u2713 2[n] and any expected rewards w 2 Rn.\nAssume that the reward distribution 'e for each arm e 2 [n] has mean w(e) with an R-sub-Gaussian\ntail. Let M\u21e4 = arg maxM2M w(M ) denote the optimal set. Set radt(e) = Rq2 log 4nt3\n /Tt(e)\nfor all t > 0 and e 2 [n]. Then, with probability at least 1 , the CLUCB algorithm (Algorithm 1)\nreturns the optimal set Out = M\u21e4 and\n\n(5)\nwhere T denotes the number of samples used by Algorithm 1, H is de\ufb01ned in Eq. (2) and width(M)\nis de\ufb01ned in Eq. (4).\n\nT \uf8ff OR2 width(M)2H lognR2H/ ,\n\nlargest expected reward can be modeled by decision class MTOPK(K) = {M \u2713 [n] |M = K}.\n\n3.2 Examples of decision classes\nNow we investigate several concrete types of decision classes, which correspond to different CPE\ntasks. We analyze the width of these decision classes and apply Theorem 1 to obtain the sample\ncomplexity bounds. A detailed analysis and the constructions of exchange classes can be found in\nthe supplementary material (Appendix F). We begin with the problems of top-K arm identi\ufb01cation\n(TOPK) and multi-bandit best arms identi\ufb01cation (MB).\nExample 1 (TOPK and MB). For any K 2 [n], the problem of \ufb01nding the top K arms with the\nLet A = {A1, . . . , Am} be a partition of [n]. The problem of identifying the best arms from each\ngroup of arms A1, . . . , Am can be modeled by decision class MMB(A) = {M \u2713 [n] |8 i 2\n[m],|M \\ Ai| = 1}. Note that maximization oracles for these two decision classes are trivially the\nfunctions of returning the top k arms or the best arms of each group.\nThen we have width(MTOPK(K)) \uf8ff 2 and width(MMB(A)) \uf8ff 2 (see Fact 2 and 3 in the sup-\nplementary material) and therefore the sample complexity of CLUCB for solving TOPK and MB is\nOH log(nH/), which matches previous results in the \ufb01xed con\ufb01dence setting [19, 13, 20] up to\n\nconstant factors.\n\nNext we consider the problem of identifying the maximum matching and the problem of \ufb01nding\nthe shortest path (by negating the rewards), in a setting where arms correspond to edges. For these\nproblems, Theorem 1 establishes the \ufb01rst known sample complexity bound.\n\n5\n\n\fExample 2 (Matchings and Paths). Let G(V, E) be a graph with n edges and assume there is a one-\nto-one mapping between edges E and arms [n]. Suppose that G is a bipartite graph. Let MMATCH(G)\ncorrespond to the set of all matchings in G. Then we have width(MMATCH(G)) \uf8ff| V | (In fact, we\nconstruct an exchange class corresponding to the collection of augmenting cycles and augmenting\npaths of G; see Fact 4).\nNext suppose that G is a directed acyclic graph and let s, t 2 V be two vertices. Let MPATH(G,s,t)\ncorrespond to the set of all paths from s to t. Then we have width(MPATH(G,s,t)) \uf8ff| V | (In fact,\nwe construct an exchange class corresponding to the collection of disjoint pairs of paths; see\nFact 5). Therefore the sample complexity bounds of CLUCB for decision classes MMATCH(G) and\nMPATH(G,s,t) are O|V |2H log(nH/).\nLast, we investigate the general problem of identifying the maximum-weight basis of a matroid.\nAgain, Theorem 1 is the \ufb01rst sample complexity upper bound for this type of pure exploration tasks.\nExample 3 (Matroids). Let T = (E,I) be a \ufb01nite matroid, where E is a set of size n (called\nground set) and I is a family of subsets of E (called independent sets) which satis\ufb01es the axioms of\nmatroids (see Footnote 3 in Appendix F). Assume that there is a one-to-one mapping between E and\n[n]. Recall that a basis of matroid T is a maximal independent set. Let MMATROID(T ) correspond\nto the set of all bases of T . Then we have width(MMATROID(T )) \uf8ff 2 (derived from strong basis\nexchange property of matroids; see Fact 1) and the sample complexity of CLUCB for MMATROID(T )\nis OH log(nH/).\nThe last example MMATROID(T ) is a general type of decision class which encompasses many pure\nexploration tasks including TOPK and MB as special cases, where TOPK corresponds to uniform\nmatroids of rank K and MB corresponds to partition matroids. It is easy to see that MMATROID(T )\nalso covers the decision class that contains all spanning trees of a graph. On the other hand, it has\nbeen established that matchings and paths cannot be formulated as matroids since they are matroid\nintersections [26].\n4 Lower Bound\nIn this section, we present a problem-dependent lower bound on the sample complexity of the CPE\nproblem. To state our results, we \ufb01rst de\ufb01ne the notion of -correct algorithm as follows. For any\n 2 (0, 1), we call an algorithm A a -correct algorithm if, for any expected reward w 2 Rn, the\nprobability of error of A is at most , i.e., Pr[M\u21e4 6= Out] \uf8ff , where Out is the output of A.\nWe show that, for any decision class M and any expected rewards w, a -correct algorithm A must\nuse at least \u2326H log(1/) samples in expectation.\nTheorem 2. Fix any decision class M\u2713 2[n] and any vector w 2 Rn. Suppose that, for each\narm e 2 [n], the reward distribution 'e is given by 'e = N (w(e), 1), where we let N (\u00b5, 2)\ndenote Gaussian distribution with mean \u00b5 and variance 2. Then, for any 2 (0, e16/4) and any\n-correct algorithm A, we have\n\n(6)\n\nE[T ] \n\n1\n16\n\nH log\u2713 1\n4\u25c6 ,\n\nwhere T denote the number of total samples used by algorithm A and H is de\ufb01ned in Eq. (2).\n\nIn Example 1 and Example 3, we have seen that\nthe sample complexity of CLUCB is\nO(H log(nH/)) for pure exploration tasks including TOPK, MB and more generally the CPE\ntasks with decision classes derived from matroids, i.e., MMATROID(T ) (including spanning trees).\nHence, our upper and lower bound show that the CLUCB algorithm achieves the optimal sample\ncomplexity within logarithmic factors for these pure exploration tasks. In addition, we remark that\nTheorem 2 resolves the conjecture of Bubeck et al. [8] that the lower bounds of sample complexity\nof TOPK and MB problems are \u2326H log(1/), for the cases of Gaussian reward distributions.\nOn the other hand, for general decision classes with non-constant widths, we see that there is a gap of\n\u02dc\u21e5(width(M)2) between the upper bound Eq. (5) and the lower bound Eq. (6). Notice that we have\nwidth(M) \uf8ff n for any decision class M and therefore the gap is relatively benign. Our lower bound\nalso suggests that the dependency on H of the sample complexity of CLUCB cannot be improved up\nto logarithmic factors. Furthermore, we conjecture that the sample complexity lower bound might\ninherently depend on the size of exchange sets. In the supplementary material (Appendix C.2), we\n\n6\n\n\fprovide evidences on this conjecture which is a lower bound on the sample complexity of exploration\nof the exchange sets.\n5 Fixed Budget Algorithm\nIn this section, we present Combinatorial Successive Accept Reject (CSAR) algorithm, which is a\nparameter-free learning algorithm for the CPE problem in the \ufb01xed budget setting. Then, we upper\nbound the probability of error CSAR in terms of gaps and width(M).\nConstrained oracle. The CSAR algorithm requires access to a constrained oracle, which is a func-\ntion denoted as COracle : Rn \u21e5 2[n] \u21e5 2[n] ! M [ {?} and satis\ufb01es\nCOracle(v, A, B) =(arg maxM2MA,B v(M )\n\nif MA,B 6= ;\nif MA,B = ;,\n\n(7)\n\n?\n\nwhere we de\ufb01ne MA,B = {M 2M | A \u2713 M, B \\ M = ;} as the collection of feasible sets\nand ? is a null symbol. Hence we see that COracle(v, A, B) returns an optimal set that includes all\nelements of A while excluding all elements of B; and if there are no feasible sets, the constrained\noracle COracle(v, A, B) returns the null symbol ?. In the supplementary material (Appendix G),\nwe show that constrained oracles are equivalent to maximization oracles up to a transformation on\nthe weight vector. In addition, similar to CLUCB, CSAR does not need any additional knowledge of\nM other than accesses to a constrained oracle for M.\nAlgorithm. The idea of the CSAR algorithm is as follows. The CSAR algorithm divides the budget\nof T rounds into n phases. In the end of each phase, CSAR either accepts or rejects a single arm. If\nan arm is accepted, then it is included into the \ufb01nal output. Conversely, if an arm is rejected, then it\nis excluded from the \ufb01nal output. The arms that are neither accepted nor rejected are sampled for an\nequal number of times in the next phase.\nNow we describe the procedure of the CSAR algorithm for choosing an arm to accept/reject. Let\nAt denote the set of accepted arms before phase t and let Bt denote the set of rejected arms before\nphase t. We call an arm e to be active if e 62 At [ Bt. In the beginning of phase t, CSAR samples\neach active arm for \u02dcTt \u02dcTt1 times, where the de\ufb01nition of \u02dcTt is given in Algorithm 2. Next,\nCSAR calls the constrained oracle to compute an optimal set Mt with respect to the empirical means\n\u00afwt, accepted arms At and rejected arms Bt, i.e., Mt = COracle( \u00afwt, At, Bt). It is clear that the\noutput of COracle( \u00afwt, At, Bt) is independent from the input \u00afwt(e) for any e 2 At [ Bt. Then, for\neach active arm e, CSAR estimates the \u201cempirical gap\u201d of e in the following way. If e 2 Mt, then\nCSAR computes an optimal set \u02dcMt,e that does not include e, i.e., \u02dcMt,e = COracle( \u00afwt, At, Bt [\n{e}). Conversely, if e 62 Mt, then CSAR computes an optimal \u02dcMt,e which includes e, i.e., \u02dcMt,e =\nCOracle( \u00afwt, At[{e}, Bt). Then, the empirical gap of e is calculated as \u00afwt(Mt) \u00afwt( \u02dcMt,e). Finally,\nCSAR chooses the arm pt which has the largest empirical gap. If pt 2 Mt then pt is accepted,\notherwise pt is rejected. The pseudo-code CSAR is shown in Algorithm 2. We note that CSAR can\nbe considered as a generalization of the ideas of the two versions of SAR algorithm due to Bubeck\net al. [8], which are designed speci\ufb01cally for the TOPK and MB problems respectively.\n5.1 Probability of error\nIn the following theorem, we bound the probability of error of the CSAR algorithm.\nTheorem 3. Given any T > n, any decision class M\u2713 2[n] and any expected rewards w 2\nRn. Assume that the reward distribution 'e for each arm e 2 [n] has mean w(e) with an R-sub-\nGaussian tail. Let (1), . . . , (n) be a permutation of 1, . . . , n (de\ufb01ned in Eq. (1)) such that\n(1) \uf8ff . . . . . . (n). De\ufb01ne H2 , maxi2[n] i2\n(i) . Then, the CSAR algorithm uses at most T\nsamples and outputs a solution Out 2 M [ {?} such that\nPr[Out 6= M\u21e4] \uf8ff n2 exp\u2713\ni=1 i1, M\u21e4 = arg maxM2M w(M ) and width(M) is de\ufb01ned in Eq. (4).\n\nwhere \u02dclog(n) ,Pn\nOne can verify that H2 is equivalent to H up to a logarithmic factor: H2 \uf8ff H \uf8ff log(2n)H2 (see\n[1]). Therefore, by setting the probability of error (the RHS of Eq. (8)) to a constant, one can see\nthat CSAR requires a budget of T = \u02dcO(width(M)2H) samples. This is equivalent to the sample\ncomplexity bound of CLUCB up to logarithmic factors. In addition, applying Theorem 3 back to\nTOPK and MB, our bound matches the previous \ufb01xed budget algorithm due to Bubeck et al. [8].\n\n18R2 \u02dclog(n) width(M)2H2\u25c6 ,\n\n(T n)\n\n(8)\n\n7\n\n\f1\ni\n\ni=1\n\n.\n\n. set \u00afwt(e) = 0, 8e 2 At [ Bt\n\n. de\ufb01ne \u00afwt(?) = 1; break ties arbitrarily\n\npt arg maxe2[n]\\(At[Bt) \u00afwt(Mt) \u00afwt( \u02dcMt,e)\nif pt 2 Mt then\nelse\n\nAt+1 At [{ pt}, Bt+1 Bt\nAt+1 At, Bt+1 Bt [{ pt}\n\nfail: set Out ? and return Out\nif e 2 Mt then \u02dcMt,e COracle( \u00afwt, At, Bt [{ e})\nelse \u02dcMt,e COracle( \u00afwt, At [{ e}, Bt)\n\n\u02dcTt l\nPull each arm e 2 [n]\\(At [ Bt) for \u02dcTt \u02dcTt1 times\nUpdate the empirical means \u00afwt for each arm e 2 [n]\\(At [ Bt)\nMt COracle( \u00afwt, At, Bt)\nif Mt = ? then\nfor each e 2 [n]\\(At [ Bt) do\n\nAlgorithm 2 CSAR: Combinatorial Successive Accept Reject\nRequire: Budget: T > 0; Constrained oracle: COracle : Rn \u21e5 2[n] \u21e5 2[n] !M[{?}\n1: De\ufb01ne \u02dclog(n) ,Pn\n2: \u02dcT0 0, A1 ;, B1 ;\n3: for t = 1, . . . , n do\n\u02dclog(n)(nt+1)m\n4:\nTn\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: Out An+1\n19: return Out\n6 Related Work\nThe multi-armed bandit problem has been extensively studied in both stochastic and adversarial\nsettings [22, 3, 2]. We refer readers to [5] for a survey on recent advances. Many work in MABs focus\non minimizing the cumulative regret, which is an objective known to be fundamentally different\nfrom the objective of pure exploration MABs [6]. Among these work, a recent line of research\nconsiders a generalized setting called combinatorial bandits in which a set of arms (satisfying certain\ncombinatorial constraints) are played on each round [9, 17, 25, 7, 10, 14, 23, 21]. Note that the\nobjective of these work is to minimize the cumulative regret, which differs from ours.\nIn the literature of pure exploration MABs, the classical problem of identifying the single best arm\nhas been well-studied in both \ufb01xed con\ufb01dence and \ufb01xed budget settings [24, 11, 6, 1, 13, 15, 16].\nA \ufb02urry of recent work extend this classical problem to TOPK and MB problems and obtain algo-\nrithms with upper bounds [18, 12, 13, 19, 8, 20, 31] and worst-case lower bounds of TOPK [19, 31].\nOur framework encompasses these two problems as special cases and covers a much larger class of\ncombinatorial pure exploration problems, which have not been addressed in current literature. Ap-\nplying our results back to TOPK and MB, our upper bounds match best available problem-dependent\nbounds up to constant factors [13, 19, 8] in both \ufb01xed con\ufb01dence and \ufb01xed budget settings; and our\nlower bound is the \ufb01rst proven problem-dependent lower bound for these two problems, which are\nconjectured earlier by Bubeck et al. [8].\n7 Conclusion\nIn this paper, we proposed a general framework called combinatorial pure exploration (CPE) that\ncan handle pure exploration tasks for many complex bandit problems with combinatorial constraints,\nand have potential applications in various domains. We have shown a number of results for the\nframework, including two novel learning algorithms, their related upper bounds and a novel lower\nbound. The proposed algorithms support a wide range of decision classes in a unifying way and our\nanalysis introduced a novel tool called exchange class, which may be of independent interest. Our\nupper and lower bounds characterize the complexity of the CPE problem: the sample complexity of\nour algorithm is optimal (up to a logarithmic factor) for the decision classes derived from matroids\n(including TOPK and MB), while for general decision classes, our upper and lower bounds are\nwithin a relatively benign factor.\nAcknowledgments. The work described in this paper was partially supported by the National Grand\nFundamental Research 973 Program of China (No. 2014CB340401 and No. 2014CB340405), the\nResearch Grants Council of the Hong Kong Special Administrative Region, China (Project No.\nCUHK 413212 and CUHK 415113), and Microsoft Research Asia Regional Seed Fund in Big Data\nResearch (Grant No. FY13-RES-SPONSOR-036).\n\n8\n\n\fReferences\n[1] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identi\ufb01cation in multi-armed bandits. In COLT, 2010.\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nlearning, 47(2-3):235\u2013256, 2002.\n\n[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.\n\nSIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[4] C. Berge. Two theorems in graph theory. PNAS, 1957.\n[5] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. Foundations and Trends in Machine Learning, 5:1\u2013122, 2012.\n\n[6] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in \ufb01nitely-armed and continuous-armed bandits.\n\nTheoretical Computer Science, 412:1832\u20131852, 2010.\n\n[7] S. Bubeck, N. Cesa-bianchi, S. M. Kakade, S. Mannor, N. Srebro, and R. C. Williamson. Towards mini-\n\nmax policies for online linear optimization with bandit feedback. In COLT, 2012.\n\n[8] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identi\ufb01cations in multi-armed bandits.\n\npages 258\u2013265, 2013.\n\nIn ICML,\n\n[9] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. JCSS, 78(5):1404\u20131422, 2012.\n[10] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications.\n\nIn ICML, pages 151\u2013159, 2013.\n\n[11] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed\n\nbandit and reinforcement learning problems. JMLR, 2006.\n\n[12] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identi\ufb01cation. In NIPS.\n\n2011.\n\n[13] V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identi\ufb01cation: A uni\ufb01ed approach to \ufb01xed budget\n\nand \ufb01xed con\ufb01dence. In NIPS, 2012.\n\n[14] A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In ICML,\n\npages 100\u2013108, 2014.\n\n[15] K. Jamieson and R. Nowak. Best-arm identi\ufb01cation algorithms for multi-armed bandits in the \ufb01xed\n\ncon\ufb01dence setting. In Information Sciences and Systems (CISS), pages 1\u20136. IEEE, 2014.\n\n[16] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck.\n\nmulti-armed bandits. COLT, 2014.\n\nlil\u2019UCB: An optimal exploration algorithm for\n\n[17] S. Kale, L. Reyzin, and R. E. Schapire. Non-stochastic bandit slate problems. In NIPS, 2010.\n[18] S. Kalyanakrishnan and P. Stone. Ef\ufb01cient selection of multiple bandit arms: Theory and practice. In\n\nICML, pages 511\u2013518, 2010.\n\n[19] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed\n\nbandits. In ICML, pages 655\u2013662, 2012.\n\n[20] E. Kaufmann and S. Kalyanakrishnan. Information complexity in bandit subset selection. In COLT, 2013.\n[21] B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinatorial opti-\n\nmization with learning. In UAI, 2014.\n\n[22] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied mathe-\n\nmatics, 6(1):4\u201322, 1985.\n\n[23] T. Lin, B. Abrahao, R. Kleinberg, J. Lui, and W. Chen. Combinatorial partial monitoring game with linear\n\nfeedback and its application. In ICML, 2014.\n\n[24] S. Mannor and J. N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem.\n\nThe Journal of Machine Learning Research, 5:623\u2013648, 2004.\n\n[25] G. Neu, A. Gy\u00a8orgy, and C. Szepesv\u00b4ari. The online loop-free stochastic shortest-path problem. In COLT,\n\npages 231\u2013243, 2010.\n\n[26] J. G. Oxley. Matroid theory. Oxford university press, 2006.\n[27] D. Pollard. Asymptopia. Manuscript, Yale University, Dept. of Statist., New Haven, Connecticut, 2000.\n[28] O. Rivasplata. Subgaussian random variables: An expository note. 2012.\n[29] S. M. Ross. Stochastic processes, volume 2. John Wiley & Sons New York, 1996.\n[30] N. Spring, R. Mahajan, and D. Wetherall. Measuring ISP topologies with rocketfuel. ACM SIGCOMM\n\nComputer Communication Review, 32(4):133\u2013145, 2002.\n\n[31] Y. Zhou, X. Chen, and J. Li. Optimal PAC multiple arm identi\ufb01cation with applications to crowdsourcing.\n\nIn ICML, 2014.\n\n9\n\n\f", "award": [], "sourceid": 265, "authors": [{"given_name": "Shouyuan", "family_name": "Chen", "institution": "CUHK"}, {"given_name": "Tian", "family_name": "Lin", "institution": "Tsinghua University"}, {"given_name": "Irwin", "family_name": "King", "institution": "Chinese University of Hong Kong"}, {"given_name": "Michael", "family_name": "Lyu", "institution": "CUHK"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}]}