{"title": "A New Perspective on Pool-Based Active Classification and False-Discovery Control", "book": "Advances in Neural Information Processing Systems", "page_first": 13992, "page_last": 14003, "abstract": "In many scientific settings there is a need for adaptive experimental design to guide the process of identifying regions of the search space that contain as many true positives as possible subject to a low rate of false discoveries (i.e. false alarms). Such regions of the search space could differ drastically from a predicted set that minimizes 0/1 error and accurate identification could require very different sampling strategies. Like active learning for binary classification, this experimental design cannot be optimally chosen a priori, but rather the data must be taken sequentially and adaptively in a closed loop. However, unlike classification with 0/1 error, collecting data adaptively to find a set with high true positive rate and low false discovery rate (FDR) is not as well understood. In this paper, we provide the first provably sample efficient adaptive algorithm for this problem. Along the way, we highlight connections between classification, combinatorial bandits, and FDR control making contributions to each.", "full_text": "A New Perspective on Pool-Based Active\nClassi\ufb01cation and False-Discovery Control\n\n{lalitj, jamieson}@cs.washington.edu\n\nPaul G. Allen School of Computer Science & Engineering\n\nLalit Jain, Kevin Jamieson\n\nUniversity of Washington, Seattle, WA\n\nAbstract\n\nIn many scienti\ufb01c settings there is a need for adaptive experimental design to guide\nthe process of identifying regions of the search space that contain as many true\npositives as possible subject to a low rate of false discoveries (i.e. false alarms).\nSuch regions of the search space could differ drastically from a predicted set\nthat minimizes 0/1 error and accurate identi\ufb01cation could require very different\nsampling strategies. Like active learning for binary classi\ufb01cation, this experimental\ndesign cannot be optimally chosen a priori, but rather the data must be taken\nsequentially and adaptively. However, unlike classi\ufb01cation with 0/1 error, collecting\ndata adaptively to \ufb01nd a set with high true positive rate and low false discovery\nrate (FDR) is not as well understood. In this paper we provide the \ufb01rst provably\nsample ef\ufb01cient adaptive algorithm for this problem. Along the way we highlight\nconnections between classi\ufb01cation, combinatorial bandits, and FDR control making\ncontributions to each.\n\n1\n\nIntroduction\n\nAs machine learning has become ubiquitous in the biological, chemical, and material sciences, it\nhas become irresistible to use these techniques not only for making inferences about previously\ncollected data, but also for guiding the data collection process, closing the loop on inference and\ndata collection [10, 38, 41, 39, 33, 31]. However, though collecting data randomly or non-adaptively\ncan be inef\ufb01cient, ill-informed ways of collecting data adaptively can be catastrophic: a procedure\ncould collect some data, adopt an incorrect belief, collect more data based on this belief, and leave\nthe practitioner with insuf\ufb01cient data in the right places to infer anything with con\ufb01dence.\nIn a recent high-throughput protein synthesis experiment [33], thousands of short amino acid se-\nquences (length less than 60) were evaluated with the goal of identifying and characterizing a subset\nof the pool of all possible sequences ( \u2248 1080) containing many sequences that will fold into stable\nproteins. That is, given an evaluation budget that is just a minuscule proportion of the total number\nof sequences, the researchers sought to make predictions about individual sequences that would\nnever be evaluated. An initial \ufb01rst round of sequences uniformly sampled from a prede\ufb01ned subset\nwere synthesized to observe whether each sequence was in the set of sequences that will fold, H1,\nor in H0 = Hc\n1. Treating this as a classi\ufb01cation problem, a linear logistic regression classi\ufb01er was\ntrained, using these labels and physics based features. Then a set of sequences to test in the next\nround were chosen to maximize the probability of folding according to this empirical model - a\nprocedure repeated twice more. This strategy suffers two \ufb02aws. First, selecting a set to maximize\nthe likelihood of hits given past rounds\u2019 data is effectively using logistic regression to perform\noptimization similar to follow-the-leader strategies [14]. While more of the sequences evaluated\nmay fold, these observations may provide little information about whether sequences that were not\nevaluated will fold or not. Second, while it is natural to employ logistic regression or the SVM\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The distribution of a feature that is highly correlated with the \ufb01tted logistic model (bottom plot) and\nthe proportion of sequences that fold (top plot). The distribution of this feature for the sequences drifts right.\nto discriminate between binary outcomes (e.g., fold/not-fold), in many scienti\ufb01c applications the\nproperty of interest is incredibly rare and an optimal classi\ufb01er will just predict a single class e.g.\nnot fold. This is not only an undesirable inference for prediction, but a useless signal for collecting\ndata to identify those regions with higher, but still unlikely, probabilities of folding. Consider the\ndata of [33] reproduced in Figure 1, where the proportion of sequences that fold along with their\ndistributions for a particularly informative feature (Buried NPSA) are shown in each round for two\ndifferent protein topologies (notated \u03b2\u03b1\u03b2\u03b2 and \u03b1\u03b1\u03b1). In the last column of Figure 1, even though\nmost of the sequences evaluated are likely to fold, we are sampling in a small part of the overall\nsearch space. This limits our overall ability to identify under-explored regions that could potentially\ncontain many sequences that fold, even though the logistic model does not achieve its maximum\nthere. On the other hand, in the top plot of Figure 1, sequences with topology \u03b2\u03b1\u03b2\u03b2 (shown in blue)\nso rarely folded that a near-optimal classi\ufb01er would predict \u201cnot fold\u201d for every sequence.\nInstead of using a procedure that seeks to maximize the probability of folding or classifying sequences\nas fold or not-fold, a more natural objective is to predict a set of sequences \u03c0 in such a way as to\nmaximize the true positive rate (TPR) |H1 \u2229 \u03c0|/|H1| while minimizing the false discovery rate (FDR)\ni.e. |H0 \u2229 \u03c0|/|\u03c0|. That is, \u03c0 is chosen to contain a large number of sequences that fold while the\nproportion of false-alarms among those predicted is relatively small. For example, if a set \u03c0 for \u03b2\u03b1\u03b2\u03b2\nwas found that maximized TPR subject to FDR being less than 9/10 then \u03c0 would be non-empty\nwith the guarantee that at least one in every 10 suggestions was a true-positive; not ideal, but making\nthe best of a bad situation. In some settings, such as for topology \u03b1\u03b1\u03b1 (shown in orange), training\na classi\ufb01er to minimize 0/1 loss may be reasonable. Of course, before seeing any data we would\nnot know whether classi\ufb01cation is a good objective so it is far more conservative to optimize for\nmaximizing the number of discoveries.\nContributions. We propose the \ufb01rst provably sample-ef\ufb01cient adaptive sampling algorithm for\nmaximizing TPR subject to an FDR constraint. This problem has deep connections to active binary\nclassi\ufb01cation (e.g., active learning) and pure-exploration for combinatorial bandits that are necessary\nsteps towards motivating our algorithm. We make the following contributions:\n1. We improve upon state of the art sample complexity for pool-based active classi\ufb01cation in the\nagnostic setting providing novel sample complexity bounds that do not depend on the disagreement-\ncoef\ufb01cient for sampling with or without replacement. Our bounds are more granular than previous\nresults as they describe the contribution of a single example to the overall sample complexity.\n\n2. We highlight an important connection between active classi\ufb01cation and combinatorial bandits.\nOur results follow directly from our improvements to the state of the art in combinatorial bandits,\nextending methods to be near-optimal for classes that go beyond matroids where one need not\nsample every arm at least once.\n\n3. Our main contribution is the development and analysis of an adaptive sampling algorithm that\nminimizes the number of samples to identify the set that maximizes the true positive rate subject\nto a false discovery constraint. To the best of our knowledge, this is the \ufb01rst work to demonstrate a\nsample complexity for this problem that is provably better than non-adaptive sampling.\n\n1.1 Pool Based Classi\ufb01cation and FDR Control\n\nHere we describe what is known as the pool-based setting for active learning with stochastic labels.\nThroughout the following we assume access to a \ufb01nite set of items [n] = {1,\u00b7\u00b7\u00b7 , n} with an\nassociated label space {0, 1}. The items can be \ufb01xed vectors {xi}n\ni=1 \u2208 Rd but we do not restrict\n\n2\n\n20304050600.00.51.0Proportion FoldingRound 120304050600.00.51.0Round 220304050600.00.51.0Round 320304050600.00.51.0Round 42030405060Buried NPSA (\u00c5/res)0.000.050.10Distribution2030405060Buried NPSA (\u00c5/res)0.000.050.102030405060Buried NPSA (\u00c5/res)0.000.050.102030405060Buried NPSA (\u00c5/res)0.000.050.10\fto this case. Associated to each i \u2208 [n] there is a Bernoulli distribution Ber(\u03b7i) with \u03b7i \u2208 [0, 1].\nWe imagine a setting where in each round a player chooses It \u2208 [n] and observes an independent\nrandom variable YIt,t. For any i, Yi,t \u223c Ber(\u03b7i) are i.i.d. Borrowing from the multi-armed bandit\nliterature, we may also refer to the items as arms, and pulling an arm is receiving a sample from\nits corresponding label distribution. We will refer to this level of generality as the stochastic noise\nsetting. The case when \u03b7i \u2208 {0, 1}, i.e. each point i \u2208 [n] has a deterministic label Yi,j = \u03b7i\nfor all j \u2265 1, will be referred to as the persistent noise setting. In this setting we can de\ufb01ne\nH1 = {i : \u03b7i = 1},H0 = [n] \\ H1. This is a natural setting if the experimental noise is negligible\nso that performing the same measurement multiple times gives the same result. A classi\ufb01er is a\ndecision rule f : [n] \u2192 {0, 1} that assigns each item i \u2208 [n] a \ufb01xed label. We can identify any such\ndecision rule with the set of items it maps to 1, i.e. the set \u03c0 = {i : i \u2208 [n], f (i) = 1}. Instead of\nconsidering all possible sets \u03c0 \u2282 [n], we will restrict ourselves to a smaller class \u03a0 \u2282 2[n]. With this\ninterpretation, one can imagine \u03a0 being a combinatorial class, such as the collection of all subsets of\n[n] of size k, or if we have features, \u03a0 could be the sets induced by the set of all linear separators\nover {xi}.\nThe classi\ufb01cation error, or risk of a classi\ufb01er is given by the expected number of incorrect labels, i.e.\n\nR(\u03c0) = Pi\u223cUnif([n]),Yi\u223cBer(\u03b7i) (\u03c0(i) (cid:54)= Yi) =\n\n1\nn\n\n(\n\n\u03b7i +\n\n(cid:88)\n\ni(cid:54)\u2208\u03c0\n\n(cid:88)\n\n(1 \u2212 \u03b7i))\n\ni\u2208\u03c0\n|\u03c0\u2229H0|+|\u03c0c\u2229H1|\n\nn\n\nfor any \u03c0 \u2208 \u03a0. In the case of persistent noise the above reduces to R(\u03c0) =\nwhere A\u2206B = (A \u222a B) \u2212 (A \u2229 B) for any sets A, B.\n\n|H1\u2206\u03c0|\n\nn\n\n=\n\nProblem 1:(Classi\ufb01cation) Given a hypothesis class \u03a0 \u2286 2[n] identify \u03c0\u2217 := argmin\n\u03c0\u2208\u03a0\nrequesting as few labels as possible.\n\nR(\u03c0) by\n\nalarms \u03c0 \u2229 H0. De\ufb01ne \u03b7\u03c0 :=(cid:80)\n\nAs described in the introduction, in many situations we are not interested in \ufb01nding the lowest risk\nclassi\ufb01er, but instead returning \u03c0 \u2208 \u03a0 that contains many discoveries \u03c0 \u2229 H1 without too many false\ni\u2208\u03c0 \u03b7x. The false discovery rate (FDR) and true positive rate (TPR)\n\nof a set \u03c0 in the stochastic noise setting are given by\n\nF DR(\u03c0) := 1 \u2212 \u03b7\u03c0|\u03c0|\n\nT P R(\u03c0) :=\n\nand\n|\u03c0| = 1 \u2212 |H1\u2229\u03c0|\n|H0\u2229\u03c0|\n|\u03c0|\n\n\u03b7\u03c0\n\u03b7[n]\nand T P R(\u03c0) =\n\nIn the case of persistent noise, F DR(\u03c0) =\n. A\nconvenient quantity that we can use to reparametrize these quantities is the true positives: T P (\u03c0) :=\n\n(cid:80)\ni\u2208\u03c0 \u03b7i. Throughout the following we let \u03a0\u03b1 = {\u03c0 \u2208 \u03a0 : F DR(\u03c0) \u2264 \u03b1}.\nProblem 2:(Combinatorial FDR Control) Given an \u03b1 \u2208 (0, 1) and hypothesis class \u03a0 \u2286 2[n]\nidentify \u03c0\u2217\n\nT P R(\u03c0) by requesting as few labels as possible.\n\nargmax\n\n|H1\u2229\u03c0|\n|H1|\n\n\u03b1 =\n\n\u03c0\u2208\u03a0,F DR(\u03c0)\u2264\u03b1\n\nIn this work we are agnostic about how \u03b7 relates to \u03a0, ala [2, 20]. For instance we do not assume the\nBayes classi\ufb01er, argminB\u2208{0,1}n R(B) is contained in \u03a0.\n\n2 Related Work\n\nActive Classi\ufb01cation. Active learning for binary classi\ufb01cation is a mature \ufb01eld (see surveys [36, 25]\nand references therein). The major theoretical results of the \ufb01eld can coarsely be partitioned into the\nstreaming setting [2, 6, 20, 26] and the pool-based setting [19, 24, 32], noting that algorithms for the\nformer can be used for the latter, [2], an inspiration for our algorithm, is such an example. These\nresults rely on different complexity measures known as the splitting index, the teaching dimension,\nand (arguably the most popular) the disagreement coef\ufb01cient.\nComputational Considerations. While there have been remarkable efforts to make some of these\nmethods more computationally ef\ufb01cient [6, 26], we believe even given in\ufb01nite computation, many of\nthese previous works are fundamentally inef\ufb01cient from a sample complexity perspective. This stems\nfrom the fact that when applied to common combinatorial classes (for example the collection of all\nsubsets of size k), these algorithms have sample complexities that are off by at least log(n) factors\nfrom the best algorithms for these classes. Consequently, in our work we focus on sample complexity\nalone, and leave matters of computational ef\ufb01ciency for future work.\n\n3\n\n\fOther Measures. Given a static dataset, the problem of \ufb01nding a set or classi\ufb01er that maximizes\nTPR subject to FDR-control in the information retrieval community is also known as \ufb01nding a\nbinary classi\ufb01er that maximizes recall for a given precision level. There is extensive work on the\nnon-adaptive sample complexity of computing measures related to precision and recall such as AUC,\nand F-scores [35, 9, 1]. However, there have been just a few works that consider adaptively collecting\ndata with the goal of maximizing recall with precision constraints [34, 5], with the latter work being\nthe most related. We will discuss it further after the statement of our main result. In [34], the problem\nof adaptively estimating the whole ROC curve for a threshold class is considered under a monotonicity\nassumption on the true positives; our algorithm is agnostic to this assumption.\nCombinatorial Bandits: The pure-exploration combinatorial bandit game has been studied for the\ncase of all subsets of [n] of size k known as the Top-K problem [22, 29, 30, 28, 37, 17], the bases of a\nrank-k matroid (for which Top-K is a particular instance) [18, 23, 15], and in the general case [11, 16].\nThe combinatorial bandit component of our work (see Section 3.2) is closest to [11]. The algorithm\nof [11] uses a disagreement-based algorithm in the spirit of Successive Elimination for bandits [22],\nor the A2 for binary classi\ufb01cation [2]. Exploring precisely what log factors are necessary has been an\nactive area. [16] demonstrates a family of instances in which they show in the worst-case, the sample\ncomplexity must scale with log(|\u03a0|). However, there are many classes like best-arm identi\ufb01cation\nand matroids where sample complexity does not scale with log(|\u03a0|) (see references above). Our own\nwork provides some insight into what log factors are necessary by presenting our results in terms\nof VC dimension. In addition, we discuss situtations when a log(n) could potentially be avoided by\nappealing to Sauer\u2019s lemma in the supplementary material.\nMultiple Hypothesis Testing. Finally, though this work shares language with the adaptive multiple-\nhypothesis testing literature [12, 27, 42, 40], the goals are different. In that setting, there is a set of\nn hypothesis tests, where the null is that the mean of each distribution is zero and the alternative is\nthat it is nonzero. [27] designs a procedure that adaptively allocates samples and uses the Benjamini-\nHochberg procedure [4] on p-values to return an FDR-controlled set. We are not generally interested\nin \ufb01nding which individual arms have means that are above a \ufb01xed threshold, but instead, given a\nhypothesis class we want to return an FDR controlled set in the hypothesis class with high TPR. This\nis the situation in many structured problems in scienti\ufb01c discovery where the set of arms corresponds\nto an extremely large set of experiments and we have feature vector associated with each arm. We\ncan\u2019t run each one but we may have some hope of identifying a region of the search space which\ncontains many discoveries. In summary, unlike the setting of [27], \u03a0 encodes structure among the\nsets, we do not insist each item is sampled, and we are allowing for persistent labels - overall we are\nsolving a different and novel problem.\n\n3 Pool Based Active Classi\ufb01cation\n\n\u00b5i\n\ni=1\n\n1\nn\n\ni=1\n\n\u03b7i +\n\n1\nn\n\nR(\u03c0) =\n\n\u03b7i \u2212 1\nn\n\nR(\u03c0) = argmax\n\n\u03c0\u2208\u03a0\n\n(cid:88)\n\ni\u2208\u03c0\n\n(cid:88)\n\ni\u2208\u03c0\n\nWe \ufb01rst establish a pool based active classi\ufb01cation algorithm that motivates our development of an\nadaptive algorithm for FDR-control. For each i de\ufb01ne \u00b5i := 2\u03b7i \u2212 1 \u2208 [\u22121, 1] so \u03b7i = 1+\u00b5i\n. By a\nsimple manipulation of the de\ufb01nition of R(\u03c0) above we have\n\n2\n\n(2\u03b7i \u2212 1) =\n\nn(cid:88)\n(cid:80)\ni\u2208\u03c0 \u00b5i. De\ufb01ne \u00b5\u03c0 :=(cid:80)\n\nn(cid:88)\n1\nn\ni\u2208\u03c0 \u00b5i. If for some i \u2208 [n] we map the jth\nso that argmin\n\u03c0\u2208\u03a0\ndraw of its label Yi,j (cid:55)\u2192 2Yi,j \u2212 1, then E[2Yi,j \u2212 1] = \u00b5i and returning an optimal classi\ufb01er in the\nset is equivalent to returning \u03c0 \u2208 \u03a0 with the largest \u00b5\u03c0. Algorithm 1 exploits this.\nThe algorithm maintains a collection of active sets Ak \u2286 \u03a0 and an active set of items Tk \u2286 [n] which\nis the symmetric difference of all sets in Ak. To see why we only sample in Tk, if i \u2208 \u2229\u03c0\u2208Ak \u03c0 then\n(cid:98)\u00b5\u03c0 \u2212(cid:98)\u00b5\u03c0(cid:48) =(cid:98)\u00b5\u03c0\\\u03c0(cid:48) \u2212(cid:98)\u00b5\u03c0(cid:48)\\\u03c0 for all \u03c0, \u03c0(cid:48) \u2208 Ak so we should not pay to sample it. In each round sets\n\u03c0 and \u03c0(cid:48) agree on the label of item i, and any contribution of arm i is canceled in each difference\n(cid:80)t\n\u03c0 with lower empirical means that fall outside of the con\ufb01dence interval of sets with higher empirical\nestimator(cid:98)\u00b5\u03c0(cid:48),k \u2212(cid:98)\u00b5\u03c0,k = n\nmeans are removed. There may be some concern that samples from previous rounds are reused. The\ns=1 RIt,s(1(Is \u2208 \u03c0(cid:48) \\ \u03c0) \u2212 1(Is \u2208 \u03c0 \\ \u03c0(cid:48))) depends on all t samples\nup to the t-th round, each of which is uniformly and independently drawn at each step. Thus each\nsummand is an unbiased estimate of \u00b5\u03c0(cid:48) \u2212 \u00b5\u03c0. However, for \u03c0, \u03c0(cid:48) active in round k, as explained\n\nt\n\n4\n\n\fIt \u2208 Tk so the estimate of(cid:98)\u00b5\u03c0(cid:48),k \u2212(cid:98)\u00b5\u03c0,k is unbiased.\nabove, a summand is only non-zero if Is \u2208 \u03c0\u2206\u03c0(cid:48) \u2282 Tk hence we only need to observe RIt,s if\n\nIn practice, since the number of samples that land in Tk follow a binomial distribution, instead of\nusing rejection sampling we could instead have drawn a single sample from a binomial distribution\nand sampled that many uniformly at random from Tk.\n\nInput: \u03b4, \u03a0 \u2282 2[n], Con\ufb01dence bound C(\u03c0(cid:48), \u03c0, t, \u03b4).\nLet A1 = \u03a0, T1 = (\u222a\u03c0\u2208A1 \u03c0) \u2212 (\u2229\u03c0\u2208A1 \u03c0), k = 1, Ak will be the active sets in round k\nfor t = 1, 2,\u00b7\u00b7\u00b7\nif t == 2k:\n\nSet \u03b4k = .5\u03b4/k2. For each \u03c0, \u03c0(cid:48) let\n\ns=1 RIs,s1{Is \u2208 \u03c0(cid:48) \\ \u03c0} \u2212(cid:80)t\nt ((cid:80)t\n(cid:98)\u00b5\u03c0(cid:48),k \u2212(cid:98)\u00b5\u03c0,k = n\nSet Ak+1 = Ak \u2212(cid:8)\u03c0 \u2208 Ak : \u2203\u03c0\nSet Tk+1 =(cid:0)\u222a\u03c0\u2208Ak+1 \u03c0(cid:1) \u2212(cid:0)\u2229\u03c0\u2208Ak+1 \u03c0(cid:1).\n\n(cid:48) \u2208 Akwith(cid:98)\u00b5\u03c0(cid:48),k \u2212(cid:98)\u00b5\u03c0,k > C(\u03c0\n\ns=1 RIs,s1{Is \u2208 \u03c0 \\ \u03c0(cid:48)})\n\n, \u03c0, t, \u03b4k)(cid:9).\n\n(cid:48)\n\nk \u2190 k + 1\n\nendif\nStochastic Noise:\n\nPersistent Noise:\n\nIf Tk = \u2205, Break. Otherwise, draw It uniformly at random from [n] and if It \u2208 Tk receive an\nassociated reward RIt,t = 2YIt,t \u2212 1, YIt,t\nIf Tk = \u2205 or t > n, Break. Otherwise, draw It uniformly at random from [n] \\ {Is : 1 \u2264 s < t}\nand if It \u2208 Tk receive associated reward RIt,t = 2YIt,t \u2212 1, YIt,t = \u03b7It.\n\niid\u223c Ber(\u03b7It ).\nOutput: \u03c0(cid:48) \u2208 Ak such that(cid:98)\u00b5\u03c0(cid:48),k \u2212(cid:98)\u00b5\u03c0,k \u2265 0 for all \u03c0 \u2208 Ak \\ \u03c0(cid:48)\nFor any A \u2286 2[n] de\ufb01ne V (A) as the VC-dimension of a collection of sets A. Given a family of sets,\n\u03a0 \u2286 2[n], de\ufb01ne B1(k) := {\u03c0 \u2208 \u03a0 : |\u03c0| = k}, B2(k, \u03c0(cid:48)) := {\u03c0 \u2208 \u03a0 : |\u03c0\u2206\u03c0(cid:48)| = k}. Also de\ufb01ne\nthe following complexity measures:\n\nAlgorithm 1: Action Elimination for Active Classi\ufb01cation\n\nV\u03c0 := V (B1(|\u03c0|)) \u2227 |\u03c0| and V\u03c0,\u03c0(cid:48) := max{V (B2(|\u03c0\u2206\u03c0(cid:48)|, \u03c0), V (B2(|\u03c0\u2206\u03c0(cid:48)|, \u03c0(cid:48)))} \u2227 |\u03c0\u2206\u03c0(cid:48)|\n\nIn general V\u03c0, V\u03c0,\u03c0(cid:48) \u2264 V (\u03a0). A contribution of our work is the development of con\ufb01dence intervals\nthat do not depend on a union bound over the class but instead on local VC dimensions. These are\ndescribed carefully in Lemma 1 in the supplementary materials.\nTheorem 1 For each i \u2208 [n] let \u00b5i \u2208 [\u22121, 1] be \ufb01xed but unknown and assume {Ri,j}\u221e\n|\u00b5\u03c0 \u2212 \u00b5\u03c0\u2217|/|\u03c0\u2206\u03c0\u2217|, and\n\ni.i.d sequence of random variables such that E[Ri,j] = \u00b5i and Ri,j \u2208 [\u22121, 1]. De\ufb01ne (cid:101)\u2206\u03c0 =\n\nj=1 is an\n\n(cid:16)\n\nn log((cid:101)\u2206\u22122\n\n(cid:17)\n\nlog\n\n\u03c0 )/\u03b4\n\n.\n\n\u03c4\u03c0 =\n\n1(cid:101)\u22062\n(cid:113) 8|\u03c0\u2206\u03c0(cid:48)|nV\u03c0,\u03c0(cid:48) log( n\n\nV\u03c0,\u03c0\u2217\n|\u03c0\u2217\u2206\u03c0|\n\n\u03c0\n\ni=1 min{1, max\u03c0\u2208\u03a0:i\u2208\u03c0\u2206\u03c0\u2217 \u03c4\u03c0}\n\nUsing C(\u03c0, \u03c0(cid:48), t, \u03b4) :=\nfor a \ufb01xed constant c, with probability\ngreater than 1 \u2212 \u03b4, in the stochastic noise setting Algorithm 1 returns \u03c0\u2217 after a number of samples\ni=1 max\u03c0\u2208\u03a0:i\u2208\u03c0\u2206\u03c0\u2217 \u03c4\u03c0 and in the persistent noise setting the number of samples\n\n\u03b4 )\n\n4nV\u03c0,\u03c0(cid:48) log( n\n\u03b4 )\n\n+\n\n3t\n\nt\n\nno more than c(cid:80)n\nneeded is no more than c(cid:80)n\nHeuristically, the expression 1/|\u03c0\u2206\u03c0\u2217|(cid:101)\u22062\n\n\u03c0 roughly captures the number of times we would have to\nsample each i \u2208 \u03c0\u2206\u03c0\u2217 to ensure that we can show \u00b5\u03c0\u2217 > \u00b5\u03c0. Thus in the more general case, we\nmay expect that we can stop pulling a speci\ufb01c i once each set \u03c0 such that i \u2208 \u03c0\u2206\u03c0\u2217 is removed -\naccounting for the expression max\u03c0\u2208\u03a0,i\u2208\u03c0\u2206\u03c0\u2217 \u03c4\u03c0. The VC-dimension and the logarithmic term in\n\u03c4\u03c0 is discussed further below and primarily comes from a careful union bound over the class \u03a0. One\nalways has 1/|\u03c0\u2217\u2206\u03c0| \u2264 V\u03c0,\u03c0\u2217 /|\u03c0\u2217\u2206\u03c0| \u2264 1 and both bounds are achievable by different classes \u03a0.\n\nIn addition, in terms of risk(cid:101)\u2206\u03c0 = |\u00b5\u03c0 \u2212 \u00b5\u03c0\u2217|/|\u03c0\u2206\u03c0\u2217| = n|R(\u03c0)\u2212 R(\u03c0\u2217)|/|\u03c0\u2206\u03c0\u2217|. Since sampling\n\nis done without replacement for persistent noise, there are improved con\ufb01dence intervals that one\ncan use in that setting described in Lemma 1 in the supplementary materials. Finally, if we had\nsampled non-adaptively, i.e. without rejection sampling, we would have had a sample complexity of\nO(n maxi\u2208[n] max\u03c0:\u03a0:i\u2208\u03c0\u2206\u03c0\u2217 \u03c4\u03c0).\n\n5\n\n\f3.1 Comparison with previous Active Classi\ufb01cation results.\n\n|\u03c0\u2206\u03c0\u2217| 1(cid:101)\u22062\n\n\u03c0\n\n2 + sign(z\u2212i/n)\n\n2\n\n= [i] and takes a value of(cid:0) 1+\u03b1\n\nOne Dimensional Thresholds: In the bound of Theorem 1, a natural question to ask is whether\nthe log(n) dependence can be improved. In the case of nested classes, such as thresholds on a\nline, we can replace the log(n) with a log log(n) using empirical process theory. This leads to\ncon\ufb01dence intervals dependent on log log(n) that can be used in place of C(\u03c0(cid:48), \u03c0, t, \u03b4) in Algorithm 1\n(see sections C for the con\ufb01dence intervals and 3.2 for a longer discussion). Under speci\ufb01c noise\nmodels we can give a more interpretable sample complexity. Let h \u2208 (0, 1], \u03b1 \u2265 0, z \u2208 [0, 1]\nfor some i \u2208 [n \u2212 1] and assume that \u03b7i = 1\nh|z \u2212 i/n|\u03b1 so that \u00b5i = h|z \u2212\ni/n|\u03b1sign(z \u2212 i/n) (this would be a reasonable noise model for topology \u03b1\u03b1\u03b1 in the introduction).\nLet \u03a0 = {[k] : k \u2264 n}. In this case, inspecting the dominating term of Theorem 1 for i \u2208 \u03c0\u2217\nn\u22121(z \u2212 i/n)\u22122\u03b1\u22121.\nwe have arg max\u03c0\u2208\u03a0:i\u2208\u03c0\u2206\u03c0\u2217 V\u03c0,\u03c0\u2217\nUpper bounding the other terms and summing, the sample complexities can be calculated to be\nO(log(n) log(log(n)/\u03b4)/h2) if \u03b1 = 0, and O(n2\u03b1 log(log(n)/\u03b4)/h2) if \u03b1 > 0. These rates match\nthe minimax lower bound rates given in [13] up to log log factors. Unlike the algorithms given there,\nour algorithm works in the agnostic setting, i.e. it is making no assumptions about whether the Bayes\nclassi\ufb01er is in the class. In the case of non-adaptive sampling, the sum is replaced with the max times\nn yielding n2\u03b1+1 log(log(n)/\u03b4)/h2 which is substantially worse than adaptive sampling.\nComparison to previous algorithms: One of the foundational works on active learning is the DHM\nalgorithm of [20] and the A2 algorithm that preceded it [2]. Similar in spirit to our algorithm, DHM\nrequests a label only when it is uncertain how \u03c0\u2217 would label the current point. In general the\nanalysis of the DHM algorithm can not characterize the contribution of each arm to the overall sample\ncomplexity leading to sub-optimal sample complexity for combinatorial classes. For example in\nthe the case when \u03a0 = {[i]}n\ni=1, with i\u2217 = arg maxi\u2208[n] \u00b5i, ignoring logarithmic factors, one can\nshow for this problem the bound of Theorem 1 of [20] scales like n2 maxi(cid:54)=i\u2217 (\u00b5i\u2217 \u2212 \u00b5\u22122\n) which is\n. Similar arguments\ncan be made for other combinatorial classes such as all subsets of size k. While we are not particularly\ninterested in applying algorithms like DHM to this speci\ufb01c problem, we note that the style of its\nanalysis exposes such a gross inconsistency with past analyses of the best known algorithms that the\napproach leaves much to be desired. For more details, please see A.2 in the supplementary materials.\n\nsubstantially worse than our bound for this problem which scales like(cid:80)\n\ni(cid:54)=i\u2217 \u2206\u22122\n\ni\n\n(cid:1)2\n\nh\n\ni\n\n3.2 Connections to Combinatorial Bandits\n\n\u00b5i \u2208 [\u22121, 1]. Given a collection of sets \u03a0 \u2286 2[n], for each \u03c0 \u2208 \u03a0 we de\ufb01ne \u00b5\u03c0 :=(cid:80)\n\nA closely related problem to classi\ufb01cation is the pure-exploration combinatorial bandit problem. As\nabove we have access to a set of arms [n], and associated to each arm is an unknown distribution \u03bdi\nwith support in [\u22121, 1] - which is arbitrary not just a Bernoulli label distribution. We let {Ri,j}\u221e\nbe a sequence of random variables where Ri,j \u223c \u03bdi is the jth (i.i.d.) draw from \u03bdi satisfying\nj=1\nE[Ri,j] = \u00b5i \u2208 [\u22121, 1].\nIn the persistent noise setting we assume that \u03bdi is a point mass at\ni\u2208\u03c0 \u00b5i the\nsum of means in \u03c0. The pure-exploration for combinatorial bandit problem asks, given a hypothesis\nclass \u03a0 \u2286 2[n] identify \u03c0\u2217 = argmax\n\u00b5\u03c0 by requesting as few labels as possible. The combinatorial\n\u03c0\u2208\u03a0\nbandit extends many problems considered in the multi-armed bandit literature. For example setting\n\u03a0 = {{i} : i \u2208 [n]} is equivalent to the best-arm identi\ufb01cation problem.\nThe discussion at the start of Section 3 shows that the classi\ufb01cation problem can be mapped to\ncombinatorial bandits - indeed minimizing the 0/1 loss is equivalent to maximizing \u00b5\u03c0. In fact,\nAlgorithm 1 gives state of the art results for the pure exploration combinatorial bandit problem\nand furthermore Theorem 1 holds verbatim. Algorithm 1 is similar to previous action elimination\nalgorithms for combinatorial bandits in the literature, e.g. Algorithm 4 in [11]. However, unlike\nprevious algorithms, we do not insist on sampling each item once, an unrealistic requirement for\nclassi\ufb01cation settings - indeed, not having this constraint allows us to reach minimax rates for\nclassi\ufb01cation in one dimensions as discussed above. In addition, this resolves a concern brought up in\n[11] for elimination being used for PAC-learning. We prove Theorem 1 in this more general setting\nin the supplementary materials, see A.3.\nThe connection between FDR control and combinatorial bandits is more direct: we are seeking to\n\ufb01nd \u03c0 \u2208 \u03a0 with maximum \u03b7\u03c0 subject to FDR-constraints. This already highlights a key difference\n\n6\n\n\fA1 = \u03a0, C1 = \u2205, S1 = \u222a\u03c0\u2208\u03a0\u03c0, T1 =(cid:83)\n\nInput: Con\ufb01dence bounds C1(\u03c0, t, \u03b4), C2(\u03c0, \u03c0(cid:48), t, \u03b4)\nAk \u2282 \u03a0 will be the set of active sets in round k. Ck \u2282 \u03a0 is the set of FDR-controlled policies in round k.\nfor t = 1, 2,\u00b7\u00b7\u00b7\n\n\u03c0\u2208\u03a0 \u03c0 \u2212(cid:84)\n\n\u03c0\u2208\u03a0 \u03c0, k = 1.\n\nif t = 2k:\n\n(cid:80)t\nLet \u03b4k = .25\u03b4/k2\nFor each set \u03c0 \u2208 Ak, and each pair \u03c0(cid:48), \u03c0 \u2208 Ak update the estimates:\n(cid:0)(cid:80)t\n(cid:100)T P (\u03c0\n) \u2212(cid:100)T P (\u03c0) := n\n(cid:92)F DR(\u03c0) := 1 \u2212 n|\u03c0|t\ns=1 YIs,s1{Is \u2208 \u03c0}\n\nJs,s1{Js \u2208 \u03c0(cid:48)\\\u03c0} \u2212(cid:80)t\n\ns=1 Y (cid:48)\n\ns=1 Y (cid:48)\n\n(cid:48)\n\nt\n\nJs,s1{Js \u2208 \u03c0\\\u03c0(cid:48)}(cid:1)\n\nSet Ck+1 = Ck \u222a {\u03c0 \u2208 Ak \\ Ck : (cid:92)F DR(\u03c0) + C1(\u03c0, t, \u03b4k)/|\u03c0| \u2264 \u03b1}\nSet Ak+1 = Ak\nRemove any \u03c0 from Ak+1 and Ck+1 such that one of the conditions is true:\n1. (cid:92)F DR(\u03c0) \u2212 C1(\u03c0, t, \u03b4k)/|\u03c0| > \u03b1\n\n2. \u2203\u03c0(cid:48) \u2208 Ck+1 with(cid:100)T P (\u03c0(cid:48)) \u2212(cid:100)T P (\u03c0) > C2(\u03c0, \u03c0(cid:48), t, \u03b4k) and add \u03c0 to a set R\nSet Sk+1 :=(cid:83)\n\nRemove any \u03c0 from Ak+1 and Ck+1 such that:\n3. \u2203\u03c0(cid:48) \u2208 Ck+1 \u222a R, such that \u03c0 \u2282 \u03c0(cid:48).\nand\nk \u2190 k + 1\n\nTk+1 =(cid:83)\n\n\u03c0 \u2212(cid:84)\n\n\u03c0\u2208Ak+1\\Ck+1\n\n\u03c0\u2208Ak+1\n\n\u03c0\u2208Ak+1\n\n\u03c0.\n\n\u03c0,\n\nendif\nStochastic Noise:\n\nPersistent Noise:\n\nif |Ak| = 1, Break. Otherwise:\nSample It \u223c Unif([n]). If It \u2208 Sk, then receive a label YIt,t \u223c Ber(\u03b7It ).\nJt,t \u223c Ber(\u03b7Jt ).\nSample Jt \u223c Unif([n]). If Jt \u2208 Tk, then receive a label Y (cid:48)\nIf |Ak| = 1 or t > n, Break. Otherwise:\nSample It \u223c [n]\\{Is : 1 \u2264 s < t}. If It \u2208 Sk, then receive a label YIt,t = \u03b7It.\nSample Jt \u223c [n]\\{Js : 1 \u2264 s < t}. If Jt \u2208 Tk, then receive a label Y (cid:48)\nJt,t = \u03b7Jt.\n\nReturn maxt\u2208Ck+1(cid:100)T P (\u03c0)\n\nAlgorithm 2: Active FDR control in persistent and bounded noise settings.\n\nbetween classi\ufb01cation and FDR-control. In one we choose to sample to maximize \u03b7\u03c0 subject to FDR\nconstraints where each \u03b7i \u2208 [0, 1], whereas in classi\ufb01cation we are trying to maximize \u00b5\u03c0 where\neach \u00b5i \u2208 [\u22121, 1]. A major consequence of this difference is that \u03b7\u03c0 \u2264 \u03b7\u03c0(cid:48) whenever \u03c0 \u2286 \u03c0(cid:48), but\nsuch a condition does not hold for \u00b5\u03c0, \u00b5\u03c0(cid:48).\nMotivating the sample complexity: As mentioned above, the general combinatorial bandit problem\nis considered in [11]. There they present an algorithm with sample complexity,\n\nThis complexity parameter is dif\ufb01cult to interpret directly so we compare it to one more familiar\nin statistical learning - the VC dimension. To see how this sample complexity relates to ours in\nTheorem 1, note that log2 |B(k, \u03c0\u2217)| \u2264 log2\nV (B(r, \u03c0\u2217)) (cid:46) log2(|B(r, \u03c0\u2217)|) (cid:46) min{V (B(r, \u03c0\u2217)), r} log2(n) where (cid:46) hides a constant. The\nproof of the con\ufb01dence intervals in the supplementary effectively combines these two facts along\nwith a union bound over all sets in B(r, \u03c0\u2217).\n\n(cid:1) (cid:46) k log2(n). Thus by the Sauer-Shelah lemma,\n\n(cid:0)n\n\nk\n\n4 Combinatorial FDR Control\nAlgorithm 2 provides an active sampling method for determining \u03c0 \u2208 \u03a0 with F DR(\u03c0) \u2264 \u03b1\nand maximal T P R, which we denote as \u03c0\u2217\n\u03b1. Since T P R(\u03c0) = T P (\u03c0)/\u03b7[n], we can ignore the\ndenominator and so maximizing the T P R is the same as maximizing T P . The algorithm proceeds in\nepochs. At all times a collection Ak \u2286 \u03a0 of active sets is maintained along with a collection of FDR-\ncontrolled sets Ck \u2286 Ak. In each time step, random indexes It and Jt are sampled from the union\nSk = \u222a\u03c0\u2208Ak\\Ck \u03c0 and the symmetric difference Tk = \u222a\u03c0\u2208Ak \u03c0 \u2212 \u2229\u03c0\u2208Ak \u03c0 respectively. Associated\nrandom labels YIt,t, YJt,t \u2208 {0, 1} are then obtained from the underlying label distributions Ber(\u03b7It)\nand Ber(\u03b7Jt). At the start of each epoch, any set with a F DR that is statistically known to be\n\n7\n\nn(cid:88)\n\ni=1\n\nC\n\nmax\n\n\u03c0:i\u2208\u03c0\u2206\u03c0\u2217\n\n1\n\n|\u03c0\u2206\u03c0\u2217|\n\n1(cid:101)\u22062\n\n\u03c0\n\n(cid:16)\n\nmax(|B(|\u03c0\u2206\u03c0\u2217|, \u03c0)|,|B(|\u03c0\u2206\u03c0\u2217|, \u03c0\u2217)|)\n\nlog\n\n(cid:17)\n\nn\n\u03b4\n\n\fFigure 2: Example run of Algorithm 2, showing the evolution of sampling regions Sk (blue stripes), Tk (pink\nstripes) and FDR controlled sets Ck (orange \ufb01ll) at each time kt.\nunder \u03b1 is added to Ck, and any sets whose F DR are greater than \u03b1 are removed from Ak in\ncondition 1. Similar to the active classi\ufb01cation algorithm of Figure 1, a set \u03c0 \u2208 Ak is removed in\ncondition 2 if T P (\u03c0) is shown to be statistically less than T P (\u03c0(cid:48)) for some \u03c0(cid:48) \u2208 Ck that, crucially,\nis FDR controlled. In general there may be many sets \u03c0 \u2208 \u03a0 such that T P (\u03c0) > T P (\u03c0\u2217\n\u03b1) that are\nnot FDR-controlled. Finally in condition 3, we exploit the positivity of the \u03b7i\u2019s: if \u03c0 \u2282 \u03c0(cid:48) then\ndeterministically T P (\u03c0) \u2264 T P (\u03c0(cid:48)), so if \u03c0(cid:48) is FDR controlled it can be used to eliminate \u03c0. The\nchoice of Tk is motivated by active classi\ufb01cation: we only need to sample in the symmetric difference.\nTo determine which sets are FDR-controlled it is important that we sample in the entirety of the union\nof all \u03c0 \u2208 Ak \\ Ck, not just the symmetric difference of the Ak, which motivates the choice of Sk.\nIn practical experiments persistent noise is not uncommon and avoids the potential for unbounded\nsample complexities that potentially occur when F DR(\u03c0) \u2248 \u03b1. Figure 2 demonstrates a model run\nof the algorithm in the case of \ufb01ve sets \u03a0 = {\u03c01, . . . , \u03c05}.\nRecall that \u03a0\u03b1 is the subset of \u03a0 that is FDR-controlled so that \u03c0\u2217\n\u03b1 = arg max\u03c0\u2208\u03a0\u03b1 T P (\u03c0). The\nfollowing gives a sample complexity result for the number of rounds before the algorithm terminates.\n\n1\n\u22062\n\n\u03c0,\u03b1\n\nsF DR\n\u03c0\n\n=\n\nand (cid:101)\u2206\u03c0 = |T P (\u03c0\u2217\n\u03b1) \u2212 T P (\u03c0)|/|\u03c0\u2206\u03c0\u2217| = |T P (\u03c0\u2217\nV\u03c0|\u03c0|\n\nTheorem 2 Assume that for each i \u2264 n there is an associated \u03b7i \u2208 [0, 1] and {Yi,j}\u221e\nj=1 is an i.i.d.\nsequence of random variables such that Yi,j \u223c Ber(\u03b7i). For any \u03c0 \u2208 \u03a0 de\ufb01ne \u2206\u03c0,\u03b1 = |F DR(\u03c0)\u2212\u03b1|,\n(cid:17)\n\u03b1)|/|\u03c0\u2206\u03c0\u2217|, and\n\u03b1 \\ \u03c0) \u2212 T P (\u03c0 \\ \u03c0\u2217\nn log((cid:101)\u2206\u22122\n1(cid:101)\u22062\nV\u03c0,\u03c0\u2217\n|\u03c0\u2206\u03c0\u2217\n\u03b1|\n(cid:113) 4|\u03c0|nV\u03c0 log( n\n}, min\u03c0(cid:48)\u2208\u03a0\u03b1\n\u03c0\u2282\u03c0(cid:48)\n\nlog(cid:0)n log(\u2206\u22122\n\n\u03c0,\u03b1)/\u03b4(cid:1) ,\n\nIn addition de\ufb01ne T F DR\n\nsF DR\n\u03c0(cid:48)\n\n\u03c0\u2217\n}. Using C1(\u03c0, t, \u03b4) :=\n\n\u03c0\n4nV\u03c0 log( n\n\u03b4 )\n1 \u2212 \u03b4, in the stochastic noise setting Algorithm 2 returns \u03c0\u2217\n\nand C2 = C for C de\ufb01ned in Theorem 1, for a \ufb01xed constant c, with probability at least\n\u03b1 after a number of samples no more than\n\n}, min\u03c0(cid:48)\u2208\u03a0\u03b1\n\u03c0\u2282\u03c0(cid:48)\n\n= min{max{sT P\n\n= min{sF DR\n\n, max{sT P\n\n\u03c0 , sF DR\n\n\u03c0\u2217\n\n} and\n\n\u03c0 , sF DR\n\n\u03b1\n\nsT P\n\u03c0 =\n\nsF DR\n\u03c0(cid:48)\n\n\u03b4 )\n\n+\n\n\u03b1\n\nlog\n\n\u03c0 )/\u03b4\n\n(cid:16)\n\n\u03c0\n\n\u03c0\n\nT T P\n\n3t\n\n\u03c0\n\nt\n\n(cid:123)(cid:122)\n\nmax\n\n\u03c0\u2208\u03a0:i\u2208\u03c0\n\nT F DR\n\u03c0\n\nmax\n\n\u03c0\u2208\u03a0\u03b1:i\u2208\u03c0\u2206\u03c0\u2217\n\n\u03b1\n\nT T P\n\u03c0\n\nF DR\u2212Control\n\nT P R\u2212Elimination\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n\u03b1\n\nn(cid:88)\n(cid:124)\n\ni=1\n\nc\n\nn(cid:88)\n(cid:124)\n\ni=1\n\n(cid:125)\n\n+c\n\n8\n\n\fand\n\nc(cid:80)n\n\nin\n\ni=1 min\n\n(cid:16)\n\n(cid:110)\n\nthe\n1,\n\npersistent\n\nnoise\nmax\u03c0\u2208\u03a0:i\u2208\u03c0 T F DR\n\n\u03c0\n\nsetting\n\nreturns\n\n+ max\u03c0\u2208\u03a0\u03b1:i\u2208\u03c0\u2206\u03c0\u2217\n\n\u03b1\n\nT T P\n\n\u03c0\n\n(cid:17)(cid:111)\n\n\u03c0\u2217\n\n\u03b1\n\nafter\n\nno\n\nmore\n\nthan\n\n\u03c0\n\n\u03c0\n\n\u03c0\n\nThough this result is complicated, each term is understood by considering each way a set can be\nremoved and the time at which an arm i will stop being sampled. Effectively the sample complexity\ndecomposes into two parts, the complexity of showing that a set is FDR-controlled or not, and\nhow long it takes to eliminate it based on TPR. To motivate sF DR\n, if we have a single set \u03c0 then\n1/(|\u03c0|\u22062\n\u03c0,\u03b1) roughly captures the number of times we have to sample each element in \u03c0 to decide\nwhether it is FDR-controlled or not - so in particular in the general case we have to roughly sample an\narm i, max\u03c0\u2208\u03a0,i\u2208\u03c0 s\u03c0 times. However, we can remove a set before showing it is FDR controlled using\nother conditions which T F DR\ncaptures. The term in the sample complexity for elimination using\nTPR is similarly motivated. We now unpack the underbraced terms more carefully simultaneously\nexplaining the sample complexity and the motivation for the proof of Theorem 2.\nSample Complexity of FDR-Control In any round where there exists a set \u03c0 \u2208 Ak \\ Ck with arm\ni \u2208 \u03c0, i.e. \u03c0 is not yet FDR controlled, there is the potential for sampling i \u2208 Sk. A set \u03c0 only leaves\nAk if i) it is shown to not be FDR controlled (condition 1 of the algorithm), ii) because an FDR\ncontrolled set eliminates it on the basis of TP (condition 2), or iii) it is contained in an FDR controlled\nset (condition 3). These three cases re\ufb02ect the three arguments of the min in the de\ufb01ned quantity\n, respectively. Taking the maximum over all sets containing an arm i and summing over all i\nT F DR\n\u03c0\ngives the total FDR-control term. This is a large savings relative to naive non-adaptive algorithms that\n) samples.\nsample until every set \u03c0 in \u03a0 was FDR controlled which would take O(n max\u03c0\u2208\u03a0 sF DR\nSample Complexity of TPR-Elimination An FDR-controlled set \u03c0 \u2208 \u03a0\u03b1 is only removed from Ck\nwhen eliminated by an FDR-controlled set with higher T P or if it is removed because it is contained\nin an FDR-controlled set. In general we can upper bound the former time by the samples needed for\n\u03c0\u2217\n\u03b1 to eliminate \u03c0 once we know \u03c0\u2217\n\u03c0 .\nT T P\nNote that sets are removed in a procedure mimicking active classi\ufb01cation and so the active gains\nthere apply to this setting as well. A naive passive algorithm that continues to sample until both the\nFDR of every set is determined, and \u03c0\u2217\n\u03b1 has higher TP than every other FDR-controlled set gives a\nsigni\ufb01cantly worse sample complexity of O(n max{max\u03c0\u2208\u03a0\u03b1 sF DR\nComparison with [5]. Similar to our proposed algorithm, [5] samples in the union of all active sets\nand maintains statistics on the empirical FDR of each set, along the way removing sets that are not\nFDR-controlled or have lower TPR than an FDR-controlled set. However, they fail to sample in the\nsymmetric difference, missing an important link between FDR-control and active classi\ufb01cation. In\nparticular, the con\ufb01dence intervals they use are far looser as a result. They also only consider the\ncase of persistent noise. Their proven sample complexity results are no better than those achieved by\nthe passive algorithm that samples each item uniformly, which is precisely the sample complexity\ndescribed at the end of the previous paragraph.\nOne Dimensional Thresholds Consider a stylized modeling of the topology \u03b2\u03b1\u03b2\u03b2 from the introduc-\ntion in the persistent noise setting where \u03a0 = {[t] : t \u2264 n}, \u03b7i \u223c Ber(\u03b21{i \u2264 z}) with \u03b2 < .5, and\nz \u2208 [n] is assumed to be small, i.e., we assume that there is only a small region in which positive labels\ncan be found and the Bayes classi\ufb01er is just to predict 0 for all points. Assuming \u03b1 > 1 \u2212 \u03b2, one can\nshow the sample complexity of Algorithm 2 satis\ufb01es O((1\u2212\u03b1)\u22122(log(n/(1\u2212\u03b1))+(1+\u03b2)z/(1\u2212\u03b1)))\nwhile any naive non-adaptive sampling strategy will take at least O(n) samples.\nImplementation. For simple classes \u03a0 such as thresholds or axis aligned rectangles, our algorithm\ncan be made computationally ef\ufb01cient. But for more complex classes there may be a wide gap\nbetween theory and practice, just as in classi\ufb01cation [36, 20]. However, the algorithm motivates\ntwo key ideas - sample in the union of potentially good sets to learn which are FDR controlled, and\nsample in the symmetric difference to eliminate sets. The latter insight was originally made by A2 in\nthe case of classi\ufb01cation and has justi\ufb01ed heuristics such as uncertainty sampling [36]. Developing\nanalogous heuristics for the former case of FDR-control is an exciting avenue of future work.\n\n\u03b1 is FDR controlled - this gives rise to max\u03c0\u2208\u03a0\u03b1:i\u2208\u03c0\u2206\u03c0\u2217\n\n, max\u03c0(cid:54)\u2208\u03a0\u03b1 sT P\n\n\u03c0 }).\n\n\u03c0\n\n\u03b1\n\n9\n\n\fReferences\n[1] Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, and Dan Roth. Generalization\nbounds for the area under the roc curve. Journal of Machine Learning Research, 6(Apr):393\u2013\n425, 2005.\n\n[2] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal\n\nof Computer and System Sciences, 75(1):78\u201389, 2009.\n\n[3] R\u00e9mi Bardenet, Odalric-Ambrym Maillard, et al. Concentration inequalities for sampling\n\nwithout replacement. Bernoulli, 21(3):1361\u20131385, 2015.\n\n[4] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and\npowerful approach to multiple testing. Journal of the Royal statistical society: series B\n(Methodological), 57(1):289\u2013300, 1995.\n\n[5] Paul N Bennett, David M Chickering, Christopher Meek, and Xiaojin Zhu. Algorithms for\nactive classi\ufb01er selection: Maximizing recall with precision constraints. In Proceedings of the\nTenth ACM International Conference on Web Search and Data Mining, pages 711\u2013719. ACM,\n2017.\n\n[6] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning.\n\narXiv preprint arXiv:0812.4952, 2008.\n\n[7] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities: A\n\nnonasymptotic theory of independence. Oxford university press, 2013.\n\n[8] Olivier Bousquet. A bennett concentration inequality and its application to suprema of empirical\n\nprocesses. Comptes Rendus Mathematique, 334(6):495\u2013500, 2002.\n\n[9] Kendrick Boyd, Kevin H Eng, and C David Page. Area under the precision-recall curve: Point\nestimates and con\ufb01dence intervals. In Joint European Conference on Machine Learning and\nKnowledge Discovery in Databases, pages 451\u2013466. Springer, 2013.\n\n[10] Diogo M Camacho, Katherine M Collins, Rani K Powers, James C Costello, and James J\n\nCollins. Next-generation machine learning for biological networks. Cell, 2018.\n\n[11] Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration:\nEf\ufb01cient algorithms and an analysis with localization. arXiv preprint arXiv:1711.08018, 2017.\n\n[12] Rui M Castro et al. Adaptive sensing performance lower bounds for sparse signal detection and\n\nsupport estimation. Bernoulli, 20(4):2217\u20132246, 2014.\n\n[13] Rui M Castro and Robert D Nowak. Minimax bounds for active learning. IEEE Transactions\n\non Information Theory, 54(5):2339\u20132353, 2008.\n\n[14] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\n[15] Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid\n\nconstraints. In Conference on Learning Theory, pages 647\u2013669, 2016.\n\n[16] Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal\nsampling algorithms for combinatorial pure exploration. In Conference on Learning Theory,\npages 482\u2013534, 2017.\n\n[17] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for\n\ntop-k arm selection. In Arti\ufb01cial Intelligence and Statistics, pages 101\u2013110, 2017.\n\n[18] Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure\nexploration of multi-armed bandits. In Advances in Neural Information Processing Systems,\npages 379\u2013387, 2014.\n\n[19] Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in neural\n\ninformation processing systems, pages 235\u2013242, 2006.\n\n10\n\n\f[20] Sanjoy Dasgupta, Daniel J Hsu, and Claire Monteleoni. A general agnostic active learning\n\nalgorithm. In Advances in neural information processing systems, pages 353\u2013360, 2008.\n\n[21] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of\n\nrandomized algorithms. Cambridge University Press, 2009.\n\n[22] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. Journal of machine learning\nresearch, 7(Jun):1079\u20131105, 2006.\n\n[23] Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, and Peter\nBartlett. Improved learning complexity in combinatorial pure exploration bandits. In Arti\ufb01cial\nIntelligence and Statistics, pages 1004\u20131012, 2016.\n\n[24] Steve Hanneke. Teaching dimension and the complexity of active learning. In International\n\nConference on Computational Learning Theory, pages 66\u201381. Springer, 2007.\n\n[25] Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends R(cid:13)\n\nin Machine Learning, 7(2-3):131\u2013309, 2014.\n\n[26] Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Ef\ufb01cient\nand parsimonious agnostic active learning. In Advances in Neural Information Processing\nSystems, pages 2755\u20132763, 2015.\n\n[27] Kevin Jamieson and Lalit Jain. A bandit approach to multiple testing with false discovery\n\ncontrol. In Advances in Neural Information Processing Systems, 2018.\n\n[28] Kevin Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck. lil\u2019ucb: An optimal\nexploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages\n423\u2013439, 2014.\n\n[29] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in\n\nstochastic multi-armed bandits. In ICML, volume 12, pages 655\u2013662, 2012.\n\n[30] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed\n\nbandits. In International Conference on Machine Learning, pages 1238\u20131246, 2013.\n\n[31] Armaghan W Naik, Joshua D Kangas, Devin P Sullivan, and Robert F Murphy. Active machine\nlearning-driven experimentation to determine compound effects on protein patterns. Elife,\n5:e10047, 2016.\n\n[32] Robert D Nowak. The geometry of generalized binary search. IEEE Transactions on Information\n\nTheory, 57(12):7893\u20137906, 2011.\n\n[33] Gabriel J Rocklin, Tamuka M Chidyausiku, Inna Goreshnik, Alex Ford, Scott Houliston,\nAlexander Lemak, Lauren Carter, Rashmi Ravichandran, Vikram K Mulligan, Aaron Chevalier,\net al. Global analysis of protein folding using massively parallel design, synthesis, and testing.\nScience, 357(6347):168\u2013175, 2017.\n\n[34] Ashish Sabharwal and Yexiang Xue. Adaptive strati\ufb01ed sampling for precision-recall estimation.\n\npages 825\u2013834, 2018.\n\n[35] Christoph Sawade, Niels Landwehr, and Tobias Scheffer. Active estimation of f-measures. In\n\nAdvances in Neural Information Processing Systems, pages 2083\u20132091, 2010.\n\n[36] Burr Settles. Active learning. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine\n\nLearning, 6(1):1\u2013114, 2012.\n\n[37] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive\nsampling in the moderate-con\ufb01dence regime. In Conference on Learning Theory, pages 1794\u2013\n1834, 2017.\n\n[38] Yuriy Sverchkov and Mark Craven. A review of active learning approaches to experimental\ndesign for uncovering biological networks. PLoS computational biology, 13(6):e1005466, 2017.\n\n11\n\n\f[39] Lorillee Tallorin, JiaLei Wang, Woojoo E Kim, Swagat Sahu, Nicolas M Kosa, Pu Yang,\nMatthew Thompson, Michael K Gilson, Peter I Frazier, Michael D Burkart, et al. Discovering\nde novo peptide substrates for enzymes using machine learning. Nature communications,\n9(1):5253, 2018.\n\n[40] Fanny Yang, Aaditya Ramdas, Kevin G Jamieson, and Martin J Wainwright. A framework for\nmulti-a (rmed)/b (andit) testing with online fdr control. In Advances in Neural Information\nProcessing Systems, pages 5957\u20135966, 2017.\n\n[41] Lu Zhang, Jianjun Tan, Dan Han, and Hao Zhu. From machine learning to deep learning:\nprogress in machine intelligence for rational drug discovery. Drug discovery today, 22(11):1680\u2013\n1685, 2017.\n\n[42] Martin J Zhang, James Zou, and David Tse. Adaptive monte carlo multiple testing via multi-\n\narmed bandits. arXiv preprint arXiv:1902.00197, 2019.\n\n12\n\n\f", "award": [], "sourceid": 7813, "authors": [{"given_name": "Lalit", "family_name": "Jain", "institution": "University of Washington"}, {"given_name": "Kevin", "family_name": "Jamieson", "institution": "U Washington"}]}