{"title": "Buy-in-Bulk Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2229, "page_last": 2237, "abstract": "In many practical applications of active learning, it is more cost-effective to request labels in large batches, rather than one-at-a-time. This is because the cost of labeling a large batch of examples at once is often sublinear in the number of examples in the batch. In this work, we study the label complexity of active learning algorithms that request labels in a given number of batches, as well as the tradeoff between the total number of queries and the number of rounds allowed. We additionally study the total cost sufficient for learning, for an abstract notion of the cost of requesting the labels of a given number of examples at once. In particular, we find that for sublinear cost functions, it is often desirable to request labels in large batches (i.e., buying in bulk); although this may increase the total number of labels requested, it reduces the total cost required for learning.", "full_text": "Buy-in-Bulk Active Learning\n\nLiu Yang\n\nMachine Learning Department,\n\nCarnegie Mellon University\n\nJaime Carbonell\n\nLanguage Technologies Institute,\n\nCarnegie Mellon University\n\nliuy@cs.cmu.edu\n\njgc@cs.cmu.edu\n\nAbstract\n\nIn many practical applications of active learning, it is more cost-effective to re-\nquest labels in large batches, rather than one-at-a-time. This is because the cost of\nlabeling a large batch of examples at once is often sublinear in the number of ex-\namples in the batch. In this work, we study the label complexity of active learning\nalgorithms that request labels in a given number of batches, as well as the tradeoff\nbetween the total number of queries and the number of rounds allowed. We ad-\nditionally study the total cost suf\ufb01cient for learning, for an abstract notion of the\ncost of requesting the labels of a given number of examples at once. In particular,\nwe \ufb01nd that for sublinear cost functions, it is often desirable to request labels in\nlarge batches (i.e., buying in bulk); although this may increase the total number of\nlabels requested, it reduces the total cost required for learning.\n\n1\n\nIntroduction\n\nIn many practical applications of active learning, the cost to acquire a large batch of labels at once is\nsigni\ufb01cantly less than the cost of the same number of sequential rounds of individual label requests.\nThis is true for both practical reasons (overhead time for start-up, reserving equipment in discrete\ntime-blocks, multiple labelers working in parallel, etc.) and for computational reasons (e.g., time\nto update the learner\u2019s hypothesis and select the next examples may be large). Consider making\none vs multiple hematological diagnostic tests on an out-patient. There are \ufb01xed up-front costs:\nbringing the patient in for testing, drawing and storing the blood, entring the information in the\nhospital record system, etc. And there are variable costs, per speci\ufb01c test. Consider a microarray\nassay for gene expression data. There is a \ufb01xed cost in setting up and running the microarray, but\nvirtually no incremental cost as to the number of samples, just a constraint on the max allowed.\nEither of the above conditions are often the case in scienti\ufb01c experiments (e.g., [1]), As a different\nexample, consider calling a focused group of experts to address questions w.r.t new product design\nor introduction. There is a \ufb01xed cost in forming the group (determine membership, contract, travel,\netc.), and a incremental per-question cost. The common abstraction in such real-world versions\nof \u201coracles\u201d is that learning can buy-in-bulk to advantage because oracles charge either per batch\n(answering a batch of questions for the same cost as answering a single question up to a batch\nmaximum), or the cost per batch is axp + b, where b is the set-up cost, x is the number of queries,\nand p = 1 or p < 1 (for the case where practice yields ef\ufb01ciency).\nOften we have other tradeoffs, such as delay vs testing cost. For instance in a medical diagnosis case,\nthe most cost-effective way to minimize diagnostic tests is purely sequential active learning, where\neach test may rule out a set of hypotheses (diagnoses) and informs the next test to perform. But\na patient suffering from a serious disease may worsen while sequential tests are being conducted.\nHence batch testing makes sense if the batch can be tested in parallel. In general one can convert\ndelay into a second cost factor and optimize for batch size that minimizes a combination of total\ndelay and the sum of the costs for the individual tests. Parallelizing means more tests would be\nneeded, since we lack the bene\ufb01t of earlier tests to rule out future ones. In order to perform this\n\n1\n\n\fbatch-size optimization we also need to estimate the number of redundant tests incurred by turning\na sequence into a shorter sequence of batches.\n\nFor the reasons cited above, it can be very useful in practice to generalize active learning to active-\nbatch learning, with buy-in-bulk discounts. This paper developes a theoretical framework exploring\nthe bounds and sample compelxity of active buy-in-bulk machine learning, and analyzing the trade-\noff that can be achieved between the number of batches and the total number of queries required for\naccurate learning.\n\nIn another example, if we have many labelers (virtually unlimited) operating in parallel, but must\npay for each query, and the amount of time to get back the answer to each query is considered\nindependent with some distribution, it may often be the case that the expected amount of time\nneeded to get back the answers to m queries is sublinear in m, so that if the \u201ccost\u201d is a function\nof both the payment amounts and the time, it might sometimes be less costly to submit multiple\nqueries to be labeled in parallel. In scenarios such as those mentioned above, a batch mode active\nlearning strategy is desirable, rather than a method that selects instances to be labeled one-at-a-time.\n\nThere have recently been several attempts to construct heuristic approaches to the batch mode active\nlearning problem (e.g., [2]). However, theoretical analysis has been largely lacking. In contrast,\nthere has recently been signi\ufb01cant progress in understanding the advantages of fully-sequential ac-\ntive learning (e.g., [3, 4, 5, 6, 7]). In the present work, we are interested in extending the techniques\nused for the fully-sequential active learning model, studying natural analogues of them for the batch-\nmodel active learning model.\n\nFormally, we are interested in two quantities: the sample complexity and the total cost. The sample\ncomplexity refers to the number of label requests used by the algorithm. We expect batch-mode\nactive learning methods to use more label requests than their fully-sequential cousins. On the other\nhand, if the cost to obtain a batch of labels is sublinear in the size of the batch, then we may\nsometimes expect the total cost used by a batch-mode learning method to be signi\ufb01cantly less than\nthe analogous fully-sequential algorithms, which request labels individually.\n\n2 De\ufb01nitions and Notation\n\nAs in the usual statistical learning problem, there is a standard Borel space X , called the instance\nspace, and a set C of measurable classi\ufb01ers h : X \u2192 {\u22121, +1}, called the concept space. Through-\nout, we suppose that the VC dimension of C, denoted d below, is \ufb01nite.\nIn the learning problem, there is an unobservable distribution DXY over X \u00d7 {\u22121, +1}. Based on\nt=1 denote an in\ufb01nite sequence of independent DXY -distributed\nthis quantity, we let Z = {(Xt, Yt)}\u221e\nrandom variables. We also denote by Zt = {(X1, Y1), (X2, Y2), . . . , (Xt, Yt)} the \ufb01rst t such\nlabeled examples. Additionally denote by DX the marginal distribution of DXY over X . For a\nclassi\ufb01er h : X \u2192 {\u22121, +1}, denote er(h) = P(X,Y )\u223cDXY (h(X) 6= Y ), the error rate of h.\nAdditionally, for m \u2208 N and Q \u2208 (X \u00d7 {\u22121, +1})m, let er(h; Q) = 1\nI[h(x) 6= y],\nthe empirical error rate of h. In the special case that Q = Zm, abbreviate erm(h) = er(h; Q).\nFor r > 0, de\ufb01ne B(h, r) = {g \u2208 C : DX (x : h(x) 6= g(x)) \u2264 r}. For any H \u2286 C, de\ufb01ne\nDIS(H) = {x \u2208 X : \u2203h, g \u2208 H s.t. h(x) 6= g(x)}. We also denote by \u03b7(x) = P (Y = +1|X = x),\nwhere (X, Y ) \u223c DXY , and let h\u2217(x) = sign(\u03b7(x) \u2212 1/2) denote the Bayes optimal classi\ufb01er.\nIn the active learning protocol, the algorithm has direct access to the Xt sequence, but must request\nto observe each label Yt, sequentially. The algorithm asks up to a speci\ufb01ed number of label requests\nn (the budget), and then halts and returns a classi\ufb01er. We are particularly interested in determining,\nfor a given algorithm, how large this number of label requests needs to be in order to guarantee\nsmall error rate with high probability, a value known as the label complexity. In the present work,\nwe are also interested in the cost expended by the algorithm. Speci\ufb01cally, in this context, there\nis a cost function c : N \u2192 (0,\u221e), and to request the labels {Yi1 , Yi2 , . . . , Yim} of m examples\n{Xi1 , Xi2 , . . . , Xim} at once requires the algorithm to pay c(m); we are then interested in the sum\nof these costs, over all batches of label requests made by the algorithm. Depending on the form\nof the cost function, minimizing the cost of learning may actually require the algorithm to request\nlabels in batches, which we expect would actually increase the total number of label requests.\n\n|Q|P(x,y)\u2208Q\n\n2\n\n\fTo help quantify the label complexity and cost complexity, we make use of the following de\ufb01nition,\ndue to [6, 7].\nDe\ufb01nition 2.1. [6, 7] De\ufb01ne the disagreement coef\ufb01cient of h\u2217 as\nDX (DIS(B(h\u2217, r)))\n\n.\n\n\u03b8(\u01eb) = sup\nr>\u01eb\n\nr\n\n3 Buy-in-Bulk Active Learning in the Realizable Case: k-batch CAL\n\nWe begin our anlaysis with the simplest case: namely, the realizable case, with a \ufb01xed prespeci\ufb01ed\nnumber of batches. We are then interested in quantifying the label complexity for such a scenario.\nFormally, in this section we suppose h\u2217 \u2208 C and er(h\u2217) = 0. This is refered to as the realizable\ncase. We \ufb01rst review a well-known method for active learning in the realizable case, refered to as\nCAL after its discoverers Cohn, Atlas, and Ladner [8].\n\nAlgorithm: CAL(n)\n1. t \u2190 0, m \u2190 0, Q \u2190 \u2205\n2. While t < n\n3. m \u2190 m + 1\n4.\n\nIf max\n\ny\u2208{\u22121,+1}\n\nmin\nh\u2208C\n\ner(h;Q \u222a {(Xm, y)}) = 0\n\nRequest Ym, let Q \u2190 Q \u222a {(Xm, Ym)}, t \u2190 t + 1\n\n5.\n6. Return \u02c6h = argminh\u2208C er(h;Q)\n\nThe label complexity of CAL is known to be O (\u03b8(\u01eb)(d log(\u03b8(\u01eb)) + log(log(1/\u01eb)/\u03b4)) log(1/\u01eb)) [7].\nThat is, some n of this size suf\ufb01ces to guarantee that, with probability 1 \u2212 \u03b4, the returned classi\ufb01er\n\u02c6h has er(\u02c6h) \u2264 \u01eb.\nOne particularly simple way to modify this algorithm to make it batch-based is to simply divide up\nthe budget into equal batch sizes. This yields the following method, which we refer to as k-batch\nCAL, where k \u2208 {1, . . . , n}.\n\nAlgorithm: k-batch CAL(n)\n1. Let Q \u2190 {}, b \u2190 2, V \u2190 C\n2. For m = 1, 2, . . .\n3.\n4.\n5.\n6.\n7.\n8.\n9.\n10.\n\nIf Xm \u2208 DIS(V )\nQ \u2190 Q \u222a {Xm}\nIf |Q| = \u230an/k\u230b\nRequest the labels of examples in Q\nLet L be the corresponding labeled examples\nV \u2190 {h \u2208 V : er(h; L) = 0}\nb \u2190 b + 1 and Q \u2190 \u2205\nIf b > k, Return any \u02c6h \u2208 V\n\nWe expect the label complexity of k-batch CAL to somehow interpolate between passive learning\n(at k = 1) and the label complexity of CAL (at k = n). Indeed, the following theorem bounds the\nlabel complexity of k-batch CAL by a function that exhibits this interpolation behavior with respect\nto the known upper bounds for these two cases.\nTheorem 3.1. In the realizable case, for some\n\n\u03bb(\u01eb, \u03b4) = O(cid:16)k\u01eb\u22121/k\u03b8(\u01eb)1\u22121/k(d log(1/\u01eb) + log(1/\u03b4))(cid:17) ,\n\nfor any n \u2265 \u03bb(\u01eb, \u03b4), with probability at least 1 \u2212 \u03b4, running k-batch CAL with budget n produces a\nclassi\ufb01er \u02c6h with er(\u02c6h) \u2264 \u01eb.\nProof. Let M = \u230an/k\u230b. De\ufb01ne V0 = C and i0M = 0. Generally, for b \u2265 1, let ib1, ib2, . . . , ibM\ndenote the indices i of the \ufb01rst M points Xi \u2208 DIS(Vb\u22121) for which i > i(b\u22121)M , and let Vb = {h \u2208\n\n3\n\n\fVb\u22121 : \u2200j \u2264 M, h(Xibj ) = h\u2217(Xibj )}. These correspond to the version space at the conclusion of\nbatch b in the k-batch CAL algorithm.\nNote that Xib1, . . . , XibM are conditionally iid given Vb\u22121, with distribution of X given X \u2208\nDIS(Vb\u22121). Thus, the PAC bound of [9] implies that, for some constant c \u2208 (0,\u221e), with prob-\nability \u2265 1 \u2212 \u03b4/k,\n\nVb \u2286 B(cid:18)h\u2217, c\n\nd log(M/d) + log(k/\u03b4)\n\nM\n\nP (DIS(Vb\u22121))(cid:19) .\n\nBy a union bound, the above holds for all b \u2264 k with probability \u2265 1 \u2212 \u03b4; suppose this is the\ncase. Since P (DIS(Vb\u22121)) \u2264 \u03b8(\u01eb) max{\u01eb, maxh\u2208Vb\u22121er(h)}, and any b with maxh\u2208Vb\u22121 er(h) \u2264 \u01eb\nwould also have maxh\u2208Vb er(h) \u2264 \u01eb, we have\n\nmax\nh\u2208Vb\n\ner(h) \u2264 max(cid:26)\u01eb, c\n\nd log(M/d) + log(k/\u03b4)\n\nM\n\n\u03b8(\u01eb) max\nh\u2208Vb\u22121\n\nM\n\nmax\nh\u2208Vk\n\nNoting that P (DIS(V0)) \u2264 1 implies V1 \u2286 B(cid:16)h\u2217, c d log(M/d)+log(k/\u03b4)\n(cid:19)k\n\ner(h) \u2264 max(\u01eb,(cid:18)c\nFor some constant c\u2032 > 0, any M \u2265 c\u2032 \u03b8(\u01eb)\n(cid:0)d log 1\nSince M = \u230an/k\u230b, it suf\ufb01ces to have n \u2265 k(cid:18)1 + c\u2032 \u03b8(\u01eb)\n\nd log(M/d) + log(k/\u03b4)\n\nk\n\u01eb1/k\n\nk\n\u01eb1/k\n\nM\n\nk\u22121\n\ner(h))(cid:27) .\n(cid:17), by induction we have\n\u03b8(\u01eb)k\u22121) .\n\n\u01eb + log(k/\u03b4)(cid:1) makes the right hand side \u2264 \u01eb.\n\nk\u22121\n\n(cid:0)d log 1\n\n\u01eb + log(k/\u03b4)(cid:1)(cid:19).\n\nTheorem 3.1 has the property that, when the disagreement coef\ufb01cient is small, the stated bound\non the total number of label requests suf\ufb01cient for learning is a decreasing function of k. This\nmakes sense, since \u03b8(\u01eb) small would imply that fully-sequential active learning is much better than\npassive learning. Small values of k correspond to more passive-like behavior, while larger values of\nk take fuller advantage of the sequential nature of active learning. In particular, when k = 1, we\nrecover a well-known label complexity bound for passive learning by empirical risk minimization\n[10]. In contrast, when k = log(1/\u01eb), the \u01eb\u22121/k factor is e (constant), and the rest of the bound is at\nmost O(\u03b8(\u01eb)(d log(1/\u01eb) + log(1/\u03b4)) log(1/\u01eb)), which is (up to a log factor) a well-known bound on\nthe label complexity of CAL for active learning [7] (a slight re\ufb01nement of the proof would in fact\nrecover the exact bound of [7] for this case); for k larger than log(1/\u01eb), the label complexity can only\nimprove; for instance, consider that upon reaching a given data point Xm in the data stream, if V is\nthe version space in k-batch CAL (for some k), and V \u2032 is the version space in 2k-batch CAL, then\nwe have V \u2032 \u2286 V (supposing n is a multiple of 2k), so that Xm \u2208 DIS(V \u2032) only if Xm \u2208 DIS(V ).\nNote that even k = 2 can sometimes provide signi\ufb01cant reductions in label complexity over passive\nlearning: for instance, by a factor proportional to 1/\u221a\u01eb in the case that \u03b8(\u01eb) is bounded by a \ufb01nite\nconstant.\n\n4 Batch Mode Active Learning with Tsybakov noise\n\nThe above analysis was for the realizable case. While this provides a particularly clean and simple\nanalysis, it is not suf\ufb01ciently broad to cover many realistic learning applications. To move beyond\nthe realizable case, we need to allow the labels to be noisy, so that er(h\u2217) > 0. One popular noise\nmodel in the statistical learning theory literature is Tsybakov noise, which is de\ufb01ned as follows.\nDe\ufb01nition 4.1. [11] The distribution DXY satis\ufb01es Tsybakov noise if h\u2217 \u2208 C, and for some c > 0\nand \u03b1 \u2208 [0, 1],\nequivalently, \u2200h, P (h(x) 6= h\u2217(x)) \u2264 c2(er(h) \u2212 er(h\u2217))\u03b1, where c1 and c2 are constants.\nSupposing DXY satis\ufb01es Tsybakov noise, we de\ufb01ne a quantity\nEm = c3(cid:18) d log(m/d) + log(km/\u03b4)\n\n\u2200t > 0, P(|\u03b7(x) \u2212 1/2| < t) < c1t\n\n(cid:19)\n\n1\u2212\u03b1 ,\n\nm\n\n2\u2212\u03b1\n\n.\n\n\u03b1\n\n1\n\n4\n\n\fbased on a standard generalization bound for passive learning [12]. Speci\ufb01cally, [12] have shown\nthat, for any V \u2286 C, with probability at least 1 \u2212 \u03b4/(4km2),\n\nh,g\u2208V |(er(h) \u2212 er(g)) \u2212 (erm(h) \u2212 erm(g))| < Em.\nsup\n\n(1)\n\nConsider the following modi\ufb01cation of k-batch CAL, designed to be robust to Tsybakov noise. We\nrefer to this method as k-batch Robust CAL, where k \u2208 {1, . . . , n}.\n\nAlgorithm: k-batch Robust CAL(n)\n1. Let Q \u2190 {}, b \u2190 1, V \u2190 C, m1 \u2190 0\n2. For m = 1, 2, . . .\n3.\n4.\n5.\n6.\n7.\n8.\n9.\n10.\n11.\n\nIf Xm \u2208 DIS(V )\nQ \u2190 Q \u222a {Xm}\nIf |Q| = \u230an/k\u230b\nRequest the labels of examples in Q\nLet L be the corresponding labeled examples\nV \u2190 {h \u2208 V : (er(h; L) \u2212 ming\u2208V er(g; L)) \u230an/k\u230b\nb \u2190 b + 1 and Q \u2190 \u2205\nmb \u2190 m\nIf b > k, Return any \u02c6h \u2208 V\n\nm\u2212mb \u2264 Em\u2212mb}\n\nTheorem 4.2. Under the Tsybakov noise condition, letting \u03b2 = \u03b1\n\n\u03bb(\u01eb, \u03b4) = O k(cid:18) 1\n\u01eb(cid:19)\n\n2\u2212\u03b1\n\n\u00af\u03b2\n\n(c2\u03b8(c2\u01eb\u03b1))1\u2212 \u03b2k\u22121\n\n\u00af\u03b2 (cid:18)d log(cid:18) d\n\ni=0 \u03b2i, for some\n\n2\u2212\u03b1 , and \u00af\u03b2 =Pk\u22121\n\u01eb(cid:19) + log(cid:18) kd\n\n\u03b4\u01eb(cid:19)(cid:19)\n\n\u00af\u03b2\n\n1+\u03b2 \u00af\u03b2\u2212\u03b2k\n\n!,\n\nM\n\n(1)\n\nthat \u2200m > 0,\n\n(applied under\ntotal probability)\n\nfor any n \u2265 \u03bb(\u01eb, \u03b4), with probability at least 1 \u2212 \u03b4, running k-batch Robust CAL with budget n\nproduces a classi\ufb01er \u02c6h with er(\u02c6h) \u2212 er(h\u2217) \u2264 \u01eb.\nProof. Let M = \u230an/k\u230b. De\ufb01ne i0M = 0 and V0 = C. Generally, for b \u2265 1,\nlet\nib1, ib2, . . . , ibM denote the indices i of the \ufb01rst M points Xi \u2208 DIS(Vb\u22121) for which i >\ni(b\u22121)M , and let Qb = {(Xib1 , Yib1 ), . . . , (XibM , YibM )} and Vb = {h \u2208 Vb\u22121 : (er(h; Qb) \u2212\nibM \u2212i(b\u22121)M \u2264 EibM \u2212i(b\u22121)M}. These correspond to the set V at the conclu-\nming\u2208Vb\u22121 er(g; Qb))\nsion of batch b in the k-batch Robust CAL algorithm.\nFor b \u2208 {1, . . . , k},\nthe conditional distribution given Vb\u22121, com-\nbined with the law of\nletting Zb,m =\nimplies\n{(Xi(b\u22121)M +1, Yi(b\u22121)M +1), ..., (Xi(b\u22121)M +m, Yi(b\u22121)M +m)}, with probability at least 1\u2212\u03b4/(4km2),\nif h\u2217 \u2208 Vb\u22121, then er(h\u2217; Zb,m) \u2212 ming\u2208Vb\u22121 er(g; Zb,m) < Em, and every h \u2208 Vb\u22121 with\ner(h; Zb,m) \u2212 ming\u2208Vb\u22121 er(g; Zb,m) \u2264 Em has er(h) \u2212 er(h\u2217) < 2Em. By a union bound,\nIn particular, this means it\nthis holds for all m \u2208 N, with probability at least 1 \u2212 \u03b4/(2k).\nholds for m = ibM \u2212 i(b\u22121)M . But note that for this value of m, any h, g \u2208 Vb\u22121 have\ner(h; Zb,m) \u2212 er(g; Zb,m) = (er(h; Qb) \u2212 er(g; Qb)) M\nm (since for every (x, y) \u2208 Zb,m \\ Qb, either\nboth h and g make a mistake, or neither do). Thus if h\u2217 \u2208 Vb\u22121, we have h\u2217 \u2208 Vb as well, and\nfurthermore suph\u2208Vb er(h) \u2212 er(h\u2217) < 2EibM \u2212i(b\u22121)M . By induction (over b) and a union bound,\nthese are satis\ufb01ed for all b \u2208 {1, . . . , k} with probability at least 1 \u2212 \u03b4/2. For the remainder of the\nproof, we suppose this 1 \u2212 \u03b4/2 probability event occurs.\nNext, we focus on lower bounding ibM \u2212 i(b\u22121)M , again by induction. As a base case, we clearly\nhave i1M \u2212 i0M \u2265 M. Now suppose some b \u2208 {2, . . . , k} has i(b\u22121)M \u2212 i(b\u22122)M \u2265 Tb\u22121\nfor some Tb\u22121. Then, by the above, we have suph\u2208Vb\u22121 er(h) \u2212 er(h\u2217) < 2ETb\u22121. By the Tsy-\nbakov noise condition, this implies Vb\u22121 \u2286 B(cid:0)h\u2217, c2(cid:0)2ETb\u22121(cid:1)\u03b1(cid:1), so that if suph\u2208Vb\u22121 er(h) \u2212\ner(h\u2217) > \u01eb, P (DIS(Vb\u22121)) \u2264 \u03b8(c2\u01eb\u03b1)c2(cid:0)2ETb\u22121(cid:1)\u03b1. Now note that the conditional distribution\nof ibM \u2212 i(b\u22121)M given Vb\u22121 is a negative binomial random variable with parameters M and\n1 \u2212 P (DIS(Vb\u22121)) (that is, a sum of M Geometric(P (DIS(Vb\u22121))) random variables). A Cher-\nnoff bound (applied under the conditional distribution given Vb\u22121) implies that P (ibM \u2212 i(b\u22121)M <\n\n5\n\n\fM/(2P (DIS(Vb\u22121)))|Vb\u22121) < e\u2212M/6. Thus, for Vb\u22121 as above, with probability at least 1\u2212e\u2212M/6,\n2\u03b8(c2\u01eb\u03b1)c2(2ETb\u22121 )\u03b1 . Thus, we can de\ufb01ne Tb as in the right hand side, which thereby\nibM\u2212i(b\u22121)M \u2265\nde\ufb01nes a recurrence. By induction, with probability at least 1 \u2212 ke\u2212M/6 > 1 \u2212 \u03b4/2,\n\nM\n\nikM \u2212 i(k\u22121)M \u2265 M\n\n\u00af\u03b2(cid:18)\n\n1\n\n4c2\u03b8(c2\u01eb\u03b1)(cid:19)\n\n\u00af\u03b2\u2212\u03b2k\u22121\n\n2(d log(M ) + log(kM/\u03b4))(cid:19)\u03b2( \u00af\u03b2\u2212\u03b2k\u22121)\n\n1\n\n(cid:18)\n\n.\n\nBy a union bound, with probability 1\u2212\u03b4, this occurs simultaneously with the above suph\u2208Vk er(h)\u2212\ner(h\u2217) < 2EikM \u2212i(k\u22121)M bound. Combining these two results yields\n\ner(h) \u2212 er(h\u2217) = O   (c2\u03b8(c2\u01eb\u03b1)) \u00af\u03b2\u2212\u03b2k\u22121\n\nsup\nh\u2208Vk\nSetting this to \u01eb and solving for n, we \ufb01nd that it suf\ufb01ces to have\n\n!\n\nM \u00af\u03b2\n\n1\n\n2\u2212\u03b1\n\n(d log(M ) + log(kM/\u03b4))\n\n1+\u03b2( \u00af\u03b2\u2212\u03b2k\u22121)\n\n2\u2212\u03b1\n\n!.\n\n\u01eb(cid:19)\nM \u2265 c4(cid:18) 1\n\n2\u2212\u03b1\n\n\u00af\u03b2\n\n(c2\u03b8(c2\u01eb\u03b1))1\u2212 \u03b2k\u22121\n\n\u00af\u03b2 (cid:18)d log(cid:18) d\n\n\u01eb(cid:19) + log(cid:18) kd\n\n\u03b4\u01eb(cid:19)(cid:19)\n\n1+\u03b2 \u00af\u03b2\u2212\u03b2k\n\n\u00af\u03b2\n\n,\n\nfor some constant c4 \u2208 [1,\u221e), which then implies the stated result.\nNote: the threshold Em in k-batch Robust CAL has a direct dependence on the parameters of the\nTsybakov noise condition. We have expressed the algorithm in this way only to simplify the pre-\nsentation. In practice, such information is not often available. However, we can replace Em with a\ndata-dependent local Rademacher complexity bound \u02c6Em, as in [7], which also satis\ufb01es (1), and satis-\n\ufb01es (with high probability) \u02c6Em \u2264 c\u2032Em, for some constant c\u2032 \u2208 [1,\u221e) (see [13]). This modi\ufb01cation\nwould therefore provide essentially the same guarantee stated above (up to constant factors), with-\nout having any direct dependence on the noise parameters, and the analysis gets only slightly more\ninvolved to account for the con\ufb01dences in the concentration inequalities for these \u02c6Em estimators.\nA similar result can also be obtained for batch-based variants of other noise-robust disagreement-\nbased active learning algorithms from the literature (e.g., a variant of A2 [5] that uses updates based\non quantities related to these \u02c6Em estimators, in place of the traditional upper-bound/lower-bound\nconstruction, would also suf\ufb01ce).\n\nWhen k = 1, Theorem 4.2 matches the best results for passive learning (up to log factors), which\nare known to be minimax optimal (again, up to log factors). If we let k become large (while still\nconsidered as a constant), our result converges to the known results for one-at-a-time active learning\nwith RobustCAL (again, up to log factors) [7, 14]. Although those results are not always minimax\noptimal, they do represent the state-of-the-art in the general analysis of active learning, and they are\nreally the best we could hope for from basing our algorithm on RobustCAL.\n\n5 Buy-in-Bulk Solutions to Cost-Adaptive Active Learning\n\nThe above sections discussed scenarios in which we have a \ufb01xed number k of batches, and we\nsimply bounded the label complexity achievable within that constraint by considering a variant of\nCAL that uses k equal-sized batches. In this section, we take a slightly different approach to the\nproblem, by going back to one of the motivations for using batch-based active learning in the \ufb01rst\nplace: namely, sublinear costs for answering batches of queries at a time. If the cost of answering\nm queries at once is sublinear in m, then batch-based algorithms arise naturally from the problem\nof optimizing the total cost required for learning.\nFormally, in this section, we suppose we are given a cost function c : (0,\u221e) \u2192 (0,\u221e), which is\nnondecreasing, satis\ufb01es c(\u03b1x) \u2264 \u03b1c(x) (for x, \u03b1 \u2208 [1,\u221e)) , and further satis\ufb01es the condition that\nfor every q \u2208 N, \u2203q\u2032 \u2208 N such that 2c(q) \u2264 c(q\u2032) \u2264 4c(q), which typically amounts to a kind of\nsmoothness assumption. For instance, c(q) = \u221aq would satisfy these conditions (as would many\nother smooth increasing concave functions); the latter assumption can be generalized to allow other\nconstants, though we only study this case below for simplicity.\n\nTo understand the total cost required for learning in this model, we consider the following cost-\nadaptive modi\ufb01cation of the CAL algorithm.\n\n6\n\n\fAlgorithm: Cost-Adaptive CAL(C)\n1. Q \u2190 \u2205, R \u2190 DIS(C), V \u2190 C, t \u2190 0\n2. Repeat\n3.\nq \u2190 1\n4. Do until P (DIS(V )) \u2264 P (R)/2\n5.\n6.\n7.\n8.\n9.\n10.\n11. R \u2190 DIS(V )\n\nLet q\u2032 > q be minimal such that c(q\u2032 \u2212 q) \u2265 2c(q)\nIf c(q\u2032 \u2212 q) + t > C, Return any \u02c6h \u2208 V\nRequest the labels of the next q\u2032 \u2212 q examples in DIS(V )\nUpdate V by removing those classi\ufb01ers inconsistent with these labels\nLet t \u2190 t + c(q\u2032 \u2212 q)\nq \u2190 q\u2032\n\nNote that the total cost expended by this method never exceeds the budget argument C. We have the\nfollowing result on how large of a budget C is suf\ufb01cient for this method to succeed.\nTheorem 5.1. In the realizable case, for some\n\n\u03bb(\u01eb, \u03b4) = O(cid:16)c(cid:16)\u03b8(\u01eb) (d log(\u03b8(\u01eb)) + log(log(1/\u01eb)/\u03b4))(cid:17) log(1/\u01eb)(cid:17) ,\n\nfor any C \u2265 \u03bb(\u01eb, \u03b4), with probability at least 1 \u2212 \u03b4, Cost-Adaptive CAL(C) returns a classi\ufb01er \u02c6h\nwith er(\u02c6h) \u2264 \u01eb.\nProof. Supposing an unlimited budget (C = \u221e), let us determine how much cost the algorithm\nincurs prior to having suph\u2208V er(h) \u2264 \u01eb; this cost would then be a suf\ufb01cient size for C to guarantee\nthis occurs. First, note that h\u2217 \u2208 V is maintained as an invariant throughout the algorithm. Also,\nnote that if q is ever at least as large as O(\u03b8(\u01eb)(d log(\u03b8(\u01eb)) + log(1/\u03b4\u2032))), then as in the analysis for\nCAL [7], we can conclude (via the PAC bound of [9]) that with probability at least 1 \u2212 \u03b4\u2032,\n\nsup\nh\u2208V\n\nP (h(X) 6= h\u2217(X)|X \u2208 R) \u2264 1/(2\u03b8(\u01eb)),\n\nso that\n\nsup\nh\u2208V\n\ner(h) = sup\nh\u2208V\n\nP (h(X) 6= h\u2217(X)|X \u2208 R)P (R) \u2264 P (R)/(2\u03b8(\u01eb)).\n\nWe know R = DIS(V \u2032) for the set V \u2032 which was the value of the variable V at the time this R was\nobtained. Supposing suph\u2208V \u2032 er(h) > \u01eb, we know (by the de\ufb01nition of \u03b8(\u01eb)) that\n\nTherefore,\n\nP (R) \u2264 P (cid:18)DIS(cid:18)B(cid:18)h\u2217, sup\n\nh\u2208V \u2032\n\ner(h)(cid:19)(cid:19)(cid:19) \u2264 \u03b8(\u01eb) sup\n\nh\u2208V \u2032\n\ner(h).\n\nsup\nh\u2208V\n\ner(h) \u2264\n\n1\n2\n\nsup\nh\u2208V \u2032\n\ner(h).\n\nthis implies the condition in Step 4 will be satis\ufb01ed if this happens while\nIn particular,\nthis condition can be satis\ufb01ed at most \u2308log2(1/\u01eb)\u2309 times while\nsuph\u2208V er(h) > \u01eb. But\nsuph\u2208V er(h) > \u01eb (since suph\u2208V er(h) \u2264 P (DIS(V ))). So with probability at least 1 \u2212\n\u03b4\u2032\u2308log2(1/\u01eb)\u2309, as long as suph\u2208V er(h) > \u01eb, we always have c(q) \u2264 4c(O(\u03b8(\u01eb)(d log(\u03b8(\u01eb)) +\nthis is\nlog(1/\u03b4\u2032)))) \u2264 O(c(\u03b8(\u01eb)(d log(\u03b8(\u01eb)) + log(1/\u03b4\u2032)))). Letting \u03b4\u2032 = \u03b4/\u2308log2(1/\u01eb)\u2309,\n1 \u2212 \u03b4. So for each round of the outer loop while suph\u2208V er(h) > \u01eb, by summing the geomet-\nric series of cost values c(q\u2032 \u2212 q) in the inner loop, we \ufb01nd the total cost incurred is at most\nO(c(\u03b8(\u01eb)(d log(\u03b8(\u01eb)) + log(log(1/\u01eb)/\u03b4)))). Again, there are at most \u2308log2(1/\u01eb)\u2309 rounds of the\nouter loop while suph\u2208V er(h) > \u01eb, so that the total cost incurred before we have suph\u2208V er(h) \u2264 \u01eb\nis at most O(c(\u03b8(\u01eb)(d log(\u03b8(\u01eb)) + log(log(1/\u01eb)/\u03b4))) log(1/\u01eb)).\n\nComparing this result to the known label complexity of CAL, which is (from [7])\n\nO (\u03b8(\u01eb) (d log(\u03b8(\u01eb)) + log(log(1/\u01eb)/\u03b4)) log(1/\u01eb)) ,\n\nwe see that the major factor, namely the O (\u03b8(\u01eb) (d log(\u03b8(\u01eb)) + log(log(1/\u01eb)/\u03b4))) factor, is now\ninside the argument to the cost function c(\u00b7). In particular, when this cost function is sublinear, we\n\n7\n\n\fexpect this bound to be signi\ufb01cantly smaller than the cost required by the original fully-sequential\nCAL algorithm, which uses batches of size 1, so that there is a signi\ufb01cant advantage to using this\nbatch-mode active learning algorithm.\n\nAgain, this result is formulated for the realizable case for simplicity, but can easily be extended to\nthe Tsybakov noise model as in the previous section. In particular, by reasoning quite similar to\nthat above, a cost-adaptive variant of the Robust CAL algorithm of [14] achieves error rate er(\u02c6h) \u2212\ner(h\u2217) \u2264 \u01eb with probability at least 1 \u2212 \u03b4 using a total cost\n\nO(cid:16)c(cid:16)\u03b8(c2\u01eb\u03b1)c2\n\n2\u01eb2\u03b1\u22122dpolylog (1/(\u01eb\u03b4))(cid:17) log (1/\u01eb)(cid:17) .\n\nWe omit the technical details for brevity. However, the idea is similar to that above, except\nthat the update to the set V is now as in k-batch Robust CAL (with an appropriate modi\ufb01ca-\ntion to the \u03b4-related logarithmic factor in Em), rather than simply those classi\ufb01ers making no\nmistakes. The proof then follows analogous to that of Theorem 5.1, the only major change be-\ning that now we bound the number of unlabeled examples processed in the inner loop before\nsuph\u2208V P (h(X) 6= h\u2217(X)) \u2264 P (R)/(2\u03b8); letting V \u2032 be the previous version space (the one for\nwhich R = DIS(V \u2032)), we have P (R) \u2264 \u03b8c2(suph\u2208V \u2032 er(h) \u2212 er(h\u2217))\u03b1, so that it suf\ufb01ces to have\nsuph\u2208V P (h(X) 6= h\u2217(X)) \u2264 (c2/2)(suph\u2208V \u2032 er(h) \u2212 er(h\u2217))\u03b1, and for this it suf\ufb01ces to have\nsuph\u2208V er(h)\u2212 er(h\u2217) \u2264 2\u22121/\u03b1 suph\u2208V \u2032 er(h)\u2212 er(h\u2217); by inverting Em, we \ufb01nd that it suf\ufb01ces to\nhave a number of samples \u02dcO(cid:0)(2\u22121/\u03b1 suph\u2208V \u2032 er(h) \u2212 er(h\u2217))\u03b1\u22122d(cid:1). Since the number of label re-\nquests among m samples in the inner loop is roughly \u02dcO(mP (R)) \u2264 \u02dcO(m\u03b8c2(suph\u2208V \u2032 er(h) \u2212\ner(h\u2217))\u03b1), the batch size needed to make suph\u2208V P (h(X) 6= h\u2217(X)) \u2264 P (R)/(2\u03b8) is at\nmost \u02dcO(cid:0)\u03b8c222/\u03b1(suph\u2208V \u2032 er(h) \u2212 er(h\u2217))2\u03b1\u22122d(cid:1). When suph\u2208V \u2032 er(h) \u2212 er(h\u2217) > \u01eb, this is\n\u02dcO(cid:0)\u03b8c222/\u03b1\u01eb2\u03b1\u22122d(cid:1). If suph\u2208V P (h(X) 6= h\u2217(X)) \u2264 P (R)/(2\u03b8) is ever satis\ufb01ed, then by the\nsame reasoning as above, the update condition in Step 4 would be satis\ufb01ed. Again, this update can\nbe satis\ufb01ed at most log(1/\u01eb) times before achieving suph\u2208V er(h) \u2212 er(h\u2217) \u2264 \u01eb.\n6 Conclusions\n\nWe have seen that the analysis of active learning can be adapted to the setting in which labels are\nrequested in batches. We studied this in two related models of learning. In the \ufb01rst case, we supposed\nthe number k of batches is speci\ufb01ed, and we analyzed the number of label requests used by an\nalgorithm that requested labels in k equal-sized batches. As a function of k, this label complexity\nbecame closer to that of the analogous results for fully-sequential active learning for larger values of\nk, and closer to the label complexity of passive learning for smaller values of k, as one would expect.\nOur second model was based on a notion of the cost to request the labels of a batch of a given size.\nWe studied an active learning algorithm designed for this setting, and found that the total cost used\nby this algorithm may often be signi\ufb01cantly smaller than that used by the analogous fully-sequential\nactive learning methods, particularly when the cost function is sublinear.\n\nThere are many active learning algorithms in the literature that can be described (or analyzed) in\nterms of batches of label requests. For instance, this is the case for the margin-based active learning\nstrategy explored by [15]. Here we have only studied variants of CAL (and its noise-robust gen-\neralization). However, one could also apply this style of analysis to other methods, to investigate\nanalogous questions of how the label complexities of such methods degrade as the batch sizes in-\ncrease, or how such methods might be modi\ufb01ed to account for a sublinear cost function, and what\nresults one might obtain on the total cost of learning with these modi\ufb01ed methods. This could\npotentially be a fruitful future direction for the study of batch mode active learning.\n\nThe tradeoff between the total number of queries and the number of rounds examined in this paper is\nnatural to study. Similar tradeoffs have been studied in other contexts. In any two-party communica-\ntion task, there are three measures of complexity that are typically used: communication complexity\n(the total number of bits exchanged), round complexity (the number of rounds of communication),\nand time complexity. The classic work [16] considered the problem of the tradeoffs between com-\nmunication complexity and rounds of communication. [17] studies the tradeoffs among all three of\ncommunication complexity, round complexity, and time complexity. Interested readers may wish\nto go beyond the present and to study the tradeoffs among all the three measures of complexity for\nbatch mode active learning.\n\n8\n\n\fReferences\n\n[1] V. S. Sheng and C. X. Ling. Feature value acquisition in testing: a sequential batch test\n\nalgorithm. In Proceedings of the 23rd international conference on Machine learning, 2006.\n\n[2] S. Chakraborty, V. Balasubramanian, and S. Panchanathan. An optimization based framework\nfor dynamic batch mode active learning. In Advances in Neural Information Processing, 2010.\n[3] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Jour-\n\nnal of Machine Learning Research, 10:281\u2013299, 2009.\n\n[4] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural\n\nInformation Processing Systems 18, 2005.\n\n[5] M. F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proc. of the 23rd\n\nInternational Conference on Machine Learning, 2006.\n\n[6] S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of\n\nthe 24th International Conference on Machine Learning, 2007.\n\n[7] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361,\n\n2011.\n\n[8] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15(2):201\u2013221, 1994.\n\n[9] V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York,\n\n1982.\n\n[10] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\n\nUniversity Press, 1999.\n\n[11] E. Mammen and A.B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,\n\n27:1808\u20131829, 1999.\n\n[12] P. Massart and \u00b4E. N\u00b4ed\u00b4elec. Risk bounds for statistical learning. The Annals of Statistics,\n\n34(5):2326\u20132366, 2006.\n\n[13] V. Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization.\n\nThe Annals of Statistics, 34(6):2593\u20132656, 2006.\n\n[14] S. Hanneke. Activized learning: Transforming passive to active with improved label complex-\n\nity. Journal of Machine Learning Research, 13(5):1469\u20131587, 2012.\n\n[15] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Proceedings of the\n\n20th Conference on Learning Theory, 2007.\n\n[16] C. H. Papadimitriou and M. Sipser. Communication complexity. Journal of Computer and\n\nSystem Sciences, 28(2):260269, 1984.\n\n[17] P. Harsha, Y. Ishai, Joe Kilian, Kobbi Nissim, and Srinivasan Venkatesh. Communication\nversus computation. In The 31st International Colloquium on Automata, Languages and Pro-\ngramming, pages 745\u2013756, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1079, "authors": [{"given_name": "Liu", "family_name": "Yang", "institution": "CMU"}, {"given_name": "Jaime", "family_name": "Carbonell", "institution": "CMU"}]}