{"title": "Beyond Disagreement-Based Agnostic Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 442, "page_last": 450, "abstract": "We study agnostic active learning, where the goal is to learn a classifier in a pre-specified hypothesis class interactively with as few label queries as possible, while making no assumptions on the true function generating the labels. The main algorithms for this problem are {\\em{disagreement-based active learning}}, which has a high label requirement, and {\\em{margin-based active learning}}, which only applies to fairly restricted settings. A major challenge is to find an algorithm which achieves better label complexity, is consistent in an agnostic setting, and applies to general classification problems. In this paper, we provide such an algorithm. Our solution is based on two novel contributions -- a reduction from consistent active learning to confidence-rated prediction with guaranteed error, and a novel confidence-rated predictor.", "full_text": "Beyond Disagreement-based Agnostic Active\n\nLearning\n\nChicheng Zhang\n\nUniversity of California, San Diego\n\n9500 Gilman Drive, La Jolla, CA 92093\n\nKamalika Chaudhuri\n\nUniversity of California, San Diego\n\n9500 Gilman Drive, La Jolla, CA 92093\n\nchichengzhang@ucsd.edu\n\nkamalika@cs.ucsd.edu\n\nAbstract\n\nWe study agnostic active learning, where the goal is to learn a classi\ufb01er in a pre-\nspeci\ufb01ed hypothesis class interactively with as few label queries as possible, while\nmaking no assumptions on the true function generating the labels. The main al-\ngorithm for this problem is disagreement-based active learning, which has a high\nlabel requirement. Thus a major challenge is to \ufb01nd an algorithm which achieves\nbetter label complexity, is consistent in an agnostic setting, and applies to general\nclassi\ufb01cation problems.\nIn this paper, we provide such an algorithm. Our solution is based on two novel\ncontributions; \ufb01rst, a reduction from consistent active learning to con\ufb01dence-rated\nprediction with guaranteed error, and second, a novel con\ufb01dence-rated predictor.\n\n1\n\nIntroduction\n\nIn this paper, we study active learning of classi\ufb01ers in an agnostic setting, where no assumptions\nare made on the true function that generates the labels. The learner has access to a large pool of\nunlabelled examples, and can interactively request labels for a small subset of these; the goal is to\nlearn an accurate classi\ufb01er in a pre-speci\ufb01ed class with as few label queries as possible. Speci\ufb01cally,\nwe are given a hypothesis class H and a target \ufffd, and our aim is to \ufb01nd a binary classi\ufb01er in H\nwhose error is at most \ufffd more than that of the best classi\ufb01er in H, while minimizing the number of\nrequested labels.\nThere has been a large body of previous work on active learning; see the surveys by [10, 28] for\noverviews. The main challenge in active learning is ensuring consistency in the agnostic setting\nwhile still maintaining low label complexity. In particular, a very natural approach to active learning\nis to view it as a generalization of binary search [17, 9, 27]. While this strategy has been extended\nto several different noise models [23, 27, 26], it is generally inconsistent in the agnostic case [11].\nThe primary algorithm for agnostic active learning is called disagreement-based active learning.\nThe main idea is as follows. A set Vk of possible risk minimizers is maintained with time, and the\nlabel of an example x is queried if there exist two hypotheses h1 and h2 in Vk such that h1(x) \ufffd=\nh2(x). This algorithm is consistent in the agnostic setting [7, 2, 12, 18, 5, 19, 6, 24]; however, due\nto the conservative label query policy, its label requirement is high. A line of work due to [3, 4, 1]\nhave provided algorithms that achieve better label complexity for linear classi\ufb01cation on the uniform\ndistribution over the unit sphere as well as log-concave distributions; however, their algorithms are\nlimited to these speci\ufb01c cases, and it is unclear how to apply them more generally.\nThus, a major challenge in the agnostic active learning literature has been to \ufb01nd a general active\nlearning strategy that applies to any hypothesis class and data distribution, is consistent in the agnos-\ntic case, and has a better label requirement than disagreement based active learning. This has been\nmentioned as an open problem by several works, such as [2, 10, 4].\n\n1\n\n\fIn this paper, we provide such an algorithm. Our solution is based on two key contributions, which\nmay be of independent interest. The \ufb01rst is a general connection between con\ufb01dence-rated pre-\ndictors and active learning. A con\ufb01dence-rated predictor is one that is allowed to abstain from\nprediction on occasion, and as a result, can guarantee a target prediction error. Given a con\ufb01dence-\nrated predictor with guaranteed error, we show how to to construct an active label query algorithm\nconsistent in the agnostic setting. Our second key contribution is a novel con\ufb01dence-rated predictor\nwith guaranteed error that applies to any general classi\ufb01cation problem. We show that our predictor\nis optimal in the realizable case, in the sense that it has the lowest abstention rate out of all predictors\nguaranteeing a certain error. Moreover, we show how to extend our predictor to the agnostic setting.\nCombining the label query algorithm with our novel con\ufb01dence-rated predictor, we get a general\nactive learning algorithm consistent in the agnostic setting. We provide a characterization of the label\ncomplexity of our algorithm, and show that this is better than the bounds known for disagreement-\nbased active learning in general. Finally, we show that for linear classi\ufb01cation with respect to the\nuniform distribution and log-concave distributions, our bounds reduce to those of [3, 4].\n\n2 Algorithm\n\n2.1 The Setting\n\nWe study active learning for binary classi\ufb01cation. Examples belong to an instance space X , and\ntheir labels lie in a label space Y = {\u22121, 1}; labelled examples are drawn from an underlying data\ndistribution D on X \u00d7 Y. We use DX to denote the marginal on D on X , and DY |X to denote the\nconditional distribution on Y |X = x induced by D. Our algorithm has access to examples through\ntwo oracles \u2013 an example oracle U which returns an unlabelled example x \u2208 X drawn from DX and\na labelling oracle O which returns the label y of an input x \u2208 X drawn from DY |X.\nGiven a hypothesis class H of VC dimension d, the error of any h \u2208 H with respect to a\ndata distribution \u03a0 over X \u00d7 Y is de\ufb01ned as err\u03a0(h) = P(x,y)\u223c\u03a0(h(x) \ufffd= y). We de\ufb01ne:\nerr\u03a0(h), \u03bd\u2217(\u03a0) = err\u03a0(h\u2217(\u03a0)). For a set S, we abuse notation and use S\nh\u2217(\u03a0) = argminh\u2208H\nto also denote the uniform distribution over the elements of S. We de\ufb01ne P\u03a0(\u00b7) := P(x,y)\u223c\u03a0(\u00b7),\nE\u03a0(\u00b7) := E(x,y)\u223c\u03a0(\u00b7).\nGiven access to examples from a data distribution D through an example oracle U and a labeling\noracle O, we aim to provide a classi\ufb01er \u02c6h \u2208 H such that with probability \u2265 1 \u2212 \u03b4, errD(\u02c6h) \u2264\n\u03bd\u2217(D) + \ufffd, for some target values of \ufffd and \u03b4; this is achieved in an adaptive manner by making\nas few queries to the labelling oracle O as possible. When \u03bd\u2217(D) = 0, we are said to be in the\nrealizable case; in the more general agnostic case, we make no assumptions on the labels, and thus\n\u03bd\u2217(D) can be positive.\nPrevious approaches to agnostic active learning have frequently used the notion of disagreements.\nThe disagreement between two hypotheses h1 and h2 with respect to a data distribution \u03a0 is\nthe fraction of examples according to \u03a0 to which h1 and h2 assign different labels; formally:\n\u03c1\u03a0(h1, h2) = P(x,y)\u223c\u03a0(h1(x) \ufffd= h2(x)). Observe that a data distribution \u03a0 induces a pseudo-\nmetric \u03c1\u03a0 on the elements of H; this is called the disagreement metric. For any r and any h \u2208 H,\nde\ufb01ne B\u03a0(h, r) to be the disagreement ball of radius r around h with respect to the data distribution\n\u03a0. Formally: B\u03a0(h, r) = {h\ufffd \u2208 H : \u03c1\u03a0(h, h\ufffd) \u2264 r}.\nFor notational simplicity, we assume that the hypothesis space is \u201cdense\u201d with repsect to the data\ndistribution D, in the sense that \u2200r > 0, suph\u2208BD(h\u2217(D),r) \u03c1D(h, h\u2217(D)) = r. Our analysis will\nstill apply without the denseness assumption, but will be signi\ufb01cantly more messy. Finally, given a\nset of hypotheses V \u2286 H, the disagreement region of V is the set of all examples x such that there\nexist two hypotheses h1, h2 \u2208 V for which h1(x) \ufffd= h2(x).\nThis paper establishes a connection between active learning and con\ufb01dence-rated predictors with\nguaranteed error. A con\ufb01dence-rated predictor is a prediction algorithm that is occasionally al-\nlowed to abstain from classi\ufb01cation. We will consider such predictors in the transductive setting.\nGiven a set V of candidate hypotheses, an error guarantee \u03b7, and a set U of unlabelled examples,\na con\ufb01dence-rated predictor P either assigns a label or abstains from prediction on each unlabelled\n\n2\n\n\fx \u2208 U. The labels are assigned with the guarantee that the expected disagreement1 between the\nlabel assigned by P and any h \u2208 V is \u2264 \u03b7. Speci\ufb01cally,\n\nfor all h \u2208 V, Px\u223cU (h(x) \ufffd= P (x), P (x) \ufffd= 0) \u2264 \u03b7\n\n(1)\nThis ensures that if some h\u2217 \u2208 V is the true risk minimizer, then, the labels predicted by P on U do\nnot differ very much from those predicted by h\u2217. The performance of a con\ufb01dence-rated predictor\nwhich has a guarantee such as in Equation (1) is measured by its coverage, or the probability of\nnon-abstention Px\u223cU (P (x) \ufffd= 0); higher coverage implies better performance.\n2.2 Main Algorithm\n\nOur active learning algorithm proceeds in epochs, where the goal of epoch k is to achieve excess\ngeneralization error \ufffdk = \ufffd2k0\u2212k+1, by querying a fresh batch of labels. The algorithm maintains a\ncandidate set Vk that is guaranteed to contain the true risk minimizer.\nThe critical decision at each epoch is how to select a subset of unlabelled examples whose labels\nshould be queried. We make this decision using a con\ufb01dence-rated predictor P . At epoch k, we run\nP with candidate hypothesis set V = Vk and error guarantee \u03b7 = \ufffdk/64. Whenever P abstains, we\nquery the label of the example. The number of labels mk queried is adjusted so that it is enough to\nachieve excess generalization error \ufffdk+1.\nAn outline is described in Algorithm 1; we next discuss each individual component in detail.\n\ncon\ufb01dence-rated predictor P , target excess error \ufffd and target con\ufb01dence \u03b4.\n\nAlgorithm 1 Active Learning Algorithm: Outline\n1: Inputs: Example oracle U, Labelling oracle O, hypothesis class H of VC dimension d,\n2: Set k0 = \ufffdlog 1/\ufffd\ufffd. Initialize candidate set V1 = H.\n3: for k = 1, 2, ..k0 do\n4:\n5:\n\nSet \ufffdk = \ufffd2k0\u2212k+1, \u03b4k =\nCall U to generate a fresh unlabelled sample Uk = {zk,1, ..., zk,nk} of size nk =\n192( 512\n\ufffdk\nRun con\ufb01dence-rated predictor P with inpuy V = Vk, U = Uk and error guarantee\n\u03b7 = \ufffdk/64 to get abstention probabilities \u03b3k,1, . . . , \u03b3k,nk on the examples in Uk. These\nprobabilities induce a distribution \u0393k on Uk. Let \u03c6k = Px\u223cUk (P (x) = 0) = 1\ni=1 \u03b3k,i.\nif in the Realizable Case then\n\n2(k0\u2212k+1)2 .\n).\n)2 + ln 288\n\u03b4k\n\n)2(d ln 192( 512\n\ufffdk\n\n\u03b4\n\nnk\ufffdnk\n\n\ufffdk\n\n\ufffdk\n\n+ ln 48\n\u03b4k\n\n(d ln 1536\u03c6k\n\nLet mk = 1536\u03c6k\n). Draw mk i.i.d examples from \u0393k and query\nO for labels of these examples to get a labelled data set Sk. Update Vk+1 using Sk:\nVk+1 := {h \u2208 Vk : h(x) = y, for all (x, y) \u2208 Sk}.\nIn the non-realizable case, use Algorithm 2 with inputs hypothesis set Vk, distribution\n\u0393k, target excess error \ufffdk\n2 , and the labeling oracle O to get a new\n8\u03c6k\nhypothesis set Vk+1.\n\n, target con\ufb01dence \u03b4k\n\n6:\n\n7:\n8:\n\n9:\n10:\n\nelse\n\n11: return an arbitrary \u02c6h \u2208 Vk0+1.\n\nCandidate Sets. At epoch k, we maintain a set Vk of candidate hypotheses guaranteed to contain\nthe true risk minimizer h\u2217(D) (w.h.p). In the realizable case, we use a version space as our candidate\nset. The version space with respect to a set S of labelled examples is the set of all h \u2208 H such that\nh(xi) = yi for all (xi, yi) \u2208 S.\nLemma 1. Suppose we run Algorithm 1 in the realizable case with inputs example oracle U, la-\nbelling oracle O, hypothesis class H, con\ufb01dence-rated predictor P , target excess error \ufffd and target\ncon\ufb01dence \u03b4. Then, with probability 1, h\u2217(D) \u2208 Vk, for all k = 1, 2, . . . , k0 + 1.\nIn the non-realizable case, the version space is usually empty; we use instead a (1 \u2212 \u03b1)-con\ufb01dence\nset for the true risk minimizer. Given a set S of n labelled examples, let C(S) \u2286 H be a function of\n\n1where the expectation is with respect to the random choices made by P\n\n3\n\n\fS; C(S) is said to be a (1 \u2212 \u03b1)-con\ufb01dence set for the true risk minimizer if for all data distributions\n\u0394 over X \u00d7 Y,\nPS\u223c\u0394n [h\u2217(\u0394) \u2208 C(S)] \u2265 1 \u2212 \u03b1,\nRecall that h\u2217(\u0394) = argminh\u2208H\nerr\u0394(h). In the non-realizable case, our candidate sets are (1 \u2212 \u03b1)-\ncon\ufb01dence sets for h\u2217(D), for \u03b1 = \u03b4. The precise setting of Vk is explained in Algorithm 2.\nLemma 2. Suppose we run Algorithm 1 in the non-realizable case with inputs example oracle U,\nlabelling oracle O, hypothesis class H, con\ufb01dence-rated predictor P , target excess error \ufffd and\ntarget con\ufb01dence \u03b4. Then with probability 1 \u2212 \u03b4, h\u2217(D) \u2208 Vk, for all k = 1, 2, . . . , k0 + 1.\nLabel Query. We next discuss our label query procedure \u2013 which examples should we query labels\nfor, and how many labels should we query at each epoch?\n\nWhich Labels to Query? Our goal is to query the labels of the most informative examples. To\nchoose these examples while still maintaining consistency, we use a con\ufb01dence-rated predictor P\nwith guaranteed error. The inputs to the predictor are our candidate hypothesis set Vk which contains\n(w.h.p) the true risk minimizer, a fresh set Uk of unlabelled examples, and an error guarantee \u03b7 =\n\ufffdk/64. For notation simplicity, assume the elements in Uk are distinct. The output is a sequence of\nabstention probabilities {\u03b3k,1, \u03b3k,2, . . . , \u03b3k,nk}, for each example in Uk. It induces a distribution \u0393k\nover Uk, from which we independently draw examples for label queries.\n\n(d ln 1536\u03c6k\n\n\ufffdk\n\n+ ln 48\n\u03b4k\n\n\ufffdk\n\nHow Many Labels to Query? The goal of epoch k is to achieve excess generalization error \ufffdk.\nTo achieve this, passive learning requires \u02dcO(d/\ufffdk) labelled examples2 in the realizable case, and\n\u02dcO(d(\u03bd\u2217(D) + \ufffdk)/\ufffd2\nk) examples in the agnostic case. A key observation in this paper is that in\norder to achieve excess generalization error \ufffdk on D, it suf\ufb01ces to achieve a much larger excess\ngeneralization error O(\ufffdk/\u03c6k) on the data distribution induced by \u0393k and DY |X, where \u03c6k is the\nfraction of examples on which the con\ufb01dence-rated predictor abstains.\nIn the realizable case, we achieve this by sampling mk = 1536\u03c6k\n) i.i.d examples\nfrom \u0393k, and querying their labels to get a labelled dataset Sk. Observe that as \u03c6k is the abstention\nprobability of P with guaranteed error \u2264 \ufffdk/64, it is generally smaller than the measure of the\ndisagreement region of the version space; this key fact results in improved label complexity over\ndisagreement-based active learning. This sampling procedure has the following property:\nLemma 3. Suppose we run Algorithm 1 in the realizable case with inputs example oracle U, la-\nbelling oracle O, hypothesis class H, con\ufb01dence-rated predictor P , target excess error \ufffd and target\ncon\ufb01dence \u03b4. Then with probability 1 \u2212 \u03b4, for all k = 1, 2, . . . , k0 + 1, and for all h \u2208 Vk,\nerrD(h) \u2264 \ufffdk. In particular, the \u02c6h returned at the end of the algorithm satis\ufb01es errD(\u02c6h) \u2264 \ufffd.\nThe agnostic case has an added complication \u2013 in practice, the value of \u03bd\u2217 is not known ahead of\ntime. Inspired by [24], we use a doubling procedure(stated in Algorithm 2) which adaptively \ufb01nds\nthe number mk of labelled examples to be queried and queries them. The following two lemmas\nillustrate its properties \u2013 that it is consistent, and that it does not use too many label queries.\nLemma 4. Suppose we run Algorithm 2 with inputs hypothesis set V , example distribution \u0394,\nlabelling oracle O, target excess error \u02dc\ufffd and target con\ufb01dence \u02dc\u03b4. Let \u02dc\u0394 be the joint distribution on\nX \u00d7 Y induced by \u0394 and DY |X. Then there exists an event \u02dcE, P( \u02dcE) \u2265 1 \u2212 \u02dc\u03b4, such that on \u02dcE, (1)\nAlgorithm 2 halts and (2) the set Vj0 has the following properties:\n(2.1) If for h \u2208 H, err \u02dc\u0394(h) \u2212 err \u02dc\u0394(h\u2217( \u02dc\u0394)) \u2264 \u02dc\ufffd/2, then h \u2208 Vj0.\n(2.2) On the other hand, if h \u2208 Vj0, then err \u02dc\u0394(h) \u2212 err \u02dc\u0394(h\u2217( \u02dc\u0394)) \u2264 \u02dc\ufffd.\nWhen event \u02dcE happens, we say Algorithm 2 succeeds.\nLemma 5. Suppose we run Algorithm 2 with inputs hypothesis set V , example distribution \u0394,\nlabelling oracle O, target excess error \u02dc\ufffd and target con\ufb01dence \u02dc\u03b4. There exists some absolute constant\nc1 > 0, such that on the event that Algorithm 2 succeeds, nj0 \u2264 c1((d ln 1\n). Thus\nthe total number of labels queried is\ufffdj0\n\u02dc\ufffd + ln 1\n\u02dc\u03b4\n\nj=1 nj \u2264 2nj0 \u2264 2c1((d ln 1\n\n\u02dc\ufffd + ln 1\n\u02dc\u03b4\n) \u03bd\u2217( \u02dc\u0394)+\u02dc\ufffd\n\n) \u03bd\u2217( \u02dc\u0394)+\u02dc\ufffd\n\n\u02dc\ufffd2\n\n).\n\n\u02dc\ufffd2\n\n2 \u02dcO(\u00b7) hides logarithmic factors\n\n4\n\n\fA naive approach (see Algorithm 4 in the Appendix) which uses an additive VC bound gives a\nsample complexity of O((d ln(1/\u02dc\ufffd) + ln(1/\u02dc\u03b4))\u02dc\ufffd\u22122); Algorithm 2 gives a better sample complexity.\nThe following lemma is a consequence of our label query procedure in the non-realizable case.\nLemma 6. Suppose we run Algorithm 1 in the non-realizable case with inputs example oracle U,\nlabelling oracle O, hypothesis class H, con\ufb01dence-rated predictor P , target excess error \ufffd and\ntarget con\ufb01dence \u03b4. Then with probability 1 \u2212 \u03b4, for all k = 1, 2, . . . , k0 + 1, and for all h \u2208 Vk,\nerrD(h) \u2264 errD(h\u2217(D)) + \ufffdk. In particular, the \u02c6h returned at the end of the algorithm satis\ufb01es\nerrD(\u02c6h) \u2264 errD(h\u2217(D)) + \ufffd.\nAlgorithm 2 An Adaptive Algorithm for Label Query Given Target Excess Error\n1: Inputs: Hypothesis set V of VC dimension d, Example distribution \u0394, Labeling oracle O,\n2: for j = 1, 2, . . . do\n3:\n\ntarget excess error \u02dc\ufffd, target con\ufb01dence \u02dc\u03b4.\n\nDraw nj = 2j i.i.d examples from \u0394; query their labels from O to get a labelled dataset\nSj. Denote \u02dc\u03b4j := \u02dc\u03b4/(j(j + 1)).\nTrain an ERM classi\ufb01er \u02c6hj \u2208 V over Sj.\nDe\ufb01ne the set Vj as follows:\nVj =\ufffdh \u2208 V : errSj (h) \u2264 errSj (\u02c6hj) +\nif suph\u2208Vj (\u03c3(nj, \u02dc\u03b4j) +\ufffd\u03c3(nj, \u02dc\u03b4j)\u03c1Sj (h, \u02c6hj)) \u2264 \u02dc\ufffd\n\n+ \u03c3(nj, \u02dc\u03b4j) +\ufffd\u03c3(nj, \u02dc\u03b4j)\u03c1Sj (h, \u02c6hj)\ufffd\n\nWhere \u03c3(n, \u03b4) := 16\n\nj0 = j, break\n\n6 then\n\n\u02dc\ufffd\n2\n\nn (2d ln 2en\n\nd + ln 24\n\n\u03b4 ).\n\n4:\n5:\n\n6:\n7:\n8: return Vj0.\n\n2.3 Con\ufb01dence-Rated Predictor\n\nOur active learning algorithm uses a con\ufb01dence-rated predictor with guaranteed error to make its\nlabel query decisions. In this section, we provide a novel con\ufb01dence-rated predictor with guaranteed\nerror. This predictor has optimal coverage in the realizable case, and may be of independent interest.\nThe predictor P receives as input a set V \u2286 H of hypotheses (which is likely to contain the true\nrisk minimizer), an error guarantee \u03b7, and a set of U of unlabelled examples. We consider a soft\nprediction algorithm; so, for each example in U, the predictor P outputs three probabilities that add\nup to 1 \u2013 the probability of predicting 1, \u22121 and 0. This output is subject to the constraint that the\nexpected disagreement3 between the \u00b11 labels assigned by P and those assigned by any h \u2208 V is\nat most \u03b7, and the goal is to maximize the coverage, or the expected fraction of non-abstentions.\nOur key insight is that this problem can be written as a linear program, which is described in Algo-\nrithm 3. There are three variables, \u03bei, \u03b6i and \u03b3i, for each unlabelled zi \u2208 U; there are the probabil-\nities with which we predict 1, \u22121 and 0 on zi respectively. Constraint (2) ensures that the expected\ndisagreement between the label predicted and any h \u2208 V is no more than \u03b7, while the LP objective\nmaximizes the coverage under these constraints. Observe that the LP is always feasible. Although\nthe LP has in\ufb01nitely many constraints, the number of constraints in Equation (2) distinguishable by\nUk is at most (em/d)d, where d is the VC dimension of the hypothesis class H.\nThe performance of a con\ufb01dence-rated predictor is measured by its error and coverage. The error of\na con\ufb01dence-rated predictor is the probability with which it predicts the wrong label on an example,\nwhile the coverage is its probability of non-abstention. We can show the following guarantee on the\nperformance of the predictor in Algorithm 3.\nTheorem 1. In the realizable case, if the hypothesis set V is the version space with respect to\na training set, then Px\u223cU (P (x) \ufffd= h\u2217(x), P (x) \ufffd= 0) \u2264 \u03b7.\nIn the non-realizable case, if the\nhypothesis set V is an (1 \u2212 \u03b1)-con\ufb01dence set for the true risk minimizer h\u2217, then, w.p \u2265 1 \u2212 \u03b1,\nPx\u223cU (P (x) \ufffd= y, P (x) \ufffd= 0) \u2264 Px\u223cU (h\u2217(x) \ufffd= y) + \u03b7.\n\n3where the expectation is taken over the random choices made by P\n\n5\n\n\fAlgorithm 3 Con\ufb01dence-rated Predictor\n1: Inputs: hypothesis set V , unlabelled data U = {z1, . . . , zm}, error bound \u03b7.\n2: Solve the linear program:\n\nsubject to: \u2200i, \u03bei + \u03b6i + \u03b3i = 1\n\n\u03b3i\n\nmin\n\nm\ufffdi=1\n\u2200h \u2208 V, \ufffdi:h(zi)=1\n\n\u2200i, \u03bei, \u03b6i, \u03b3i \u2265 0\n\n\u03b6i + \ufffdi:h(zi)=\u22121\n\n\u03bei \u2264 \u03b7m\n\n(2)\n\n3: For each zi \u2208 U, output probabilities for predicting 1, \u22121 and 0: \u03bei, \u03b6i, and \u03b3i.\n\nIn the realizable case, we can also show that our con\ufb01dence rated predictor has optimal coverage.\nObserve that we cannot directly show optimality in the non-realizable case, as the performance\ndepends on the exact choice of the (1 \u2212 \u03b1)-con\ufb01dence set.\nTheorem 2. In the realizable case, suppose that the hypothesis set V is the version space with\nrespect to a training set. If P \ufffd is any con\ufb01dence rated predictor with error guarantee \u03b7, and if P is\nthe predictor in Algorithm 3, then, the coverage of P is at least much as the coverage of P \ufffd.\n\n3 Performance Guarantees\n\nAn essential property of any active learning algorithm is consistency \u2013 that it converges to the true\nrisk minimizer given enough labelled examples. We observe that our algorithm is consistent pro-\nvided we use any con\ufb01dence-rated predictor P with guaranteed error as a subroutine. The consis-\ntency of our algorithm is a consequence of Lemmas 3 and 6 and is shown in Theorem 3.\nTheorem 3 (Consistency). Suppose we run Algorithm 1 with inputs example oracle U, labelling\noracle O, hypothesis class H, con\ufb01dence-rated predictor P , target excess error \ufffd and target\ncon\ufb01dence \u03b4. Then with probability 1 \u2212 \u03b4, the classi\ufb01er \u02c6h returned by Algorithm 1 satis\ufb01es\nerrD(\u02c6h) \u2212 errD(h\u2217(D)) \u2264 \ufffd.\nWe now establish a label complexity bound for our algorithm; however, this label complexity bound\napplies only if we use the predictor described in Algorithm 3 as a subroutine.\nFor any hypothesis set V , data distribution D, and \u03b7, de\ufb01ne \u03a6D(V, \u03b7) to be the minimum absten-\ntion probability of a con\ufb01dence-rated predictor which guarantees that the disagreement between its\npredicted labels and any h \u2208 V under DX is at most \u03b7.\nFormally, \u03a6D(V, \u03b7) = min{ED\u03b3(x) : ED[I(h(x) = +1)\u03b6(x) + I(h(x) = \u22121)\u03be(x)] \u2264\n\u03b7 for all h \u2208 V, \u03b3(x) + \u03be(x) + \u03b6(x) \u2261 1, \u03b3(x), \u03be(x), \u03b6(x) \u2265 0}. De\ufb01ne \u03c6(r, \u03b7)\n:=\n\u03a6D(BD(h\u2217, r), \u03b7). The label complexity of our active learning algorithm can be stated as follows.\nTheorem 4 (Label Complexity). Suppose we run Algorithm 1 with inputs example oracle U, la-\nbelling oracle O, hypothesis class H, con\ufb01dence-rated predictor P of Algorithm 3, target excess\nerror \ufffd and target con\ufb01dence \u03b4. Then there exist constants c3, c4 > 0 such that with probability\n1 \u2212 \u03b4:\n(1) In the realizable case, the total number of labels queried by Algorithm 1 is at most:\n\nc3\n\n\ufffdlog 1\n\n\ufffd \ufffd\ufffdk=1\n\n(d ln\n\n\u03c6(\ufffdk, \ufffdk/256)\n\n\ufffdk\n\n+ ln(\ufffdlog(1/\ufffd)\ufffd \u2212 k + 1\n\n\u03b4\n\n))\n\n\u03c6(\ufffdk, \ufffdk/256)\n\n\ufffdk\n\n(2) In the agnostic case, the total number of labels queried by Algorithm 1 is at most:\n\nc4\n\n\ufffdlog 1\n\n\ufffd \ufffd\ufffdk=1\n\n(d ln\n\n\u03c6(2\u03bd\u2217(D) + \ufffdk, \ufffdk/256)\n\n\ufffdk\n\n+ln(\ufffdlog(1/\ufffd)\ufffd \u2212 k + 1\n\n\u03b4\n\n))\n\n\u03c6(2\u03bd\u2217(D) + \ufffdk, \ufffdk/256)\n\n\ufffdk\n\n(1+\n\n\u03bd\u2217(D)\n\n\ufffdk\n\n)\n\n6\n\n\fComparison. The label complexity of disagreement-based active learning is characterized in\nterms of the disagreement coef\ufb01cient. Given a radius r, the disagreement coef\ufb01cent \u03b8(r) is de\ufb01ned\nas:\n\n\u03b8(r) = sup\nr\ufffd\u2265r\n\nP(DIS(BD(h\u2217, r\ufffd)))\n\n,\n\nr\ufffd\n\n\u03c6(r\ufffd,0)\n\nwhere for any V \u2286 H, DIS(V ) is the disagreement region of V . As P(DIS(BD(h\u2217, r))) =\n\u03c6(r, 0) [13], in our notation, \u03b8(r) = supr\ufffd\u2265r\nIn the realizable case, the best known bound for label complexity of disagreement-based active\nlearning is \u02dcO(\u03b8(\ufffd) \u00b7 ln(1/\ufffd) \u00b7 (d ln \u03b8(\ufffd) + ln ln(1/\ufffd))) [20]4. Our label complexity bound may be\nsimpli\ufb01ed to:\n\ufffd\ufffd\ufffd ,\n\u02dcO\ufffdln\n\n\u00b7\ufffdd ln\ufffd sup\n\n\ufffd + ln ln\n\n\u03c6(\ufffdk, \ufffdk/256)\n\n\u03c6(\ufffdk, \ufffdk/256)\n\n1\n\ufffd \u00b7\n\nsup\n\n\ufffdk\n\n\ufffdk\n\n.\n\nr\ufffd\n\n1\n\nk\u2264\ufffdlog(1/\ufffd)\ufffd\n\nk\u2264\ufffdlog(1/\ufffd)\ufffd\n\n. As en-\nwhich is essentially the bound of [20] with \u03b8(\ufffd) replaced by supk\u2264\ufffdlog(1/\ufffd)\ufffd\nforcing a lower error guarantee requires more abstention, \u03c6(r, \u03b7) is a decreasing function of \u03b7; as a\nresult,\n\n\ufffdk\n\n\u03c6(\ufffdk,\ufffdk/256)\n\nsup\n\nk\u2264\ufffdlog(1/\ufffd)\ufffd\n\n\u03c6(\ufffdk, \ufffdk/256)\n\n\ufffdk\n\n\u2264 \u03b8(\ufffd),\n\nand our label complexity bound is better.\nIn the agnostic case, [12] provides a label complexity bound of \u02dcO(\u03b8(2\u03bd\u2217(D)+\ufffd)\u00b7(d \u03bd\u2217(D)2\nln(1/\ufffd)+\nd ln2(1/\ufffd))) for disagreement-based active-learning. In contrast, by Proposition 1 our label com-\nplexity is at most:\nln(1/\ufffd) + d ln2(1/\ufffd)\ufffd\ufffd\n\n\u02dcO\ufffd sup\n\n\u03c6(2\u03bd\u2217(D) + \ufffdk, \ufffdk/256)\n\n\u00b7\ufffdd\n\n2\u03bd\u2217(D) + \ufffdk\n\nk\u2264\ufffdlog(1/\ufffd)\ufffd\n\n\u03bd\u2217(D)2\n\n\ufffd2\n\n\ufffd2\n\nAgain, this is essentially the bound of [12] with \u03b8(2\u03bd\u2217(D) + \ufffd) replaced by the smaller quantity\n\nsup\n\nk\u2264\ufffdlog(1/\ufffd)\ufffd\n\n\u03c6(2\u03bd\u2217(D) + \ufffdk, \ufffdk/256)\n\n2\u03bd\u2217(D) + \ufffdk\n\n,\n\n[20] has provided a more re\ufb01ned analysis of disagreement-based active learning that gives a label\ncomplexity of \u02dcO(\u03b8(\u03bd\u2217(D) + \ufffd)( \u03bd\u2217(D)2\n\ufffd )); observe that their\ndependence is still on \u03b8(\u03bd\u2217(D) + \ufffd). We leave a more re\ufb01ned label complexity analysis of our\nalgorithm for future work.\nAn important sub-case of learning from noisy data is learning under the Tsybakov noise condi-\ntions [30]. We defer the discussion into the Appendix.\n\n\ufffd )(d ln \u03b8(\u03bd\u2217(D) + \ufffd) + ln ln 1\n\n\ufffd2 + ln 1\n\n3.1 Case Study: Linear Classi\ufb01cation under the Log-concave Distribution\n\nWe now consider learning linear classi\ufb01ers with respect to log-concave data distribution on Rd. In\n\nthis case, for any r, the disagreement coef\ufb01cient \u03b8(r) \u2264 O(\u221ad ln(1/r)) [4]; however, for any \u03b7 > 0,\n\u03c6(r,\u03b7)\nr \u2264 O(ln(r/\u03b7)) (see Lemma 14 in the Appendix), which is much smaller so long as \u03b7/r is not\ntoo small. This leads to the following label complexity bounds.\nCorollary 1. Suppose DX is isotropic and log-concave on Rd, and H is the set of homogeneous lin-\near classi\ufb01ers on Rd. Then Algorithm 1 with inputs example oracle U, labelling oracle O, hypothesis\nclass H, con\ufb01dence-rated predictor P of Algorithm 3, target excess error \ufffd and target con\ufb01dence \u03b4\nsatis\ufb01es the following properties. With probability 1 \u2212 \u03b4:\n(1) In the realizable case, there exists some absolute constant c8 > 0 such that the total number of\nlabels queried is at most c8 ln 1\n\n\u03b4 ).\n4Here the \u02dcO(\u00b7) notation hides factors logarithmic in 1/\u03b4\n\n\ufffd (d + ln ln 1\n\n\ufffd + ln 1\n\n7\n\n\f2\n\n\u03ba\u22122 ln 1\n\n3\n\n2 ln2 1\n\n\ufffd\n\n\ufffd\n\n+ ln 1\n\n\u03b4 ) + ln 1\n\n\ufffd2 + ln 1\n\n(d ln \ufffd+\u03bd\u2217(D)\n\n\ufffd ln \ufffd+\u03bd\u2217(D)\n\n\ufffd\n\n\ufffd (d ln 1\n\n\ufffd + ln 1\n\n\ufffd (ln d + ln ln 1\n\n\ufffd ) ln \ufffd+\u03bd\u2217(D)\n\n(2) In the agnostic case, there exists some absolute constant c9 > 0 such that the total number of la-\nbels queried is at most c9( \u03bd\u2217(D)2\n\ufffd .\nln ln 1\n(3) If (C0, \u03ba)-Tsybakov Noise condition holds for D with respect to H, then there exists some\nconstant c10 > 0 (that depends on C0, \u03ba) such that the total number of labels queried is at most\nc10\ufffd\n\n\u03b4 ).\nIn the realizable case, our bound matches [4]. For disagreement-based algorithms, the bound is\n\ufffd )), which is worse by a factor of O(\u221ad ln(1/\ufffd)). [4] does not address the\n\u02dcO(d\nfully agnostic case directly; however, if \u03bd\u2217(D) is known a-priori, then their algorithm can achieve\nroughly the same label complexity as ours.\nFor the Tsybakov Noise Condition with \u03ba > 1, [3, 4] provides a label complexity bound for\n\u02dcO(\ufffd\n\ufffd )) with an algorithm that has a-priori knowledge of C0 and \u03ba. We get\na slightly better bound. On the other hand, a disagreement based algorithm [20] gives a label\n\ufffd )). Again our bound is better by factor of \u03a9(\u221ad)\ncomplexity of \u02dcO(d\nover disagreement-based algorithms. For \u03ba = 1, we can tighten our label complexity to get a\n\u02dcO(ln 1\n\u03b4 )) bound, which again matches [4], and is better than the ones provided by\ndisagreement-based algorithm \u2013 \u02dcO(d\n\n\u03ba\u22122(ln d + ln ln 1\n\n\ufffd (d + ln ln 1\n\n\ufffd + ln 1\n\n\ufffd (d + ln ln 1\n\n2\n\n\u03ba\u22122 ln2 1\n\n3\n\n2 ln2 1\n\ufffd \ufffd\n\n2\n\n3\n\n2 ln2 1\n\n\ufffd (ln d + ln ln 1\n\n\ufffd )) [20].\n\n4 Related Work\nActive learning has seen a lot of progress over the past two decades, motivated by vast amounts of\nunlabelled data and the high cost of annotation [28, 10, 20]. According to [10], the two main threads\nof research are exploitation of cluster structure [31, 11], and ef\ufb01cient search in hypothesis space,\nwhich is the setting of our work. We are given a hypothesis class H, and the goal is to \ufb01nd an h \u2208 H\nthat achieves a target excess generalization error, while minimizing the number of label queries.\nThree main approaches have been studied in this setting. The \ufb01rst and most natural one is generalized\nbinary search [17, 8, 9, 27], which was analyzed in the realizable case by [9] and in various limited\nnoise settings by [23, 27, 26]. While this approach has the advantage of low label complexity, it is\ngenerally inconsistent in the fully agnostic setting [11]. The second approach, disagreement-based\nactive learning, is consistent in the agnostic PAC model. [7] provides the \ufb01rst disagreement-based\nalgorithm for the realizable case. [2] provides an agnostic disagreement-based algorithm, which\nis analyzed in [18] using the notion of disagreement coef\ufb01cient. [12] reduces disagreement-based\nactive learning to passive learning; [5] and [6] further extend this work to provide practical and ef\ufb01-\ncient implementations. [19, 24] give algorithms that are adaptive to the Tsybakov Noise condition.\nThe third line of work [3, 4, 1], achieves a better label complexity than disagreement-based active\nlearning for linear classi\ufb01ers on the uniform distribution over unit sphere and logconcave distribu-\ntions. However, a limitation is that their algorithm applies only to these speci\ufb01c settings, and it is\nnot apparent how to apply it generally.\nResearch on con\ufb01dence-rated prediction has been mostly focused on empirical work, with relatively\nless theoretical development. Theoretical work on this topic includes KWIK learning [25], confor-\nmal prediction [29] and the weighted majority algorithm of [16]. The closest to our work is the recent\nlearning-theoretic treatment by [13, 14]. [13] addresses con\ufb01dence-rated prediction with guaranteed\nerror in the realizable case, and provides a predictor that abstains in the disagreement region of the\nversion space. This predictor achieves zero error, and coverage equal to the measure of the agree-\nment region. [14] shows how to extend this algorithm to the non-realizable case and obtain zero\nerror with respect to the best hypothesis in H. Note that the predictors in [13, 14] generally achieve\nless coverage than ours for the same error guarantee; in fact, if we plug them into our Algorithm 1,\nthen we recover the label complexity bounds of disagreement-based algorithms [12, 19, 24].\nA formal connection between disagreement-based active learning in realizable case and perfect\ncon\ufb01dence-rated prediction (with a zero error guarantee) was established by [15]. Our work can\nbe seen as a step towards bridging these two areas, by demonstrating that active learning can be\nfurther reduced to imperfect con\ufb01dence-rated prediction, with potentially higher label savings.\n\nAcknowledgements. We thank NSF under IIS-1162581 for research support. We thank Sanjoy\nDasgupta and Yoav Freund for helpful discussions. CZ would like to thank Liwei Wang for intro-\nducing the problem of selective classi\ufb01cation to him.\n\n8\n\n\fReferences\n[1] P. Awasthi, M-F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently learning\n\nlinear separators with noise. In STOC, 2014.\n\n[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst.\n\nSci., 75(1):78\u201389, 2009.\n\n[3] M.-F. Balcan, A. Z. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.\n[4] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-\n\nconcave distributions. In COLT, 2013.\n\n[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML,\n\n2009.\n\n[6] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without con-\n\nstraints. In NIPS, 2010.\n\n[7] D. A. Cohn, L. E. Atlas, and R. E. Ladner.\n\nMachine Learning, 15(2), 1994.\n\nImproving generalization with active learning.\n\n[8] S. Dasgupta. Analysis of a greedy active learning strategy. In NIPS, 2004.\n[9] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.\n[10] S. Dasgupta. Two faces of active learning. Theor. Comput. Sci., 412(19), 2011.\n[11] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In ICML, 2008.\n[12] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In\n\nNIPS, 2007.\n\n[13] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classi\ufb01cation. JMLR,\n\n2010.\n\n[14] R. El-Yaniv and Y. Wiener. Agnostic selective classi\ufb01cation. In NIPS, 2011.\n[15] R. El-Yaniv and Y. Wiener. Active learning via perfect selective classi\ufb01cation. JMLR, 2012.\n[16] Y. Freund, Y. Mansour, and R. E. Schapire. Generalization bounds for averaged classi\ufb01ers.\n\nThe Ann. of Stat., 32, 2004.\n\n[17] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by\n\ncommittee algorithm. Machine Learning, 28(2-3):133\u2013168, 1997.\n\n[18] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.\n[19] S. Hanneke. Adaptive rates of convergence in active learning. In COLT, 2009.\n[20] S. Hanneke. A statistical theory of active learning. Manuscript, 2013.\n[21] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. CoRR, abs/1207.3772,\n\n2012.\n\n[22] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.\n[23] M. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In ALT, 2006.\n[24] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning.\n\nJMLR, 2010.\n\n[25] L. Li, M. L. Littman, and T. J. Walsh. Knows what it knows: a framework for self-aware\n\nlearning. In ICML, 2008.\n\n[26] M. Naghshvar, T. Javidi, and K. Chaudhuri. Noisy bayesian active learning. In Allerton, 2013.\n[27] R. D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information\n\nTheory, 57(12):7893\u20137906, 2011.\n\n[28] B. Settles. Active learning literature survey. Technical report, University of Wisconsin-\n\nMadison, 2010.\n\n[29] G. Shafer and V. Vovk. A tutorial on conformal prediction. JMLR, 2008.\n[30] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics,\n\n32:135\u2013166, 2004.\n\n[31] R. Urner, S. Wulff, and S. Ben-David. Plal: Cluster-based active learning. In COLT, 2013.\n\n9\n\n\f", "award": [], "sourceid": 287, "authors": [{"given_name": "Chicheng", "family_name": "Zhang", "institution": "University of California San Diego"}, {"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UC San Diego"}]}