{"title": "Analysis of a greedy active learning strategy", "book": "Advances in Neural Information Processing Systems", "page_first": 337, "page_last": 344, "abstract": null, "full_text": " Analysis of a greedy active learning strategy\n\n\n\n Sanjoy Dasgupta\n University of California, San Diego\n dasgupta@cs.ucsd.edu\n\n\n\n\n Abstract\n\n We abstract out the core search problem of active learning schemes, to\n better understand the extent to which adaptive labeling can improve sam-\n ple complexity. We give various upper and lower bounds on the number\n of labels which need to be queried, and we prove that a popular greedy\n active learning rule is approximately as good as any other strategy for\n minimizing this number of labels.\n\n\n1 Introduction\n\nAn increasingly common phenomenon in classification tasks is that unlabeled data is abun-\ndant, whereas labels are considerably harder to come by. Genome sequencing projects, for\ninstance, are producing vast numbers of peptide sequences, but reliably labeling even one\nof these with structural information requires time and close attention.\n\nThis distinction between labeled and unlabeled data is not captured in standard models like\nthe PAC framework, and has motivated the field of active learning, in which the learner\nis able to ask for the labels of specific points, but is charged for each label. These query\npoints are typically chosen from an unlabeled data set, a practice called pool-based learning\n[10]. There has also been some work on creating query points synthetically, including a\nrich body of theoretical results [1, 2], but this approach suffers from two problems: first,\nfrom a practical viewpoint, the queries thus produced can be quite unnatural and therefore\nbewildering for a human to classify [3]; second, since these queries are not picked from the\nunderlying data distribution, they might have limited value in terms of generalization. In\nthis paper, we focus on pool-based learning.\n\nWe are interested in active learning with generalization guarantees. Suppose the hypothesis\nclass has VC dimension d and we want a classifier whose error rate on distribution P over\nthe joint (input, label) space, is less than > 0. The theory tells us that in a supervised\nsetting, we need some m = m( , d) labeled points drawn from P (for a fixed level of\nconfidence, which we will henceforth ignore). Can we get away with substantially fewer\nthan m labels if we are given unlabeled points from P and are able to adaptively choose\nwhich points to label? How much fewer, and what querying strategies should we follow?\n\nHere is a toy example illustrating the potential of active learning. Suppose the data lie on\nthe real line, and the classifiers are simple thresholding functions, H = {hw : w R}:\n\n hw(x) = 1(x w).\n\nVC theory tells us that if the underlying distribution P can be classified perfectly by some\nhypothesis in H (called the realizable case), then it is enough to draw m = O(1/ ) random\n\n\f\nlabeled examples from P , and to return any classifier consistent with them. But suppose\nwe instead draw m unlabeled samples from P . If we lay these points down on the line,\ntheir hidden labels are a sequence of 0's followed by a sequence of 1's, and the goal is to\ndiscover the point w at which the transition occurs. This can be accomplished with a simple\nbinary search which asks for just log m labels. Thus active learning gives us an exponential\nimprovement in the number of labels needed: by adaptively querying log m labels, we can\nautomatically infer the rest of them.\n\n\nGeneralized binary search?\n\nSo far we have only looked at an extremely simple learning problem. For more complicated\nhypothesis classes H, is a sort of a generalized binary search possible? What would the\nsearch space look like? For supervised learning, in the realizable case, the usual bounds\nspecify a sample complexity of (very roughly) m d/ labeled points if the target error\nrate is . So let's pick this many unlabeled points, and then try to find a hypothesis con-\nsistent with all the hidden labels by adaptively querying just a few of them. We know via\nSauer's lemma that H can classify these m points (considered jointly) in at most O(md)\ndifferent ways in effect, the size of H is reduced to O(md). This finite set is the effective\nhypothesis class H. (In the 1-d example, H has size m + 1, corresponding to the intervals\ninto which the points xi split the real line.) The most we can possibly learn about the target\nhypothesis, even if all labels are revealed, is to narrow it down to one of these regions. Is it\npossible to pick among these O(md) possibilities using o(m) labels? If binary search were\npossible, just O(d log m) labels would be needed.\n\nUnfortunately, we cannot hope for a generic positive result of this kind. The toy example\nabove is a 1-d linear separator. We show that for d 2, the situation is very different:\n\n Pick any collection of m (unlabeled) points on the unit sphere in Rd,\n for d 2, and assume their hidden labels correspond perfectly to some\n linear separator. Then there are target hypotheses in H which cannot be\n identified without querying all the labels.\n\n(What if the active learner is not required to identify exactly the right hypothesis, but some-\nthing close to it? This and other little variations don't help much.) Therefore, even in the\nmost benign situations, we cannot expect that every target hypothesis will be identifiable\nusing o(m) labels. To put it differently, in the worst case over target hypotheses, active\nlearning gives no improvement in sample complexity.\n\nBut hopefully, on average (with respect to some distribution over target hypotheses), the\nnumber of labels needed is small. For instance, when d = 2 in the bad case above, a\ntarget hypothesis chosen uniformly at random from H can be identified by querying just\nO(log m) labels in expectation. This motivates the main model of this paper.\n\nAn average-case model\n\nWe will count the expected number of labels queried when the target hypothesis is chosen\nfrom some distribution over H. This can be interpreted as a Bayesian setting, but it is\nmore accurate to think of merely as a device for averaging query counts, which has no\nbearing on the final generalization bound. A natural choice is to make uniform over H.\n\nMost existing active learning schemes work with H rather than H; but H reflects the\nunderlying combinatorial structure of the problem, and it can't hurt to deal with it directly.\nOften can chosen to mask the structure of H; for instance, if H is the set of linear\nseparators, then H is a set of convex regions of H, and can be made proportional to the\nvolume of each region. This makes the problem continuous rather than combinatorial.\n\n\f\nWhat is the expected number of labels needed to identify a target hypothesis chosen from ?\nIn this average-case setting, is it always possible to get away with o(m) labels, where m is\nthe sample complexity of the supervised learning problem as defined above? We show that\nthe answer, once again, is sadly no. Thus the benefit of active learning is really a function\nof the specific hypothesis class and the particular pool of unlabeled data. Depending on\nthese, the expected number of labels needed lies in the following range (within constants):\n\n ideal case: d log m perfect binary search\n worst case: m all labels, or randomly chosen queries\n\nNotice the exponential gap between the top and bottom of this range. Is there some sim-\nple querying strategy which always achieves close to the minimum (expected) number of\nlabels, whatever this minimum number might be?\n\nOur main result is that this property holds for a variant of a popular greedy scheme: always\nask for the label which most evenly divides the current effective version space weighted by\n. This doesn't necessarily minimize the number of queries, just as a greedy decision tree\nalgorithm need not produce trees of minimum size. However:\n\n\n When is uniform over H, the expected number of labels needed by this\n greedy strategy is at most O(ln |H|) times that of any other strategy.\n\nWe also give a bound for arbitrary , and show corresponding lower bounds in both the\nuniform and non-uniform cases.\n\nVariants of this greedy scheme underlie many active learning heuristics, and are often de-\nscribed as optimal in the literature. This is the first rigorous validation of the scheme in\na general setting. The performance guarantee is significant: recall log |H| = O(d log m),\nthe minimum number of queries possible.\n\n\n2 Preliminaries\n\n\nLet X be the input space, Y = {0, 1} the space of labels, and P an unknown underlying\ndistribution over X Y. We want to select a hypothesis (a function X Y) from some\nclass H of VC dimension d < , which will accurately predict labels of points in X . We\nwill assume that the problem is realizable, that is, there is some hypothesis in H which\ngives a correct prediction on every point. Suppose that points (x1, y1) . . . , (xm, ym) are\ndrawn randomly from P . Standard bounds give us a function m( , d) such that if we want\na hypothesis of error (on P , modulo some fixed confidence level), and if m m( , d),\nthen we need only pick a hypothesis h H consistent with these labeled points [9].\n\nNow suppose just the pool of unlabeled data x1, . . . , xm is available. The possible labelings\nof these points form a subset of {0, 1}m, the effective hypothesis class\n\n H \n = {(h(x1), . . . , h(xm)) : h H}.\n\nSauer's lemma [9] tells us |H| = O(md). We want to pick the unique h H which is\nconsistent with all the hidden labels, by querying just a few of them.\n\nAny deterministic search strategy can be represented as a binary tree whose internal nodes\nare queries (\"what is the xi's label?\"), and whose leaves are elements of H. We can also\naccommodate randomization for instance, to allow a random choice of query point by\nletting internal nodes of the tree be random coin flips. Our main result, Theorem 3, is\nunaffected by this generalization.\n\n\f\n xm x1\n\n x2\n\n L3\n\n\n\n\n\n x3\n\n\n\n\n\n x4\n\n\n\n\n Figure 1: To identify target hypotheses like L3, we need to see all the labels.\n\n\n\n3 Some bad news\n\n\nClaim 1 Let H be the hypothesis class of linear separators in R2. For any set of m distinct\ndata points on the perimeter of the unit circle, there are always some target hypotheses in\nH which cannot be identified without querying all m labels.\n\n\nProof. To see this, consider the following realizable labelings (Figure 1):\n\n Labeling L0: all points are negative.\n\n Labeling Li (1 i m): all points are negative except xi.\n\nIt is impossible to distinguish these cases without seeing all the labels.1\n\nRemark. To rephrase this example in terms of learning a linear separator with error ,\nsuppose the input distribution P (X ) is a density over the perimeter of the unit circle. No\nmatter what this density is, there are always target hypotheses in H which force us to ask\nfor (1/ ) labels: no improvement over the sample complexity of supervised learning.\n\nIn this example, the bad target hypotheses have a large imbalance in probability mass be-\ntween their positive and negative regions. By adding an extra dimension and an extra point,\nexactly the same example can be modified to make the bad hypotheses balanced.\n\nLet's return to the original 2-d case. Some hypotheses must lie at depth m in any query tree;\nbut what about the rest? Well, suppose for convenience that x1, . . . , xm are in clockwise\norder around the unit circle. Then H = {hij : 1 i = j m} {h0, h1}, where hij\nlabels xi xj-1 positive (if j < i it wraps around) and the remaining points negative, and\nh0, h1 are everywhere negative/positive. It is possible to construct a query tree in which\neach hij lies at depth 2(m/|j - i| + log |j - i|). Thus, if the target hypothesis is chosen\nuniformly from H, the expected number of labels queried is at most\n\n\n 1 2m + 2(m/|j - i| + log |j - i|) = O(log m).\n m(m-1)+2 i=j\n\n\nThis is why we place our hopes in an average-case analysis.\n\n\n 1What if the final hypothesis considered as a point in {0, 1}m doesn't have to be exactly right,\nbut within Hamming distance k of the correct one? Then a similar example forces (m/k) queries.\n\n\f\n4 Main result\n\nLet be any distribution over H; we will analyze search strategies according to the number\nof labels they require, averaged over target hypotheses drawn from . In terms of query\ntrees, this is the average depth of a leaf chosen according to . Specifically, let T be any\ntree whose leaves include the support of . The quality of this tree is\n Q(T, ) = (h) (# labels needed for h) = (h) leaf-depth(h).\n hH hH\n\nIs there always a tree of average depth o(m)? The answer, once again, is sadly no.\n\nClaim 2 Pick any d 2 and any m 2d. There is an input space X of size m and a\nhypothesis class H of VC dimension d, defined on domain X , with the following property:\nif is chosen to be uniform over H = H, then any query tree T has Q(T, ) m/8.\n\nProof. Let X consist of any m points x1, . . . , xm, and let H consist of all hypotheses\nh : X {0, 1} which are positive on exactly d inputs. In order to identify a particular\nelement h H, any querying method must discover exactly the d points xi on which h is\nnonzero. By construction, the order in which queries are asked is irrelevant it might as\nwell be x1, x2, . . .. The rest is a simple probability calculation.\n\nIn our average-case model, we have seen one example in which intelligent querying results\nin an exponential improvement in the number of labels required, and one in which it is\nno help at all. Is there some generic scheme which always comes close to minimizing the\nnumber of queries, whatever the minimum number might be? Here's a natural candidate:\n\n Greedy strategy. Let S H be the current version space. For each\n unlabeled xi, let S+ be the hypotheses which label x\n i i positive and S-\n i\n the ones which label it negative. Pick the xi for which these sets are most\n nearly equal in -mass, that is, for which min{(S+), (S-)} is largest.\n i i\n\nWe show this is almost as good at minimizing queries as any other strategy.\n\nTheorem 3 Let be any distribution over H. Suppose that the optimal query tree requires\nQ labels in expectation, for target hypotheses chosen according to . Then the expected\nnumber of labels needed by the greedy strategy is at most 4Q ln 1/(minh (h)).\n\nFor the case of uniform , the approximation ratio is thus at most 4 ln |H|. We also show\nalmost-matching lower bounds in both the uniform and non-uniform cases.\n\n\n5 Analysis of the greedy active learner\n\n5.1 Lower bounds on the greedy scheme\n\nThe greedy approach is not optimal because it doesn't take into account the way in which a\nquery reshapes the search space specifically, the effect of a query on the quality of other\nqueries. For instance, H might consist of several dense clusters, each of which permits\nrapid binary search. However, the version space must first be whittled down to one of\nthese subregions, and this process, though ultimately optimal, might initially be slower at\nshrinking the hypothesis space than more shortsighted alternatives. A concrete example of\nthis type gives rise to the following lower bound.\n\nClaim 4 For any n 16 which is a power of two, there is a concept class Hn of size n\nsuch that: under uniform , the optimal tree has average height at most qn = (log n),\nbut the greedy active learning strategy produces a tree of average height (qn log n ).\n log log n\n\n\f\nFor non-uniform , the greedy scheme can deviate more substantially from optimality.\n\nClaim 5 For any n 2, there is a hypothesis class H with 2n + 1 elements and a distri-\nbution over H, such that: (a) ranges in value from 1/2 to 1/2n+1; (b) the optimal tree\nhas average depth less than 3; (c) the greedy tree has average depth at least n/2.\n\nProofs of these lower bounds appear in the full paper, available at the author's website.\n\n\n5.2 Upper bound\n\nOverview. The lower bounds on the quality of a greedy learner are sobering, but things\ncannot get too much worse than this. Here's the basic argument for uniform : we show\nthat if the optimal tree T requires Q queries in expectation, then some query must (again\nin expectation) \"cut off\" a chunk of H of -mass (1/Q). Therefore, the root query of\nthe greedy tree TG is at least this good (cf. Johnson's set cover analysis [8]). Things get\ntrickier when we try to show that the rest of TG is also good, because although T uses just\nQ queries on average, it may need many more queries for certain hypotheses. Subtrees\nof TG could correspond to version spaces for which more than Q queries are needed, and\nthe roots of these subtrees might not cut down the version space much...\n\nFor a worst-case model, a proof of approximate optimality is known in a related context\n[6]; as we saw in Claim 1, that model is trivial in our situation. The average-case model,\nand especially the use of arbitrary weights , require more care.\n\nDetails. For want of space, we only discuss some issues that arise in proving the main\ntheorem, and leave the actual proof to the full paper.\n\nThe key concept we have to define is the quality of a query, and it turns out that we need this\nto be monotonically decreasing, that is, it should only go down as active learning proceeds\nand the version space shrinks. This rules out some natural entropy-based notions.\n\nSuppose we are down to some version space S H, and a possible next query is xj. If\nS+ is the subset of S which labels xj positive, and S- are the ones that label it negative,\nthen on average the probability mass (measured by ) eliminated by xj is\n\n (S+) (S-) + (S-)(S+) = 2(S+)(S-).\n (S) (S) (S)\n\nWe say xj shrinks (S, ) by this much, with the understanding that this is in expectation.\nShrinkage is easily seen to have the monotonicity property we need.\n\n\nLemma 6 If xj shrinks (H, ) by , then it shrinks (S, ) by at most for any S H.\n\nWe would expect that if the optimal tree is short, there must be at least one query which\nshrinks (H, ) considerably. More concretely, the definition of shrinkage seems to suggest\nthat if all queries provide shrinkage at most , and the current version space has mass (S),\nthen at least about (S)/ more queries are needed. This isn't entirely true, because of a\nsecond effect: if |S| = 2, then we need just one query, regardless of (S).\n\nRoughly speaking, when there are lots of hypotheses with significant mass left in S, the first\neffect dominates; thereafter the second takes over. To smoothly incorporate both effects,\nwe use the notion of collision probability. For a distribution over support Z, this is\nCP() = (z)2, the chance that two random draws from are identical.\n zZ\n\n\nLemma 7 Suppose every query shrinks (H, ) by at most > 0. Pick any S H,\nand any query tree T whose leaves include S. If S is the restriction of to S (that is,\nS(h) = (h)/(S) for h S), then Q(T, S) (1 - CP(S)) (S)/.\n\n\f\nCorollary 8 Pick any S H and any tree T whose leaves include all of S. Then there\nmust exist a query which shrinks (S, S) by at least (1 - CP(S))/Q(T, S).\n\nSo if the current version space S H is such that S has small collision probability, some\nquery must split off a sizeable chunk of S. This can form the basis of a proof by induction.\n\nBut what if CP(S) is large, say greater than 1/2? In this case, the mass of some particular\nhypothesis h0 S exceeds that of all the others combined, and S could shrink by just\nan insignificant amount during the subsequent greedy query, or even during the next few\niterations of greedy queries. It turns out, however, that within roughly the number of itera-\ntions that the optimal tree needs for target h0, the greedy procedure will either reject h0 or\nidentify it as the target. If it is rejected, then by that time S will have shrunk considerably.\n\nBy combining the two cases for CP(S), we get the following lemma, which is proved in\nthe full paper and yields our main theorem as an immediate consequence.\n\nLemma 9 Let T denote any particular query tree for , and let T be the greedily-\nconstructed query tree. For any S H which corresponds to a subtree TS of T ,\n (S)\n Q(TS, S) 4Q(T , S) ln .\n minhS (h)\n\n6 Related work and promising directions\n\nRather than attempting to summarize the wide range of proposed active learning methods,\nfor instance [5, 7, 10, 13, 14], we will discuss three basic techniques upon which they rely.\n\nGreedy search. This is the technique we have abstracted and rigorously validated in this\npaper. It is the foundation of most of the schemes cited above. Algorithmically, the main\nproblem is that the query selection rule is not immediately tractable, so approximations are\nnecessary. For linear separators, H consists of convex sets, and if is chosen to be pro-\nportional to volume, query selection involves estimating volumes of convex regions, which\nis tractable but (using present techniques) inconvenient. Tong and Koller [13] investigate\nmargin-based approximations which are efficiently computable using SVM technology.\n\nOpportunistic priors. This is a trick in which the learner takes a look at the unlabeled\ndata and then places bets on hypotheses. A uniform bet over all of H leads to standard\ngeneralization bounds. But if the algorithm places more weight on certain hypotheses (for\ninstance, those with large margin), then its final error bound is excellent if it guessed right,\nand worse-than-usual if it guessed wrong. This technique is not specific to active learning,\nand has been analyzed elsewhere (eg. [12]). One interesting line of work investigates a\nflexible family of priors specified by pairwise similarities between data points, eg. [14].\n\nBayesian assumptions. In our analysis, although can be seen as some sort of prior belief,\nthere is no assumption that nature shares this belief; in particular, the generalization bound\ndoes not depend on it. A Bayesian assumption has an immediate benefit for active learning:\nif at any stage the remaining version space (weighted by prior ) is largely in agreement\non the unlabeled data, it is legitimate to stop and output one of these remaining hypotheses\n[7]. In a non-Bayesian setting this is not legitimate.\n\nWhen the hypothesis class consists of probabilistic classifiers, the Bayesian assumption\nhas also been used in another way: to approximate the greedy selection rule using the\nMAP estimate instead of an expensive summation over the posterior (eg. [11]).\n\nIn terms of theoretical results, another work which considers the tradeoff between labels\nand generalization error is [7], in which a greedy scheme, realized using sampling, is ana-\nlyzed in a Bayesian setting. The authors show that it is possible to achieve an exponential\n\n\f\nimprovement in the number of labels needed to learn linear separators, when both data and\ntarget hypothesis are chosen uniformly from the unit sphere. It is an intriguing question\nwhether this holds for more general data distributions.\n\n\nOther directions. We have looked at the case where the acceptable error rate is fixed and\nthe goal is to minimize the number of queries. What about fixing the number of queries and\nasking for the best (average) error rate possible? In other words, the query tree has a fixed\ndepth, and each leaf is annotated with its remaining version space S H. Treating each\nelement of S as a point in {0, 1}m (its predictions on the pool of data), the error at this leaf\ndepends on the Hamming diameter of S. What is a good querying strategy for producing\nlow-diameter leaves?\n\nThe most widely-used classifiers are perhaps linear separators. Existing active learning\nschemes ignore the rich algebraic structure of H, an arrangement of hyperplanes [4].\n\n\nAcknowledgements. I am very grateful to the anonymous NIPS reviewers for their careful\nand detailed feedback.\n\n\nReferences\n\n [1] D. Angluin. Queries and concept learning. Machine Learning, 2:319342, 1988.\n\n [2] D. Angluin. Queries revisited. Proceedings of the Twelfth International Conference on Algo-\n rithmic Learning Theory, pages 1231, 2001.\n\n [3] E.B. Baum and K. Lang. Query learning can work poorly when a human oracle is used. Inter-\n national Joint Conference on Neural Networks, 1992.\n\n [4] A. Bjorner, M. Las Vergnas, B. Sturmfels, N. White, and G. Ziegler. Oriented matroids. Cam-\n bridge University Press, 1999.\n\n [5] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of\n Artificial Intelligence Research, 4:129145, 1996.\n\n [6] S. Dasgupta, P.M. Long, and W.S. Lee. A theoretical analysis of query selection for collabora-\n tive filtering. Machine Learning, 51:283298, 2003.\n\n [7] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee\n algorithm. Machine Learning, 28:133168, 1997.\n\n [8] D.S. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer\n and System Sciences, 9:256278, 1974.\n\n [9] M.J. Kearns and U.V. Vazirani. An introduction to computational learning theory. MIT Press,\n 1993.\n\n[10] A. McCallum and K. Nigam. Employing em and pool-based active learning for text classifica-\n tion. Fifteenth International Conference on Machine Learning, 1998.\n\n[11] N. Roy and A. McCallum. Toward optimal active learning through sampling of error reduction.\n Twentieth International Conference on Machine Learning, 2003.\n\n[12] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over\n data-dependent hierarchies. IEEE Transactions on Information Theory, 1998.\n\n[13] S. Tong and D. Koller. Support vector machine active learning with applications to text classi-\n fication. Journal of Machine Learning Research, 2001.\n\n[14] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learn-\n ing using gaussian fields and harmonic functions. ICML workshop, 2003.\n\n\f\n", "award": [], "sourceid": 2636, "authors": [{"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": null}]}