{"title": "Active Learning for Anomaly and Rare-Category Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1080, "abstract": null, "full_text": " Active Learning for Anomaly and\n Rare-Category Detection\n\n\n\n Dan Pelleg and Andrew Moore\n School of Computer Science\n Carnegie-Mellon University\n Pittsburgh, PA 15213 USA\n dpelleg@cs.cmu.edu, awm@cs.cmu.edu\n\n\n Abstract\n\n We introduce a novel active-learning scenario in which a user wants to\n work with a learning algorithm to identify useful anomalies. These are\n distinguished from the traditional statistical definition of anomalies as\n outliers or merely ill-modeled points. Our distinction is that the useful-\n ness of anomalies is categorized subjectively by the user. We make two\n additional assumptions. First, there exist extremely few useful anoma-\n lies to be hunted down within a massive dataset. Second, both useful\n and useless anomalies may sometimes exist within tiny classes of similar\n anomalies. The challenge is thus to identify \"rare category\" records in an\n unlabeled noisy set with help (in the form of class labels) from a human\n expert who has a small budget of datapoints that they are prepared to cat-\n egorize. We propose a technique to meet this challenge, which assumes\n a mixture model fit to the data, but otherwise makes no assumptions on\n the particular form of the mixture components. This property promises\n wide applicability in real-life scenarios and for various statistical mod-\n els. We give an overview of several alternative methods, highlighting\n their strengths and weaknesses, and conclude with a detailed empirical\n analysis. We show that our method can quickly zoom in on an anomaly\n set containing a few tens of points in a dataset of hundreds of thousands.\n\n\n1 Introduction\n\nWe begin with an example of a rare-category-detection problem: an astronomer needs to\nsift through a large set of sky survey images, each of which comes with many numerical\nparameters. Most of the objects (99.9%) are well explained by current theories and models.\nThe remainder are anomalies, but 99% of these anomalies are uninteresting, and only 1%\nof them (0.001% of the full dataset) are useful. The first type of anomalies, called \"boring\nanomalies\", are records which are strange for uninteresting reasons such as sensor faults or\nproblems in the image processing software. The useful anomalies are extraordinary objects\nwhich are worthy of further research. For example, an astronomer might want to cross-\ncheck them in various databases and allocate telescope time to observe them in greater\ndetail. The goal of our work is finding this set of rare and useful anomalies.\n\nAlthough our example concerns astrophysics, this scenario is a promising general area for\nexploration wherever there is a very large amount of scientific, medical, business or intel-\nligence data and a domain expert wants to find truly exotic rare events while not becoming\n\n\f\n Random set Ask expert Spot \"important\"\n of records to classify records\n some records\n\n\n\n Build model Run all data\n from data through model\n and labels\n\n\nFigure 1: Anomalies in Sloan data: Diffraction spikes (left). Satellite trails (center). The\nactive-learning loop is shown on the right.\n\nswamped with uninteresting anomalies. Two rare categories of \"boring\" anomalies in our\ntest astrophysics data are shown in Figure 1. The first, a well-known optical artifact, is the\nphenomenon of diffraction spikes. The second consists of satellites that happened to be\nflying overhead as the photo was taken.\n\nAs a first step, we might try defining a statistical model for the data, and identifying objects\nwhich do not fit it well. At this point, objects flagged as \"anomalous\" can still be almost\nentirely of the uninteresting class of anomalies. The computational and statistical question\nis then how to use feedback from the human user to iteratively reorder the queue of anoma-\nlies to be shown to the user in order to increase the chance that the user will soon see an\nanomaly of a whole new category.\n\nWe do this in the familiar pool-based active learning framework1. In our setting, learning\nproceeds in rounds. Each round starts with the teacher labeling a small number of examples.\nThen the learner models the data, taking into account the labeled examples as well as the\nremainder of the data, which we assume to be much larger in volume. The learner then\nidentifies a small number of input records (\"hints\") which are important in the sense that\nobtaining labels for them would help it improve the model. These are shown to the teacher\n(in our scenario, a human expert) for labeling, and the cycle repeats. The model, which we\ncall \"irrelevance feedback\", is shown in Figure 1.\n\nIt may seem too demanding to ask the human expert to give class labels instead of a simple\n\"interesting\" or \"boring\" flag. But in practice, this is not an issue--it seems easier to place\nobjects into such \"mental bins\". For example, in the astronomical data we have seen a user\nplace most objects into previously-known categories: point sources, low-surface-brightness\ngalaxies, etc. This also holds for the negative examples: it is frustrating to have to label all\nanomalies as \"bad\" without being able to explain why. Often, the data is better understood\nas time goes by, and users wish to revise their old labels in light of new examples. Note that\nthe statistical model does not care about the names of the labels. For all it cares, the label\nset can be utterly changed by the user from one round to another. Our tools allow that: the\nlabels are unconstrained and the user can add, refine, and delete classes at will. It is trivial\nto accommodate the simpler \"interesting or not\" model in this richer framework.\n\nOur work differs from traditional applications of active learning in that we assume the\ndistribution of class sizes to be extremely skewed. For example, the smallest class may\nhave just a few members whereas the largest may contain a few million. Generally in\nactive learning, it is believed that, right from the start, examples from each class need to\nbe presented to the oracle [1, 2, 3]. If the class frequencies were balanced, this could be\nachieved by random sampling. But in datasets with the rare categories property, this no\nlonger holds, and much of our effort is an attempt to remedy the situation.\n\nPrevious active-learning work tends to tie intimately to a particular model [4, 3]. We would\nlike to be able to \"plug in\" different types of models or components and therefore propose\nmodel-independent criteria. The same reasoning also precludes us from directly using\ndistances between data points, as is done in [5].\n\n\n 1More precisely, we allow multiple queries and labels in each learning round -- the traditional\npresentation has just one.\n\n\f\n (a) (b) (c) (d) (e) (f)\n\n\nFigure 2: Underlying data distribution for the example (a); behavior of the lowlik method (bf). The\noriginal data distribution is in (a). The unsupervised model fit to it in (b). The anomalous points\naccording to lowlik, given the model in (b), are shown in (c). Given labels for the points in (c), the\nmodel in (d) is fitted. Given the new model, anomalous points according to lowlik are flagged (e).\nGiven labels for the points in (c) and (e), this is the new fitted model (f).\n\nAnother desired property is resilience to noise. Noise can be inherent in the data (e.g., from\nmeasurement errors) or be an artifact of a ill-fitting model. In any case, we need to be able\nto identify query points in the presence of noise. This is a not just a bonus feature: points\nwhich the model considers noisy could very well be the key to improvement if presented\nto the oracle. This is in contrast to the approach taken by some: a pre-assumption that the\ndata is noiseless [6, 7].\n\n\n2 Overview of Hint Selection Methods\n\nIn this section we survey several proposed methods for active learning as they apply to our\nsetting. While the general tone is negative, what follows should not be construed as general\ndismissal of these methods. Rather, it is meant to highlight specific problems with them\nwhen applied to a particular setting. Specifically, the rare-categories assumption (and in\nsome cases, just having more than 2 classes) breaks the premises for some of them.\n\nAs an example, consider the data shown in Figure 2 (a). It is a mixture of two classes.\nOne is an X-shaped distribution, from which 2000 points are drawn. The other is a circle\nwith 100 points. In this example, the classifier is a Gaussian Bayes classifier trained in\na semi-supervised manner from labeled and unlabeled data, with one Gaussian per class.\nThe model is learned with a standard EM procedure, with the following straightforward\nmodification [8, 9] to enable semi-supervised learning. Before each M step we clamp the\nclass membership values for the hinted records to match the hints (i.e., one for the labeled\nclass for this record, and zero elsewhere).\n\nGiven fully labeled data, our learner would perfectly predict class membership for this data\n(although it would be a poor generative model): one Gaussian centered on the circle, and\nanother spherical Gaussian with high variance centered on the X. Now, suppose we plan to\nperform active learning in which we take the following steps:\n\n 1. Start with entirely unlabeled data.\n\n 2. Perform semi-supervised learning (which, on the first iteration degenerates to un-\n supervised learning).\n\n 3. Ask an expert to classify the 35 strangest records.\n\n 4. Go to Step 2.\nOn the first iteration (when unsupervised) the algorithm will naturally use the two Gaus-\nsians to model the data as in Figure 2(b), with one Gaussian for each of the arms of the \"X\",\nand the points in the circle represented as members of one of them. What happens next all\ndepends on the choice of the datapoints to show to the human expert. We now survey the\nmethods for hint selection.\n\nChoosing Points with Low Likelihood: A rather intuitive approach is to select as hints\nthe points which the model performs worst on. This can be viewed as model variance\n\n\f\n (a) (b) (c) (d) (e)\n\n\nFigure 3: Behavior of the ambig (ac) and interleave (de) methods. The unsupervised model and\nthe points which ambig flags as anomalous, given this model (a). The model learned using labels for\nthese points is (b), along with the point it flags. The last refinement, given both sets of labels (c).\n\n\nminimization [4] or as selection of points furthest away from any labeled points [5]. We do\nthis by ranking each point in order of increasing model likelihood, and choosing the most\nanomalous items.\n\nWe show what this approach would flag in the given configuration in Figure 2. It is derived\nfrom a screenshot of a running version of our code, redrawn by hand for clarity. Each\nsubsequent drawing shows a model which EM converged to after including the new labels,\nand the hints it chooses under a particular scheme (here it is what we call lowlik). These\nhints affect the model shown for the next round. The underlying distribution is shown in\ngray shading. We use this same convention for the other methods below.\n\nIn the first round, the Mahalanobis distance for the points in the corners is greater than\nthose in the circle, therefore they are flagged. Another effect we see is that one of the arms\nis represented more heavily. This is probably due to its lower variance. In any event, none\nof the points in the circle is flagged. The outcome is that the next round ends up in a similar\nlocal minimum. We can also see that another step will not result in the desired model.\nOnly after obtaining labels for all of the \"outlier\" points (that is, those on the extremes\nof the distribution) will this approach go far enough down the list to hit a point in the\ncircle. This means that in scenarios where there are more than a few hundred noisy data,\nclassification accuracy is likely to be very low.\n\nChoosing Ambiguous Points: Another popular approach is to choose the points which the\nlearner is least certain about. This is the spirit of \"query by committee\" [10] and \"uncer-\ntainty sampling\" [11]. In our setting this is implemented in the following way. For each\ndata point, the EM algorithm maintains an estimate of the probability of its membership in\nevery mixture component. For each point, we compute the entropy of the set of all such\nprobabilities, and rank the points in decreasing order of the entropy. This way, the top of\nthe list will have the objects which are \"owned\" by multiple components.\n\nFor our example, this would choose the points shown in Figure 3. As expected, points on\nthe decision boundaries between classes are chosen. Here, the ambiguity sets are useless\nfor the purpose of modeling the entire distribution. One might argue this only holds for\nthis contrived distribution. However, in general this is a fairly common occurrence, in the\nsense that the ambiguity criterion works to nudge the decision surfaces so they better fit a\nrelatively small set of labeled examples. It may help modeling the points very close to the\nboundaries, but it does not improve generalization accuracy in the general case. Indeed, we\nsee that if we repeatedly apply this criterion we end up asking for labels for a great number\nof points in close proximity, to very little effect on the overall model. In the results section\nbelow, we call this method ambig.\n\nCombining Unlikely and Ambiguous Points: Our next candidate is a hybrid method\nwhich tries to combine the hints from the two previous methods. Recall they both produce\na ranked list of all the points. We merge the lists into another ranked list in the following\nway. Alternate between the lists when picking items. For each list, pick the top item that\nhas not already been placed in the output list. When all elements are taken, the output list\nis a ranked list as required. We now pick the top items from this list for hints.\n\n\f\nAs expected we get a good mix of points in both hint sets (not shown). But, since neither\nmethod identifies the small cluster, their union fails to find it as well. However, in general\nit is useful to combine different criteria in this way, as our empirical results below show.\nThere, this method is called mix-ambig-lowlik.\n\nInterleaving: We now present what we consider is the logical conclusion of the observa-\ntions above. To the best of our knowledge, the approach is novel. The key insight is that\nour group of anomalies was, in fact, reasonably ordinary when analyzed on a global scale.\nIn other words, the mixture density of the region we chose for the group of anomalies is\nnot sufficiently low for them to rank high on the hint list. Recall that the mixture model\nsums up the weighted per-model densities. Therefore, a point that is \"split\" among several\ncomponents approximately evenly, and scores reasonably high on at least some of them,\nwill not be flagged as anomalous.\n\nAnother instance of the same problem occurs when a point which is somewhat \"owned\" by\na component with high mixture weight. Even if the small component that \"owns\" most of\nit predicts it is very unlikely, that term has very little effect on the overall density.\n\nTherefore, our goal is to eliminate the mixture weights from the equation. Our idea is that\nif we restrict the focus to match the \"point of view\" of just one component, these anomalies\nwill become more apparent. We do this by considering just the points that \"belong\" to one\ncomponent, and by ranking them according to the PDF of this component. The hope is that\ngiven this restricted view, anomalies that do not fit the component's own model will stand\nout.\n\nMore precisely, let c be a component and i a data point. The EM algorithm maintains, for\nevery c and i, an estimate zc of the degree of \"ownership\" that c exerts over i. For each\n i\n\ncomponent c we create a list of all the points for which c = arg maxc zc , ranked by zc.\n i i\n\n\nHaving constructed the sorted lists, we merge them in a generalization of the merge method\ndescribed above. We cycle through the lists in some order. For each list, we pick the top\nitem that has not already been placed in the output list, and place it at the next position in\nthe output list.\n\nThis strategy is appealing intuitively, although we have no further theoretical justification\nfor it. We show results for this strategy for our example in Figure 3, and in the experimental\nsection below. We see it meets the requirement of representation for all true components.\nMost of the points are along the major axes of the two elongated Gaussians, but two of\nthe points are inside the small circle. Correct labels for even just these two points result in\nperfect classification in the next EM run.\n\nIn our experiments, we found it beneficial to modify this method as follows. One of the\ncomponents is a uniform-density \"background\". This modification lets it nominate hints\nmore often than any other component. In terms of list merging, we take one element from\neach of the lists of standard components, and then several elements from the list produced\nfor the background component. All of the results shown were obtained using an oversam-\npling ratio of 20. In other words, if there are N components (excluding uniform), then the\nfirst cycle of hint nomination will result in 20 + N hints, 20 of which from uniform.\n\n\n3 Experimental Results\n\nTo establish the results hinted by the intuition above, we conducted a series of experiments.\nThe first one uses synthetic data. The data distribution is a mixture of components in\n5, 10, 15 and 20 dimensions. The class size distribution is a geometric series with the\nlargest class owning half of the data and each subsequent class being half as small.\n\nThe components are multivariate Gaussians whose covariance structure can be modeled\n\n\f\n 1 1\n\n\n\n 0.95\n\n 0.9\n\n 0.9\n\n\n\n 0.85 0.8\n\n\n\n 0.8\n\n 0.7\n\n 0.75\n\n\n %classes discovered %classes discovered\n 0.7 0.6\n\n\n\n 0.65\n\n lowlik 0.5\n 0.6 mix-ambig-lowlik lowlik\n random random\n ambig ambig\n interleave interleave\n 0.55 0.4\n 0 200 400 600 800 1000 1200 1400 1600 0 500 1000 1500 2000 2500 3000\n\n hints hints\n\n\nFigure 4: Learning curves for simulated data drawn from a mixture of dependency trees\n(left), and for the SHUTTLE set (right). The Y axis shows the fraction of classes represented\nin queries sent to the teacher. For SHUTTLE and ABALONE below, mix-ambig-loglike is\nomitted because it is so similar to lowlik.\n\n 1 1\n\n\n\n 0.9\n 0.9\n\n\n 0.8\n\n 0.8\n\n 0.7\n\n\n 0.7 0.6\n\n\n\n 0.5\n 0.6\n\n %classes discovered %classes discovered\n 0.4\n\n 0.5\n\n 0.3\n\n 0.4 lowlik\n lowlik 0.2 mix-ambig-lowlik\n random random\n ambig ambig\n interleave interleave\n 0.3 0.1\n 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400\n\n hints hints\n\n\n Figure 5: Learning curves for the ABALONE (left) and KDD (right) sets.\n\nwith dependency trees. Each Gaussian component has its covariance generated in the fol-\nlowing way. Random attribute pairs are chosen, and added to an undirected dependency\ntree structure unless they close a cycle. Each edge describes a linear dependency between\nnodes, with the coefficients drawn uniformly at random, with random noise added to each\nvalue. Each data set contains 10, 000 points. There are ten tree classes and a uniform back-\nground component. The number of \"background\" points ranges from 50 to 200. Only the\nresults for 15 dimensions and 100 noisy points are shown as they are representative of the\nother experiments. In each round of learning, the learner queries the teacher with a list of\n50 points for labeling, and has access to all the queries and replies submitted previously.\n\nThis data generation scheme is still very close to the one which our tested model assumes.\nNote, however, that we do not require different components to be easily identifiable. The\nresults of this experiment are shown in Figure 4. Also included, are results for random,\nwhich is a baseline method choosing hints at random.\n\nOur scoring function is driven by our application, and estimates the amount of effort the\nteacher has to expend before being presented by representatives of every single class. The\nassumption is that the teacher can generalize from a single example (or a very few exam-\nples) to an entire class, and the valuable information is concentrated in the first queried\nmember of each class. More precisely, if there are n classes, then the score under this\nmetric is 1/n times the number of classes represented in the query set. In the query set we\ninclude all items queried in preceding rounds, as we do for other applicable metrics.\n\nThe best performer so far is interleave, taking five rounds or less to reveal all of the classes,\nincluding the very rare ones. Below we show it is superior in many of the real-life data sets.\nWe can also see that ambig performs worse than random. This can be explained by the fact\nthat ambig only chooses points that already have several existing components \"competing\"\nfor them. Rarely do these points belong to a new, yet-undiscovered component.\n\n\f\n 1 1\n lowlik\n random\n interleave\n\n 0.95\n 0.9\n\n\n\n 0.9\n\n 0.8\n\n\n 0.85\n\n 0.7\n\n 0.8\n\n %classes discovered %classes discovered\n 0.6\n\n 0.75\n\n\n\n 0.5 0.7\n\n lowlik\n random\n interleave\n 0.4 0.65\n 50 100 150 200 250 300 350 400 450 500 20 40 60 80 100 120 140 160 180 200\n\n hints hints\n\n Figure 6: Learning curves for the EDSGC (left) and SDSS (right) sets.\n Table 1: Properties of the data sets used.\n SMALLEST LARGEST\n NAME DIMS RECORDS CLASSES SOURCE\n CLASS CLASS\n SHUTTLE 9 43500 7 0.01% 78.4% [12]\n ABALONE 7 4177 20 0.34% 16% [13]\n KDD 33 50000 19 0.002% 21.6% [13]\n EDSGC 26 1439526 7 0.002% 76% [14]\n SDSS 22 517371 3 0.05% 50.6% [15]\n\nWe were concerned that the poor performance of lowlik was just a consequence of our\nchoice of metric. After all, it does not measure the number of noise points (i.e points from\nthe uniform background component) found. These points are genuine anomalies, so it is\npossible that lowlik is being penalized unfairly for its focusing on the noise points. After\nexamining the fraction of noise points (i.e., points drawn from the uniform background\ncomponent) found by each algorithm, we discovered that lowlik actually scores worse than\ninterleave on this metric as well.\n\nThe remaining experiments were run on various real data sets. Table 1 has a summary of\ntheir properties. They represent data and computational effort orders of magnitude larger\nthan any active-learning result of which we are aware.\n\nResults for the SHUTTLE set appear in Figure 4. We see that it takes the interleave al-\ngorithm five rounds to spot all classes, whereas the next best is lowlik, with 11. The\nABALONE set (Figure 5) is a very noisy set, in which random seems to be the best long-\nterm strategy. Again, note how ambig performs very poorly.\n\nDue to resource limitations, results for kdd were obtained on a 50000-record random sub-\nsample of the original training set (which is roughly ten times bigger). This set has an ex-\ntremely skewed distribution of class sizes, and a large number of classes. In Figure 5 we see\nthat lowlik performs uncharacteristically poorly. Another surprise is that the combination\nof lowlik and ambig outperforms them both. It also matches interleave in performance,\nand this is the only case where we have seen it do so.\n\nThe EDSGC set, as distributed, is unlabeled. The class labels relate to the shape and size of\nthe sky object. We see in Figure 6 that for the purpose of class discovery, we can do a good\njob in a small number of rounds: here, a human would have had to label just 250 objects\nbefore being presented with a member of the smallest class - comprising just 24 records\nout of a set of 1.4 million.\n\n\n4 Conclusion\n\nWe have shown that some of the popular methods for active learning perform poorly in\nrealistic active-learning scenarios where classes are imbalanced. Working from the def-\ninition of a mixture model we were able to propose methods which let each component\n\"nominate\" its favorite queries. These methods work well in the presence of noisy data\nand extremely rare classes and anomalies. Our simulations show that a human user only\n\n\f\nneeds to label one or two hundred examples before being presented with very rare anoma-\nlies in huge data sets. In our experience, this kind of interaction takes just an hour or two\nof combined human and computer time [16].\n\nWe make no assumptions about the particular form a component takes. Consequently, we\nexpect our results to apply to many different kinds of component models, including the case\nwhere components are not dependency trees, or even not all from the same distribution.\n\nWe are using lessons learned from our empirical comparison in an application for anomaly-\nhunting in the astrophysics domain. Our application presents multiple indicators to help a\nuser spot anomalous data, as well as controls for labeling points and adding classes. The\napplication will be described in a companion paper.\n\n\nReferences\n\n [1] Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Active semi-supervision for pair-\n wise constrained clustering. Submitted for publication, February, 2003.\n [2] M. Seeger. Learning with labeled and unlabeled data. Technical report, Institue for Adaptive\n and Neural Computation, Universiy of Edinburgh, 2000.\n [3] Klaus Brinker. Incorporating diversity in active learning with support vector machines. In\n Proceedings of the Twentieth International Conference on Machine Learning, 2003.\n [4] David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Active learning with statistical\n models. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information\n Processing Systems, volume 7, pages 705712. The MIT Press, 1995.\n [5] Nirmalie Wiratunga, Susan Craw, and Stewart Massie. Index driven selective sampling for\n CBR, 2003. To appear in Proceedings of the Fifth International Conference on Case-Based\n Reasoning, Springer-Verlag, Trondheim, Norway, 23-26 June 2003.\n [6] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning.\n Machine Learning, 15(2):201221, 1994.\n [7] Mark Plutowski and Halbert White. Selecting concise training sets from clean data. IEEE\n Transactions on Neural Networks, 4(2):305318, March 1993.\n [8] Shahshashani and Landgrebe. The effect of unlabeled examples in reducing the small sample\n size problem. IEEE Trans Geoscience and Remote Sensing, 32(5):10871095, 1994.\n [9] Miller and Uyar. A mixture of experts classifier with learning based on both labeled and unla-\n belled data. In NIPS-9, 1997.\n[10] H. S. Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Computational\n Learning Theory, pages 287294, 1992.\n[11] David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learn-\n ing. In William W. Cohen and Haym Hirsh, editors, Proceedings of ICML-94, 11th Interna-\n tional Conference on Machine Learning, pages 148156, New Brunswick, US, 1994. Morgan\n Kaufmann Publishers, San Francisco, US.\n[12] P.Brazdil and J.Gama. StatLog, 1991. http://www.liacc.up.pt/ML/statlog.\n[13] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://\n www.ics.uci.edu/mlearn/MLRepository.html.\n[14] R. C. Nichol, C. A. Collins, and S. L. Lumsden. The Edinburgh/Durham southern galaxy\n catalogue -- IX. Submitted to the Astrophysical Journal, 2000.\n[15] SDSS. The Sloan Digital Sky Survey, 1998. www.sdss.org.\n[16] Dan Pelleg. Scalable and Practical Probability Density Estimators for Scientific Anomaly De-\n tection. PhD thesis, Carnegie-Mellon University, 2004. Tech Report CMU-CS-04-134.\n[17] David MacKay. Information-based objective functions for active data selection. Neural Com-\n putation, 4(4):590604, 1992.\n[18] Fabio Gagliardi Cozman, Ira Cohen, and Marclo Cesar Cirelo. Semi-supervised learning of\n mixture models and bayesian networks. In Proceedings of the Twentieth International Confer-\n ence on Machine Learning, 2003.\n[19] Yoram Baram, Ran El-Yaniv, and Kobi Luz. Online choice of active learning algorithms. In\n Proceedings of the Twentieth International Conference on Machine Learning, 2003.\n[20] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Infor-\n mation Processing Systems 18, 2004.\n\n\f\n", "award": [], "sourceid": 2554, "authors": [{"given_name": "Dan", "family_name": "Pelleg", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}