{"title": "Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 880, "abstract": null, "full_text": " Co-Validation: Using Model Disagreement on\n Unlabeled Data to Validate Classification\n Algorithms\n\n\n\n Omid Madani, David M. Pennock, Gary W. Flake\n Yahoo! Research Labs\n 3rd floor, Pasadena Ave.\n Pasadena, CA 91103\n {madani|pennockd|flakeg}@yahoo-inc.com\n\n Abstract\n\n In the context of binary classification, we define disagreement as a mea-\n sure of how often two independently-trained models differ in their clas-\n sification of unlabeled data. We explore the use of disagreement for error\n estimation and model selection. We call the procedure co-validation,\n since the two models effectively (in)validate one another by comparing\n results on unlabeled data, which we assume is relatively cheap and plen-\n tiful compared to labeled data. We show that per-instance disagreement\n is an unbiased estimate of the variance of error for that instance. We also\n show that disagreement provides a lower bound on the prediction (gen-\n eralization) error, and a tight upper bound on the \"variance of prediction\n error\", or the variance of the average error across instances, where vari-\n ance is measured across training sets. We present experimental results on\n several data sets exploring co-validation for error estimation and model\n selection. The procedure is especially effective in active learning set-\n tings, where training sets are not drawn at random and cross validation\n overestimates error.\n\n\n1 Introduction\n\nBalancing hypothesis-space generality with predictive power is one of the central tasks in\ninductive learning. The difficulties that arise in seeking an appropriate tradeoff go by a\nvariety of names--overfitting, data snooping, memorization, no free lunch, bias-variance\ntradeoff, etc.--and lead to a number of known solution techniques or philosophies, includ-\ning regularization, minimum description length, model complexity penalization (e.g., BIC,\nAIC), Ockham's razor, training with noise, ensemble methods (e.g., boosting), structural\nrisk minimization (e.g., SVMs), cross validation, hold-out validation, etc.\n\nAll of these methods in some way attempt to estimate or control the prediction (general-\nization) error of an induced function on unseen data. In this paper, we explore a method\nof error estimation that we call co-validation. The method trains two independent func-\ntions that in a sense validate (or invalidate) one another by examining their mutual rate of\ndisagreement across a set of unlabeled data. In Section 2, we formally define disagree-\nment. The measure simultaneously reflects notions of algorithm stability, model capacity,\nand problem complexity. For example, empirically we find that disagreement goes down\n\n\f\nwhen we increase the training set size, reduce the model's capacity (complexity), or reduce\nthe inherent difficulty of the learning problem. Intuitively, the higher the disagreement\nrate, the higher the average error rate of the learner, where the average is taken over both\ntest instances and training subsets. Therefore disagreement is a measure of the fitness of\nthe learner to the learning task. However, as researchers have noted in relation to various\nmeasures of learner stability in general [Kut02], while robust learners (i.e., algorithms\nwith low prediction error) are stable, a stable learning algorithm does not necessarily have\nlow prediction error. In the same vein, we show and explain that the disagreement mea-\nsure provides only lower bounds on error. Still, our empirical results give evidence that\ndisagreement can be a useful estimate in certain circumstances.\n\nSince we require a source of unlabeled data--preferably a large source in order to accu-\nrately measure disagreement--we assume a semi-supervised setting where unlabeled data\nis relatively cheap and plentiful while labeled data is scarce or expensive. This scenario is\noften realistic, most notably for text classification. We focus on the binary classification\nsetting and analyze 0/1 error.\n\nIn practice, cross validation--especially leave-one-out cross validation--often provides an\naccurate and reliable error estimate. In fact, under the usual assumption that training and\ntest data both arise from the same distribution, k-fold cross validation provides an unbiased\nestimate of prediction error (for functions trained on m(1 - 1/k) many instances, m being\nthe total number of labeled instances). However, in many situations, training data may\nactually arise from a different distribution than test data. One extreme example of this is\nactive learning, where training samples are explicitly chosen to be maximally informative,\nusing a process that is neither independent nor reflective of the test distribution. Even\nbeyond active learning, in practice the process of gathering data and obtaining labels often\nmay bias the training set, for example because some inputs are cheaper or easier to label,\nor are more readily available or obvious to the data collector, etc. In these cases, the error\nestimate obtained from cross validation may not yield an accurate measure of the prediction\nerror of the learned function, and model selection based on cross validation may suffer.\nEmpirically we find that in active learning settings, disagreement often provides a more\naccurate estimate of prediction error and is more useful as a guide for model selection.\n\nRelated to the problem of (average) error estimation is the problem of error variance es-\ntimation: both variance across test instances and variance across functions (i.e., training\nsets). Even if a learning algorithm exhibits relatively low average error, if it exhibits high\nvariance, the algorithm may be undesirable depending on the end-user's risk tolerance.\nVariance is also useful for algorithm comparison, to determine whether observed error dif-\nferences are statistically significant. For variance estimation, cross validation is on much\nless solid footing: in fact, Bengio and Grandvalet [BG03] recently proved an impossibility\nresult showing that no method exists for producing an unbiased estimate of the variance\nof cross validation error in a pure supervised setting with labeled training data only. In\nthis work, we show how disagreement relates to certain measures of variance. First, the\ndisagreement on a particular instance provides an unbiased estimate of the variance of er-\nror on that instance. Second, disagreement provides an upper bound on the variance of\nprediction error (the type of variance useful for algorithm comparison).\n\nThe paper is organized as follows. In 2 we formally define disagreement and prove how\nit lower-bounds prediction error and upper-bounds variance of prediction error. In 3 we\nempirically explore how error estimates and model selection strategies that we devise based\non disagreement compare against cross validation in standard (iid) learning settings and in\nactive learning settings. In 4 we discuss related work. We conclude in 5.\n2 Error, Variance, and Disagreement\n\nDenote a set of input instances by X. Each instance x X is a vector of feature attributes.\nEach instance has a unique true classification or label yx {0, 1}, in general unknown to\n\n\f\nthe learner. Let Z = {(x, yx)}m be a set of m labeled training instances provided to the\nlearner. The learner is an algorithm A : Z F, that inputs labeled instances and output\na function f F, where F is the set of all functions (classifiers) that A may output (the\nhypothesis space). Each f F is a function that maps instances x to labels {0, 1}. The\ngoal of the algorithm is to choose f F to minimize 0/1 error (defined below) on future\nunlabeled test instances.\n\nWe assume the training set size is fixed at some m > 0, and we take expectations over\none or both of two distributions: (1) the distribution X over instances in X, and (2) the\ndistribution F induced over the functions F, when learner A is trained on training sets of\nsize m obtained by sampling from X.\nThe 0/1 error ex,f of a given function f on a given instance x equals 1 if and only if\nthe function incorrectly classifies the instances, and equals 0 otherwise; that is, ex,f =\n1{f(x) = yx}. We define the expected prediction error e of algorithm A as e = Ef,xef,x,\nwhere the expectation is taken over instances drawn from X (x X), and functions drawn\nfrom F (f F). The variance of prediction error 2 is useful for comparing different\nlearners (e.g., [BG03]). Let ef denote the 0/1 error of function f (i.e., ef = Exex,f ). Then\n2 = Ef ((ef - e)2) = Ef(e2)\n f - e2.\nDefine the disagreement between two classifiers f1 and f2 on instance x as 1{f1(x) =\nf2(x)}. The disagreement rate of learner A is then:\n d = Ex,f 1\n 1,f2 {f1(x) = f2(x)}, (1)\nwhere recall that the expectation is taken over x X, f1 F, f2 F (with respect to\ntraning sets of some fixed size m).\n\nLet dx be the (expected) disagreement at x when we sample functions from F: dx =\nEf 1\n 1,f2 {f1(x) = f2(x)}. Similarly, let ex and 2x denote respectively the error and vari-\nance at x: ex = P (f (x) = yx)) = Ef 1{f(x) = yx} = Efef,x and 2 = V AR(e\n x f ) =\nEf [(1{f(x) = yx} - ex)2] = ex(1 - ex). (The last equality follow from the fact that\nef,x is a Bernoulli/binary random variable.) Now, we can establish the connection between\ndisagreement and variance of error (of the learner) at instance x:\n\n dx = Ef 1\n 1,f2 {(f1(x) = yx and f2(x) = yx) or (f1(x) = yx andf2(x) = yx)}\n = P (1{(f1(x) = yx andf2(x) = yx) or (f1(x) = yx andf2(x) = yx)}\n = 2P (f1(x) = yx and f2(x) = yx) = 2ex(1 - ex) \n 2 = d\n x x/2. (2)\n\nThe derivations follow from the fact that the expectation of a Bernoulli random vari-\nable is the same as its probability of being 1, and the two events above (the event\n(f1(x) = yx and f2(x) = yx) and the event (f1(x) = yx and f2(x) = yx) ) are mutually\nexclusive and have equal probability, and the two events f1(x) = yx and f2(x) = yx are\nconditionally independent (note that the two events are conditioned on x, and the two func-\ntions are picked independently of one another). Furthermore, d = ExEf [1\n 1,f2 {f1(x) =\nf2(x)}] = Exdx = 2Ex(2) = 2E )\n x x[ex(1 - ex)] = 2(e - Exe2x , and therefore:\n d = e . (3)\n 2 - Exe2x\n\n2.1 Bounds on Variance via Disagreement\n\nThe variance of prediction error 2 can be used to test the significance of the difference\nin two learners' error rates. Bengio and Granvalet [BG03] show that there is no unbiased\nestimator of the variance of k-fold cross-validation in the supervised setting. We can see\n\n\f\nfrom Equation 2 that having access to disagreement at a given instance x (labeled or not)\ndoes yield the variance of error at that instance. Thus disagreement obtained via 2-fold\ntraining gives us an unbaised estimator of 2x, the variance of prediction error at instance\nx, for functions trained on m/2 instances. (Note for unbiasedness, none of the functions\nshould have been trained on the given instance.) Of course, to compare different algorithms\non a given instance, one also needs the average error at that instance.\n\nIn terms of overall variance of prediction error 2 (where error is averaged across instances\nand variance taken across functions), there exist scenarios when 2 is 0 but d is not (when\nerrors of the different functions learned are the same but negatively correlated), and scenar-\nios when 2 = d/2 = 0. In fact, disagreement yields an upper-bound:\n\nTheorem 1 d 22.\nProof (sketch). We show that the result holds for any finite sampling of functions and in-\nstances: Consider the binary (0/1) matrix M where the rows correspond to instances and\nthe columns correspond to functions, and the entries are the binary-valued errors (entry\nMi,j = 1{fj(xi) = yxi}). Thus the average error is the number of 1 entries when sam-\nplings of instances and functions are drawn from X and F respectively, and variances and\ndisagreement can also be readily defined for the matrix. We show the inequality holds\nfor any such n n matrix for any n. This establishes the theorem (by using limiting ar-\nguments). Treat the 1 entries (matrix cells) as vertices in a graph, where an edge exists\nbetween two 1 entries if they share a column or a row. For a fixed number of 1 entries\nN (N n2), we show the difference between disagreement and variance is minimized\nwhen the number of edges is maximized. We establish that configuration maximizing the\nnumber of edges occurs when all the 1 entries form a compact formation, that is, all the\nmatrix entries in row i are filled before filling row i+1 with 1s. Finally, we show that for\nsuch a configuration minimzing the difference, the difference remains nonnegative. 2\n\nIn typical small training sample size cases when the errors are nonzero and not entirely\ncorrelated (the patterns of 1s in the matrix is basically scattered) d/2 can be significantly\nlarger than 2. With increasing training size, the functions learned tend to make the same\nerrors and d and 2 both approach 0.\n\n\n2.2 Bounds on Error via Disagreement\n\nFrom Jensen's inequality, we have that Exe2x (Exex)2 = e2, therefore using eq. 3, we\nconclude that d/2 e - e2. This implies that\n 1 - 1 - 2d 1 + 1 - 2d. (4)\n 2 e 2\n\nThe upper bound derived is often not informative, as it is greater than 0.5, and often we\n \nknow the error is less than 0.5. Let e 1\n l = 1- -2d . We next discuss whether/when e\n 2 l\ncan be far from the actual error, and the related question of whether we can derive a good\nupperbound or just a good estimator on error using a measure based on disagreement.\n\nWhen functions generated by the learner make correlated and frequent mistakes, el can be\nfar from true error. The extreme case of this is a learner that always outputs a constant\nfunction. In order to account for weak but stable learners, the error lower bound should be\ncomplemented with some measure that ensures that the learner is actually adapting (i.e.,\ndoing its job!). We explore using the training (empirical) error for this purpose. Let ~\n e\ndenote the average training error of the algorithm: ~\n e = E 1\n f ~\n ef = Ef 1\n m xiZ {f(xi) =\nyxi}, where Z is the training set that yielded f. Define ^e = max(~e,el). We explore ^e as\na candidate criterion for model selection, which we compare against the cross-validation\ncriterion in 3.\n\n\f\nNote that a learner can exhibit low disagreement and low training error, yet still have high\nprediction error. For example, the learner could memorize the training data and output a\nconstant on all other instances. (Though when disagreement is exactly zero, the test error\nequals the training error.) A measure of self-disagreement within the labeled training set,\ndefined by Lang et al. [LBRB02], in conjunction with the empirical training error does\nyield an upper bound. Still, we find empirically that, when using SVMs, naive Bayes, or\nlogistic regression, disagreement on unlabeled data does not tend to wildly underestimate\nerror, even though it's theoretically possible.\n\n3 Experiments\n\nWe conducted experiments on the \"20 Newsgroups\" and Reuters-21578 test categoriza-\ntion datasets, and the Votes, Chess, Adult, and Optics datasets from the UCI collec-\ntion [BKM98].1 We chose two categorization tasks from the newsgroups sets: (1) iden-\ntifying Baseball documents in a collection containing both Baseball and Hockey docu-\nments (2000 total documents), and (2) identifying alt.atheism documents from among the\nalt.atheism, soc.religion.christian, and talk.religion.misc collections (3000 documents). For\nthe Reuters set, we chose documents belonging to one of the top 10 categories of the cor-\npus (9410 documents), and we attempt to discriminate the \"Earn\" (3964) and \"Acq\" (2369)\nrespectively from the remaining nine. These categories are large enough that 0/1 error re-\nmains a reasonable measure. We used the bow library for stemming and stop words, kept\nfeatures up to 3-grams, and used l2-normalized frequency counts [McC96]. The Votes,\nChess, Adult, and Optics datasets have respectively 435, 3197, 32561 and 1800 instances.\nThese datasets give us some representation of the various types of learning problems. All\nour data set are in a nonnegative feature value representation. We used support vector\nmachines with polynomial kernels available from the libsvm library [CL01] in all our ex-\nperiments.2 For the error estimation experiments, we used linear SVMs with a C value of\n10. For the model selection experiments, we used polynomial degree as the model selection\nparameter.\n\n3.1 Error Estimation\n\nWe first examine the use of disagreement for error estimation both in the standard setting\nwhere training and test samples are uniformly iid, and in an active learning scenario.\n\nFor each of several training set sizes for each data set, we computed average results and\nstandard deviation across thirty trials. In each trial, we first generate a training set, sam-\npled either uniformly iid or actively, then set aside 20% of remaining instances as the test\nset. Next, we partition the training set into equal halves, train an SVM on each half, and\ncompute the disagreement rate between the two SVMs across the set of (unlabeled) data\nthat has not been designated for the training or test set (80% of total - m instances). We\nrepeat this inner loop of partitioning, dual training, and disagreement computation thirty\ntimes and take averages.\n\nWe examined the utility of our disagreement bound (4) as an estimate of the true test error of\nthe algorithm trained on the full data set (\"trueE\"). We also examined using the maximum\nof the training error (\"trainE\") and lower bound on error from our disagreement measure\n(\"disE\") as an estimate of trueE (\"MaxDtE = max(trainE, disE)\"). Note that disE and trainE\nare respectively unbiased empirical estimates of expected disagreement d and expected\ntraining error ~\n e of 2 for the standard setting. Since our disagreement measure is actually\na bound on half error (i.e., error averaged over training sets of size m/2), we also compare\nagainst two-fold cross-validation error (\"2cvE\"), and the true test error of the two functions\nobtained from training on the two halves (\"1/2trueE\").\n\n 1Available from http://www.ics.uci.edu/ and http://www.daviddlewis.com/resources/testcollections/\n 2We observed similar results in error estimation using linear logistic regression and Naive Bayes\nlearners in preliminary experiments.\n\n\f\n Linear SVM on BASEBALLvsHockey Dataset Linear SVM on BASEBALLvsHockey Dataset\n\n 0.5 0.5\n trueE trueE\n 0.45 1/2trueE 0.45 1/2trueE\n 2cvE 2cvE\n 0.4 disE 0.4 disE\n 0.35 trainE 0.35 trainE\n maxDtE maxDtE\n 0.3 0.3\n\n 0.25 0.25\n\n 0.2\n 0/1 ERROR 0.2\n 0/1 ERROR\n 0.15 0.15\n\n 0.1 0.1\n\n 0.05 0.05\n\n 0 0\n 50 100 150 200 50 100 150 200\n\n TRAINING SET SIZE TRAINING SET SIZE\n\n\n\n Figure 1: (a) Random training set. (b) Actively picked.\n\n\n 12 6 1.4\n Baseball Baseball Baseball\n Religion Religion Religion\n 10 Earn 5 Earn 1.2 Earn\n Acq Acq Acq\n Adult Adult Adult\n 1\n 8 Chess 4 Chess Chess\n Votes Votes Votes\n Digit 1 (Optics) Digit 1 (Optics) 0.8 Digit 1 (Optics)\n 6 3\n 0.6\n\n 4 2\n Ratio of disE to trueE 0.4\n Ratio of disE to 1/2trueE\n 2 1 0.2\n Ratio of the differences from trueE\n\n 0 0 0\n 60 80 100 120 140 160 180 200 60 80 100 120 140 160 180 200 60 80 100 120 140 160 180 200\n\n (a) TRAINING SET SIZE (b) TRAINING SET SIZE (c) TRAINING SET SIZE\n\n\n\n\n Figure 2: Plots of ratios when active learning: (a) 2cvE - trueE (b) disE (c) disE .\n disE - trueE trueE 1/2trueE\nIn the standard scenario, when the training set is chosen uniformly at random from the cor-\npus, leave-one-out cross validated error (\"looE\") is generally a very good estimate of trueE,\nwhile 2cvE is a good estimate for 1/2trueE. For all the data sets, as expected our error esti-\nmate maxDtE underestimates 1/2trueE. A representative example is shown in Figure 1(a).\n\nIn the active learning scenario, the training set is chosen in an attempt to maximize informa-\ntion, and the choice of each new instance depends on the set of previously chosen instances.\nOften this means that especially difficult instances are chosen (or at least instances whose\nlabels are difficult to infer from the current training set). Thus cross validation naturally\noverestimates the difficulty of the learning task and so may greatly overestimate error. On\nthe other hand, an approximate model of active learning is that the instances are iid sampled\nfrom a hard distribution. This ignores the sequential nature of active learning. Measuring\ndisagreement on the easier test distribution via subsampling the training set may remain a\ngood estimator of the actual test error.\n\nWe used linear SVMs as the basis for our active learning procedure. In each trial, we begin\nwith random training set size of 10, and then grow the labeled set by using the uncertainty\nsampling technique. We computed the various error measures at regular intervals.3 A rep-\nresentative plot of errors during active learning is given in Fig. 1(b). In all the datasets\nexperimented with, we have observed the same pattern: the error estimate using disagree-\nment provides a much better estimate of 1/2trueE and trueE than does 2cvE (Fig. 2a), and\ncan be used as an indication of the error and the progress of active learning. Note that while\nwe have not computed looE error in the error-estimation experiments, figure Fig. 1(b) in-\ndicates that 2cvE is not a good estimator of trueE at size m/2 either, and this has been\nthe case in all our experiments. We have observed that disE estimates the 1/2trueE best\n(Fig. 2c). The estimation performance may degrade towards the end of active learning\nwhen the learner converges (disagreement approaches 0). However, we have observed that\nboth 1/2trueE (obtained via subsampling) and disE tend to overestimate the actual error of\nthe active learner even at half the training size (e.g., Fig. 1(b)). This observation underlines\nthe importance of taking the sequential nature of active learning into account.\n\n\n 3We could use a criterion based on disagreement for selective sampling, but we have not throughly\nexplored this option.\n\n\f\n 0.55\n 1/2cvE\n 0.5 looE\n maxDtE\n 0.1\n 0.45 trueE\n\n 0.4\n\n 0.35\n\n error 0.3 looE 0.01\n\n 0.25\n\n 0.2\n\n 0.15 0.001\n\n 0.1\n 0 0.5 1 1.5 2 2.5 3 3.5 4 0.001 0.01 0.1\n\n (a) SVM poly degree (b) maxDtE\n\n\n\n\nFigure 3: (a) An example were maxDtE performs particularly well as a model selection criteria,\ntracking the true error curve more closely than looE or 2cvE. (b) A summary of all experiments\nplotting looE versus maxDtE on a log-log scale: points above the diagonal indicate maxDtE outper-\nforming looE.\n\n\n3.2 Model Selection\n\n\nWe explore various criteria for selecting the expected best among twenty SVMs, each\ntrained using a different polynomial degree kernel. For each data set, we manually identify\nan interval of polynomial degrees that seems to include the error minimum4, then choose\ntwenty degrees equally spaced within that interval. We compare our disagreement-based\nestimate maxDtE with the cross validation estimates looE and 2cvE as model selection cri-\nteria. In each trial, we identify the polynomial degree that is expected to be best according\nto each criteria, then train an SVM at that degree on the full training set. We compare trueE\nat the degree selected by each criteria against trueE at the actual optimal degree.\n\nIn the standard uniform iid scenario, though cross validation often does fail as a model\nselection criteria for regression problems, it seems that cross validation in general is hard\nto beat for classification problems [SS02]. We find that both looE and 2cvE modestly\noutperform maxDtE as model selection criteria, though maxDtE is often competitive. We\nare exploring using the maximum of cross validation and maxDtE as an alternative with\npreliminary evidence of a slight advantage over cross validation alone.\n\nIn an active learning setting, even though cross validation overestimates error, it is theoreti-\ncally possible that cross validation would still function well to identify the best or near-best\nmodel. However, our experiments suggest that the performance of cross validation as a\nmodel selection criteria indeed degrades under active learning. In this situation, maxDtE\nserves as a consistently better model selection criteria. Figure 3(a) shows an example where\nmaxDtE performs particularly well.\n\nThe active learning model selection experiments proceed as follows. For each data set,\nwe use one run of active learning to identify 200 ordered and actively-picked instances.\nFor each training size m {25, 50, 100, 200}, we run thirty experiments using a random\nshuffling of the size-m prefix of the 200 actively-picked instances. In each trial and for\neach of the twenty polynomial degrees, we measure trueE and looE, then run an inner\nloop of thirty random partitionings and dual trainings to measure average d, expE, 2cvE,\nand 1/2trueE. Disagreements and errors are measured across the full test set (total - m\ninstances), so this is a transductive learning setting. Figure 3(b) summarizes the results.\nWe observe that model selection based on disagreement often outperforms model selection\nbased on cross-validation, and at times significantly so. Across 26 experiments, the win-\nloss-tie record of maxDtE versus 2cvE was 16-5-5, the record of maxDtE versus looE was\n18-6-2, and the record of 2cvE versus looE was 15-9-2.\n\n 4Although for fractional degress less than 1 the kernal matrix is not guaranteed to be positive\nsemi-definite, we included such ranges whenever the range included the error minimum. Non-integral\ndegress greater than 1 do not pose a problem as the feature values in all our problem representations\nare nonnegative.\n\n\f\n4 Related Work\n\nPrevious work has already shown that using various measures of stability on unlabeled data\nis useful for ensemble learning, model selection, and regularization, both in supervised and\nunsupervised learning [KV95, Sch97, SS02, BC03, LBRB02, LRBB04]. Metric-based\nmethods for model selection are complementary to our approach in that they are desgined\nto prefer models/algorithms that behave similarly on the labeled and unlabeled data [Sch97,\nSS02, BC03], while disagreement is a measure of self-consistency on the same dataset (in\nthis paper, unlabeled data only). Consequently, our method is also applicable to scenarios in\nwhich the test and training distributions are different. Lang et. al [LBRB02, LRBB04] also\nexplore disagreement on unlabeled data, establishing robust model selection techniques\nbased on disagreement for clustering. Theoretical work on algorithmic stability focuses\non deriving generalization bounds given that the algorithm has certain inherent stability\nproperties [KN02].\n\n5 Conclusions and Future Work\n\nTwo advantages of co-validation over traditional techniques are: (1) disagreement can be\nmeasured to almost an arbitrary degree assuming unlabeled data is plentiful, and (2) dis-\nagreement is measured on unlabeled data drawn from the same distribution as test instances,\nthe extreme case of which is in transductive learning where the unlabeled and test instances\ncoincide. In this paper we derived bounds on certain measures of error and variance based\non disagreement, then examined empirically when co-validation might be useful. We found\nco-validation particularly useful in active learning settings. Future goals include extending\nthe theory to active learning, precision/recall, algorithm comparison (using variance), en-\nsemble learning, and regression. We plan to compare semi-supervised and transductive\nlearning, and consider procedures to generate fictitious unlabeled data.\nReferences\n\n[BC03] Y. Bengio and N. Chapados. Extensions to metric-based model selection. Journal of Machine\n Learning Research, 2003.\n[BG03] Y. Bengio and Y. Granvalet. No unbiased estimator of the variance of k-fold cross-validation.\n In NIPS, 2003.\n[BKM98] C.L. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases,\n 1998.\n[CL01] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines,\n 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.\n[KN02] S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error.\n In UAI, 2002.\n[Kut02] S. Kutin. Algorithmic stability and ensemble-based learning. PhD thesis, University of\n Chicago, 2002.\n[KV95] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learning.\n In NIPS, 1995.\n[LBRB02] T. Lange, M. Braun, V. Roth, and J. Buhmann. Stability-based model selection. In NIPS,\n 2002.\n[LRBB04] T. Lange, V. Roth, M. Braun, and J. Buhmann. Stability based validation of clustering\n algorithms. Neural Computation, 16, 2004.\n[McC96] A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classifi-\n cation and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.\n[Sch97] D. Schuurmans. A new metric-based approach to model selection. In AAAI, 1997.\n[SS02] D. Schuurmans and F. Southey. Metric-based methods for adaptive model selection and\n regularization. Machine Learning, pages 5184, 2002.\n\n\f\n", "award": [], "sourceid": 2603, "authors": [{"given_name": "Omid", "family_name": "Madani", "institution": null}, {"given_name": "David", "family_name": "Pennock", "institution": null}, {"given_name": "Gary", "family_name": "Flake", "institution": null}]}