{"title": "Neural Network Ensembles, Cross Validation, and Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 231, "page_last": 238, "abstract": null, "full_text": "Neural Network Ensembles, Cross \nValidation, and Active Learning \n\nAnders Krogh\" \n\nNordita \n\nBlegdamsvej 17 \n\n2100 Copenhagen, Denmark \n\nJesper Vedelsby \n\nElectronics Institute, Building 349 \nTechnical University of Denmark \n\n2800 Lyngby, Denmark \n\nAbstract \n\nLearning of continuous valued functions using neural network en(cid:173)\nsembles (committees) can give improved accuracy, reliable estima(cid:173)\ntion of the generalization error, and active learning. The ambiguity \nis defined as the variation of the output of ensemble members aver(cid:173)\naged over unlabeled data, so it quantifies the disagreement among \nthe networks. It is discussed how to use the ambiguity in combina(cid:173)\ntion with cross-validation to give a reliable estimate of the ensemble \ngeneralization error, and how this type of ensemble cross-validation \ncan sometimes improve performance. It is shown how to estimate \nthe optimal weights of the ensemble members using unlabeled data. \nBy a generalization of query by committee, it is finally shown how \nthe ambiguity can be used to select new training data to be labeled \nin an active learning scheme. \n\n1 \n\nINTRODUCTION \n\nIt is well known that a combination of many different predictors can improve predic(cid:173)\ntions. In the neural networks community \"ensembles\" of neural networks has been \ninvestigated by several authors, see for instance [1, 2, 3]. Most often the networks \nin the ensemble are trained individually and then their predictions are combined. \nThis combination is usually done by majority (in classification) or by simple aver(cid:173)\naging (in regression), but one can also use a weighted combination of the networks . \n\n.. Author to whom correspondence should be addressed. Email: kroghlnordita. elk \n\n\f232 \n\nAnders Krogh, Jesper Vedelsby \n\nAt the workshop after the last NIPS conference (December, 1993) an entire session \nwas devoted to ensembles of neural networks ( \"Putting it all together\", chaired by \nMichael Perrone) . Many interesting papers were given, and it showed that this area \nis getting a lot of attention. \n\nA combination of the output of several networks (or other predictors) is only useful \nif they disagree on some inputs. Clearly, there is no more information to be gained \nfrom a million identical networks than there is from just one of them (see also \n[2]). By quantifying the disagreement in the ensemble it turns out to be possible \nto state this insight rigorously for an ensemble used for approximation of real(cid:173)\nvalued functions (regression). The simple and beautiful expression that relates the \ndisagreement (called the ensemble ambiguity) and the generalization error is the \nbasis for this paper, so we will derive it with no further delay. \n\n2 THE BIAS-VARIANCE TRADEOFF \n\nAssume the task is to learn a function J from RN to R for which you have a sample \nof p examples, (xiJ , yiJ), where yiJ = J(xiJ) and J.t = 1, . . . ,p. These examples \nare assumed to be drawn randomly from the distribution p(x) . Anything in the \nfollowing is easy to generalize to several output variables. \nThe ensemble consists of N networks and the output of network a on input x is \ncalled va (x). A weighted ensemble average is denoted by a bar, like \n\nV(x) = L Wa Va(x). \n\na \n\n(1) \n\nThis is the final output of the ensemble. We think of the weight Wa as our belief in \nnetwork a and therefore constrain the weights to be positive and sum to one. The \nconstraint on the sum is crucial for some of the following results. \nThe ambiguity on input x of a single member of the ensemble is defined as aa (x) = \n(V a(x) - V(x))2 . The ensemble ambiguity on input x is \n\na(x) = Lwaaa(x) = LWa(va(x) - V(x))2 . \n\na \n\na \n\n(2) \n\nIt is simply the variance of the weighted ensemble around the weighed mean, and \nit measures the disagreement among the networks on input x. The quadratic error \nof network a and of the ensemble are \n\n(J(x) - V a(x))2 \n(J(x) - V(X))2 \n\nrespectively. Adding and subtracting J( x) in (2) yields \na(x) = L Wafa(X) - e(x) \n\na \n\n(3) \n(4) \n\n(5) \n\n(after a little algebra using that the weights sum to one) . Calling the weighted \naverage of the individual errors \u20ac( x) = La Wa fa (x) this becomes \n\ne(x) = \u20ac(x) - a(x). \n\n(6) \n\n\fNeural Network Ensembles, Cross Validation, and Active Learning \n\n233 \n\nAll these formulas can be averaged over the input distribution. Averages over the \ninput distribution will be denoted by capital letter, so \nJ dxp(xVl! (x) \nJ dxp(x)aa(x) \nJ dxp(x)e(x). \n\n(7) \n(8) \n(9) \n\nE \n\nThe first two of these are the generalization error and the ambiguity respectively \nfor network n , and E is the generalization error for the ensemble. From (6) we then \nfind for the ensemble generalization error \n\n(10) \n\nThe first term on the right is the weighted average of the generalization errors of \nthe individual networks (E = La waEa), and the second is the weighted average \nof the ambiguities (A = La WaAa), which we refer to as the ensemble ambiguity. \nThe beauty of this equation is that it separates the generalization error into a term \nthat depends on the generalization errors of the individual networks and another \nterm that contain all correlations between the networks. Furthermore, the corre(cid:173)\nlation term A can be estimated entirely from unlabeled data, i. e., no knowledge is \nrequired of the real function to be approximated. The term \"unlabeled example\" is \nborrowed from classification problems, and in this context it means an input x for \nwhich the value of the target function f( x) is unknown. \nEquation (10) expresses the tradeoff between bias and variance in the ensemble, \nbut in a different way than the the common bias-variance relation [4] in which the \naverages are over possible training sets instead of ensemble averages. If the ensemble \nis strongly biased the ambiguity will be small, because the networks implement very \nsimilar functions and thus agree on inputs even outside the training set. Therefore \nthe generalization error will be essentially equal to the weighted average of the \ngeneralization errors of the individual networks. If, on the other hand , there is a \nlarge variance , the ambiguity is high and in this case the generalization error will \nbe smaller than the average generalization error . See also [5]. \n\nFrom this equation one can immediately see that the generalization error of the \nensemble is always smaller than the (weighted) average of the ensemble errors, \nE < E. In particular for uniform weights: \n\nE ~ ~ 'fEcx \n\n(11) \n\nwhich has been noted by several authors , see e.g. [3] . \n\n3 THE CROSS-VALIDATION ENSEMBLE \n\nFrom (10) it is obvious that increasing the ambiguity (while not increasing individual \ngeneralization errors) will improve the overall generalization. We want the networks \nto disagree! How can we increase the ambiguity of the ensemble? One way is to \nuse different types of approximators like a mixture of neural networks of different \ntopologies or a mixture of completely different types of approximators. Another \n\n\f234 \n\nAnders Krogh, Jesper Vedelsby \n\n:~ t \n, \n\n, ' , \n\n. ..... , . . \n. \nv '. --: '1 \n\n.' .~.--c\u00b7\u00b7 . \n. -- --\\\\ \n\n.~ \n\n1. -\n\n-\n\n.. , \n\nE o ...... -' '.- .. ' ........ ....,. \n> \n\n-1.k! \n\n~ .t. \n\nf. \n:\\,'. - \u00b7-.l \n:--, ____ ~. \n. . \n\nI f \n\n, ' \n\n,\n\n__ .. -.tI\" , ._ \u2022 .\" \u2022 \n\n~. \n\n.. - ..... \n\n_._ ..... . '-._._.1 \n\n1\\.1 \n\n~~ .~ . \n\n-\n\n-, \n\n-\n~\\. \n\n: ~: \n\n' 0' \n\n\u2022 \n\n-4 \n\n-2 \n\no \nx \n\n2 \n\n4 \n\nFigure 1: An ensemble of five networks were trained to approximate the square \nwave target function f(x). The final ensemble output (solid smooth curve) and \nthe outputs of the individual networks (dotted curves) are shown. Also the square \nroot of the ambiguity is shown (dash-dot line) _ For training 200 random examples \nwere used, but each network had a cross-validation set of size 40, so they were each \ntrained on 160 examples. \n\nobvious way is to train the networks on different training sets. Furthermore, to be \nable to estimate the first term in (10) it would be desirable to have some kind of \ncross-validation. This suggests the following strategy. \nChose a number K :::; p. For each network in the ensemble hold out K examples for \ntesting, where the N test sets should have minimal overlap, i. e., the N training sets \nshould be as different as possible. If, for instance, K :::; piN it is possible to choose \nthe K test sets with no overlap. This enables us to estimate the generalization error \nE(X of the individual members of the ensemble, and at the same time make sure \nthat the ambiguity increases. When holding out examples the generalization errors \nfor the individual members of the ensemble, E(X, will increase, but the conjecture \nis that for a good choice of the size of the ensemble (N) and the test set size \n(K), the ambiguity will increase more and thus one will get a decrease in overall \ngeneralization error. \n\nThis conjecture has been tested experimentally on a simple square wave function \nof one variable shown in Figure 1. Five identical feed-forward networks with one \nhidden layer of 20 units were trained independently by back-propagation using 200 \nrandom examples. For each network a cross-validation set of K examples was held \nout for testing as described above. The \"true\" generalization and the ambiguity were \nestimated from a set of 1000 random inputs. The weights were uniform, w(X = 1/5 \n(non-uniform weights are addressed later). \n\nIn Figure 2 average results over 12 independent runs are shown for some values of \n\n\fNeural Network Ensembles, Cross Validation, and Active Learning \n\n235 \n\nFigure 2: The solid line shows the gen(cid:173)\neralization error for uniform weights as \na function of K, where K is the size \nof the cross-validation sets. The dotted \nline is the error estimated from equa(cid:173)\ntion (10) . The dashed line is for the \noptimal weights estimated by the use of \nthe generalization errors for the individ(cid:173)\nual networks estimated from the cross(cid:173)\nvalidation sets as described in the text. \nThe bottom solid line is the generaliza(cid:173)\ntion error one would obtain if the indi(cid:173)\nvidual generalization errors were known \nexactly (the best possible weights). \n\n0.08 ,-----r----,--~---r-----, \n\no t= w 0.06 \nc o \n~ \n.!::! co ... \n~ 0.04 \nQ) \n(!) \n\n0.02 '---_---1 __ ---'-__ --'-__ -----' \n\no \n\n20 \n\n40 \n\nSize of CV set \n\n60 \n\n80 \n\nK (top solid line) . First, one should note that the generalization error is the same \nfor a cross-validation set of size 40 as for size 0, although not lower, so it supports \nthe conjecture in a weaker form. However, we have done many experiments, and \ndepending on the experimental setup the curve can take on almost any form, some(cid:173)\ntimes the error is larger at zero than at 40 or vice versa. In the experiments shown, \nonly ensembles with at least four converging networks out of five were used . If all \nthe ensembles were kept, the error would have been significantly higher at ]{ = a \nthan for K > a because in about half of the runs none of the networks in the en(cid:173)\nsomething that seldom happened when a cross-validation set \nsemble converged -\nwas used. Thus it is still unclear under which circumstances one can expect a drop \nin generalization error when using cross-validation in this fashion. \n\nThe dotted line in Figure 2 is the error estimated from equation (10) using the \ncross-validation sets for each of the networks to estimate Ea, and one notices a \ngood agreement. \n\n4 OPTIMAL WEIGHTS \n\nThe weights Wa can be estimated as described in e.g. [3]. We suggest instead \nto use unlabeled data and estimate them in such a way that they minimize the \ngeneralization error given in (10) . \n\nThere is no analytical solution for the weights , but something can be said about \nthe minimum point of the generalization error. Calculating the derivative of E as \ngiven in (10) subject to the constraints on the weights and setting it equal to zero \nshows that \n\nE a - A a = E or Wa = O. \n\n(12) \n(The calculation is not shown because of space limitations, but it is easy to do.) \nThat is, Ea - Aa has to be the same for all the networks. Notice that Aa depends \non the weights through the ensemble average of the outputs. It shows that the \noptimal weights have to be chosen such that each network contributes exactly waE \n\n\f236 \n\nAnders Krogh, Jesper Vedelsby \n\nto the generalization error. Note, however, that a member of the ensemble can have \nsuch a poor generalization or be so correlated with the rest of the ensemble that it \nis optimal to set its weight to zero. \n\nThe weights can be \"learned\" from unlabeled examples, e.g. by gradient descent \nminimization of the estimate of the generalization error (10). A more efficient \napproach to finding the optimal weights is to turn it into a quadratic optimization \nproblem. That problem is non-trivial only because of the constraints on the weights \n(L:a Wa = 1 and Wa 2:: 0). Define the correlation matrix, \nC af3 = f dxp(x)Va(x)V f3 (x) . \n\n(13) \n\nThen, using that the weights sum to one, equation (10) can be rewritten as \n\nE = L wa Ea + L w aC af3 w f3 - L waCaa . \n\n(14) \n\na \n\naf3 \n\na \n\nHaving estimates of E a and Caf3 the optimal weights can be found by linear pro(cid:173)\ngramming or other optimization techniques. Just like the ambiguity, the correlation \nmatrix can be estimated from unlabeled data to any accuracy needed (provided that \nthe input distribution p is known). \n\nIn Figure 2 the results from an experiment with weight optimization are shown. \nThe dashed curve shows the generalization error when the weights are optimized as \ndescribed above using the estimates of Ea from the cross-validation (on K exam(cid:173)\npies). The lowest solid curve is for the idealized case, when it is assumed that the \nerrors Ea are known exactly, so it shows the lowest possible error. The performance \nimprovement is quite convincing when the cross-validation estimates are used. \n\nIt is important to notice that any estimate of the generalization error of the indi(cid:173)\nvidual networks can be used in equation (14). If one is certain that the individual \nnetworks do not overfit, one might even use the training errors as estimates for \nEa (see [3]). It is also possible to use some kind of regularization in (14), if the \ncross-validation sets are small. \n\n5 ACTIVE LEARNING \n\nIn some neural network applications it is very time consuming and/or expensive \nto acquire training data, e.g., if a complicated measurement is required to find the \nvalue of the target function for a certain input. Therefore it is desirable to only use \nexamples with maximal information about the function. Methods where the learner \npoints out good examples are often called active learning. \n\nWe propose a query-based active learning scheme that applies to ensembles of net(cid:173)\nworks with continuous-valued output. It is essentially a generalization of query by \ncommittee [6, 7] that was developed for classification problems. Our basic assump(cid:173)\ntion is that those patterns in the input space yielding the largest error are those \npoints we would benefit the most from including in the training set. \n\nSince the generalization error is always non-negative, we see from (6) that the \nweighted average of the individual network errors is always larger than or equal to \nthe ensemble ambiguity, \n\nf(X) 2:: a(x), \n\n(15) \n\n\fNeural Network Ensembles. Cross Validation. and Active Learning \n\n237 \n\n2.5 r\"':\":'T---r--\"T\"\"--.-----r---, \n\n. . . : \n\n0.5 \n\no \n\n10 \n\n30 \n20 \nTraining set size \n\n40 \n\n50 \n\no \n\n10 \n\n20 \n30 \nTraining set size \n\n40 \n\n50 \n\nFigure 3: In both plots the full line shows the average generalization for active \nlearning, and the dashed line for passive learning as a function of the number of \ntraining examples. The dots in the left plot show the results of the individual \nexperiments contributing to the mean for the active learning. The dots in right plot \nshow the same for passive learning. \n\nwhich tells us that the ambiguity is a lower bound for the weighted average of the \nsquared error. An input pattern that yields a large ambiguity will always have a \nlarge average error. On the other hand, a low ambiguity does not necessarily imply \na low error. If the individual networks are trained to a low training error on the \nsame set of examples then both the error and the ambiguity are low on the training \npoints. This ensures that a pattern yielding a large ambiguity cannot be in the close \nneighborhood of a training example. The ambiguity will to some extent follow the \nfluctuations in the error. Since the ambiguity is calculated from unlabeled examples \nthe input-space can be scanned for these areas to any detail. These ideas are well \nillustrated in Figure 1, where the correlation between error and ambiguity is quite \nstrong, although not perfect. \n\nThe results of an experiment with the active learning scheme is shown in Figure 3. \nAn ensemble of 5 networks was trained to approximate the square-wave function \nshown in Figure 1, but in this experiments the function was restricted to the interval \nfrom - 2 to 2. The curves show the final generalization error of the ensemble in a \npassive (dashed line) and an active learning test (solid line). For each training set \nsize 2x40 independent tests were made, all starting with the same initial training \nset of a single example. Examples were generated and added one at a time. In the \npassive test examples were generated at random, and in the active one each example \nwas selected as the input that gave the largest ambiguity out of 800 random ones. \nFigure 3 also shows the distribution of the individual results of the active and \npassive learning tests. Not only do we obtain significantly better generalization by \nactive learning, there is also less scatter in the results. It seems to be easier for the \nensemble to learn from the actively generated set. \n\n\f238 \n\nAnders Krogh. Jesper Vedelsby \n\n6 CONCLUSION \n\nThe central idea in this paper was to show that there is a lot to be gained from \nusing unlabeled data when training in ensembles. Although we dealt with neural \nnetworks, all the theory holds for any other type of method used as the individual \nmembers of the ensemble. \nIt was shown that apart from getting the individual members of the ensemble to \ngeneralize well, it is important for generalization that the individuals disagrees as \nmuch as possible, and we discussed one method to make even identical networks \ndisagree. This was done by training the individuals on different training sets by \nholding out some examples for each individual during training. This had the added \nadvantage that these examples could be used for testing, and thereby one could \nobtain good estimates of the generalization error. \nIt was discussed how to find the optimal weights for the individuals of the ensemble. \nFor our simple test problem the weights found improved the performance of the \nensemble significantly. \n\nFinally a method for active learning was described, which was based on the method \nof query by committee developed for classification problems. The idea is that if the \nensemble disagrees strongly on an input, it would be good to find the label for that \ninput and include it in the training set for the ensemble. It was shown how active \nlearning improves the learning curve a lot for a simple test problem. \n\nAcknowledgements \n\nWe would like to thank Peter Salamon for numerous discussions and for his imple(cid:173)\nmentation of linear programming for optimization of the weights. We also thank \nLars Kai Hansen for many discussions and great insights, and David Wolpert for \nvaluable comments. \n\nReferences \n\n[1] L.K. Hansen and P Salamon. Neural network ensembles. \n\nIEEE Transactions on \n\nPattern Analysis and Machine Intelligence, 12(10):993- 1001, Oct. 1990. \n\n[2] D.H Wolpert. Stacked generalization. Neural Networks, 5(2):241-59, 1992. \n[3] Michael P. Perrone and Leon N Cooper. When networks disagree: Ensemble method \nfor neural networks. In R. J. Mammone, editor, Neural Networks for Speech and Image \nprocessing. Chapman-Hall, 1993. \n\n[4] S. Geman , E . Bienenstock, and R Doursat. Neural networks and the bias/variance \n\ndilemma. Neural Computation, 4(1):1-58, Jan. 1992. \n\n[5] Ronny Meir. Bias, variance and the combination of estimators; the case of linear least \n\nsquares. Preprint (In Neuroprose), Technion, Heifa, Israel, 1994. \n\n[6] H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of \nthe Fifth Workshop on Computational Learning Theory, pages 287-294, San Mateo, \nCA, 1992. Morgan Kaufmann. \n\n[7] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Information, prediction, and query \nby committee. In Advances in Neural Information Processing Systems, volume 5, San \nMateo, California, 1993. Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 1001, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "Jesper", "family_name": "Vedelsby", "institution": null}]}