{"title": "Balancing Between Bagging and Bumping", "book": "Advances in Neural Information Processing Systems", "page_first": 466, "page_last": 472, "abstract": "", "full_text": "Balancing between bagging and bumping \n\nTom Heskes \n\nRWCP Novel Functions SNN Laboratory; University of Nijmegen \n\nGeert Grooteplein 21 , 6525 EZ Nijmegen, The Netherlands \n\ntom@mbfys.kun.nl \n\nAbstract \n\nWe compare different methods to combine predictions from neu(cid:173)\nral networks trained on different bootstrap samples of a regression \nproblem. One of these methods, introduced in [6] and which we \nhere call balancing, is based on the analysis of the ensemble gen(cid:173)\neralization error into an ambiguity term and a term incorporating \ngeneralization performances of individual networks. We show how \nto estimate these individual errors from the residuals on valida(cid:173)\ntion patterns. Weighting factors for the different networks follow \nfrom a quadratic programming problem. On a real-world problem \nconcerning the prediction of sales figures and on the well-known \nBoston housing data set, balancing clearly outperforms other re(cid:173)\ncently proposed alternatives as bagging [1] and bumping [8]. \n\n1 EARLY STOPPING AND BOOTSTRAPPING \n\nStopped training is a popular strategy to prevent overfitting in neural networks. \nThe complete data set is split up into a training and a validation set. Through \nlearning the weights are adapted in order to minimize the error on the training \ndata. Training is stopped when the error on the validation data starts increasing. \nThe final network depends on the accidental subdivision in training and validation \nset , and often also on the, usually random, initial weight configuration and chosen \nminimization procedure. In other words , early stopped neural networks are highly \nunstable: small changes in the data or different initial conditions can produce large \nchanges in the estimate. As argued in [1 , 8], with unstable estimators it is advisable \nto resample, i.e., to apply the same procedure several times using different sub(cid:173)\ndivisions in training and validation set and perhaps starting from different initial \n\nRWCP: Real World Computing Partnership; SNN: Foundation for Neural Networks. \n\n\fBalancing Between Bagging and Bumping \n\n467 \n\nconfigurations. In the neural network literature resampling is often referred to as \ntraining ensembles of neural networks [3, 6]. In this paper, we will discuss methods \nfor combining the outputs of networks obtained through such a repetitive procedure. \n\nFirst, however, we have to choose how to generate the subdivisions in training and \nvalidation sets. Options are, among others, k-fold cross-validation, subsampling and \nbootstrapping. In this paper we will consider bootstrapping [2] which is based on \nthe idea that the available data set is nothing but a particular realization of some \nprobability distribution. In principle, one would like to do inference on this \"true\" \nyet unknown probability distribution. A natural thing to do is then to define an em(cid:173)\npirical distribution. With so-called naive bootstrapping the empirical distribution \nis a sum of delta peaks on the available data points, each with probability content \nl/Pdata with Pdata the number of patterns. A bootstrap sample is a collection of \nPdata patterns drawn with replacement from this empirical probability distribution. \nSome of the data points will occur once, some twice and some even more than \ntwice in this bootstrap sample. The bootstrap sample is taken to be the training \nset, all patterns that do not occur in a particular bootstrap sample constitute the \nvalidation set. For large Pdata, the probability that a pattern becomes part of the \nvalidation set is (1 -\nl/Pdata)Pda.ta. ~ l/e ~ 0.368. An advantage of bootstrapping \nover other resampling techniques is that most statistical theory on resampling is \nnowadays based on the bootstrap. \n\nUsing naive bootstrapping we generate nrun training and validation sets out of our \ncomplete data set of Pdata input-output combinations {iI', tl'}. In this paper we \nwill restrict ourselves to regression problems with, for notational convenience, just \none output variable. We keep track of a matrix with components q; indicating \nwhether pattern p is part of the validation set for run i (q; = 1) or of the training \nset (qf = 0). On each subdivision we train and stop a neural network with one \nlayer of nhidden hidden units. The output or of network i with weight vector w( i) \non input il' reads \n\no~ I \n\n+ wo(i) , \n\nwhere we use the definition x~ == 1. The validation error for run i can be written \n\nEvalidation(i) == -:- L qrr; , \n\n1 Pda.ta. \n\nPI /.'=1 \n\nwith Pi == L:/.' qf ~ 0.368 Pdata, the number of validation patterns m run z, and \nr; == (or - ttl)2/2, the error of network i on pattern p. \nAfter training we are left with nrun networks, with, in practice, quite different \nperformances on the complete data set. How should we combine all these outputs \nto get the best possible performance on new data? \n\n2 COMBINING ESTIMATORS \n\nSeveral methods have been proposed to combine estimators (see e.g. (5) for a re(cid:173)\nview). In this paper we will only consider estimators with the same architecture \n\n\f468 \n\nT. Heskes \n\nbut trained and stopped on different subdivisions of the data in training and val(cid:173)\nidation sets. Recently, two such methods have been suggested for bootstrapped \nestimators: bagging [1], an acronym for bootstrap aggregating, and bumping [8], \nmeaning bootstrap umbrella of model parameters. With bagging, the prediction on \na newly arriving input vector is the average over all network predictions. Bagging \ncompletely disregards the performance of the individual networks on the data used \nfor training and stopping. Bumping, on the other hand, throws away all networks \nexcept the one with the lowest error on the complete data set 1 \u2022 In the following \nwe will describe an intermediate form due to [6], which we here call balancing. A \ntheoretical analysis of the implications of this idea can be found in [7]. \n\nSuppose that after training we receive a new set of Ptest test patterns for which we \ndo not know the true targets [II, but can calculate the network output OJ for each \nnetwork i. We give each network a weighting factor aj and define the prediction of \nall networks on pattern 1/ as the weighted average \n\nnrun \n\n-II _ ~ -II \nm = L- ajOi . \n\ni=1 \n\nThe goal is to find the weighting factors aj, subject to the constraints \n\nnrun \n\nL aj = 1 and aj ~ 0 Vj\nj=1 \n\n, \n\n(1) \n\nyielding the smallest possible generalization error \n\nE \n\n~ ( - II \ntest = - - L- m -\n\n-\n\nt-II) 2 \n\n. \n\n1 Ptest \n\nPtest 11:1 \n\nThe problem, of course, is our ignorance about the targets [II. Bagging simply takes \nai = l/nrun for all networks, whereas bumping implies aj = din. with \n\nK. \n\n1 Pd .. t .. \n\nargmin - - L (or - t JJ )2 . \n\nj \n\nPdata JJ=1 \n\nAs in [6, 7] we write the generalization error in the form \n\nE test \n\n_1_ L L ajaj(or - [1I)(oj - [II) \n\nPtest \n\nII \n\n.. \nI,) \n\n2p1 L L ajaj [(or - [11)2 + (oj - ill)2 - (or - oj)2] \n\ntest \n\nII \n\nj ,j \n\nL ajaj [Etest(i) + Etest(j) - ~ L(or - 5j )2]. \n\n(2) \n\n. . \nIJ \n\nPtest \n\nII \n\nThe last term depends only on the network outputs and can thus be calculated. \nThis \"ambiguity\" term favors networks with conflicting outputs. The first part, \n\nlThe idea behind bumping is more general and involved than discussed here. The \n\ninterested reader is referred to [8] . In this paper we will only consider its naive version. \n\n\fBalancing Between Bagging and Bumping \n\n469 \n\ncontaining the generalization errors Etest(i) for individual networks, depends on the \ntargets tV and is thus unknown. It favors networks that by themselves already have \na low generalization error. In the next section we will find reasonable estimates for \nthese generalization errors based on the network performances on validation data. \nOnce we have obtained these estimates, finding the optimal weighting factors Cti \nunder the constraints (1) is a straightforward quadratic programming problem. \n\n3 ESTIMATING THE GENERALIZATION ERROR \n\nAt first sight, a good estimate for the generalization error of network i could be \nthe performance on the validation data not included during training. However, \nthe validation error Evalidation (i) strongly depends on the accidental subdivision in \ntraining and validation set. For example, if there are a few outliers which, by pure \ncoincidence, are part of the validation set, the validation error will be relatively \nlarge and the training error relatively small. To correct for this bias as a result \nof the random subdivision, we introduce the \"expected\" validation error for run i. \nFirst we define nil as the number of runs in which pattern J.l is part of the validation \nset and E~alidation as the error averaged over these runs: \n\nnrun \n\nnil == L qf and E~alidation == nil ?= qf rf , \n\n1 nrun \n\ni=1 \n\n.=1 \n\nThe expected validation error then follows from \n\n, \n\nEvalidation (i) == --:- L qf E~alidation . \n\n1 Pda.ta. \n\nP. 11=1 \n\nThe ratio between the observed and the expected validation error indicates whether \nthe validation error for network i is relatively high or low. Our estimate for the \ngeneralization error of network i is this ratio multiplied by an overall scaling factor \nbeing the estimated average generalization error: \n\nE \n\ntest t\n\n(.) ~ validation t __ '\"\" Ell. \n\nE \n, \nEvalidation (t) Pdata 11=1 \n\n(.) \n. \n\n1 \n\nPda.ta. \n~ validation' \n\n. \n\nNote that we implicitly make the assumption that the bias introduced by stopping \nat the minimal error on the validation patterns is negligible, i.e., that the validation \npatterns used for stopping a network can be considered as new to this network as \nthe completely independent test patterns. \n\n4 SIMULATIONS \n\nWe compare the following methods for combining neural network outputs. \n\nIndividual: the average individual generalization error, i.e., the generalization er(cid:173)\n\nror we will get on average when we decide to perform only one run. It \nserves as a reference with which the other methods will be compared. \n\nBumping: the generalization of the network with the lowest error on the data \n\navailable for training and stopping. \n\n\f470 \n\nT. Heskes \n\nunfair \n\nunfair \n\nbumping \n\nbagging \n\nambiguity \n\nbalancing \n\nbumping \n\nbalancing \n\nstore 1 \n\nstore 2 \n\nstore 3 \n\nstore 4 \n\nstore 5 \n\nstore 6 \n\nmean \n\n4% \n\n5% \n\n-7 % \n\n6% \n\n6% \n\n1% \n\n3% \n\n) \n\n9% \n\n15 % \n\n11% \n\n11% \n\n10% \n\n8% \n\n11% \n\n10% \n\n22 % \n\n18 % \n\n17 % \n\n22 % \n\n14 % \n\n17 % \n\n17 % \n\n23 % \n\n25 % \n\n26 % \n\n19 % \n\n19 % \n\n22 % \n\n17 % \n23 % \n\n25 % \n\n26 % \n\n22 % \n\n16 % \n\n22 % \n\n24 % \n\n34 % \n\n36 % \n\n31 % \n\n26 % \n\n26 % \n\n30 % \n\nTable 1: Decrease in generalization error relative to the average individual general(cid:173)\nization error as a result of several methods for combining neural networks trained \nto predict the sales figures for several stores. \n\nBagging: the generalization error when we take the average of all n run network \n\noutputs as our prediction. \n\nAmbiguity: the generalization error when the weighting factors are chosen to max(cid:173)\nimize the ambiguity, i.e., taking identical estimates for the individual gen(cid:173)\neralization errors of all networks in expression (2). \n\nBalancing: the generalization error when the weighting factors are chosen to min(cid:173)\n\nimize our estimate of the generalization error. \n\nUnfair bumping: the smallest generalization error for an individual error, i.e., the \nresult of bumping if we had indeed chosen the network with the smallest \ngeneralization error. \n\nUnfair balancing: the lowest possible generalization error that we could obtain if \n\nwe had perfect estimates of the individual generalization errors. \n\nThe last two methods, unfair bumping and unfair balancing, only serve as some \nkind of reference and can never be used in practice. \n\nWe applied these methods on a real-world problem concerning the prediction of \nsales figures for several department stores in the Netherlands. For each store, 100 \nnetworks with 4 hidden units were trained and stopped on bootstrap samples of \nabout 500 patterns. The test set, on which the performances of the various methods \nfor combination were measured, consists of about 100 patterns. \nInputs include \nweather conditions, day of the week, previous sales figures, and season. The results \nare summarized in Table 1, where we give the decrease in the generalization error \nrelative to the average individual generalization error. \n\nAs can be seen in Table 1, bumping hardly improves the performance. The reason \nis that the error on the data used for training and stopping is a lousy predictor of \nthe generalization error, since some amount of overfitting is inevitable. The general(cid:173)\nization performance obtained through bagging, i.e., first averaging over all outputs, \ncan be pro\"en to be always better than the average individual generalization error. \n\n\fBalancing Between Bagging and Bumping \n\n471 \n\n80r---~----~--~----~-' \n\nE \nQ) \nE 60 \n~ \nE \n.\u00a740 \n\nQ) \nC> \n\n~ 20 \n> \u00abI \n\n~ 30 r-------..-------.-----~---~......, \nE \n~ 25 \nE \na. .s 20 \n\nlI('lIE- \"\"*- __ \u2022 - - -\n\n-lIE \n\n-\n\nO~--~----~----~--~~ \n\n80 \n\no \n\n20 \nnumber of replicates \n\n60 \n\n40 \n\n20 \nnumber of replicates \n\n40 \n\n60 \n\n80 \n\nFigure 1: Decrease of generalization error relative to the average individual gen(cid:173)\neralization error as a function of the number of bootstrap replicates for different \ncombination methods: bagging (dashdot , star), ambiguity (dotted, star), bumping \n(dashed, star), balancing (solid, star) , unfair bumping (dashed, circle), unfair bal(cid:173)\nancing (solid, circle). Shown are the mean (left) and the standard deviation (right) \nof the decrease in percentages. Networks are trained and tested on the Boston \nhousing database. \n\nOn these data bagging is definitely better than bumping, but also worse than max(cid:173)\nimizing the ambiguity. In all cases, except for store 5 where maximization of the \nambiguity is slightly better, balancing is a clear winner among the \"fair\" methods. \nThe last column in Table 1 shows how much better we can get if we could find more \naccurate estimates for the generalization errors of individual networks. \n\nThe method of balancing discards most of the networks, i.e., the solution to the \nquadratic programming problem (2) under constraints (1) yields just a few weighting \nfactors different from zero (on average about 8 for this set of simulations). Balancing \nis thus indeed a compromise between bagging, taking all networks into acount , and \nbumping, keeping just one network. \n\nWe also compared these methods on the well-known Boston housing data set con(cid:173)\ncerning the median housing price in several tracts based on 13 mainly socio-economic \npredictor variables (see e.g. [1] for more information). We left out 50 of the 506 \navailable cases for assessment of the generalization performance. All other 456 cases \nwere used for training and stopping neural networks with 4 hidden units. The av(cid:173)\nerage individual mean squared error over all 300 bootstrap runs is 16.2, which is \ncomparable to the mean squared error reported in [1]. To study how the perfor(cid:173)\nmance depends on the number of bootstrap replicates , we randomly drew sets of \nn = 5,10,20,40 and 80 bootstrap replicates out of our ensemble of 300 replicates \nand applied the combination methods on these sets. For each n we did this 48 \ntimes. Figure 1 shows the mean decrease in the generalization error relative to the \naverage individual generalization error and its standard deviation . \n\nAgain, balancing comes out best , especially for a larger number of bootstrap repli(cid:173)\ncates. It seems that beyond say 20 replicates both bumping and bagging are hardly \nhelped by more runs, whereas both maximization of the ambiguity and balancing \nstill increase their performance. Bagging, fully taking into account all network pre-\n\n\f472 \n\nT. Heskes \n\ndictions, yields the smallest variation, bumping, keeping just one of them, by far the \nlargest. Balancing and maximization of the ambiguity combine several predictions \nand thus yield a variation that is somewhere in between. \n\n5 CONCLUSION AND DISCUSSION \n\nBalancing, a compromise between bagging and bumping, is an attempt to arrive \nat better performances on regression problems. The crux in all this is to obtain \nreasonable estimates for the quality of the different networks and to incorporate \nthese estimates in the calculation of the proper weighting factors (see [5, 9] for \nsimilar ideas and related work in the context of stacked generalization). \n\nObtaining several estimators is computationally expensive. However, the notorious \ninstability offeedforward neural networks hardly leaves us a choice. Furthermore, an \nensemble of bootstrapped neural networks can also be used to deduce (approximate) \nconfidence and prediction intervals (see e.g. [4]), to estimate the relevance of input \nfields and so on. It has also been argued that combination of several estimators \ndestroys the structure that may be present in a single estimator [8]. Having hardly \nany interpretable structure, neural networks do not seem to have a lot they can lose. \nIt is a challenge to show that an ensemble of neural networks does not only give \nmore accurate predictions, but also reveals more information than a single network. \n\nReferences \n\n[1] L. Breiman. Bagging predictors. Machine Learning, 24:123-140, 1996. \n\n[2] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, \n\nLondon, 1993. \n\n[3] L. Hansen and P. Salomon. Neural network ensembles. IEEE Transactions on \n\nPattern Analysis and Machine Intelligence, 12:993-1001, 1990. \n\n[4] T. Heskes. Practical confidence and prediction intervals. These proceedings, \n\n1997. \n\n[5] R. Jacobs. Methods for combining experts' probability assessments. Neural \n\nComputation, 7:867-888, 1995. \n\n[6] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and \nactive learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances \nin Neural Information Processing Systems 7, pages 231-238, Cambridge, 1995. \nMIT Press. \n\n[7] P. Sollich and A. Krogh. Learning with ensembles: How over-fitting can be \nuseful. \nIn D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in \nNeural Information Processing Systems 8, pages 190-196, San Mateo, 1996. \nMorgan Kaufmann. \n\n[8] R. Tibshirani and K. Knight. Model search and inference by bootstrap \"bump(cid:173)\n\ning\". Technical report, University of Toronto, 1995. \n\n[9] D. Wolpert and W. Macready. Combining stacking with bagging to improve a \n\nlearning algorithm. Technical report, Santa Fe Institute, Santa Fe, 1996. \n\n\f", "award": [], "sourceid": 1182, "authors": [{"given_name": "Tom", "family_name": "Heskes", "institution": null}]}