{"title": "Improving Committee Diagnosis with Resampling Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 882, "page_last": 888, "abstract": null, "full_text": "Improving Committee Diagnosis with \n\nResampling Techniques \n\nBambang Parmanto \n\nDepartment of Information Science \n\nUniversity of Pittsburgh \n\nPittsburgh, PA 15260 \nparmanto@li6.pitt. edu \n\nPaul W. Munro \n\nDepartment of Information Science \n\nUniversity of Pittsburgh \n\nPittsburgh, PA 15260 \n\nmunro@li6.pitt. edu \n\nHoward R. Doyle \n\nPittsburgh Transplantation Institute \n3601 Fifth Ave, Pittsburgh, PA 15213 \n\ndoyle@vesaliw. tu. med. pitt. edu \n\nAbstract \n\nCentral to the performance improvement of a committee relative to \nindividual networks is the error correlation between networks in the \ncommittee. We investigated methods of achieving error indepen(cid:173)\ndence between the networks by training the networks with different \nresampling sets from the original training set. The methods were \ntested on the sinwave artificial task and the real-world problems of \nhepatoma (liver cancer) and breast cancer diagnoses. \n\n1 \n\nINTRODUCTION \n\nThe idea of a neural net committee is to combine several neural net predictors \nto perform collective decision making, instead of using a single network (Perrone, \n1993). The potential of a committee in improving classification performance has \nbeen well documented. Central to this improvement is the extent to which the \nerrors tend to coincide. Committee errors occur where the misclassification sets of \nindividual networks overlap. On the one hand, if all errors of committee members \ncoincide, using a committee does not improve performance. On the other hand, if \nerrors do not coincide, performance of the committee dramatically increases and \nasymptotically approaches perfect performance. Therefore, it is beneficial to make \nthe errors among the networks in the committee less correlated in order to improve \nthe committee performance. \n\n\fImproving Committee Diagnosis with Resampling Techniques \n\n883 \n\nOne way of making the networks less correlated is to train them with different sets \nof data. Decreasing the error correlation by training members of the committee \nusing different sets of data is intuitively appealing. Networks trained with different \ndata sets have a higher probability of generalizing differently and tend to make \nerrors in different places in the problem space. \n\nThe idea is to split the data used in the training into several sets. The sets are \nnot necessarily mutually exclusive, they may share part of the set (overlap). This \nidea resembles resampling methods such as cross-validation and bootstrap known \nin statistics for estimating the error of a predictor from limited sets of available \ndata. In the committee framework, these techniques are recast to construct different \ntraining sets from the original training set. David Wolpert (1992) has put forward \na general framework of training the committee using different partitions of the \ndata known as stacked generalization. This approach has been adopted to the \nregression environment and is called stacked regression (Breiman, 1992). Stacked \nregression uses cross-validation to construct different sets of regression functions. \nA similar idea of using a bootstrap method to construct different training sets has \nbeen proposed by Breiman (1994) for classification and regression trees predictors. \n\n2 THE ALGORITHMS \n\n2.1 BOOTSTRAP COMMITTEE (BOOTC) \n\nConsider a total of N items are available for training. The approach is to generate \nK replicates from the original set, each containing the same number of item as the \noriginal set. The replicates are obtained from the original set by drawing at random \nwith replacement. See Efron & Tibshirani (1993) for background on bootstrapping. \nUse each replicate to train each network in the committee. \n\nUsing this bootstrap procedure, each replicate is expected to include roughly 36 \n% duplicates (due to replacement during sampling). Only the distinct fraction is \nused for training and the leftover fraction for early stopping, if necessary (notice \nslight difference from the standard bootstrapping and from Breiman's bagging). \nEarly stopping usually requires a fraction of the data to be taken from the original \ntraining set, which might degrade the performance of the neural network. The \nadvantage of a BOOTC is that the leftover sample is already available. \n\nAlgorithm: \n\n1. Generate bootstrap replicates Ll, ... , LK from the original set. \n2. For each bootstrap replicate, collect unsampled items into leftover sample \n\nse s, gIVIng: \n\n.. \n\nt \n\nl*l \n\n, ... , \n\nl*K \n. \n\n3. For each Lk, train a network. Use the leftover set l*k as validation stopping \n\ncriteria if necessary. Giving K neural net predictors: f(~i Lk) \n\n4. Build a committee from the bootstrap networks using a simple averaging \n\nprocedure: fcom(~) = ic ~~=l f(~i Lk) \n\nThere is no rule as to how many bootstrap replicates should be used to achieve a \ngood performance. In error estimation, the number ranges from 20 to 200. It is \nbeneficial to keep the number of replicates, hence the number of networks, small to \nreduce training time. Unless the networks are trained on a parallel machine, training \ntime increases proportionally to the number of networks in the committee. In this \nexperiment, 20 bootstrap training replicates were constructed for 20 networks in \n\n\f884 \n\nB. PARMANTO, P. W. MUNRO, H. R. DOYLE \n\nthe committee. Twenty replicates were chosen since beyond this number there is \nno significant improvement on the performance. \n\n2.2 CROSS-VALIDATION COMMITTEE (CVC) \n\nThe algorithm is quite similar to the procedure used in prediction error estimation. \nFirst, generate replicates from the original training set by removing a fraction of \nthe data. Let D denote the original data, and D- V denote the data with subset \nv removed. The procedure revolves so that each item is in the removed fraction \nat least once. Generate replicates D11Jl , \u2022\u2022\u2022 Di/Ie and train each network in the \ncommittee with one replicate. \n\nAn important issue in the eve is the degree of data overlap between the replicates. \nThe degree of overlap depends on the number of replicates and the size of a removed \nfraction from the original sample. For example, if the committee consists of 5 \nnetworks and 0.5 of the data are removed for each replicate, the minimum fraction \nof overlap is 0 (calculation: (v x 2) - 1.0) and the maximum is ~ (calculation: \n1.0 - k)' \nAlgorithm: \n\n1. Divide data into v-fractions db . . . , dv \n2. Leave one fraction die and train network fie with the rest of the data (D-d le ). \n\n3. Use die as a validation stopping criteria, if necessary. \n4. Build a committee from the networks using a simple averaging procedure. \n\nThe fraction of data overlap determines the trade-off between the individual network \nperformance and error correlation between the networks. Lower correlation can be \nexpected if the networks train with less overlapped data, which means a larger \nremoved fraction and smaller fraction for training. The smaller the training set \nsize, the lower the individual network performance that can be expected. \nWe investigated the effect of data overlap on the error correlations between the \nnetworks and the committee performance. We also studied the effect of training \nsize on the individual performance. The goal was to find an optimal combination \nof data overlap and individual training size. \n\n3 THE BASELINE & PERFORMANCE EVALUATION \n\nTo evaluate the improvement of the proposed methods on the committee perfor(cid:173)\nmance, they should be compared with existing methods as the baseline. The com(cid:173)\nmon method for constructing a committee is to train an ensemble of networks \nindependently. The networks in the committee are initialized with different sets \nof weights. This type of committee has been reported as achieving significant im(cid:173)\nprovement over individual network performances in regression (Hashem, 1993) and \nclassification tasks (Perrone, 1993; Parmanto et al., 1994). \n\nThe baseline, BOOTe, and eve were compared using exactly the same architecture \nand using the same pair of training-test sets. Performance evaluation was conducted \nusing 4-fold exhaustive cross-validation where 0.25 fraction of the original data is \nused for the test set and the remainder of the data is used for the training set. The \nprocedure was repeated 4 times so that all items were once on the test set. The \nperformance was calculated by averaging the results of 4 test sets. The simulations \n\n\fImproving Committee Diagnosis with Resampling Techniques \n\n885 \n\nwere conducted several times using different initial weights to exclude the possibility \nthat the improvement was caused by chance. \n\n4 EXPERIMENTS \n\n4.1 SYNTHETIC DATA: SINWAVE CLASSIFICATION \n\nThe sinwave task is a classification problem with two classes, a negative class rep(cid:173)\nresented as 0 and a positive class represented as 1. The data consist of two input \nvariables, x = (Xli X2). The entire space is divided equally into two classes with \nthe separation line determined by the curve X2 = sin( 2: Xl). The upper half of the \nrectangle is the positive class, while the lower half is the negative one (see Fig. 1). \nGaussian noise along the perfect boundary with variance of 0.1 is introduced to \nthe clean data and is presented in Fig. 1 (middle). Let z be a vector drawn from \nthe Gaussian distribution with variance TI, then the classification rule is given by \nequation: \n\n(1) \n\nA similar artificial problem is used to analyze the bias-variance trade-offs by Geman \net al. (1992). \n\nFigure 1: Complete and clean data/without noise (top), complete data with noise \n(middle), and a small fraction used for training (bottom). \n\nThe population contains 3030 data items, since a grid of 0.1 is used for both Xl and \nX2 . In the real world, we usually have no access to the entire population. To mimic \nthis situation, the training set contained only a small fraction of the population. \nFig. 1 (bottom) visualizes a training set that contains 200 items with 100 items for \neach class. The training set is constructed by randomly sampling the population. \nThe performance of the predictor is measured with respect to the test set. The \npopulation (3030 items) is used as the test set. \n\n4.2 HEPATOMA DETECTION \n\nHepatoma is a very important clinical problem in patients who are being considered \nfor liver transplantation for its high probability of recurrence. Early hepatoma \ndetection may improve the ultimate outlook of the patients since special treatment \ncan be carried out. Unfortunately, early detection using non-invasive procedures \n\n\f886 \n\nB. PARMANTO, P. W. MUNRO, H. R. DOYLE \n\ncan be difficult, especially in the presence of cirrhosis. We have been developing \nneural network classifiers as a detection system with minimum imaging or invasive \nstudies (Parmanto et al., 1994). \n\nThe task is to detect the presence or absence (binary output) of a hepatoma given \nvariables taken from an individual patient. Each data item consists of 16 variables, \n7 of which are continuous variables and the rest are binary variables, primarily \nblood measurements. \n\nFor this experiment, 1172 data items with their associated diagnoses are available. \nOut of 1172 itmes, 693 items are free from missing values, 309 items contain missing \nvalues only on the categorical variables, and 170 items contain missing values on \nboth types of variables. For this experiment, only the fraction without missing \nvalues and the fraction with missing values on the categorical variables were used, \ngiving the total item of 1002. Out of the 1002 items, 874 have negative diagnoses \nand the remaining 128 have positive diagnoses. \n\n4.3 BREAST CANCER \n\nThe task is to diagnose if a breast cytology is benign or malignant based on cyto(cid:173)\nlogical characteristics. Nine input variables have been established to differentiate \nbetween the benign and malignant samples which include clump thickness, marginal \nadhesion, the uniformity of cell size and shape, etc. \n\nThe data set was originally obtained from the University of Wisconsin Hospitals \nand currently stored at the UCI repository for machine learning (Murphy & Aha, \n1994). The current size of the data set is 699 examples. \n\n5 THE RESULTS \n\nCommittee Performance \n\nIndiv. Performance \n\n\u00a5 ~ ~.::.:.:-:~~~\u00ab:\n.:::: . .::.::---.-.-......... ---..... . \n\n---.... _---\n\n\u00a7 :! \n\n,I; N \n~ ..... \no \n\n0 \n\u2022 \n\nbas.an. \n-\n::-:::. &10. .. ,,,, \n\n4 \n\n6 \n\n10 \n\n12 \n/I hidden units \n\n14 \n\n16 \n\n4 \n\n10 \n\n8 \n/I hidden units \n\n12 \n\n14 \n\n16 \n\nCorrelation \n\n0r------------------. \n\nPercent Improvement \n\n.... -... -\n...... -. \n\n.... -......... --\n\n-~-------~-------~ \n\" , --. .... _-.. --_._ ... _---. \n\no~ ________________ ~ \no \n\nQ -~ . . . . \n& :::: li&>mr\", \n\n4 \n\n6 \n\n10 \n\n12 \nII hidden units \n\n14 \n\n16 \n\n10 \n\n8 \n12 \n/I hidden ...,its \n\n14 \n\n16 \n\nFigure 2: Results on the sinwave classif. task. Performances of individual nets \nand the committee (top); error correlation and committee improvement (bottom). \n\nFigure 2. (top) and Table 1. show that the performance of the committee is always \nbetter than the average performance of individual networks in all three committees. \n\n\fImproving Committee Diagnosis with Resampling Techniques \n\n887 \n\nIndiv. Nets Error Committee \n\nTask \n\nMethods \n\nSmwave \n(2 vars ) \n\nCancer \n(9 vars) \n\nBaseline \nBOOTC \n\nCVC \n\nBaseline \nBOOTC \n\nCVC \n\nHepatoma BaSeline \nBOOTC \n(16 vars) \n\nCVC \n\n% error \n13.31 \n12.85 \n15.72 \n2.7 \n3.14 \n3.2 \n25.95 \n26.00 \n26.90 \n\nCorr \n.87 \n.57 \n.33 \n.96 \n.83 \n.80 \n.89 \n.70 \n.55 \n\n% error \n\n11.8 \n8.36 \n9.79 \n2.5 \n2.0 \n1.63 \n23.25 \n19.72 \n19.05 \n\nImprov. \nto Indiv. \n\nImprov. \n\nto baseline \n\n11 '70 \n35 % \n38 % \n5% \n34 % \n49 % \n10.5 % \n24 % \n29 % \n\n-\n\n29 % \n17 % \n\n-\n\n20 % \n35 % \n\n-\n\n15.2 % \n18 % \n\nTable 1: Error rate, correlation, and performance improvement calculated based on \nthe best architecture for each method. Reduction of misclassification rates compare \nto the baseline committee \n\nCorrelation vs . Fraction of Data Overlap \n\n0r-----------------------____ -. \n\nm \n\n\u00b0 \n\nN \no \n\n.,., \n! \n\nT \n\ni ~ , , \n\nFigure 3: Error correlation and fraction of overlap in training data (results from \nthe sinwave classification task). \n\nFraction 01 data overlap \n\nThe CVC and BOOTC are always better than the baseline even when the individual \nnetwork performance is worse. Figure 2 (bottom) and the table show that the \nimprovement of a committee over individual networks is proportional to the error \ncorrelation between the networks in the committee. The CVC consistently produces \nsignificant improvement over its individual network performance due to the low error \ncorrelation, while the baseline committee only produces modest improvement. This \nresult confirms the basic assumption of this research: committee performance can \nbe improved by decorrelating the errors made by the networks. \n\nThe performance of a committee depends on two factors: individual performance of \nthe networks and error correlation between the networks. The gain of using BOOTC \nor CVC depends on how the algorithms can reduce the error correlations while still \nmaintaining the individual performance as good as the individual performance of the \nbaseline. The BOOTC produced impressive improvement (29 %) over the baseline \non the sinwave task due to the lower correlation and good individual performance. \nThe performances of the BOOTC on the other two tasks were not as impressive \ndue to the modest reduction of error correlation and slight decrease in individual \nperformance. The performances were still significantly better than the baseline \ncommittee. The CVC, on the other hand, consistently reduced the correlation and \n\n\f888 \n\nB. PARMANTO, P. W. MUNRO, H. R. DOYLE \n\nimproved the committee performance. The improvement on the sinwave task was \nnot as good as the BOOTC due to the low individual performance. \n\nThe individual performance of the CVC and BOOTC in general are worse than the \nbaseline. The individual performance of CVC is 18 % and 19 % lower than the \nbaseline on the sinwave and cancer tasks respectively, while the BOOTC suffered \nsignificant reduction of individual performance only on the cancer task (16 %). The \ndegradation of individual performance is due to the smaller training set for each \nnetwork on the CVC and the BOOTC. The detrimental effect of a small training \nset, however, is compensated by low correlation between the networks. The effect \nof a smaller training set depends on the size of the original training set. If the data \nsize is large, using a smaller set may not be harmful. On the contrary, if the data set \nis small, using an even smaller data set can significantly degrade the performance. \n\nAnother interesting finding of this experiment is the relationship between the error \ncorrelation and the overlap fraction in the training set. Figure 3 shows that small \ndata overlap causes the networks to have low correlation to each other. \n\n6 SUMMARY \n\nTraining committees of networks using different set of data resampled from the \noriginal training set can improve committee performance by reducing the error cor(cid:173)\nrelation among the networks in the committee. Even when the individual network \nperformances of the BOOTC and CVC degrade from the baseline networks, the \ncommittee performance is still better due to the lower correlation. \n\nAcknowledgement \n\nThis study is supported in part by Project Grant DK 29961 from the National \nInstitutes of Health, Bethesda, MD. We would like to thank the Pittsburgh Trans(cid:173)\nplantation Institute for providing the data for this study. \n\nReferences \n\nBreiman, L, (1992) Stacked Regressions, TR 367, Dept. of Statistics., UC. Berkeley. \n\nBreiman, L, (1994) Bagging Predictors, TR 421, Dept. of Statistics, UC. Berkeley. \n\nEfron, B., & Tibshirani, R.J. (1993) An Introd. to the Bootstrap. Chapman & Hall. \n\nHashem, S. (1994). Optimal Linear Combinations of Neural Networks. PhD Thesis, \nPurdue University. \n\nGeman, S., Bienenstock, E., and Doursat, R. (1992) Neural networks and the \nbias/variance dilemma. Neural Computation, 4(1), 1-58. \n\nMurphy, P. M., &. Aha, D. W. (1994). UCI Repository of machine learning databases \n[ftp: ics.uci.edu/pub/machine-Iearning-databases/] \n\nParmanto, B., Munro, P.W., Doyle, H.R., Doria, C., Aldrighetti, 1., Marino, I.R., \nMitchel, S., and Fung, J.J. (1994) Neural network classifier for hepatoma detectipn. \nProceedings of the World Congress of Neural Networks 1994 San Diego, June 4-9. \n\nPerrone, M.P. (1993) Improving Regression Estimation: Averaging Methods for \nVariance Reduction with Eztension to General Convez Measure Optimization. PhD \nThesis, Department of Physics, Brown University. \n\nWolpert, D. (1992). Stacked generalization, Neural Networks, 5, 241-259. \n\n\f", "award": [], "sourceid": 1075, "authors": [{"given_name": "Bambang", "family_name": "Parmanto", "institution": null}, {"given_name": "Paul", "family_name": "Munro", "institution": null}, {"given_name": "Howard", "family_name": "Doyle", "institution": null}]}