{"title": "Generating Accurate and Diverse Members of a Neural-Network Ensemble", "book": "Advances in Neural Information Processing Systems", "page_first": 535, "page_last": 541, "abstract": null, "full_text": "Generating Accurate and Diverse \n\nMembers of a Neural-Network Ensemble \n\nDavid w. Opitz \n\nComputer Science Department \n\nUniversity of Minnesota \n\nDuluth, MN 55812 \nopitz@d.umn.edu \n\nJude W. Shavlik \n\nComputer Sciences Department \n\nUniversity of Wisconsin \n\nMadison, WI 53706 \nshavlik@cs.wisc.edu \n\nAbstract \n\nNeural-network ensembles have been shown to be very accurate \nclassification techniques. Previous work has shown that an effec(cid:173)\ntive ensemble should consist of networks that are not only highly \ncorrect, but ones that make their errors on different parts of the \ninput space as well. Most existing techniques, however, only in(cid:173)\ndirectly address the problem of creating such a set of networks. \nIn this paper we present a technique called ADDEMUP that uses \ngenetic algorithms to directly search for an accurate and diverse \nset of trained networks. ADDEMUP works by first creating an ini(cid:173)\ntial population, then uses genetic operators to continually create \nnew networks, keeping the set of networks that are as accurate as \npossible while disagreeing with each other as much as possible. Ex(cid:173)\nperiments on three DNA problems show that ADDEMUP is able to \ngenerate a set of trained networks that is more accurate than sev(cid:173)\neral existing approaches. Experiments also show that ADDEMUP \nis able to effectively incorporate prior knowledge, if available, to \nimprove the quality of its ensemble. \n\n1 \n\nIntroduction \n\nMany researchers have shown that simply combining the output of many classifiers \ncan generate more accurate predictions than that of any of the individual classi(cid:173)\nfiers (Clemen, 1989; Wolpert, 1992). In particular, combining separately trained \nneural networks (commonly referred to as a neural-network ensemble) has been \ndemonstrated to be particularly successful (Alpaydin, 1993; Drucker et al., 1994; \nHansen and Salamon, 1990; Hashem et al., 1994; Krogh and Vedelsby, 1995; \nMaclin and Shavlik, 1995; Perrone, 1992). Both theoretical (Hansen and Sala(cid:173)\nmon, 1990; Krogh and Vedelsby, 1995) and empirical (Hashem et al., 1994; \n\n\f536 \n\nD. W. OPITZ, J. W. SHA VLIK \n\nMaclin and Shavlik, 1995) work has shown that a good ensemble is one where \nthe individual networks are both accurate and make their errors on different parts \nof the input space; however, most previous work has either focussed on combining \nthe output of multiple trained networks or only indirectly addressed how we should \ngenerate a good set of networks. We present an algorithm, ADDEMUP (Accurate \nanD Diverse Ensemble-Maker giving United Predictions), that uses genetic algo(cid:173)\nrithms to generate a population of neural networks that are highly accurate, while \nat the same time having minimal overlap on where they make their error. \nThaditional ensemble techniques generate their networks by randomly trying differ(cid:173)\nent topologies, initial weight settings, parameters settings, or use only a part of the \ntraining set in the hopes of producing networks that disagree on where they make \ntheir errors (we henceforth refer to diversity as the measure of this disagreement). \nWe propose instead to actively search for a good set of networks. The key idea be(cid:173)\nhind our approach is to consider many networks and keep a subset of the networks \nthat minimizes our objective function consisting of both an accuracy and a diversity \nterm. In many domains we care more about generalization performance than we \ndo about generating a solution quickly. This, coupled with the fact that computing \npower is rapidly growing, motivates us to effectively utilize available CPU cycles by \ncontinually considering networks to possibly place in our ensemble. \nADDEMUP proceeds by first creating an initial set of networks, then continually \nproduces new individuals by using the genetic operators of crossover and mutation. \nIt defines the overall fitness of an individual to be a combination of accuracy and \ndiversity. Thus ADDEMUP keeps as its population a set of highly fit individuals that \nwill be highly accurate, while making their mistakes in a different part of the input \nspace. Also, it actively tries to generate good candidates by emphasizing the current \npopulation's erroneous examples during backpropagation training. Experiments \nreported herein demonstrate that ADDEMUP is able to generate an effective set of \nnetworks for an ensemble. \n\n2 The Importance of an Accurate and Diverse Ensemble \n\nFigure 1 illustrates the basic framework of a neural-network ensemble. Each network \nin the ensemble (network 1 through network N in this case) is first trained using \nthe training instances. Then, for each example, the predicted output of each of \nthese networks (Oi in Figure 1) is combined to produce the output of the ensemble \n(0 in Figure 1). Many researchers (Alpaydin, 1993; Hashem et al., 1994; Krogh \nand Vedelsby, 1995; Mani, 1991) have demonstrated the effectiveness of combining \nschemes that are simply the weighted average of the networks (Le., 0 = L:iEN Wi \u00b7Oi \nand L:iEN Wi = 1), and this is the type of ensemble we focus on in this paper. \n\nHansen and Salamon (1990) proved that for a neural-network ensemble, if the av(cid:173)\nerage error rate for a pattern is less than 50% and the networks in the ensemble are \nindependent in the production of their errors, the expected error for that pattern \ncan be reduced to zero as the number of networks combined goes to infinity; how(cid:173)\never, such assumptions rarely hold in practice. Krogh and Vedelsby (1995) later \nproved that if diversity! Di of network i is measured by: \n\nDi = I)Oi(X) - o(xW, \n\nx \n\nthen the ensemble generalization error CE) consists of two distinct portions: \n\n1 Krogh and Vedelsby referred to this term as ambiguity. \n\nE = E - D, \n\n(1) \n\n(2) \n\n\fGenerating Accurate and Diverse Members of a Neural-network Ensemble \n\n537 \n\n\" o \n\u2022\u2022 ensemble output \n\nInW$lllnW$21-lnW$NI \n\nn \n\n~ \n\n\u2022\u2022\u2022 input \n\n1j \n\nFigure 1: A neural-network ensemble. \n\nwhere [) = Li Wi\u00b7 Di and E = Li Wi\u00b7 Ei (Ei is the error rate of network i and the \nWi'S sum to 1). What the equation shows then, is that we want our ensemble to \nconsist of highly correct networks that disagree as much as possible. Creating such \na set of networks is the focus of this paper. \n\n3 The ADDEMUP Algorithm \n\nTable 1 summarizes our new algorithm, ADDEMUP, that uses genetic algorithms \nto generate a set of neural networks that are accurate and diverse in their classi(cid:173)\nfications. (Although ADDEMUP currently uses neural networks, it could be easily \nextended to incorporate other types of learning algorithms as well.) ADDEMUP \nstarts by creating and training its initial population of networks. It then creates \nnew networks by using standard genetic operators, such as crossover and mutation. \nADDEMUP trains these new individuals, emphasizing examples that are misclassified \nby the current population, as explained below. ADDEMUP adds these new networks \nto the population then scores each population members with the fitness function: \n\nFitnessi = AccuracYi + A DiversitYi = (1 - E i ) + A D i , \n\n(3) \nwhere A defines the tradeoff between accuracy and diversity. Finally, ADDEMUP \nprunes the population to the N most-fit members, which it defines to be its current \nensemble, then repeats this process. \nWe define our accuracy term, 1 - E i , to be network i's validation-set accuracy (or \ntraining-set accuracy if a validation set is not used), and we use Equation lover \nthis validation set to calculate our diversity term Di . We then separately normalize \neach term so that the values range from 0 to 1. Normalizing both terms allows A to \nhave the same meaning across domains. Since it is not always clear at what value \none should set A, we have therefore developed some rules for automatically setting \nA. First, we never change A if the ensemble error E is decreasing while we consider \nnew networks; otl.!erwise we change A if one of following two things happen: (1) \npopulation error E is not increasing and the population diversity D is decreasing; \ndiversity seems to be under-emphasized and we increase A, or (2) E is increasing \nand [) is not decreasing; diversity seems to be over-emphasized and we decrease A. \n(We started A at 0.1 for the results in this paper.) \n\nA useful network to add to an ensemble is one that correctly classifies as many \nexamples as possible while making its mistakes primarily on examples that most \n\n\f538 \n\nD. W. OPITZ. 1. W. SHA VLIK \n\nTable 1: The ADDEMUP algorithm. \n\nGOAL: Genetically create an accurate and diverse ensemble of networks. \n\n1. Create and train the initial population of networks. \n2. Until a stopping criterion is reached: \n\n(a) Use genetic operators to create new networks. \n(b) Thain the new networks using Equation 4 and add them to the popu(cid:173)\n\nlation. \n\n(c) Measure the diversity of each network with respect to the current pop(cid:173)\n\nulation (see Equation 1). \n\n(d) Normalize the accuracy scores and the diversity scores of the individual \n\nnetworks. \n\n(e) Calculate the fitness of each population member (see Equation 3). \n(f) Prune the population to the N fittest networks. \n(g) Adjust oX (see the text for an explanation). \n(h) Report the current population of networks as the ensemble. Combine \n\nthe output of the networks according to Equation 5. \n\nof the current population members correctly classify. We address this during back(cid:173)\npropagation training by multiplying the usual cost function by a term that measures \nthe combined population error on that example: \n\nCost = L It(k) ~O(k)I>-'+l [t(k) -a(kW, \n\n..2.-\n\nkET \n\nE \n\n(4) \n\nwhere t(k) is the target and a(k) is the network activation for example k in the \ntraining set T. Notice that since our network is not yet a member of the ensemble, \no(k) and E are not dependent on our network; our new term is thus a constant when \ncalculating the derivatives during backpropagation. We normalize t(k) -o(k) by the \nensemble error E so that the average value of our new term is around 1 regardless of \nthe correctness of the ensemble. This is especially important with highly accurate \npopulations, since tk - o(k) will be close to 0 for most examples, and the network \nwould only get trained on a few examples. The exponent A~l represents the ratio \nof importance of the diversity term in the fitness function. For instance, if oX is close \nto 0, diversity is not considered important and the network is trained with the usual \ncost function; however, if oX is large, diversity is considered important and our new \nterm in the cost function takes on more importance. \nWe combine the predictions of the networks by taking a weighted sum of the output \nof each network, where each weight is based on the validation-set accuracy of the \nnetwork. Thus we define our weights for combining the networks as follows: \n\n(5) \n\nWhile simply averaging the outputs generates a good composite model (Clemen, \n1989), we include the predicted accuracy in our weights since one should believe \naccurate models more than inaccurate ones. \n\n\fGenerating Accurate and Diverse Members of a Neural-network Ensemble \n\n539 \n\n4 Experimental Study \n\nThe genetic algorithm we use for generating new network topologies is the RE(cid:173)\nGENT algorithm (Opitz and Shavlik, 1994). REGENT uses genetic algorithms \nto search through the space of knowledge-based neural network (KNN) topolo(cid:173)\ngies. KNNs are networks whose topologies are determined as a result of the \ndirect mapping of a set of background rules that represent what we currently \nknow about our task. KBANN (Towell and Shavlik, 1994), for instance, trans(cid:173)\nlates a set of propositional rules into a neural network, then refines the result(cid:173)\ning network's weights using backpropagation. Thained KNNs, such as KBANN'S \nnetworks, have been shown to frequently generalize better than many other \ninductive-learning techniques such as standard neural networks (Opitz, 1995; \nTowell and Shavlik, 1994). Using KNNs allows us to have highly correct networks \nin our ensemble; however, since each network in our ensemble is initialized with the \nsame set of domain-specific rules, we do not expect there to be much disagreement \namong the networks. An alternative we consider in our experiments is to randomly \ngenerate our initial population of network topologies, since domain-specific rules \nare sometimes not available. \nWe ran ADDEMUP on NYNEX's MAX problem set and on three problems from the \nHuman Genome Project that aid in locating genes in DNA sequences (recognizing \npromoters, splice-junctions, and ribosome-binding sites - RBS). Each of these do(cid:173)\nmains is accompanied by a set of approximately correct rules describing what is \ncurrently known about the task (see Opitz, 1995 or Opitz and Shavlik, 1994 for \nmore details). Our experiments measure the test-set error of ADDEMUP on these \ntasks. Each ensemble consists of 20 networks, and the REGENT and ADDEMUP \nalgorithms considered 250 networks during their genetic search. \nTable 2a presents the results from the case where the learners randomly create \nthe topology of their networks (Le., they do not use the domain-specific knowl(cid:173)\nedge). Table 2a's first row, best-network, results from a single-layer neural net(cid:173)\nwork where, for each fold, we trained 20 networks containing between 0 and 100 \n(uniformly) hidden nodes and used a validation set to choose the best network. The \nnext row, bagging, contains the results of running Breiman's (1994) bagging algo(cid:173)\nrithm on standard, single-hidden-Iayer networks, where the number of hidden nodes \nis randomly set between 0 and 100 for each network. 2 Bagging is a \"bootstrap\" \nensemble method that trains each network in the ensemble with a different partition \nof the training set. It generates each partition by randomly drawing, with replace(cid:173)\nment, N examples from the training set, where N is the size of the training set. \nBreiman (1994) showed that bagging is effective on \"unstable\" learning algorithms, \nsuch as neural networks, where small changes in the training set result in large \nchanges in predictions. The bottom row of Table 2a, AOOEMUP, contains the results \nof a run of ADDEMUP where its initial population (of size 20) is randomly generated. \nThe results show that on these domains combining the output of mUltiple trained \nnetworks generalizes better than trying to pick the single-best network. \nWhile the top table shows the power of neural-network ensembles, Table 2b demon(cid:173)\nstrates ADDEMUP'S ability to utilize prior knowledge. The first row of Table 2b \ncontains the generalization results of the KBANN algorithm, while the next row, \nKBANN-bagging, contains the results of the ensemble where each individual net(cid:173)\nwork in the ensemble is the KBANN network trained on a different partition of the \ntraining set. Even though each of these networks start with the same topology and \n\n2We also tried other ensemble approaches, such as randomly creating varying multi(cid:173)\n\nlayer network topologies and initial weight settings, but bagging did significantly better \non all datasets (by 15-25% on all three DNA domains). \n\n\f540 \n\nD. W. OPITZ. J. W. SHA VLlK \n\nTable 2: Test-set error from a ten-fold cross validation. Table (a) shows the results \nfrom running three learners without the domain-specific knowledge; Table (b) shows \nthe results of running three learners with this knowledge. Pairwise, one-tailed t-tests \nindicate that AOOEMUP in Table (b) differs from the other algorithms in both tables \nat the 95% confidence level, except with REGENT in the splice-junction domain. \n\nI Standard neural networks (no domain-specific knowledge used) I \n\nbest-network \n\nbagging \nAOOEMUP \n\nPromoters Splice Junction RBS \n10.7% \n9.5% \n9.0% \n\n6.6% \n4.6% \n4.6% \n\n7.8% \n4.5% \n4.9% \n\nMAX \n37.0% \n35.7% \n34.9% \n\n\u00b7Knowledge-based neural networks (domain-specific knowledge used) \n\n(a) \n\nKBANN \n\nKBANN-bagging \nREGENT-Combined \n\nAOOEMUP \n\nPromoters Splice Junction RBS \n9.4% \n8.5% \n8.2% \n7.5% \n\n5.3% \n4.5% \n3.9% \n3.6% \n\n6.2% \n4.2% \n3.9% \n2.9% \n\nMAX \n35.8% \n35.6% \n35.6% \n34.7% \n\n(b) \n\n\"large\" initial weight settings (Le., the weights resulting from the domain-specific \nknowledge), small changes in the training set still produce significant changes in \npredictions. Also notice that on all datasets, KBANN-bagging is as good as or better \nthan running bagging on randomly generated networks (Le., bagging in Table 2a). \nThe next row, REGENT-Combined, contains the results of simply combining, using \nEquation 5, the networks in REGENT'S final population. AOOEMUP, the final row of \nTable 2b, mainly differs from REGENT-Combined in two ways: (a) its fitness function \n(Le., Equation 3) takes into account diversity rather than just network accuracy, and \n(b) it trains new networks by emphasizing the erroneous examples of the current \nensemble. Therefore, comparing AOOEMUP with REGENT-Combined helps directly \ntest ADDEMUP'S diversity-achieving heuristics, though additional results reported in \nOpitz (1995) show ADDEMUP gets most of its improvement from its fitness function. \nThere are two main reasons why we think the results of ADDEMUP in Table 2b are \nespecially encouraging: (a) by comparing ADDEMUP with REGENT-Combined, we \nexplicitly test the quality of our heuristics and demonstrate their effectiveness, and \n(b) ADDEMUP is able to effectively utilize background knowledge to decrease the \nerror of the individual networks in its ensemble, while still being able to create \nenough diversity among them so as to improve the overall quality of the ensemble. \n\n5 Conclusions \n\nPrevious work with neural-network ensembles have shown them to be an effective \ntechnique if the classifiers in the ensemble are both highly correct and disagree \nwith each other as much as possible. Our new algorithm, ADDEMUP, uses genetic \nalgorithms to search for a correct and diverse population of neural networks to be \nused in the ensemble. It does this by collecting the set of networks that best fits an \nobjective function that measures both the accuracy of the network and the disagree(cid:173)\nment of that network with respect to the other members of the set. ADDEMUP tries \n\n\fGenerating Accurate and Diverse Members of a Neural-network Ensemble \n\n541 \n\nto actively generate quality networks during its search by emphasizing the current \nensemble's erroneous examples during backpropagation training. \nExperiments demonstrate that our method is able to find an effective set of net(cid:173)\nworks for our ensemble. Experiments also show that ADDEMUP is able to effectively \nincorporate prior knowledge, if available, to improve the quality of this ensemble. \nIn fact, when using domain-specific rules, our algorithm showed statistically signif(cid:173)\nicant improvements over (a) the single best network seen during the search, (b) a \npreviously proposed ensemble method called bagging (Breiman, 1994), and (c) a \nsimilar algorithm whose objective function is simply the validation-set correctness \nof the network. In summary, ADDEMUP is successful in generating a set of neural \nnetworks that work well together in producing an accurate prediction. \n\nAcknowledgements \n\nThis work was supported by Office of Naval Research grant N00014-93-1-0998. \n\nReferences \nAlpaydin, E. (1993). Multiple networks for function learning. In Proceedings of the 1993 \nIEEE International Conference on Neural Networks, vol I, pages 27-32, San Fransisco. \nBreiman, L. (1994). Bagging predictors. Technical Report 421, Department of Statistics, \nUniversity of California, Berkeley. \nClemen, R. (1989). Combining forecasts: A review and annotated bibliography. Inter(cid:173)\nnational Journal of Forecasting, 5:559-583. \nDrucker, H., Cortes, C., Jackel, L., LeCun, Y., and Vapnik, V. (1994). Boosting and other \nmachine learning algorithms. In Proceedings of the Eleventh International Conference on \nMachine Learning, pages 53-61, New Brunswick, NJ. Morgan Kaufmann. \nHansen, L. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions on \nPattern Analysis and Machine Intelligence, 12:993-100l. \nHashem, S., Schmeiser, B., and Yih, Y. (1994). Optimal linear combinations of neural \nnetworks: An overview. In Proceedings of the 1994 IEEE International Conference on \nNeural Networks, Orlando, FL. \nKrogh, A. and Vedelsby, J. (1995). Neural network ensembles, cross validation, and \nactive learning. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural \nInformation Processing Systems, vol 7, Cambridge, MA. MIT Press. \nMaclin, R. and Shavlik, J. (1995). Combining the predictions of multiple classifiers: \nUsing competitive learning to initialize neural networks. In Proceedings of the Fourteenth \nInternational Joint Conference on Artificial Intelligence, Montreal, Canada. \nMani, G. (1991). Lowering variance of decisions by using artificial neural network port(cid:173)\nfolios. Neural Computation, 3:484-486. \nOpitz, D. (1995). An Anytime Approach to Connectionist Theory Refinement: Refining \nthe Topologies of Knowledge-Based Neural Networks. PhD thesis, Computer Sciences \nDepartment, University of Wisconsin, Madison, WI. \nOpitz, D. and Shavlik, J. (1994). Using genetic search to refine knowledge-based neural \nnetworks. In Proceedings of the Eleventh International Conference on Machine Learning, \npages 208-216, New Brunswick, NJ. Morgan Kaufmann. \nPerrone, M. (1992). A soft-competitive splitting rule for adaptive tree-structured neural \nnetworks. In Proceedings of the International Joint Conference on Neural Networks, pages \n689-693, Baltimore, MD. \nTowell, G. and Shavlik, J. (1994). Knowledge-based artificial neural networks. Artificial \nIntelligence, 70(1,2):119- 165. \nWolpert, D. (1992). Stacked generalization. Neural Networks, 5:241- 259. \n\n\f", "award": [], "sourceid": 1175, "authors": [{"given_name": "David", "family_name": "Opitz", "institution": null}, {"given_name": "Jude", "family_name": "Shavlik", "institution": null}]}