{"title": "The Cascade-Correlation Learning Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 524, "page_last": 532, "abstract": null, "full_text": "524 \n\nFablman and Lebiere \n\nThe Cascade-Correlation Learning Architecture \n\nScott E. Fahlman and Christian Lebiere \n\nSchool of Computer Science \nCarnegie-Mellon University \n\nPittsburgh, PA 15213 \n\nABSTRACT \n\nCascade-Correlation is a new architecture and supervised learning algo(cid:173)\nrithm for artificial neural networks. Instead of just adjusting the weights \nin a network of fixed topology. Cascade-Correlation begins with a min(cid:173)\nimal network, then automatically trains and adds new hidden units one \nby one, creating a multi-layer structure. Once a new hidden unit has \nbeen added to the network, its input-side weights are frozen. This unit \nthen becomes a permanent feature-detector in the network, available for \nproducing outputs or for creating other, more complex feature detec(cid:173)\ntors. The Cascade-Correlation architecture has several advantages over \nexisting algorithms: it learns very quickly, the network . determines its \nown size and topology, it retains the structures it has built even if the \ntraining set changes, and it requires no back-propagation of error signals \nthrough the connections of the network. \n\n1 DESCRIPTION OF CASCADE\u00b7CORRELATION \nThe most important problem preventing the widespread application of artificial neural \nnetworks to real-world problems is the slowness of existing learning algorithms such as \nback-propagation (or \"backprop\"). One factor contributing to that slowness is what we \ncall the moving target problem: because all of the weights in the network are changing \nat once, each hidden units sees a constantly changing environment. Instead of moving \nquickly to assume useful roles in the overall problem solution, the hidden units engage in \na complex dance with much wasted motion. The Cascade-Correlation learning algorithm \nwas developed in an attempt to solve that problem. In the problems we have examined, \nit learns much faster than back-propagation and solves some other problems as well. \n\n\fThe Cascade-Correlation Learning Architecture \n\n525 \n\nOutputs \no \n0 \n\nOutput Units \n\nHidden Unit 2 \n\nHidden unit 1 \n\n~~--------~~--.----\n\no--------~&_-------mr_--------------~----~--~~ \nInpu~ O--------~~----~H-------------~.---~~--~ \no--------~~------~--------------------------~ \n\n+1 \n\nFigure 1: The Cascade architecture, after two hidden units have been added. The \nvertical lines sum all incoming activation. Boxed connections are frozen, X connections \nare trained repeatedly. \n\nCascade-Correlation combines two key ideas: The first is the cascade architecture, in \nwhich hidden units are added to the network one at a time and do not change after they \nhave been added. The second is the learning algorithm, which creates and installs the \nnew hidden units. For each new hidden unit, we attempt to maximize the magnitude of \nthe correlation between the new unit's output and the residual error signal we are trying \nto eliminate. \nThe cascade architecture is illustrated in Figure 1. It begins with some inputs and one or \nmore output units, but with no hidden units. The number of inputs and outputs is dictated \nby the problem and by the I/O representation the experimenter has chosen. Every input \nis connected to every output unit by a connection with an adjustable weight. There is \nalso a bias input, permanently set to + 1. \nThe output units may just produce a linear sum of their weighted inputs, or they may \nemploy some non-linear activation function. In the experiments we have run so far, we \nuse a symmetric sigmoidal activation function (hyperbolic tangent) whose output range \nis -1.0 to + 1.0. For problems in which a precise analog output is desired, instead of a \nbinary classification, linear output units might be the best choice, but we have not yet \nstudied any problems of this kind. \n\nWe add hidden units to the network one by one. Each new hidden unit receives a \nconnection from each of the network's original inputs and also from every pre-existing \nhidden unit. The hidden unit's input weights are frozen at the time the unit is added to \nthe net; only the output connections are trained repeatedly. Each new unit therefore adds \n\n\f526 \n\nFahlman and Lebiere \n\na new one-unit \"layer\" to the network, unless some of its incoming weights happen to be \nzero. This leads to the creation of very powerful high-order feature detectors; it also may \nlead to very deep networks and high fan-in to the hidden units. There are a number of \npossible strategies for minimizing the network depth and fan-in as new units are added, \nbut we have not yet explored these strategies. \n\nThe learning algorithm begins with no hidden units. The direct input-output connections \nare trained as well as possible over the entire training set. With no need to back-propagate \nthrough hidden units, we can use the Widrow-Hoff or \"delta\" rule, the Perceptron learning \nalgorithm, or any of the other well-known learning algorithms for single-layer networks. \nIn our simulations, we use Fahlman's \"quickprop\" algorithm [Fahlman, 1988] to train the \noutput weights. With no hidden units, this acts essentially like the delta rule, except that \nit converges much faster. \n\nAt some point, this training will approach an asymptote. When no significant error \nreduction has occurred after a certain number of training cycles (controlled by a \"patience\" \nparameter set by the operator), we run the network one last time over the entire training \nset to measure the error. If we are satisfied with the network's performance, we stop; if \nnot, we attempt to reduce the residual errors further by adding a new hidden unit to the \nnetwork. The unit-creation algorithm is described below. The new unit is added to the \nnet, its input weights are frozen, and all the output weights are once again trained using \nquickprop. This cycle repeats until the error is acceptably small (or until we give up). \nTo create a new hidden unit, we begin with a candidate unit that receives trainable input \nconnections from all of the network's external inputs and from all pre-existing hidden \nunits. The output of this candidate unit is not yet connected to the active network. We run \na number of passes over the examples of the training set, adjusting the candidate unit's \ninput weights after each pass. The goal of this adjustment is to maximize S, the sum over \nall output units 0 of the magnitude of the correlation (or, more precisely, the covariance) \nbetween V, the candidate unit's value, and Eo, the residual output error observed at unit \no. We define S as \n\nS = L: L:(Vp - V) (Ep,o - Eo) \n\no \n\np \n\nwhere 0 is the network output at which the error is measured and p is the training pattern. \nThe quantities V and Eo are the values of V and Eo averaged over all patterns. \n\nIn order to maximize S, we must compute 8Sj8wi, the partial derivative of S with \nrespect to each of the candidate unit's incoming weights, Wi. In a manner very similar \nto the derivation of the back-propagation rule in [Rumelhart, 1986], we can expand and \ndifferentiate the fonnula for S to get \n\n8Sj8Wj = L: uo(Ep,o - Eo)J;,lj,p \n\np,o \n\nwhere U o is the sign of the correlation between the candidate's value and output o,ff, is \n\n\fThe Cascade-Correlation Learning Architecture \n\n527 \n\nthe derivative for pattern p of the candidate unit's activation function with respect to the \nsum of its inputs, and li,p is the input the candidate unit receives from unit i for pattern \np. \nAfter computing 8 S / 8Wi for each incoming connection, we can perform a gradient ascent \nto maximize S. Once again we are training only a single layer of weights. Once again \nwe use the quickprop update rule for faster convergence. When S stops improving, we \ninstall the new candidate as a unit in the active network, freeze its input weights, and \ncontinue the cycle as described above. \nBecause of the absolute value in the formula for S, a candidate unit cares only about the \nmagnitude of its correlation with the error at a given output, and not about the sign of \nthe correlation. As a rule, if a hidden unit correlates positively with the error at a given \nunit, it will develop a negative connection weight to that unit, attempting to cancel some \nof the error; if the correlation is negative, the output weight will be positive. Since a \nunit's weights to different outputs may be of mixed sign, a unit can sometimes serve two \npurposes by developing a positive correlation with the error at one output and a negative \ncorrelation with the error at another. \nInstead of a single candidate unit. it is possible to use a pool of candidate units, each \nwith a different set of random initial weights. All receive the same input signals and see \nthe same residual error for each pattern and each output. Because they do not interact \nwith one another or affect the active network during training, all of these candidate units \ncan be trained in parallel; whenever we decide that no further progress is being made, \nwe install the candidate whose correlation score is the best. The use of this pool of \ncandidates is beneficial in two ways: it greatly reduces the chance that a useless unit will \nbe permanently installed because an individual candidate got stuck during training, and \n(on a parallel machine) it can speed up the training because many parts of weight-space \ncan be explored simultaneously. \n\nThe hidden and candidate units may all be of the same type, for example with a sigmoid \nactivation function. Alternatively, we might create a pool of candidate units with a \nmixture of nonlinear activation functions-some sigmoid, some Gaussian, some with \nradial activation functions. and so on-and let them compete to be chosen for addition \nto the active network. To date, we have explored the all-sigmoid and all-Gaussian cases, \nbut we do not yet have extensive simulation data on networks with mixed unit-types. \nOne final note on the implementation of this algorithm: While the weights in the output \nlayer are being trained, the other weights in the active network are frozen. While the \ncandidate weights are being trained, none of the weights in the active network are changed. \nIn a machine with plenty of memory. it is possible to record the unit-values and the output \nerrors for an entire epoch, and then to use these cached values repeatedly during training. \nrather than recomputing them repeatedly for each training case. This can result in a \ntremendous speedup as the active network grows large. \n\n\f528 \n\nFahlman and Lebiere \n\nFigure 2: Training points for the two-spirals problem, and output pattern for one network \ntrained with Cascade-Correlation. \n\n2 BENCHMARK RESULTS \n2.1 THE TWO-SPIRALS PROBLEM \n\nThe \"two-spirals\" benchmark was chosen as the primary benchmark for this study because \nit is an extremely hard problem for algorithms of the back-propagation family to solve. \nn was first proposed by Alexis Wieland of MImE Corp. The net has two continuous(cid:173)\nvalued inputs and a single output. The training set consists of 194 X-Y values, half of \nwhich are to produce a + 1 output and half a -1 output. These training points are arranged \nin two interlocking spirals that go around the origin three times, as shown in Figure 2a. \nThe goal is to develop a feed-forward network with sigmoid units that properly classifies \nall 194 training cases. Some hidden units are obviously needed, since a single linear \nseparator cannot divide two sets twisted together in this way. \n\nWieland (unpublished) reported that a modified version of backprop in use at MITRE \nrequired 150,000 to 200,000 epochs to solve this problem, and that they had never \nobtained a solution using standard backprop. Lang and Witbrock [Lang, 1988] tried the \nproblem using a 2-5-5-5-1 network (three hidden layers of five units each). Their network \nwas unusual in that it provided \"shortcut\" connections: each unit received incoming \nconnections from every unit in every earlier layer, not just from the immediately preceding \nlayer. With this architecture, standard backprop was able to solve the problem in 20,000 \nepochs, backprop with a modified error function required 12,000 epochs, and quickprop \nrequired 8000. This was the best two-spirals performance reported to date. Lang and \nWitbrock also report obtaining a solution with a 2-5-5-1 net (only ten hidden units in \nall), but the solution required 60,000 quickprop epochs. \n\nWe ran the problem 100 times with the Cascade-Correlation algorithm using a Sigmoidal \nactivation function for both the output and hidden units and a pool of 8 candidate units. \nAll trials were successful, requiring 1700 epochs on the average. (This number counts \n\n\fThe Cascade-Correlation Learning Architecture \n\n529 \n\nboth the epochs used to train output weights and the epochs used to train candidate units.) \nThe number of hidden units built into the net varied from 12 to 19, with an average of \n15.2 and a median of 15. Here is a histogram of the number of hidden units created: \n\nHidden Number of \nUnits \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n\nTrials \n4 \n9 \n24 \n19 \n24 \n13 \n5 \n2 \n\n#### \n######### \n######################## \n################### \n######################## \n############# \n##### \n## \n\nIn terms of training epochs, Cascade-Correlation beats quickprop by a factor of 5 and \nstandard back prop by a factor of 10, while building a network of about the same com(cid:173)\nplexity (15 hidden units). In terms of actual computation on a serial machine, however, \nthe speedup is much greater than these numbers suggest In backprop and quickprop, \neach training case requires a forward and a backward pass through all the connections in \nthe network; Cascade-Correlation requires only a forward pass. In addition, many of the \nCascade-Correlation epochs are run while the network is much smaller than its final size. \nFinally, the cacheing strategy described above makes it possible to avoid re-computing \nthe unit values for parts of the network that are not changing. \nSuppose that instead of epochs, we measure learning time in connection crossings, defined \nas the number of multiply-accumulate steps necessary to propagate activation values \nforward through the network and error values backward. This measure leaves out some \ncomputational steps, but it is a more accurate measure of computational complexity \nthan comparing epochs of different sizes or comparing runtimes on different machines. \nThe Lang and Witbrock result of 20,000 backprop epochs requires about 1.1 billion \nconnection crossings. Their solution using 8000 quickprop epochs on the same network \nrequires about 438 million crossings. An average Cascade-Correlation run with a pool of \n8 candidate units requires about 19 million crossings-a 23-fold speedup over quickprop \nand a 50-fold speedup over standard backprop. With a smaller pool of candidate units the \nspeedup (on a serial machine) would be even greater, but the resulting networks might \nbe somewhat larger. \n\nFigure 2b shows the output of a 12-hidden-unit network built by Cascade-Correlation \nas the input is scanned over the X-V field. This network properly classifies all 194 \ntraining points. We can see that it interpolates smoothly for about the first 1.5 turns of \nthe spiral, but becomes a bit lumpy farther out, where the training points are farther apart. \nThis \"receptive field\" diagram is similar to that obtained by Lang and Witbrock using \nbackprop, but is somewhat smoother. \n\n\f530 \n\nFahlman and Lebiere \n\n2.2 N-INPUT PARITY \n\nSince parity has been a popular benchmark among other researchers, we ran Cascade(cid:173)\nCorrelation on N-input parity problems with N ranging from 2 to 8. The best results \nwere obtained with a sigmoid output unit and hidden units whose output is a Gaussian \nfunction of the sum of weighted inputs. Based on five trials for each value of N, our \nresults were as follows: \n\nUnits \n\nN Cases Hidden Average \nEpochs \n24 \n32 \n66 \n142 \n161 \n292 \n357 \n\n1 \n1 \n2 \n2-3 \n3 \n4-5 \n4-5 \n\n2 \n3 \n4 \n5 \n6 \n7 \n8 \n\n4 \n8 \n16 \n32 \n64 \n128 \n256 \n\nFor a rough comparison, Tesauro and Janssens [Tesauro, 1988] report that standard back(cid:173)\nprop takes about 2000 epochs for 8-input parity. In their study, they used 2N hidden units. \nCascade-Correlation can solve the problem with fewer than N hidden units because it uses \nshort-cut connections. \nAs a test of generalization, we ran a few trials of Cascade-Correlation on the lO-input \nparity problem, training on either 50% or 25% of the 1024 patterns and testing on the \nrest. The number of hidden units built varied from 4 to 7 and training time varied from \n276 epochs to 551. When trained on half of the patterns, perfonnance on the test set \naveraged 96% correct; when trained on one quarter of the patterns, test-set performance \naveraged 90% correct Note that the nearest neighbor algorithm would get almost all of \nthe test-set cases wrong. \n\n3 DISCUSSION \nWe believe that that Cascade-Correlation algorithm offers the following advantages over \nnetwork learning algorithms currently in use: \n\n\u2022 There is no need to guess the size, depth, and connectivity pattern of the network \nin advance. A reasonably small (though not optimal) net is built automatically, \nperhaps with a mixture of unit-types . \n\n\u2022 Cascade-Correlation learns fast In backprop, the hidden units engage in a complex \ndance before they settle into distinct useful roles; in Cascade-Correlation, each unit \nsees a fixed problem and can move decisively to solve that problem. For the \nproblems we have investigated to date, the learning time in epochs grows roughly \nas NlogN, where N is the number of hidden units ultimately needed to solve the \nproblem. \n\n\fThe Cascade-Correlation Learning Architecture \n\n531 \n\n\u2022 Cascade-Correlation can build deep nets (high-order feature detectors) without the \n\ndramatic slowdown we see in deep back-propagation networks. \n\n\u2022 Cascade-Correlation is useful for incremental learning. in which new infonnation is \nadded to an already-trained net. Once built. a feature detector is never cannibalized. \nIt is available from that time on for producing outputs or more complex features. \n\n\u2022 At any given time. we train only one layer of weights in the network. The rest of \n\nthe network is constant. so results can be cached. \n\n\u2022 There is never any need to propagate error signals backwards through network \nconnections. A single residual error signal can be broadcast to all candidates. \nThe weighted connections transmit signals in only one direction. eliminating one \ndifference between these networks and biological synapses. \n\n\u2022 The candidate units do not interact. except to pick a winner. Each candidate sees the \nsame inputs and error signals. This limited communication makes the architecture \nattractive for parallel implementation. \n\n4 RELATION TO OTHER WORK \nThe principal differences between Cascade-Correlation and older learning architectures \nare the dynamic creation of hidden units. the way we stack the new units in multiple \nlayers (with a fixed output layer). the freezing of units as we add them to the net. and \nthe way we train new units by hill-climbing to maximize the unit's correlation with the \nresidual error. The most interesting discovery is that by training one unit at a time instead \nof training the whole network at once. we can speed up the learning process considerably. \nwhile still creating a reasonably small net that generalizes well. \n\nA number of researchers [Ash. 1989.Moody. 1989] have investigated networks that add \nnew units or receptive fields within a single layer in the course of learning. While \nsingle-layer systems are well-suited for some problems. these systems are incapable of \ncreating higher-order feature detectors that combine the outputs of existing units. The \nidea of building feature detectors and then freezing them was inspired in part by the \nwork of Waibel on modular networks [Waibel. 19891. but in his model the structure of \nthe sub-networks must be fixed before learning begins. \n\nWe know of only a few attempts to build up multi-layer networks as the learning pro(cid:173)\ngresses. Our decision to look at models in which each unit can see all pre-existing units \nwas inspired to some extent by work on progressively deepening threshold-logic models \nby Merrick Furst and Jeff Jackson at Carnegie Mellon. (They are not actively pursuing \nthis line at present.) Gallant [Gallant. 1986] briefly mentions a progressively deepening \nperceptron model (his \"inverted pyramidU model) in which units are frozen after being \ninstalled. However. he has concentrated most of his research effort on models in which \nnew hidden units are generated at random rather than by a deliberate training process. \nThe SONN model of Tenorio and Lee [Tenorio, 1989] builds a multiple-layer topology \n\n\f532 \n\nFahlman and Lebiere \n\nto suit the problem at hand. Their algorithm places new -two-input units at randomly se(cid:173)\nlected locations, using a simulated annealing search to keep only the most useful ones-a \nvery different approach from ours. \n\nAcknowledgments \n\nWe would like to thank Merrick Furst, Paul Gleichauf, and David Touretzlcy for asking \ngood questions that helped to shape this work. This research was sponsored in part by \nthe National Science Foundation (Contract EET-8716324) and in part by the Defense \nAdvanced Research Projects Agency (Contract F3361S-87-C-1499). \n\nReferences \n[Ash, 1989] \n\nAsh, T. (1989) \"Dynamic Node Creation in Back-Propagation Net(cid:173)\nworks\", Technical Report 8901, Institute for Cognitive Science, Uni(cid:173)\nversity of California, San Diego. \n\n[Fahlman, 1988] \n\nFahlman, S. E. (1988) \"Faster-Learning Variations on Back(cid:173)\nPropagation: An Empirical Study\" in Proceedings of the 1988 Con(cid:173)\nnectionist Models Summer School, Morgan Kaufmann. \n\n[Gallant, 1986] \n\n[Lang, 1988] \n\n[Moody, 1989] \n\nGallant, S. I. (1986) \"Three Constructive Algorithms for Network \nLearning\" in Proceedings. 8th Annual Conference of the Cognitive \nScience Society. \n\nLang, K. J. and Witbrock, M. J. (1988) \"Learning to Tell Two Spirals \nApart\" in Proceedings of the 1988 Connectionist Models Summer \nSchool, Morgan Kaufmann. \n\nMoody, J. (1989) \"Fast Learning in Multi-Resolution Hierarchies\" in \nD. S. Touretzky (ed.), Advances in Neural Information Processing \nSystems 1, Morgan Kaufmann. \n\n[Rumelhart, 1986] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986) \"Learning \nInternal Representations by Error Propagation\" in Rumelhart, D. E. \nand McClelland, J. L.,Parallel Distributed Processing: Explorations \nin the Microstructure of Cognition, MIT Press. \n\n[Tenorio, 1989] \n\nTenorio, M. E, and Lee, W. T. (1989) \"Self-Organizing Neural Nets \nfor the Identification Problem\" in D. S. Touretzky (ed.), Advances in \nNeural Information Processing Systems 1, Morgan Kaufmann. \n\n[Tesauro, 1988] \n\nTesauro, G. and Janssens, B. (1988) \"Scaling Relations in Back(cid:173)\nPropagation Learning\" in Complex Systems 2 39-44. \n\n[Waibel, 1989] Waibel, A. (1989) \"Consonant Recognition by Modular Construction \nof Large Phonemic Time-Delay Neural Networks\" in D. S. TouretzlcY \n(ed.), Advances in Neural Information Processing Systt ms 1, Morgan \nKaufmann. \n\n\f", "award": [], "sourceid": 207, "authors": [{"given_name": "Scott", "family_name": "Fahlman", "institution": null}, {"given_name": "Christian", "family_name": "Lebiere", "institution": null}]}