{"title": "The Recurrent Cascade-Correlation Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 196, "abstract": null, "full_text": "The Recurrent Cascade-Correlation Architecture \n\nScott E. Fahlman \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nRecurrent Cascade-Correlation CRCC) is a recurrent version of the Cascade(cid:173)\nCorrelation learning architecture of Fah I man and Lebiere [Fahlman, 1990]. RCC \ncan learn from examples to map a sequence of inputs into a desired sequence of \noutputs. New hidden units with recurrent connections are added to the network \nas needed during training. In effect, the network builds up a finite-state machine \ntailored specifically for the current problem. RCC retains the advantages of \nCascade-Correlation: fast learning, good generalization, automatic construction \nof a near-minimal multi-layered network, and incremental training. \n\n1 THE ARCHITECTURE \n\nCascade-Correlation [Fahlman, 1990] is a supervised learning architecture that builds a \nnear-minimal multi-layer network topology in the course of training. Initially the network \ncontains only inputs, output units, and the connections between them. This single layer of \nconnections is trained (using the Quickprop algorithm [Fahlman, 1988]) to minimize the \nerror. When no further improvement is seen in the level of error, the network's performance \nis evaluated. If the error is small enough, we stop. Otherwise we add a new hidden unit to \nthe network in an attempt to reduce the residual error. \nTo create a new hidden unit, we begin with a pool of candidate units, each of which receives \nweighted connections from the network's inputs and from any hidden units already present \nin the net. The outputs of these candidate units are not yet connected into the active network. \nMultiple passes through the training set are run, and each candidate unit adjusts its incoming \nweights to maximize the correlation between its output and the residual error in the active \nnet. When the correlation scores stop improving, we choose the best candidate, freeze its \nincoming weights, and add it to the network. This process is called \"tenure.\" After tenure, \n\n190 \n\n\fThe Recurrent Cascade-Correlation Architecture \n\n191 \n\na unit becomes a permanent new feature detector in the net. We then re-train all the weights \ngoing to the output units, including those from the new hidden unit. This process of adding \na new hidden unit and re-training the output layer is repeated until the error is negligible or \nwe give up. Since the new hidden unit receives connections from the old ones. each hidden \nunit effectively adds a new layer to the net. \n\nCascade-correlation eliminates the need for the user to guess in advance the network's \nsize, depth, and topology. A reasonably small (though not minimal) network is built \nautomatically. Because a hidden-unit feature detector, once built, is never altered or \ncannibalized, the network can be trained incrementally. A large data set can be broken \nup into smaller \"lessons,\" and feature-building will be cumulative. Cascade-Correlation \nlearns much faster than backprop for several reasons: First only a single layer of weights \nis being trained at any given time. There is never any need to propagate error information \nbackwards through the connections, and we avoid the dramatic slowdown that is typical \nwhen training backprop nets with many layers. Second, this is a \"greedy\" algorithm: each \nnew unit grabs as much of the remaining error as it can. In a standard backprop net. the all \nthe hidden units are changing at once, competing for the various jobs that must be done-a \nslow and sometimes unreliable process. \n\nCascade-correlation, like back-propagation and other feed-forward architectures, has no \nshort-term memory in the network. The outputs at any given time are a function only of \nthe current inputs and the network's weights. Of course, many real-world tasks require the \nrecognition of a sequence of inputs and, in some cases, the corresponding production of a \nsequence of outputs. A number of recurrent architectures have been proposed in response \nto this need. Perhaps the most widely used, at present. is the Elman model [Elman, 1988]. \nwhich assumes that the network operates in discrete time-steps. The outputs of the network's \nhidden units at time t are fed back for use as additional network inputs at time-step t+ 1. These \nadditional inputs can be thought of as state-variables whose contents and interpretation are \ndetermined by the evolving weights of the network. In effect, the network is free to choose \nits own representation of past history in the course of learning. \n\nRecurrent Cascade-Correlation CRCC) is an architecture that adds Elman-style recurrent \noperation to the Cascade-Correlation architecture. However, some changes were needed in \norder to make the two models fit together. In the original Elman architecture there is total \nconnectivity between the state variables (previous outputs of hidden units) and the hidden \nunit layer. In Cascade-Correlation, new hidden units are added one by one, and are frozen \nonce they are added to the network. It would violate this concept to insert the outputs from \nnew hidden units back into existing hidden units as new inputs. On the other hand, the \nnetwork must be able to form recurrent loops if it is to retain state for an indefinite time. \n\nThe solution we have adopted in RCC is to augment each candidate unit with a single \nweighted self-recurrent input that feeds back that unit's own output on the previous time(cid:173)\nstep. That self-recurrent link is trained along with the unit's other input weights to maximize \nthe correlation of the candidate with the residual error. If the recurrent link adopts a strongly \npositive value, the unit will function as a flip-flop, retaining its previous state unless the \nother inputs force it to change; if the recurrent link adopts a negative value, the unit will \ntend to oscillate between positive and negative outputs on each time-step unless the other \ninputs hold it in place; if the recurrent weight is near zero, then the unit will act as a gate \nof some kind. When a candidate unit is added to the active network as a new hidden unit. \nthe self-recurrent weight is frozen. along with all the other weights. Each new hidden unit \nis in effect a single state variable in a finite-state machine that is built specifically for the \n\n\f192 \n\nFahlman \n\ntask at hand. In this use of self-recurrent connections only, the RCC model resembles the \n\"Focused Back-Propagation\" algorithm of Mozer[Mozer, 1988]. \n\nThe output, V(t), of each self-recurrent unit is computed as follows: \n\nV(t) = tr (~li(t)Wi + V(t -\n\nl)W') \n\nwhere q is some non-linear squashing function applied to the weighted sum of inputs 1 plus \nthe self-weight, W,,' times the previous output. In the studies described here, q is always the \nhyperbolic tangent or \"symmetric sigmoid\" function, with a range from -1 to + 1. During the \ncandidate training phase, we adjust the weights W; and w\" for each unit so as to maximize \nits correlation score. This requires computing the derivative of V(t) with respect to these \nweights: \n\n8V(t)/Ow; = q'(t) (1;(t) + w\" 8V(t - 1)/Ow;) \n\n8V(t)/8w\" = q'(t) (V(t - 1) +w\" 8V(t - 1)/Ow,,) \n\nThe rightmost term reflects the influence of the weight in question on the unit's previous \nstate. Since we computed 8V(t - 1)/Ow on the previous time-step, we can just save this \nvalue and use it in the current step. So the recurrent version of the learning algorithm \nrequires us to store a single additional number for each candidate weight, plus V(t - 1) for \neach unit. At t = 0 we assume that the unit's previous value and previous derivatives are \nall zero. \n\nAs an aside, the usual formulation for Elman networks treats the hidden units' previous \nvalues as independent inputs, ignoring the dependence of these previous values on the \nweights being adjusted. In effect, the rightmost terms in the above equations are being \ndropped, though they are not negligible in general. This rough approximation apparently \ncauses little trouble in practice, but it might explain the instability that some researchers \nhave reported when Elman nets are run with aggressive second-order learning procedures \nsuch as quickprop. The Mozer algorithm does take these extra terms into account. \n\n2 EMPffiICAL RESULTS: FINITE\u00b7 STATE GRAMMAR \n\nFigure la shows the state-transition diagram for a simple finite-state grammar, called \nthe Reber grammar, that has been used by other researchers to investigate learning and \ngeneralization in recurrent neural networks. To generate a \"legal\" string of tokens from \nthis grammar, we begin at the left side of the graph and move from state to state, following \nthe directed edges. When an edge is traversed, the associated letter is added to the string. \nWhere two paths leave a single node, we choose one at random with equal probability. The \nresulting string always begins with a \"B\" and ends with an \"E\". Because there are loops \nin the graph, there is no bound on the length of the strings; the average length about eight \nletters. An example of a legal string would be \"BTSSXXVPSE\". \n\nCleeremans, Servan-Schreiber, and McClelland [Cleeremans, 1989] showed that an Elman \nnetwork can learn this grammar if it is shown many different strings produced by the \n\n\fThe Recurrent Cascade-Correlation Architecture \n\n193 \n\nGRAMMAR \n\nREBER \n\n.. / \n~ REBER \n\nE \n\nE \n\nGRAMMAR \n\nFigure 1: State transition diagram for the Reber grammar (left) and for the embedded Reber \ngrammar (right). \n\ngrammar. The internal state of the network is zeroed at the start of each string. The letters \nin the string are then presented sequentially at the inputs of the network, with a separate \ninput connection for each of the seven letters. The network is trained to predict the next \ncharacter in the string by turning on one of the seven outputs. The output is compared to \nthe true successor and network attempts to minimize the resulting errors. \n\nWhen there are two legal successors from a given state, the network will never be able to \ndo a perfect job of prediction. During training, the net will see contradictory examples, \nsometimes with one successor and sometimes the other. In such cases, the net will eventually \nlearn to partially activate both legal outputs. During testing, a prediction is considered \ncorrect if the two desired outputs are the two with the largest values. \n\nThis task requires generalization in the presence of considerable noise. The rules defining \nthe grammar are never presented-only examples of the grammar's output. Note that if the \nnetwork can perform the prediction task perfectly, it can also be used to determine whether \na string is a legal output of the grammar. Note also that the successor letter(s) cannot be \ndetermined from the current input alone; some memory of of the network's state or past \ninputs is essential. \n\nCleeremans et ale report that a fixed-topology Elman net with three hidden units can learn \nthis task after 60,000 distinct training strings have been presented, each used only once. A \nlarger network with 15 hidden units required only 20,000 training strings. These were the \nbest results obtained, not averages over a number of runs. \n\nRCC was given the same problem, but using a fixed set of 128 training strings, presented \nrepeatedly. (Smaller string-sets had too many statistical irregularities for reliable training.) \nTen trials were run using different training sets. In nine cases, RCC achieved perfect \nperformance after building two hidden units; in the tenth, three hidden units were built. \nAverage training time was 195.5 epochs, or about 25,000 string presentations. (An epoch \nis defined as a single pass through a fixed training set.) In every case, the trained network \nachieved a perfect score on a set of 128 new strings not used in training. This study used a \npool of 8 candidate units. \n\nCleeremans et ale also explored the \"embedded Reber grammar\" shown in figure 1 b. Each \n\n\f194 \n\nFahlman \n\nof the boxes in the figure is a transition graph identical to the original Reber grammar. \nIn this much harder task, the network must learn to predict the final T or P correctly. To \naccomplish this, the network must note the initial T or P and must retain this information \nwhile processing an \"embedded clause\" of arbitrary length. It is difficult to discover this \nrule from example strings, since the embedded clauses may also contain many T's and P's, \nbut only the initial T or P correlates reliably with the final prediction. The \"signal to noise \nratio\" in this problem is very poor. \n\nThe standard Elman net was unable to learn this task, even with 15 hidden units and 250,000 \ntraining strings. However, the task could be learned partially (correct prediction in about \n70% of test strings) if the two copies of the embedded grammar were differentiated by \ngiving them slightly different transition probabilities. \n\nRCC was run six times on the more difficult symmetrical form of this problem. A candidate \npool of 8 units was used. Each trial used a different set of 256 training strings and the \nresulting network was tested on a separate set of 256 strings. As shown in the table below, \nperfect performance was achieved in about half the trial runs, requiring 7 -9 hidden units \nand and average of 713 epochs (182K string -presentations). 1\\vo of the remaining networks \nperform at the 99+% level, and one got stuck. (Trial 6 is a successful second run on the \nsame test set used in trial 5.) \n\nTrial Hidden Epochs Tram Set Test Set \nErrors \n\nErrors \n\nUnits Needed \n831 \n602 \n1256 \n910 \n1063 \n707 \n\n9 \n7 \n15 \n11 \n13 \n9 \n\n1 \n2 \n3 \n4 \n5 \n6 \n\n0 \n0 \n0 \n0 \n11 \n0 \n\n0 \n0 \n2 \n1 \n16 \n0 \n\nSmith and Zipser[Smith, 1989] have studied the same grammar-learning tasks using the \ntime-continuous \"Real-Time Recurrent Learning\" (or \"RTRL\") architecture developed by \nWilliams and Zipser[Williams, 1989]. They report that a network with seven visible (com(cid:173)\nbined input-output) units, two hidden units, and full inter-unit connectivity is able to learn \nthe simple Reber grammar task after presentation of 19,000 to 63,000 distinct training \nstrings. On the more difficult embedded grammar task, Smith and Zipser report that RTRL \nlearned the task perfectly in some (unspecified) fraction of attempts. Successful runs ranged \nfrom 3 hidden units (173K distinct training strings) to 12 hidden units (25K strings). RTRL \nis able to deal with discrete or time-continuous problems, while RCC deals only in discrete \nevents. On the other hand, RTRL requires more computation than RCC in processing each \ntraining example, and RTRL scales up poorly as network size increases. \n\n3 EMPIRICAL RESULTS: LEARNING MORSE CODE \n\nAnother series of experiments tested the ability of an RCC network to learn the Morse \ncode patterns for the 26 English letters. While this task requires no generalization, it \ndoes demonstrate the ability of this architecture to recognize a long, rather complex set of \npatterns. It also provides an opportunity to demonstrate RCC's ability to learn a new task \nin small increments. This study assumes that the dots and dashes arrive at precise times; it \ndoes not address the problem of variable timing. \n\n\fThe Recurrent Cascade-Correlation Architecture \n\n195 \n\nThe network has one input and 27 outputs: one for each letter and a \"strobe\" output \nsignalling that a complete letter has been recognized. A dot is represented as a logical one \n(positive input) followed by a logical zero (negative); a dash is two ones followed by a \nzero. A second consecutive zero marks the end of the letter. When the second zero is seen \nthe network must raise the strobe output and one of the other 26; at all other times. the \noutputs are zero. For example, the \" ... -\" pattern for the letter V would be encoded as the \ninput sequence\" 1010101100\". The letter patterns vary considerably in length, from 3 to 12 \ntime-steps, with an average of 8. During training, the network's state is zeroed at the start \nof each new letter; once the network is trained. the strobe output could be used to reset the \nnetwork. \n\nIn one series of trials. the training set included the codes for all 26 letters at once (226 \ntime-steps in all). In ten trials. the network learned the task perfectly in every case, building \nan average of 10.5 hidden units and requiring an average of 1321 passes through the entire \ntraining set. Note that the system does not require a distinct hidden unit for each letter or \nfor each time-slice in the longest sequence. \n\nIn a second experiment, we divided the training into a series of short \"lessons\" of increasing \ndifficulty. The network was first trained to produce the strobe output and to recognize the \ntwo shortest letters, E and T. This task was learned perfectly, usually with the creation of 2 \nhidden units. We then set aside the \"ET\" set and trained successively on the following sets: \n\"AIN\", \"DGHKRUW\", \"BFLOV\", and \"CJPQXYZ\". As a rule, each of these lessons adds \none or two new hidden units, building upon those already present. Finally we train on all \n26 characters at once, which generally adds 2-3 more units to the existing set. \n\nIn ten trials, the incremental version learned the task perfectly every time, requiring an \naverage total of 1427 epochs and 9.6 hidden units-slightly fewer than the number of units \nadded in block training. While the epoch count is slightly greater than in the block-training \nexperiment. most of these epochs are run on very small training sets. The incremental \ntraining required only about half as much total runtime as the block training. For learning \nof even more complex temporal sequences, incremental training of this kind may prove \nessential. \n\nOur approach to incremental training was inspired to some degree by the work reported in \n[Waibel, 1989] in which small network modules were trained separately, frozen, and then \ncombined into a composite network with the addition of some \"glue\" units. However, in \nRCC only the partitioning of the training set is chosen by the user; the network itself builds \nthe appropriate internal structure, and new units are able to build upon hidden units created \nduring some earlier lesson. \n\n4 CONCLUSIONS \n\nRCC sequential processing to Cascade-Correlation, while retaining the advantages of the \noriginal version: fast learning, good generalization, automatic choice of network topology. \nability to create complex high-order feature detectors, and incremental learning. The \ngrammar-learning experiments suggest that RCC is more powerful than standard Elman \nnetworks in learning to recognize subtle patterns in sequential data. The RTRL scheme of \nWilliams and Zipser may be equally powerful. but RTRL is more complex and does not \nscale up well when larger networks are needed. \n\nOn the negative side, RCC deals in discrete time-steps and not in continuous time. An \n\n\f196 \n\nFahlman \n\ninteresting direction for future research is to explore the use of an RCC-like structure with \nunits whose memory of past state is time-continuous rather than discrete. \n\nAcknowledgments \n\nI would like to thank Paul Gleichauf, Dave Touretzky, and Alex Waibel for their help and \nuseful suggestions. This research was sponsored in part by the National Science Foundation \n(Contract EET-87 16324) and the Defense Advanced Research Projects Agency (Contract \nF33615-90-C-1465). \n\nReferences \n\n[Elman,1988] \n\n[Fahlman, 1990] \n\n[Fahlman,1988] \n\n[Cleeremans, 1989] Cleeremans, A., D. Servan-Schreiber, and J. L. McClelland (1989) \n\"Finite-State Automata and Simple Recurrent Networks\" in Neural \nC ompJltation 1, 372-381. \nElman, J. L. (1988) \"Finding Structure in Time,\" CRL Tech Report \n8801, Univ. of California at San Diego, Center for Research in Lan(cid:173)\nguage. \nFahlman, S. E. (1988) \"Faster-Learning Variations on Back(cid:173)\nPropagation: An Empirical Study\" in Proceedings of the 1988 Con(cid:173)\nnectionist Models Summer School, Morgan Kaufmann. \nFahlman, S. E. and C. Lebiere (1988) ''The Cascade-Correlation \nLearning Architecture\" in D. S. Touretzky (ed.), Advances in Neu(cid:173)\nral Information Processing Systems 2, Morgan Kaufmann. \nMozer, M. C. (1988) \"A Focused Back-Propagation Algorithm for \nTemporal Pattern Recognition,\" Tech Report CRG-1R-88-3, Univ. of \nToronto, Dept. of Psychology and Computer Science. \nSmith, A. W. and D. Zipser (1989) \"Learning Sequential Structure \nwith the Real-TIme Recurrent Learning Algorithm\" in International \nJournal of Neural Systems, Vol. 1, No.2, 125-131. \nWaibel, A. (1989) \"Consonant Recognition by Modular Construction \nof Large Phonemic TIme-Delay Neural Networks\" in D. S. Touretzky \n(ed.),Advances in Neural Information Processing Systems 1, Morgan \nKaufmann. \n\n[Mozer,1988] \n\n[Smith,1989] \n\n[Waibel, 1989] \n\n[Williams,1989] Williams, R. J. and D. Zipser (1989) \"A learning algorithm for con tin(cid:173)\n\nually running fully recurrent neural networkS,\" Neural Computation \n1,270-280. \n\n\fPart V \n\nSpeech \n\n\f\f", "award": [], "sourceid": 330, "authors": [{"given_name": "Scott", "family_name": "Fahlman", "institution": null}]}