{"title": "Constructive Learning Using Internal Representation Conflicts", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 284, "abstract": null, "full_text": "Constructive Learning Using Internal \n\nRepresentation Conflicts \n\nLaurens R. Leerink and Marwan A. J abri \n\nSystems Engineering & Design Automation Laboratory \n\nDepartment of Electrical Engineering \n\nThe University of Sydney \n\nSydney, NSW 2006, Australia \n\nAbstract \n\nWe present an algorithm for the training of feedforward and recur(cid:173)\nrent neural networks. It detects internal representation conflicts \nand uses these conflicts in a constructive manner to add new neu(cid:173)\nrons to the network . The advantages are twofold: (1) starting with \na small network neurons are only allocated when required; (2) by \ndetecting and resolving internal conflicts at an early stage learning \ntime is reduced. Empirical results on two real-world problems sub(cid:173)\nstantiate the faster learning speed; when applied to the training \nof a recurrent network on a well researched sequence recognition \ntask (the Reber grammar), training times are significantly less than \npreviously reported . \n\n1 \n\nIntroduction \n\nSelecting the optimal network architecture for a specific application is a nontrivial \ntask, and several algorithms have been proposed to automate this process. The \nfirst class of network adaptation algorithms start out with a redundant architecture \nand proceed by pruning away seemingly unimportant weights (Sietsma and Dow, \n1988; Le Cun et aI, 1990). A second class of algorithms starts off with a sparse \narchitecture and grows the network to the complexity required by the problem. \nSeveral algorithms have been proposed for growing feedforward networks. The \nupstart algorithm of Frean (1990) and the cascade-correlation algorithm of Fahlman \n(1990) are examples of this approach. \n\n279 \n\n\f280 \n\nLeerink and Jabri \n\nThe cascade correlation algorithm has also been extended to recurrent networks \n(Fahlman, 1991), and has been shown to produce good results. The recurrent \ncascade-correlation (RCC) algorithm adds a fully connected layer to the network \nafter every step, in the process attempting to correlate the output of the additional \nlayer with the error. In contrast, our proposed algorithm uses the statistical prop(cid:173)\nerties of the weight adjustments produced during batch learning to add additional \nunits. \n\nThe RCC algorithm will be used as a baseline against which the performance of \nour method will be compared. In a recent paper, Chen et al (1993) presented an \nalgorithm which adds one recurrent neuron with small weights every N epochs. \nHowever, no significant improvement in training speed was reported over training \nthe corresponding fixed size network, and the algorithm will not be further analyzed. \nTo the authors knowledge little work besides the two mentioned papers have applied \nconstructive algorithms to recurrent networks. \n\nIn the majority of our empirical studies we have used partially recurrent neural \nnetworks, and in this paper we will focus our attention on such networks. The mo(cid:173)\ntivation for the development of this algorithm partly stemmed from the long training \ntimes experienced with the problems of phoneme and word recognition from contin(cid:173)\nuous speech. However, the algorithm is directly applicable to feedforward networks. \nThe same criteria and method used to add recurrent neurons to a recurrent network \ncan be used for adding neurons to any hidden layer of a feed-forward network. \n\n2 Architecture \n\nIn a standard feedforward network, the outputs only depend on the current inputs, \nthe network architecture and the weights in the network. However, because of the \ntemporal nature of several applications, in particular speech recognition, it might \nbe necessary for the network to have a short term memory. \n\nPartially recurrent networks, often referred to as Jordan (1989) or Elman (1990) \nnetworks, are well suited to these problems. The architecture examined in this \npaper is based on the work done by Robinson and Fallside (1991) who have applied \ntheir recurrent error propagation network to continuous speech recognition. \n\nA common feature of all partially recurrent networks is that there is a special set \nof neurons called context units which receive feedback signals from a previous time \nstep. Let the values of the context units at time t be represented by C(t). During \nnormal operation the input vector at time t are applied to the input nodes I(t), and \nduring the feedforward calculation values are produced at both the output nodes \nO(t + 1) and the context units C(t + 1). The values of the context units are then \ncopied back to the input layer for use as input in the following time step. \n\nSeveral training algorithms exist for training partially recurrent neural networks, \nbut for tasks with large training sets the back-propagation through time (Werbos, \n1990) is often used. This method is computationally efficient and does not use \nany approximations in following the gradient. For an application where the time \ninformation is spread over T. input patterns, the algorithm simply duplicates the \nnetwork T times - which results in a feedforward network that can be trained by a \nvariation of the standard backpropagation algorithm. \n\n\fConstructive Learning Using Internal Representation Conflicts \n\n281 \n\n3 The Algorithm \n\nFor partially recurrent networks consisting of input, output and context neurons, \nthe following assertions can be made: \n\n\u2022 The role of the context units in the network is to extract and store all \nrelevant prior information from the sequence pertaining to the classification \nproblem. \n\n\u2022 For weights entering context units the weight update values accumulated \nduring batch learning will eventually determine what context information \nis stored in the unit (the sum of the weight update values is larger than the \ninitial random weights). \n\n\u2022 We assume that initially the number of context units in the network is \ninsufficient to implement this extraction and storage of information (we \nstart training with a small network). Then, at different moments in time \nduring the recognition of long temporal sequences, a context unit could be \nrequired to preserve several different contexts. \n\n\u2022 These conflicts are manifested as distinct peaks in the distribution of the \n\nweight update values during the epoch. \n\nAll but the last fact follows directly from the network architecture and requires no \nfurther elaboration. The peaks in the distribution of the weight update values are a \nresult of the training algorithm attempting to adjust the value of the context units in \norder to provide a context value that will resolve short-term memory requirements. \n\nAfter the algorithm had been developed, it was discovered that this aspect of the \nweight update values had been used in the past by Wynne-Jones (1992) and in \nthe Meiosis Networks of Hanson (1990). The method of Wynne-Jones (1992) in \nparticular is very closely related; in this case principal component analysis of the \nweight updates and the Hessian matrix is used to detect oscillating nodes in fully \ntrained feed-forward networks. This aspect of backpropagation training is fully \ndiscussed in Wynne-Jones (1992), to which reader is referred for further details. \n\nThe above assertions lead to the proposed training algorithm, which states that if \nthere are distinct maxima in the distribution of weight update values of the weights \nentering a context unit, then this is an indication that the batch learning algorithm \nrequires this context unit for the storage of more than one context. \n\nIf this conflict can be resolved, the network can effectively store all the contexts \nrequired, leading to a reduction in training time and potentially an increase III \nperformance. \n\nThe training algorithm is given below (the mode of the distribution is defined as \nthe number of distinct maxima): \n\nFor all context units { \n\nSet N = modality ot the distribution ot weight update values; \nIt N > 1 then { \n\nAdd N-1 new context units to the network which are identical \n(in terms ot weighted inputs) to the current context unit. \n\n\f282 \n\nLeerink and Jabri \n\nAdjust each of these N context units (including the \noriginal) by the weight update value determined by each \nmaxima (the average value of the mode). \n\nAdjust all weights leaving these N context units so that the \naddition of the new units do not affect any subsequent layers \n(division by N). This ensures that the network retains all \npreviously acquired knowledge. \n\n} \n\n} \n\nThe main problem in the implementation of the above algorithm is the automatic \ndetection of significant maxima in the distribution of weight updates. A standard \nstatistical approach for the determination of the modality (the number of maxima) \nof a distribution of noisy data is to fit a curve of a certain predetermined order to \nthe data. The maxima (and minima) are then found by setting the derivative to \nzero. This method was found to be unsuitable mainly because after curve fitting it \nwas difficult to determine the significance of the detected peaks. \nIt was decided that only instances of bi-modality and tri-modality were to be iden(cid:173)\ntified, each corresponding to the addition of one or two context units. The following \nheuristic was constructed: \n\n\u2022 Calculate the mean and standard deviation of the weight update values. \n\u2022 Obtain the maximum value in the distribution. \n\u2022 If there are any peaks larger than 60% of the maxima outside one standard \n\ndeviation of the mean, regard this as significant. \n\nThis heuristic provided adequate identification of the modalities. The distribution \nwas divided into three areas using the mean \u00b1 the standard deviation as boundaries. \nDepending on the number of maxima detected, the average within each area is used \nto adjust the weights. \n\n4 Discussion \n\nAccording to our algorithm it follows that if at least one weight entering a context \nunit has a multi-modal distribution, then that context unit is duplicated. In the \ncase where multi-modality is detected in more than one weight, context units were \nadded according to the highest modality. \n\nAlthough this algorithm increases the computational load during training, the stan(cid:173)\ndard deviation of the weight updates rapidly decreases as the network converges. \nThe narrowing of the distribution makes it more difficult to determine the modal(cid:173)\nity. In practice it was only found useful to apply the algorithm during the initial \ntraining epochs, typically during the first 20. \n\nDuring simulations in which strong multi-modalities were detected in certain nodes, \nfrequently the multi-modalities would persist in the newly created nodes. In this \n\n\fConstructive Learning Using Internal Representation Conflicts \n\n283 \n\nmanner a strong bi-modality would cause one node to split into two, the two nodes \nto grow to four, etc. This behaviour was prevented by disabling the splitting of \na node for a variable number of epochs after a multi-modality had been detected. \nDisabling this behaviour for two epochs provided good results. \n\n5 Simulation Results \n\nThe algorithm was evaluated empirically on two different tasks: \n\n\u2022 Phoneme recognition from continuous multi-speaker speech usmg the \n\nTIMIT (Garofolo, 1988) acoustic-phonetic database . \n\n\u2022 Sequence Recognition: Learning a finite-state grammar from examples of \n\nvalid sequences. \n\nFor the phoneme recognition task the algorithm decreased training times by a factor \nof 2 to 10, depending on the size of the network and the size of the training set. \n\nThe sequence recognition task has been studied by other researchers in the past, no(cid:173)\ntably Fahlman (1991). Fahlman compared the performance of the recurrent cascade \ncorrelation (RCC) network with that of previous results by Cleeremans et al (1989) \nwho used an Elman (1990) network. It was concluded that the RCC algorithm \nprovides the same or better performance than the Elman network with less training \ncycles on a smaller training set. Our simulations have shown that the recurrent \nerror propagation network of Robinson and Fallside (1991), when trained with our \nconstructive algorithm and a learning rate adaptation heuristic, can provide the \nsame performance as the RCC architecture in 40% fewer training epochs using a \ntraining set of the same size. The resulting network has the same number of weights \nas the minimum size RCC network which correctly solves this problem. \n\nConstructive algorithms are often criticized in terms of efficiency, i.e. \"Is the in(cid:173)\ncrease in learning speed due to the algorithm or just the additional degrees of \nfreedom resulting from the added neuron and associated weights?\". To address this \nquestion several simulations were conducted on the speech recognition task, com(cid:173)\nparing the performance and learning time of a network with N fixed context units \nto that of a network with small number of context units and growing a network \nwith a maximum of N context units. Results indicate that the constructive algo(cid:173)\nrithm consistently trains faster, even though both networks often have the same \nfinal performance. \n\n6 Summary \n\nIn this paper the statistical properties of the weight update values obtained during \nthe training of a simple recurrent network using back-propagation through time \nhave been examined. An algorithm has been presented for using these properties to \ndetect internal representation conflicts during training and to use this information \nto add recurrent units to the network. Simulation results show that the algorithm \ndecreases training time compared to networks which have a fixed number of context \nunits. The algorithm has not been applied to feedforward networks, but can III \nprinciple be added to all training algorithms that operate in batch mode. \n\n\f284 \n\nLeerink and Jabri \n\nReferences \n\nChen, D., Giles, C.L., Sun, G.Z., Chen, H.H., Lee, Y.C., Goudreau, M.W. (1993). \nConstructive Learning of Recurrent Neural Networks. In 1993 IEEE International \nConference on Neural Networks, 111:1196-1201. Piscataway, NJ: IEEE Press. \n\nCleeremans, A., Servan-Schreiber, D., and McClelland, J.L. (1989). Finite State \nAutomata and Simple Recurrent Networks. Neural Computation 1:372-381. \nElman, J .L. (1990). Finding Structure in Time. Cognitive Science 14:179-21l. \n\nFahlman, S.E. and C. Lebiere (1990). The Cascade Correlation Learning Architec(cid:173)\nture. In D. S. Touretzky (ed.), Advances in Neural Information Processing Systems \n2, 524-532. San Mateo, CA: Morgan Kaufmann. \nFahlman, S.E. (1991). The Recurrent Cascade Correlation Architecture. Technical \nReport CMU-CS-91-100. School of Computer Science, Carnegie Mellon University. \nFrean, M. (1990). The Upstart Algorithm: A Method for Constructing and Training \nFeedforward Neural Networks. Neural Computation 2:198-209. \n\nGarofolo, J.S. (1988). Getting Started with the DARPA TIMIT CD-ROM: an \nAcoustic Phonetic Continuous Speech Database. National Institute of Standards \nand Technology (NIST), Gaithersburgh, Maryland. \n\nHanson, S.J. (1990). Meiosis Networks. In D. S. Touretzky (ed.), Advances in Neu(cid:173)\nral Information Processing Systems 2, 533-541, San Mateo, CA: Morgan Kaufmann. \nJordan, M.1. (1989). Serial Order: A Parallel, Distributed Processing Approach. In \nAdvances in Connectionist Theory: Speech, eds. J.L. Elman and D.E. Rumelhart. \nHillsdale: Erlbaum. \nLe Cun, Y., J .S. Denker, and S.A Solla (1990). Optimal Brain Damage. In D. S. \nTouretzky (ed.), Advances in Neural Information Processing Systems 2, 598-605. \nSan Mateo, CA: Morgan Kaufmann. \n\nReber, A.S. (1967). Implicit learning of artificial grammars. Journal of Verbal \nLearning and Verbal Behavior 5:855-863. \n\nRobinson, A.J. and Fallside F. (1991). An error propagation network speech recog(cid:173)\nnition system. Computer Speech and Language 5:259-274. \nSietsma, J. and RJ.F Dow (1988). Neural Net Pruning-\\Vhy and How. In IEEE \nInternational Conference on Neural Networks. (San Diego 1988), 1:325-333. \n\nWynne-Jones, M. (1992) Node Splitting: A Constructive Algorithm for Feed(cid:173)\nForward Neural Networks. In D. S. Touretzky (ed.), Advances in Neural Infor(cid:173)\nmation Processing Systems 4, 1072-1079. San Mateo, CA: Morgan Kaufmann. \nWerbos, P.J. (1990). Backpropagation Through Time, How It Works and How to \nDo It. Proceedings of the IEEE, 78:1550-1560. \n\n\f", "award": [], "sourceid": 802, "authors": [{"given_name": "Laurens", "family_name": "Leerink", "institution": null}, {"given_name": "Marwan", "family_name": "Jabri", "institution": null}]}