{"title": "Learning Sequential Tasks by Incrementally Adding Higher Orders", "book": "Advances in Neural Information Processing Systems", "page_first": 115, "page_last": 122, "abstract": null, "full_text": "Learning Sequential Tasks by \n\nIncrementally Adding Higher Orders \n\nMark Ring \n\nDepartment of Computer Sciences, Taylor 2.124 \n\nUniversity of Texas at Austin \n\nAustin, Texas 78712 \n(ring@cs. utexas.edu) \n\nAbstract \n\nAn incremental, higher-order, non-recurrent network combines two \nproperties found to be useful for learning sequential tasks: higher(cid:173)\norder connections and incremental introduction of new units. The \nnetwork adds higher orders when needed by adding new units that \ndynamically modify connection weights. Since the new units mod(cid:173)\nify the weights at the next time-step with information from the \nprevious step, temporal tasks can be learned without the use of \nfeedback, thereby greatly simplifying training. Furthermore, a the(cid:173)\noretically unlimited number of units can be added to reach into \nthe arbitrarily distant past. Experiments with the Reber gram(cid:173)\nmar have demonstrated speedups of two orders of magnitude over \nrecurrent networks. \n\n1 \n\nINTRODUCTION \n\nSecond-order recurrent networks have proven to be very powerful [8], especially \nwhen trained using complete back propagation through time [1, 6, 14]. It has also \nbeen demonstrated by Fahlman that a recurrent network that incrementally adds \nnodes during training-his Recurrent Cascade-Correlation algorithm [5]-can be \nsuperior to non-incremental, recurrent networks [2,4, 11, 12, 15]. \n\nThe incremental, higher-order network presented here combines advantages of both \nof these approaches in a non-recurrent network. This network (a simplified, con-\n\n115 \n\n\f116 \n\nRing \n\ntinuous version of that introduced in [9]), adds higher orders when they are needed \nby the system to solve its task. This is done by adding new units that dynamically \nmodify connection weights. The new units modify the weights at the next time-step \nwith information from the last, which allows temporal tasks to be learned without \nthe use of feedback. \n\n2 GENERAL FORMULATION \n\nEach unit (U) in the network is either an input (I), output (0), or high-level (L) \nunit. \n\nUi(t) \n\nvalue of ith unit at time t. \n\nIi(t) \n\nOi(t) \nret) \nL~y(t) \n\nUi(t) where i is an input unit. \n\nUi(t) where i is an output unit. \nTarget value for Oi (t) at time t. \n\nUi(t) where i is the higher-order unit that \nmodifies weight wxy at time t. 1 \n\nThe output and high-level units are collectively referred to as non-input (N) units: \n\nif Ui = Oi. \n{ Oi (t) \nL~y(t) if Ui = L~y. \n\nIn a given time-step, the output and high-level units receive a summed input from \nthe input units. \n\nNi(t) == L Ij (t)g(i, j, t). \n\n(1) \n\ng is a gating function representing the weight of a particular connection at a par(cid:173)\nticular time-step. If there is a higher-order unit assigned to that connection, then \nthe input value of that unit is added to the connection's weight at that time-step.2 \n\nj \n\n( .. t) _ { Wij(t) + Lij(t - 1) \ng Z,), \n\nWij (t) \n\n-\n\nIf Lij exists \nOtherwise \n\n(2) \n\nAt each time-step, the values of the output units are calculated from the input units \nand the weights (possibly modified by the activations of the high-level units from the \nprevious time-step). The values of the high-level units are calculated at the same \ntime in the same way. The output units generate the output of the network. The \nhigh-level units simply alter the weights at the next time-step. All unit activations \ncan be computed simultaneously since the activations of the L units are not required \n\n1 A connection may be modified by at most one L unit. Therefore Li , Lzy , and L~y are \n\nidentical but used as appropriate for notational convenience. \n21t can be seen that this is a higher-order connection in the usual sense if one substitutes \nthe right-hand side of equation 1 for L'0 in equation 2 and then replaces g in equation 1 with \nthe result. In fact, as the network increases in height, ever higher orders are introduced, \nwhile lower orders are preserved. \n\n\fLearning Sequential Tasks by Incrementally Adding Higher Orders \n\n117 \n\nuntil the following time-step. The network is arranged hierarchically in that every \nhigher-order units is always higher in the hierarchy than the units on either side \nof the weight it affects. Since higher-order units have no outgoing connections, the \nnetwork is not recurrent. It is therefore impossible for a high-level unit to affect, \ndirectly or indirectly, its own input. \n\nThere are no hidden units in the traditional sense, and all units have a linear activa(cid:173)\ntion function. (This does not imply that non-linear functions cannot be represented, \nsince non-linearities do result from the multiplication of higher-level and input units \nin equations 1 and 2.) \n\nLearning is done through gradient descent to reduce the sum-squared error. \n\nE(t) \n\n! L:(Ti(t) - Oi(t\u00bb2 \n2 \n\n. , \n\n(3) \n\nwhere 1] is the learning rate. Since it may take several time-steps for the value of \na weight to affect the network's output and therefore the error, equation 3 can be \nrewritten as: \n\n(4) \n\n(5) \n\nwhere \n\nri = \n\n{ 0 \n\n1 + rX \n\n~Wij(t) = B \n\n( \n\nBE(t) \n\n\") , \n\nWij t - r' \n\nif Ui = Oi \nif Ui = L~y \n\nThe value ri is constant for any given unit i and specifies how \"high\" in the hierarchy \nunit i is. It therefore also specifies how many time-steps it takes for a change in \nunit i's activation to affect the network's output. \n\nDue to space limitations, the derivation of the gradient is not shown, but is given \nelsewhere [10]. The resulting weight change rule, however, is: \n\n~ .. (t) - Ii (t _ \n\nw') \n\n-\n\ni) { Ti(t) - Oi(t) \n\n~W (t) \n\nxy \n\nr \n\nIf u~ = O~ \n\nIf U' = LI \nxy \n\n-\n\nThe weights are changed after error values for the output units have been collected. \nSince each high-level unit is higher in the hierarchy than the units on either side \nof the weight it affects, weight changes are made bottom up, and the ~Wxy(t) in \nequation 5 will already have been calculated at the time ~Wij(t) is computed. \n\nThe intuition behind the learning rule is that each high-level unit learns to utilize \nthe context from the previous time-step for adjusting the connection it influences \nat the next time-step so that it can minimize the connection's error in that context. \nTherefore, if the information necessary to decide the correct value of a connection \nat one time-step is available at the previous time-step, then that information is used \nby the higher-order unit assigned to that connection. If the needed information is \nnot available at the previous time-step, then new units may be built to look for \nthe information at still earlier steps. This method concentrating on unexpected \nevents is similar to the \"hierarchy of decisions\" of Dawkins [3], and the \"history \ncompression\" of Schmidhuber [13]. \n\n\f118 \n\nRing \n\n3 WHEN TO ADD NEW UNITS \n\nA unit is added whenever a weight is being pulled strongly in opposite directions \n(i.e. when learning is forcing the weight to increase and to decrease at the same \ntime) . The unit is created to determine the contexts in which the weight is pulled \nin each direction. This is done in the following way: Two long-term averages are \nkept for each connection. The first of these records the average change made to the \nweight, \n\n~Wij(t) = O'~Wij(t) + (1 - O')~Wij(t - 1); 0 S; 0' S; l. \n\nThe second is the long-term mean absolute deviation, given by: \n\nThe parameter, 0', specifies the duration of the long-term average. A lower value \nof 0' means that the average is kept for a longer period of time. When ~Wij(t) is \nsmall, but I~Wij(t)1 is large, then the weight is being pulled strongly in conflicting \ndirections, and a new unit is built. \n\nif \n\nI~Wij(t)1 \n\nc + I~Wij(t)1 \n\n>8 \n\nthen build unit L~ +1, \n\nwhere c is a small constant that keeps the denominator from being zero, 8 is a \nthreshold value, and N is the number of units in the network. A related method for \nadding new units in feed-forward networks was introduced by Wynne-Jones [16]. \n\nWhen a new unit is added, its incoming weights are initially zero. It has no output \nweights but simply learns to anticipate and reduce the error at each time-step of \nthe weight it modifies. In order to keep the number of new units low, whenever a \nunit, Lij is created, the statistics for all connections into the destination unit (U i ) \nare reset: I~Wij(t)1 ~ 0.0 and ~Wij(t) ~ 1.0. \n\n4 RESULTS \n\nThe Reber grammar is a small finite-state grammar of the following form: \n\nB \n\nsO X y. .. . \n~ E \n... \n... \n~ /v \n... \n\u2022 \nTO \n\nTransitions from one node to the next are made by way of the labeled arcs. The \ntask of the network is: given as input the label of the arc just traversed, predict \n\n\fLearning Sequential Tasks by Incrementally Adding Higher Orders \n\n119 \n\nElman \nNetwork \n\nRTRL \n\nSequences Seen: Mean \nBest \n\n\"Hidden\" Units \n\n20,000 \n15 \n\n19,000 \n2 \n\nRecurrent \nCascade \n\nCorrelation \n25,000 \n\n2-3 \n\nIncremental \nHigher-Order \n\nNetwork \n206 \n176 \n40 \n\nTable 1: The incremental higher-order network is compared against recurrent net(cid:173)\nworks on the Reber grammar. The results for the recurrent networks are quoted \nfrom other sources [2, 5]. The mean and/or best performance is shown when avail(cid:173)\nable. RTRL is the real-time recurrent learning algorithm [15]. \n\nthe arc that will be traversed next. A training sequence, or string, is generated \nby starting with a B transition and then randomly choosing an arc leading away \nfrom the current state until the final state is reached. Both inputs and outputs are \nencoded locally, so that there are seven output units (one each for B, T, S, X, V, P, \nand E) and eight input units (the same seven plus one bias unit). The network is \nconsidered correct if its highest activated outputs correspond to the arcs that can be \ntraversed from the current state. Note that the current state cannot be determined \nfrom the current input alone. \n\nAn Elman-type recurrent network was able to learn this task after 20,000 string \npresentations using 15 hidden units [2]. \n(The correctness criteria for the Elman \nnet was slightly more stringent than that described in the previous paragraph.) \nRecurrent Cascade-Correlation (RCC) was able to learn this task using only two or \nthree hidden units in an average of 25,000 string presentations [5]. \n\nThe incremental, higher-order network was trained on a continuous stream of input: \nthe network was not reset before beginning a new string. Training was considered \nto be complete only after the network had correctly classified 100 strings in a row. \nUsing this criterion, the network completed training after an average of 206.3 string \npresentations with a standard deviation of 16.7. It achieved perfect generalization \non test sets of 128 randomly generated strings in all ten runs. Because the Reber \ngrammar is stochastic, a ceiling of 40 higher-order units was imposed on the network \nto prevent it from continually creating new units in an attempt to outguess the \nrandom number generator. \n\nComplete results for the network on the Reber grammar task are given in table 1. \nThe parameter settings were: TJ = 0.04, (J\" = 0.08, e = 1.0, f = 0.1 and Bias = 0.0. \n(The network seemed to perform better with no bias unit.) \n\nThe network has also been tested on the \"variable gap\" tasks introduced by Mozer \n[7], as shown in figure 1. These tasks were intended to test performance of networks \nover long time-delays. Two sequences are alternately presented to the network. \nEach sequence begins with an X or a Y and is followed by a fixed string of characters \nwith an X or a Y inserted some number of time-steps from the beginning. \nIn \nfigure 1 the number of time-steps, or \"gap\", is 2. The only difference between the \ntwo sequences is that the first begins with an X and repeats the X after the gap, \nwhile the second begins with a Y and repeats the Y after the gap. The network \nmust learn to predict the next item in the sequence given the current item as input \n\n\f120 \n\nRing \n\nTime-step: \n0 \nSequence 1: X \nSequence 2: Y \n\n1 2 \na b \na b \n\n3 \nX \nY \n\n4 \nc \nc \n\n5 6 \nd \ne \nd \ne \n\n789 \ng h \nf \ng h \nf \n\n10 \n\n11 \nJ \nJ \n\n12 \nk \nk \n\nFigure 1: An example of a \"variable gap\" training sequence [7]. One item is pre(cid:173)\nsented to the network at each time-step. The target is the next item in the sequence. \nHere the \"gap\" is two, because there are two items in the sequence between the first \nX or Y and the second X or Y . In order to correctly predict the second X or Y, the \nnetwork must remember how the sequence began. \n\n(where all inputs are locally encoded). \nIn order for the network to predict the \nsecond occurrence of the X or Y, it must remember how the sequence began. The \nlength of the gap can be increased in order to create tasks of greater difficulty. \n\nResults of the \"gap\" tasks are given in table 2. The values for the standard recurrent \nnetwork and for Mozer's own variation are quoted from Mozer's paper [7]. The \nincremental higher-order net had no difficulty with gaps up to 24, which was the \nlargest gap I tested. The same string was used for all tasks (except for the position \nof the second X or V), and had no repeated characters (again with the exception \nof the X and Y). The network continued to scale linearly with every gap size both \nin terms of units and epochs required for training. Because these tasks are not \nstochastic, the network always stopped building units as soon as it had created \nthose necessary to solve each task. \nThe parameter settings were: TJ = 1.5, (j = 0.2, e = 1.0, f = 0.1 and Bias = 0.0. \nThe network was considered to have correctly predicted an element in the sequence \nif the most strongly activated output unit was the unit representing the correct \nprediction. The sequence was considered correctly predicted if all elements (other \nthan the initial X or Y) were correctly predicted. \n\nMean number of Training sets required by: \nStandard \n\nUmts \nRecurrent Net Network Higher-Order Net Created \n\nMozer \n\nIncremental \n\n4 \n6 \n8 \n10 \n12 \n26 \n\n10 \n15 \n19 \n23 \n27 \n49 \n\nGap \n\n2 \n4 \n6 \n8 \n10 \n24 \n\n468 \n7406 \n9830 \n> 10000 \n> 10000 \n\n328 \n584 \n992 \n1312 \n1630 \n\nTable 2: A comparison on the \"gap\" tasks of a standard recurrent-network and \na network devised specifically for long time-delays (quoted from Mozer [7], who \nreported results for gaps up to ten) against an incremental higher-order network. \nThe last column is the number of units created by the incremental higher-order net. \n\n\fLearning Sequential Tasks by Incrementally Adding Higher Orders \n\n121 \n\n5 CONCLUSIONS \n\nThe incremental higher-order network performed much better than the networks \nthat it was compared against on these tiny tests. A few caveats are in order, \nhowever. First, the parameters given for the tasks above were customized for those \ntasks. Second, the network may add a large number of new units if it contains \nmany context-dependent events or if it is inherently stochastic. Third, though the \nnetwork in principle can build an ever larger hierarchy that searches further and \nfurther back in time for a context that will predict what a connection's weight \nshould be, many units may be needed to bridge a long time-gap. Finally, once a \nbridge across a time-delay is created, it does not generalize to other time-delays. \n\nOn the other hand, the network learns very fast due to its simple structure that \nadds high-level units only when needed. Since there is no feedback (i.e. no unit ever \nproduces a signal that will ever feed back to itself), learning can be Qone without \nback propagation through time. Also, since the outputs and high-level units have a \nfan-in equal to the number of inputs only, the number of connections in the system \nis much smaller than the number of connections in a traditional network with the \nsame number of hidden units. \n\nFinally, the network can be thought of as a system of continuous-valued condition(cid:173)\naction rules that are inserted or removed depending on another set of such rules \nthat are in turn inserted or removed depending on another set, etc. When new rules \n(new units) are added, they are initially invisible to the system, (i.e., they have no \neffect), but only gradually learn to have an effect as the opportunity to decrease \nerror presents itself. \n\nAcknowledgements \n\nThis work was supported by NASA Johnson Space Center Graduate Student Re(cid:173)\nsearchers Program training grant, NGT 50594. I would like to thank Eric Hartman, \nKadir Liano, and my advisor Robert Simmons for useful discussions and helpful \ncomments on drafts of this paper. I would also like to thank Pavilion Technologies, \nInc. for their generous contribution of computer time and office space required to \ncomplete much of this work. \n\nReferences \n\n[1] Jonathan Richard Bachrach. Connectionist Modeling and Control of Finite \nState Environments. PhD thesis, Department of Computer and Information \nSciences, University of Massachusetts, February 1992. \n\n[2] Axel Cleeremans, David Servan-Schreiber, and James L. McClelland. Finite \nstate automata and simple recurrent networks. Neural Computation, 1(3):372-\n381, 1989. \n\n[3] Richard Dawkins. Hierarchical organisation: a candidate principle for ethology. \nIn P. P. G. Bateson and R. A. Hinde, editors, Growing Points in Ethology, pages \n7-54, Cambridge, 1976. Cambridge University Press. \n\n[4] Jeffrey L. Elman. Finding structure in time. CRL Technical Report 8801, \nUniversity of California, San Diego, Center for Research in Language, April \n1988. \n\n\f122 \n\nRing \n\n[5] Scott E. Fahlman. The recurrent cascade-correlation architecture. In R. P. \nLippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural \nInformation Processing Systems 3, pages 190-196, San Mateo, California, 1991. \nMorgan Kaufmann Publishers. \n\n[6] C. L. Giles, C. B. Miller, D. Chen, G. Z. Sun, H. H. Chen, and Y. C. Lee. \nExtracting and learning an unknown grammar with recurrent neural networks. \nIn J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances in Neural \nInformation Processing Systems 4, pages 317-324, San Mateo, California, 1992. \nMorgan Kaufmann Publishers. \n\n[7] Michael C. Mozer. Induction of multiscale temporal structure. In John E. \nMoody, Steven J. Hanson, and Richard P. Lippmann, editors, Advances in \nNeural Information Processing Systems 4, pages 275-282, San Mateo, Califor(cid:173)\nnia, 1992. Morgan Kaufmann Publishers. \n\n[8] Jordan B. Pollack. The induction of dynamical recognizers. Machine Learning, \n\n7:227-252, 1991. \n\n[9] Mark B. Ring. Incremental development of complex behaviors through au(cid:173)\n\ntomatic construction of sensory-motor hierarchies. In Lawrence A. Birnbaum \nand Gregg C. Collins, editors, Machine Learning: Proceedings of the Eighth In(cid:173)\nternational Workshop (ML91), pages 343-347. Morgan Kaufmann Publishers, \nJune 1991. \n\n[10] Mark B. Ring. Sequence learning with incremental higher-order neural net(cid:173)\nworks. Technical Report AI 93-193, Artificial Intelligence Laboratory, Univer(cid:173)\nsity of Texas at Austin, January 1993. \n\n[11] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation \nnetwork. Technical Report CUED/F-INFENG/TR.l, Cambridge University \nEngineering Department, 1987. \n\n[12] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal rep(cid:173)\n\nresentations by error propagation. In D. E. Rumelhart and J. L. McClelland, \neditors, Parallel Distributed Processing: Explorations in the Microstructure of \nCognition. V1: Foundations. MIT Press, 1986. \n\n[13] Jiirgen Schmidhuber. Learning unambiguous reduced sequence descriptions. \nIn J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances in Neural \nInformation Processing Systems 4, pages 291-298, San Mateo, California, 1992. \nMorgan Kaufmann Publishers. \n\n[14] Raymond L. Watrous and Gary M. Kuhn. Induction of finite-state languages \nusing second-order recurrent networks. In J. E. Moody, S. J. Hanson, and \nR. P. Lippman, editors, Advances in Neural Information Processing Systems \n4, pages 309-316, San Mateo, California, 1992. Morgan Kaufmann Publishers. \n[15] Ronald J. Williams and David Zipser. A learning algorithm for continually \nrunning fully recurrent neural networks. Neural Computation, 1(2):270-280, \n1989. \n\n[16] Mike Wynn-Jones. Node splitting: A constructive algorithm for feed-forward \n\nneural networks. Neural Computing and Applications, 1(1):17-22, 1993. \n\n\f", "award": [], "sourceid": 660, "authors": [{"given_name": "Mark", "family_name": "Ring", "institution": null}]}