{"title": "The \"Moving Targets\" Training Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 558, "page_last": 565, "abstract": null, "full_text": "558 \n\nRohwer \n\nThe 'Moving Targets' Training Algorithm \n\nRichard Rohwer \n\nCentre for Speech Technology Research \n\nEdinburgh University \n\n80, South Bridge \n\nEdinburgh EH1 1HN SCOTLAND \n\nABSTRACT \n\nA simple method for training the dynamical behavior of a neu(cid:173)\nral network is derived. It is applicable to any training problem \nin discrete-time networks with arbitrary feedback. The algorithm \nresembles back-propagation in that an error function is minimized \nusing a gradient-based method, but the optimization is carried out \nin the hidden part of state space either instead of, or in addition to \nweight space. Computational results are presented for some simple \ndynamical training problems, one of which requires response to a \nsignal 100 time steps in the past. \n\nINTRODUCTION \n\n1 \nThis paper presents a minimization-based algorithm for training the dynamical be(cid:173)\nhavior of a discrete-time neural network model. The central idea is to treat hidden \nnodes as target nodes with variable training data. These \"moving targets\" are \nvaried during the minimization process. Werbos (Werbos, 1983) used the term \n\"moving targets\" to describe the qualitative idea that a network should set itself \nintermediate objectives, and vary these objectives as information is accumulated on \ntheir attainability and their usefulness for achieving overall objectives. The (coin(cid:173)\ncidentally) like-named algorithm presented here can be regarded as a quantitative \nrealization of this qualitative idea. \n\nThe literature contains several temporal training algorithms based on minimization \nof an error measure with respect to the weights. This type of method includes \nthe straightforward extension of the back-propagation method to back-propagation \n\n\fThe 'Moving Targets' Training Algorithm \n\n559 \n\nthrough time (Rumelhart, 1986), the methods of Rohwer and Forrest (Rohwer, \n1987), Pearlmutter (Pearlmutter, 1989), and the forward propagation of derivatives \n(Robinson, 1988, Williams 1989a, Williams 1989b, Kuhn, 1990). A careful compar(cid:173)\nison of moving targets with back-propagation in time and teacher forcing appears in \n(Rohwer, 1989b). Although applicable only to fixed-point training, the algorithms \nof Almeida (Almeida, 1989) and Pineda (Pineda, 1988) have much in common with \nthese dynamical training algorithms. The formal relationship between these and \nthe method of Rohwer and Forrest is spelled out in (Rohwer 1989a). \n\n2 NOTATION AND STATEMENT OF THE TRAINING \n\nPROBLEM \n\nConsider a neural network model with arbitrary feedback as a dynamical system in \nwhich the dynamical variables Xit change with time according to a dynamical law \ngiven by the mapping \n\nLWij/(Xj,t-l) \n\nj \nbias constant \n\nXOt \n\n(1) \n\nunless specified otherwise. The weights Wi; are arbitrary parameters representing \nthe connection strength from node :i to node i. \n/ is an arbitrary differentiable \nfunction. Let us call any given variable Xit the \"activation\" on node i at time t. It \nrepresents the total input into node i at time t. Let the \"output\" of each node be \ndenoted by Yit = /(Xit). Let node 0 be a \"bias node\", assigned a positive constant \nactivation so that the weights WiO can be interpreted as activation thresholds. \n\nIn normal back-propagation, a network architecture is defined which divides the \nnetwork into input, hidden, and target nodes. The moving targets algorithm makes \nitself applicable to arbitrary training problems by defining analogous concepts in a \nmanner dependent upon the training data, but independent of the network archi(cid:173)\ntecture. Let us call a node-time pair an \"event\"'. To define a training problem, the \nset of all events must be divided into three disjoint sets, the input events I, target \nevents T, and hidden events H. A node may participate in different types of event \nat different times. For every input event (it) E I, we require training data Xit with \nwhich to overrule the dynamical law (1) using \n\nXit = Xit \n\n(it) E I. \n\n(2) \n\n(The bias events (Ot) can be regarded as a special case of input events.) For each \ntarget event (it) E T, we require training data X it to specify a desired activation \nvalue for event (Ot). No notational ambiguity arises from referring to input and \ntarget data with the same symbol X because I and T are required to be disjoint \nsets. The training dat a says nothing about the hidden events in H. There is no \nrestriction on how the initial events (iO) are classified. \n\n\f560 \n\nRohwer \n\n3 THE \"MOVING TARGETS\" METHOD \nLike back-propagation, the moving targets training method uses (arbitrary) gradient(cid:173)\nbased minimization techniques to minimize an \"error\" function such as the \"output \ndeficit\" \n\n(3) \n\nEod = ~ L {Yit - ~tl2, \n\n(it)ET \n\nwhere Yit = f(xid and ~t = f(Xid. A modification of the output deficit error gave \nthe best results in numerical experiments. However, the most elegant formalism \nfollows from an \"activation deficit\" error function: \n\nEad =! L {Xit - Xitl 2 , \n\n(it)ET \n\n(4) \n\nso this is what we shall use to present the formalism. \nThe basic idea is to treat the hidden node activations as variable target activations. \nTherefore let us denote these variables as X it , just as the (fixed) targets and inputs \nare denoted. Let us write the computed activation values Xit of the hidden and \ntarget events in terms of the inputs and (fixed and moving) targets of the previous \ntime step. Then let us extend the sum in (4) to include the hidden events, so the \nerror becomes \n\nE = ~ L {L wiif(Xi,t-l) _ Xit}2 \n\n(it)ETUH \n\ni \n\n(5) \n\nThis is a function of the weights Wii, and because there are no x's present, the full \ndependence on Wii is explicitly displayed. We do not actually have desired values \nfor the Xit with (it) E H. But any values for which weights can be found which \nmake (5) vanish would be suitable, because this would imply not only that the \ndesired targets are attained, but also that the dynamical law is followed on both \nthe hidden and target nodes. Therefore let us regard E as a function of both the \nweights and the \"moving targets\" X it , (it) E H. This is the essence of the method. \nThe derivatives with respect to all of the independent variables can be computed \nand plugged into a standard minimization algorithm. \n\nThe reason for preferring the activation deficit form of the error (4) to the output \ndeficit form (3) is that the activation deficit form makes (5) purely quadratic in the \nweights. Therefore the equations for the minimum, \n\n(6) \n\nform a linear system, the solution of which provides the optimal weights for any \ngiven set of moving targets. Therefore these equations might as well be used to \ndefine the weights as functions of the moving targets, thereby making the error (5) \na function of the moving targets alone. \n\n\fThe 'Moving Targets' Training Algorithm \n\n561 \n\nThe derivation of the derivatives with respect to the moving targets is spelled out \nin (Rohwer, 1989b). The result is: \n\nwhere \n\nand \n\n(it) E TuH \n(it) \u00a2 1'uH \neie = 2:: Wij/(Xj,t-d - Xie , \n\nj \n\nf ! = d/(x) I \n\n.t \n\nd \nx ~-x . \n\n- - It \n\n' \n\nW \u00b7 \u00b7 - ~ (~X' X\u00b7 Y;k \nIJ - ~ L: It \n\nIt \n\n,t-i \n\n) M(i)-i \n, \n\nkj \n\n(7) \n\n(8) \n\n(9) \n\n(to) \n\n(11) \n\nwhere M(a)-i is the inverse of M(a), the correlation matrix of the node outputs \ndefined by \n\nM (a) - ~X y.. \n\n- L- at I,t-i J,t-i\u00b7 \n\ny . \n\nij \n\n(12) \n\nt \n\nIn the event that any of the matrices M are singular, a pseudo-inversion method \nsuch as singular value decomposition (Press, 1988) can be used to define a unique \nsolution among the infinite number available. \n\nNote also that (11) calls for a separate matrix inversion for each node. However if \nthe set of input nodes remains fixed for all time, then all these matrices are equal. \n\n3.1 FEEDFORWARD VERSION \n\nThe basic ideas used in the moving targets algorithm can be applied to feedfor(cid:173)\nward networks to provide an alternative method to back-propagation. The hidden \nnode activations for each training example become the moving target variables. \nFurther details appear in (Rohwer, 1989b). The moving targets method for feedfor(cid:173)\nward nets is analogous to the method of Grossman, Meir, and Domany (Grossman, \n1990a, 1990b) for networks with discrete node values. Birmiwal, Sarwal, and Sinha \n(Birmiwal, 1989) have developed an algorithm for feedforward networks which in(cid:173)\ncorporates the use of hidden node values as fundamental variables and a linear \n\n\f562 \n\nRohwer \n\nsystem of equations for obtaining the weight matrix. Their algorithm differs from \nthe feedforward version of moving targets mainly in the (inessential) use of a specific \nminimization algorithm which discards most of the gradient information except for \nthe signs of the various derivatives. Heileman, Georgiopoulos, and Brown (Heile(cid:173)\nman, 1989) also have an algorithm which bears some resemblance to the feedforward \nversion of moving targets. Another similar algorithm has been developed by Krogh, \nHertz, and Thorbergasson (Krogh, 1989, 1990). \n\n4 COMPUTATIONAL RESULTS \nA set of numerical experiments performed with the activation deficit form of the \nalgorithm (4) is reported in (Rohwer, 1989b). Some success was attained, but \ngreater progress was made after changing to a quartic output deficit error function \nwith temporal weighting of errors: \n\nEquartic = t L (1.0 + at){Yit - }'ie}4. \n\n(it)ET \n\n(13) \n\nHere a is a small positive constant. The quartic function is dominated by the terms \nwith the greatest error. This combats a tendency to fail on a few infrequently seen \nstate transitions in order to gain unneeded accuracy on a large number of similar, \nlow-error state transitions. The temporal weighting encourages the algorithm to \nIn some cases this \nfocus first on late-time errors, and then work back in time. \nhelped with local minimum difficulties. A difficulty with convergence to chaotic \nattractors reported in (Rohwer, 1989b) appears to have mysteriously disappeared \nwith the adoption of this error measure. \n\n4.1 MINIMIZATION ALGORITHM \n\nFurther progress was made by altering the minimization algorithm. Originally the \nconjugate gradient algorithm (Press, 1988) was used, with a linesearch algorithm \nfrom Fletcher (Fletcher, 1980). The new algorithm might be called \"curvature \navoidance\" . The change in the gradient with each linesearch is used to update \na moving average estimate of the absolute value of the diagonal components of \nthe Hessian. The linesearch direction is taken to be the component-by-component \nquotient of the gradient with these curvature averages. Were it not for the absolute \nvalues, this would be an unusual way of estimating the conjugate gradient. The \nabsolute values are used to discourage exploration of directions which show any \nhint of being highly curved. The philosophy is that by exploring low-curvature \ndirections first, narrow canyons are entered only when necessary. \n\n4.2 SIMULATIONS \n\nSeveral simulations have been done using fully connected networks. Figure 1 plots \nthe node outputs of a network trained to switch between different limit cycles under \ninput control. There are two input nodes, one target node, and 2 hidden nodes, \nas indicated in the left margin. Time proceeds from left to right. The oscillation \n\n\fThe 'Moving Targets' Training Algorithm \n\n563 \n\nperiod of the target node increases with the binary number represented by the two \ninput nodes. The network was trained on one period of each of the four frequencies. \n\nFigure 1: Controlled switching between limit cycles \n\nFigure 2 shows the operation of a network trained to detect whether an even or odd \nnumber of pulses have been presented to the input; a temporal version of parity \ndetection. The network was trained on the data preceding the third input pulse. \n\ncontrol fila: 1550 \ne- \u00b71.ClOOOOOe+OO a- \u00b71.()Q()()()Qe+OO \no Linasaarchas. 0 Gradiant avals. 0 error avals. 0 CPU sacs. \n\nlog f~a: lu6Isiplrr/rmndir/movingtargalSlWorkiparilyllogSlts5O \n\nH JJ LlJ) \nT n n n r \n\nF J \n\n--\nr-\n\n.--\n\n-\n\nr \n\n\"1 \n\n.--\n\n.--\n\n-\n\n,...-\n\nH \n\nl \n\nI \n\n-::-:::-: = \n\n-::-::- ~ \n\n= \n\nFigure 2: Parity detection \n\nFigure 3 shows the behavior of a network trained to respond to the second of \ntwo input pulses separated by 100 time steps. This demonstrates a unique (in \nthe author's knowledge) capability of this method, an ability to utilize very distant \n\n\f564 \n\nRohwer \n\ntemporal correlations when there is no other way to solve the problem. This network \nwas trained and tested on the same data, the point being merely to show that \ntraining is possible in this type of problem. More complex problems of this type \nfrequently get stuck in local minima. \n\ncontrol file: cx100.tr \nE- 2.2328OOe-11 a- 9.9nS18a-04 \n4414linasearchas. 9751 Gradient avals. 9043 Error avals. 3942 CPU &eea. \n\nlog file: lu6Isiplrr/rmndir/movinglargelslworlclcx1l1ogslcx100.1r \n\nH r \n\nT \n\nI I \n\nr \n\n{ \n\nr \n\nI \n\nJ \n\nI \n\nFigure 3: Responding to temporally distant input \n\n5 CONCLUDING REMARKS \nThe simulations show that this method works, and show in particular that distant \ntemporal correlations can be discovered. Some practical difficulties have emerged, \nhowever, which are currently limiting the application of this technique to 'toy' \nproblems. The most serious are local minima and long training times. Problems \ninvolving large amounts of training data may present the minimization problem \nwith an impractically large number of variables. Variations of the algorithm are \nbeing studied in hopes of overcomming these difficulties. \n\nAcknowledgements \n\nThis work was supported by ESPRIT Basic Research Action 3207 ACTS. \n\nReferences \n\nL. Almeida, (1989), \"Backpropagation in Non-Feedforward Networks\", in Neural \nComputing Architecture!, I. Aleksander, ed., North Oxford Academic. \n\nK. Birmiwal, P. Sarwal, and S. Sinha, (1989), \"A new Gradient-Free Learning \nAlgorithm\", Tech. report, Dept. of EE, Southern Illinois U., Carbondale. \n\nR. Fletcher, (1980), Practical Methods of Optimization, v1, Wiley. \n\nT. Grossman, (1990a), \"The CHIR Algorithm: A Generalization for Multiple Out(cid:173)\nput and Multilayered Networks\" , to appear in Complex Systems. \n\n\fThe 'Moving Targets' Training Algorithm \n\n565 \n\nT. Grossman, (1990bL this volume. \nG. L. Heileman, M. Georgiopoulos, and A. K. Brown, (1989), \"The Minimal Dis(cid:173)\nturbance Back Propagation Algorithm\", Tech. report, Dept. of EE, U. of Central \nFlorida, Orlando. \nA. Krogh, J. A. Hertz, and G.1. Thorbergsson, (1989), \"A Cost Function for Internal \nRepresentations\", NORDITA preprint 89/37 S. \nA. Krogh, J. A. Hertz, and G. I. Thorbergsson, (1990), this volume. \nG. Kuhn, (1990) \"Connected Recognition with a Recurrent Network\", to appear in \nProc. NEUROSPEECH, 18 May 1989, as special issue of Speech Communication, \n9, no. 2. \nB. Pearlmutter, (1989), \"Learning State Space Trajectories in Recurrent Neural \nNetworks\", Proc. IEEE IJCNN 89, Washington D. C., II-365. \n\nF. Pineda, (1988), \"Dynamics and Architecture for Neural Computation\", J. Com(cid:173)\nplexity 4, 216. \n\nW. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, (1988), Nu(cid:173)\nmerical Recipes in C, The Art of Scientific Computing, Cambridge. \n\nA. J. Robinson and F. Fallside, (1988), \"Static and Dynamic Error Propagation \nNetworks with Applications to Speech Coding\", Neural Information Processing Sys(cid:173)\ntems, D. Z. Anderson, Ed., AlP, New York. \n\nR. Rohwer and B. Forrest, (1987), \"Training Time Dependence in Neural Networks\" \nProc. IEEE ICNN, San Diego, II-701. \n\nR. Rohwer and S. Renals, (1989a), \"Training Recurrent Networks\", in Neural Net(cid:173)\nworks from Models to Applications, L. Personnaz and G. Dreyfus, eds., I.D.S.E.T., \nParis, 207. \n\nR. Rohwer, (1989b), \"The 'Moving Targets' Training Algorithm\", to appear in Proc. \nDANIP, G MD Bonn, J. Kinderman and A. Linden, Eds. \n\nD. Rumelhart, G. Hinton and R. Williams, (1986), \"Learning Internal Representa(cid:173)\ntions by Error Propagation\" in Parallel Distributed Processing, v. 1, MIT. \n\nP. Werbos, (1983) Energy Models and Studies, B. Lev, Ed., North Holland. \n\nR. Williams and D. Zipser, (1989a), \"A Learning Algorithm for Continually Run(cid:173)\nning Fully Recurrent Neural Networks\" , Neural Computation 1, 270. \n\nR. Williams and D. Zipser, (1989bL \"Experimental Analysis of the Real-time Re(cid:173)\ncurrent Learning Algorithm\", Connection Science 1, 87. \n\n\f", "award": [], "sourceid": 233, "authors": [{"given_name": "Richard", "family_name": "Rohwer", "institution": null}]}