{"title": "Fixed Point Analysis for Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 149, "page_last": 159, "abstract": null, "full_text": "149 \n\nFIXED POINT ANALYSIS FOR RECURRENT \n\nNETWORKS \n\nMary B. Ottaway \n\nDana H. Ballard \n\nPatrice Y. Simard \n\nDept. of Computer Science \n\nUniversity of Rochester \n\nRochester NY 14627 \n\nABSTRACT \n\nThis paper provides a systematic analysis of the recurrent backpropaga(cid:173)\ntion (RBP) algorithm, introducing a number of new results. The main \nlimitation of the RBP algorithm is that it assumes the convergence of \nthe network to a stable fixed point in order to backpropagate the error \nsignals. We show by experiment and eigenvalue analysis that this condi(cid:173)\ntion can be violated and that chaotic behavior can be avoided. Next we \nexamine the advantages of RBP over the standard backpropagation al(cid:173)\ngorithm. RBP is shown to build stable fixed points corresponding to the \ninput patterns. This makes it an appropriate tool for content address(cid:173)\nable memories, one-to-many function learning, and inverse problems. \n\nINTRODUCTION \n\nIn the last few years there has been a great resurgence of interest in neural network \nlearning algorithms. One of the most successful of these is the Backpropagation learning \nalgorithm of [Rumelhart 86], which has shown its usefulness in a number of applications. \nThis algorithm is representative of others that exploit internal units to represent very \nnonlinear decision surfaces [Lippman 87] and thus overcomes the limits of the classical \npercept ron [Rosenblatt 62]. \n\nWith its enormous advantages, the backpropagation algorithm has a number of dis(cid:173)\n\nadvantages. Two of these are the inability to fill in patterns and the inability to solve \none-to-many inverse problems [Jordan 88]. These limitations follow from the fact that \nthe algorithm is only defined for a feedforward network. Thus if part of the pattern is \nmissing or corrupted in the input, this error will be propagated through to the output and \nthe original pattern will not be restored. In one-to-many problems, several solutions are \npossible for a given input. On a feedforward net, the competing targets for a given input \nintroduce contradictory error signals and learning in unsuccessful. \n\nVery recently, these limitations have been removed with the specification of a recurrent \n\nbackpropagation algorithm [Pineda 87]. This algorithm effectively extends the backpropa(cid:173)\ngation idea to networks of arbitrary connection topologies. This advantage, however, does \nnot come without some risk. Since the connections in the network are not symmetric, the \nstability of the network is not guaranteed. For some choices of weights, the state of the \nunits may oscillate indefinitely. \n\nThis paper provides a systematic analysis of the recurrent backpropagation (RBP) \nalgorithm, introducing a number of new results. The main limitation of the RBP algorithm \nis that it assumes the convergence of the network to a stable fixed point in order to \n\n\f150 \n\nSimard, Ottaway and Ballard \n\nbackpropagate the error signals. We show by experiment and eigenvalue analysis that this \ncondition can be violated and that chaotic behavior can be avoided. \n\nNext we examine the advantage in convergence speed of RBP over the standard back(cid:173)\n\npropagation algorithm. RBP is shown to build stable fixed points corresponding to the \ninput patterns. This makes it an appropriate tool for content addressable memories, many(cid:173)\nto-one function learning and inverse problem. \n\nMODEL DESCRIPTION \n\nThe simulations have been done on a recurrent backpropagation network with first order \nunits. Using the same formalism as [Pineda 87], the vector state x is updated according \nto the equation: \n\nwhere \n\nUi = LWijXj \n\nfor i=I,2 , . .. ,N \n\nj \n\nThe activation function is the logistic function \n\n(1) \n\n(2) \n\n(3) \n\nThe networks we will consider are organized in modules (or sets) of units that perform \nsimilar functions. For example, we talk about fully connected module if each unit in the \nmodule is connected to ea.ch of the others. An input module is a set of units where each \nunit has non-zero input function Ii . Note that a single unit can belong to more than \none module at a time. The performance of the network is measured through the energy \nfunction: \n\nN \n\nE = \"2 L..J Jj \n1 '\"' 2 \n\ni=1 \n\n(4) \n\nwhere \n\n(5) \nAn output module is a set of units i such that Ji '\" O. Units that do not belong to any \ninput or output modules are called hidden units. A unit (resp module) can be damped \nand undamped. When the unit (resp module) is undamped, Ii = Ji = 0 for the unit \n(resp the module). If the unit is damped, it behave according to the pattern presented to \nthe network. Unclamping a unit results in making it hidden. Clamping and unclamping \nactions are handy concepts for the study of content addressable memory or generalization. \nThe goal for the network is to minimize the energy function by changing the weights \n\naccordingly. One way is to perform a gradient descent in E using the delta rule: \n\n(6) \n\nwhere 71 is a learning rate constant. The weight variation as a function of the error is given \nby the formula [Pineda 8i, Almeida 87] \n\n(7) \n\n\fFixed Point Analysis for Recurrent Networks \n\n151 \n\nwhere yi is a solution of the dynamical system \n\n(8) \n\nThe above discussion, assumes that the input function I and the target T are constant \nIn our simulation however, we have a set of patterns Po. presented to the \nover time. \nnetwork. A pattern is a tuple in ([0,1] U {U})N, where N is the total number of units \nand U stands for undamped. The ith value of the tuple is the value assigned to Ii and Ti \nwhen the pattern is presented to the network (if the value is U, the unit is undamped for \nthe time of presentation of the pattern). This definition of a pattern does not allow hOI. \nand Tio. to have different values. This is not an important restriction however since we \ncan can always simulate such an (inconsistent) unit with two units. The energy function \nto be minimized over all the patterns is defined by the equation: \n\nEtotal = L E( a) \n\na \n\n(9) \n\nThe gradient of Etotal is simply the sum of the gradients of E(a), and hence the updating \nequation has the form: \n\ndWij/dt = 17 I: Yi(a)xj(a) \n\n(10) \n\na \n\n\\\"'hen a pat tern Po. is presented to the network, an approximation of x j ( a) is first \ncomputed by doing a few iterations using equation 1 (propagation). Then, an approx(cid:173)\nimation of yoo (a) is evaluated by iterating equation 8 (backpropagation). The weights \nare finally updated using equation 10. If we assume the number of iterations to evaluate \nxj(a) and yj(a) to be constant, the total number of operations required to update the \nweights is O( N 2 ). The validity of this assumption will be discussed in a later section. \n\nCONVERGENCE OF THE NETWORK \n\nThe learning algorithm for our network assumes a correct approximation of xoo. This value \nis computed by recursively propagating the activation signals according to equation 1. The \neffect of varying the number of propagations can be illustrated with a simple experiment. \nConsider a fully connected network of eight units (it's a directed anti-reflexive graph). \nFonr of them are auto-associative units which are presented various patterns of zeroes and \nones. An auto-associative unit is best viewed as two visible units, one having a.ll of the \nincoming connect.ions and one having all of the outgoing connections. When the auto(cid:173)\nassociat.ive unit is not cla.mped, it is viewed as a hidden unit. The four remaining units \narc hidden. The error is measured by the differences between the a.ctivations (from the \nincoming connections) of t.he auto-associat.ive units and the corresponding target value \n7~ for each pattern. \nIn running the experiment, eight patterns were presented to the \nnetwork perfo[213zrming 1 to 5 propagations of the activations using Equation 1, 20 back(cid:173)\npropa.gations of the error signals according to Equation 8, and one update (Equation 10) \nof the weights per preselltat,ion. We define an epoch to be a sweep through the eight \npatterns using the above formula of execution on each. The corresponding results using \na learning rate of 0.2 are shown in figure 1. It can easily be seen that using one or two \npropagations does not suffice to set the hidden units to their correct values. However, \nthe network does learn correctly how to reproduce the eight patterns when 3 or more \n\n\f152 \n\nSimard, Ottaway and Ballard \n\nError \n\nError \n\n4 \n\n3 \n\n2 \n\n1 \n\n0 \n\n4 \n\n3 \n\n2 \n\n1 \n\n0 \n\n1 \n\n'on \n\n...... \" ....... . \n\n\u2022......\u2022 .2. P.I:.~ \n\n-- . \n\n........ -:. \n\n' 0. \n\n.............. -.&.. \u2022 .....:,.. \u2022 ...:.,.. \u2022\u2022 \u2022\u2022 \u2022 \u2022 \u2022 \u2022\n\n\u2022\n\n\u2022 \n\n....... -..., \n\n... . . -----\nO-+--------------~----------------r_-------------\n\n~----.---\n\n.. ... .. \n\no \n\n50 \n\n100 \n\nFigure 2: top Maximum eigenvalues for the unstable fixed point as a function of the \nnumber of epochs. bottom Error as a function of the number of epochs. \n\n\fFixed Point Analysis for Recurrent Networks \n\n155 \n\nsince it only has four hidden units. To increase the probability of getting a fixed point that \nis unstable, we make the initial weights range from -3 to 3 and set the thresholds so that \n[0.5] is a fixed point for one of the patterns. This fixed point is more likely to be unstable \nsince the partial derivative ofthe functions (which are equal to 8gi(Ui)/8xj = wijxi(l-Xi) \nat the fixed point) are maximized at [Xi] = [0.5] and therefore the Jacobian is more likely \nto have big eigenvalues. Figure 2 shows the stability of that particular fixed point and \nthe error as a function of the number of epochs. Three different simulations were done \nwith different sets of random initial weights. As clearly shown in the figure, the network \nlearns despite the absence of stable fixed points. Moreover, the observed fixed point(s) \nbecome stable as learning progresses. In the absence of stable fixed points, the weights \nare modified after a fixed number of propagations and backpropagations. Even though \nthe state vector of the network is not precisely defined, the state space trajectory lies in a \ndelimited volume. As learning progresses, the projection of this volume on the visible units \ndiminishes to a single point (stable) and moves toward a target point that correspond to \nthe presented pattern on the visible units. Note that our energy function does not impose \nconstraints on the state space trajectories projected on the hidden units [Pearl mutter 88]. \n\n\"RUNAWAY\" SIMULATIONS \n\nThe next question that arises is whether a recurrent network goes to the same fixed point \nat successive epochs (for a given input) and what happens if it does not. To answer this \nquestion, we construct two networks, one with only feed forward connections and one with \nfeed back connections. Bot.h networks have 3 modules (input, hidden and output) of 4 \nunits each. The connections of the feed forward network are between the input and the \nhidden module and between the hidden and the output module. The connections of the \nrecurrent net.work are identical except that the there are connections between the units of \nthe hidden module. The rationale behind this layout is to ensure fairness of comparison \nbetween feed forward and feedback backpropagation. Each network is presented sixteen \ndistiuct patterns on the input with sixteen different random patterns on the output. The \npatterns consist of zeros and ones. This task is purposely chosen to be fairly difficult (16 \nfixed points on the four hidden units for the recurrent net) and will make the evaluation of \nX OO difficult. The learning curves for the networks are shown in Figure 3 for a learning rate \nof 0.2. We can see that the network with recurrent connections learn a slightly faster than \nthe feed forward network. However, a more careful analysis reveals that when the learning \nrate is increased, the recurrent network doesn't always learn properly. The success of the \nlearning depends on the number of iterations we use in the computation of xt. As clearly \nshown on the Figure 3, if we use 30 iterations for xt the network fails to learn, although \n40 iterations yields reasonable results. The two cases only differ by the value of xt used \nwhen the error signals are backpropagated. \n\nAccording to our interpretation, recurrent backpropagation learns by moving the fixed \n\npoints (or small volume state trajectories) toward target values (determined by the out(cid:173)\nput). As learning progresses, the distances between the fixed points and the target values \ndiminish, causing the error signals to become smaller and the learning to slow down. How(cid:173)\never if the network doesn't come close enough to the fixed point (or the small volume \nstate trajectory), the new error (the distance between the current state and the target) \ncan suddenly be very large (relatively to the distance between the fixed point and the \ntarget). Large incorrect error signals are then introduced into the system. There are two \nCa.seS: if the learning rate is small, a near miss has lit tie effect on the learning curve and \nRBP learns faster than the feed forward network. If, on the other hand, the learning rate \n\n\f156 \n\nSimard, Ottaway and Ballard \n\n6 \n\n4 \n\nError \n\n..... .... . \" \n\n..... ... . ... . .. . \n\nO-+-------.--------.-------r-------~------~-------\n\no \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\n2500 \n\n6 \n\n4 \n\n2 \n\nError \n\n' .. \n\n. . . \u2022 . . . . . . . \u2022 . . .\u2022.\u2022 ~ . . . . . \u2022 . .\u2022 ~.--rr-.-r-. ...\"... \u2022 ..._:_.__:\"'.~.' \u2022 . \n\nO-+------~--------.-------r-------~------~-------\n\no \n\n100 \n\n200 \n\n300 \n\n400 \n\n500 \n\n6 \n\n4 \n\n2 \n\nError \n\nO-+------~------~.-------r-------~------~-------\n\no \n\n100 \n\n200 \n\n300 \n\n400 \n\n500 \n\nFigure 3: Error as a function of the number of epochs for a feed forward net (dotted) \nand a recurrent net (solid or dashed). top: The learning rate is set to 0.2. center: \nThe learning rate is set to 1.0. The solid and the dashed lines are for recurrent \nnet with 30 and 40 iterations of xt per epochs respectively. bottom: The learning \nrate is variable. The recurrent network has a variable number of iteration of xt per \nepochs. \n\n\fFixed Point Analysis for Recurrent Networks \n\n157 \n\n1 \n\nX2 \n\n0 \n\n0 \n\n\\ \nI \n\n\u2022 \n\nXl \n\n1 \n\nX2 \n\n\u2022 \n\n\\ \n\n1 \n\n0 \n\n0 \n\n-\n\n1 \n\n. ~ \n\\ \n\nX2 \n\n~\\ \n\n\u2022 \n\n1 \n\nXl \n\n0 \n\n0 \n\n~ . \n\n1 \n\nXl \n\nFigure 4: State space and fixed point. Xl and X2 are the activation of two units \nof a fully connected network. left: Before learning, there is one stable fixed point \ncenter: After learning a few pattern, there are two desired stable fixed points. right: \nAfter learning several patterns, there are two desired stable fixed points and one \nundesired stable fixed point. \n\nis big, a near miss will induce important incorrect error signals into the system which in \nturn makes the next miss more dramatic. This runaway situation is depicted on the center \nof Figure 3. To circumvent this problem we vary the number of propagations as needed \nuntil successive states on the state trajectory are sufficiently close. The resulting learning \ncurves for feed forward and recurrent nets are plotted at the bottom of Figure 3. In these \nsimulations the learning rates are adjusted dynamically so that successive error vectors \nare almost colinear, that is: \n\n-\n\n0.7 < cos(~w:j' ~W:1l) < 0.9 \n\n(12) \n\nAs can be seen recurrent and feed forward nets learn at the same speed. It is interesting to \nmention that the average learning rate for the recurrent net is significantly smaller (:::::: 0.65) \nthan for the feed forward net (:::::: 0.80). Surprisingly, this doesn't affect the learning speed. \n\nCONTENT ADDRESSABLE MEMORIBS \n\nAn interesting property of recurrent networks is their ability to generate fixed points that \ncan be used to perform content addressable memory [Lapedes 86, Pineda 87]. Initially, \na fully connected network usually has only one stable fixed point (all units undamped) \n(see Figure 4, left). By clamping a few (autoassociative) units to given patterns, it is \npossible, by learning, to create stable fixed points for the undamped network (Figure 4, \ncenter). To illustrate this property, we build a network of 6 units: 3 auto associative units \n\n\f158 \n\nSimard, Ottaway and Ballard \n\nfixed points \n\nautoassociative units \n\nhidden units \n\n0.0402 \n0.9649 \n0.0830 \n0.9400 \n0.9076 \n\n0.0395 \n0.0176 \n0.9662 \n0.9619 \n0.5201 \n\n0.9800 \n0.0450 \n0.0658 \n0.9252 \n0.0391 \n\n0.8699 \n0.0724 \n0.2136 \n0.1142 \n0.0448 \n\n0.0763 \n0.8803 \n0.0880 \n0.1692 \n0.6909 \n\n0.0478 \n0.4596 \n0.8832 \n0.5164 \n0.7431 \n\nMaximum \neigenvalue \n\n0.4419 \n0.6939 \n0.8470 \n0.8941 \n1.2702 \n\nTable 1: Fixed points for content addressable memory \n\nand 3 hidden units. The three autoassociative units are presented patterns with an odd \nnumber of ones in them (there are 4 such patterns on 3 units: 1 0 0, 0 1 0, 0 Oland \n1 1 1). The network is fully connected. After 5000 epochs, the auto-associative units are \nundamped for testing. All the fixed points found for the network of 6 (undamped) units \nare given in table 1. As can be seen, the four stable fixed points are exactly the four \npatterns presented to the network. Moreover their stability guarantees that the network \ncan be used for CAM (content addressable memory) or for one-to-many function learning. \nIndeed, if the network is presented incomplete or corrupted patterns (sufficiently dose to a \npreviously learned pattern), it will restore the pattern as soon as the incorrect or missing \nunits are undamped by converging to a stable fixed point. If there are several correct \npattern completions for the damped units, the network will converge to one of the pattern \ndepending on the initial conditions of the undamped units (which determine the state \nspace trajectory). These highly desirable properties are the main advantages of having \nfeedback connections. V.le note from table 1 that a fifth (incorrect) fixed point has also \nbe found. However, this fixed point is unstable (Maximum eigenvalue = 1.27) and will \ntherefore never be found during recursive searches. \n\nIn the previous example, there are no undesired stable fixed points. They are, however, \nlikely to appear if the learning task becomes more complex (Figure 4, right). The reason \nwhy they are difficult to avoid is that unless the units are undamped (the learning is \nstopped), the network cannot reach them. Algorithms which eliminate spurious fixed \npoints are presently under study. \n\nCONCLUSION \n\nIn this paper, we have studied the effect of introducing feedback connections into feed \nforward networks. We have shown that the potential disadvantages of the algorithm, such \nas the absence of stable fixed points and chaotic behavior, can be overcome. The resulting \nsystems ha\\'e several interesting properties. First, allowing arbitrary connections makes a \nnetwork more physiologically plausible by removing structural constraints on the topology. \nSecond, the increased number of connections diminishes the sensitivity to noise and slightly \nimproves the speed of learning. Finally, feedback connections allow the network to restore \nincomplete or corrupted patterns by following the state space trajectory to a stable fixed \npoint. This property can also be used for one-to-many function learning. A limitation of \nthe algorithm, however, is that spurious stable fixed points could lead to incorrect pattern \ncompletion. \n\n\fFixed Point Analysis for Recurrent Networks \n\n159 \n\nReferences \n\n[Almeida 87] Luis B. Almeida, in the Proceedings of the IEEE First Annual International \n\nConference on Neural Networks, San Diego, California, June 1987. \n\n[Lapedes 86] Alan S. Lapedes & Robert M. Farber A self-optimizing nonsymmetrical neu(cid:173)\nral net for content addressable memory and pattern recognition. Physica D22, 247-\n259, 1986. \n\n[Lippman 87] Richard P. Lippman, An introduction to computing with neural networks, \n\nIEEE ASSP Magazine April 1987. \n\n[Jordan 88] Michael I. Jordan, Supervised learning and systems with excess degrees of \nfreedom. COINS Technical Report 88-27. Massachusetts Institute of Technology. 1988. \n[Pearlmutter 88] Barak A. Pearlmutter. Learning State Space Trajectories in Recurrent \nNeural Networks. Proceedings of the Connectionnist Models Summer School. pp. 113-\n117. 1988. \n\n[Pineda 87] Fernando J. Pineda. Generalization of backpropagation to recurrent and \nhigher order neural networks. Neural Information Processing Systems, New York, \n1987. \n\n[Pineda 88] Fernando J. Pineda. Dynamics and Architecture in Neural Computation. Jour(cid:173)\n\nnal of Complexity, special issue on Neural Network. September 1988. \n\n[Simard 88] Patrice Y. Simard, Mary B. Ottaway and Dana H. Ballard, Analysis of re(cid:173)\n\ncurrent backpropagation. Technical Report 253. Computer Science, University of \nRochester, 1988. \n\n[Rosenblatt 62] F. Rosenblatt, Principles of Neurodynamics, New York: Spartam Books, \n\n1962. \n\n[Rumelhart 86] D. E. Rumelhart, G. E. Hinton, & R. J. Williams, Learning internal rep(cid:173)\n\nresentations by back-propagating errors. Nature, 323,533-536. \n\n\f", "award": [], "sourceid": 181, "authors": [{"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "Mary", "family_name": "Ottaway", "institution": null}, {"given_name": "Dana", "family_name": "Ballard", "institution": null}]}