{"title": "Gradient and Hamiltonian Dynamics Applied to Learning in Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 274, "page_last": 280, "abstract": null, "full_text": "Gradient and Hamiltonian Dynamics \n\nApplied to Learning in Neural Networks \n\nJames W. Howse \n\nChaouki T. Abdallah \n\nGregory L. Heileman \n\nDepartment of Electrical and Computer Engineering \n\nUniversity of New Mexico \nAlbuquerque, NM 87131 \n\nAbstract \n\nThe process of machine learning can be considered in two stages: model \nselection and parameter estimation. In this paper a technique is presented \nfor constructing dynamical systems with desired qualitative properties. The \napproach is based on the fact that an n-dimensional nonlinear dynamical \nsystem can be decomposed into one gradient and (n - 1) Hamiltonian sys(cid:173)\ntems. Thus, the model selection stage consists of choosing the gradient and \nHamiltonian portions appropriately so that a certain behavior is obtainable. \nTo estimate the parameters, a stably convergent learning rule is presented. \nThis algorithm has been proven to converge to the desired system trajectory \nfor all initial conditions and system inputs. This technique can be used to \ndesign neural network models which are guaranteed to solve the trajectory \nlearning problem. \n\nIntroduction \n\n1 \nA fundamental problem in mathematical systems theory is the identification of dy(cid:173)\nnamical systems. System identification is a dynamic analogue of the functional ap(cid:173)\nproximation problem. A set of input-output pairs {u(t), y(t)} is given over some time \ninterval t E [7i, 1j]. The problem is to find a model which for the given input sequence \nreturns an approximation of the given output sequence. Broadly speaking, solving an \nidentification problem involves two steps. The first is choosing a class of identifica(cid:173)\ntion models which are capable of emulating the behavior of the actual system. The \nsecond is selecting a method to determine which member of this class of models best \nemulates the actual system. In this paper we present a class of nonlinear models and \na learning algorithm for these models which are guaranteed to learn the trajectories \nof an example system. Algorithms to learn given trajectories of a continuous time \nsystem have been proposed in [6], [8], and [7] to name only a few. To our knowledge, \nno one has ever proven that the error between the learned and desired trajectories \nvanishes for any of these algorithms. In our trajectory learning system this error is \nguaranteed to vanish. Our models extend the work in [1] by showing that Cohen's \nsystems are one instance of the class of models generated by decomposing the dynam(cid:173)\nics into a component normal to some surface and a set of components tangent to the \nsame surface. Conceptually this formalism can be used to design dynamical systems \nwith a variety of desired qualitative properties. Furthermore, we propose a provably \nconvergent learning algorithm which allows the parameters of Cohen's models to be \nlearned from examples rather than being programmed in advance. The algorithm is \n\n\fGradient and Hamiltonian Dynamics Applied to Learning in Neural Networks \n\n275 \n\nconvergent in the sense that the error between the model trajectories and the de(cid:173)\nsired trajectories is guaranteed to vanish. This learning procedure is related to one \ndiscussed in [5] for use in linear system identification. \n\n2 Constructing the Model \nFirst some terminology will be defined. For a system of n first order ordinary differ(cid:173)\nential equations, the phase space of the system is the n-dimensional space of all state \ncomponents. A solution trajectory is a curve in phase space described by the differ(cid:173)\nential equations for one specific starting point. At every point on a trajectory there \nexists a tangent vector. The space of all such tangent vectors for all possible solution \ntrajectories constitutes the vector field for this system of differential equations. \nThe trajectory learning models in this paper are systems of first order ordinary dif(cid:173)\nferential equations. The form of these equations will be obtained by considering the \nsystem dynamics as motion relative to some surface. At each point in the state space \nan arbitrary system trajectory will be decomposed into a component normal to this \nsurface and a set of components tangent to this surface. This approach was suggested \nto us by the results in [4], where it is shown that an arbitrary n-dimensional vector \nfield can be decomposed locally into the sum of one gradient vector field and (n - 1) \nHamiltonian vector fields. The concept of a potential function will be used to de(cid:173)\nfine these surfaces. A potential function V(:z:) is any scalar valued function of the \nsystem states :z: = [Xl, X2, \u2022\u2022\u2022 , Xn.] t which is at least twice continuously differentiable \n(Le. V(:z:) E or : r ~ 2). The operation [.]t denotes the transpose of the vector. If \nthere are n components in the system state, the function V{:z:), when plotted with \nrespect all of the state components, defines a surface in an (n + 1 )-dimensional space. \nThere are two curves passing through every point on this potential surface which are \nof interest in this discussion, they are illustrated in Figure 1(a). The dashed curve is \n\n(z - zo)t \\7 ... v (z)l ... o = 0 \n\n(a) \n\n(b) \n\nV(z) = K-\n\nFigure 1: (a) The potential function V(z) = X~ (Xl _1)2 +x~ plotted versus its two depen(cid:173)\ndent variables Xl and X2. The dashed curve is called a level surface and is given \nby V(z) = 0.5. The solid curve follows the path of steepest descent through Zo. \n(b) The partitioning of a 3-dimensional vector field at the point Zo into a 1-\ndimensional portion which is normal to the surface V(z) = K- and a 2-dimensional \nportion which is tangent to V(z) = K-. The vector -\\7 ... V(z) 1\"'0 is the normal vec(cid:173)\ntor to the surface V(z) = K- at the point Zo. The plane (z - zo)t \\7 ... V (z) 1\"'0 = 0 \ncontains all of the vectors which are tangent to V(z) = K- at Zo. Two linearly \nindependent vectors are needed to form a basis for this tangent space, the pair \nQ2(z) \\7 ... V (z)l ... o and Q3(Z) \\7 ... V (z)l ... o that are shown are just one possibility. \nreferred to as a level surface, it is a surface along which V(:z:) = K for some constant \nK. Note that in general this level surface is an n-dimensional object. The solid curve \n\n\f276 \n\nJ. W. HOWSE, C. T. ABDALLAH, G. L. HEILEMAN \n\nmoves downhill along V (X) following the path of steepest descent through the point \nXo. The vector which is tangent to this curve at Xo is normal to the level surface \nat Xo. The system dynamics will be designed as motion relative to the level surfaces \nof V(x). The results in [4] require n different local potential functions to achieve \narbitrary dynamics. However, the results in [1] suggest that a considerable number \nof dynamical systems can be achieved using only a single global potential function. \nA system which is capable of traversing any downhill path along a given potential \nsurface V(x), can be constructed by decomposing each element of the vector field \ninto a vector normal to the level surface of V(x) which passes through each point \nand a set of vectors tangent to the level surface of V(x) which passes through the \nsame point. So the potential function V(x) is used to partition the n-dimensional \nphase space into two subspaces. The first contains a vector field normal to some \nlevel surface V(x) = }( for }( E IR, while the second subspace holds a vector field \ntangent to V(x) = IC. The subspace containing all possible normal vectors to the \nn-dimensional level surface at a given point, has dimension one. This is equivalent \nto the statement that every point on a smooth surface has a unique normal vector. \nSimilarly, the subspace containing all possible tangent vectors to the level surface at \na given point has dimension (n - 1). An example of this partition in the case of a \n3-dimensional system is shown in Figure 1 (b). Since the space of all tangent vectors \nat each point on a level surface is (n - I)-dimensional, (n - 1) linearly independent \nvectors are required to form a basis for this space. \nMathematically, there is a straightforward way to construct dynamical systems which \neither move downhill along V(x) or remain at a constant height on V(x). In this \npaper, dynamical systems which always move downhill along some potential surface \nare called gradient-like systems. These systems are defined by differential equations \nof the form \n\nx = -P(x) VII: V(x), \n\n(1) \nwhere P(x) is a matrix function which is symmetric (Le. pt = P) and positive \ndefinite at every point x, and where V III V(x) = [g;: , g;: , ... , :z~]f. These systems \n\nare similar to the gradient flows discussed in [2]. The trajectories of the system \nformed by Equation (1) always move downhill along the potential surface defined by \nV(x). This can be shown by taking the time derivative of V(x) which is V(x) = \n-[VII: V (x)]t P(x) [VII: V(x)] :5 O. Because P(x) is positive definite, V(x) can only be \nzero where V II: V (x) = 0, elsewhere V(x) is negative. This means that the trajectories \nof Equation (1) always move toward a level surface of V(x) formed by \"slicing\" V(x) \nat a lower height, as pointed out in [2]. It is also easy to design systems which remain \nat a constant height on V(x). Such systems will be denoted Hamiltonian-like systems. \nThey are specified by the equation \n\nx = Q(x) VII: V(x), \n\n(2) \nwhere Q(x) is a matrix function which is skew-symmetric (Le. Qt = -Q) at every \npoint x. These systems are similar to the Hamiltonian systems defined in [2]. The \nelements of the vector field defined by Equation (2) are always tangent to some level \nsurface of V (x). Hence the trajectories ofthis system remain at a constant height on \nthe potential surface given by V(x). Again this is indicated by the time derivative \nof V(x), which in this case is V(x) = [VII: V(x)]f Q(x)[VII: V(x)] = o. This indicates \nthat the trajectories of Equation (2) always remain on the level surface on which the \nsystem starts. So a model which can follow an arbitrary downhill path along the \npotential surface V(x) can be designed by combining the dynamics of Equations (1) \nand (2) . The dynamics in the subspace normal to the level surfaces of V(x) can be \n\n\fGradient and Hamiltonian Dynamics Applied to Learning in Neural Networks \n\n277 \n\ndefined using one equation of the form in Equation (1). Similarly the dynamics in the \nsubspace tangent to the level surfaces of Vex) can be defined using (n - 1) equations \nof the form in Equation (2). Hence the total dynamics for the model are \n\nz= -P(x)VIDV(x) + LQi(X)VIDV(x). \n\nn \n\ni=2 \n\n(3) \n\nFor this model the number and location of equilibria is determined by the function \nVex), while the manner in which the equilibria are approached is determined by the \nmatrices P(x) and Qi(x). \nIf the potential function Vex) is bounded below (i.e. Vex) > Bl V x E IRn , where \nBl is a constant), eventually increasing (i.e. limlllDlI-+oo Vex) ~ 00) , and has only \na finite number of isolated local maxima and minima (i.e. in some neighborhood \nof every point where V III V (x) = 0 there are no other points where the gradient \nvanishes), then the system in Equation (3) satisfies the conditions of Theorem 10 \nin [1]. Therefore the system will converge to one of the points where V ID Vex) = 0, \ncalled the critical points of Vex), for all initial conditions. Note that this system \nis capable of all downhill trajectories along the potential surface only if the (n - 1) \nvectors Qi(X) VID Vex) V i = 2, ... , n are linearly independent at every point x. It \nis shown in [1] that the potential function \n\nV(z) = C ( \n\n1:., (-y) d-y + t, [ ~ (XI - I:.,(xd)' + ~ J:' 1:., h )II:.: (-y)]' d-y 1 \n\n(4) \n\nsatisfies these three criteria. In this equation \u00a3.i(Xt} Vi = 1, ... , n are interpolation \npolynomials, C is a real positive constant, Xi Vi = 1, ... , n are real constants chosen \nso that the integrals are positive valued, and \u00a3.Hxt} == f:-. \n3 The Learning Rule \nIn Equation (3) the number and location of equilibria can be controlled using the \npotential function Vex), while the manner in which the equilibria are approached can \nbe controlled with the matrices P(x) and Qi(X). If it is assumed that the locations \nof the equilibria are known, then a potential function which has local minima and \nmaxima at these points can be constructed using Equation (4). The problem of \ntrajectory learning is thereby reduced to the problem of parameterizing the matrices \nP(x) and Qi(x) and finding the parameter values which cause this model to best \nemulate the actual system. If the elements P(x) and Qi(x) are correctly chosen, \nthen a learning rule can be designed which makes the model dynamics converge to \nthat of the actual system. Assume that the dynamics given by Equation (3) are a \nparameterized model of the actual dynamics. Using this model and samples of the \nactual system states, an estimator for states of the actual system can be designed. The \nbehavior of the model is altered by changing its parameters, so a parameter estimator \nmust also be constructed. The following theorem provides a form for both the state \nand parameter estimators which guarantees convergence to a set of parameters for \nwhich the error between the estimated and target trajectories vanishes. \nTheorem 3.1. Given the model system \n\nk \n\nZ = LAili(x) +Bg(u) \n\ni=l \n\n(5) \n\nwhere Ai E IRnxn and BE IRnxm are unknown, and li(') and g(.) are known smooth \nfunctions such that the system has bounded solutions for bounded inputs u(t). Choose \n\n\f278 \n\nJ. W. HOWSE, C. T. ABDALLAH, G. L. HEILEMAN \n\na state estimator of the form \n\n~ = 'R. B (x - x) + L Ai fi(x) + iJ g(u) \n\nk \n\n(6) \n\ni=1 \n\nwhere'R.B is an (n x n) matrix of real constants whose eigenvalues must all be in the \nleft half plane, and Ai and iJ are the estimates of the actual parameters. Choose \nparameter estimators of the form \nt \nAi = -'R.p (x - x) [fi(x)] V i = 1, ... , k \nB = -'R.p (x - x) [g(u)]t \n\n(7) \n\n~ \n\nwhere 'R.p is an (n x n) matrix of real constants which is symmetric and positive \ndefinite, and (x - x) [.]t denotes an outer product. For these choices of state and \nparameter estimators limt~oo(x(t) -x(t\u00bb = 0 for all initial conditions. Furthermore, \nthis remains true if any of the elements of Ai or iJ are set to 0, or if any of these \nmatrices are restricted to being symmetric or skew-symmetric. \nThe proof of this theorem appears in [3]. Note that convergence of the parameter \nestimates to the actual parameter values is not guaranteed by this theorem. The \nmodel dynamics in Equation (3) can be cast in the form of Equation (5) by choosing \neach element of P(x) and Qi(X) to have the form \n\nPrB = LL~rBjkt?k(Xj) \n\nand \n\nQrB = LLArBjk ek(Xj), \n\n(8) \n\nn I-I \n\nj=1 k=O \n\nn I-I \n\nj=1 k=O \n\nwhere {t?o(Xj), t?1 (Xj), ... ,t?I-1 (Xj)} and {eo(Xj), el (Xj), ... ,el-l (Xj)} are a set of 1 \northogonal polynomials which depend on the state Xj' There is a set of such poly(cid:173)\nnomials for every state Xj, j = 1,2, ... , n. The constants ~rBjk and ArBjk determine \nthe contribution of the kth polynomial which depends on the jth state to the value \nof Prs and Qrs respectively. In this case the dynamics in Equation (3) become \n\n:i: = t. ~ { S;. [11.(x;) V. V (z)j + t, A;;. [e;.(x;) v. V(z)j } + T g(u(t)) \n\n(9) \n\nwhere 8 jk is the (n x n) matrix of all values ~rsjk which have the same value of j and \nk. Likewise A ijk is the (n x n) matrix of all values Arsjk, having the same value of \nj and k, which are associated with the ith matrix Qi(X). This system has m inputs, \nwhich may explicitly depend on time, that are represented by the m-element vector \nfunction u(t). The m-element vector function g(.) is a smooth, possibly nonlinear, \ntransformation of the input function. The matrix Y is an (n x m) parameter matrix \nwhich determines how much of input S E {I, ... , m} effects state r E {I, ... , n}. \nAppropriate state and parameter estimators can be designed based on Equations (6) \nand (7) respectively. \n4 Simulation Results \nNow an example is presented in which the parameters of the model in Equation (9) \nare trained, using the learning rule in Equations (6) and (7), on one input signal and \nthen are tested on a different input signal. The actual system has three equilibrium \npoints, two stable points located at (1,3) and (3,5), and a saddle point located at \n(2 - ~,4 + ~). In this example the dynamics of both the actual system and the \nmodel are given by \n(~1) = (1'1 + 1'2 Z~ +:3 Z~ O2) (:~) + (0 - {1'7 + 1'8 Z1 + 1'9 Z2}) (:~ ) + (1'10) u(t) \n\n(10) \n\n0 1'4 + 1'5 Z1 + 1'6 Z2 \n\n'P7 + 'P8 ZI + 1'9 Z2 \n\nZ2 \n\n0 \n\n0 \n\n8Y \n8Z2 \n\n8Y \n8Z2 \n\n\fGradient and Hamiltonian Dynamics Applied to Learning in Neural Networks \n\n279 \n\nwhere V(x) is defined in Equation (4) and u(t) is a time varying input. For the actual \nsystem the parameter values were 'PI = 'P4 = -4, 'P2 = 'Ps = -2, 'P3 = 'P6 = -1, \n'P7 = 1, 'Ps = 3, 'P9 = 5, and 'PIO = 1. \nIn the model the 10 elements 'Pi are \ntreated as the unknown parameters which must be learned. Note that the first matrix \nfunction is positive definite if the parameters 'PI-'P6 are all negative valued. The \nsecond matrix function is skew-symmetric for all values of 'P7-'P9. The two input \nsignals used for training and testing were Ul = 10000 (sin! 1000t + sin ~ 1000t) and \nU2 = 5000 sin 1000 t. The phase space responses of the actual system to the inputs UI \nand U2 are shown by the solid curves in Figures 3(b) and 3(a) respectively. Notice that \nboth of these inputs produce a periodic attractor in the phase space of Equation (10). \nIn order to evaluate the effectiveness of the learning algorithm the Euclidean distance \nbetween the actual and learned state and parameter values was computed and plotted \nversus time. The results are shown in Figure 2. Figure 2(a) shows these statistics when \n\n{1I~zll, II~'PII} \n\n{1I~zll, II~'PII} \n\n17.5 \n\n15 \n\n12.5 \n\n10 \n7.5 i \n\n,., ~--.----... -... --....... ----\n\n15 \n\n12.5 \n\n2.5 \n\n50 \n\n100 \n\n150 \n(a) \n\n200 \n\n250 \n\n300 t \n\n50 \n\n100 \n\n150 \n(b) \n\n200 \n\n250 \n\n300 t \n\nFigure 2: (a) The state and parameter errors for training using input signal Ut. The solid \ncurve is the Euclidean distance between the state estimates and the actual states \nas a function of time. The dashed curve shows the distance between the estimated \nand actual parameter values versus time. \n(b) The state and parameter errors for training using input signal U2. \n\ntraining with input UI, while Figure 2(b) shows the same statistics for input U2. The \nsolid curves are the Euclidean distance between the learned and actual system states, \nand the dashed curves are the distance between the learned and actual parameter \nvalues. These statistics have two noteworthy features. First, the error between the \nlearned and desired states quickly converges to very small values, regardless of how \nwell the actual parameters are learned. This result was guaranteed by Theorem 3.1. \nSecond, the final error between the learned and desired parameters is much lower when \nthe system is trained with input UI. Intuitively this is because input Ul excites more \nfrequency modes of the system than input U2. Recall that in a nonlinear system the \nfrequency modes excited by a given input do not depend solely on the input because \nthe system can generate frequencies not present in the input. The quality of the \nlearned parameters can be qualitatively judged by comparing the phase plots using \nthe learned and actual parameters for each input, as shown in Figure 3. In Figure 3(a) \nthe system was trained using input Ul and tested with input U2, while in Figure 3(b) \nthe situation was reversed. The solid curves are the system response using the actual \nparameter values, and the dashed curves are the response for the learned parameters. \nThe Euclidean distance between the target and test trajectories in Figure 3(a) is in \nthe range (0,0.64) with a mean distance of 0.21 and a standard deviation of 0.14. The \ndistance between the the target and test trajectories in Figure 3(b) is in the range \n(0,4.53) with a mean distance of 0.98 and a standard deviation of 1.35. Qualitatively, \nboth sets of learned parameters give an accurate response for non-training inputs. \n\n\f280 \n\n1. W. HOWSE, C. T. ABDALLAH, G. L. HEILEMAN \n\n5 \n\no -------r-- ------- ----- -\n\nI \nI \n\n{i - 5 \n\n-10 \n\n-15 \n\n-l \n\n-1 \n\n1 \nXl \n\n(a) \n\n- 2 \n\n-1 \n\n4 \n\n(b) \n\nFigure 3: (a) A phase plot of the system response when trained with input UI and tested \nwith input U2. The solid line is the response to the test input using the actual \nparameters. The dotted line is the system response using the learned parameters. \n(b) A phase plot of the system response when trained with input U2 and tested \nwith input UI. \n\nNote that even when the error between the learned and actual parameters is large, \nthe periodic attractor resulting from the learned parameters appears to have the same \n\"shape\" as that for the actual parameters. \n5 Conclusion \nWe have presented a conceptual framework for designing dynamical systems with \nspecific qualitative properties by decomposing the dynamics into a component normal \nto some surface and a set of components tangent to the same surface. We have \npresented a specific instance of this class of systems which converges to one of a finite \nnumber of equilibrium points. By parameterizing these systems, the manner in which \nthese equilibrium points are approached can be fitted to an arbitrary data set. We \npresent a learning algorithm to estimate these parameters which is guaranteed to \nconverge to a set of parameter values for which the error between the learned and \ndesired trajectories vanishes. \nAcknowledgments \nThis research was supported by a grant from Boeing Computer Services under Contract \nW-300445. The authors would like to thank Vangelis Coutsias, Tom Caudell, and Bill \nHome for stimulating discussions and insightful suggestions. \nReferences \n[1] M.A. Cohen. The construction of arbitrary stable dynamics in nonlinear neural networks. \n\nNeural Networks, 5(1):83-103, 1992. \n\n[2] M.W. Hirsch and S. Smale. Differential equations, dynamical systems, and linear algebra, \nvolume 60 of Pure and Applied Mathematics. Academic Press, Inc., San Diego, CA, 1974. \n[3] J.W. Howse, C.T. Abdallah, and G.L. Heileman. A gradient-hamiltonian decomposition \nfor designing and learning dynamical systems. Submitted to Neural Computation, 1995. \n[4] R.V. Mendes and J .T. Duarte. Decomposition of vector fields and mixed dynamics. \n\nJournal of Mathematical Physics, 22(7):1420-1422, 1981. \n\n[5] K.S. Narendra and A.M. Annaswamy. Stable adaptitJe systems. Prentice-Hall, Inc., En(cid:173)\n\nglewood Cliffs, NJ, 1989. \n\n[6] B.A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural \n\nComputation, 1(2):263-269, 1989. \n\n[7] D. Saad. Training recurrent neural networks via trajectory modification. Complex Sys(cid:173)\n\ntems, 6(2):213-236, 1992. \n\n[8] M.-A. Sato. A real time learning algorithm for recurrent analog neural networks. Bio(cid:173)\n\nlogical Cybernetics, 62(2):237-241, 1990. \n\n\f", "award": [], "sourceid": 1033, "authors": [{"given_name": "James", "family_name": "Howse", "institution": null}, {"given_name": "Chaouki", "family_name": "Abdallah", "institution": null}, {"given_name": "Gregory", "family_name": "Heileman", "institution": null}]}