{"title": "Adjoint-Functions and Temporal Learning Algorithms in Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 113, "page_last": 120, "abstract": null, "full_text": "Adjoint-Functions and Temporal Learning \n\nAlgorithms in Neural Networks \n\nN. Toomarian and J. Barhen \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nAbstract \n\nThe development of learning algorithms is generally based upon the min(cid:173)\nimization of an energy function. It is a fundamental requirement to com(cid:173)\npute the gradient of this energy function with respect to the various pa(cid:173)\nrameters of the neural architecture, e.g., synaptic weights, neural gain,etc. \nIn principle, this requires solving a system of nonlinear equations for each \nparameter of the model, which is computationally very expensive. A new \nmethodology for neural learning of time-dependent nonlinear mappings is \npresented. It exploits the concept of adjoint operators to enable a fast \nglobal computation of the network's response to perturbations in all the \nsystems parameters. The importance of the time boundary conditions of \nthe adjoint functions is discussed. An algorithm is presented in which \nthe adjoint sensitivity equations are solved simultaneously (Le., forward \nin time) along with the nonlinear dynamics of the neural networks. This \nmethodology makes real-time applications and hardware implementation \nof temporal learning feasible. \n\n1 \n\nINTRODUCTION \n\nEarly efforts in the area of training artificial neural networks have largely focused \non the study of schemes for encoding nonlinear mapping characterized by time(cid:173)\nindependent inputs and outputs. The most widely used approach in this context \nhas been the error backpropagation algorithm (Werbos, 1974), which involves either \nstatic i.e., \"feedforward\" (Rumelhart, 1986), or dynamic i.e., \"recurrent\" ( Pineda, \n1988) networks. In this context ( Barhen et aI, 1989, 1990a, 1990b), have exploited \n\n113 \n\n\f114 \n\nToomarian and Barhen \n\nthe concepts of adjoint operators and terminal attractors. These concepts provide \na firm mathematical foundation for learning such mappings with dynamical neural \nnetworks, while achieving a considerable reduction in the overall computational \ncosts (Barhen et aI, 1991). \nRecently, there has been a wide interest in developing learning algorithms capable \nof modeling time-dependent phenomena ( Hirsh, 1989). In a more restricted appli(cid:173)\ncation oriented, domain attention has focused on learning temporal sequences. The \nproblem can be formulated as minimization, over an arbitrary but finite time inter(cid:173)\nval, of an appropriate error functional. Thus, the gradients of the functional with \nrespect to the various parameters of the neural architecture, e.g., synaptic weights, \nneural gains, etc. must be computed. \n\nA number of methods have been proposed for carrying out this task, a recent sur(cid:173)\nvey of which can be found in (Pearlmutter, 1990). Here, we will briefly mention \n\\-Villiams and Zipser(1989) discuss a \nonly those which are relevant to our work. \nscheme similar to the well known \"Forward Sensitivity Equations\" of sensitivity \ntheory (Cacuci, 1981 and Toomarian et aI, 1987), in which the same set of sensi(cid:173)\ntivity equations has to be solved again and again for each network parameter of \ninterest. Clearly, this is computationally very expensive, and scales poorly to large \nsystems. Pearlmutter (1989), on the other hand, describes a variational approach \nwhich yields a set of equations which are similar to the \"Adjoint Sensitivity Equa(cid:173)\ntions\" (Cacuci, 1981 and Toomarian et aI, 1987). These equations must be solved \nbackwards in time and involve storage of the state variables from the activation net(cid:173)\nwork dynamics, which is impractical. These authors ( Toomarian and Barhen, 1990 \n) have suggested a new method which, in contradistinction to previous approaches, \nsolves the adjoint system of equations forward in time, concomitantly with the neu(cid:173)\nral activation dynamics. A potential drawback of this method lies in the fact that \nthese adjoint equations have to be treated in terms of distributions which precludes \nstraight-forward numerical implementation. Finally, Pineda (1990), suggests com(cid:173)\nbining the existence of disparate time scales with a heuristic gradient computation. \nHowever, the underlying adiabatic assumptions and highly \"approximate\" gradient \nevaluation technique place severe limits on the applicability of his approach. \n\nIn this paper we introduce a rigorous derivation of two novel systems of adjoint equa(cid:173)\ntions, which can be solved simultaneously (i.e., forward in time) with the network \ndynamics, and thereby enable the implementation of temporal learning algorithms \nin a computationally efficient manner. Numerical simulations and comparison with \npreviously available results will be presented elsewhere( Toomarian and Barhen, \n1991). \n\n2 TEMPORAL LEARNING \n\nWe formalize a neural network as an adaptive dynamical system whose temporal \nevolution is governed by the following set of coupled nonlinear differential equations: \n\ngnhn(L Tnm U m + In)] \n\nt>O \n\n(1) \n\nm \n\n\fAdjoint-Functions and Temporal Learning Algorithms in Neural Networks \n\n115 \n\nwhere Un represents the output of the nth neuron [un(O) being the initial state], \nand Tnm denotes the synaptic coupling from the m-th to the n-th neuron. The \nconstant Kn characterizes the decay of neuron activity. The sigmoidal function g(.) \nmodulates the neural response, with gain given by r; typically, g(rx) = tanh(rx). \nThe time-dependent \"source\" term, In(t), encodes component-contribution of the \ntarget temporal pattern a(t) via the expression \n\nif n E Sx \nif n E SH U Sy \n\n(2) \n\nThe topographic input, output, and hidden network partitions Sx, Sy and SH, \nrespectively, are architectural requirements related to the encoding of mapping-type \nproblems. Details are given in Barhen et al (1989). \n\nTo proceed formally with the development of a temporal learning algorithm, we \nconsider an approach based upon the minimization of a \"neuromorphic\" energy \nfunctional E, given by the following expression \n\nE(u,p) = 1~ 2: r~ dt = 1 Fdt \n\nt... \n\n\"l \n\nt \n\n(3) \n\n(4) \n\nwhere \n\nr n(t) \n\nif n E Sy \nif n E Sx U SH \n\nIn our model the internal dynamical parameters of interest are the synaptic \nstrengths Tnm of the interconnection topology, the characteristic decay constants \nK n , and the gain parameters rn. Therefore, the vector of system parameters ( \nBarhen et aI, 1990b) should be \n\n(5a) \n\nIn this paper, however, for illustration purposes and simplicity, we will limit our(cid:173)\nselves in terms of parameters to the synaptic interconnections only. Hence, the \nvector of system parameters will have M = N 2 elements \n\np = {TIl, ... , TN N } \n\n(5b) \nWe will assume that elements of p are, in principle, independent. Furthermore, we \nwill also assume that, for a specific choice of parameters and set of initial conditions, \na unique solution of Eq. (1) exists. Hence, u is an implicit function of p. \nLyapunov stability requires the energy functional to be monotonically decreasing \nduring learning time, r. This translates into \n\ndE = 2: dE . dp~ < 0 \n\nM \n\ndr \n\n~=1 dp~ dr \n\nThus, one can always choose, with 7] > 0 \n\ndE \ndp~ \n-= -7 ] -\ndr \ndp~ \n\n(6) \n\n(7) \n\n\f116 \n\nToomarian and Barhen \n\nIntegrating the above dynamical system over the interval [T, T + .6.TL one obtains, \n\nPI'( T + .6.T) = p~( T) - TJ \n\nT \n\nIT+6T dE \n\n-d dT \nPI' \n\n(8) \n\n(9) \n\nEquation (8) implies that, in order to update a system parameter PI\" one must \nevaluate the gradient of E with respect to PI' in the interval [T, T+.6.T]. Furthermore, \nusing Eq. (3) and observing that the time integral and derivative with respect to \nPI\" permute one can write; \n\ndE 1 dF 1 8F 18F 8ft. \n-= -dt = -dt+ - \u00b7 -d t \ndpl' \n\nt 8ft. 8pI' \n\nt 8pI' \n\nt dPI' \n\nSince F is known analytically [viz. Eq. (3)] computation of 8F /8u n and 8F /8p~ \nis straightforward. \n\n(lOa) \n\n(lOb) \n\nThus, the quantity that needs to be determined is the vector 8ft./ 8p~. Differentiating \nthe activation dynamics, Eq. (1), with respect to PI\" we observe that the time \nderivative and partial derivative with respect to PI' commute. Using the shorthand \nnotation 8(\u00b7\u00b7 \u00b7)/8pl' = ( .. . ),1' we obtain a set of equations to be referred to as \n\"Forward Sensitivity Equations-FSE\"; \n\nin which \n\nt>O \nt=O \n\nlen 6nm -\n\nAnm = \nSn,~ = \"Yn !In I: Tnm U m 6p ,. ,T\"\", \n\n\"Yn 9n Tnm \n\nm \n\n(12) \n\n(13) \n\n(14) \n\nwhere !In represents the derivative of gn with respect to Un, and 6 denotes the Kro(cid:173)\nnecker symbol. Since the initial conditions of the activation dynamics, Eq.( 1), are \nexcluded from the system parameter vector p, the initial conditions of the forward \nsensitivity equations will be taken as zero. Computation of the gradients, via Eq. \n(9), using the forward sensitivity scheme as proposed by William and Zipser (1989), \nwould require solving Eq. (12), N 2 times, since the source term explicitly depends \non PI'. The system of equations (12) has N equations, each of which requires \nsummation over all N neurons. Hence, the amount of computation ( measured in \nmultiply-accumulates, scales like N 4 per time step. We assume that the interval \nbetween to to tf is divided to L time steps. Therefore, the total number of multiply(cid:173)\naccumulates scales like N4 L. Clearly, the scaling properties of this approach are \nvery poor and it can not be practically applied to very large networks. On the other \nhand, this method has also inherent advantages. The FSE are solved forward in \ntime along with the nonlinear dynamics of the neural networks. Therefore, there is \nno need for or a large amount of memory. Since un ,1' has N 3 components, that is \nall needed to be stored. \n\n\fAdjoint-Functions and Temporal Learning Algorithms in Neural Networks \n\n117 \n\nIn order to reduce the computational costs, an alternative approach can be consid(cid:173)\nered. It is based upon the concept of adjoint operators, and eliminates the need for \nexplicit appearance of u,#-, in Eq. (9). A vector of adjoint functions, v is obtained, \nwhich contain all the information required for computing all the \"sensitivities\", \ndE I dp#-\" The necessary and sufficient conditions for constructing adjoint equations \nare discussed elsewhere ( Toomarian et aI, 1987 and references therein). \n\nIt can be shown that an Adjoint System of Equations-ASE, pertaining to the forward \nsystem of equations (12), can be formally written as \n\nt> 0 \n\n(15) \n\nm \n\nIn order to specify Eq. \n(15) in closed mathematical form, we must define the \nsource term s~ and time- boundary conditions for the system. Both should be \nindependent of p#-, and its derivatives. \nBy identifying s~ with aFla Un and selecting the final time condition v(t = \nt,) = 0, a system of equations is obtained, which is similar to those proposed by \nPearlmutter. The method requires that the neural activation dynamics, i.e., Eq. \n(1), be solved first forward in time, as followed by the ASE, Eq. (15), integrated \nbackwards in time. The computation requirement of this approach scales as N2 L. \nHowever, a major drawback to date has resided with the necessity to store quantities \nsuch as g, 5* and 5,#-, at each time step. Thus, the memory requirements for this \nmethod scale as N 2 L. \nBy selecting 5* = g~ -v6(t-t,) and initial conditions v(t = 0) = 0, these authors ( \nToomarian and Barhen 1990 ) have suggested a method which, in contradistinction \nto previous approaches, enables the ASE to be integrated forward in time, i.e., \nconcomitantly with the neural activation dynamics. This approach saves a large \namount of storage, which scales only as N 2 . The computation complexity of this \nmethod, is similar to that of backward integration and scales as N 2 L. A potential \ndrawback lies in the fact that Eq. (15) must then be treated in terms of distributions, \nwhich precludes straightforward numerical implementation. \n\nAt this stage, we introduce a new paradigm which will enable us to evolve the \nadjoint dynamics, Eq. (15) forward in time, but without the difficulties associated \nwith solutions in the sense of distributions. We multiply the FSE, Eq. (12), by v \nand the ASE, Eq. (15), by u,~, subtract the two resulting equations and integrate \nover the time interval (to,t,). This procedure yields the bilinear form: \n\n(v U,~ )tl - (v U,#-' )'0 = 1'1 [(v S,~) - (u,#-, S*)]dt \n\nto \n\nTo proceed, we select \n\n-\n\nv(t = 0) = O. \n\n() t1 \n\n{s-* -ll. \n1-\n\nThus, Eq. (16) can be rewritten as: \n\n1 -\nl aF \nt a U u,~dt = t s\u00b7 u.~dt = ,v S,~dt -\n\n[v U,~]tl \n\n(16) \n\n(17) \n\n(18) \n\n\f118 \n\nToomarian and Barhen \n\nThe first term in the RHS of Eq. (18) can be computed by using the values of v \nobtained by solving the ASE, (Eqs. (15) and (17\u00bb, forward in time. The main \ndifficulty resides in the evaluation of the second term in the RHS of Eq. (18), i.e., \n[v u,~]t/\" To compute it, we now introduce an auxiliary adjoint system: \n\nin which we select \n\nm \n\nt> 0 \n\n(19) \n\n(20) \n\nNote that, eventhough we selected z(tJ) = 0, we are also interested in solving this \nauxiliary adjoint system forward in time. Thus, the critical issue is how to select \nthe initial condition (i.e. z(t o\u00bb, that would result in z(tJ) = O. The bilinear form \nassociated with the dynamical systems u,~ and z can be derived in a similar fashion \nto Eq. (16). Its expression is: \n\n(z u'~)t - (z u,~ )t o = 1t/ [(z s,~) - ( u,~ S)]dt \n\n/ \n\nto \n\n(21) \n\nIncorporatingo5\\ z(tJ) and the initial condition of Eq. (12) into Eq. (21), we obtain; \n\nI t' \n\nto \n\n\" \n\n(u,~ S)dt = [v u,~]t/ = \n\n1t/ \n\n(z s,~ )dt \n\nto \n\n(22) \n\nIn order to provide a simple illustration on how the problem of selecting the initial \nconditions for the z-dynamics can be addressed, we assume, for a moment, that the \nmatrix A in Eq. (19) is time independent. Hence, the formal solution of Eq. (19) \ncan be written as: \n\nz(t) = z(to)eAT(t-to) \n\nz(tJ) = z(to)eAT(t/-to) - v(tJ) \n\n(23a) \n\n(23b) \n\nTherefore, in principle, Eq. (22) can be expressed in terms of z(to), using Eq. (23a). \nAt time tJ, where v(tJ) is known from the solution of Eq. (15), one can calculate \nthe vector z(to), from Eq. (23b), with z(tJ) = O. \nIn the problem under consideration, however, the matrix A in Eq. (19) is time \ndependent (viz Eq. (13\u00bb. Thus the auxiliary adjoint equations will be solved by \nmeans of finite differences. Usually, the same numerical scheme that is used for Eqs. \n(1) and (15) will be adopted. For illustrative purposes, we limit the discussion in \nthe sequel to the first order approximation i.e.; \n\n( -1+1 \nZ \n\n-\ndt \n\n-I) \nZ \n\n+ \n\nA' -1 \n\nz = \n0 \n\no < I < L \n\n(24) \n\n\fAdjoint-Functions and Temporal Learning Algorithms in Neural Networks \n\n119 \n\nFrom this equation one can easily show that \n\nzl+1 = B' . B'- 1 ... Bl . BOz(t o ) \n\nin which \n\nB' = I + ilt A' \n\n(25) \n\n(26) \n\nwhere I is the identity matrix. Thus, the RHS of Eq. (22) can be rewritten as: \n\n[v u.~]tJ = [LB(l-I)!S.~]z(to) ilt \n\n(27) \n\nI \n\nThe initial conditions z( to) can easily be found at time t\" i.e., at iteration stop L I \nby solving the algebraic equation: \n\nB(L-l)!z(to ) = vet,) \n\n(28) \n\nIn summary, the computation of the gradients i.e. Eq. (8) involves two stages, \ncorresponding to the two terms in the RHS of Eq. (18). The first term is calcu(cid:173)\nlated using the adjoint functions v obtained from Eq. (15). The computational \ncomplexity is N 2 L. The second term is calculated via Eq. (27), and involves two \nsteps: a) kernel propagation, which requires multiplication of two matrices B' and \nB(l-I) at each time step; the computational complexity scales as N 3 L; b) numerical \nintegration via Eq. (24) which requires a matrix vector multiplication at each time \nstep; hence, it scales as N2 L. Thus, the overall computational complexity of this \napproach is of the order N 3 L. Notice, however, that here the storage needed is \nminimal and equal to N 2 . \n\n3 CONCLUSIONS \n\nA new methodology for neural learning of time-dependent nonlinear mappings is \npresented. It exploits the concept of adjoint operators. The resulting algorithm \nenables computation of the gradient of an energy function with respect to various \nparameters of the network architecture in a highly efficient manner. Specifically, \nit combines the advantage of dramatic reductions in computational complexity in(cid:173)\nherent in adjoint methods with the ability to solve the equations forward in time. \nNot only is a large amount of computation and storage saved, but the handling \nof real-time applications becomes also possible. This methodology also makes the \nhardware implementation of temporal learning attractive. \n\nAcknowledgments \n\nThis research was carried out at the Center for Space Microelectronics Technology, \nJet Propulsion Laboratory, California Institute of Technology. Support for the \nwork came from Agencies of the U.S. Department of Defense including the Naval \nWeapons Center (China Lake, CA), and from the Office of Basic Energy Sciences \nof the Department of Energy, through an agreement with the National Aeronautics \nand Space Administration. The authors acknowledge helpful discussions with J. \nMartin and D. Andes from Navel Weapons Center. \n\n\f120 \n\nToomarian and Barhen \n\nReferences \n\nBarhen, J., Gulati, S., and Zak, M., 1989, \"Neural Learning of Constrained Non(cid:173)\nlinear Transformations\" ,IEEE Computer, 22(6), 67-76. \nBarhen, J., Toomarian, N., and Gulati, S., 1990a, \" Adjoint Operator Algorithms \nfor Faster Learning in Dynamical Neural Networks\", Adv. Neur. Inf. Proc. Sys., \n2,498-508. \nBarhen, J., Toomarian, N., and Gulati, S., 1990b, \"Application of Adjoint Operators \nto Neural Learning\", Appl. Math. Lett.,3 (3), 13-18. \nBarhen, J., Toomarian, N., and Gulati, S., 1991, \"Fast Neural Learning Algorithms \nUsing Adjoint Operators\", Submitted to IEEE Trans. of Neural Networks \nCacuci, D. G., 1981, \"Sensitivity Theory for Nonlinear Systems\", J. Math. Phys., \n22 (12), 2794-2802. \nHirsch, M. W., 1989, \"Convergent Activation Dynamics in Continuous Time Net(cid:173)\nworks\" ,Neural Networks, 2 (5), 331-349. \nPearlmutter, B. A., 1989, \"Learning State Space Trajectories in Recurrent Neural \nNetworks\", Neural Computation, 1 (2), 263-269. \n\nPearlmutter, B. A., 1990, \"Dynamic Recurrent Neural Networks\", Technical Re(cid:173)\nport CMU-CS-90-196, School of Computer Science, Carnegie Mellon University, \nPittsburgh, Pa. \nPineda, F., 1988, \"Dynamics and Architecture in Neural Computation\", J. of Com(cid:173)\nplexity, 4, 216-245. \nPineda, F., 1990, \"Time Dependent Adaptive Neural Networks\", Adv. Neur. Inf. \nProc. Sys., 2, 710-718. \nRumelhart, D. E., and McC.and, J. 1., 1986, Parallel and Distributed Processing, \nMIT Press. \nToomarian, N., Wacholder, E., and Kaizerman, S., 1987, \"Sensitivity Analysis of \nTwo-Phase Flow Problems\", Nucl. Sci. Eng., 99 (1), 53-81. \nToomarian, N. and Barhen, J., 1990, \"Adjoint Operators and Non- Adiabatic Al(cid:173)\ngorithms in Neural Networks\", Appl. Math. Lett., (in press). \n\nToomarian, N. and Barhen, J., 1991, \" Learning a Trajectory Using Adjoint Func(cid:173)\ntions\" , submitted to Neural Networks \n\nWerbos, P., 1974, \"Beyond Regression: New Tools for Prediction and Analysis in \nThe Behavioral Sciences\", Ph.D. Thesis, Harvard Univ. \nWilliams, R. J., and Zipser, D., 1989, \"A Learning Algorithm for Continually Run(cid:173)\nning Fully Recurrent Neural Networks\", Neural Computation, 1 (2), 270-280. \n\n\fPart III \n\nOscilla tions \n\n\f\f", "award": [], "sourceid": 425, "authors": [{"given_name": "N.", "family_name": "Toomarian", "institution": null}, {"given_name": "J.", "family_name": "Barhen", "institution": null}]}