{"title": "Universal Approximation and Learning of Trajectories Using Oscillators", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 457, "abstract": "", "full_text": "Universal Approximation and Learning \n\nof Trajectories Using Oscillators \n\nPierre Baldi* \nDivision of Biology \n\nCalifornia Institute of Technology \n\nPasadena, CA 91125 \n\npfbaldi@juliet.caltech.edu \n\nKurt Hornik \n\nTechnische Universitat Wien \n\nWiedner Hauptstra8e 8-10/1071 \n\nA-1040 Wien, Austria \n\nKurt.Hornik@tuwien.ac.at \n\nAbstract \n\nNatural and artificial neural circuits must be capable of travers(cid:173)\ning specific state space trajectories. A natural approach to this \nproblem is to learn the relevant trajectories from examples. Un(cid:173)\nfortunately, gradient descent learning of complex trajectories in \namorphous networks is unsuccessful. We suggest a possible ap(cid:173)\nproach where trajectories are realized by combining simple oscil(cid:173)\nlators, in various modular ways. We contrast two regimes of fast \nand slow oscillations. In all cases, we show that banks of oscillators \nwith bounded frequencies have universal approximation properties. \nOpen questions are also discussed briefly. \n\n1 \n\nINTRODUCTION: TRAJECTORY LEARNING \n\nThe design of artificial neural systems, in robotics applications and others, often \nleads to the problem of constructing a recurrent neural network capable of producing \na particular trajectory, in the state space of its visible units. Throughout evolution, \nbiological neural systems, such as central pattern generators, have also been faced \nwith similar challenges. A natural approach to tackle this problem is to try to \n\"learn\" the desired trajectory, for instance through a process of trial and error \nand subsequent optimization. Unfortunately, gradient descent learning of complex \ntrajectories in amorphous networks is unsuccessful. Here, we suggest a possible \napproach where trajectories are realized, in a modular and hierarchical fashion, by \ncombining simple oscillators. In particular, we show that banks of oscillators have \nuniversal approximation properties. \n\n* Also with the Jet Propulsion Laboratory, California Institute of Technology. \n\n\f452 \n\nP. BALDI, K. HORNIK \n\nTo begin with, we can restrict ourselves to the simple case of a network with one! \nvisible linear unit and consider the problem of adjusting the network parameters \nin a way that the output unit activity u(t) is equal to a target function I(t), over \nan interval of time [0, T]. The hidden units of the network may be non-linear and \nsatisfy, for instance, one of the usual neural network charging equations such as \n\ndUi \n\ndt = - Ti + L..JjWij/jUj(t - Tij), \n\nUi ~ \n\n(1) \n\nwhere Ti is the time constant of the unit, the Tij represent interaction delays, and \nthe functions Ij are non-linear input/output functions, sigmoidal or other. In the \nnext section, we briefly review three possible approaches for solving this problem, \nand some of their limitations. In particular, we suggest that complex trajectories \ncan be synthesized by proper combination of simple oscillatory components. \n\n2 THREE DIFFERENT APPROACHES TO TRAJECTO(cid:173)\n\nRY LEARNING \n\n2.1 GRADIENT DESCENT APPROACHES \n\nOne obvious approach is to use a form of gradient descent for recurrent networks \n(see [2] for a review), such as back-propagation through time, in order to mod(cid:173)\nify any adjustable parameters of the networks (time constants, delays, synaptic \nweights and/or gains) to reduce a certain error measure, constructed by comparing \nthe output u(t) with its target I(t). While conceptually simple, gradient descent \napplied to amorphous networks is not a successful approach, except on the most \nsimple trajectories. Although intuitively clear, the exact reasons for this are not \nentirely understood, and overlap in part with the problems that can be encountered \nwith gradient descent in simple feed-forward networks on regression or classification \ntasks. \n\nThere is an additional set of difficulties with gradient descent learning offixed points \nor trajectories, that is specific to recurrent networks, and that has to do with the \nbifurcations of the system being considered. In the case of a recurrent 2 network, as \nthe parameters are varied, the system mayor may not undergo a series of bifurca(cid:173)\ntions, i.e., of abrupt changes in the structure of its trajectories and, in particular, of \nits at tractors (fixed points, limit cycles, ... ). This in turn may translate into abrupt \ndiscontinuities, oscillations or non-convergence in the corresponding learning curve. \nAt each bifurcation, the error function is usually discontinuous, and therefore the \ngradient is not defined. Learning can be disrupted in two ways: when unwanted \nabrupt changes occur in the flow of the dynamical system, or when desirable bifur(cid:173)\ncations are prevented from occurring. A classical example of the second type is the \ncase of a neural network with very small initial weights being trained to oscillate, \nin a symmetric and stable fashion, around the origin. With small initial weights, \nthe network in general converges to its unique fixed point at the origin, with a large \nerror. If we slightly perturb the weights, remaining away from any bifurcation, the \nnetwork continues to converge to its unique fixed point which now may be slightly \ndisplaced from the origin, and yield an even greater error, so that learning by gradi(cid:173)\nent descent becomes impossible (the starting configuration of zero weights is a local \nminimum of the error function). \n\n1 All the results to be derived can be extended immediately to the case of higher(cid:173)\n\ndimensional trajectories. \n\n2In a feed-forward network, where the transfer functions of the units are continuous, the \noutput is a continuous function of the parameters and therefore there are no bifurcations. \n\n\fUniversal Approximation and Learning of Trajectories Using Oscillators \n\n453 \n\n8 \no \n\nFigure 1: A schematic representation of a 3 layer oscillator network for double figure \neight. Oscillators with period T in a given layer gate the corresponding oscillators, \nwith period T /2, in the previous layer. \n\n2.2 DYNAMICAL SYSTEM APPROACH \n\nIn the dynamical system approach, the function /(t) is approximated in time, over \n[0, T] by a sequence of points Yo, Yl, .... These points are associated with the iterates \nof a dynamical system, i.e., Yn+l = F(Yn) = Fn(yo), for some function F. Thus \nthe network implementation requires mainly a feed-forward circuit that computes \nthe function F. It has a simple overall recursive structure where, at time n, the \noutput F(Yn) is calculated, and fed back into the input for the next iteration. \nWhile this approach is entirely general, it leaves open the problem of constructing \nthe function F. Of course, F can be learned from examples in a usual feed-forward \nconnectionist network. But, as usual, the complexity and architecture of such a \nnetwork are difficult to determine in general. Another interesting issue in trajectory \nlearning is how time is represented in the network, and whether some sort of clock is \nneeded. Although occasionally in the literature certain authors have advocated the \nintroduction of an input unit whose output is the time t, this explicit representation \nis clearly not a suitable representation, since the problem of trajectory learning \nreduces then entirely to a regression problem. The dynamical system approach \nrelies on one basic clock to calculate F and recycle it to the input layer. In the \nnext approach, an implicit representation of time is provided by the periods of the \noscillators. \n\n2.3 OSCILLATOR APPROACH \n\nA different approach was suggested in [1] where, loosely speaking, complex tra(cid:173)\njectories are realized using weakly pre-structured networks, consisting of shallow \nhierarchical combinations of simple oscillatory modules. The oscillatory modules \ncan consist, for instance, of simple oscillator rings of units satisfying Eq. 1, with \ntwo or three high-gain neurons, and an odd number of inhibitory connections ([3]). \n\nTo fix the ideas, consider the typical test problem of constructing a network capable \nof producing a trajectory associated with a double figure eight curve (i.e., a set \nof four loops joined at one point), see Fig. 1. In this example, the first level of \nthe hierarchy could contain four oscillator rings, one for each loop of the target \ntrajectory. The parameters in each one of these four modules can be adjusted, for \ninstance by gradient descent, to match each of the loops in the target trajectory. \n\n\f454 \n\nP. BALDI, K. HORNIK \n\nThe second level of the pyramid should contain two control modules. Each of these \nmodules controls a distinct pair of oscillator networks from the first level, so that \neach control network in the second level ends up producing a simple figure eight . \nAgain, the control networks in level two can be oscillator rings and their parameters \ncan be adjusted. In particular, after the learning process is completed, they should \nbe operating in their high-gain regimes and have a period equal to the sum of the \nperiods of the circuits each one controls. \n\nFinally, the third layer consists of another oscillatory and adjustable module which \ncontrols the two modules in the second level, so as to produce a double figure \neight. The third layer module must also end up operating in its high-gain regime \nwith a period equal to four times the period of the oscillators in the first layer. \nIn general, the final output trajectory is also a limit cycle because it is obtained \nby superposition of limit cycles in the various modules. If the various oscillators \nrelax to their limit cycles independently of one another, it is essential to provide \nfor adjustable delays between the various modules in order to get the proper phase \nadjustments. In this way, a sparse network with 20 units or so can be constructed \nthat can successfully execute a double figure eight. \n\nThere are actually different possible neural network realizations depending on how \nthe action of the control modules is implemented. For instance, if the control units \nare gating the connections between corresponding layers, this amounts to using \nhigher order units in the network. If one high-gain oscillatory unit, with activity \nc(t) always close to 0 or 1, gates the oscillatory activities of two units Ul(t) and \nU2(t) in the previous layer, then the overall output can be written as \n\nout(t) = C(t)Ul (t) + (1 - C(t))U2(t) . \n\n(2) \n\nThe number of layers in the network then becomes a function of the order of the \nunits one is willing to use. This approach could also be described in terms of a \ndynamic mixture of experts architecture, in its high gain regime. Alternatively, \none could assume the existence of a fast weight dynamics on certain connections \ngoverned by a corresponding set of differential equations. Although we believe that \noscillators with limit cycles present several attractive properties (stability, short \ntransients, biological relevance, . . . ), one can conceivably use completely different \ncircuits as building blocks in each module. \n\n3 GENERALIZATION AND UNIVERSAL APPROXIMA(cid:173)\n\nTION \n\nWe have just described an approach that combines a modular hierarchical architec(cid:173)\nture, together with some simple form of learning, enabling the synthesis of a neural \ncircuit suitable for the production of a double figure eight trajectory. It is clear that \nthe same approach can be extended to triple figure eight or, for that matter, to any \ntrajectory curve consisting of an arbitrary number of simple loops with a common \nperiod and one common point. In fact it can be extended to any arbitrary trajec(cid:173)\ntory. To see this, we can subdivide the time interval [0, T] into n equal intervals of \nduration f = Tin . Given a certain level of required precision, we can always find n \noscillator networks with period T (or a fraction of T) and visible trajectory Ui(t), \nsuch that for each i, the i-th portion of the trajectory u(t) with if ~ t ~ (i + l)f \ncan be well approximated by a portion of Ui(t) , the trajectory of the i-th oscillator. \nThe target trajectory can then be approximated as \n\n(3) \n\n\fUniversal Approximation and Learning of Trajectories Using Oscillators \n\n455 \n\nAs usual, the control coefficient Cj(t) must have also period T and be equal to 1 \nfor i{ :5 t :5 (i + 1){, and 0 otherwise. The control can be realized with one large \nhigh-gain oscillator, or as in the case described above, by a hierarchy of control \noscillators arranged, for instance, as a binary tree of depth m if n = 2m , with the \ncorresponding multiple frequencies. \n\nWe can now turn to a slightly different oscillator approach, where trajectories are to \nbe approximated with linear combinations of oscillators, with constant coefficients. \nWhat we would like to show again is that oscillators are universal approximators \nfor trajectories. In a sense, this is already a well-known result of Fourier theory \nsince, for instance, any reasonable function f with period T can be expanded in the \nform3 \n\nA.k = kiT. \n\n(4) \n\nFor sufficiently smooth target functions, without high frequencies in their spectrum, \nit is well known that the series in Eq. 4 can be truncated. Notice, however, that both \nEqs. 3 and 4 require having component oscillators with relatively high frequencies, \ncompared to the final trajectory. This is not implausible in biological motor control, \nwhere trajectories have typical time scales of a fraction of a second, and single \ncontrol neurons operate in the millisecond range. A rather different situation arises \nif the component oscillators are \"slow\" with respect to the final product. \n\nThe Fourier representation requires in principle oscillations with arbitrarily large \nfrequencies (0, liT, 2IT, .. . , niT, .. . ). Most likely, relatively small variations in the \nparameters (for instance gains, delays andlor synaptic weights) of an oscillator \ncircuit can only lead to relatively small but continuous variations of the overall \nfrequency. For instance, in [3] it is shown that the period T of an oscillator ring \nwith n units obeying Eq. 1 must satisfy \n\nThus, we need to show that a decomposition similar in flavor to Eq. 4 is possible, \nbut using oscillators with frequencies in a bounded interval. Notice that by varying \nthe parameters of a basic oscillator, any frequency in the allowable frequency range \ncan be realized, see [3]. Such a linear combination is slightly different in spirit from \nEq. 2, since the coefficients are independent of time, and can be seen as a soft \nmixture of experts. We have the following result. \n\nTheorem 1 Let a < b be two arbitrary real numbers and let f be a continuous \nfunction on [0, T]. Then for any error level { > 0, there exist n and a function 9n \nof the form \n\nsuch that the uniform distance Ilf - 9n 1100 is less than {. \n\nIn fact, it is not even necessary to vary the frequencies A. over a continuous band \n[a, b]. We have the following. \n\nTheorem 2 Let {A.k} be an infinite sequence with a finite accumulation point, and \nlet f be a continuous function on [0,7]. Then for any error level { > 0, there exist \nn and a function 9n(t) = 2:~=10:'ke27rjAkt such that Ilf - 9nll00 < {. \n3In what follows, we use the complex form for notational convenience. \n\n\f456 \n\nP. BALDI, K. HORNIK \n\nThus, we may even fix the oscillator frequencies as e.g. Ak = l/k without losing \nuniversal approximation capabilities. Similar statements can be made about mean(cid:173)\nsquare approximation or, more generally, approximation in p-norm LP(Il), where \n1 ~ p < 00 and Il is a finite measure on [0, T]: \n\nTheorem 3 For all p and f in LP(Il) and for all { > 0, we can always find nand \ngn as above such that Ilf - gn IILP{Jl) < {. \n\nThe proof of these results is surprisingly simple. Following the proofs in [4], if one \nof the above statements was not true, there would exist a nonzero, signed finite \n\nmeasure (T with support in [0, T] such that hO,T] e21fi >.t d(T(t) = \u00b0 for all \"allowed\" \n\nfrequencies A. Now the function z t-+ !rO,T] e21fizt d(T(t) is clearly analytic on the \nwhole complex plane. Hence, by a well-known result from complex variables, if it \nvanishes along an infinite sequence with a finite accumulation point, it is identically \nzero. But then in particular the Fourier transform of (T vanishes, which in turn \nimplies that (T is identically zero by the uniqueness theorem on Fourier transforms, \ncontradicting the initial assumption. \nNotice that the above results do not imply that f can exactly be represented as \n\ne.g. f(t) = f: e21fi >.t dV(A) for some signed finite measure v-such functions are not \n\nonly band-limited, but also extremely smooth (they have an analytic extension to \nthe whole complex plane). \n\nHence, one might even conjecture that the above approximations are rather poor \nin the sense that unrealistically many terms are needed for the approximation. \nHowever, this is not true-one can easily show that the rates of approximation \ncannot be worse that those for approximation with polynomials. Let us briefly sketch \nthe argument, because it also shows how bounded-frequency oscillators could be \nconstructed. \nFollowing an idea essentially due to Stinchcombe & White [5], let, more generally, \n9 be an analytic function in a neighborhood of the real line for which no derivative \nvanishes at the origin (above, we had g(t) = e21fit ). Pick a nonnegative integer n \nand a polynomial p of degree not greater than n - 1 arbitrarily. Let us show that \nfor any { > 0, we can always find a gn of the form gn(t) = E~=l Cl'kg(Akt) with Ak \narbitrarily small such that lip - gn 1100 < {. To do so, note that we can write \n\np(t) = \n\nL n - l \n\n1=0 \n\nis,t' , \n\nwhere rn(At) is of the order of An, as A -t 0, uniformly for t in [0, T] . Hence, \n\nL:=l Cl'kg(Ak t ) \n\nL:=l Cl'k (L~=-ol fil (At)l + rn (At)) \n\n= L~=~l (L:: 1 Cl'kAi) filtl + L:=l Cl'krn (Akt). \n\nNow fix n distinct numbers el, ... ,en, let Ak = Ak(p) = pek, and choose the Cl'k = \nCl'k(p) such that E;=lCl'k(p)Ak(p)' = iSl/fil for I = 0, ... , n - 1. (This is possible \nthe order of pl-n as p -t \u00b0 (in fact, the j-th row of the inverse of the coefficient \nbecause, by assumption, all fil are non-zero.) It is readily seen that Cl'k (p) is of \n\nmatrix of the linear system is given by the coefficients of the polynomial nktj (A -\nAk)/(Aj -Ak)). Hence, as p -t 0, the remainder term EZ=lCl'k(p)rn(Ak(p)t) is ofthe \norder of p, and thus E~=lCl'k(p)g(Adp)t) -t E~=-oliS,t' = p(t) uniformly on [0, T]. \nNote that using the above method, the coefficients in the approximation grow quite \nrapidly when the approximation error tends to 0. In some sense, this was to be \n\n\fUniversal Approximation and Learning of Trajectories Using Oscillators \n\n457 \n\nexpected from the observation that the classes of small-band-limited functions are \nrather \"small\". There is a fundamental tradeoff between the size of the frequencies, \nand the size of the mixing coefficients. How exactly the coefficients scale with the \nwidth of the allowed frequency band is currently being investigated. \n\n4 CONCLUSION \n\nThe modular oscillator approach leads to trajectory architectures which are more \nstructured than fully interconnected networks, with a general feed-forward flow of \ninformation and sparse recurrent connections to achieve dynamical effects. The \nsparsity of units and connections are attractive features for hardware design; and \nso is also the modular organization and the fact that learning is much more cir(cid:173)\ncumscribed than in fully interconnected systems. We have shown in different ways \nthat such architectures have universal approximation properties. In these architec(cid:173)\ntures, however, some form of learning remains essential, for instance to fine tune \neach one of the modules. This, in itself, is a much easier task than the one a fully \ninterconnected and random network would have been faced with. It can be solved \nby gradient or random descent or other methods. Yet, fundamental open problems \nremain in the overall organization of learning across modules, and in the origin of \nthe decomposition. In particular, can the modular architecture be the outcome of a \nsimple internal organizational process rather than an external imposition and how \nshould learning be coordinated in time and across modules (other than the obvious: \nmodules in the first level learn first, modules in the second level second, .. . )? How \nsuccessful is a global gradient descent strategy applied across modules? How can the \nsame modular architecture be used for different trajectories, with short switching \ntimes between trajectories and proper phases along each trajectory? \n\nAcknowledgments \n\nThe work of PB is in part supported by grants from the ONR and the AFOSR. \n\nReferences \n\n[1] Pierre Baldi. A modular hierarchical approach to learning. In Proceedings of the \n2nd International Conference on Fuzzy Logic and Neural Networks, volume II, \npages 985-988, IIzuka, Japan, 1992. \n\n[2] Pierre F. Baldi. Gradient descent learning algorithm overview: a general dy(cid:173)\nnamic systems perspective. IEEE Transactions on Neural Networks, 6(1}:182-\n195, January 1995. \n\n[3] Pierre F. Baldi and Amir F. Atiya. How delays affect neural dynamics and \n\nlearning. IEEE Transactions on Neural Networks, 5(4):612-621, July 1994. \n\n[4] Kurt Hornik. Some new results on neural network approximation. Neural Net(cid:173)\n\nworks, 6:1069-1072,1993. \n\n[5] Maxwell B. Stinchcombe and Halbert White. Approximating and learning un(cid:173)\n\nknown mappings using multilayer feedforward networks with bounded weights. \nIn International Joint Conference on Neural Networks, volume III, pages 7-16, \nWashington, 1990. Lawrence Earlbaum, Hillsdale. \n\n\f", "award": [], "sourceid": 1062, "authors": [{"given_name": "Pierre", "family_name": "Baldi", "institution": null}, {"given_name": "Kurt", "family_name": "Hornik", "institution": null}]}