{"title": "Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 151, "abstract": null, "full_text": "Processing of Time Series by Neural Circuits \nwith Biologically Realistic Synaptic Dynamics \n\nThomas NatschIager & Wolfgang Maass \nInstitute for Theoretical Computer Science \n\nTechnische Universitat Graz, Austria \n\n{tna t schl,maass }@i g i.tu -gra z. ac . a t \n\nEduardo D. Sontag \nDept. of Mathematics \n\nRutgers University \n\nAnthony Zador \n\nCold Spring Harbor Laboratory \n\n1 Bungtown Rd \n\nNew Brunswick, NJ 08903, USA \n\nsont ag@h ilbert. r ut ge rs . e du \n\nCold Spring Harbor, NY 11724 \n\nzad or@cshl. org \n\nAbstract \n\nExperimental data show that biological synapses behave quite differently \nfrom the symbolic synapses in common artificial neural network models. \nBiological synapses are dynamic, i.e., their \"weight\" changes on a short \ntime scale by several hundred percent in dependence of the past input \nto the synapse. In this article we explore the consequences that these \nsynaptic dynamics entail for the computational power of feedforward \nneural networks. We show that gradient descent suffices to approximate \na given (quadratic) filter by a rather small neural system with dynamic \nsynapses. We also compare our network model to artificial neural net(cid:173)\nworks designed for time series processing. Our numerical results are \ncomplemented by theoretical analysis which show that even with just a \nsingle hidden layer such networks can approximate a surprisingly large \nlarge class of nonlinear filters: all filters that can be characterized by \nVolterra series. This result is robust with regard to various changes in the \nmodel for synaptic dynamics. \n\n1 Introduction \n\nMore than two decades of research on artificial neural networks has emphasized the central \nrole of synapses in neural computation. In a conventional artificial neural network, all units \n(\"neurons\") are assumed to be identical, so that the computation is completely specified by \nthe synaptic \"weights,\" i. e. by the strengths of the connections between the units. Synapses \nin common artificial neural network models are static: the value Wij of a synaptic weight \nis assumed to change only during \"learning\". In contrast to that, the \"weight\" Wij (t) of \na biological synapse at time t is known to be strongly dependent on the inputs Xj(t - T) \nthat this synapse has received from the presynaptic neuron i at previous time steps t - T, \nsee e.g. [1]. We will focus in this article on mean-field models for populations of neurons \nconnected by dynamic synapses. \n\n\fA 1 \n\nB 1 \n\nC 1 \n\n0.5 \n\npure facilitation \n\n0.5 \n\n\\ \n\npure depression \n\n0.5 \n\nfacilitation and \ndepression \n\n00 \n\n100 \ntime \n\n200 \n\n00 \n\n100 \ntime \n\n200 \n\n00 \n\n100 \ntime \n\n200 \n\nFigure 1: A dynamic synapse can produce quite different outputs for the same input. The \nresponse of a single synapse to a step increase in input activity applied at time step 0 is \ncompared for three different parameter settings. \n\nSeveral models for single synapses have been proposed for the dynamic changes in synaptic \nefficacy. In [2] the model of [3] is extended to populations of neurons where the current \nsynaptic efficacy Wij (t) between a population j and a population i at time t is modeled as a \nproduct of a facilitation term lij (t) and a depression term dij (t) scaled by the factor Wij . \nWe consider a time discrete version of this model defined as follows: \n\nWij ( t) = W ij . hj ( t) . dij ( t) \n\n-\nlij(t + 1) = hj(t) - ~ + Uij . (1- hj(t)) . Xj(t) \n\nlij(t) \n\n-\n\n-\n\ndij(t + 1) = dij(t) + 1-~~j(t) - lij(t). dij(t)\u00b7 Xj(t) \n\n'J \n\n(1) \n\n(2) \n\n(3) \n\n'J \n\nhj(t) = hj(t) . (1- Uij) + Uij \n\n(4) \nwith dij (0) = 1 and hj (0) = O. Equation (2) models facilitation (with time constant Fij ), \nwhereas equation (3) models the combined effects of synaptic depression (with time con(cid:173)\nstant D ij) and facilitation. Depending on the values of the characteristic parameters Uij, \nDij , Fij a synaptic connection (ij) maps an input function Xj(t) into the corresponding \ntime varying synaptic output Wij (t) . Xj (t). The same input Xj (t) can yield markedly dif(cid:173)\nferent outputs Wij (t) . Xi (t) for different values of the characteristic parameters Uij, Dij, \nFij. Fig. 1 compares the output for three different sets of values for the parameters Uij, \nDij , Fij . These examples illustrate just three of the range of input-output behaviors that a \nsingle synapse can achieve. \n\nIn this article we will consider feedforward networks coupled by dynamic synapses. One \nshould think of the computational units in such a network as populations of spiking neurons. \nWe refer to such networks as \"dynamic networks\", see Fig. 2 for details. \n\nhidden units \n\ndynamic synapses \n\nFigure 2: The dynamic network model. \nThe output Xi(t) of the itk unit is given \nby Xi(t) = u(Ej Wij(t) . Xj(t)), where \nu is either the sigmoid function u(u) = \n1/(1 + exp(-u)) (in the hidden layers) \nor just the identity function u( u) = u (in \nthe output layer) and Wij (t) is modeled ac(cid:173)\ncording to Equ. (1) to (4). \n\n\fIn Sections 2 and 3 we demonstrate (by employing gradient descent to find appropriate \nvalues for the parameters Uij, D ij , Fij and Wij) that even small dynamic networks can \ncompute complex quadratic filters. In Section 4 we address the question which synaptic \nparameters are important for a dynamic network to learn a given filter. In Section 5 we \ngive a precise mathematical characterization of the computational power of such dynamic \nnetworks. \n\n2 Learning Arbitrary Quadratic Filters by Dynamic Networks \n\nIn order to analyze which filters can be approximated by small dynamic networks we in(cid:173)\nvestigate the task of learning a quadratic filter Q randomly chosen from a class Qm. The \nclass Qm consists of all quadratic filters Q whose output (Qx) (t) in response to the in(cid:173)\nput time series x(t) is defined by some symmetric m x m matrix HQ = \n[hkd of fil(cid:173)\nter coefficients hkl E ~ k = 1 .. . m, l = \nl ... m through the equation (Qx)(t) = \nZ=;:1 Z=~=1 hkl x(t - k) x(t - l) . An example of the input and output for one choice \nof quadratic parameters (m = 10) are shown in Figs. 3B and 3C, respectively. We view \nsuch filter Q as an example for the kinds of complex transformations that are important \nto an organism's survival, such as those required for motor control and the processing of \ntime-varying sensory inputs. For example, the spectrotemporal receptive field of a neu(cid:173)\nron in the auditory cortex [4] reflects some complex transformation of sound pressure to \nneuronal activity. The real transformations actually required may be very complex, but the \nsimple filter Q provides a useful starting point for assessing the capacity of this architecture \nto transform one time-varying signal into another. \n\nCan a network of units coupled by dynamic synapses implement the filter Q? We tested \nthe approximation capabilities of a rather small dynamic network with just 10 hidden \nunits (5 excitatory and 5 inhibitory ones), and one output (Fig. 3A). The dynamics of \ninhibitory synapses is described by the same model as that for excitatory synapses. For \nany particular temporal pattern applied at the input and any particular choice of the synap(cid:173)\ntic parameters, this network generates a temporal pattern as output. This output can be \nthought of, for example, as the activity of a particular population of neurons in the cor(cid:173)\ntex, and the target function as the time series generated for the same input by some un(cid:173)\nknown quadratic filter Q. The synaptic parameters Wij, D ij , Fij and Uij are chosen \nso that, for each input in the training set, the network minimized the mean-square error \nE[z, zQ] = ~ z=;=-oI(Z(t) - ZQ(t))2 between its output z(t) and the desired output zQ(t) \nspecified by the filter Q. To achieve this minimization, we used a conjugate gradient al(cid:173)\ngorithm. l The training inputs were random signals, an example of which is shown in \nFig. 3B. The test inputs were drawn from the same random distribution as the training in(cid:173)\nputs, but were not actually used during training. This test of generalization ensured that the \nobserved performance represented more than simple \"memorization\" of the training set. \nFig. 3C compares the network performance before and after training. Prior to training, the \noutput is nearly flat, while after training the network output tracks the filter output closely \n(E[z,zQ] = 0.0032). \nFig. 3D shows the performance after training for different randomly chosen quadratic filters \nQ E Qm for m = 4, ... ,16. Even for larger values of m the relatively small network with \n10 hidden units performs rather well. Note that a quadratic filter of dimension m has \nm(m + 1)/2 free parameters, whereas the dynamic network has a constant number of 80 \nadjustable parameters. This shows clearly that dynamic synapses enable a small network \nto mimic a wide range of possible quadratic target filters. \n\nIi E [z zQ ] \n\nIi E[z zQ] \n\n1 In order to apply such a conjugate gradient algorithm ones has to calculate the partial derivatives \nIi u'\u00b7 . \nl.J \n\nIi w.. for all synapses ~J ill the network. For more detaIls \n\nIi n '\u00b7 . \n'&J \n\nIi E[z zQ] \n\nIi F: . and \n\nIi E[z zQ ] \n\n( \u2022. ) \n\n. \n\n. \n\n, \n\n, \n\n'l.J \n\n'I., \n\nabout conjugate gradient algorithms see e.g. [5]. \n\n\fA \n\nc \n\no \n\n-020 \n\n50 \n\nB \n\nO.B \n\n0\u00b720 \n\n50 \n\nD \n\n100 \n\n150 \n\n200 \n\ntime steps \n\n100 \n\n150 \n\ntime steps \n\n200 \n\no 4 \n\n6 \n\nB \n\n10 12 14 16 \nm \n\nFigure 3: A network with units coupled by dynamic synapses can approximate randomly \ndrawn quadratic filters. A Network architecture. The network had one input unit, 10 hidden \nunits (5 excitatory, 5 inhibitory), and one output unit, see Fig. 2 for details. B One of the \ninput patterns used in the training ensemble. For clarity, only a portion of the actual input \nis shown. C Output of the network prior to training, with random initialization of the pa(cid:173)\nrameters, and the output of the dynamic network after learning. The target was the output \nof a quadratic filter Q E QlQ. The filter coefficients hkl (1 :::; k, l :::; 10) were generated \nrandomly by subtracting J-t/2 from a random number generated from an exponential distri(cid:173)\nbution with mean J-t = 3. D Performance after network training. For different sizes of HQ \n(HQ is a symmetric m x m matrix) we plotted the average performance (mse measured on \na test set) over 20 different filters Q, i.e. 20 randomJy generated matrices HQ. \n\n3 Comparison with the model of Back and Tsoi \n\nOur dynamic network model is not the first to incorporate temporal dynamics via dynamic \nsynapses. Perhaps the earliest suggestion for a role for synaptic dynamics in network com(cid:173)\nputation was by [7]. More recently, a number of networks have been proposed in which \nsynapses implemented linear filters; in particular [6]. \n\nTo assess the performance of our network model in relation to the model proposed in [6] \nwe have analyzed the performance of our dynamic network model for the same system \nidentification task that was employed as benchmark task in [6]. The goal of this task is to \nlearn a filter F with (Fx)(t) = sin(u(t)) where u(t) is the output of a linear filter applied \nto the input time series X(t).2 \nThe result is summarized in Fig. 4. It can clearly be seen that our network model (see \nFig. 3A for the network architecture) is able to learn this particular filter. The mean square \nerror (mse) on the test data is 0.0010, which is slightly smaller than the mse of 0.0013 re(cid:173)\nported in [6]. Note that the network Back and Tsoi used to learn the task had 130 adjustable \nparameters (13 parameters per IIR synapse, 10 hidden units) whereas our network model \nhad only 80 adjustable parameters (all parameters U ij , F ij , Dij and W ij were adjusted \nduring learning). \n\n2U(t) is the solution to the difference equation u(t)-1.99u(t-1)+ 1.572u(t-21)-0.4583u(t-\n31) = O.0154x(t) + O.0462x(t - 1) + O.0462x(t - 2 1) + O.0154x(t - 31). Hence, u(t) is the \noutput of a linear filter applied to the input x(t). \n\n\fA \n\n2 \n\n.~ 0 \n\n- 1 \n\n-20 \n\nB \n\n50 \n\n100 \ntime \n\n150 \n\n200 \n\n150 \n\n200 \n\nC \n\nI~~~ \nI DN \n\n= ST \n\n0.5 \n\n0.0 \nD \n\n1.0 \n\n1.5 \n\nI \n\nI \n\n50 \n\n100 \n\n150 \n\nFigure 4: Performance of our model on the system identification task used in [6]. The \nnetwork architecture is the same as in Fig. 3. A One of the input patterns used in the \ntraining ensemble. B Output of the network after learning and the target. C Comparison \nof the mean square error (in units of to- 3 ) achieved on test data by the model of Back and \nTsoi (BT) and by the dynamic network (DN). D Comparison of the number of adjustable \nparameters. The network model of Back and Tsoi (BT) utilizes slightly more adjustable \nparameters than the dynamic network (DN). \n\n____ .. _____ 1 _____ .... ___ _ \n\nA 1 and 2-tuples \n\nW _!.!.!. \n- r - - -F. _ \u2022 \n\nU. !- !.!_ \nDilL. , _ ~. \n\nW U D F \n\nI--- -\n\nB3-tuples \n\n.\n\n.\n\n.\n\nw/oF \n\nw/oD \n\nw/oU \n\n. w/oW \n\nC 1 and 2-tuples \n\n'.'.' \n\n:-\n.\nW \n---- ... -----1----- 1- ----\nU. : \u00b7 :.: \u2022 \nD. ,-i-i-\n\n__ _ _ l __ _ __ I ___ _ _ L __ _ _ \n\n: \n\n: \n\n- I- - - - - r \n\n- - -\n\n-- - - , \n\n- -\n\n, \n\n, \n\n, \n\nF _ ' \u2022 ' . ' \n\nW U D F \n\nF \n\nD \n\nU \n\nW \n\nFigure 5: Impact of different synaptic parameters on the learning capabilities of a dynamic \nnetwork. The size of a square (the \"impact\") is proportional to the inverse of the mean \nsquared error averaged over N trials. A In each trial (N = 100) a different quadratic filter \nmatrix HQ (m = 6) was randomly generated as described in Fig. 3. Along the diagonal \none can see the impact of a single parameter, whereas the off-diagonal elements (which are \nsymmetric) represent the impact of changing pairs of parameters. B The impact of subsets \nof size three is shown where the labels indicate which parameter is not included. C Same \ninterpretation as for panel A but the results shown (N = 20) are for the filter used in [6]. \nD Same interpretation as for panel B but the results shown (N = 20) are for the same filter \nas in panel C. \n\nThis shows that a very simple feedforward network with biologically realistic synaptic dy(cid:173)\nnamics yields performance comparable to that of artificial networks that were previously \ndesigned to yield good performance in the time series domain without any claims of bio(cid:173)\nlogical realism. \n\n4 Which Parameters Matter? \n\nIt remains an open experimental question which synaptic parameters are subject to use(cid:173)\ndependent plasticity, and under what conditions. For example, long term potentiation ap(cid:173)\npears to change synaptic dynamics between pairs of layer 5 cortical neurons [8] but not in \nthe hippocampus [9]. We therefore wondered whether plasticity in the synaptic dynamics is \nessential for a dynamic network to be able to learn a particular target filter. To address this \nquestion, we compared network performance when different parameter subsets were opti(cid:173)\nmized using the conjugate gradient algorithm, while the other parameters were held fixed. \nIn all experiments, the fixed parameters were chosen to ensure heterogeneity in presynaptic \ndynamics. \n\n\fFig. 5 shows that changing only the postsynaptic parameter W has comparable impact to \nchanging only the presynaptic parameters U or D, whereas changing only F has little im(cid:173)\npact on the dynamics of these networks (see diagonal of Fig. 5A and Fig. 5C). However, to \nachieve good performance one has to change at least two different types of parameters such \nas {W, U} or {W, D} (all other pairs yield worse performance). Hence, neither plasticity \nin the presynaptic dynamics (U, D, F) alone nor plasticity of the postsynaptic efficacy (W) \nalone was sufficient to achieve good performance in this model. \n\n5 A Universal Approximation Theorem for Dynamic Networks \n\nIn the preceding sections we had presented empirical evidence for the approximation ca(cid:173)\npabilities of our dynamic network model for computations in the time series domain. This \ngives rise to the question, what the theoretical limits of their approximation capabilities \nare. The rigorous theoretical result presented in this section shows that basically there \nare no significant a priori limits. Furthermore, in spite of the rather complicated system \nof equations that defines dynamic networks, one can give a precise mathematical charac(cid:173)\nterization of the class of filters that can be approximated by them. This characterization \ninvolves the following basic concepts. An arbitrary filter F is called time invariant if a \nshift of the input functions by a constant to just causes a shift of the output function by the \nsame constant to. Another essential property of filters is fading memory. A filter F has \nfading memory if and only if the value of F;f(O) can be approximated arbitrarily closely \nby the value of F~(O) for functions ~ that approximate the functions ;f for sufficiently \nlong bounded intervals [-T, 0]. Interesting examples of linear and nonlinear time invariant \nfilters with fading memory can be generated with the help of representations of the form \n(Fx)(t) = Iooo ... Iooo x(t - Tt) ..... x(t - Tk)hh, . .. ,Tk)dTl ... dTk for measurable \nand essentially bounded functions x : R -+ R (with hELl). One refers to such an integral \nas a Volterra term of order k. Note that for k = 1 it yields the usual representation for a \nlinear time invariant filter. The class of filters that can be represented by Volterra series, \ni.e., by finite or infinite sums of Volterra terms of arbitrary order, has been investigated for \nquite some time in neurobiology and engineering. \n\nTheorem 1 Assume that X is the class of functions from R into [Bo, B l ] which satisfy \nIx(t) - x(s)1 ~ B2 \u00b7It - sl for all t,s E ffi, where B o,Bl ,B2 are arbitrary real-valued \nconstants with 0 < Bo < Bl and 0 < B 2. Let F be an arbitrary filter that maps vectors \nof functions ;f = (Xl, ... ,xn) E xn into functions from R into ~ Then the following are \nequivalent: \n\n(a) F can be approximated by dynamic networks' N defined in Fig. 2 (i.e., for any \nfor all \n\n\u20ac > 0 there exists such network N such that I (F;f)(t) - (N ;f)(t) I < \u20ac \n;f E xn and all t E R) \n\n(b) F can be approximated by dynamic networks (see Fig. 2) with just a single layer \n\nof sigmoidal neurons \n\n( c) F is time invariant and has fading memory \n\n(d) F can be approximated by a sequence of (finite or infinite) Volterra series. \n\nThe proof of Theorem 1 relies on the Stone-Weierstrass Theorem, and is contained as the \nproof of Theorem 3.4 in [10]. \n\nThe universal approximation result contained in Theorem 1 turns out to be rather robust \nwith regard to changes in the definition of a dynamic network. Dynamic networks with just \none layer of dynamic synapses and one subsequent layer of sigmoidal gates can approxi(cid:173)\nmate the same class of filters as dynamic networks with an arbitrary number of layers of \n\n\fdynamic synapses and sigmoidal neurons. It can also be shown that Theorem 1 remains \nvalid if one considers networks which have depressing synapses only or if one uses the \nmodel for synaptic dynamics proposed in [1]. \n\n6 Discussion \n\nOur central hypothesis is that rapid changes in synaptic strength, mediated by mechanisms \nsuch as facilitation and depression, are an integral part of neural processing. We have ana(cid:173)\nlyzed the computational power of such dynamic networks, which represent a new paradigm \nfor neural computation on time series that is based on biologically realistic models for \nsynaptic dynamics [11]. \n\nOur analytical results show that the class of nonlinear filters that can be approximated by \ndynamic networks, even with just a single hidden layer of sigmoidal neurons, is remarkably \nrich. It contains every time invariant filter with fading memory, hence arguable every filter \nthat is potentially useful for a biological organism. \n\nThe computer simulations we performed show that rather small dynamic networks are not \nonly able to perform interesting computations on time series, but their performance is com(cid:173)\nparable to that of previously considered artificial neural networks that were designed for \nthe purpose of yielding efficient processing of temporal signals. We have tested dynamic \nnetworks on tasks such as the learning of a randomly chosen quadratic filter, as well as on \nthe learning task used in [6], to illustrate the potential of this architecture. \n\nReferences \n\n[1] J. A. Varela, K. Sen, 1. Gibson, J. Fost, L. F. Abbott, and S. B. Nelson. A quantitative descrip(cid:173)\ntion of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. \nNeurosci, 17:220-4, 1997. \n\n[2] M.Y. Tsodyks, K. Pawelzik, and H. Markram. Neural networks with dynamic synapses. Neural \n\nComputation, 10:821-835, 1998. \n\n[3] H. Markram, Y. Wang, and M. Tsodyks. Differential signaling via the same axon of neocortical \n\npyramidal neurons. PNAS,95:5323-5328, 1998. \n\n[4] R.C. deCharms and M.M. Merzenich. Optimizing sound features for cortical neurons. Science, \n\n280:1439-43, 1998. \n\n[5] John Hertz, Anders Krogh, and Richard Palmer. Introduction to the Theory oj Neural Compu(cid:173)\n\ntation. Addison-Wesley, 1991. \n\n[6] A. D. Back and A. C. Tsoi. A simplified gradient algorithm for 1IR synapse multilayer percep(cid:173)\n\ntrons. Neural Computation, 5:456-462, 1993. \n\n[7] W.A. Little and G.L. Shaw. A statistical theory of short and long term memory. Behavioural \n\nBiology, 14:115-33, 1975. \n\n[8] H. Markram and M. Tsodyks. Redistribution of synaptic efficacy between neocortical pyramidal \n\nneurons. Nature, 382:807-10, 1996. \n\n[9] D.K. Selig, R.A. Nicoll, and R.C. Malenka. Hippocampal long-term potentiation preserves the \n\nfidelity of postsynaptic responses to presynaptic bursts. J. Neurosci. , 19:1236-46, 1999. \n\n[10] W. Maass and E. D. Sontag. Neural systems as nonlinear filters. Neural Computation, \n\n12(8):1743-1772,2000. \n\n[11] A. M. Zador. The basic unit of computation. Nature Neuroscience, 3(Supp):1167, 2000. \n\n\f", "award": [], "sourceid": 1903, "authors": [{"given_name": "Thomas", "family_name": "Natschl\u00e4ger", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}, {"given_name": "Eduardo", "family_name": "Sontag", "institution": null}, {"given_name": "Anthony", "family_name": "Zador", "institution": null}]}