{"title": "Adaptive Nonlinear System Identification with Echo State Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 609, "page_last": 616, "abstract": "", "full_text": "Adaptive Nonlinear System Identification \n\nwith Echo State Networks \n\nHerbert Jaeger \n\nInternational University Bremen \n\nD-28759 Bremen, Germany \n\nh.jaeger@iu-bremen. de \n\nAbstract \n\nEcho state networks (ESN) are a novel approach to recurrent neu(cid:173)\nral network training. An ESN consists of a large, fixed, recurrent \n\"reservoir\" network, from which the desired output is obtained by \ntraining suitable output connection weights. Determination of op(cid:173)\ntimal output weights becomes a linear, uniquely solvable task of \nMSE minimization. This article reviews the basic ideas and de(cid:173)\nscribes an online adaptation scheme based on the RLS algorithm \nknown from adaptive linear systems. As an example, a 10-th or(cid:173)\nder NARMA system is adaptively identified. The known benefits \nof the RLS algorithms carryover from linear systems to nonlinear \nones; specifically, the convergence rate and misadjustment can be \ndetermined at design time. \n\n1 \n\nIntroduction \n\nIt is fair to say that difficulties with existing algorithms have so far precluded su(cid:173)\npervised training techniques for recurrent neural networks (RNNs) from widespread \nuse. Echo state networks (ESNs) provide a novel and easier to manage approach \nto supervised training of RNNs. A large (order of 100s of units) RNN is used as a \n\"reservoir\" of dynamics which can be excited by suitably presented input and/or \nfed-back output. The connection weights of this reservoir network are not changed \nby training. In order to compute a desired output dynamics, only the weights of \nconnections from the reservoir to the output units are calculated. This boils down \nto a linear regression. The theory of ESNs, references and many examples can be \nfound in [5] [6]. A tutorial is [7]. A similar idea has recently been independently \ninvestigated in a more biologically oriented setting under the name of \"liquid state \nnetworks\" [8] [9]. \n\nIn this article I describe how ESNs can be conjoined with the \"recursive least \nsquares\" (RLS) algorithm, a method for fast online adaptation known from linear \nsystems. The resulting RLS-ESN is capable of tracking a 10-th order nonlinear \nsystem with high quality in convergence speed and residual error. Furthermore, \nthe approach yields apriori estimates of tracking performance parameters and thus \nallows one to design nonlinear trackers according to specificationsl . \n\n1 All \n\nalgorithms \n\nand \n\ncalculations \n\ndescribed m \n\nthis \n\narticle \n\nare \n\ncon-\n\n\fArticle organization. Section 2 recalls the basic ideas and definitions of ESNs and \nintroduces an augmentation of the basic technique. Section 3 demonstrates ESN \nomine learning on the 10th order system identification task. Section 4 describes the \nprinciples of using the RLS algorithm with ESN networks and presents a simulation \nstudy. Section 5 wraps up. \n\n2 Basic ideas of echo state networks \n\nFor the sake of a simple notation, in this article I address only single-input, single(cid:173)\noutput systems (general treatment in [5]). We consider a discrete-time \"reservoir\" \nRNN with N internal network units, a single extra input unit, and a single extra \noutput unit. The input at time n 2 1 is u(n), activations of internal units are x(n) = \n(xl(n), ... ,xN(n)), and activation of the output unit is y(n). Internal connection \nweights are collected in an N x N matrix W = (Wij), weights of connections going \nfrom the input unit into the network in an N-element (column) weight vector win = \n(w~n), and the N + 1 (input-and-network)-to-output connection weights in aN + 1-\nelement (row) vector wout = (w?ut). The output weights wout will be learned, the \ninternal weights Wand input weights win are fixed before learning, typically in a \nsparse random connectivity pattern. Figure 1 sketches the setup used in this article. \n\nN internal units \n\nFigure 1: Basic setup of ESN. Solid arrows: fixed weights; dashed arrows: trainable \nweights. \n\nThe activation of internal units and the output unit is updated according to \n\nx(n + 1) \ny(n + 1) \n\nf(Wx(n) + winu(n + 1) + v(n + 1)) \nrut (wout ( u(n + 1), x(n + 1) )) , \n\n(1) \n(2) \n\nwhere f stands for an element-wise application of the unit nonlinearity, for which \nwe here use tanh; v(n + 1) is an optional noise vector; (u(n + l) , x(n + 1)) is a \nvector concatenated from u(n + 1) and x(n + 1); and rut is the output unit's non(cid:173)\nlinearity (tanh will be used here, too). Training data is a stationary I/O signal \n(Uteach(n), Yteach(n)). When the network is updated according to (1), then under \ncertain conditions the network state becomes asymptotically independent of ini(cid:173)\ntial conditions. More precisely, if the network is started from two arbitrary states \nx(O), X(O) and is run with the same input sequence in both cases, the resulting state \nsequences x(n), x(n) converge to each other. If this condition holds, the reservoir \nnetwork state will asymptotically depend only on the input history, and the network \n\ntained \nhttp://www.ais.fraunhofer.de/INDY /ESNresources.html. \n\ntutorial Mathematica notebook which \n\nin \n\na \n\ncan be \n\nfetched \n\nfrom \n\n\fis said to be an echo state network (ESN). A sufficient condition for the echo state \nproperty is contractivity of W. In practice it was found that a weaker condition \nsuffices, namely, to ensure that the spectral radius I Amax I of W is less than unity. \n[5] gives a detailed account. \n\nConsider the task of computing the output weights such that the teacher output \nis approximated by the network. In the ESN approach, this task is spelled out \nconcretely as follows: compute wout such that the training error \n\n(3) \n\nis minimized in the mean square sense. Note that the effect of the output non(cid:173)\nlinearity is undone by (f0ut)-l in this error definition. We dub (fout)-IYteach(n) \nthe teacher pre-signal and (f0ut)-l (wout (Uteach(n), x(n)) + v(n)) the network's pre(cid:173)\noutput. The computation of wout is a linear regression. Here is a sketch of an offline \nalgorithm for the entire learning procedure: \n\n1. Fix a RNN with a single input and a single output unit , scaling the weight \n\nmatrix W such that I Amax 1< 1 obtains. \n\n2. Run this RNN by driving it with the teaching input signal. Dismiss \ndata from initial transient and collect remaining input+network states \n(Uteach (n), Xteach (n)) row-wise into a matrix M. Simultaneously, collect \nthe remaining training pre-signals (f0ut)-IYteach(n) into a column vector r. \n3. Compute the pseudo-inverse M-l, and put wout = (M-Ir) T (where T \n\ndenotes transpose). \n\n4. Write wout into the output connections; the ESN is now trained. \n\nThe modeling power of an ESN grows with network size. A cheaper way to increase \nthe power is to use additional nonlinear transformations ofthe network state x(n) for \ncomputing the network output in (2). We use here a squared version of the network \nstate. Let w~~~ares denote a length 2N + 2 output weight vector and Xsquares(n) \nthe length 2N +2 (column) vector (u(n), Xl (n), . . . , xN(n), u2 (n), xi(n), ... , xJv(n)). \nKeep the network update (1) unchanged, but compute outputs with the following \nvariant of (2): \n\ny(n + 1) \n\n(4) \n\nThe \"reservoir\" and the input is now tapped by linear and quadratic connections. \nThe learning procedure remains linear and now goes like this: \n\n1. (unchanged) \n2. Drive the ESN with the training input. Dismiss initial transient and collect \nremaining augmented states Xsquares(n) row-wise into M. Simultaneously, \ncollect the training pre-signals (fout)-IYteach(n) into a column vector r. \n\n3. Compute the pseudo-inverse M-l, and put w~~~ares = (M-Ir) T. \n4. The ESN is now ready for exploitation, using output formula (4). \n\n3 \n\nIdentifying a 10th order system: offline case \n\nIn this section the workings of the augmented algorithm will be demonstrated with \na nonlinear system identification task. The system was introduced in a survey-and(cid:173)\nunification-paper [1]. It is a 10th-order NARMA system: \n\n\fd(n + 1) = 0.3 d(n) + 0.05 d(n) [t, d(n - i)] + 1.5 u(n - 9) u(n) + 0.1. \n\n(5) \n\nNetwork setup. An N = 100 ESN was prepared by fixing a random, sparse connec(cid:173)\ntion weight matrix W (connectivity 5 %, non-zero weights sampled from uniform \ndistribution in [-1,1], the resulting raw matrix was re-scaled to a spectral radius \nof 0.8, thus ensuring the echo state property). An input unit was attached with a \nrandom weight vector win sampled from a uniform distribution over [-0.1,0.1]. \n\nTraining data and training. An I/O training sequence was prepared by driving the \nsystem (5) with an i.i.d. input sequence sampled from the uniform distribution over \n[0,0.5]' as in [1]. The network was run according to (1) with the training input for \n1200 time steps with uniform noise v(n) of size 0.0001. Data from the first 200 \nsteps were discarded. The remaining 1000 network states were entered into the \naugmented training algorithm, and a 202-length augmented output weight vector \nw~~~ares was calculated. \nTesting. The learnt output vector was installed and the network was run from a \nzero starting state with newly created testing input for 2200 steps, of which the \nfirst 200 were discarded. From the remaining 2000 steps, the NMSE test error \nNMSEtest = E[(Y(;~(d~(n))2J was estimated. A value of NMSEtest ~ 0.032 was found. \n\nComments. (1) The noise term v(n) functions as a regularizer, slightly compro(cid:173)\nmising the training error but improving the test error. (2) Generally, the larger \nan ESN, the more training data is required and the more precise the learning. \nSet up exactly like in the described 100-unit example, an augmented 20-unit ESN \ntrained on 500 data points gave NMSEtest ~ 0.31, a 50-unit ESN trained on 1000 \npoints gave NMSEtest ~ 0.084, and a 400-unit ESN trained on 4000 points gave \nNMSEtest ~ 0.0098. \nComparison. The best NMSE training [!] error obtained in [1] on a length 200 \ntraining sequence was NMSEtrain ~ 0.2412 However, the level of precision reported \n[1] and many other published papers about RNN training appear to be based on \nsuboptimal training schemes. After submission of this paper I went into a friendly \nmodeling competition with Danil Prokhorov who expertly applied EKF-BPPT tech(cid:173)\nniques [3] to the same tasks. His results improve on [1] results by an order of \nmagnitude and reach a slightly better precision than the results reported here. \n\n4 Online adaptation of ESN network \n\nBecause the determination of optimal (augmented) output weights is a linear task, \nstandard recursive algorithms for MSE minimization known from adaptive linear \nsignal processing can be applied to online ESN estimation. I assume that the reader \nis familiar with the basic idea of FIR tap-weight (Wiener) filters: i.e. , that N input \nsignals Xl (n), ... ,XN(n) are transformed into an output signal y(n) by an inner \nproduct with a tap-weight vector (Wl, ... ,WN): y(n) = wlxl(n) + ... + wNxN(n). \nIn the ESN context, the input signals are the 2N + 2 components of the augmented \ninput+network state vector, the tap-weight vector is the augmented output weight \nvector, and the output signal is the network pre-output (fout)-ly(n) . \n\n2The authors miscalculated their NMSE because they used a formula for zero-mean sig(cid:173)\nnals. I re-calculated the value NMSEtrain ~ 0.241 from their reported best (miscalculated) \nNMSE of 0.015 . The larger value agrees with the plots supplied in that paper. \n\n\f4.1 A refresher on adaptive linear system identification \n\nFor a recursive online estimation of tap-weight vectors, \"recursive least squares\" \n(RLS) algorithms are widely used in linear signal processing when fast conver(cid:173)\ngence is of prime importance. A good introduction to RLS is given in [2], whose \nnotation I follow. An online algorithm in the augmented ESN setting should do \nthe following: given an open-ended, typically non-stationary training I/O sequence \n(Uteach(n), Yteach(n)), at each time n ~ 1 determine an augmented output weight \nvector w~~~ares(n) which yields a good model of the current teacher system. \n\nFormally, an RLS algorithm for ESN output weight update minimizes the exponen(cid:173)\ntially discounted square \"pre-error\" \n\nn LAn- k ((follt)-lYteach(k) - (follt)-lY [n](k))2 , \n\n(6) \n\nk=l \n\nwhere A < 1 is the forgetting factor and Y[n](k) is the model output that would \nbe obtained at time k when a network with the current output weights w~~~ares(n) \nwould be employed at all times k = 1, ... ,n. \nThere are many variants of RLS algorithms minimizing (6), differing in their trade(cid:173)\noffs between computational cost, simplicity, and numerical stability. I use a \"vanilla\" \nversion, which is detailed out in Table 12.1 in [2] and in the web tutorial package \naccompanying this paper. \n\nTwo parameters characterise the tracking performance of an RLS algorithm: the \nmisadjustment M and the convergence time constant T. The misadjustment gives \nthe ratio between the excess MSE (or excess NMSE) incurred by the fluctuations of \nthe adaptation process, and the optimal steady-state MSE that would be obtained \nin the limit of offline-training on infinite stationary training data. For instance, a \nmisadjustment of M = 0.3 means that the tracking error of the adaptive algorithm \nin a steady-state situation exceeds the theoretically achievable optimum (with Sanle \ntap weight vector length) by 30 %. The time constant T associated with an RLS \nalgorithm determines the exponent of the MSE convergence, e- n / T \u2022 For example, \nT = 200 would imply an excess MSE reduction by I/e every 200 steps. Misad(cid:173)\njustment and convergence exponent are related to the forgetting factor and the \ntap-vector length through \n\nand \n\n1 \n\nT::::::--. \nI-A \n\n(7) \n\n4.2 Case study: RLS-ESN for our 10th-order system \n\nEqns. (7) can be used to predict/design the tracking characteristics of a RLS(cid:173)\npowered ESN. I will demonstrate this with the 10th-order system (5). Ire-use \nthe same augmented lOa-unit ESN, but now determine its 2N + 2 output weight \nvector online with RLS. Setting A = 0.995 , and considering N = 202, Eqns. (7) \nyield a misadjustment of M = 0.5 and a time constant T :::::: 200. Since the asymp(cid:173)\ntotically optimal NMSE is approximately the NMSE of the offline-trained network, \nnamely, NMSE :::::: 0.032, the misadjustment M = 0.5 lets us expect a NMSE of \n0.032 x 150% :::::: 0.048 for the online adaptation after convergence. The time con(cid:173)\nstant T \nNMSE by a factor of I/e every 200 steps. \n\n:::::: 200 makes us expect NMSE convergence to the expected asymptotic \n\n\fTraining data. Experiments with the system (5) revealed that the system some(cid:173)\ntimes explodes when driven with i.i.d. input from [0,0.5]. To bound outputs, I \nwrapped the r.h.s. of (5) with a tanh. Furthermore, I replaced the original con(cid:173)\nstants 0.3,0.05,1.5, 0.1 by free parameters a, (3\", 6, to obtain \n\nd(n + 1) = tanh (a d(n) + (3 d(n) [t, d(n - i)] + ,u(n - 9) u(n) + 6). \n\n(8) \n\nThis system was run for 10000 steps with an i.i.d. teacher input from [0,0.5]. Every \n2000 steps, 0'.,(3\",6 were assigned new random values taken from a \u00b1 50 % interval \naround the respective original constants. Fig. 2A shows the resulting teacher output \nsequence, which clearly shows transitions between different \"episodes\" every 2000 \nsteps. \n\nRunning the RLS-ENS algorithm. The ENS was started from zero state and \nwith a zero augmented output weight vector. It was driven by the teacher in(cid:173)\nput, and a noise of size 0.0001 was inserted into the state update, as in the \noffline training. The RLS algorithm (with forgetting factor 0.995) was initial(cid:173)\nized according to the prescriptions given in [2] and then run together with the \nnetwork updates, to compute from the augmented input+network states x(n) = \n(u(n), Xl (n), ... ,XN(n), u2 (n), xi(n), ... ,xJv(n)) a sequence of augmented output \nweight vectors w~~~ares (n). These output weight vectors were used to calculate a \nnetwork output y(n) = tanh(w~~~ares(n), x(n)). \n\nResults. From the resulting length-l0000 sequences of desired outputs d(n) and net(cid:173)\nwork productions y(n) , NMSE's were numerically estimated from averaging within \nsubsequent length-lOO blocks. Fig. 2B gives a logarithmic plot. \n\nIn the last three episodes, the exponential NMSE convergence after each episode \nonset disruption is clearly recognizable. Also the convergence speed matches the \npredicted time constant, as revealed by the T = 200 slope line inserted in Fig. 2B. \n\nThe dotted horizontal line in Fig. 2B marks the NMSE of the offline-trained ESN \ndescribed in the previous section. Surprisingly, after convergence, the online-NMSE \nis lower than the offline NMSE. This can be explained through the IIR (autoregres(cid:173)\nsive) nature of the system (5) resp. (8) , which incurs long-term correlations in the \nsignal d( n), or in other words, a nonstationarity of the signal in the timescale of the \ncorrelation lengthes, even with fixed parameters a, (3\", 6. This medium-term non(cid:173)\nstationarity compromises the performance of the offline algorithm, but the online \nadaptation can to a certain degree follow this nonstationarity. \n\nFig. 2C is a logarithmic plot of the development of the mean absolute output weight \nsize. It is apparent that after starting from zero, there is an initial exponential \ngrowth of absolute values of the output weights, until a stabilization at a size of \nabout 1000, whereafter the NMSE develops a regular pattern (Fig. 2B). \nFinally, Fig. 2D shows an overlay of d(n) (solid) with y(n) (dotted) of the last 100 \nsteps in the experiment, visually demonstrating the precision after convergence. \nA note on noise and stability. Standard offline training of ESNs yields output \nweights whose absolute size depends on the noise inserted into the network dur(cid:173)\ning training: the larger the noise, the smaller the mean output weights (extensive \ndiscussion in [5]). In online training, a similar inverse correlation between output \nweight size (after settling on plateau) and noise size can be observed. When the \nonline learning experiment was done otherwise identically but without noise inser(cid:173)\ntion, weights grew so large that the RLS algorithm entered a region of numerical \n\n\finstability. Thus, the noise term is crucial here for numerical stability, a condition \nfamiliar from EKF -based RNN training schemes [3], which are computationally \nclosely related to RLS. \n\n0.8 \n0 . 7 \n0.6 \n0.5 \n0.4 \n0 . 3 \n\nTeacher output signal \n\nLoglO of NMSE \n\n2000 4000 6000 8000 10000 \n\nLoglO of avo abs. weights \n\nB. \n\n-0 . 5 \n-1 \n\n-1.5 \n-2 \n\nTeacher vs. network \n\n~!I~ \n\n6000 8000 10000 \n\nD. \n\n2000 \n\n4000 \n\nA. \n\nC. \n\nFigure 2: A. Teacher output. B. NMSE with predicted baseline and slopeline. C. \nDevelopment of weights. D. Last 100 steps: desired (solid) and network-predicted \n( dashed) signal. For details see text. \n\n5 Discussion \n\nSeveral of the well-known error-gradient-based RNN training algorithms can be used \nfor online weight adaptation. The update costs per time step in the most efficient of \nthose algorithms (overview in [1]) are O(N2 ) , where N is network size. Typically, \nstandard approaches train small networks (order of N = 20), whereas ESN typically \nrelies on large networks for precision (order of N = 100). Thus, the RLS-based ESN \nonline learning algorithm is typically more expensive than standard techniques. \nHowever, this drawback might be compensated by the following properties of RLS(cid:173)\nESN: \n\n\u2022 Simplicity of design and implementation; robust behavior with little need \n\nfor learning parameter hand-tuning. \n\n\u2022 Custom-design of RLS-ESNs with prescribed tracking parameters, trans(cid:173)\n\nferring well-understood linear systems methods to nonlinear systems. \n\n\u2022 Systems with long-lasting short-term memory can be learnt. Exploitable \nESN memory spans grow with network size (analysis in [6]). Consider the \n\n30th order system d(n+ 1) = tanh(0.2d(n) +0.04d(n) [L~=o 9d(n - i)] + \n1.5 u(n - 29) u(n) + 0.001). It was learnt by a 400-unit augmented adaptive \nESN with a test NMSE of 0.0081. The 51-th (!) order system y(n + 1) = \nu(n - 10) u(n - 50) was learnt offline by a 400-unit augmented ESN with \na NMSE of 0.213. \n\nAll in all, on the kind of tasks considered in above, adaptive (augmented) ESNs \nreach a similar level of precision as today's most refined gradient-based techniques. \nA given level of precision is attained in ESN vs. gradient-based techniques with a \nsimilar number of trainable weights (D. Prokhorov, private communication). Be(cid:173)\ncause gradient-based techniques train every connection weight in the RNN, whereas \n\n3See Mathematica notebook for details. \n\n\fESNs train only the output weights, the numbers of units of similarly performing \nstandard RNNs vs. ESNs relate as N to N 2 . Thus, RNNs are more compact than \nequivalent ESNs. However, when working with ESNs, for each new trained out(cid:173)\nput signal one can re-use the same \"reservoir\", adding only N new connections \nand weights. This has for instance been exploited for robots in the AIS institute \nby simultaneously training multiple feature detectors from a single \"reservoir\" [4]. \nIn this circumstance, with a growing number of simultaneously required outputs, \nthe requisite net model sizes for ESNs vs. traditional RNNs become asymptotically \nequal. The size disadvantage of ESNs is further balanced by much faster offline \ntraining, greater simplicity, and the general possibility to exploit linear-systems \nexpertise for nonlinear adaptive modeling. \n\nAcknowledgments The results described in this paper were obtained while I \nworked at the Fraunhofer AIS Institute. \nI am greatly indebted to Thomas \nChristaller for unfaltering support. Wolfgang Maass and Danil Prokhorov con(cid:173)\ntributed motivating discussions and valuable references. An international patent ap(cid:173)\nplication for the ESN technique was filed on October 13, 2000 (PCT /EPOI/11490). \n\nReferences \n\n[1] A.F. Atiya and A.G. Parlos. New results on recurrent network training: Unifying \nthe algorithms and accelerating convergence. IEEE Trans. Neural Networks, \n11(3):697- 709,2000. \n\n[2] B. Farhang-Boroujeny. Adaptive Filters: Theory and Applications. Wiley, 1998. \n[3] L.A. Feldkamp, D.V. Prokhorov, C.F. Eagen, and F. Yuan. Enhanced multi(cid:173)\n\nstream Kalman filter training for recurrent neural networks. In J .A.K. Suykens \nand J. Vandewalle, editors, Nonlinear Modeling: Advanced Black-Box Tech(cid:173)\nniques, pages 29- 54. Kluwer, 1998. \n\n[4] J. Hertzberg, H. Jaeger, and F. Schonherr. Learning to ground fact symbols in \nbehavior-based robots. In F. van Harmelen, editor, Proc. 15th Europ. Gonf. on \nArt. Int. (EGAI 02), pages 708- 712. lOS Press, Amsterdam, 2002. \n\nThe \"echo state\" approach \n\nJaeger. \nrecurrent neural networks. \n\n[5] H. \ning \nman National Research \nhttp://www.gmd.de/People/Herbert.Jaeger/Publications.html. \n\nInstitute \n\nto analysing and \n-\n\nGMD Report 148, GMD \nScience, \n\nfor Computer \n\ntrain(cid:173)\nGer-\n2001. \n\n[6] H. Jaeger. Short term memory in echo state networks. GMD-Report 152, \nGMD - German National Research Institute for Computer Science, 2002. \nhttp://www.gmd.de/People/Herbert.Jaeger/Publications.html. \n\n[7] H. Jaeger. Tutorial on training recurrent neural networks, covering BPPT, \nRTRL, EKF and the echo state network approach. GMD Report 159, Fraunhofer \nInstitute AIS , 2002. \n\n[8] W. Maass, T. Natschlaeger, and H. Markram. Real-time computing without \nstable states: A new framework for neural computation based on perturbations. \nhttp://www.cis.tugraz.at/igi/maass/psfiles/LSM-vl06.pdf. 2002. \n\n[9] W. Maass, Th. NatschHiger, and H. Markram. A model for real-time compu(cid:173)\ntation in generic neural microcircuits. In S. Becker, S. Thrun, and K. Ober(cid:173)\nmayer, editors, Advances in Neural Information Processing System 15 (Proc. \nNIPS 2002). MIT Press, 2002. \n\n\f", "award": [], "sourceid": 2318, "authors": [{"given_name": "Herbert", "family_name": "Jaeger", "institution": null}]}