At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

Bertschinger, Nils; Natschläger, Thomas; Legenstein, Robert

At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

Nils Bertschinger, Thomas Natschläger, Robert A. Legenstein

Advances in Neural Information Processing Systems 17 (NIPS 2004)

Abstract

In this paper we analyze the relationship between the computational ca- pabilities of randomly connected networks of threshold gates in the time- series domain and their dynamical properties. In particular we propose a complexity measure which we find to assume its highest values near the edge of chaos, i.e. the transition from ordered to chaotic dynamics. Furthermore we show that the proposed complexity measure predicts the computational capabilities very well: only near the edge of chaos are such networks able to perform complex computations on time series. Ad- ditionally a simple synaptic scaling rule for self-organized criticality is presented and analyzed.

1 Introduction

It has been proposed that extensive computational capabilities are achieved by systems whose dynamics is neither chaotic nor ordered but somewhere in between order and chaos. This has led to the idea of "computation at the edge of chaos". Early evidence for this hypothesis has been reported e.g. in [1]. The results of numerous computer simulations carried out in these studies suggested that there is a sharp transition between ordered and chaotic dynamics. Later on this was confirmed by Derrida and others [2]. They used ideas from statistical physics to develop an accurate mean-field theory which allowed to determine the critical parameters analytically. Because of the physical background, this theory focused on the autonomous dynamics of the system, i.e. its relaxation from an initial state (the input) to some terminal state (the output) without any external influences. In contrast to such "off-line" computations, we will focus in this article on time-series computations, i.e. mappings, also called filters, from a time-varying input signal to a time- varying output signal. Such "online" or real-time computations describe more adequately the input to output relation of systems like animals or autonomous robots which must react in real-time to a continuously changing stream of sensory input.

The purpose of this paper is to analyze how the computational capabilities of randomly connected recurrent neural networks in the domain of real-time processing and the type of dynamics induced by the underlying distribution of synaptic weights are related to each other. In particular, we will show that for the types of neural networks considered in this pa- per (defined in Sec. 2) there also exists a transition from ordered to chaotic dynamics. This phase transition is determined using an extension of the mean-field approach described in [3] and [4] (Sec. 3). As the next step we propose a novel complexity measure (Sec. 4) which

                                                                     input


                                                            network activity


           10


           20

       neuron # 30

                      0     20          40             0           20           40       0                             20        40                                timesteps                           timesteps                              timesteps

                                        0.4

                                        0.2                                                   0.8

                                         0                                                                                                  0.6                                            -0.2                   critical

                                       -0.4     ordered                       chaotic         0.4     mean activity                                            -0.6

                                                0.1           1                10                                                                        2

Figure 1: Networks of randomly connected threshold gates can exhibit ordered, critical and chaotic dynamics. In the upper row examples of the temporal evolution of the network state xt are shown (black: xi,t = 1, white: xi,t = 0, input as indicated above) for three different networks with parameters taken from the ordered, critical and chaotic regime, respectively. Parameters: K = 5, N = 500, u = -0.5, r = 0.3 and and 2 as indicated in the phase plot below. The background of the phase plot shows the mean activity a (see Sec. 3) of the networks depending on the parameters and 2.

can be calculated using the mean-field theory developed in Sec. 3 and serves as a predic- tor for the computational capability of a network in the time-series domain. Employing a recently developed framework for analyzing real-time computations [5, 6] we investigate in Sec. 5 the relationship between network dynamics and the computational capabilities in the time-series domain. In Sec. 6 of this paper we propose and analyze a synaptic scaling rule for self-organized criticality (SOC) for the types of networks considered here. In con- trast to previous work [7], we do not only check that the proposed rule shows adaptation towards critical dynamics, but also show that the computational capabilities of the network are actually increased if the rule is applied.

Relation to previous work: In [5], the so-called liquid state machine (LSM) approach was proposed and used do analyze the computational capabilities in the time-series domain of randomly connected networks of biologically inspired network models (composed of leaky integrate-and-fire neurons). We will use that approach to demonstrate that only near the edge of chaos, complex computations can be performed (see Sec. 5). A similar analysis for a restricted case (zero mean of synaptic weights) of the network model considered in this paper can be found in [4].

2 The Network Model and its Dynamics We consider input driven recurrent networks consisting of N threshold gates with states xi {0, 1}. Each node i receives nonzero incoming weights wij from exactly K randomly chosen nodes j. Each nonzero connection weight wij is randomly drawn from a Gaussian distribution with mean and variance 2. Furthermore, the network is driven by an exter- nal input signal u() which is injected into each node. Hence, in summary, the update of the network state xt = (x1,t, . . . , xN,t) is given by xi,t = ( N w j=1 ij xj,t-1 + ut-1) which is applied to all neurons in parallel and where (h) = 1 if h 0 and (h) = 0 otherwise. In the following we consider a randomly drawn binary input signal u(): at each

time step ut assumes the value u + 1 with probability r and the value u with probability 1 - r. This network model is similar to the one we have considered in [4]. However it differs in two important aspects: a) By using states xi {0, 1} we emphasis the asymmet- ric information encoding by spikes prevalent in biological neural systems and b) it is more general in the sense that the Gaussian distribution from which the non-zero weights are drawn is allowed to have an arbitrary mean R. This implies that the network activity a N t = 1 x N i=1 i,t can vary considerably for different parameters (compare Fig. 1) and enters all the calculations discussed in the rest of the paper.

The top row of Fig. 1 shows typical examples of ordered, critical and chaotic dynamics (see the next section for a definition of order and chaos). The system parameters corresponding to each type of dynamics are indicated in the lower panel (phase plot). We refer to the (phase) transition from the ordered to the chaotic regime as the critical line (shown as the solid line in the phase plot). Note that increasing the variance 2 of the weights consistently leads to chaotic behavior.

3 The Critical Line: Order and Fading Memory versus Chaos

To define the chaotic and ordered phase of an input driven network we use an approach which is similar to that proposed by Derrida and Pomeau [2] for autonomous systems: consider two (initial) network states with a certain (normalized) Hamming distance. These states are mapped to their corresponding successor states (using the same weight matrix) with the same input in each case and the change in the Hamming distance is observed. If small distances tend to grow this is a sign of chaos whereas if the distance tends to decrease this is a signature of order.

Following closely the arguments in [4, 3] we developed a mean-field theory (see [8] for all details) which allows to calculate the update dt+1 = f (dt, at, ut) of the normalized Hamming distance dt = |{i : xi,t = ~ xi,t}|/N between two states xt and ~ xt as well as the update at+1 = A(at, ut) of the network activity in one time step. Note that dt+1 depends on the input ut (in contrast to [3]) and also on the activity at (in contrast to [4]). Hence the two-dimensional map Fu(dt, at) := (dt+1, at+1) = (f (dt, at, ut), A(at, ut)) describes the time evolution of dt and at given the input times series u().

Let us consider the steady state of the averaged Hamming distance f as well as the steady state of the averaged network activity a, i.e. (f , a) = limt F tu .1 If f = 0 we know that any state differences will eventually die out and the network is in the ordered phase. If on the other hand a small difference is amplified and never dies out we have f = 0 and the network is in the chaotic phase. Whether f = 0 or f = 0 can be decided by looking at the slope of the function f (, , ) at its fixed point f = 0. Since at does not depend on dt we calculate the averaged steady state activity a and determine the slope of the map rf (d, a, u + 1) + (1 - r)f (d, a, u) at the point (d, a) = (0, a). Accordingly we say that the network is in the ordered, critical or chaotic regime if < 1, = 1 or > 1 respectively. In [8] it is shown that the so called critical line = 1 where the phase transition from ordered to chaotic behavior occurs is given by

             K-1    K - 1                                                                           1  Pbf =                                 an(1 - a)K-1-n(rQ(1, n,                                                                       u + 1) + (1 - r)Q(1, n,                                                                                                 u)) =         (1)                                n                                                                        K                  n=0

where Pbf denotes the probability (averaged over the inputs and the network activity) that a node will change its output if a single out of its K input bits is flipped.2 Examples of

  1F tu denotes t-fold composition of the map Fu(, ) where in the k-th iteration the input uk is applied and  denotes the average over all possible initial conditions and all input signals with a given statistics determined by                                             u and r.       2The actual single bit-flip probability Q depends on the number n of inputs which are 1 and the



                     K = 5                                                                   K = 10        0.4                                                                        0.4                                                 0.1        0.3                                                                        0.3                                   0.07

   0.2                                                                        0.2                                                 0.08                                                                    0.06

   0.1                                                                        0.1                                                                                                                         0.05           0                                                                         0                                                 0.06       -0.1                                                                      -0.1                                  0.04

  -0.2                                      0.04                             -0.2                                   0.03                                                         NM-Separation                                                             NM-Separation       -0.3                                                                       -0.3                                                                                                                         0.02       -0.4                                      0.02                             -0.4                                                                                                                         0.01       -0.5                                                                       -0.5

  -0.6                                      0                                -0.6                                   0                   0.1      1          10                                                  0.1      1          10                           2                                                                       2

Figure 2: N M -separation assumes high values on the critical line. The gray coded image shows the N M -separation in dependence on and 2 for K denoted in the panels, r = 0.3,

u = -0.5 and b = 0.1. The solid line marks the critical values for and 2.

critical lines that were calculated from this formula (marked by the solid lines) can be seen in Fig. 2 for K = 5 and K = 10.3

The ordered phase can also be described by using the notion of fading memory (see [5] and the references therein). Intuitively speaking in a network with fading memory a state xt is fully determined by a finite history ut-T , ut-T +1, . . . , ut-1, ut of the input u(). A slight reformulation of this property (see [6] and the references therein) shows that it is equivalent to the requirement that all state differences vanish, i.e. being in the ordered phase. Fading memory plays an important role in the "liquid state machine" framework [5] since together with the separation property (see below) it would in principle allow an appropriate readout function to deduce the recent input, or any function of it, from the network state. If on the other hand the network does not have fading memory (i.e. is in the chaotic regime) a given network state xt also contains "spurious" information about the initial conditions and hence it is hard or even impossible to deduce any features of the recent input.

4 NM-Separation as a Predictor for Computational Power

The already mentioned separation property [5] is especially important if a network is to be useful for computations on input time-series: only if different input signals separate the network state, i.e. different inputs result in different states, it is possible for a readout function to respond differently. Hence it is necessary that any two different input time series for which the readout function should produce different outputs drive the recurrent network into two sufficiently different states.

The mean field theory we have developed (see [8]) can be extended to describe the update dt+1 = s(dt, ...) of the Hamming distance that result from applying differ- ent inputs u() and ~ u() with a mean distance of b := Pr {ut = ~ ut}, i.e. the separa- tion. In summary the three-dimensional map Su,~u(dt, at, ~ at) := (dt+1, at+1, ~at+1) = (s(dt, at, ~at, ut, ~ ut), A(at, ut), A(~at, ~ ut)) fully describes the time evolution of the Ham- ming distance and the network activities. Again we consider the steady state of the averaged Hamming distance s and the network activities a, ~ a, i.e. (s, a, ~ a) = limt St . u,~ u

The overall separation for a given input statistics (determined by u, r, and b) is then given by s. However, this overall separation measure can not be directly related to the computa-

external input u and is given by Q(1, n, u) = -u (, n, n2) 1 - (-u - , , 2) d + -

     (, n, n2)(-u - , , 2)d where ,  denote the Gaussian density and cumulative den-  -u

sity respectively (see [8] for a detailed explanation). 3For each value of = -0.6 + k 0.01, k = 0 . . . 100 a search was conducted to find the value for 2 such that = 1. Numerical iterations of the function A were used to determine a.

      A                                                        B                                                                 C                          3bit parity (K = 5)                                          3bit parity (K = 10)                                5bit random boolean functions            0.4                                                               5                                                                                                                                                                                  0.8            0.2                                                                                                       4                                                               4

                                                                                                                                                                             0.6                0                                                                                                     3                                                               3                 -0.2                                                     MC (MI)                                           2    MC (MI)                                                0.4    mean MI                                                               2


      -0.4                                                                                                                                                                   0.2                                                               1                                                      1


      -0.6                                                0                                                      0                                                           0                0.01      0.1      1           10       100                    0.01    0.1      1       10     100                    0.01       0.1     1           10    100                                  2                                                           2                                                       2

Figure 3: Real-time computation at the edge of chaos. A The gray coded image (an in- terpolation between the data points marked with open diamonds) shows the performance of trained networks in dependence of the parameters and 2 for the delayed 3-bit par- ity task. Performance is measured as the memory capacity M C = I(v, y()) where I(v, y()) is the mutual information between the classifier output v() and the target func- tion y() = PARITY(u t t- , ut- -1, ut- -2) measured on a test set. B Same as panel A but for K = 10. C Same as panel A but for an average over 50 randomly drawn Boolean functions f of the last 5 time steps, i.e. yt = f (ut, ut-1, ..., ut-4).

tional power since chaotic networks separate even minor differences in the input to a very high degree. The part of this separation that is caused by the input distance b and not by the distance of some initial state is therefore given by s - f because f measures the state distance that is caused by differences in the initial states and remains even after long runs with the same inputs (see Sec. 3). Note that f is always zero in the ordered phase and non-zero in the chaotic phase.

Since we want the complexity measure, which we will call N M -separation, to be a predictor for computational power we correct s - f by a term which accounts for the separation due to an all-dominant input drive. A suitable measure for this "imme- diate separation" i is the average increase in the Hamming distance if the system is run for a long time (t ) with equal inputs u() = ~ u() and then a single step with an input pair (v, ~ v) with an average difference of b = Pr {v, = ~ v} is applied: i = lim 1 t rv(1-r)1-vb|v-~v|(1-b)1-|v-~v| s(, , , v, ~ v) St -f . Hence v,~ v=0 u,u a measure of the network mediated separation N Msep due to input differences is given by

                                                                          N Msep = s - f  - i                                                                             (2)

In Fig. 2 the N M -separation resulting from an input difference of b = 0.1 is shown in dependence of the network parameters and 2.4 Note that the N M -separation peaks very close to the critical line. Because of the computational importance of the separation property this also suggests that the computational capabilities of the networks will peak at the onset of chaos, which is confirmed in the next section.

5 Real-Time Computations at the Edge of Chaos

To access the computational power of a network we make use of the so called "liquid state machine" framework which was proposed by Maass et.al. [5] and independently by Jaeger [6]. They put forward the idea that any complex time-series computation can be imple- mented by composing a system which consists of two conceptually different parts: a) a

       4For each value of  = -0.6 + k  0.05, k = 0 . . . 20, 10 values for 2 where chosen near the critical line and 10 other values where equally spaced (on a logarithmic scale) over the interval [0.02,50]. For each such pair (, 2) extensive numerical iterations of the map S where performed to obtain accurate estimates of s, f  and i. Hopefully these numerical estimates can be replaced by analytic results in the future.

properly chosen general-purpose recurrent network with "rich" dynamics and b) a read- out function that is trained to map the network state to the desired outputs (see [5, 6, 4] for more details). This approach is potentially successful if the general-purpose network encodes the relevant features of the input signal in the network state in such a way that the readout function can easily extract it. We will show that near the critical line the net- works considered in this paper encode the input in such a way that a simple linear classifier C(xt) = (w xt + w0) suffices to implement a broad range of complex nonlinear fil- ters. Note that in order to train the network for a given task only the parameters w RN , w0 R of the linear classifier are adjusted such that the actual network output vt = C(xt) is as close as possible to the target values yt.

To access the computational power in a principled way networks with different parameters were tested on a delayed 3-bit parity task for increasing delays and on randomly drawn Boolean functions of the last 5 input bits. Note that these tasks are quite complex for the networks considered here since most of them are not linear separable (i.e. the parity function) and require memory. Hence to achieve good performance it is necessary that a state xt contains information about several input bits ut , t < t in a nonlinear transformed form such that a linear classifier C is sufficient to perform the nonlinear computations.

The results are summarized in Fig. 3 where the performance (measured in terms of mutual information) on a test set between the network output and the target signal is shown for various parameter settings (for details see [4]). The highest performance is clearly achieved for parameter values close to the critical line where the phase transition occurs. This has been noted before [1]. In contrast to these previous results the networks used here are not optimized for any specific task but their computational capabilities are assessed by evaluating them for many different tasks. Therefore a network that is specifically designed for a single task will not show a good performance in this setup. These considerations suggest the following hypotheses regarding the computational function of generic recurrent neural circuits: to serve as a general-purpose temporal integrator, and simultaneously as a kernel (i.e., nonlinear projection into a higher dimensional space) to facilitate subsequent (linear) readout of information whenever it is needed.

6 Self-Organized Criticality via Synaptic Scaling Since the computational capabilities of a network depend crucially on having almost critical dynamics an adaptive system should be able to adjust its dynamics accordingly.

Equ. (1) states that critical dynamics are achieved if the probability Pbf that a single bit- flip in the input shows up in the output should on average (over the external and internal input statistics given by u, r and a respectively) be equal to 1 . To allow for a rule that K can adjust the weights of each node a local estimate of Pbf must be available. This can be accomplished by estimating Pbf from the margin of each node, i.e. the distance of the internal activation from the firing threshold. Intuitively a node with an activation that is much higher or lower than its firing threshold is rather unlikely to change its output if a single bit in its input is flipped. Formally P i of node i is given by the average (over the bf internal and external input statistics) of the following quantity:

                1           N                                          (w                     K                           ij (1 - 2xj,t-1)(1 - 2xi,t) - mi,t)             (3)                           j=1,wij =0

where mi,t = N w j=1 ij xj,t-1 + ut-1 denotes the margin of node i (see [8] for details). Each node now applies synaptic scaling to adjust itself towards the critical line. Accord- ingly we arrive at the following SOC-rule:

                                            1     w                        (t) > 1                    w                       1+              ij     if P esti                                                                        bf             K                         ij (t + 1) =                                                            (4)                                            (1 + )  wij(t) if P esti(t) < 1                                                                        bf             K

                            50


                          100

        neuron #                               150


                          200                                                            100                 200            300                   400                 500            600            700                                                                                                      timesteps

B C 1.5 1.5

                           1                                                                                           1

                    bf                                                                                          bf

 K*P                                                                                                     K*P                               0.5                                                                                         0.5




                           0                                                                                           0                                 0           100     200      300      400       500    600    700                           0    100     200    300    400    500    600     700                                                                   timesteps                                                                      timesteps

Figure 4: Self-organized criticality. A Time evolution of the network state xt starting in a chaotic regime while the SOC-rule (4) is active (black: xi,t = 1, white: xi,t = 0). Parameters: N = 500, K = 5, u = -0.5, r = 0.3, = 0 and initial 2 = 100. B Estimated Pbf . The dotted line shows how the node averaged estimate of Pbf evolves over time for the network shown in A. The running average of this estimate (thick black line) as used by the SOC-rule clearly shows that Pbf approaches its critical value (dashed line). C Same as B but for K = 10 and initial 2 = 0.01 in the ordered regime.

where 0 < 1 is the learning rate and P esti(t) is a running average of the formula bf in Equ. (3) to estimate P i . Applying this rule in parallel to all nodes of the network is bf then able to adjust the network dynamics towards criticality as shown in Fig. 45. The upper row shows the time evolution of the network states xt while the SOC-rule (4) is running. It is clearly visible how the network dynamics changes from chaotic (the initial network had the parameters K = 5, = 0 and 2 = 100) to critical dynamics that respect the input signal. The lower row of Fig. 4 shows how the averaged estimated bit-flip probability 1 N P esti(t) approaches its critical value for the case of the above network and one N i=1 bf that started in the ordered regime (K = 10, = 0, 2 = 0.01). Since critical dynamics are better suited for information processing (see Fig. 3) it is ex- pected that the performance on the 3-bit parity task improves due to SOC. This is con- firmed in Fig. 5 which shows how the memory capacity M C (defined in Fig. 3) grows for networks that were initialized in the chaotic and ordered regime respectively. Note that the performance reached by these networks using the SOC-rule (4) is as high as for networks where the critical value for 2 is chosen apriori and stays at this level. This shows that rule (4) is stable in the sense that it keeps the dynamics critical and does not destroy the computational capabilities.

7 Discussion

We developed a mean-field theory for input-driven networks which allows to determine the position of the transition line between ordered and chaotic dynamics with respect to the

                          5Here a learning rate of  = 0.01 and an exponentially weighted running average with a time constant of 15 time steps were used.

A B K = 5, start 2 = 100 (chaotic) K = 10, start 2 = 0.01 (ordered)

MC [bits] 2 2 MC [bits]

  1                                                                               1


  0                                                                               0                 0         500         1000            1500      2000                       0       500         1000             1500       2000                                    SOC steps                                                                 SOC steps

Figure 5: Time evolution of the performance with activated SOC-rule. A The plot shows the memory capacity M C (see Fig. 3) on the 3-bit parity task averaged over 25 networks ( standard deviation as error-bars) evaluated at the indicated time steps. At each evaluation time step the network weights were fixed and the M C was measured as in Fig. 3 by training the corresponding readouts from scratch. The networks were initialized in the chaotic regime. B Same as in A but for K = 10 and networks initialized in the ordered regime.

parameters controlling the network connectivity and input statistics. Based on this theory we proposed a complexity measure (called N M -separation) which assumes its highest values at the critical line and shows a clear correlation with the computational power for real-time time-series processing. These results provide further evidence for the idea of "computation at the edge of chaos" [1] and support the hypothesis that dynamics near the critical line are expected to be a general property of input driven dynamical systems which support complex real-time computations. Therefore our analysis and the proposed complexity measure provide a new approach towards discovering dynamical principles that enable biological systems to do sophisticated information processing.

Furthermore we have shown that a local rule for synaptic scaling is able to adjust the weights of a network towards critical dynamics. Additionally networks adjusting them- selves by this rule have been found to exhibit enhanced computational capabilities. Thereby systems can combine task-specific optimization provided by (supervised) learning rules with self-organization of its dynamics towards criticality. This provides an explanation how specific information can be processed while still being able to react to incoming sig- nals in a flexible way.

Acknowledgement This work was supported in part by the PASCAL project #IST-2002-506778 of the European Community.

Abstract

Name Change Policy