{"title": "Learning Cellular Automaton Dynamics with Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 631, "page_last": 638, "abstract": "", "full_text": "Learning Cellular Automaton Dynamics \n\nwith Neural Networks \n\nN H Wulff* and J A Hertz t \n\nCONNECT, the Niels Bohr Institute and Nordita \nBlegdamsvej 17, DK-2100 Copenhagen 0, Denmark \n\nAbstract \n\nWe have trained networks of E - II units with short-range connec(cid:173)\ntions to simulate simple cellular automata that exhibit complex or \nchaotic behaviour. Three levels of learning are possible (in decreas(cid:173)\ning order of difficulty): learning the underlying automaton rule, \nlearning asymptotic dynamical behaviour, and learning to extrap(cid:173)\nolate the training history. The levels of learning achieved with and \nwithout weight sharing for different automata provide new insight \ninto their dynamics. \n\n1 \n\nINTRODUCTION \n\nNeural networks have been shown to be capable of learning the dynamical behaviour \nexhibited by chaotic time series composed of measurements of a single variable \namong many in a complex system [1, 2, 3]. In this work we consider instead cellular \nautomaton arrays (CA)[4], a class of many-degree-of-freedom systems which exhibits \nvery complex dynamics, including universal computation. We would like to know \nwhether neural nets can be taught to imitate these dynamics, both locally and \nglobally. \n\nOne could say we are turning the usual paradigm for studying such systems on \nits head. Conventionally, one is given the rule by which each automaton updates \nits state, and the (nontrivial) problem is to find what kind of global dynamical \n\n\u00b7Present address: NEuroTech AjS, Copenhagen, Denmark \nt Address until October 1993: Laboratory of Neuropsychology, NIMH, Bethesda MD \n\n20892. email: hertz@nordita.dk \n\n631 \n\n\f632 Wulff and Hertz \n\nbehaviour results. Here we suppose that we are given the history of some CA, and \nwe would like, if possible, to find the rule that generated it. \n\nWe will see that a network can have different degrees of success in this task, de(cid:173)\npending on the constraints we place on the learning. Furthermore, we will be able \nto learn something about the dynamics of the automata themselves from knowing \nwhat level of learning is possible under what constraints. \n\nThis note reports some preliminary investigations of these questions. We study \nonly the simplest automata that produce chaotic or complex dynamic behaviour. \nNevertheless, we obtain some nontrivial results which lead to interesting conjectures \nfor future investigation. \n\nA CA is a lattice of formal computing units, each of which is characterized by a \nstate variable Si(t), where i labels the site in the lattice and t is the (digital) time. \nEvery such unit updates itself according to a particular rule or function f( ) of its \nown state and that of the other units in its local neighbourhood. The rule is the \nsame for all units, and the updatings of all units are simultaneous. \n\nDifferent models are characterized by the nature of the state variable (e.g. binary, \ncontinuous, vector, etc), the dimensionality of the lattice, and the choice of neigh(cid:173)\nbourhood. In the two cases we study here, the neighbourhoods are of size N = 3, \nconsisting of the unit itself and its two immediate neighbours on a chain, and \nN = 9, consisting of the unit itself and its 8 nearest neighbours on a square lattice \n(the 'Moore neighbourhood'). We will consider only binary units, for which we take \nSi(t) = \u00b11. Thus, if the neighbourhood (including the unit itself) includes N sites, \nf( ) is a Boolean function on the N -hypercube. There are 22N such functions. \n\nWolfram [4) has divided the rules for such automata further into three classes: \n\n1. Class 1: rules that lead to a uniform state. \n2. Class 2: rules that lead to simple stable or periodic patterns. \n\n3. Class 3: rules that lead to chaotic patterns. \n\n4. Class 4: rules that lead to complex, long-lived transient patterns. \n\nRules in the fourth cla.ss lie near (in a sense not yet fully understood [5)) a critical \nboundary between classes 2 and 3. They lead eventually to asymptotic behaviour \nin class 2 (or possibly 3); what distinguishes them is the length of the transient. It \nis classes 3 and 4 that we are interested in here. \n\nMore specifically, for class 3 we expect that after the (short) initial transients, the \nmotion is confined to some sort of attractor. Different attractors may be reached \nfor a given rule, depending on initial conditions. For such systems we will focus \non the dynamics on these attractors, not on the short transients. We will want to \nknow what we can learn from a given history about the attractor characterizing it, \nabout the asymptotic dynamics of the system generally (i.e. about all attractors), \nand, if possible, about the underlying rule. \n\nFor class 4 CA, in contra.st, only the transients are of interest. Different initial \nconditions will give rise to very different transient histories; indeed, this sensitivity \nis the dynamical ba.sis for the capability for universal computation that has been \n\n\fLearning Cellular Automaton Dynamics with Neural Networks \n\n633 \n\nproved for some of these systems. Here we will want to know what we can learn \nfrom a portion of such a history about its future, as well as about the underlying \nrule. \n\n2 REPRESENTING A CA AS A NETWORK \n\nAny Boolean function of N arguments can be implemented by a ~-n unit of order \nP ::; N with a threshold activation function, i.e. there exist weights wJlh ... jp such \nthat \n\nI(SI, S2 ... SN) = sgn [. L . wJd~ ... jp Sjl Sh ... Sjp] . \n\n(1) \n\nJl.J~.\u00b7\"JP \n\nThe indices ile run over the sites in the neighbourhood (1 to N) and zero, which \nlabels a constant formal bias unit So = 1. Because the updating rule we are looking \nfor is the same for the entire lattice, the weight WJ1 ... jp doesn't depend on i. Fur(cid:173)\nthermore, because of the discrete nature of the outputs, the weights that implement \na given rule are not unique; rather, there is a region of weight space for each rule. \n\nAlthough we could work with other architectures, it is natural to study networks \nwith the same structure as the CA to be simulated. We therefore make a lattice \nof formal 1: - n neurons with short-range connections, which update themselves \naccording to \n\nVi(t+ 1) = 9 r.~ Wit ... jPVjl(t) ... Vjp(t)] , \n\nJt\"'Jp \n\n(2) \n\nIn these investigations, we have assumed that we know a priori what the relevant \nneighbourhood size is, thereby fixing the connectivity of the network. At the end of \nthe day, we will take the limit where the gain of the activation function 9 becomes \ninfinite. However, during learning we use finite gain and continuous-valued units. \nWe know that the order P of our ~ - n units need not be higher than the neigh(cid:173)\nbourhood size N. However, in most cases a smaller P will do. More precisely, a \nnetwork with any P > ~N can in principle (Le. given the right learning algorithm \nand sufficient training examples) implement almost all possible rules. This is an \nasymptotic result for large N but is already quite accurate for N = 3, where only \ntwo of the 256 possible rules are not implementable by a second-order unit, and \nN = 5, where we found from simple learning experiments that 99.87% of 10000 \nrandomly-chosen rules could be implemented by a third-order unit. \n\n3 LEARNING \n\nHaving chosen a suitable value of P, we can begin our main task: training the \nnetwork to simulate a CA, with the training examples {Si(t) - t Si(t + I)} taken \nfrom a particular known history. \n\nThe translational invariance of the CA suggests that weight sharing is appropriate \nin the learning algorithm. On the other hand, we can imagine situations in which \nwe did not possess a priori knowledge that the CA rule was the same for all units, \n\n\f634 Wulff and Hertz \n\nor where we only had access to the automaton state in one neighbourhood. This \ncase is analogous to the conventional time series extrapolation paradigm, where \nwe typically only have access to a few variables in a large system. The difference \nis that here the accessible variables are binary rather than continuous. In these \nsituations we should or are constrained to learn without each unit having access to \nerror information at other units. In what follows we will perform the training both \nwith and without weight sharing. The differences in what can be learned in the two \ncases will give interesting information about the CA dynamics being simulated. \n\nMost of our results are for chaotic (class 3) CA. For these systems, this training \nhistory is taken after initial transients have died out. Thus many of the 2N possible \nexamples necessary to specify the rule at each site may be missing from the training \nset, and it is possible that our training procedure will not result in the network \nlearning the underlying rule of the original system. It might instead learn another \nrule that coincides with the true one on the training examples. This is even more \nlikely if we are not using weight sharing, because then a unit at one site does not \nhave access to examples from the training history at other sites. \n\nHowever, we may relax our demand on the network, asking only that it evolve \nexactly like the original system when it is started in a configuration the original \nsystem could be in after transients have died out (Le. on an attractor of the original \nsystem). Thus we are restricting the test set in a way that is \"fairer\" to the network, \ngiven the instruction it has received. \n\nOf course, if the CA has more than one attractor, several rules which yield the \nsame evolution on one attractor need not do so on another one. It is therefore \npossible that a network can learn the attractor of the training history (Le. will \nsimula.te the original system correctly on a part of the history subsequent to the \ntraining sequence) but will not be found to evolve correctly when tested on data \nfrom another attractor. \n\nFor class 4 automata, we cannot formulate the distinctions between different levels \nof learning meaningfully in terms of attractors, since the object of interest is the \ntransient portion of the history. Nevertheless, we can still ask whether a network \ntrained on part of the transient can learn the full rule, whether it can simulate the \ndynamics for other initial conditions, or whether it can extrapolate the training \nhistory. \n\nWe therefore distinguish three degrees of successful learning: \n\n1. Learning the rule, where the network evolves exactly like the original system \n\nfrom any initial configuration. \n\n2. Learning the dynamics, the intermediate case where the network can simu(cid:173)\nlate the original system exactly after transients, irrespective of initial con(cid:173)\nditions, despite not having learned the full rule. \n\n3. Learning to continue the dynamics, where the successful simulation of the \noriginal system is only achieved for the particular initial condition used to \ngenerate the training history. \n\nOur networks are recurrent, but because they have no hidden units, they can be \ntrained by a simple variant of the delta-rule algorithm. It can be obtained formally \n\n\fLearning Cellular Automaton Dynamics with Neural Networks \n\n635 \n\nfrom gradient descent on a modified cross entropy \nE = ~ '\" [(1 + Si(t)) log 1 + Si~t~ + (1 _ Si(t)) log 1 - ~~~t~l 8[-Si(t)Vi(t)] (3) \n\n1- \u00b7t \n\nt \n\nL \nit \n\nl+v.-t \n\nt \n\nWe used the online version: \nf!lwith ... jp = 7]8[-Si(t+ l)l/i(t+ l)J[Si(t+ 1) - Vi(t+ l)]l';l(t)V}l(t).\u00b7. V}p(t) (4) \nThis is like an extension of the Adatron algorithm[6} to E- n units, but with the \nadded feature that we are using a nonlinear activation function. \nThe one-dimensional N = 3 automata we simulated were the 9 legal cha.otic ones \nidentified by Wolfram l4]. Using his system for labeling the rules, these are rules \n18, 22, 54, 90, 122, 126, 146, 150, and 182. We used networks of order P = 3 \nso that all rules were learnable. (Rule 150 would not have been learnable by a \nsecond-order net.) Each network was a chain 60 units long, subjected to periodic \nboundary conditions. \nThe training histories {Si (t)} were 1000 steps long, beginning 100 steps after ran(cid:173)\ndomly chosen initial configurations. To test for learning the rules, all neighbourhood \nconfigurations were checked at every site. To test for learning the dynamics, the \nCA were reinitialized with different random starting configurations and run 100 \nsteps to eliminate transients, after which new test histories of length 100 steps were \nconstructed. Networks were then tested on 100 such histories. The test set for \ncontinuing the dynamics was made simply by allowing the CA that had generated \nthe training set to continue for 100 more steps. \n\nThere are no class 4 CA among the one-dimensional N = 3 systems. As an example \nof such a rule, we chose the Game of Life which is defined on a square lattice with \na neighbourhood size N = 9 and has been proved capable of universaJ computation \n(see, e.g. [7, 8]). We worked with a lattice of 60 x 60 units. \n\nThe training history for the Game of Life consisted of 200 steps in the transient. The \ntrained networks were tested, as in the case of the chaotic one-dimensional systems, \non all possible configurations at every site (learning the rule), on other transient \nhistories generated from different initial conditions (learning the dynamics), and \non the evolution of the original system immediately following the training history \n(learning to continue the dynamics). \n\n4 RESULTS \n\nWith weight sharing, it proved possible to learn the dynamics for all 9 of the one(cid:173)\ndimensional chaotic rules very easily. In fact, it took no more than 10 steps of the \ntraining history to achieve this. \n\nLearning the underlying rules proved harder. After training on the histories of \n1000 steps, the networks were able to do so in only 4 of the 9 cases. No qualitative \ndifference in the two groups of patterns is evident to us from looking at their histories \n(Fig. 1). Nevertheless, we conclude that their ergodic properties must be different, \nat lea.st quantitatively. \n\nLife was also easy with weight sharing. Our network succeed in learning the under(cid:173)\nlying rule starting almost anywhere in the long transient. \n\n\f636 \n\nWulff and Hertz \n\n22 \n\n54 \n\n90 \n\n126 \n\n182 \n\nFigure 1: Histories of the 4 one-dimensional rules that could be learned (top) and \nthe 5 that could not (bottom) . (Learning with weight sharing.) \n\nWithout weight sharing, all learning naturally proved more difficult. While it was \npossible to learn to continue the dynamics for all the one-dimensional chaotic rules, \nit proved impossible except in one case (rule 22) to learn the dynamics within \nthe training history of 1000 steps. The networks failed on about 25% of the test \nhistories. It was never possible to learn the underlying rule. Thus, apparently these \nchaotic states are not as homogeneous as they appear (at least on the time scale of \nthe training period). \n\nLife is also difficult without weight sharing. Our network was unable even to con(cid:173)\ntinue the dynamics from histories of several hundred steps in the transient (Fig. 2). \n\n5 DISCUSSION \n\nIn previous studies of learning chaotic behaviour in single-variable time series \n(e.g. [1, 2, 3]), the test to which networks have been put has been to extrapolate \nthe training series, i.e. to continue the dynamics. We have found that this is also \npossible in cellular automata for all the chaotic rules we have studied, even when \nonly local information about the training history is available to the units. Thus, the \nCA evolution history at any site is rich enough to permit error-free extrapolation. \n\nHowever, local training data are not sufficient (except in one system, rule 22) to \npermit our networks to pass the more stringent test oflearning the dynamics. Thus, \nviewed from any single site, the different attra.ctors of these systems are dissimilar \nenough that data from one do not permit generalization to another. \n\n\fLearning Cellular Automaton Dynamics with Neural Networks \n\n637 \n\n'\\1(0 \n\n, , -\n. \n... JI~ \n~ (. , .~ . ..,.~. \n:~\". \n- +.~. \n,.\"'~ , \n. I \n-\n.-. \n\nI \n~. \n\n-;. \n\n_v \n\n~ \n\n0 \n\n. -= .=:;\" \n\n. t ..... -\n\n. Q(. \n, , \n, \n\n~ \n\n-0-\nIi,. \n\n, \n,-\n\n0.i!-o-\n\n.~ , -\n-.~ --\n\noc;> \n\n('v \n\n~ \n\nFigure 2: The original Game of Life CA (left) and the network (right), both 20 \nsteps a.fter the end of the training history. (Training done without weight sharing.) \n\nWith the access to training data from other sites implied by weight sharing, the \nsituation changes dramatically. Learning the dynamics is then very easy, implying \nthat all possible asymptotic local dynamics that could occur for any initial condition \nactually do occur somewhere in the system in any given history. \n\nFurthermore, with weight sharing, not only the dynamics but also the underlying \nrule can be learned for some rules. This suggests that these rules are ergodic, in \nthe sense that all configurations occur somewhere in the system at some time. This \ndivision of the chaotic rules into two classes according to this global ergodicity is a \nnew finding . \n\nTurning to our class 4 example, Life proves to be impossible without weight sharing, \neven by OUr most lenient test, continuing the dynamics. Thus, although one might \nbe tempted to think that the transient in Life is so long that it can be treated \nopera.tionallyas if it were a chaotic attractor, it cannot. For real chaotic attractors, \nin both in the CA studied here and continuous dynamical systems, networks can \nlearn to continue the dynamics on the basis of local data, while in Life they cannot. \n\nOn the other hand, the result that the rule of Life is easy to learn with weight \nsharing implies that looked at globally, the history of the transient is quite rich. \nSomewhere in the system, it contains sufficient information (together with the a \npriori knowledge that a second-order network is sufficient) to allow us to predict \nthe evolution from any configuration correctly. \n\nThis study is a very preliminary one and raises more questions than it answers. We \nwould like to know whether the results we have obtained for these few simple systems \nare generic to complex and chaotic CA. To answer this question we will have to study \nsystems in higher dimensions and with larger updating neighbourhoods. Perhaps \nsignificant universal patterns will only begin to emerge for large neighborhoods (cf \n[5]). However, we have identified some questions to ask about these problems. \n\n\f638 Wulff and Hertz \n\nReferences \n\n[1J A Lapedes and R Farber, Nonlinear Signal Processing Using Neural Networks: \n\nPrediction and System Modelling, Tech Rept LA-UR-87 -2662, Los Alamos N a(cid:173)\ntional Laboratory. Los Alamos NM USA \n\n[2] A S Weigend, B A Huberman a.nd D E Rumelhart, Int J Neural Systems 1 \n\n193-209 (1990) \n\n[3] K Stokbro, D K Umberger and J A Hertz, Complex Systems 4 603-622 (1991) \n[4] S Wolfram, Theory and Applications of Cellular Automata (World Scientific, \n\n1986) \n\n[5] C G Langton, pp 12-37 in Emergent Computation (S Forrest, ed) MIT \n\nPress/North Holland, 1991 \n\n[6] J K Anlauf and M Biehl, Europhys Letters 10 687 (1989) \n[7] H V Mcintosh, Physica D 45 105-121 (1990) \n[8] S Wolfram, Physic a D 10 1-35 (1984) \n\n\f", "award": [], "sourceid": 703, "authors": [{"given_name": "N.", "family_name": "Wulff", "institution": null}, {"given_name": "J", "family_name": "Hertz", "institution": null}]}