{"title": "Fool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 501, "page_last": 508, "abstract": null, "full_text": "Fool.s Gold: Extracting Finite State Machines \n\nFrom Recurrent Network Dynamics \n\nJohn F. Kolen \n\nLaboratory for Artificial Intelligence Research \n\nDepartment of Computer and Information Science \n\nThe Ohio State University \n\nColumbus,OH 43210 \n\nkolen-j @cis.ohio-state.edu \n\nAbstract \n\nSeveral recurrent networks have been proposed as representations for the \ntask of formal language learning. After training a recurrent network rec(cid:173)\nognize a formal language or predict the next symbol of a sequence, the \nnext logical step is to understand the information processing carried out \nby the network. Some researchers have begun to extracting finite state \nmachines from the internal state trajectories of their recurrent networks. \nThis paper describes how sensitivity to initial conditions and discrete \nmeasurements can trick these extraction methods to return illusory finite \nstate descriptions. \n\nINTRODUCTION \nFormal language learning (Gold, 1969) has been a topic of concern for cognitive science \nand artificial intelligence. It is the task of inducing a computational description of a formal \nlanguage from a sequence of positive and negative examples of strings in the target lan(cid:173)\nguage. Neural information processing approaches to this problem involve the use of recur(cid:173)\nrent networks that embody the internal state mechanisms underlying automata models \n(Cleeremans et aI., 1989; Elman, 1990; Pollack, 1991; Giles et aI, 1992; Watrous & Kuhn, \n1992). Unlike traditional automata-based approaches, learning systems relying on recurrent \nnetworks have an additional burden: we are still unsure as to what these networks are \ndoing.Some researchers have assumed that the networks are learning to simulate finite state \n\n501 \n\n\f502 \n\nKolen \n\nmachines (FSMs) in their state dynamics and have begun to extract FSMs from the net(cid:173)\nworks' state transition dynamics (Cleeremans et al., 1989; Giles et al., 1992; Watrous & \nKuhn, 1992). These extraction methods employ various clustering techniques to partition \nthe internal state space of the recurrent network into a finite number of regions correspond(cid:173)\ning to the states of a finite state automaton. \n\nThis assumption of finite state behavior is dangerous on two accounts. First, these extrac(cid:173)\ntion techniques are based on a discretization of the state space which ignores the basic def(cid:173)\ninition of information processing state. Second, discretization can give rise to incomplete \ncomputational explanations of systems operating over a continuous state space. \n\nSENSITIVITY TO INITIAL CONDITIONS \n\nIn this section, I will demonstrate how sensitivity to initial conditions can confuse an FSM \nextraction system. The basis of this claim rests upon the definition of information processing \nstate. Information processing (lP) state is the foundation underlying automata theory. Two \nIP states are the same if and only if they generate the same output responses for all possible \nfuture inputs (Hopcroft & Ullman, 1979). This definition is the fulcrum for many proofs and \ntechniques, including finite state machine minimization. Any FSM extraction technique \nshould embrace this definition, in fact it grounds the standard FSM minimization methods \nand the physical system modelling of Crutchfield and Young (Crutchfield & Young, 1989). \n\nSome dynamical systems exhibit exponential divergence for nearby state vectors, yet \nremain confined within an attractor. This is known as sensitivity to initial conditions. If this \ndivergent behavior is quantized, it appears as nondeterministic symbol sequences (Crutch(cid:173)\nfield & Young, 1989) even though the underlying dynamical system is completely deter(cid:173)\nministic (Figure 1). \n\nConsider a recurrent network with one output and three recurrent state units. The output \nunit performs a threshold at zero activation for state unit one. That is, when the activation \nof the first state unit of the current state is less than zero then the output is A. Otherwise, \nthe output is B. Equation 1 presents a mathematical description. Set) is the current state of \nthe system 0 (t) is the current output. \n\nS (t + 1) = tanh ( 0 0 2 1 \no 0 2 -1 \n\n1 \n\nr2 -2 0 -J [ ~ \n\n. S(t)) \n\n(1) \n\nFigure 2 illustrates what happens when you run this network for many iterations. The point \nin the upper left hand state space is actually a thousand individual points all within a ball \nof radius 0.01. In one iteration these points migrate down to the lower corner of the state \nspace. Notice that the ball has elongated along one dimension. After ten iterations the orig(cid:173)\ninal ball shape is no longer visible. After seventeen, the points are beginning to spread \nalong a two dimensional sheet within state space. And by fifty iterations, we see the net(cid:173)\nwork reaching the its full extent of in state space. This behavior is known as sensitivity to \ninitial conditions and is one of three conditions which have been used to characterize cha(cid:173)\notic dynamical systems (Devaney, 1989). In short, sensitivity to initial conditions implies \n\n\fFool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics \n\n503 \n\nx~4x(l-x) \n\nx~2x mod 1 \n\n{ : xO.5 \n\nO(x) = \n\nA \n\nO(x) = B \n\nC \n\n1 \nx<-\n3 \n\n2 \n1 \n- O.5 \n\nFigure 1: Examples of deterministic dynamical systems whose discretize trajectories \nappear nondeterministic. \n\nthat any epsilon ball on the attractor of the dynamical will exponentially diverge, yet still \nbe contained within the locus of the attractor. The rate of this divergence is illustrated in \nFigure 3 where the maximum distance between two points is plotted with respect to the \nnumber of iterations. Note the exponential growth before saturation. Saturation occurs as \nthe point cloud envelops the attractor. \n\nNo matter how small one partitions the state space, sensitivity to initial conditions will \neventually force the extracted state to split into multiple trajectories independent of the \nfuture input sequence. This is characteristic of a nondeterministic state transition. Unfortu(cid:173)\nnately, it is very difficult, and probably intractable, to differentiate between a nondetermin(cid:173)\nistic system with a small number of states or a deterministic with large number of states. In \ncertain cases, however, it is possible to analytically ascertain this distinction (Crutchfield & \nYoung, 1989). \n\nTHE OBSERVERS' PARADOX \n\nOne response to this problem is to evoke more computationally complex models such as \npush-down or linear-bounded automata. Unfortunately, the act of quantization can actually \nintroduce both complexion and complexity in the resulting symbol sequence. Pollack and \nI have focused on a well-hidden problems with the symbol system approach to understand(cid:173)\ning the computational powers of physical systems. This work (Kolen & Pollack, 1993; \n\n\fS04 \n\nKolen \n\n1 \n\n1 \n\n1 \n\nI \n\nI \n\nI \n\noutput=A \n\n1 \n\nStart (e 0). (The notation \nrn indicates a sequence of n r's.) For a fixed sampling rate, each rotational velocity spec(cid:173)\nifies up to three sentences whose number of r's and l's differ by at most one. These sen(cid:173)\ntences repeat in an arbitrary manner. Thus, a typical subsequence of a rotator which \nproduces sentences r n1 n, r n1 n+l ,rn+ 11 n would look like \n\n\f506 \n\nKolen \n\nrnln+lrnlnrnln+lrn+l1nrnlnrnln+l. \n\nA language of sentences may be constructed by examining the families of sentences gener(cid:173)\nated by a large collection of individuals, much like a natural language is induced from the \nabilities of its individual speakers. In this context, a language could be induced from a pop(cid:173)\nulation of rotators with different rotational velocities where individuals generate sentences \nof the form {r\"l n, r\"l \"+1 ,r\"+ll\"}, n > O. The reSUlting language can be described by a \ncontext free grammar and has unbounded dependencies; the number of 1 's is a function of \nthe number of preceding r's. These two constraints on the language imply that the induced \nlanguage is context free. \n\nTo show that this complexity class assignment is an artifact of the observational mecha(cid:173)\nnism, consider the mechanism which reports three disjoint regions: 1, c, and r (Figure 4, \nright side). Now the same rotating point will generate sequences ofthe form \n\n... rr ... rrcc ... ccll. .. llrr ... rrcc ... ccll ... ll .... \n\nFor a fixed sampling rate, each rotational velocity specifies up to seven sentences, r nc ffil k, \nwhen n, m, and k can differ no by no more than one. Again, a language of sentences may \nbe constructed containing all sentences in which the number ofr's, c's, and l's differs by \nno more than one. The resulting language is context sensitive since it can be described by \na context sensitive grammar and cannot be context free as it is the finite union of several \ncontext sensitive languages related to r\"c\"l n. \n\nCONCLUSION \n\nUsing recurrent neural networks as the representation underlying the language learning task \nhas revealed some inherent problems with the concept of this task. While formal languages \nhave mathematical validity, looking for language induction in physical systems is question(cid:173)\nable, especially if that system operates with continuous internal states. As I have shown, \nthere are two major problems with the extraction of a learned automata from our models. \n\nFirst, sensitivity to initial conditions produces nondeterministic machines whose trajecto(cid:173)\nries are specified by both the initial state of the network and the dynamics of the state trans(cid:173)\nformation. The dynamics provide the shape of the eventual attractor. The initial conditions \nspecify the allowable trajectories toward that attractor. While clustering methods work in \nthe analysis of feed-forward networks because of neighborhood preservation (as each layer \nis a homeomorphism), they may fail when applied to recurrent network state space trans(cid:173)\nformations. FSM construction methods which look for single transitions between regions \nwill not help in this case because the network eventually separates initially nearby states \nacross several FSM state regions. \n\nThe second problem with the extraction of a learned automata from recurrent network is \nthat trivial changes in observation strategies can cause one to induce behavioral descrip(cid:173)\ntions from a wide range of computational complexity classes for a single system. It is the \nresearcher's bias which determines that a dynamical system is equivalent to a finite state \nautomata. \n\n\fFool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics \n\n507 \n\nOne response to the first problem described above has been to remove and eliminate the \nsources of nondeterminism from the mechanisms. Zeng et. a1 (1993) corrected the second(cid:173)\norder recurrent network model by replacing the continuous internal state transformation with \na discrete step function. (The continuous activation remained for training purposes.) This \nmove was justified by their focus on regular language learning, as these languages can be rec(cid:173)\nognized by finite state machines. This work is questionable on two points, however. First, \ntractable algorithms already exist for solving this problem (e.g. Angluin, 1987). Second, they \nclaim that the network is self-clustering the internal states. Self-clustering occurs only at the \ncomers of the state space hypercube because of the discrete activation function, in the same \nmanner as a digital sequential circuit \"clusters\" its states. Das and Mozer (1994), on the other \nhand, have relocated the clustering algorithm. Their work focused on recurrent networks that \nperform internal clustering during training. These networks operate much like competitive \nlearning in feed-forward networks (e.g. Rumelhart and Zipser, 1986) as the dynamics of the \nlearning rules constrain the state representations such that stable clusters emerge. \n\nThe shortcomings of finite state machine extraction must be understood with respect to the \ntask at hand. The actual dynamics of the network may be inconsequential to the final prod(cid:173)\nuct if one is using the recurrent network as a pathway for designing a finite state machine. \nIn this engineering situation, the network is thrown away once the FSM is extracted. Neural \nnetwork training can be viewed as an \"interior\" method to finding discrete solutions. It is \ninterior in the same sense as linear programming algorithms can be classified as either edge \nor interior methods. The former follows the edges of the simplex, much like traditional \nFSM learning algorithms search the space of FSMs. Internal methods, on the other hand, \nexplore search spaces which can embed the target spaces. Linear programming algorithms \nemploying internal methods move through the interior of the defined simplex. Likewise, \nrecurrent neural network learning methods swim through mechanisms with mUltiple finite \nstate interpretations. Some researchers, specifically those discussed above, have begun to \nbias recurrent network learning to walk the edges (Zeng et al, 1993) or to internally cluster \nstates (Das & Mozer, 1994). \n\nIn order to understand the behavior of recurrent networks, these devices should be regarded \nas dynamical systems (Kolen, 1994). In particular, most common recurrent networks are \nactually iterated mappings, nonlinear versions of Barnsley's iterated function systems \n(Barnsley, 1988). While automata also fall into this class, they are a specialization of \ndynamical systems, namely discrete time and state systems. Unfortunately, information \nprocessing abstractions are only applicable within this domain and do not make any sense \nin the broader domains of continuous time or continuous space dynamical systems. \n\nAcknowledgments \n\nThe research reported in this paper has been supported by Office of Naval Research grant \nnumber NOOOI4-92-J-1195. I thank all those who have made comments and suggestions for \nimprovement of this paper, especially Greg Saunders and Lee Giles. \n\nReferences \n\nAngluin, D. (1987). Learning Regular Sets from Queries and Counterexamples. Information \n\n\f508 \n\nKolen \n\nand Computation, 75,87-106. \nBarnsley, M. (1988). Fractals Everywhere. Academic Press: San Diego, CA. \nChomsky, N. (1957). Syntactic Structures. The Hague: Mounton & Co. \nChomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, Mass.: MIT Press. \nCleeremans, A, Servan-Schreiber, D. & McClelland, J. L. (1989). Finite state automata and \nsimple recurrent networks. Neural Computation, 1,372-381. \nCrutchfield, J. & Young, K. (1989). Computation at the Onset of Chaos. In W. Zurek, (Ed.), \nEntropy, Complexity, and the Physics of Information. Reading: Addison-Wesely. \nDas, R. & Mozer, M. (1994) A Hybrid Gradient-Descent/Clustering Technique for Finite \nState Machine Induction. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, (Eds.), \nAdvances in Neural Information Processing Systems 6. Morgan Kaufman: San Francisco. \nDevaney, R. L. (1989). An Introduction to Chaotic Dynamical Systems. Addison-Wesley. \nElman, J. (1990). Finding structure in time. Cognitive Science, 14, 179-211. \nGiles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H. & C.Lee, Y. (1992). Extracting \nand Learning an Unknown Grammar with Recurrent Neural Networks. In John E. Moody, \nSteven J. Hanson & Richard P. Lippman, (Eds.), Advances in Neural Information Processing \nSystems 4. Morgan Kaufman. \nGold, E. M. (1969). Language identification in the limit. Information and Control, 10,372-\n381. \nHopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and \nComputation. Addison-Wesely. \nKolen, J. F. (1994) Recurrent Networks: State Machines or Iterated Function Systems? In M. \nC. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman, & AS. Weigend (Eds.), Proceedings \nof the 1993 Connectionist Models Summer School. (pp. 203-210) Hillsdale, NJ: Erlbaum \nAssociates. \nKolen, J. F. & Pollack, J. B. (1993). The Apparent Computational Complexity of Physical \nSystems. In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society. \nLaurence Earlbaum. \nKolen, J. F. & Pollack, J. B. (In press) The Observers' Paradox: The Apparent Computational \nComplexity of Physical Systems. Journal of Experimental and Theoretical Artificial Intelli(cid:173)\ngence. \nPollack, J. B. (1991). The Induction Of Dynamical Recognizers. Machine Learning, 7.227-\n252. \nNewell, A. & Simon, H. A (1976). Computer science as empirical inquiry: symbols and \nsearch. Communications of the Associationfor Computing Machinery, 19, 113-126. \nRumelhart, D. E., and Zipser, D. (1986). Feature Discovery by Competitive Learning. In D. \nE. Rumelhart, J. L. McClelland, and the PDP Research Group, (Eds.), Parallel Distributed \nProcessing. Volume 1. 151-193. MIT Press: Cambridge, MA \nWatrous, R. L. & Kuhn, G. M. (1992). Induction of Finite-State Automata Using Second(cid:173)\nOrder Recurrent Networks. In John E. Moody, Steven J. Hanson & Richard P. Lippman, \n(Eds.), Advances in Neural Information Processing Systems 4. Morgan Kaufman. \nZeng, Z., Goodman, R. M., Smyth, P. (1993). Learning Finite State Machines With Self-Clus(cid:173)\ntering Recurrent Networks. Neural Computation, 5, 976-990 \n\n\fPART IV \n\nNEUROSCIENCE \n\n\f\f", "award": [], "sourceid": 757, "authors": [{"given_name": "John", "family_name": "Kolen", "institution": null}]}