{"title": "Fool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 501, "page_last": 508, "abstract": null, "full_text": "Fool.s Gold: Extracting Finite State Machines \n\nFrom Recurrent Network Dynamics \n\nJohn F.  Kolen \n\nLaboratory for Artificial Intelligence Research \n\nDepartment of Computer and Information Science \n\nThe Ohio State University \n\nColumbus,OH  43210 \n\nkolen-j @cis.ohio-state.edu \n\nAbstract \n\nSeveral recurrent networks have been proposed as representations for the \ntask of formal language learning. After training a recurrent network rec(cid:173)\nognize a formal  language or predict the next symbol of a sequence, the \nnext logical step is to  understand the information processing carried out \nby  the  network.  Some researchers have  begun  to  extracting  finite  state \nmachines from the internal state trajectories of their recurrent networks. \nThis  paper describes  how  sensitivity  to  initial  conditions  and  discrete \nmeasurements can trick these extraction methods to return illusory finite \nstate descriptions. \n\nINTRODUCTION \nFormal language learning (Gold,  1969) has been a topic  of concern for cognitive science \nand artificial intelligence. It is the task of inducing a computational description of a formal \nlanguage from  a sequence of positive and  negative examples of strings in  the target lan(cid:173)\nguage. Neural information processing approaches to this problem involve the use of recur(cid:173)\nrent  networks  that  embody  the  internal  state  mechanisms  underlying  automata  models \n(Cleeremans et aI., 1989;  Elman, 1990; Pollack, 1991; Giles et aI,  1992; Watrous & Kuhn, \n1992). Unlike traditional automata-based approaches, learning systems relying on recurrent \nnetworks  have  an  additional  burden:  we  are  still  unsure  as  to  what  these  networks  are \ndoing.Some researchers have assumed that the networks are learning to simulate finite state \n\n501 \n\n\f502 \n\nKolen \n\nmachines (FSMs) in  their state dynamics and have begun  to  extract FSMs from  the  net(cid:173)\nworks' state transition dynamics  (Cleeremans et al.,  1989;  Giles et al.,  1992; Watrous  & \nKuhn,  1992). These extraction methods employ various clustering techniques to partition \nthe internal state space of the recurrent network into a finite number of regions correspond(cid:173)\ning to the states of a finite  state automaton. \n\nThis assumption of finite  state behavior is dangerous on two accounts. First, these extrac(cid:173)\ntion techniques are based on a discretization of the state space which ignores the basic def(cid:173)\ninition of information processing state. Second, discretization can give rise to  incomplete \ncomputational explanations of systems operating over a continuous state space. \n\nSENSITIVITY TO INITIAL CONDITIONS \n\nIn this section, I will demonstrate how sensitivity to initial conditions can confuse an  FSM \nextraction system. The basis of this claim rests upon the definition of information processing \nstate.  Information processing (lP) state is the foundation  underlying automata theory.  Two \nIP states are the same if and only if they generate the same output responses for all possible \nfuture inputs (Hopcroft & Ullman, 1979). This definition is the fulcrum for many proofs and \ntechniques,  including  finite  state  machine  minimization.  Any  FSM extraction  technique \nshould embrace this definition, in fact it grounds the standard FSM minimization methods \nand the physical system modelling of Crutchfield and Young (Crutchfield &  Young,  1989). \n\nSome  dynamical  systems  exhibit  exponential  divergence  for  nearby  state  vectors,  yet \nremain confined within an attractor. This is known as sensitivity to initial conditions. If this \ndivergent behavior is quantized, it appears as nondeterministic symbol sequences (Crutch(cid:173)\nfield  &  Young,  1989) even though the underlying dynamical  system is completely deter(cid:173)\nministic (Figure 1). \n\nConsider a recurrent network  with one output and three recurrent state units.  The output \nunit performs a threshold at zero activation for state unit one. That is, when the  activation \nof the first  state unit of the current state is less than zero then the output is A.  Otherwise, \nthe output is B. Equation  1 presents a mathematical description.  Set)  is the current state of \nthe system  0  (t)  is the current output. \n\nS (t + 1)  = tanh (  0  0  2  1 \no  0  2  -1 \n\n1 \n\nr2  -2 0 -J  [ ~ \n\n.  S(t)) \n\n(1) \n\nFigure 2 illustrates what happens when you run this network for many iterations. The point \nin  the upper left hand state space is actually a thousand individual points all  within  a ball \nof radius 0.01. In one iteration these points migrate down to  the lower corner of the  state \nspace. Notice that the ball has elongated along one dimension. After ten iterations the orig(cid:173)\ninal  ball  shape  is  no  longer  visible.  After  seventeen,  the  points  are  beginning  to  spread \nalong a two dimensional  sheet within  state space.  And by  fifty  iterations,  we see the  net(cid:173)\nwork reaching the its full extent of in  state space. This behavior is known as  sensitivity to \ninitial conditions and is one of three conditions which have been used to characterize cha(cid:173)\notic dynamical systems (Devaney,  1989). In short, sensitivity to initial conditions implies \n\n\fFool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics \n\n503 \n\nx~4x(l-x) \n\nx~2x mod 1 \n\n{ : x<O.5  @A \n\nx>O.5 \n\nO(x)  = \n\nA \n\nO(x)  =  B \n\nC \n\n1 \nx<-\n3 \n\n2 \n1 \n- <x<-\n3 \n3 \n\n2 \n-<x \n3 \n\nA \n\nC \n\nx ~ 3.68x(l-x)  C x<O.5 \n\nO(x)  = \n\nx>O.5 \n\nFigure 1: Examples of deterministic dynamical systems whose discretize trajectories \nappear nondeterministic. \n\nthat any epsilon ball on the attractor of the dynamical will exponentially diverge, yet still \nbe contained within  the locus of the attractor.  The rate of this divergence is illustrated in \nFigure 3  where  the  maximum distance  between  two  points  is  plotted with respect to  the \nnumber of iterations. Note the exponential growth before saturation. Saturation occurs as \nthe point cloud envelops the attractor. \n\nNo  matter how  small  one partitions  the  state  space,  sensitivity  to  initial  conditions  will \neventually  force  the  extracted  state  to  split into  multiple  trajectories  independent of the \nfuture input sequence. This is characteristic of a nondeterministic state transition. Unfortu(cid:173)\nnately, it is very difficult, and probably intractable, to differentiate between a nondetermin(cid:173)\nistic system with a small number of states or a deterministic with large number of states. In \ncertain cases, however, it is possible to analytically ascertain this distinction (Crutchfield & \nYoung,  1989). \n\nTHE OBSERVERS' PARADOX \n\nOne response to this problem  is to evoke more computationally complex  models such as \npush-down or linear-bounded automata. Unfortunately, the act of quantization can actually \nintroduce both complexion and complexity in  the resulting symbol sequence. Pollack and \nI have focused on a well-hidden problems with the symbol system approach to understand(cid:173)\ning  the  computational  powers  of physical  systems.  This  work  (Kolen  &  Pollack,  1993; \n\n\fS04 \n\nKolen \n\n1 \n\n1 \n\n1 \n\nI \n\nI \n\nI \n\noutput=A \n\n1 \n\nStart (e<O.Ol) \n\noutput=B \n\n1 \n1 iteration \n\noutput=A \n\n1 \n\n10 iterations \n\n1 \n\n1 \n\n1 \n\nI \n\nI \n\noutput=A,B \n\n1 \n\noutput=A,B \n\n1 \n\n17 iterations \n\n25 iterations \n\n1 \n\n50 iterations \n\nFigure 2:  The state space of a recurrent network whose next state transitions are \nsensitive to initial conditions. The initial epsilon ball contains 1000 points. These points \nfirst straddle the output decision boundary at iteration seven. \n\nKolen & Pollack, In press) demonstrated that computational complexity, in terms of Chom(cid:173)\nsky's hierarchy  of formal  languages  (Chomsky,  1957;  Chomsky,  1965)  and Newell  and \nSimon's physical symbol systems (Newell &  Simon, 1976), is not intrinsic to physical sys(cid:173)\ntems. The demonstration below shows how apparently trivial changes in the partitioning of \nstate space can produce symbol sequences from varying complexity classes. \n\nConsider a point moving in a circular orbit with a fixed rotational velocity, such as the end \nof a rotating rod spinning around a fixed center, or imagine watching a white dot on a spin(cid:173)\nning bicycle wheel. We measure the location of the dot by periodically sampling the loca(cid:173)\ntion  with  a  single  decision  boundary  (Figure 4,  left  side).  If the  point  is  to  the  left  of \nboundary at the time of the sample, we write down an \"1\". Likewise, we write down an \"r\" \nwhen the point is  on  the other side. (The probability of the point landing on the boundary \nis  zero  and  can  arbitrarily  be  assigned  to  either  category  without  affecting  the  results \nbelow.) In the limit, we will have recorded an infinite sequence of symbols containing long \nsequences of r's and l's. \n\nThe  specific  ordering  of symbols  observed  in  a  long  sequence  of multiple  rotations  is \n\n\fFool's Gold: Extracting Finite State Machines from  Recurrent Network Dynamics \n\n505 \n\n\u2022\u2022\u2022\u2022\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022  \u2022 \n\n\u2022 \n\u2022 \n\n\u2022  \u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022\u2022 \n\u2022 \n10 \n\n\u2022 \n\u2022  \u2022 \u2022 \n\n20 \n\n30 \n\nIteration number \n\n40 \n\n50 \n\n2 \n\nell  2.5 \n...... c:: \n.... \n0 \n0.. \nc:: \n0 \n0 \n~ ...... \n0 \n.0  1.5 \n0 u c:: \n.... \n~ ...... \nell \n\"1:;) \n8 ::s \n.... \nS  0.5 \n~ \n::E \n\n1 \n\nFigure 3: Spread of initial points across the attractor as measured by maximum distance. \n\n1 \n\nr \n\n1 \n\nr \n\nc \n\nFigure 4:  On the left, two decision regions which induce a context free language.  9 is \nthe current angle of rotation. At the time of sampling, if the point is to the left (right) of \nthe dividing line, an  1 (r) is generated. On the right, three decision regions which \ninduce a context sensitive language. \n\ndependent upon the initial rotational angle of the system. However, the sequence does pos(cid:173)\nsess a number of recurring structural regularities, which we call sentences: a run of r's fol(cid:173)\nlowed by a run of l's. For a fixed rotational velocity (rotations per time unit) and sampling \nrate, the observed system will generate sentences of the form r n1 m (n, m > 0). (The notation \nrn indicates a sequence of n  r's.) For a fixed sampling rate, each rotational velocity spec(cid:173)\nifies up to  three sentences whose number of r's and  l's differ by  at most one. These sen(cid:173)\ntences  repeat  in  an  arbitrary  manner.  Thus,  a  typical  subsequence  of  a  rotator  which \nproduces sentences r n1 n,  r n1 n+l ,rn+ 11 n would look like \n\n\f506 \n\nKolen \n\nrnln+lrnlnrnln+lrn+l1nrnlnrnln+l. \n\nA language of sentences may be constructed by examining the families of sentences gener(cid:173)\nated by  a large collection of individuals, much like a natural language is induced from the \nabilities of its individual speakers. In this context, a language could be induced from a pop(cid:173)\nulation of rotators with different rotational velocities where individuals generate sentences \nof the form  {r\"l n,  r\"l \"+1 ,r\"+ll\"},  n > O.  The reSUlting  language can  be described by  a \ncontext free grammar and has unbounded dependencies; the number of 1 's is a function of \nthe number of preceding r's. These two constraints on the language imply that the induced \nlanguage is context free. \n\nTo  show that this  complexity class assignment is  an  artifact of the observational  mecha(cid:173)\nnism, consider the mechanism which reports three disjoint regions:  1, c, and r  (Figure 4, \nright side). Now the same rotating point will generate sequences ofthe form \n\n... rr ... rrcc ... ccll. .. llrr ... rrcc ... ccll ... ll .... \n\nFor a fixed sampling rate, each rotational velocity specifies up to seven sentences, r nc ffil k, \nwhen  n, m, and  k  can differ no by  no more than one. Again, a language of sentences may \nbe constructed containing all sentences in  which the number ofr's, c's, and l's differs by \nno more than one. The resulting language is context sensitive since it can be described by \na context sensitive grammar and cannot be context free  as  it is the finite  union  of several \ncontext sensitive languages related to r\"c\"l n. \n\nCONCLUSION \n\nUsing recurrent neural networks as the representation underlying the language learning task \nhas revealed some inherent problems with the concept of this task. While formal languages \nhave mathematical validity, looking for language induction in physical systems is question(cid:173)\nable,  especially if that  system operates with continuous internal  states.  As I have shown, \nthere are two major problems with the extraction of a learned automata from our models. \n\nFirst, sensitivity to  initial conditions produces nondeterministic machines whose trajecto(cid:173)\nries are specified by both the initial state of the network and the dynamics of the state trans(cid:173)\nformation. The dynamics provide the shape of the eventual attractor. The initial conditions \nspecify the allowable trajectories toward that attractor.  While clustering methods work in \nthe analysis of feed-forward networks because of neighborhood preservation (as each layer \nis a homeomorphism), they may fail  when applied to recurrent network state space trans(cid:173)\nformations.  FSM construction methods which look for single transitions between regions \nwill not help in  this case because the  network eventually separates initially nearby  states \nacross several FSM state regions. \n\nThe second problem with the extraction of a learned automata from  recurrent network is \nthat trivial changes in  observation strategies can cause one to  induce behavioral descrip(cid:173)\ntions from  a wide range of computational complexity classes for a single system. It is the \nresearcher's bias which determines that a dynamical system is equivalent to a finite  state \nautomata. \n\n\fFool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics \n\n507 \n\nOne  response  to  the  first  problem  described  above  has  been  to  remove  and  eliminate the \nsources of nondeterminism from  the mechanisms. Zeng et.  a1  (1993) corrected the second(cid:173)\norder recurrent network model by replacing the continuous internal state transformation with \na  discrete  step  function.  (The  continuous  activation  remained  for  training  purposes.)  This \nmove was justified by their focus on regular language learning, as these languages can be rec(cid:173)\nognized by  finite  state  machines.  This  work is questionable  on  two  points,  however.  First, \ntractable algorithms already exist for solving this problem (e.g. Angluin, 1987). Second, they \nclaim that the network is self-clustering the internal states. Self-clustering occurs only at the \ncomers of the state space hypercube because of the discrete activation function, in the same \nmanner as a digital sequential circuit \"clusters\" its states. Das and Mozer (1994), on the other \nhand, have relocated the clustering algorithm. Their work focused on recurrent networks that \nperform internal clustering during training.  These  networks operate much like  competitive \nlearning in feed-forward networks (e.g. Rumelhart and Zipser,  1986) as the dynamics of the \nlearning rules constrain the state representations such that stable clusters emerge. \n\nThe shortcomings of finite state machine extraction must be understood with respect to the \ntask at hand. The actual dynamics of the network may be inconsequential to the final prod(cid:173)\nuct if one is using the recurrent network as a pathway for designing a finite  state machine. \nIn this engineering situation, the network is thrown away once the FSM is extracted. Neural \nnetwork training can be viewed as an \"interior\" method to  finding  discrete solutions. It is \ninterior in the same sense as linear programming algorithms can be classified as either edge \nor  interior methods.  The former  follows  the  edges  of the  simplex,  much  like  traditional \nFSM learning algorithms search the  space of FSMs. Internal methods, on  the other hand, \nexplore search spaces which can embed the target spaces. Linear programming algorithms \nemploying internal  methods  move  through the interior of the defined  simplex. Likewise, \nrecurrent neural network learning methods swim through mechanisms with mUltiple finite \nstate interpretations. Some researchers, specifically those discussed above,  have begun to \nbias recurrent network learning to walk the edges (Zeng et al,  1993) or to internally cluster \nstates (Das & Mozer,  1994). \n\nIn order to understand the behavior of recurrent networks, these devices should be regarded \nas dynamical  systems (Kolen,  1994). In particular,  most common recurrent networks are \nactually  iterated  mappings,  nonlinear  versions  of Barnsley's  iterated  function  systems \n(Barnsley,  1988).  While  automata  also  fall  into  this  class,  they  are  a  specialization  of \ndynamical  systems,  namely  discrete  time  and  state  systems.  Unfortunately,  information \nprocessing abstractions are only applicable within this domain and do not make any sense \nin the broader domains of continuous time or continuous space dynamical systems. \n\nAcknowledgments \n\nThe research reported in this paper has been supported by  Office of Naval Research grant \nnumber NOOOI4-92-J-1195. I thank all those who have made comments and suggestions for \nimprovement of this paper, especially Greg Saunders and Lee Giles. \n\nReferences \n\nAngluin, D. (1987). Learning Regular Sets from Queries and Counterexamples. Information \n\n\f508 \n\nKolen \n\nand Computation, 75,87-106. \nBarnsley, M. (1988). Fractals Everywhere. Academic Press: San Diego, CA. \nChomsky, N.  (1957). Syntactic Structures.  The Hague: Mounton &  Co. \nChomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, Mass.: MIT Press. \nCleeremans, A, Servan-Schreiber, D. & McClelland, J. L. (1989). Finite state automata and \nsimple recurrent networks. Neural Computation,  1,372-381. \nCrutchfield, J.  &  Young, K. (1989). Computation at the Onset of Chaos. In W.  Zurek, (Ed.), \nEntropy,  Complexity, and the Physics of Information. Reading: Addison-Wesely. \nDas, R.  & Mozer, M. (1994) A Hybrid Gradient-Descent/Clustering Technique for Finite \nState Machine Induction. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, (Eds.), \nAdvances in Neural Information Processing Systems 6. Morgan Kaufman: San Francisco. \nDevaney, R. L.  (1989). An Introduction to Chaotic Dynamical Systems.  Addison-Wesley. \nElman, J. (1990). Finding structure in time. Cognitive Science, 14, 179-211. \nGiles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H.  H.  &  C.Lee, Y.  (1992). Extracting \nand Learning an Unknown Grammar with Recurrent Neural Networks. In John E. Moody, \nSteven J. Hanson & Richard P. Lippman, (Eds.), Advances in Neural Information Processing \nSystems 4. Morgan Kaufman. \nGold, E. M. (1969). Language identification in the limit. Information and Control,  10,372-\n381. \nHopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata Theory,  Languages, and \nComputation.  Addison-Wesely. \nKolen, J. F. (1994) Recurrent Networks: State Machines or Iterated Function Systems?  In M. \nC. Mozer, P.  Smolensky, D. S. Touretzky, J. L. Elman, & AS. Weigend (Eds.), Proceedings \nof the 1993 Connectionist Models Summer School.  (pp. 203-210) Hillsdale, NJ: Erlbaum \nAssociates. \nKolen, J. F.  & Pollack, J. B. (1993). The Apparent Computational Complexity of Physical \nSystems. In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society. \nLaurence Earlbaum. \nKolen, J. F. & Pollack, J. B. (In press) The Observers' Paradox: The Apparent Computational \nComplexity of Physical Systems. Journal of Experimental and Theoretical Artificial Intelli(cid:173)\ngence. \nPollack, J. B. (1991).  The Induction Of Dynamical Recognizers. Machine Learning,  7.227-\n252. \nNewell, A. &  Simon, H. A  (1976). Computer science as empirical inquiry:  symbols and \nsearch. Communications of the Associationfor Computing Machinery,  19,  113-126. \nRumelhart, D. E., and Zipser, D. (1986). Feature Discovery by Competitive Learning. In D. \nE. Rumelhart, J. L. McClelland, and the PDP Research Group, (Eds.), Parallel Distributed \nProcessing. Volume 1.  151-193. MIT Press: Cambridge, MA \nWatrous, R. L.  &  Kuhn, G. M. (1992). Induction of Finite-State Automata Using Second(cid:173)\nOrder Recurrent Networks. In John E. Moody, Steven J. Hanson & Richard P.  Lippman, \n(Eds.), Advances in Neural Information Processing Systems 4. Morgan Kaufman. \nZeng, Z., Goodman, R. M., Smyth, P. (1993). Learning Finite State Machines With Self-Clus(cid:173)\ntering Recurrent Networks. Neural Computation, 5, 976-990 \n\n\fPART IV \n\nNEUROSCIENCE \n\n\f\f", "award": [], "sourceid": 757, "authors": [{"given_name": "John", "family_name": "Kolen", "institution": null}]}