{"title": "Higher Order Recurrent Networks and Grammatical Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 380, "page_last": 387, "abstract": null, "full_text": "380 \n\nGiles, Sun, Chen, Lee and Chen \n\nHIGHER ORDER RECURRENT NETWORKS \n\n& GRAMMATICAL INFERENCE \n\nC. L. Giles\u00b7, G. Z. Sun, H. H. Chen, Y. C. Lee, D. Chen \n\nDepartment of Physics and Astronomy \n\nInstitute for Advanced Computer Studies \n\nand \n\nUniversity of Maryland. College Park. MD 20742 \n\n* NEC Research Institute \n\n4 Independence Way. Princeton. NJ. 08540 \n\nABSTRACT \n\nA higher order single layer recursive network \neasily learns to \nsimulate a deterministic finite state machine and recognize regular \ngrammars. When an enhanced version of this neural net state machine \nis connected through a common error term to an external analog stack \nmemory, the combination can be interpreted as a neural net pushdown \nautomata. The neural net finite state machine is given the primitives, \npush and POP. and is able to read the top of the stack. Through a \ngradient descent learning rule derived from \nthe common error \nfunction, the hybrid network learns to effectively use the stack \nactions to manipUlate the stack memory and to learn simple context(cid:173)\nfree grammars. \nINTRODUCTION \nBiological networks readily and easily process temporal information; artificial neural \nnetworks should do the same. Recurrent neural network models permit the encoding \nand learning of temporal sequences. There are many recurrent neural net models. for ex(cid:173)\nample see [Jordan 1986. Pineda 1987, Williams & Zipser 1988]. Nearly all encode the \ncurrent state representation of the models in the activity of the neuron and the next \nstate is determined by the current state and input. From an automata perspective, this \ndynamical structure is a state machine. One formal model of sequences and machines \nthat generate and recognize them are formal grammars and their respective automata. \nThese models formalize some of the foundations of computer science. In the Chomsky \nhierarchy of formal grammars [Hopcroft & Ullman 1979] the simplest level of com(cid:173)\nplexity is defmed by the finite state machine and its regular grammars. (All machines \n\n\fHigher Order Recurrent Networks and Grammatical Inference \n\n381 \n\nand grammars described here are deterministic.} The next level of complexity is de(cid:173)\nscribed by pushdown automata and their associated context-free grammars. The push(cid:173)\ndown automaton is a fmite state machine with the added power to use a stack memory. \nNemal networks should be able to perform the same type of computation and thus \nsolve such learning problems as grammatical inference [pu 1982] . \nSimple grammatical inference is defined as the problem of finding (learning) a grammar \nfrom a fmite set of strings, often called the teaching sample. Recall that a grammar \n{phrase-structured} is defined as a 4-tuple (N, V, P, S) where N and V are a nonterm i(cid:173)\nna1 and terminal vocabularies, P is a finite set of production rules and S is the start sym(cid:173)\nbol. Here grammatical inference is also defined as the learning of the machine that \nrecognizes the teaching and testing samples. Potential applications of grammatical in(cid:173)\nference include such various areas as pattern recognition, information retrieval, pro(cid:173)\ngramming language design, translation and compiling and graphics languages [pu 1982]. \nThere has been a great deal of interest in teaching nemal nets to recognize grammars and \nsimulate automata [Allen 1989. Jordan 1986. Pollack 1989. Servant-Schreiber et. a1. \n1989,Williams & Zipser 1988]. Some important extensions of that work are discussed \nhere. In particular we construct recurrent higher order nemal net state machines which \nhave no hidden layers and seem to be at least as powerful as any nemal net multilayer \nstate machine discussed so far. For example, the learning time and training sample size \nare significantly reduced. In addition, we integrate this neural net fmite state machine \nwith an external stack memory and inform the network through a common objective \nfunction that it has at its disposal the symbol at the top of the stack and the operation \nprimitives of push and pop. By devising a common error function which integrates the \nstack and the nemal net state machine, this hybrid structure learns to effectively use the \nthe interesting work of [Williams & \nstack to recognize context-free grammars. \nZipser 1988] a recurrent net learns only the state machine part of a Turing Machine. \nsince the associated move, read, write operations for each input string are known and are \ngiven as part of the training set. However, the model we present learns how to manipu(cid:173)\nlate the push, POP. and read primitives of an external stack memory plus learns the ad(cid:173)\nditional necessary state operations and structure. \nHIGHER ORDER RECURRENT NETWORK \nThe recurrent neural network utilized can be considered as a higher order modification \nof the network model developed by [Williams & Zipser 1988]. Recall that in a recur(cid:173)\nrent net the activation state S of the neurons at time (t+l) is defined as in a state ma(cid:173)\nchine automata: \n\nIn \n\n(1) \nwhere F maps the state S and the input I at time t to the next state. The weight matrix \nW forms the mapping and is usually learned. We use a higher order form for this map(cid:173)\nping: \n\nS(t+ 1) = F ( S(t), I(t); W } \n\n(2) \n\n\f382 \n\nGiles, Sun, Chen, Lee and Chen \n\nwhere the range of i, j is the number of state neurons and k the number of input neurons; \ng is defined as g(x)=l!(l+exp(-x)). In order to use the net for grammatical inference, a \nlearning rule must be devised. To learn the mapping F and the weight matrix W, given \na sample set of P strings of the grammar, we construct the following error function E : \n\nE = L E 2 = L (T - S (L)) 2 \n\nr 01\" \n\nr \n\n(3) \n\nwhere the sum is over P samples. The error function is evaluated at the end of a present(cid:173)\ned sequence of length ~ and So is the activity of the output neuron. For a recurrent net, \nthe output neuron is a designated member of the state neurons. The target value of any \npattern is 1 for a legal string and 0 for an illegal one. U sing a gradient descent proce(cid:173)\ndure, we minimize the error E function for only the rth pattern. The weight update rule \nbecomes \n\n(4) \n\nwhere\" is the learning rate. Using eq. (2), dSo(tp) / dWijk is easily calculated using \nthe recursion relationship and the choice of an initial value for aSi(t = O)/aWijk' \n\naSI(t+l)/aWijk = hI (Sl(t+l)) ( ~li Sit) Ik(t) + 1: Wlmn In(t) aSm(t)taWijk } (5) \n\nwhere h(x) = dg/dx. Note that this requires dSi(t) / dWijk be updated as each element \nof each string is presented and to have a known initial value. Given an adequate network \ntopology, the above neural net state machine should be capable of learning any regular \ngrammar of arbitrary string length or a more complex grammar of finite length. \n\nFINITE STATE MACHINE SIMULATION \nIn order to see how such a net performs, we trained the net on a regular grammar, the du(cid:173)\nal parity grammar. An arbitrary length string of O's and 1 's has dual parity if the \nstring contains an even number of O's and an even number of 1 's. The network architec(cid:173)\nture was 3 input neurons and either 3, 4, or 5 state neurons with fully connected second \norder interconnection weights. The string vocabulary O,l,e (end symbol) used a unary \nrepresentation. The initial training set consisted of 30 positive and negative strings of \nincreasing sting length up to length 4. After including in the training all strings up to \nlength 10 which resulted in misclassification(about 30 strings), the neural net state ma(cid:173)\nchine perfectly recognized on all strings up to length 20. Total training time was usual(cid:173)\nly 500 epochs or less. \nBy looking closely at the dynamics of learning, it was discovered that for different in(cid:173)\nputs \nstate. These four states can be considered as possible states of an actual fmite state ma(cid:173)\nchine and the movement between these states as a function of input can be interpreted as \nthe state transitions of a state machine. Constructing a state machine yields a perfect \nfour state machine which will recognize any dual parity grammar. Using minimization \nprocedures [pu 1982], the extraneous state transitions can be reduced to the minimal 4-\n\ntended to cluster around three values plus the initial \n\nthe states of the network \n\n\fHigher Order Recurrent Networks and Grammatical Inference \n\n383 \n\nstate machine. The extracted state machine is shown in Fig. 1. However, for more com(cid:173)\nplicated grammars and different initial conditions, it might be difficult to extract the \nfmite state machine. When different initial weights were chosen, different extraneous \ntransition diagrams with more states resulted. What is interesting is that the neural \nnet finite state machine learned this simple grammar perfectly. A first order net can al(cid:173)\nso learn this problem; the higher order net learns it much faster. It is easy to prove that \nthere are fmite sate machines that cannot be represented by fust order, single layer re(cid:173)\ncurrent nets [Minsky 1967]. For further discussion of higher order state machines, see \n[Liu, et. al. 1990]. \n\no \n\nI \n\n1 \n\nI \n\n1 \n\nFIGURE 1: A learned four state machine; state 1 is both the start \nand the final state. \n\nNEURAL NET PUSHDOWN AUTOMATA \nIn order to easily learn more complex deterministic grammars, the neural net must \nsomehow develop and/or learn to use some type of memory, the simplest being a stack \nmemory. Two approaches easily come to mind. Teach the additional weight structure in \na multilayer neural network to serve as memory [Pollack 1989] or teach the neural net \nto use an external memory source. The latter is appealing because it is well known \nfrom formal language theory that a finite stack machine requires significantly fewer re(cid:173)\nsources than a fmite state machine for bounded problems such as recognizing a finite \nlength context-free grammar. To teach a neural net to use a stack memory poses at least \nthree problems: 1) how to construct the stack memory, 2) how to couple the stack mem(cid:173)\nory to the neural net state machine, and 3) how to formulate the objective function such \nthat its optimization will yield effective learning rules. \nMost slraight-forward is formulating the objective function so that the stack is cou(cid:173)\npled to the neural net state machine. The most stringent condition for a pushdown au(cid:173)\ntomata to accept a context-free grammar is that the pushdown automata be in a final \nstate and the stack be empty. Thus, the error function of eq. (3) above is modified to in(cid:173)\nclude both final state and stack length terms: \n\n\f384 \n\nGiles, Sun, Chen, Lee and Chen \n\n(6) \n\nwhere L(Y is the final stack length at time )\" i.e. the time at which the last symbol of \nthe string is presented. Therefore, for legal strings E = 0, if the pushdown automata is \nin a final state and the stack is empty. \nNow consider how the stack can be connected to the neural net state machine. Recall \nthat for a pushdown automata [pu 1982], the state transition mapping of eq. (I) includes \nan additional argument, the symbol R(t) read from the top of the stack and an additional \nstack action mapping. An obvious approach to connecting the stack to the neural net is to \nlet the activity level of certain neurons represent the symbol at the top of the stack and \nothers represent the action on the stack. The pushdown automata has an additional stack \naction of reading or writing to the top of the stack based on the current state, input, and \ntop stack symbol. One interpretation of these mappings would be extensions of eq. (2): \n\nSi(t+l) = g( 1: WSijk Slt) Vk(t)} \n\n~(t+l) = f( 1: Waijk Slt) Vk(t)} \n\n(7) \n\n(8) \n\nTee \n\nFIGURE 2:. Single layer higher order recursive neural network that is connected \nto a stack memory. A represents action neurons connected to the stack; R represents \nmemory buffer neurons which read the top of the stack. The activation proceeds up(cid:173)\nward from states, input, and stack top at time t to states and action at time t+ 1. \nThe recursion replaces the states in the bottom layer with the states in the top layer. \n\nwhere Aj(t) are output neurons controlling the action of the stack; Vk(t) is either the \n\ninput neuron value Ik(t) or the connected stack memory neuron value Rk(t), dependent \non the index k; and f=2g-1. The current values Slt), Ik(t), and Rk(t) are all fully con(cid:173)\nnected through 2nd order weights with no hidden neurons. The mappings of eqs. (7) and \n(8) define the recursive network and can be implemented concurrently and in parallel. \nLet A(t=O) & R(t=O)= O. The neuron state values range continuously from 0 to 1 while \nthe neuron action values range from -I to I. The neural network part of the architecture \n\n\fHigher Order Recurrent Networks and Grammatical Inference \n\n385 \n\nis depicted in Fig. 2. The number of read neurons is equal to the coding representation of \nthe stack. For most applications, one action neuron suffices. \nIn order to use the gradient descent learning rule described in eq. (4), the stack length \n(Other types of leaming algorithms may not require a \nmust have continuous values. \ncontinuous stack.) We now explain how a continuous stack is used and connected to the \naction and read neurons. Interpret the stack actions as follows: push (A>O), pop (A