{"title": "Using Prior Knowledge in a NNPDA to Learn Context-Free Languages", "book": "Advances in Neural Information Processing Systems", "page_first": 65, "page_last": 72, "abstract": "", "full_text": "Using Prior Knowledge in a NNPDA to Learn \n\nContext-Free Languages \n\nSreerupa Das \n\nDept. of Compo Sc. & \nInst. of Cognitive Sc. \nUniversity of Colorado \n\nBoulder, CO 80309 \n\nc. Lee Giles\u00b7 \n\nNEC Research Inst. \n4 Independence Way \nPrinceton, NJ 08540 \n\nGuo-Zheng SUD \n\n\"'lnst. for Adv. Compo Studies \n\nUniversity of Maryland \nCollege Park, MD 20742 \n\nAbstract \n\nAlthough considerable interest has been shown in language inference and \nautomata induction using recurrent neural networks, success of these \nmodels has mostly been limited to regular languages. We have previ(cid:173)\nously demonstrated that Neural Network Pushdown Automaton (NNPDA) \nmodel is capable of learning deterministic context-free languages (e.g., \nanbn and parenthesis languages) from examples. However, the learning \ntask is computationally intensive. In this paper we discus some ways in \nwhich a priori knowledge about the task and data could be used for efficient \nlearning. We also observe that such knowledge is often an experimental \nprerequisite for learning nontrivial languages (eg. anbncbmam ). \n\n1 \n\nINTRODUCTION \n\nLanguage inference and automata induction using recurrent neural networks has \ngained considerable interest in the recent years. Nevertheless, success of these mod(cid:173)\nels has been mostly limited to regular languages. Additional information in form of \na priori knowledge has proved important and at times necessary for learning com(cid:173)\nplex languages (Abu-Mostafa 1990; AI-Mashouq and Reed, 1991; Omlin and Giles, \n1992; Towell, 1990). They have demonstrated that partial information incorporated \nin a connectionist model guides the learning process through constraints for efficient \nlearning and better generalization. \n\n'Ve have previously shown that the NNPDA model can learn Deterministic Context \n\n65 \n\n\f66 \n\nDas, Giles, and Sun \n\nOutput \n\npush \n\nt pop or no-op \n\nAction \n\nState(t+l) \n\n00 0 0 \n\n~ t ~ \" \n\nIt;;:::::::.. \"hig.her order \n\nweights \n\n::;;;;::.. \n'::~::::'. \n\n='000 00 00 copy \n\nState Neurons \n\nInput Neurons \n\nRead Neurons \n\nit \nState(t) \n\nit \nInput(t) \n\n11' \n\nTop-of-stack(t) \n\nTop-or-stack \n\n... 11:0 \n\nExternal \n\n~ stack \n\\ \n\\ \n.~~ \n\nalphabets on the \nstack \n\nFigure 1: The figure shows the architecture of a third-order NNPDA. Each weight \nrelates the product of Input(t), State(t) and Top-of-Stack information to the \nState(t+1). Depending on the activation of the Action Neuron, stack action \n(namely, push, pop or no operation) is taken and the Top-of-Stack (i.e. value \nof Read Neurons) is updated. \n\nFree Languages (DCFLs) from a finite set of examples. However, the learning task \nrequires considerable amount of time and computational resources. In this paper \nwe discuss methods in which a priori knowledge, may be incorporated in a N eum! \nnetwork Pushdown Automaton (NNPDA) described in (Das, Giles and Sun, 1992; \nGiles et aI, 1990; Sun et aI, 1990). \n\n2 THE NEURAL NETWORK PUSHDOWN AUTOMATA \n\n2.1 ARCHITECTURE \n\nThe description of the network architecture is necessarily brief, for further details \nsee the references above. The network consists of a set of recurrent units, called \nstate neurons and an external stack memory. One state neuron is designated as the \noutput neuron. The state neurons get input (at every time step) from three sources: \nfrom their own recurrent connections, from the input neurons and from the read \nneurons. The input neurons register external inputs which consist of strings of \ncharacters presented one at a time. The read neurons keep track of the symbol(s) \non top of the stack. One non-recurrent state neuron, called the action neuron, \nindicates the stack action (push, pop or no-op) at any instance. The architecture \nis shown in Figure 1. \n\nThe stack used in this model is continuous. Unlike an usual discrete stack where an \nelement is either present or absent, elements in a continuous stack may be present \nin varying degrees (values between [0, 1]). A continuous stack is essential in order \n\n\fUsing Prior Knowledge in a NNPDA to Learn Context-Free Languages \n\n67 \n\nto permit the use of a continuous optimization method during learning. The stack \nis manipulated by the continuous valued action neuron. A detailed discussion on \nthe operations may be found in (Das, Giles and Sun, 1992). \n\n2.2 LEARNABLE CLASS OF LANGUAGES \n\nThe class of language learnable by the NNPDA is a proper subset of deterministic \ncontext-free languages. A formal description of a Pushdown Automaton (PDA) \nrequires two distinct sets of symbols - one is the input symbol set and the other \nis the stack symbol set!. We have reduced the complexity of this PDA model in \nthe following ways: First, we use the same set of symbols for the input and the \nstack. Second, when a push operation is performed the symbol pushed on the stack \nis the one that is available as the current input. Third, no epsilon transitions are \nallowed in the NNPDA. Epsilon transition is one that performs state transition and \nstack action without reading in a new input symbol. Unlike a deterministic finite \nstate automata, a deterministic PDA can make epsilon transitions under certain \nrestrictions!. Although these simplifications reduce the language class learnable by \nNNPDA, nevertheless the languages in this class retain essential properties of eFLs \nand is therefore more complex than any regular language. \n\n2.3 TRAINING \nThe activation of the state neurons s at time step t + 1 may be formulated as follows \n(we will only consider third order NNPDA in this paper): \n\n(1) \nwhere g(x) = frac1/1 + exp( -x), i is the activation of the input neurons and r is \nthe activation of the read neuron and W is the weight matrix of the network. We \nuse a localized representation for the input and the read symbols. During training, \ninput sequences are presented one at a time and activations are allowed to propagate \nuntil the end of the string is reached. Once the end is reached the activation of the \noutput neuron is matched with the target (which is 1.0 for positive string and 0.0 for \na negative string) The learning rule used in the NNPDA is a significantly enhanced \nextension to Real Time Recurrent Learning (\\Villiams and Zipser, 1989). \n\n2.4 OBJECTIVE FUNCTION \n\nThe objective function used to train the network consists of two error terms: one for \npositive strings and the other for negative strings. For positive strings we require \n(a) the NNPDA must reach a final state and (b) the stack must be empty. This \ncriterion can be reached by minimizing the error function: \n\nwhere So(l) is the activation of an output neuron and L(I) is the stack length, after \na string of length I has been presented as input a character at a time. For negative \n\n1 For details refer to (Hopcroft, 1979). \n\n(2) \n\n\f68 \n\nDas, Giles, and Sun \n\nparenthesis \n\navg of total \npresentations \nw IL wjo IL w IL wjo IL \n# of strings \n15912 \n2671 \n# of character 10628 \n82002 \n\n5644 \n29552 \n\n8326 \n31171 \n\npostfix \n\nanbn \n\nw IL \nwjo IL \n108200 >200000 \n358750 >700000 \n\nTable 1: Effect of Incremental Learning (IL) is displayed in this table. The number \nof strings and characters required for learning the languages are provided here. \n\nepochs \ngeneralization \nnumber of units \n\nepochs \ngeneralization \nnumber of units \n\nparenthesis \n\nanbn \n\nw SSP wjo SSP w SSP wjo SSP \n150-250 \n50-80 \n100% \n98.97% \n1+1 \n\n50-80 \n100% \n\n100% \n1+1 \n\n150-250 \n\n2 \nan+mbncm \n\n2 \nanbncbmam \n\nw SSP wjo SSP w SSP wjo SSP \n\n150 \n\n96.02% \n\n1+1 \n\n\"''''''' \n*** \n*** \n\n150-250 \n\n100% \n1+1 \n\n*** \n*** \n*** \n\nTable 2: This table provides some statistics on epochs, generalization and number \nof hidden units required for learning with and without selective string presentation \n(SSP). \n\nstrings, the error function is modified as: \n\nE \n\nrror -\n\n- { so(1) - L(l) \n\n0 \n\nif (so(1) - L(l)) > 0.0 \nelse \n\n(3) \n\nEquation (2) reflects the criterion that, for a negative pattern we require either the \nfinal state so(l) = 0.0 or the stack length L(1) to be greater than 1.0 (only when \nso(l) = 1.0 and the stack length L(l) is close to zero, the error is high). \n\n3 BUILDING IN PRIOR KNOWLEDGE \n\nIn practical inference tasks it may be possible to obtain prior knowledge about the \nproblem domain. In such cases it often helps to build in knowledge into the system \nunder study. There could be at least two different types of knowledge available \nto a model (a) knowledge that depends on the training data with absolutely no \nknowledge about the automaton, and (b) partial knowledge about the automaton \nbeing inferred. Some of ways in which knowledge can be provided to the model are \ndiscussed below. \n\n3.1 KNOWLEDGE FROM THE DATA \n\n3.1.1 \n\nIncremental Learning \n\nIncremental Learning has been suggested by many (Elman, 1991; Giles et aI, 1990, \nSun et aI, 1990), where the training examples are presented in order of increasing \n\n\fUsing Prior Knowledge in a NNPDA to Learn Context-Free Languages \n\n69 \n\n0.9 \n\n0.8 \n\n0.7 \n\nwith SSP -\n\nwithout SSP - - . \n\n0.6 \n\n., ., ..., \n... 0.5 \n0 ... ... .. \n\n0.4 \n\n0.3 \n\n0.2 \n\n50 \n\n100 \n\nepochs \n\n150 \n\n200 \n\n250 \n\nFigure 2: Faster convergence using selective string presentation (SSP) for parenthe(cid:173)\nsis language task. \n\nlength. This model of learning starts with a training set containing short simple \nstrings. Longer strings are added to the training set as learning proceeds. \n\nWe believe that incremental learning is very useful when (a) the data presented \ncontains structure, and (b) the strings learned earlier embody simpler versions of \nthe task being learned. Both these conditions are valid for context-free languages. \nTable 1 provides some results obtained when incremental learning was used. The \nfigures are averages over several pairs of simulations, each of which were initialized \nwith the same initial random weights. \n\n3.1.2 Selective Input Presentation \n\nOur training data contained both positive and negative examples. One problem \nwith training on incorrect strings is that, once a symbol in the string is reached \nthat makes it negative, no further information is gained by processing the rest of \nthe string. For example, the fifth a in the string aaaaba ... makes the string a \nnegative example of the language a\"b\", irrespective of what follows it. In order to \nincorporate this idea we have introduced the concept of a dead state. \n\nDuring training, we assume that there is a teacher or an oracle who has knowledge \nof the grammar and is able to identify the first (leftmost) occurrence of incorrect \nsequence of symbols in a negative string. When such a point is reached in the input \nstring, further processing of the string is stopped and the network is trained so that \none designated state neuron called the dead state neuron is active. To accommodate \nthe idea of a dead state in the learning rule, the following change is made: if the \nnetwork is being trained on negative strings that end in a dead state then the \nlength L(l) in the error function in equation (1) is ignored and it simply becomes \n\n\f70 \n\nDas, Giles, and Sun \n\n0.5 \n\n0 . 45 \n\n0 . 4 \n\n0.35 \n\n0 . 3 \n\n0 . 25 \n\n0.2 \n\n0.15 \n\n0 . 1 \n\n0 \n\n... \n... ... .. \n\n0. \n~ \n\n0.05 \n\na \n\n1000 \n\n2000 \n\n3000 \n\nW/o IW -\n\nwlth 1 IW --_. \nwlth 2 IW ' .. -(cid:173)\nwlth 3 IW ---\n\n6000 \n\n7000 \n\n8000 \n\n9000 \n\n4000 \n\n5000 \n\nNo . of gtrlngg \n\nFigure 3: Learning curves when none, one or more initial weights (IW) were set for \npostfix language learning task \n\nError = ~(1 - Sdead{l))2. Since such strings have an negative subsequence, they \ncannot be a prefix to any positive string. Therefore at this point we do not care \nabout the length of the stack. For strings that are either positive or negative but \ndo not go to a dead state (an example would be a prefix of a positive string); the \nobjective function remains the same as described earlier in Equations 1 and 2. \n\nSuch additional information provided during training resulted in efficient learning, \nhelped in learning of exact pushdown automata and led to better generalization for \nthe trained network. Information in this form was often a prerequisite for success(cid:173)\nfully learning certain languages. Figure 2 shows a typical plot of improvement in \nlearning when such knowledge is used. Table 2 shows improvements in the statistics \nfor generalization, number of units needed and number of epochs required for learn(cid:173)\ning. The numbers in the tables were averages over several simulations; changing \nthe initial conditions resulted in values of similar orders of magnitude. \n\n3.2 KNOWLEDGE ABOUT THE TASK \n\n3.2.1 Knowledge About The Target PDA's Dynamics \n\nOne way in which knowledge about the target PDA can be built into a system is \nby biasing the initial conditions of the network. This may be done by assigning \npredetermined initial values to a selected set of weights (or biases). For example a \nthird order NNPDA has a dynamics that maps well onto the theoretical model of a \nPDA. Both allow a three to two mapping of a similar kind. This is because in the \nthird order NNPDA, the product of the activations of the input neurons, the read \nneurons and the state neurons determine the next state and the next action to be \n\n\fStart 9 \n\n[.5 .5] \n\na/-/push \n\n, , \nb -* \n\nDead \n[.9 *] \n\nUsing Prior Knowledge in a NNPDA to Learn Context-Free Languages \n\n71 \n\n[.1 .9] \n\nStart \n[.5 .5] \n\nEnd \n[* .9] \n\n[.0.6] \n\ne/-/* \n\nO End \n\n[* .9] \n\n(a) PDA for parenthesis \n\n(b) PDA for a~n \n\nFigure 4: The figure shows some of the PDAs inferred by the NNPDA. In the figure \nthe nodes in the graph represent states inferred by the NNPDA and the numbers in \n\"[]\" indicates the state representations. Every transition is indicated by an arrow \nand is labeled as \"x/y /z\" where \"x\" corresponds to the current input symbol, \"y\" \ncorresponds to the symbol on top of the stack and \"z\" corresponds to the action \ntaken. \n\ntaken. It may be possible to determine some of the weights in a third order network \nif certain information about the automaton in known. Typical improvement in \nlearning is shown in Figure 3 for a postfix language learning task. \n\n3.2.2 U sing Structured Examples \n\nStructured examples from a grammar are a set of strings where the order of letter \ngeneration is indicated by brackets. An example would be the string (( ab)c) gen(cid:173)\nerated by the rules S ---+ Xc; X ---+ abo Under the current dynamics and limitations \nof the model, this information could be interpreted as providing the stack actions \n(push and pop) to the NNPDA. Learning the palindrome language is a hard task \nbecause it necessitates remembering a precise history over a long period of time. \nThe NNPDA was able to learn the palindrome language for two symbols when \nstructured examples were presented. \n\n4 AUTOMATON EXTRACTION FROM NNPDA \n\nOnce the network performs well on the training set, the transition rules in the \ninferred PDA can then be deduced. Since the languages learned by the NNPDA so \nfar corresponded to PDAs with few states, the state representations in the induced \nPDA could be inferred by looking at the state neuron activations when presented \nwith all possible character sequences. For larger PDAs clustering techniques could \nbe used to infer the state representations. Various clustering techniques for similar \ntasks have been discussed in (Das and Das, 1992; Giles et al., 1992). Figure 4 shows \nsome of the PDAs inferred by the NNPDA. \n\n\f72 \n\nDas, Giles, and Sun \n\n5 CONCLUSION \n\nThis paper has described some of the ways in which prior knowledge could be used \nto learn DCFGs in an NNPDA. Such knowledge is valuable to the learning process \nin two ways. It may reduce the solution space, and as a consequence may speed \nup the learning process. Having the right restrictions on a given representation can \nmake learning simple: which reconfirms an old truism in Artificial Intelligence. \n\nReferences \n\nY.S. Abu-Mostafa. \nComplexity, 6:192-198. \n\n(1990) Learning from hints in neural networks. Journal of \n\nK.A. AI-Mashouq and I.S. Reed . (1991) Including hints in training neural networks. \nNeural Computation, 3(3):418-427. \n\nS. Das and R. Das. (1992) Induction of discrete state-machine by stabilizing a con(cid:173)\ntinuous recurrent network using clustering. To appear in CSI Journal of Computer \nScience and Informatics. Special Issue on Neural Computing. \n\nS. Das, C.L. Giles, and G.Z. Sun. (1992) Learning context free grammars: capa(cid:173)\nbilities and limitations of neural network with an external stack memory. Proc of \nthe Fourteenth Annual Conf of the Cognitive Science Society, pp. 791-795. Morgan \nKaufmann, San Mateo, Ca. \n\nJ .L. Elman. (1991) Incremental learning, or the importance of starting small. CRL \nTech Report 9101, Center for Research in Language, UCSD, La Jolla, CA. \n\nC.L. Giles, G.Z. Sun, H.H. Chen, Y.C. Lee and D. Chen, (1990) Higher Order \nRecurrent Networks & Grammatical Inference, Advances in Neural Information \nProcessing Systems 2, pp. 380-387, ed. D.S. Touretzky, Morgan Kaufmann, San \nMateo, CA. \nC.L. Giles, C.B . Miller, H.H. Chen, G.Z. Sun, and Y.C. Lee. \n(1992) Learning \nand extracting finite state automata with second-order recurrent neural networks. \nNeural Computation, 4(3):393-405. \n\nJ .E. Hopfcroft and J.D . Ullman. (1979) Introduction to Automata Theory, Lan(cid:173)\nguages and Computation. Addison-Wesley, Reading, MA. \n\nC.W. Omlin and C.L. Giles. (1992) Training second-order recurrent neural networks \nusing hints. Proceedings of the Ninth Int Conf on Machine Learning, pp. 363-368. \nD. Sleeman and P. Edwards (eds). Morgan Kaufmann, San Mateo, Ca. \n\nG.Z. Sun, H.H. Chen, C.L. Giles, Y.C . Lee and D. Chen. (1991) Neural networks \nwith external memory stack that learn context-free grammars from examples. Proc \nof the Conf on Information Science and Systems, Princeton U., Vol. II, pp. 649-653. \n\nG.G. Towell, J.W. Shavlik and M.O Noordewier. (1990) Refinement of approxi(cid:173)\nmately correct domain theories by knowledge-based neural-networks. In Proc of the \nEighth National Conf on Artificial Intelligence, Boston, MA. pp. 861. \n\nR.J. Williams and D. Zipser. (1989) A learning algorithm for continually running \nfully recurrent neural networks. Neural Computation 1(2):270-280. \n\n\f", "award": [], "sourceid": 587, "authors": [{"given_name": "Sreerupa", "family_name": "Das", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}, {"given_name": "Guo-Zheng", "family_name": "Sun", "institution": null}]}