{"title": "Representation and Induction of Finite State Machines using Time-Delay Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 403, "page_last": 409, "abstract": null, "full_text": "Representation and Induction of Finite \nState Machines using Time-Delay Neural \n\nNetworks \n\nDaniel S. Clouse \n\nComputer Science & Engineering Dept. \n\nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0114 \n\ndclouse@ucsd.edu \n\nC. Lee Giles \n\nNEC Research Institute \n\n4 Independence Way \nPrinceton, NJ 08540 \n\ngiles@research.nj .nec.com \n\nBill G. Horne \n\nNEC Research Institute \n\n4 Independence Way \nPrinceton, N J 08540 \n\nhorne@research.nj.nec.com \n\nGarrison W. Cottrell \n\nComputer Science & Engineering Dept. \n\nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0114 \n\ngcottrell@ucsd.edu \n\nAbstract \n\nThis work investigates the representational and inductive capabili(cid:173)\nties of time-delay neural networks (TDNNs) in general, and of two \nsubclasses of TDNN, those with delays only on the inputs (IDNN), \nand those which include delays on hidden units (HDNN) . Both ar(cid:173)\nchitectures are capable of representing the same class of languages, \nthe definite memory machine (DMM) languages, but the delays on \nthe hidden units in the HDNN helps it outperform the IDNN on \nproblems composed of repeated features over short time windows. \n\n1 \n\nIntroduction \n\nIn this paper we consider the representational and inductive capabilities of time(cid:173)\ndelay neural networks (TDNN) [Waibel et al., 1989] [Lang et al., 1990], also known \nas NNFIR [Wan, 1993]. A TDNN is a feed-forward network in which the set of \ninputs to any node i may include the output from previous layers not only in the \ncurrent time step t, but from d earlier time steps as well. The activation function \n\n\f404 \n\nD. S. Clouse, C. L Giles, B. G. Home and G. W. Cottrell \n\nfor node i at time t in such a network is given by equation 1: \n\ni-l d \n\ny! = h(I: I: yJ-kWijk) \n\nj=lk=O \n\n(1) \n\nwhere y: is the activation of node i at time t, Wijk is the connection strength from \nnode j to node i at delay k, and h is the squashing function. \n\nTDNNs have been used in speech recognition [Waibel et al., 1989], and time series \nprediction [Wan, 1993]. In this paper we concentrate on the language induction \nproblem. A training set of variable-length strings taken from a discrete alphabet \n{O, 1} is generated. Each string is labeled as to whether it is in some language L \nor not. The network must learn to discriminate strings which are in the language \nfrom those which are not, not only for the training set strings, but for strings the \nnetwork has never seen before. The language induction problem provides a simple, \nfamiliar domain in which to gain insight into the capabilities of different network \narchi tect ures. \nSpecifically, in this paper, we will look at the representational and inductive capa(cid:173)\nbilities of the general class of TDNNs versus a subclass of TDNNs, the input-delay \nneural networks (IDNNs). An IDNN is a TDNN in which delays are limited to \nthe network inputs. In section 2, we will show that the classes of functions repre(cid:173)\nsentable by general TDNNs and IDNNs are equivalent. In section 3, we will show \nthat the class of languages representable by the TDNNs, are the definite memory \nmachine (DMM) languages. In section 4, we will demonstrate the inductive ca(cid:173)\npability of the TDNNs in a simulation in which a large DMM is learned using a \nsmall percentage of the possible, short training examples. In section 5, a second set \nof simulations will show the difference between representational and inductive bias, \nand will demonstrate the utility of internal delays in a TDNN network. \n\n2 TDNN sand IDNN s Are Functionally Equivalent \n\nSince every IDNN is also a TDNN, the set of functions computable by any TDNN \nincludes all those computable by the IDNNs. [Wan, 1993] also shows that the IDNNs \ncan compute any function computable by the TDNNs making these two classes of \nnetwork architectures functionally equivalent. For completeness, here we include a \ndescription of how to construct from a TDNN, an equivalent IDNN. \nFigure 1a shows a TDNN with a single input U at the current time (Ut), and at \nfour earlier time steps (Ut-l ... Ut-4). The inputs to node R consist of the outputs \nof nodes P and Q at the current time step along with one or two previous time \nsteps. At time t, node P computes !p(Ut, . . . Ut-4), a function of the current input \nand four delays. At time t -1, node P computes !P(Ut-l, ... Ut-s). This serves as \none of the delayed inputs to node R. This value could also be computed by sliding \nnode P over one step in the input tap-delay line along with its incoming weights as \nshown in figure lb. Using this construction, all the internal delays can be removed, \nand replaced by copies of the original nodes P and Q, along with their incoming \nweights. This method can be applied recursively to remove any internal delay in any \nTDNN network. Thus, for any function computable by a TDNN, we can construct \nan IDNN which computes the same function. \n\n3 TDNNs Can Represent the DMM Languages \n\nIn this section , we show that the set of languages which are representable by some \nTDNN are exactly those languages representable by the definite memory machines \n\n\fRepresentation and Induction of Finite State Machines using TDNNs \n\n405 \n\nfp (u1 \u2022. \u00b7\u00b7\u2022 u l-4 ) \nfp(u 1.!.,,\u00b7. u 1\u20225 ) \nfp (u 1.2 \u2022. \". u1\u20226 ) \n\na) Generu TDNN \n\nb) Equivalent IDNN \n\nUt \n\nu t.! \u00b7\n\n\u00b7 ut\u00b76 \n\nFigure 1: Constructing an IDNN equivalent to a given TDNN \n\n(DMMs). According to Kohavi (1978) a DMM of order d is a finite state machine \n(FSM) whose present state can always be determined uniquely from the knowledge \nof the most recent d inputs. We equivalently define a DMM of order d as an FSM \nwhose accepting/rejecting behavior is a function of only the most recent d inputs. \n\nTo fit TDNNs and IDNNs into the language induction framework, we consider only \nnetworks with a single 0/1 input. Since any boolean function can be represented \nby a feed-forward network with enough hidden units [Horne and Hush, 1994], an \nIDNN exists which can perform the mapping from d most recent inputs to any \naccepting/rejecting behavior. Therefore, any DMM language can be represented \nby some IDNN. Since every IDNN computes a function of its most recent d inputs, \nby the definition of DMM, there is no boolean output IDNN which represents a \nnon-DMM language. Therefore, the IDNNs represent exactly the DMM languages. \nSince the TDNN and IDNN classes are functionally equivalent, TDNNs implement \nexactly the DMM languages as well. \n\nThe shift register behavior of the input tap-delay line in an IDNN completely de(cid:173)\ntermines the state transition behavior of any machine represented by the network. \nThis state transition behavior is fixed by the architecture. For example, figure 2a \nshows the state transition diagram for any machine representable by an IDNN with \ntwo input delays. The mapping from the current state to \"accept\" or \"reject\" is all \nthat can be changed with training. Clouse et al. (1994) describes the conditions \nunder which such a mapping results in a minimal FSM. All mappings used in the \nsubsequent simulations are minimal FSM mappings. \n\n4 Simulation 1: Large DMM \n\nTo demonstrate the close relationship between TDNNs and DMMs, here we present \nthe results of a simulation in which we trained an IDNN to reproduce the behavior \nof a DMM of order 11. The mapping function for the DMM is given in equation 2. \nFigure 2b shows the minimal 2048 state transition diagram required to represent \nthe DMM. The symbol ~ in equation 2 represents the if-and-only-iffunction. The \noverbar notation, Uk, represents the negation of Uk, the input at time k. Yk is the \nnetwork output at time k. Yk > 0.5 is interpreted as \"accept the string seen so far.\" \nYk ~ 0.5 means \"reject.\" \n\n(2) \nTo create training and test sets, we randomly split in two the set of all 4094 \n\nYk = Uk-IO ~ (Uk U k-IU k-2 + Uk-2Uk-3 + Uk-l Uk-2) \n\n\f406 \n\nD. S. Clouse, C. L. Giles, B. G. Home and G. W Cottrell \n\na) DMM of order 3 \n\nb) DMM of order 11 \n\nFigure 2: Transition diagrams for two DMMs. \n\n0.4 \n\n~ 0.2 \n\nIs \nS \n! I \nJ II \n\n0.0 ............... !.i. ....\u2022..... l ....\u2022.....\u2022....\u2022....\u2022.... \u2022....\u2022....\u2022....\u2022....\u2022....\u2022.. \n\nPercent of total samples (4094) used in training \n\n30 \n\n10 \n\n20 \n\nFigure 3: Generalization error on 2048 state DMM. \n\nstrings of length 11 or less. We will report results using various percentages of \npossible strings for the training set. The IDNN had 10 input tap-delays, and seven \nhidden units. All tap-delays were cleared to 0 before introduction of a new input \nstring. Weights were trained using online back propagation with learning rate 0.25, \nand momentum 0.25. To speed up the algorithm, weights were updated only if the \nabsolute error on an example was greater than 0.2. Training was stopped when \nweight updates were required for no examples in the training set. This generally \nrequired 200 epochs or fewer, though there were trials which required almost 4000 \nepochs. \n\nEach point in figure 3 represents the mean classification error on the test set across \n20 trials. Error bars indicate one standard deviation on each side of the mean. \nEach trial consists of a different randomly-chosen training set. The graph plots \nerror at various training set sizes. Note that with training sets as small as 12 \npercent of possible strings the network generalizes perfectly to the remaining 88 \npercent. This kind of performance is possible because of the close match between \nthe representational bias of the IDNN and this specific problem. \n\n5 Simulation 2: Inductive biases of IDNNs and HDNNs \n\nIn section 2, we showed that the IDNNs and general TDNNs can represent the same \nclass offunctions. It does not follow that these two architectures are equally capable \nof learning the same functions. In this section, we show that the inductive biases are \n\n\fRepresentation and Induction of Finite State Machines using TDNNs \n\n407 \n\nindeed different. We will present our intuitions about the kinds of problems each \narchitecture is well suited to learning, then back up our intuitions with supporting \nsimulations. \n\nIn the following simulations, we compare two specific networks. The network repre(cid:173)\nsenting the general TDNNs includes delays on hidden layer outputs. We'll refer to \nthis as the hidden delay neural network or HDNN. All delays in the second network \nare confined to the network inputs, and so we call this the IDNN. \n\nWe have been careful to design the two networks to be comparable in size. Each \nof the networks contains two hidden layers. The first hidden layer of the IDNN has \nfour units, and the second five. The IDNN has eight input delays. Each of the two \nhidden layers of the HDNN has three units. The HDNN has three input delays, and \nfive delays on the output of each node of the first hidden layer. Note that in each \nnetwork the longest path from input to output requires eight delays. The number \nof weights, including bias weights, are also similar - 76 for the HDNN, and 79 for \nthe IDNN. \n\nIn order for the size of the two networks to be similar, the HDNN must have fewer \ndelays on the network inputs. If we think of each unit in the first hidden layer \nas a feature detector, the feature detectors in the HDNN will span a smaller time \nwindow than the IDNN. On the other hand, the HDNN has a second set of delays \nwhich saves the output of the feature detectors over several time steps. If some \nnarrow feature repeats over time, this second set of delays should help the HDNN \nto pick up this regularity. The IDNN, lacking the internal delays, should find it \nmore difficult to detect this kind of repeated regularity. \n\nTo test these ideas, we generated four DMM problems. We call equation 3 the \nnarrow-repeated problem because it contains a number of identical terms shifted in \ntime, and because each of these terms is narrow enough to fit in the time window \nof the HDNN first layer feature detectors. \nYk = Uk-8 +-+ (Uk U k-2U k-3 + Uk-1Uk-3U k-4 + Uk-3 U k-SU k-6 + Uk-4 U k-6U k-7) \nThe wide-repeated problem, represented by equation 4, is identical to the narrow(cid:173)\nrepeated problem except that each term has been stretched so that it will no longer \nfit in the HDNN feature detector time window. \n\n(3) \n\nYk = Uk-8 +-+ (Uk U k-2U k-4 + Uk-1 U k-3U k-S + Uk-2 U k-4U k-6 + Uk-3 U k-SU k-7) \n\n(4) \nThe narrow-unrepeated problem, represented by equation 5, is composed of narrow \nterms, but none of these terms is simply a shifted reproduction of another. \nYk = Uk-8 +-+ (Uk U k-2U k-3 + Uk-l Uk-3 U k-4 + Uk-3U k-S U k-6 + Uk-4U k-6U k-7) \n(5) \nLastly, the wide-unrepeated problem of equation 6 contains wide terms which do not \nrepeat. \nYk = Uk-8 +-+ (Uk U k-3U k-4 + Uk-1Uk-4Uk-S + Uk-2U k-S U k-6 + Uk-3U k-6U k-7) \n\nEach problem in this section requires a minimum of 512 states to represent. \n\nSimilar to the simulation of section 3, we trained both networks on subsets of all \npossible strings of length 9 or less. Since these problems were more difficult than \nthat of section 3, often the networks were unable to find a solution which performed \nperfectly on the training set. In this case, training was stopped after 8000 epochs. \nThe results reported later include these trials as well as trials in which training \nended because of perfect performance on the training set. Training for the HDNN \n\n(6) \n\n\f408 \n\nD. S. Clouse, C. L. Giles, B. G. Home and G. W Cottrell \n\n0.4 I I \n\no IDNN architecture \n-HDNN iKhitectun: \n\n~ \nr:s \nj \nJ 02 \ne \n! \n\" \n\n0.0 \n\nf f \n\nIJ1!lJLU \n\n_iIIlralII/Ioi \n\np_orlalal_ \n\n0.0 -._ ..... \n\n04 \n\n02 \n\nI I \n, \nIIII!LI\\J \n\n20 \n(a) Narrow\u00b7Repealed \n\n40 \n\n60 \n\n20 \n(b) Narrow\u00b7Unrepealed \n\n40 \n\n60 \n\n.. / \n\" I i I \n\n0.0 .. .........\u2022.... ... \n\n.. 'I \n., \n\niii \n\n11\\111\\ \n\n. .... . .. .f 0.0 \u2022 \u2022. . \u2022 .\u2022.... \n\n111111 \n\n\u2022 ..\u2022\u2022\u2022.. _ ..\u2022 \n\n40 \n20 \n(e) Wide\u00b7Repeated \n\n60 \n\n20 \n(d) Wide-U\"\",peated \n\n40 \n\n60 \n\nFigure 4: Generalization of a HDNN and an IDNN on four DMM problems \n\nwas identical to that of the IDNN except that error was propagated back across the \ninternal delays as in Wan (1993) . \n\nFigure 4 plots generalization error versus percentage of possible strings used in \ntraining for the two networks for each of the four DMM problems. If our intuitions \nwere correct we would expect to see evidence here that the effect of wider terms, \nand lack of repetition would have a stronger adverse effect on the HDNN network \nthan on the IDNN. This is exactly what we see. The position of the curve for the \nIDNN network is stable compared to that of the HDNN when changes are made to \nthe width and repetition factors . \n\nStatistical analysis supports this conclusion. We ran an ANOVA test [Rice, 1988] \nwith four factors (which network, term width, term repetition, and training set \nsize) on the data summarized by the graphs of figure 4. The test detected a sig(cid:173)\nnificant interaction between the network and width factors (M Snetxwid = 0.3430, \nF(l, 1824) = 234.4) , and between the network and repetition factors (MSnetxrep = \n0.1181, F(l, 1824) = 80.694). These two interactions are significant at p < 0.001, \nagreeing with our conclusion that width and repetition each has a stronger effect \non the performance of the HDNN network. \n\nFurther planned tests reveal that the effects of width and repetition are strong \nenough to change which network generalizes better. We ran a one-way ANOVA \ntest on each problem individually to see which network performs better across the \nentire curve. The tests reveal that the HDNN performs with significantly less error \nthan the IDNN in the narrow-repeated problem (M Serror = 0.0015, M Snet -\n0.5400, F(1,1824) = 369.0), and in the narrow-unrepeated problem (M Snet = \n0.0683, F(1 , 1824) = 46.7). Performance of the IDNN is significantly better in \nthe wide-unrepeated problem (M Snet = 0.0378 , F(l, 1824) = 25.83). All of these \ncomparisons are significant at p < 0.001. The test on the wide-repeated problem \nfinds no significant difference in performance of the two networks (M Snet = 0.0004, \n\n\fRepresentation and Induction of Finite State Machines using TDNNs \n\n409 \n\nF(l, 1824) = 0.273, p > 0.05). \nIn addition to confirming our intuitions about the kinds of problems that internal \ndelays should be helpful in solving, this set of simulations demonstrates the differ(cid:173)\nence between representational and inductive bias. For all DMM problems except \nfor the wide-unrepeated one, we were able to find, for each network, at least one \nset of weights which solve the problem perfectly. Despite the fact that the two \nnetworks are both capable of representing the problems, the differing way in which \nthey respond to the width and repetition factors demonstrates a difference in their \nlearning biases. \n\n6 Conclusions \n\nThis paper presents a number of interesting ideas concerning TDNNs using both \ntheoretical and empirical techniques. On the theoretical side, we have precisely \ndefined the subclass of FSMs which can be represented by TDNNs, the DMM \nlanguages. It is interesting to note that this network architecture which has no re(cid:173)\ncurrent connections is capable of representing languages whose transition diagrams \nrequire loops. \n\nOther ideas were demonstrated using empirical techniques. First, we have shown \nthat the number of states required to represent an FSM may be a poor predictor of \nhow difficult the language is to learn. We were able to learn a 2048-state FSM using \na small percentage of the possible training examples. This is possible because of \nthe close match between the representational bias of the network, and the language \nlearned. \n\nSecond, we presented a set of simulations which demonstrated the utility of internal \ndelays in a TDNN. These delays were shown to improve generalization on problems \ncomposed of features over short time intervals which reappear repeatedly. \n\nThird, that same set of simulations highlights the difference between representa(cid:173)\ntional bias, and inductive bias. Though these two terms are sometimes used inter(cid:173)\nchangeably in the theoretical literature, this work shows that the two concepts are, \nin fact, separable. \n\nReferences \n[Clouse et al., 1994] Clouse, D. S., Giles, C. 1., Horne, B. G., and Cottrell, G. W. (1994). \nLearning large debruijn automata with feed-forward neural networks. Technical Report \nCS94-398, University of California, San Diego, Computer Science and Engineering Dept. \n[Horne and Hush, 1994] Horne, B. G. and Hush, D. R. (1994). On the node complexity \n\nof neural networks. Neural Networks, 7(9):1413-1426. \n\n[Kohavi, 1978] Kohavi, Z. (1978). Switching and Finite Automata Theory. McGraw-Hill, \n\nInc., New York, NY, second edition. \n\n[Lang et al., 1990] Lang, K, Waibel, A., and Hinton, G. (1990). A time-delay neural \n\nnetwork architecture for isolated word recognition. Neural Networks, 3(1):23-44. \n\n[Rice, 1988] Rice, J. A. (1988). Mathematical Statistics and Data Analysis. Brooks/Cole \n\nPublishing Company, Monterey, California. \n\n[Waibel et al., 1989] Waibel, A., Hanazawa, T., Hinton, G., Shikano, K, and Lang, K \n(1989) . Phoneme recognition using time-delay neural networks. IEEE Transactions on \nAcoustics, Speech and Signal Processing, 37(3):328-339. \n\n[Wan, 1993] Wan, E. A. (1993). Time series prediction by using a connectionist network \nwith internal delay lines. In Weigend, A. S. and Gershenfeld, N. A., editors, Time Series \nPrediction: Forecasting the Future and Understanding the Past. Addison Wesley. \n\n\f", "award": [], "sourceid": 1275, "authors": [{"given_name": "Daniel", "family_name": "Clouse", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}, {"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}]}