{"title": "An experimental comparison of recurrent neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 697, "page_last": 704, "abstract": null, "full_text": "An experimental comparison \nof recurrent neural networks \n\nBill G. Horne and C. Lee Giles\u00b7 \n\nNEe Research Institute \n\n4 Independence Way \nPrinceton, NJ 08540 \n\n{horne.giles}~research.nj.nec.com \n\nAbstract \n\nMany different discrete-time recurrent neural network architec(cid:173)\ntures have been proposed. However, there has been virtually no \neffort to compare these arch:tectures experimentally. In this paper \nwe review and categorize many of these architectures and compare \nhow they perform on various classes of simple problems including \ngrammatical inference and nonlinear system identification. \n\n1 \n\nIntroduction \n\nIn the past few years several recurrent neural network architectures have emerged. \nIn this paper we categorize various discrete-time recurrent neural network architec(cid:173)\ntures, and perform a quantitative comparison of these architectures on two prob(cid:173)\nlems: grammatical inference and nonlinear system identification. \n\n2 RNN Architectures \n\nWe broadly divide these networks into two groups depending on whether or not the \nstates of the network are guaranteed to be observable. A network with observable \nstates has the property that the states of the system can always be determined from \nobservations of the input and output alone. The archetypical model in this class \n\n.. Also with UMIACS, University of Maryland, College Park, MD 20742 \n\n\f698 \n\nBill G. Horne, C. Lee Giles \n\nTable 1: Terms that are weighted in various single layer network architectures. Ui \nrepresents the ith input at the current time step, Zi represents the value of the lh \nnode at the previous time step. \n\nArchitecture bias Ui \nFirst order \nx \nHigh order \n\nx \n\nBilinear \nQuadratic \n\nx \nx \n\nx \n\nZi \nx \n\nx \nx \n\nUiUj \n\nZiUj ZiZj \n\nx \nx \nx \n\nx \n\nx \n\nwas proposed by N arendra and Parthasarathy [9]. In their most general model, the \noutput of the network is computed by a multilayer perceptron (MLP) whose inputs \nare a window of past inputs and outputs, as shown in Figure la. A special case of \nthis network is the Time Delay Neural Network (TDNN), which is simply a tapped \ndelay line (TDL) followed by an MLP [7]. This network is not recurrent since there \nis no feedback; however, the TDL does provide a simple form of dynamics that \ngives the network the ability model a limited class of nonlinear dynamic systems. \nA variation on the TDNN, called the Gamma network, has been proposed in which \nthe TDL is replaced by a set of cascaded filters [2]. Specifically, if the output of \none of the filters is denoted xj(k), and the output of filter i connects to the input \nof filter j, the output of filter j is given by, \n\nxj(k + 1) = I-'xi(k) + (l-I-')xj(k). \n\nIn this paper we only consider the case where I-' is fixed, although better results can \nbe obtained if it is adaptive. \n\nNetworks that have hidden dynamics have states which are not directly accessible \nto observation. In fact, it may be impossible to determine the states of a system \nfrom observations of it's inputs and outputs alone. We divide networks with hid(cid:173)\nden dynamics into three classes: single layer networks, multilayer networks, and \nnetworks with local feedback. \n\nSingle layer networks are perhaps the most popular of the recurrent neural network \nmodels. In a single layer network, every node depends on the previous output of \nall of the other nodes. The function performed by each node distinguishes the \ntypes of recurrent networks in this class. In each of the networks, nodes can be \ncharacterized as a nonlinear function of a weighted sum of inputs, previous node \noutputs, or products of these values. A bias term may also be included. In this \npaper we consider first-order networks, high-order networks [5], bilinear networks, \nand Quadratic networks[12]. The terms that are weighted in each of these networks \nare summarized in Table 1. \n\nMultilayer networks consist of a feedforward network coupled with a finite set of \ndelays as shown in Figure lb. One network in this class is an architecture proposed \nby Robinson and Fallside [11], in which the feedforward network is an MLP. Another \npopular networks that fits into this class is Elman's Simple Recurrent Network \n(SRN) [3]. An Elman network can be thought of as a single layer network with an \nextra layer of nodes that compute the output function, as shown in Figure lc. \n\nIn locally recurrent networks the feedback is provided locally within each individual \n\n\fAn Experimental Comparison of Recurrent Neural Networks \n\n699 \n\nMLP \n\nFigure 1: Network architectures: (a) Narendra and Parthasarathy's Recurrent Neu(cid:173)\nral Network, (b) Multilayer network and (c) an Elman network. \n\nnode, but the nodes are connected together in a feed forward architecture. Specifi(cid:173)\ncally, we consider nodes that have local output feedback in which each node weights \na window of its own past outputs and windows of node outputs from previous layers. \nNetworks with local recurrence have been proposed in [1, 4, 10]. \n\n3 Experimental Results \n\n3.1 Experimental methodology \n\nIn order to make the comparison as fair as possible we have adopted the following \nmethodology. \n\u2022 Resources. We shall perform two fundamental comparisons. One in which the \nnumber of weights is roughly the same for all networks, another in which the \nnumber of states is equivalent. In either case, we shall make these numbers large \nenough that most of the networks can achieve interesting performance levels. \nNumber of weights. For static networks it is well known that the generalization \nperformance is related to the number of weights in the network. Although this \ntheory has never been extended to recurrent neural networks, it seems reasonable \nthat a similar result might apply. Therefore, in some experiments we shall try \nto keep the number of weights approximately equal across all networks. \nNumber of states. It can be argued that for dynamic problems the size of the \nstate space is a more relevant measure for comparison than the number of \nweights. Therefore, in some experiments we shall keep the number of states \nequal across all networks. \n\n\u2022 Vanilla learning. Several heuristics have been proposed to help speed learning \nand improve generalization of gradient descent learning algorithms. However, \nsuch heuristics may favor certain architectures. In order to avoid these issues, \nwe have chosen simple gradient descent learning algorithms. \n\n\u2022 Number of simulations. Due to random initial conditions, the recurrent \nneural network solutions can vary widely. Thus, to try to achieve a statistically \nsignificant estimation of the generalization of these networks, a large number of \nexperiments were run. \n\n\f700 \n\nBill G. Horne, C. Lee Giles \n\no \n\nstan );::===:====,O'l+------ll \n\no \n\no \n\nFigure 2: A randomly generated six state finite state machine. \n\n3.2 Finite state machines \n\nWe chose two finite state machine (FSM) problems for a comparison of the ability of \nthe various recurrent networks to perform grammatical inference. The first problem \nis to learn the minimal, randomly generated six state machine shown in Figure 2. \nThe second problem is to infer a sixty-four state finite memory machine [6] described \nby the logic function \n\ny(k) = u(k - 3)u(k) + u(k - 3)y(k - 3) + u(k)u(k - 3)Y(k - 3) \n\nwhere u(k) and y(k) represent the input and output respectively at time k and x \nrepresents the complement of x. \nTwo experiments were run. In the first experiment all of the networks were designed \nsuch that the number of weights was less than, but as close to 60 as possible. In the \nsecond experiment, each network was restricted to six state variables, and if possible, \nthe networks were designed to have approximately 75 weights. Several alternative \narchitectures were tried when it was possible to configure the architecture differently \nand yield the same number of weights, but those used gave the best results. \n\nA complete set of 254 strings consisting of all strings of length one through seven is \nsufficient to uniquely identify both ofthese FSMs. For each simulation, we randomly \npartitioned the data into a training and testing set consisting of 127 strings each. \nThe strings were ordered lexographically in the training set. \nFor each architecture 100 runs were performed on each problem. The on-line Back \nPropagation Through Time (BPTT) algorithm was used to train the networks. \nVanilla learning was used with a learning rate of 0.5. Training was stopped at 1000 \nepochs. The weights of all networks were initialized to random values uniformly \ndistributed in the range [-0.1,0.1]. All states were initialize to zeros at the begin(cid:173)\nning of each string except for the High Order net in which one state was arbitrarily \ninitialized to a value of 1. \n\nTable 2 summarizes the statistics for each experiment. From these results we draw \nthe following conclusions. \n\u2022 The bilinear and high-order networks do best on the small randomly generated \nmachine, but poorly on the finite memory machine. Thus, it would appear that \nthere is benefit to having second order terms in the network, at least for small \nfinite state machine problems. \n\n\u2022 N arendra and Parthasarathy's model and the network with local recurrence do \nfar better than the other networks on the problem of inferring the finite memory \n\n\fAn Experimental Comparison of Recurrent Neural Networks \n\n701 \n\nTable 2: Percentage classification error on the FSM experiment for (a) networks with \napproximately the same number of weights, (b) networks with the same number of \nstate variables. %P = The percentage of trials in which the training set was learned \nperfectly, #W = the number of weights, and #S = the number of states. \n\nF5M \n\nRND \n\nFMM \n\nF5M \n\nRND \n\nFMM \n\nArchitecture t mean \nN&P \n2 .8 \nTDNN \n12.5 \n19.6 \nGamma \n12.9 \nFirst Order \n0.8 \nHigh Order \nBilinear \n1.3 \n12.9 \nQuadratic \n19 .4 \nMullilayer \n3 .5 \nElman \n2.8 \nLocal \nN&P \n0 .0 \nTDNN \n6.9 \n7.7 \nGamma \n4 .8 \nFirst Order \nHigh Order \n5.3 \nBilinear \n9 .5 \n32.5 \nQuadratic \n36. 7 \nMultilayer \n12.0 \nElman \nLocal \n0 . 1 \n\ntraining error \n( std) \n(M) \n(2.1) \n(H) \n(6.9) \n(1.5) \n(2 . 7) \n(13.4) \n(13 .6) \n~5.~~ \n1.5 \n~0 . 2 ~ \n(2 .1 ) \n(2 .2) \n(3 .0) \n(4.0) \n(10 .4) \n(10.8) \n(11.9) \n(12.5) \n' (0.3) \n\nArchitecture tt \nN&P \nTDNN \nGamma \nFirst Order \nHigh Order \nBilinear \nQuadratic \nMullilayer \nElman \nLocal \nN&P \nTDNN \nGamma \nFirs t Order \nHigh Order \nBilinear \nQuadratic \nMullUayer \nElman \nLocal \n\n(a) \n\ntra.lnlng error \n( std) \nmea.n \n4 .6 \n( 8.~~ \n( 2.0) \n11 . 7 \n(H) \n19.0 \n( 6.9) \n12.9 \n( 0 .5) \n0 .3 \n0 .6 \n( 0 .9) \n0 .2 \n( 0 .5) \n15. 4 \n(14 . 1) \n3.5 \n( 5.5) \n( 405) \n13.9 \n( 0 .8) \n0 .1 \n( 1.7) \n6 .8 \n(2.9) \n9 .0 \n(3 .0) \n4 .8 \n( 1.7) \n1.2 \n2 .6 \n( 402) \n12.6 \n(17.3) \n38.1 \n(12.6) \n12.8 \n~H.:~ \n3 .8 \n15 .3 \n(b) \n\ntesting error \n\nmea.n \n16.9 \n33.8 \n24.8 \n26.5 \n6 .2 \n5 .7 \n17.7 \n23.4 \n12.7 \n26.7 \n0 .1 \n15 .8 \n15.7 \n16 .0 \n26 .0 \n25.8 \n40.5 \n43 .5 \n24 .9 \n1.0 \n\n(std) \n(8 .6) \n(U) \n(3 .2) \n(9 .0) \n(6 .1 ) \n(6 .1) \n(14 .1) \n( 13.5) \n~9. !~ \n7.6 \n~ 1 . ~ ~ \n(3 .2) \n(3.3) \n(6 .5) \n( 5. 1 ) \n(7 .0) \n(7 .3) \n(8.5) \n(7 .9) \n( 3 .0) \n\ntestIng error \n\nmea.n \n14.1 \n34.3 \n25 .2 \n26.5 \n4 .6 \n4 .4 \n3.2 \n19.9 \n12.7 \n20.2 \n0 .3 \n16.2 \n14.9 \n16.0 \n25.1 \n20.3 \n26.1 \n42.8 \n27.6 \n22.2 \n\n( std) \n(11 .3 ) \n( 3 .9) \n(3.1) \n(9 .0) \n( 5 .1) \n( U) \n( 2 .6) \n(lU) \n( 9 .1) \n( 5.7) \n( 1.4) \n( 2 .9) \n(2 .8) \n(6 .5) \n( 5 .1) \n( 7 .2) \n(12 .8) \n( 9.2) \n(10 .7) \n( 409) \n\n'YoP \n22 \n0 \n0 \n0 \n60 \n46 \n12 \n6 \n27 \n4 \n99 \n0 \n0 \n1 \n1 \n0 \n0 \n0 \n5 \n97 \n\n'YoP \n38 \n0 \n0 \n0 \n79 \n55 \n83 \n16 \n27 \n0 \n97 \n0 \n0 \n1 \n31 \n21 \n13 \n0 \n8 \n0 \n\n#W \n\n56 \n56 \n56 \n48 \n50 \n55 \n45 \n54 \n55 \n60 \n56 \n56 \n56 \n48 \n50 \n55 \n45 \n54 \n55 \n60 \n\n#W \n\n73 \n73 \nH \n48 \nH \n78 \n216 \n76 \n55 \n26 \n73 \n73 \n73 \n48 \nH \n78 \n216 \n76 \n55 \n26 \n\n#5 \n8 \n8 \n8 \n6 \n5 \n5 \n3 \n4 \n6 \n20 \n8 \n8 \n8 \n6 \n5 \n5 \n3 \n4 \n6 \n20 \n\n#5 \n\n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n\ntThe TDNN and Gamma network both had 8 input taps and 4 hidden layer nodes. For \nthe Gamma network, I' = 0.3 (RND) and I' = 0.7 (FMM). Narendra and Parthasarathy's \nnetwork had 4 input and output taps and 5 hidden layer nodes. The High-order network \nused a \"one-hot\" encoding of the input values [5]. The multilayer network had 4 hidden \nand output layer nodes. The locally recurrent net had 4 hidden layer nodes with 5 input \nand 3 output taps, and one output node with 3 input and output taps. \nttThe TDNN, Gamma network, and N arendra and Parthasarathy's network all had 8 \nhidden layer nodes. For the Gamma network, I' = 0.3 (RND) and I' = 0.7 (FMM). The \nHigh-order network again used a \"one-hot\" encoding of the input values. The multilayer \nnetwork had 5 hidden and 6 output layer nodes. The locally recurrent net had 3 hidden \nlayer nodes and one output layer node, all with only one input and output tap. \n\n\f702 \n\nBill G. Horne, C. Lee Giles \n\nmachine when the number of states is not constrained. It is not surprising that \nthe former network did so well since the sequential machine implementation of \na finite memory machine is similar to this architecture [6]. However, the result \nfor the locally recurrent network was unexpected. \n\n\u2022 All of the recurrent networks do better than the TDNN on the small random \nmachine. However, on the finite memory machine the TDNN does surprisingly \nwell, perhaps because its structure is similiar to Narendra and Parthasarathy's \nnetwork which was well suited for this problem. \n\n\u2022 Gradient-based learning algorithms are not adequate for many of these archi(cid:173)\n\ntectures. In many cases a network is capable of representing a solution to a \nproblem that the algorithm was not able to find. This seems particularly true \nfor the Multilayer network. \n\n\u2022 Not surprisingly, an increase in the number of weights typically leads to over(cid:173)\ntraining. Although, the quadratic network, which has 216 weights, can consis(cid:173)\ntently find solutions for the random machine that generalize well even though \nthere are only 127 training samples. \n\n\u2022 Although the performance on the training set is not always a good indicator of' \ngeneralization performance on the testing set, we find that if a network is able \nto frequently find perfect solutions for the training data, then it also does well \non the testing data. \n\n3.3 Nonlinear system identification \n\nIn this problem, we train the network to learn the dynamics of the following set of \nequations proposed in [8] \n\nzl(k+l) \nZ2 + 1 = \n\n(k) \n\ny(k) \n\nzl(k) + 2z2(k) \n\n(k) \n\nl+z~(k) +u \n(k) \n\nzl(k)Z2(k) \n+ u \n1 + z~(k) \nzl(k) + z2(k) \n\nbased on observations of u( k) and y( k) alone. \nThe same networks that were used for the finite state machine problems were used \nhere, except that the output node was changed to be linear instead of sigmoidal \nto allow the network to have an appropriate dynamic range. We found that this \ncaused some stability problems in the quadratic and locally recurrent networks. For \nthe fixed number of weights comparison, we added an extra node to the quadratic \nnetwork, and dropped any second order terms involving the fed back output. This \ngave a network with 64 weights and 4 states. For the fixed state comparison, \ndropping the second order terms gave a network with 174 weights. The locally \nrecurrent network presented stability problems only for the fixed number of weights \ncomparison. Here, we used a network that had 6 hidden layer nodes and one output \nnode with 2 taps on the inputs and outputs each, giving a network with 57 weights \nand 16 states. In the Gamma network a value of l' = 0.8 gave the best results. \nThe networks were trained with 100 uniform random noise sequences of length 50. \nEach experiment used a different randomly generated training set. The noise was \n\n\fAn Experimental Comparison of Recurrent Neural Networks \n\n703 \n\nTable 3: Normalized mean squared error on a sinusoidal test signal for the nonlinear \nsystem identification experiment. \n\nArchi teet ure Fixed # weights Fixed # states \nN&P \nTDNN \nGamma \nFirst Order \nHigh Order \nBilinear \nQuadratic \nMultilayer \nElman \nLocal \n\n0.101 \n0.160 \n0.157 \n0.105 \n1.034 \n0.118 \n0.108 \n0.096 \n0.115 \n0.117 \n\n0.067 \n0.165 \n0.151 \n0.105 \n1.050 \n0.111 \n0.096 \n0.084 \n0.115 \n0.123 \n\nuniformly distributed in the range [-2.0,2.0], and each sequence started with an \ninitial value of Xl(O) = X2(0) = O. The networks were tested on the response to \na sine wave of frequency 0.04 radians/second. This is an interesting test signal \nbecause it is fundamentally different than the training data. \nFifty runs were performed for each network. BPTT was used for 500 epochs with a \nlearning rate of 0.002. The weights of all networks were initialized to random values \nuniformly distributed in the range [-0.1,0.1]. \nTable 3 shows the normalized mean squared error averaged over the 50 runs on the \ntesting set. From these results we draw the following conclusions. \n\u2022 The high order network could not seem to match the dynamic range of its output \nto the target, as a result it performed much worse than the other networks. It is \nclear that there is benefit to adding first order terms since the bilinear network \nperformed so much better. \n\n\u2022 Aside from the high order network, all of the other recurrent networks performed \n\nbetter than the TDNN, although in most cases not significantly better. \n\n\u2022 The multilayer network performed exceptionally well on this problem, unlike the \n\nfinite state machine experiments. We speculate that the existence of target out(cid:173)\nput at every point along the sequence (unlike the finite state machine problems) \nis important for the multilayer network to be successful. \n\n\u2022 Narendra and Parthasarathy's architecture did exceptionally well, even though \n\nit is not clear that its structure is well matched to the problem. \n\n4 Conclusions \n\nWe have reviewed many discrete-time recurrent neural network architectures and \ncompared them on two different problem domains, although we make no claim that \nany of these results will necessarily extend to other problems. \n\nNarendra and Parthasarathy's model performed exceptionally well on the problems \nwe explored. In general, single layer networks did fairly well, however it is important \nto include terms besides simple state/input products for nonlinear system identi(cid:173)\nfication. All of the recurrent networks usually did better than the TDNN except \n\n\f704 \n\nBill G. Home, C. Lee Giles \n\non the finite memory machine problem. In these experiments, the use of averaging \nfilters as a substitute for taps in the TDNN did not seem to offer any distinct ad(cid:173)\nvantages in performance, although better results might be obtained if the value of \nJ.I. is adapted. \n\nWe found that the relative comparison of the networks did not significantly change \nwhether or not the number of weights or states were held constant. In fact, holding \none of these values constant meant that in some networks the other value varied \nwildly, yet there appeared to be little correlation with generalization. \n\nFinally, it is interesting to note that though some are much better than others, \nmany of these networks are capable of providing adequate solutions to two seemingly \ndisparate problems. \n\nAcknowledgements \n\nWe would like to thank Leon Personnaz and Isabelle Rivals for suggesting we per(cid:173)\nform the experiments with a fixed number of states. \n\nReferences \n[1] A.D. Back and A.C. Tsoi. FIR and IIR synapses, a new neural network archi(cid:173)\n\ntecture for time series modeling. Neural Computation, 3(3):375-385, 1991. \n\n[2] B. de Vries and J .C. Principe. The gamma model: A new neural model for \n\ntemporal processing. Neural Networks, 5:565-576, 1992. \n\n[3] J .L. Elman. Finding structure in time. Cognitive Science, 14:179-211, 1990. \n[4] P. Frasconi, M. Gori, and G. Soda. Local feedback multilayered networks. \n\nNeural Computation, 4:120-130, 1992. \n\n[5] C.L. Giles, C.B. Miller, et al. Learning and extracting finite state automata \nwith second-order recurrent neural networks. Neural Computation, 4:393-405, \n1992. \n\n[6] Z. Kohavi. Switching and finite automata theory. McGraw-Hill, NY, 1978. \n[7] K.J. Lang, A.H. Waibel, and G.E. Hinton. A time-delay neural network archi(cid:173)\n\ntecture for isolated word recognition. Neural Networks, 3:23-44, 1990. \n\n[8] K.S. Narendra. Adaptive control of dynamical systems using neural networks. \nIn Handbook of Intelligent Control, pages 141-183. Van Nostrand Reinhold, \nNY, 1992. \n\n[9] K.S. Narendra and K. Parthasarathy. Identification and control of dynamical \nsystems using neural networks. IEEE Trans. on Neural Networks, 1:4-27, 1990. \n[10] P. Poddar and K.P. Unnikrishnan. Non-linear prediction of speech signals \nusing memory neuron networks. In Proc. 1991 IEEE Work. Neural Networks \nfor Sig. Proc., pages 1-10. IEEE Press, 1991. \n\n[11] A.J. Robinson and F. Fallside. Static and dynamic error propagation networks \n\nwith application to speech coding. In NIPS, pages 632-641, NY, 1988. AlP. \n\n[12] R.L. Watrous and G.M. Kuhn . \n\nInduction of finite-state automata using \n\nsecond-order recurrent networks. In NIPS4, pages 309-316, 1992. \n\n\f", "award": [], "sourceid": 1009, "authors": [{"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}]}