{"title": "Training and Analysing Deep Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 198, "abstract": "Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this pa- per we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hi- erarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modelling when trained with sim- ple stochastic gradient descent. We also offer an analysis of the different emergent time scales.", "full_text": "Training and Analyzing Deep Recurrent Neural\n\nNetworks\n\nMichiel Hermans, Benjamin Schrauwen\n\nGhent University, ELIS departement\n\nmichiel.hermans@ugent.be\n\nSint Pietersnieuwstraat 41,\n\n9000 Ghent, Belgium\n\nAbstract\n\nTime series often have a temporal hierarchy, with information that is spread out\nover multiple time scales. Common recurrent neural networks, however, do not\nexplicitly accommodate such a hierarchy, and most research on them has been\nfocusing on training algorithms rather than on their basic architecture. In this pa-\nper we study the effect of a hierarchy of recurrent neural networks on processing\ntime series. Here, each layer is a recurrent network which receives the hidden\nstate of the previous layer as input. This architecture allows us to perform hi-\nerarchical processing on dif\ufb01cult temporal tasks, and more naturally capture the\nstructure of time series. We show that they reach state-of-the-art performance for\nrecurrent networks in character-level language modeling when trained with sim-\nple stochastic gradient descent. We also offer an analysis of the different emergent\ntime scales.\n\n1\n\nIntroduction\n\nThe last decade, machine learning has seen the rise of neural networks composed of multiple layers,\nwhich are often termed deep neural networks (DNN). In a multitude of forms, DNNs have shown to\nbe powerful models for tasks such as speech recognition [17] and handwritten digit recognition [4].\nTheir success is commonly attributed to the hierarchy that is introduced due to the several layers.\nEach layer processes some part of the task we wish to solve, and passes it on to the next. In this\nsense, the DNN can be seen as a processing pipeline, in which each layer solves a part of the task\nbefore passing it on to the next, until \ufb01nally the last layer provides the output.\nOne type of network that debatably falls into the category of deep networks is the recurrent neural\nnetwork (RNN). When folded out in time, it can be considered as a DNN with inde\ufb01nitely many\nlayers. The comparison to common deep networks falls short, however, when we consider the func-\ntionality of the network architecture. For RNNs, the primary function of the layers is to introduce\nmemory, not hierarchical processing. New information is added in every \u2018layer\u2019 (every network it-\neration), and the network can pass this information on for an inde\ufb01nite number of network updates,\nessentially providing the RNN with unlimited memory depth. Whereas in DNNs input is only pre-\nsented at the bottom layer, and output is only produced at the highest layer, RNNs generally receive\ninput and produce output at each time step. As such, the network updates do not provide hierarchi-\ncal processing of the information per se, only in the respect that older data (provided several time\nsteps ago) passes through the recursion more often. There is no compelling reason why older data\nwould require more processing steps (network iterations) than newly received data. More likely, the\nrecurrent weights in an RNN learn during the training phase to select what information they need to\npass onwards, and what they need to discard. Indeed, this quality forms the core motivation of the\nso-called Long Short-term memory (LSTM) architecture [11], a special form of RNN.\n\n1\n\n\fFigure 1: Schematic illustration of a DRNN. Arrows represent connection matrices, and white,\nblack and grey circles represent input frames, hidden states, and output frames respectively. Left:\nStandard RNN, folded out in time. Middle: DRNN of 3 layers folded out in time. Each layer can\nbe interpreted as an RNN that receives the time series of the previous layer as input. Right: The two\nalternative architectures that we study in this paper, where the looped arrows represent the recurrent\nweights. Either only the top layer connects to the output (DRNN-1O), or all layers do (DRNN-AO).\n\nOne potential weakness of a common RNN is that we may need complex, hierarchical processing of\nthe current network input, but this information only passes through one layer of processing before\ngoing to the output. Secondly, we may need to process the time series at several time scales. If\nwe consider for example speech, at the lowest level it is built up of phonemes, which exist on a\nvery short time-scale. Next, on increasingly longer time scales, there are syllables, words, phrases,\nclauses, sentences, and at the highest level for instance a full conversation. Common RNNs do not\nexplicitly support multiple time scales, and any temporal hierarchy that is present in the input signal\nneeds to be embedded implicitly in the network dynamics.\nIn past research, some hierarchical architectures employing RNNs have been proposed [3, 5, 6].\nEspecially [5] is interesting in the sense that they construct a hierarchy of RNNs, which all oper-\nate on different time-scales (using subsampling). The authors limit themselves to arti\ufb01cial tasks,\nhowever. The architecture we study in this paper has been used in [8]. Here, the authors employ\nstacked bi-directional LSTM networks, and train it on the TIMIT phoneme dataset [7] in which they\nobtain state-of-the-art performance. Their paper is strongly focused on reaching good performance,\nhowever, and little analysis on the actual contribution of the network architecture is provided.\nThe architecture we study in this paper is essentially a common DNN (a multilayer perceptron) with\ntemporal feedback loops in each layer, which we call a deep recurrent neural network (DRNN).\nEach network update, new information travels up the hierarchy, and temporal context is added in\neach layer (see Figure 1). This basically combines the concept of DNNs with RNNs. Each layer\nin the hierarchy is a recurrent neural network, and each subsequent layer receives the hidden state\nof the previous layer as input time series. As we will show, stacking RNNs automatically creates\ndifferent time scales at different levels, and therefore a temporal hierarchy.\nIn this paper we will study character-based language modelling and provide a more in-depth analysis\nof how the network architecture relates to the nature of the task. We suspect that DRNNs are well-\nsuited to capture temporal hierarchies, and character-based language modeling is an excellent real-\nworld task to validate this claim, as the distribution of characters is highly nonlinear and covers\nboth short- and long-term dependencies. As we will show, DRNNs embed these different timescales\ndirectly in their structure, and they are able to model long-term dependencies. Using only stochastic\ngradient descent (SGD) we are able to get state-of-the-art performance for recurrent networks on\na Wikipedia-based text corpus, which was previously only obtained using the far more advanced\nHessian-free training algorithm [19].\n\n2 Deep RNNs\n\n2.1 Hidden state evolution\n\nWe de\ufb01ne a DRNN with L layers, and N neurons per layer. Suppose we have an input time series\ns(t) of dimensionality Nin, and a target time series y\u2217(t). In order to simplify notation we will not\nexplicitly write out bias terms, but augment the corresponding variables with an element equal to\n\n2\n\n1-layer RNN3-layer RNNtimetimeDRNN-1ODRNN-AO\fone. We use the notation \u00afx = [x; 1].\nWe denote the hidden state of the i-th layer with ai(t). Its update equation is given by:\n\nai(t) = tanh (Wiai(t \u2212 1) + Zi\u00afai\u22121(t)) if i > 1\nai(t) = tanh (Wiai(t \u2212 1) + Zi\u00afs(t)) if i = 1.\n\nHere, Wi and Zi are the recurrent connections and the connections from the lower layer or input\ntime series, respectively. A schematic drawing of the DRNN is presented in Figure 1.\nNote that the network structure inherently offers different time scales. The bottom layer has fading\nmemory of the input signal. The next layer has fading memory of the hidden state of the bottom\nlayer, and consequently a fading memory of the input which reaches further in the past, and so on\nfor each additional layer.\n\n2.2 Generating output\n\nThe task we consider in this paper is a classi\ufb01cation task, and we use a softmax function to generate\noutput. The DRNN generates an output which we denote by y(t). We will consider two scenarios:\nthat where only the highest layer in the hierarchy couples to the output (DRNN-1O), and that where\nall layers do (DRNN-AO). In the two respective cases, y(t) is given by:\n\nwhere U is the matrix with the output weights, and\n\ny(t) = softmax (U\u00afaL(t)) ,\n\n(cid:32) L(cid:88)\n\n(cid:33)\n\ny(t) = softmax\n\nUi\u00afai(t)\n\n,\n\n(1)\n\n(2)\n\ni=1\n\nsuch that Ui corresponds to the output weights of the i-th layer. The two resulting architectures are\ndepicted in the right part of Figure 1.\nThe reason that we use output connections at each layer is twofold. First, like any deep architecture,\nDRNNs suffer from a pathological curvature in the cost function. If we use backpropagation through\ntime, the error will propagate from the top layer down the hierarchy, but it will have diminished in\nmagnitude once it reaches the lower layers, such that they are not trained effectively. Adding output\nconnections at each layer amends this problem to some degree as the training error reaches all layers\ndirectly.\nSecondly, having output connections at each layer provides us with a crude measure of its role in\nsolving the task. We can for instance measure the decay of performance by leaving out an individ-\nual layer\u2019s contribution, or study which layer contributes most to predicting characters in speci\ufb01c\ninstances.\n\n2.3 Training setup\n\nIn all experiments we used stochastic gradient descent. To avoid extremely large gradients near\nbifurcations, we applied the often-used trick of normalizing the gradient before using it for weight\nupdates. This simple heuristic seems to be effective to prevent gradient explosions and sudden jumps\nof the parameters, while not diminishing the end performance. We write the number of batches we\ntrain on as T . The learning rate is set at an initial value \u03b70, and drops linearly with each subsequent\nweight update. Suppose \u03b8(j) is the set of all trainable parameters after j updates, and \u2207\u03b8(j) is the\ngradient of a cost function w.r.t. this parameter set, as computed on a randomly sampled part of the\ntraining set. Parameter updates are given by:\n\n\u03b8(j + 1) = \u03b8(j) \u2212 \u03b70\n\n1 \u2212 j\nT\n\n(cid:18)\n\n(cid:19) \u2207\u03b8(j)\n\n||\u2207\u03b8(j)|| .\n\n(3)\n\nIn the case where we use output connections at the top layer only, we use an incremental layer-wise\nmethod to train the network, which was necessary to reach good performance. We add layers one\nby one and at all times an output layer only exists at the current top layer. When adding a layer, the\nprevious output weights are discarded and new output weights are initialised connecting from the\nnew top layer. In this way each layer has at least some time during training in which it is directly\n\n3\n\n\fcoupled to the output, and as such can be trained effectively. Over the course of each of these training\nstages we used the same training strategy as described before: training the full network with BPTT\nand linearly reducing the learning rate to zero before a new layer is added. Notice the difference to\ncommon layer-wise training schemes where only a single layer is trained at a time. We always train\nthe full network after each layer is added.\n\n3 Text prediction\n\nIn this paper we consider next character prediction on a Wikipedia text-corpus [19] which was\nmade publicly available1. The total set is about 1.4 billion characters long, of which the \ufb01nal 10\nmillion is used for testing. Each character is represented by one-out-of-N coding. We used 95 of\nthe most common characters2 (including small letters, capitals, numbers and punctuation), and one\n\u2018unknown\u2019 character, used to map any character not part of the 95 common ones, e.g. Cyrillic and\nChinese characters. We need time in the order of 10 days to train a single network, largely due to\nthe dif\ufb01culty of exploiting massively parallel computing for SGD. Therefore we only tested three\nnetwork instantiations3. Each experiment was run on a single GPU (NVIDIA GeForce GTX 680,\n4GB RAM).\nThe task is as follows: given a sequence of text, predict the probability distribution of the next\ncharacter. The used performance metric is the average number of bits-per-character (BPC), given\nby BPC = \u2212(cid:104)log2 pc(cid:105), where pc is the probability as predicted by the network of the correct next\ncharacter.\n\n3.1 Network setups\n\nThe challenge in character-level language modelling lies in the great diversity and sheer number of\nwords that are used. In the case of Wikipedia this dif\ufb01culty is exacerbated due to the large number\nof names of persons and places, scienti\ufb01c jargon, etc. In order to capture this diversity we need large\nmodels with many trainable parameters.\nAll our networks have a number of neurons selected such that in total they each had approximately\n4.9 million trainable parameters, which allowed us to make a comparison to other published work\n[19]. We considered three networks: a common RNN (2119 units), a 5-layer DRNN-1O (727 units\nper layer), and a 5-layer DRNN-AO (706 units per layer)4. Initial learning rates \u03b70 were chosen at\n0.5, except for the the top layer of the DRNN-1O, where we picked \u03b70 = 0.25 (as we observed that\nthe nodes started to saturate if we used a too high learning rate).\nThe RNN and the DRNN-AO were trained over T = 5 \u00d7 105 parameter updates. The network with\noutput connections only at the top layer had a different number of parameter updates per training\nstage, T = {0.5, 1, 1.5, 2, 2.5} \u00d7 105, for the 5 layers respectively. As such, for each additional\nlayer the network is trained for more iterations. All gradients are computed using backpropagation\nthrough time (BPTT) on 75 randomly sampled sequences in parallel, drawn from the training set.\nAll sequences were 250 characters long, and the \ufb01rst 50 characters were disregarded during the\nbackwards pass, as they may have insuf\ufb01cient temporal context. In the end the DRNN-AO sees the\nfull training set about 7 times in total, and the DRNN-1O about 10 times.\nThe matrices Wi and Zi>1 were initialised with elements drawn from N (0, N\u22121/2). The input\nweights Z1 were drawn from N (0, 1). We chose to have the same number of neurons for every\nlayer, mostly to reduce the number of parameters that need to be optimised. Output weights were\nalways initialised on zero.\n\n1http://www.cs.toronto.edu/\u02dcilya/mrnns.tar.gz\n2In [19] only 86 character are used, but most of the additional characters in our set are exceedingly rare,\n\nsuch that cross-entropy is not affected meaningfully by this difference.\n\n3In our experience the networks are so large that there is very little difference in performance for different\n\ninitialisations\n\n4The decision for 5 layers is based on a previous set of experiments (results not shown).\n\n4\n\n\fModel\nRNN\n\nDRNN-AO\nDRNN-1O\n\nMRNN\nPAQ\n\nHutter Prize (current record) [12]\n\nHuman level (estimated) [18]\n\n0.6 \u2013 1.3\n\nBPC test\n\n1.610\n1.557\n1.541\n1.55\n1.51\n1.276\n\nTable 1: Results on the Wikipedia character pre-\ndiction task. The \ufb01rst three numbers are our\nmeasurements, the next two the results on the\nsame dataset found in [19]. The bottom two\nnumbers were not measured on the same text\ncorpus.\n\n3.2 Results\n\nFigure 2: Increase in BPC on the test set from\nremoving the output contribution of a single\nlayer of the DRNN-AO.\n\nPerformance and text generation\nThe resulting BPCs for our models and comparative results in literature are shown in Table 1. The\ncommon RNN performs worst, and the DRNN-1O the best, with the DRNN-AO slightly worse. Both\nDRNNs perform well and are roughly similar to the state-of-the-art for recurrent networks with the\nsame number of trainable parameters5, which was established with a multiplicative RNN (MRNN),\ntrained with Hessian-free optimization in the course of 5 days on a cluster of 8 GPUs6. The same\nauthors also used the PAQ compression algorithm [14] as a comparison, which we included in the\nlist. In the table we also included two results which were not measured on the same dataset (or even\nusing the same criteria), but which give an estimation of the true number of BPC for natural text.\nTo check how each layer in\ufb02uences performance in the case of the DRNN-AO, we performed tests\nin which the output of a single layer is set to zero. This can serve as a sanity check to ensure\nthat the model is ef\ufb01ciently trained.\nIf for instance removing the top layer output contribution\ndoes not signi\ufb01cantly harm performance, this essentially means that it is redundant (as it does no\npreprocessing for higher layers). Furthermore we can use this test to get an overall indication of\nwhich role a particular layer has in producing output. Note that these experiments only have a limited\ninterpretability, as the individual layer contributions are likely not independent. Perhaps some layers\nprovide strong negative output bias which compensates for strong positive bias of another, or strong\nsynergies might exists between them.\nFirst we measure the increase in test BPC by removing a single layer\u2019s output contribution, which\ncan then be used as an indicator for the importance of this layer for directly generating output. In\nFigure 2 we show the result. The contribution of the top layer is the most important, and that of the\nbottom layer second important. The intermediate layers contribute less to the direct output and seem\nto be more important in preprocessing the data for the top layer.\nAs in [19], we also used the networks in a generative mode, where we use the output probabilities\nof the DRNN-AO to recursively sample a new input character in order to complete a given sentence.\nWe too used the phrase \u201cThe meaning of life is \u201d. We performed three tests: \ufb01rst we generated\ntext with an intact network, next we see how the text quality deteriorates when we leave out the\ncontributions of the bottom and top layer respectively7 (by setting it equal to zero before adding up\n\n5This similarity might re\ufb02ect limitations caused by the network size. We also performed a long-term ex-\nperiment with a DRNN-AO with 9.6 million trainable parameters, which resulted in a test BPC of 1.472 after\n1,000,000 weight updates (training for over a month). More parameters offer more raw storage power, and\nhence provide a straightforward manner in which to increase performance.\n\n6This would suggest a computational cost of roughly 4 times ours, but an honest comparison is hard to make\nas the authors did not specify explicitly how much data their training algorithm went through in total. Likely\nthe cost ratio is smaller than 4, as we use a more modern GPU.\n\n7Leaving out the contributions of intermediate layers only has a minimal effect on the subjective quality of\n\nthe produced text.\n\n5\n\n1234500.511.52Removed layerIncrease in BPC test  \fThe meaning of life is the \u201ddecorator of\nRose\u201d. The Ju along with its perspec-\ntive character survive, which coincides\nwith his eromine, water and colorful\nart called \u201dCharles VIII\u201d.??In \u201dInferno\u201d\n(also 220:\n\u201dThe second Note Game\nMagazine\u201d, a comic at the Old Boys\nat the Earl of Jerusalem for two years)\nfocused on expanded domestic differ-\nences from 60 mm Oregon launching,\nand are permitted to exchange guid-\nance.\n\nof\n\nis\n\nto\n\nlife\nunprecede\n\nimpos-\nThe meaning\n?Pok.{*\nsible\nPRER)!\u2014KGOREMFHEAZ CTX=R M\n\u2014S=6 5?&+\u2014\u2014=7xp*= 5FJ4\u201413/TxI\nJX=\u2014b28O=&4+E9F=&Z26 \u2014R&N==\nZ8&A=58=84&T=RESTZINA=L&95Y\n2O59&FP85=&&#=&H=S=Z IO =T\n@\u2014CBOM=6&9Y1= 9 5\n\nThe meaning of life is man sistastered-\nsteris bus and nuster eril\u201dn ton nis our\nousNmachiselle here hereds?d topp-\nstes impedred wisv.\u201d-hor ens htls be-\ntwez rese, and Intantored wren in\nthoug and elit\ntoren on the marcel,\ngos infand foldedsamps que help sase-\ncre hon Roser and ens in respoted\nwe frequen enctuivat herde pitched\npitchugismissedre and lose\ufb02owered\n\nTable 2: Three examples of text, generated by the DRNN-AO. The left one is generated by the intact\nnetwork, the middle one by leaving out the contribution of the \ufb01rst layer, and the right one by leaving\nout the contribution of the top layer.\n\nFigure 3: Left panel: normalised average distance between hidden states of a perturbed and unper-\nturbed network as a function of presented characters. The perturbation is a single typo at the \ufb01rst\ncharacter. The coloured full lines are for the individual layers of the DRNN-1O, and the coloured\ndashed lines are those of the layers of the DRNN-AO. Distances are normalised on the distance of\nthe occurrence of the typo. Right panel: Average increase in BPC between a perturbed and un-\nperturbed network as a function of presented characters. The perturbation is by replacing the initial\ncontext (see text), and the result is shown for the text having switched back to the correct context.\nColoured lines correspond to the individual contributions of the layers in the DRNN-AO.\n\nlayer contributions and applying the softmax function). Resulting text samples are shown in Table\n2. The text sample of the intact network shows short-term correct grammar, phrases, punctuation\nand mostly existing words. The text sample with the bottom layer output contribution disabled very\nrapidly becomes \u2018unstable\u2019, and starts to produce long strings of rare characters, indicating that the\ncontribution of the bottom layer is essential in modeling some of the most basic statistics of the\nWikipedia text corpus. We veri\ufb01ed this further by using such a random string of characters as ini-\ntialization of the intact network, and observed that it consistently fell back to producing \u2018normal\u2019\ntext. The text sample with the top layer disabled is interesting in the sense that it produces roughly\nword-length strings of common characters (letters and spaces), of which substrings resemble com-\nmon syllables. This suggests that the top layer output contribution captures text statistics longer than\nword-length sequences.\nTime scales\nIn order to gauge at what time scale each individual layer operates, we have performed several\nexperiments on the models. First of all we considered an experiment in which we run the DRNN\non two identical text sequences from the test set, but after 100 characters we introduce a typo in\none of them (by replacing it by a character randomly sampled from the full set). We record the\nhidden states after the typo as a function of time for both the perturbed and unperturbed network\n\n6\n\n2040608010010\u2212310\u2212210\u22121100nr. of presented charactersnormalised average distance  RNNlayer 1layer 2layer 3layer 4layer 52040608010010\u2212310\u2212210\u22121100101nr. of presented charactersaverage increase in BPC  RNNDRNN\u22121ODRNN\u2212AOlayer 1layer 2layer 3layer 4layer 5\fFigure 4: Network output example for a particularly long phrase between parentheses (296 charac-\nters), sampled from the test set. The vertical dashed lines indicate the opening and closing paren-\ntheses in the input text sequence. Top panel: output traces for the closing parenthesis character for\neach layer in the DRNN-AO. Coloring is identical to that of Figure 3. Bottom panel: total predicted\noutput probability of the closing parenthesis sign of the DRNN-AO.\n\nand measure the Euclidean distance between them as a function of time, to see how long the effect\nof the typo remains present in each layer.\nNext we measured what the length of the context is the DRNNs effectively employ. In order to do so\nwe measured the average difference in BPC between normal text and a perturbed copy, in which we\nreplaced the \ufb01rst 100 characters by text randomly sampled from elsewhere in the test set. This will\ngive an indication of how long the lack of correct context lingers after the text sequence switched.\nAll measurements were averaged over 50,000 instances. Results are shown in Figure 3. The left\npanel shows how fast each individual layer in the DRNNs forgets the typo-perturbation. It appears\nthat the layer-wise time scales behave quite differently in the case of the DRNN-1O and the DRNN-\nAO. The DRNN-AO has very short time-scales in the three bottom layers and longer memory only\nappears for the two top ones, whereas in the DRNN-1O, the bottom two layers have relatively short\ntime scales, but the top three layers have virtually the same, very long time scale. This is almost\ncertainly caused by the way in which we trained the DRNN-1O, such that intermediate layers already\nassumed long memory when they were at the top of the hierarchy. The effect of the perturbation of\nthe normal RNN is also shown. Even though it decays faster at the start, the effect of the perturbation\nremains present in the network for a long period as well.\nThe right panel of Figure 3 depicts the effect on switching the context on the actual prediction\naccuracy, which gives some insight in what the actual length of the context used by the networks\nis. Both DRNNs seem to recover more slowly from the context switch than the RNN, indicating\nthat they employ a longer context for prediction. The time scales of the individual layers of the\nDRNN-AO are also depicted (by using the perturbed hidden states of an individual layer and the\nunperturbed states for the other layers for generating output), which largely con\ufb01rms the result from\nthe typo-perturbation test.\nThe results shown here verify that a temporal hierarchy develops when training a DRNN. We have\nalso performed a test to see what the time scales of an untrained DRNN are (by performing the typo\ntest), which showed that here the differences in time-scales for each layer were far smaller (results\nnot shown). The big differences we see in the trained DRNNs are hence a learned property.\n\nLong-term interactions: parentheses\n\nIn order to get a clearer picture on some of the long-term dependencies the DRNNs have learned we\nlook at their capability of closing parentheses, even when the phrase between parentheses is long.\nTo see how well the networks remember the opening of a parenthesis, we observe the DRNN-AO\noutput for the closing parenthesis-character8. In Figure 4 we show an example for an especially long\nphrase between parentheses. We both show the output probability and the individual layers\u2019 output\n\n8Results on the DRNN-1O are qualitatively similar.\n\n7\n\n\u22125051015output5010015020025030035040045050000.20.4prob.nr. presented characters\fcontribution for the closing parenthesis (before they are added up and sent to the softmax function).\nThe output of the top layer for the closing parenthesis is increased strongly for the whole duration\nof the phrase, and is reduced immediately after it is closed.\nThe total output probability shows a similar pattern, showing momentary high probabilities for the\nclosing parenthesis only during the parenthesized phrase, and extremely low probabilities elsewhere.\nThese results are quite consistent over the test set, with some notable exceptions. When several sen-\ntences appear between parentheses (which occasionally happens in the text corpus), the network\nreduces the closing bracket probability (i.e., essentially \u2018forgets\u2019 it) as soon as a full stop appears9.\nSimilarly, if a sentence starts with an opening bracket it will not increase closing parenthesis prob-\nability at all, essentially ignoring it. Furthermore, the model seems not able to cope with nested\nparentheses (perhaps because they are quite rare). The fact that the DRNN is able to remember the\nopening parenthesis for sequences longer than it has been trained on indicates that it has learned\nto model parentheses as a pseudo-stable attractor-like state, rather than memorizing parenthesized\nphrases of different lengths.\nIn order to see how well the networks can close parentheses when they operate in the generative\nmode, we performed a test in which we initialize it with a 100-character phrase drawn from the test\nset ending in an opening bracket and observe in how many cases the network generates a closing\nbracket. A test is deemed unsuccessful if the closing parenthesis doesn\u2019t appear in 500 characters,\nor if it produces a second opening parenthesis. We averaged the results over 2000 initializations.\nThe DRNN-AO performs best in this test; only failing in 12% of the cases. The DRNN-1O fails in\n16%, and the RNN in 28%.\nThe results presented in this section hint at the fact that DRNNs might \ufb01nd it easier to learn long-\nterm relations between input characters than common RNNs. This could lead to test DRNNs on the\ntasks introduced in [11]. These tasks are challenging in the sense that they require to retain very\nlong memory of past input, while being driven by so-called distractor input. It has been shown that\nLSTMs and later common RNNs trained with Hessian-free methods [16] and Echo State Networks\n[13] are able to model such long-term dependencies. These tasks, however, purely focus on memory\ndepth, and very little additional processing is required, let alone hierarchical processing. Therefore\nwe do not suspect that DRNNs pose a strong advantage over common RNNs for these tasks in\nparticular.\n\n4 Conclusions and Future Work\n\nWe have shown that using a deep recurrent neural network (DRNN) is bene\ufb01cial for character-\nlevel language modeling, reaching state-of-the-art performance for recurrent neural networks on a\nWikipedia text corpus, con\ufb01rming the observation that deep recurrent architectures can boost perfor-\nmance [8]. We also present experimental evidence for the appearance of a hierarchy of time-scales\npresent in the layers of the DRNNs. Finally we have demonstrated that in certain cases the DRNNs\ncan have extensive memory of several hundred characters long.\nThe training method we obtained on the DRNN-1O indicates that supervised pre-training for deep\narchitectures is helpful, which on its own can provide an interesting line of future research. Another\none is to extend common pre-training schemes, such as the deep belief network approach [9] and\ndeep auto-encoders [10, 20] for DRNNs. The results in this paper can potentially contribute to the\nongoing debate on training algorithms, especially whether SGD or second order methods are more\nsuited for large-scale machine learning problems [2]. Therefore, applying second order techniques\nsuch as Hessian-free training [15] on DRNNs seems an attractive line of future research in order to\nobtain a solid comparison.\n\nAcknowledgments\n\nThis work is partially supported by the interuniversity attraction pole (IAP) Photonics@be of the\nBelgian Science Policy Of\ufb01ce and the ERC NaResCo Starting grant. We would like to thank Sander\nDieleman and Philemon Brakel for helping with implementations. All experiments were performed\nusing Theano [1].\n\n9It is consistently resilient against points appearing in abbreviations such as \u2018e.g.,\u2019 and \u2018dr.\u2019 though.\n\n8\n\n\fReferences\n[1] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley,\nand Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for\nScienti\ufb01c Computing Conference (SciPy), June 2010.\n\n[2] L. Bottou and O. Bousquet. The tradeoffs of large-scale learning. Optimization for Machine Learning,\n\npage 351, 2011.\n\n[3] W.-Y. Chen, Y.-F. Liao, and S.-H. Chen. Speech recognition with hierarchical recurrent neural networks.\n\nPattern Recognition, 28(6):795 \u2013 805, 1995.\n\n[4] D. Ciresan, U. Meier, L. Gambardella, and J. Schmidhuber. Deep, big, simple neural nets for handwritten\n\ndigit recognition. Neural computation, 22(12):3207\u20133220, 2010.\n\n[5] S. El Hihi and Y. Bengio. Hierarchical recurrent neural networks for long-term dependencies. Advances\n\nin Neural Information Processing Systems, 8:493\u2013499, 1996.\n\n[6] S. Fern\u00b4andez, A. Graves, and J. Schmidhuber. Sequence labelling in structured domains with hierarchi-\nIn Proceedings of the 20th International Joint Conference on Arti\ufb01cial\n\ncal recurrent neural networks.\nIntelligence, IJCAI 2007, Hyderabad, India, January 2007.\n\n[7] J. Garofolo, N. I. of Standards, T. (US, L. D. Consortium, I. Science, T. Of\ufb01ce, U. States, and D. A. R. P.\n\nAgency. TIMIT Acoustic-phonetic Continuous Speech Corpus. Linguistic Data Consortium, 1993.\n\n[8] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In To\n\nappear in ICASSP 2013, 2013.\n\n[9] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural computation,\n\n18(7):1527\u20131554, 2006.\n\n[10] G. E. Hinton. Reducing the dimensionality of data with neural networks. Science, 313:504\u2013507, 2006.\n[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[12] M. Hutter. The human knowledge compression prize, 2006.\n[13] H. Jaeger. Long short-term memory in echo state networks: Details of a simulation study. Technical\n\nreport, Jacobs University, 2012.\n\n[14] M. Mahoney. Adaptive weighing of context models for lossless data compression. Florida Tech., Mel-\n\nbourne, USA, Tech. Rep, 2005.\n\n[15] J. Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th International Con-\n\nference on Machine Learning, pages 735\u2013742, 2010.\n\n[16] J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. In Pro-\nceedings of the 28th International Conference on Machine Learning, volume 46, page 68. Omnipress\nMadison, WI, 2011.\n\n[17] A. Mohamed, G. Dahl, and G. Hinton. Acoustic modeling using deep belief networks. Audio, Speech,\n\nand Language Processing, IEEE Transactions on, 20(1):14\u201322, 2012.\n\n[18] C. E. Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50\u201364,\n\n1951.\n\n[19] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In Proceedings\n\nof the 28th International Conference on Machine Learning, pages 1017\u20131024, 2011.\n\n[20] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with\ndenoising autoencoders. In Proceedings of the 25th International Conference on Machine learning, pages\n1096\u20131103, 2008.\n\n9\n\n\f", "award": [], "sourceid": 172, "authors": [{"given_name": "Michiel", "family_name": "Hermans", "institution": "Ghent University"}, {"given_name": "Benjamin", "family_name": "Schrauwen", "institution": "Ghent University"}]}