{"title": "Phoneme Recognition with Large Hierarchical Reservoirs", "book": "Advances in Neural Information Processing Systems", "page_first": 2307, "page_last": 2315, "abstract": "Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant.", "full_text": "Phoneme Recognition with Large Hierarchical\n\nReservoirs\n\nFabian Triefenbach\n\nAzarakhsh Jalalvand\n\nBenjamin Schrauwen\n\nJean-Pierre Martens\n\nDepartment of Electronics and Information Systems\n\nGhent University\n\nSint-Pietersnieuwstraat 41, 9000 Gent, Belgium\nfabian.triefenbach@elis.ugent.be\n\nAbstract\n\nAutomatic speech recognition has gradually improved over the years, but the re-\nliable recognition of unconstrained speech is still not within reach. In order to\nachieve a breakthrough, many research groups are now investigating new method-\nologies that have potential to outperform the Hidden Markov Model technology\nthat is at the core of all present commercial systems. In this paper, it is shown\nthat the recently introduced concept of Reservoir Computing might form the basis\nof such a methodology. In a limited amount of time, a reservoir system that can\nrecognize the elementary sounds of continuous speech has been built. The sys-\ntem already achieves a state-of-the-art performance, and there is evidence that the\nmargin for further improvements is still signi\ufb01cant.\n\n1\n\nIntroduction\n\nThanks to a sustained world-wide effort, modern automatic speech recognition technology has now\nreached a level of performance that makes it suitable as an enabling technology for novel applica-\ntions such as automated dictation, speech based car navigation, multimedia information retrieval,\netc. Basically all state-of-the-art systems utilize Hidden Markov Models (HMMs) to compose an\nacoustic model that captures the relations between the acoustic signal and the phonemes, de\ufb01ned\nas the basic contrastive units of the sound system of a spoken language. The HMM theory has not\nchanged that much over the years, and the performance growth is slow and for a large part owed to\nthe availability of more training data and computing resources.\nMany researchers advocate the need for alternative learning methodologies that can supplement or\neven totally replace the present HMM methodology. In the nineties for instance, very promising\nresults were obtained with Recurrent Neural Networks (RNNs) [1] and hybrid systems both com-\nprising neural networks and HMMs [2], but these systems were more or less abandoned since then.\nMore recently, there was a renewed interest in applying new results originating from the Machine\nLearning community. Two techniques, namely Deep Belief Networks (DBNs) [3, 4] and Long Short-\nTerm Memory (LSTM) recurrent neural networks [5], have already been used with great success for\nphoneme recognition. In this paper we present the \ufb01rst (to our knowledge) phoneme recognizer that\nemploys Reservoir Computing (RC) [6, 7, 8] as its core technology.\nThe basic idea of Reservoir Computing (RC) is that complex classi\ufb01cations can be performed by\nmeans of a set of simple linear units that \u2019read-out\u2019 the outputs of a pool of \ufb01xed (not trained)\nnonlinear interacting neurons. The RC concept has already been successfully applied to time se-\nries generation [6], robot navigation [9], signal classi\ufb01cation [8], audio prediction [10] and isolated\n\n1\n\n\fspoken digit recognition [11, 12, 13]. In this contribution we envisage a RC system that can recog-\nnize the English phonemes in continuous speech. In a short period (a couple of months) we have\nbeen able to design a hierarchical system of large reservoirs that can already compete with many\nstate-of-the-art HMMs that have only emerged after several decades of research.\nThe rest of this paper is organized as follows: in Section 2 we describe the speech corpus we are\ngoing to work on, in Section 3 we recall the basic principles of Reservoir Computing, in Section 4 we\ndiscuss the architecture of the reservoir system which we propose for performing Large Vocabulary\nContinuous Speech Recognition (LVCSR), and in Section 5 we demonstrate the potential of this\narchitecture for phoneme recognition.\n\n2 The speech corpus\n\nSince the main aim of this paper is to demonstrate that reservoir computing can yield a good acoustic\nmodel, we will conduct experiments on TIMIT, an internationally renowned corpus [14] that was\nspeci\ufb01cally designed to support the development and evaluation of such a model.\nThe TIMIT corpus contains 5040 English sentences spoken by 630 different speakers representing\neight dialect groups. About 70% of the speakers are male, the others are female. The corpus docu-\nmentation de\ufb01nes a training set of 462 speakers and a test set of 168 different speakers: a main test\nset of 144 speakers and a core test set of 24 speakers. Each speaker has uttered 10 sentences: two\nSA sentences which are the same for all speakers, 5 SX-sentences from a list of 450 sentences (each\none thus appearing 7 times in the corpus) and 3 SI-sentences from a set of 1890 sentences (each one\nthus appearing only once in the corpus). To avoid a biased result, the SA sentences will be excluded\nfrom training and testing.\nFor each utterance there is a manual acoustic-phonetic segmentation. It indicates where the phones,\nde\ufb01ned as the atomic units of the acoustic realizations of the phonemes, begin and end. There are 61\ndistinct phones, which, for evaluation purposes, are usually reduced to an inventory of 39 symbols,\nas proposed by [15]. Two types of error rates can be reported for the TIMIT corpus. One is the\nClassi\ufb01cation Error Rate (CER), de\ufb01ned as the percentage of the time the top hypothesis of the tested\nacoustic model is correct. The second one is the Recognition Error Rate (RER), de\ufb01ned as the ratio\nbetween the number of edit operations needed to convert the recognized symbol sequence into the\nreference sequence, and the number of symbols in that reference sequence. The edit operations are\nsymbol deletions, insertions and substitutions. Both classi\ufb01cation and recognition can be performed\nat the phone and the phoneme level.\n\n3 The basics of Reservoir Computing\n\nIn this paper, a Reservoir Computing network (see Figure 1) is an Echo State Network [6, 7, 8]\nconsisting of a \ufb01xed dynamical system (the reservoir) composed of nonlinear recurrently connected\nneurons which are left untrained, and a set of linear output nodes (read-out nodes). Each output node\nis trained to recognize one class (one-vs-all classi\ufb01cation). The number of connections between and\nwithin layers can be varied from sparsely connected to fully connected. The reservoir neurons have\nan activation function f(x) = logistic(x).\n\nFigure 1: A reservoir computing network consists of a reservoir of \ufb01xed recurrently connected\nnonlinear neurons which are stimulated by the inputs, and an output layer of trainable linear units.\n\n2\n\noutput nodesreservoirinput nodestrained output connectionsrandom recurrent connectionsrandom input connections\fThe RC approach avoids the back-propagation through time learning which can be very time con-\nsuming and which suffers from the problem of vanishing gradients [6]. Instead, it employs a simple\nand ef\ufb01cient linear regression learning of the output weights. The latter tries to minimize the mean\nsquared error between the computed and the desired outputs at all time steps.\nBased on its recurrent connections, the reservoir can capture the long-term dynamics of the human\narticulatory system to perform speech sound classi\ufb01cation. This property should give it an advantage\nover HMMs that rely on the assumption that subsequent acoustical input vectors are conditionally\nindependent.\nBesides the \u2019memory\u2019 introduced through the recurrent connections, the neurons themselves can\nalso integrate information over time. Typical neurons that can accomplish this are Leaky Integrator\nNeurons (LINs) [16]. With such neurons the reservoir state at time k+1 can be computed as follows:\n(1)\nwith u[k] and x[k] representing the inputs and the reservoir state at time k. The W matrices contain\nthe input and recurrent connection weights. It is common to include a constant bias in u[k]. As\nlong as the leak rate \u03bb < 1, the integration function provides an additional fading memory of the\nreservoir state.\nTo perform a classi\ufb01cation task, the RC network computes the outputs at time k by means of the\nfollowing linear equation:\n\nx[k + 1] = (1 \u2212 \u03bb)x[k] + \u03bbf (Wresx[k] + Winu[k])\n\n(2)\nThe reservoir state in this equation is augmented with a constant bias. If the reservoir states at the\ndifferent time instants form the columns of a large state matrix X and if the corresponding desired\noutputs form the columns of a matrix D, the optimal Wout emerges from the following equations:\n\ny[k] = Wout x[k]\n\n(cid:16)||X W \u2212 D||2 + \u0001 ||W||2(cid:17)(cid:19)\n\n(cid:18) 1\n\nN\n\nWout = arg min\n\nW\n\n(3)\n\nWout = (XTX + \u0001 I)\u22121(XTD)\n\n(4)\nwith N being the number of frames. The regularization constant \u0001 aims to limit the norm of the output\nweights (this is the so-called Tikhonov or ridge regression). For large training sets, as common in\nspeech processing, the matrices XTX and XTD are updated on-line in order to suppress the need\nfor huge storage capacity. In this paper, the regularization parameter \u0001 was \ufb01xed to 10\u22128. This\nregularization is equivalent to adding Gaussian noise with a variance of 10\u22128 to the reservoir state\nvariables.\n\n4 System architecture\n\nThe main objective of our research is to build an RC-based LVCSR system that can retrieve the\nwords from a spoken utterance. The general architecture we propose for such a system is depicted\nin Figure 2. The preprocessing stage converts the speech waveform into a sequence of acoustic\n\nFigure 2: Hierarchical reservoir architecture with multiple layers.\n\nfeature vectors representing the acoustic properties in subsequent speech frames. This sequence is\nsupplied to a hierarchical system of RC networks. Each reservoir is composed of LINs which are\nfully connected to the inputs and to the 41 outputs. The latter represent the distinct phonemes of\nthe language. The outputs of the last RC network are supplied to a decoder which retrieves the\nmost likely linguistic interpretation of the speech input, given the information computed by the RC\n\n3\n\n\fnetworks and given some prior knowledge of the spoken language. In this paper, the decoder is\na phoneme recognizer just accommodating a bigram phoneme language model. In a later stage it\nwill be extended with other components: (1) a phonetic dictionary comprising all the words of the\nsystem\u2019s vocabulary and their common pronunciations, expressed as phoneme sequences, and (2) a\nn-gram language model describing the probabilities of each word, given the preceding (n-1) words.\nWe conjecture that the integration time of the LINs in the \ufb01rst reservoir should ideally be long\nenough to capture the co-articulations between successive phonemes emerging from the dynamical\nconstraints of the articulatory system. On the other hand, it has to remain short enough to avoid that\ninformation pointing to the presence of a short phoneme is too much blurred by the left phonetic\ncontext. Furthermore, we argue that additional reservoirs can correct some of the errors made by\nthe \ufb01rst reservoir. Indeed, such an error correcting reservoir can guess the correct labels from its\ninputs, and take the past phonetic context into account in an implicit way to re\ufb01ne the decision. This\nis in contrast to an HMM system which adopts an explicit approach, involving separate models for\nseveral thousands of context-dependent phonemes.\nIn the next subsections we provide more details about the different parts of our recognizer, and we\nalso discuss the tuning of some of its control parameters.\n\n4.1 Preprocessing\n\nThe preprocessor utilizes the standard Mel Frequency Cepstral Coef\ufb01cient (MFCC) analysis [17] en-\ncountered in most state-of-the-art LVCSR systems. The analysis is performed on 25 ms Hamming-\nwindowed speech frames, and subsequent speech frames are shifted over 10 ms with respect to each\nother. Every 10 ms a 39-dimensional feature vector is generated. It consists of 13 static parameters,\nnamely the log-energy and the \ufb01rst 12 MFCC coef\ufb01cients, their \ufb01rst order derivatives (the velocity\nor \u2206 parameters), and their second order derivatives (the acceleration or \u2206\u2206 parameters).\nIn HMM systems, the training is insensitive to a linear rescaling of the individual features.\nIn\nRC systems however, the input and recurrent weights are not trained and drawn from prede\ufb01ned\nstatistical distributions. Consequently, by rescaling the features, the impact of the inputs on the\nactivations of the reservoir neurons is changed as well, which makes it compulsory to employ an\nappropriate input scaling [8].\nTo establish a proper input scaling the acoustic feature vector is split into six sub-vectors according\nto the dimensions (energy, cepstrum) and (static, velocity, acceleration). Then, each feature ai, (i =\n1, .., 39) is normalized to zi = \u03b1s (ui \u2212 ui) with ui being the mean of ui and s (s = 1, .., 6)\nreferring to the sub-vector (group) the feature belongs to. The aim of \u03b1s is to ensure that the norm\nof each sub-vector is one. If the zi were supplied to the reservoir, each sub-vector would on average\nhave the same impact on the reservoir neuron activations. Therefore, in a second stage, the zi are\nrescaled to ui = \u03b2szi with \u03b2s representing the relative importance of sub-vector s in the reservoir\nneuron activations. The normalization constants \u03b1s straightly follow from a statistical analysis of the\n\nTable 1: Different types of acoustic information in the input features and their optimal scale factors.\n\nEnergy features\n\nCepstral features\n\ngroup name\nnorm factor \u03b1\nscale factor \u03b2\n\nlog(E) \u2206 log(E) \u2206\u2206 log(E)\n0.27\n1.75\n\n4.97\n1.00\n\n1.77\n1.25\n\nc1...12 \u2206c1...12 \u2206\u2206c1...12\n0.10\n1.25\n\n0.61\n0.50\n\n1.75\n0.25\n\nacoustic feature vectors. The factors \u03b2s are free parameters that were selected such that the phoneme\nclassi\ufb01cation error of a single reservoir system of 1000 neurons is minimized on the validation\nset. The obtained factors (see Table 1) con\ufb01rm that the static features are more important than the\nvelocity and the acceleration features.\nThe proposed rescaling has the following advantages: it preserves the relative importance of the\nindividual features within a sub-vector, it is fully de\ufb01ned by six scaling parameters \u03b1s\u03b2s, it takes\nonly a minimal computational effort, and it is actually supposed to work well for any speech corpus.\n\n4\n\n\f4.2 Sequence decoding\n\nThe decoder in our present system performs a Viterbi search for the most likely phonemic sequence\ngiven the acoustic inputs and a bigram phoneme language model. The search is driven by a simple\nmodel for the conditional likelihood p(y|m) that the reservoir output vector y is observed during the\nacoustical realization of phoneme m. The model is based on the cosine similarity between y + 1\nand a template vector tm = [0, .., 0, 1, 0, .., 0], with its nonzero element appearing at position m.\nSince the template vector is a unity vector, we compute p(y|m) as\n\n(cid:18)\n\np(y|m) =\n\nmax[0,\n\n(cid:19)\u03ba\n\n\u221a\n< y + 1, tm >)\n< y + 1, y + 1 >\n\n]\n\n,\n\n(5)\n\nwith < x, y > denoting the dot product of vectors x and y. Due to the offset, we can ensure that\nthe components of y + 1 are between 0 and 1 most of the time. The maximum operator prevents\nthe likelihoods from becoming negative occasionally. The exponent \u03ba is a free parameters that will\nbe tuned experimentally. It controls the relative importance of the acoustic model and the bigram\nphoneme language model.\n\n4.3 Reservoir optimization\n\nThe training of the reservoir output nodes is based on Equations (3) and (4) and the desired phoneme\nlabels emerge from a time synchronized phonemic transcription. The latter was derived from the\navailable acoustic-phonetic segmentation of TIMIT. For all experiments reported in this paper, we\nhave used the modular RC toolkit OGER1 developed at Ghent University.\nThe recurrent weights of the reservoir are not trained but randomly drawn from statistical distribu-\ntions. The input weights emerge from a uniform distribution between \u2212U and +U, the recurrent\nweights from a zero-mean Gaussian distribution with a variance V . The value of U controls the rel-\native importance of the inputs in the activation of the reservoir neurons and is often called the input\nscale factor (ISF). The variance V directly determines the spectral radius (SR), de\ufb01ned as the largest\nabsolute eigenvalue of the recurrent weight matrix. The SR describes the dynamical excitability of\nthe reservoir [6, 8]. The SR and the ISF must be jointly optimized. To do so, we used 1000 neuron\nreservoirs, supplied with inputs that were normalized according to the procedure reviewed in the\nprevious section. We found that SR = 0.4 and ISF = 0.4 yield the best performance, but for SR\n\u2208 (0.3...0.8) and for ISF \u2208 (0.2...1.0), the performance is quite stable.\nAnother parameter that must be optimized is the leak rate, denoted as \u03bb. It determines the integra-\ntion time of the neurons. If the nonlinear function is ignored and the time between frames is Tf ,\nthe reservoir neurons represent a \ufb01rst-order leaky integrator with a time constant \u03c4 that is related\nto \u03bb by \u03bb = 1 \u2212 e\u2212Tf /\u03c4 . As stated before, the integration time should be long enough to cap-\nture the relevant co-articulation effects and short enough to constrain the information blurring over\nsubsequent phonemes. This is con\ufb01rmed by Figure 3 showing how the phoneme CER of a single\nreservoir system changes as a function of the integrator time constant. The optimal value is 40 ms,\n\nFigure 3: The phoneme Classi\ufb01cation Error Rate (CER) as a function of the integration time (in ms)\n\nand completely in line with psychophysical data concerning the post and pre-masking properties of\nthe human auditory system. In [18] for instance, it is shown that these properties can be explained\nby means of a second order low-pass \ufb01lter with real poles corresponding to time constants of 8 and\n40 ms respectively (it is the largest constant that determines the integration time here).\n\n1http://reservoir-computing.org/organic/engine\n\n5\n\n020406080100120140160180394041424344CER [in %]integration time [in ms]\fIt has been reported [19] that one can easily reduce the number of recurrent connections in a RC\nnetwork without much affecting its performance. We have found that limiting the number of con-\nnection to 50 per neuron does not harm the performance while it dramatically reduces the required\ncomputational resources (memory and computation time).\n\n5 Experiments\n\nSince our ultimate goal is to perform LVCSR, and since LVCSR systems work with a dictionary of\nphonemic transcriptions, we have worked with phonemes rather than with phones. As in [20] we\nconsider the 41 phoneme symbols one encounters in a typical phonetic dictionary like COMLEX\n[21]. The 41 symbols are very similar to the 39 symbols of the reduced phone set proposed by [15],\nbut with one major difference, namely, that a phoneme string does not contain any silences referring\nto closures of plosive sounds (e.g. the closure /kcl/ of phoneme /k/). By ignoring confusions between\n/sh/ and /zh/ and between /ao/ and /aa/ we \ufb01nally measure phoneme error rates for 39 classes, in\norder to make them more compliant with the phone error rates for 39 classes reported in other\npapers. Nevertheless, we will see later that phoneme recognition is harder to accomplish than phone\nrecognition. This is because the closures are easy to recognize and contribute to a low phone error\nrate. In phoneme recognition there are no closure anymore.\nIn what follows, all parameter tuning is performed on the TIMIT training set (divided into inde-\npendent training and development sets), and all error rates are measured on the main test set. The\nbigram phoneme language model used for the sequence decoding step is created from the phonemic\ntranscriptions of the training utterances.\n\n5.1 Single reservoir systems\n\nIn a \ufb01rst experiment we assess the performance of a single reservoir system as a function of the reser-\nvoir size, de\ufb01ned as the number of neurons in the reservoir. The phoneme targets during training are\nderived from the manual acoustic-phonetic segmentation, as explained in Section 4.3. We increase\nthe number of neurons from 500 to 20000. The corresponding number of trainable parameters then\nchanges from 20K to 800K. The latter \ufb01gure corresponds to the number of trainable parameters in\nan HMM system comprising 1200 independent Gaussian mixture distributions of 8 mixtures each.\nFigure 4 shows that the phoneme CER on the training set drops by about 4% every time the reservoir\nsize is doubled. The phoneme CER on the test set shows a similar trend, but the slope is decreasing\nfrom 4% at low reservoir sizes to 2% at 20000 neurons (nodes). At that point the CER on the test\n\nFigure 4: The Classi\ufb01cation Error Rate (CER) at the phoneme level for the training and test set as a\nfunction of the reservoir size.\n\nset is 30.6% and the corresponding RER (not shown) is 31.4%. The difference between the test and\nthe training error is about 8%.\nAlthough the \ufb01gures show that an even larger reservoir will perform better, we stopped at 20000\nnodes because the storage and the inversion of the large matrix XTX are getting problematic. Before\nstarting to investigate even larger reservoirs, we \ufb01rst want to verify our hypothesis that adding a\nsecond (equally large) layer can lead to a better performance.\n\n6\n\n\f5.2 Multilayer reservoir systems\n\nUsually, a single reservoir system produces a number of competing outputs at all time steps, and\nthis hampers the identi\ufb01cation of the correct phoneme sequence. The left panel of Figure 5 shows\nthe outputs of a reservoir of 8000 nodes in a time interval of 350 ms. Our hypothesis was that the\nobserved confusions are not arbitrary, and that a second reservoir operating on the outputs of the \ufb01rst\nreservoir system may be able to discover regularities in the error patterns. And indeed, the outputs\nof this second reservoir happen to exhibit a larger margin between the winner and the competition,\nas illustrated in the right panel of Figure 5.\n\nFigure 5: The outputs of the \ufb01rst (left) and the second (right) layer of a two-layer system composed\nof two 8000 node reservoirs. The shown interval is 350 ms long.\n\nIn Figure 6, we have plotted the phoneme CER and RER as a function of the number of reservoirs\n(layers) and the size of these reservoirs. We have thus far only tested systems with equally large\nreservoirs at every layer. For the exponent \u03ba, we have just tried \u03ba = 0.5, 0.7 and 1, and we have\nselected the value yielding the best balance between insertions and deletions.\n\nFigure 6: The phoneme CERs and RERs for different combinations of number of nodes and layers\n\nFor all reservoir sizes, the second layer induces a signi\ufb01cant improvement of the CER by 3-4%\nabsolute. The corresponding improvements of the recognition error rates are a little bit less but\nstill signi\ufb01cant. The best RER obtained with a two-layer system comprising reservoirs of 20000\nnodes is 29.1%. Both plots demonstrate that a third layer does not cause any additional gain when\nthe reservoir size is large enough. However, this might also be caused by the fact that we did not\nsystematically optimize the parameters (SR, leak rate, regularization parameter, etc.) for each large\nsystem con\ufb01guration we investigated. We just chose sensible values which were retrieved from tests\nwith smaller systems.\n\n5.3 Comparison with the state-of-the-art\n\nIn Table 2 we have listed some published results obtained on TIMIT with state-of-the-art HMM\nsystems and other recently proposed research systems. We have also included the results of own ex-\nperiments we conducted with SPRAAK2 [22], a recently launched HMM-based speech recognition\ntoolkit. In order to provide an easier comparison, we also build a phone recognition system based\non the same design parameters that were optimized for phoneme recognition. All phone RERs are\n\n2http://www.spraak.org\n\n7\n\n\fcalculated on the core test set, while the phoneme RERs were measured on the main test set. We\ndo this because most \ufb01gures in speech community papers apply to these experimental settings. Our\n\ufb01nal results were obtained with systems that were trained on the full training data (including the\ndevelopment set). Before discussing our \ufb01gures in detail we emphasize that the two \ufb01gures for\nSPRAAK con\ufb01rm our earlier statement that phoneme recognition is harder than phone recognition.\n\nTable 2: Phoneme and Phone Recognition Error Rates (in %) obtained with state-of-the-art systems.\n\nSystem description\nused test set\nReservoir Computing (this paper)\nCD-HMM (SPRAAK Toolkit)\nCD-HMM [20]\nRecurrent Neural Networks [1]\nLSTM+CTC [5]\nBayesian Triphone HMM [23]\nDeep Belief Networks [4]\nHierarchical HMM + MLPs [20]\n\nPhone RER Phoneme RER\n\ncore test\n\nmain test\n\n26.8\n25.6\n\n26.1\n(24.6)\n24.4\n23.0\n\n29.1\n28.1\n28.7\n\n(23.4)\n\nGiven the fact that SPRAAK seems to achieve state-of-the-art performance, it is fair to conclude\nfrom the \ufb01gures in Table 2 that our present system is already competitive with other modern HMM\nsystems. It is also fair to say that better systems do exist, like the Deep Belief Network system\n[4] and the hierarchical HMM system with multiple Multi-Layer Perceptrons (MLPs) on top of an\nHMM system [20]. Note however that the latter system also employs complex temporal patterns\n(TRAPs) as input features. These patterns are much more powerful than the simple MFCC vectors\nused in all other systems we cite. Furthermore, the LSTM+CTC [5] results too must be considered\nwith some care since they were obtained with a bidirectional system. Such a system is impractical\nin many application since it has to wait until the end of a speech utterance to start the recognition.\nWe therefore put the results of the latter two systems between brackets in Table 2.\nTo conclude this discussion, we also want to mention some training and execution times. The\ntraining of our two-layer 20K reservoir systems takes about 100 hours on a single core 3.0 GHz PC,\nwhile recognition takes about two seconds of decoding per second of speech.\n\n6 Conclusion and future work\n\nIn this paper we showed for the \ufb01rst time that good phoneme recognition on TIMIT can be achieved\nwith a system based on Reservoir Computing. We demonstrated that in order to achieve this, we\nneed large reservoirs (at least 20000 nodes) which are con\ufb01gured in a hierarchical way. By stacking\ntwo reservoir layers, we were able to achieve error rates that are competitive with what is attainable\nusing state-of-the-art HMM technology. Our results support the idea that reservoirs can exploit\nlong-term dynamic properties of the articulatory system in continuous speech recognition.\nIt is acknowledged though that other techniques such as Deep Belief Networks are still outperform-\ning our present system, but the plots and the discussions presented in the course of this paper clearly\nshow a signi\ufb01cant margin for further improvement of our system in the near future.\nTo achieve this improvement we will investigate even larger reservoirs with 50000 and more nodes\nand we will more thoroughly optimize the parameters of the different reservoirs. Furthermore, we\nwill explore the use of sparsely connected outputs and multi-frame inputs in combination with PCA-\nbased dimensionality reduction. Finally, we will develop an embedded training scheme that permits\nthe training of reservoirs on much larger speech corpora for which only orthographic representations\nare distributed together with the speech data.\n\nAcknowledgement\n\nThe work presented in this paper is funded by the EC FP7 project ORGANIC (FP7-231267).\n\n8\n\n\fReferences\n[1] A. Robinson. An application of recurrent neural nets to phone probability estimation. IEEE Trans. on\n\nNeural Networks, 5:298\u2013305, 1994.\n\n[2] H. Bourlard and N. Morgan. Continuous speeh recognition by connectionist statistical methods. IEEE\n\nTrans. on Neural Networks, 4:893\u2013909, 1993.\n\n[3] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation,\n\n18:1527\u20131554, 2006.\n\n[4] A. Mohamed, G. Dahl, and G. Hinton. Deep belief networks for phone recognition. In NIPS Workshop\n\non Deep Learning for Speech Recognition and Related Applications, 2009.\n\n[5] A. Graves and J. Schmidhuber. Framewise phoneme classi\ufb01cation with bidirectional LSTM and other\n\nneural network architectures. Neural Networks, 18:602\u2013610, 2005.\n\n[6] H. Jaeger. Tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the echo\nstate network approach (48 pp). Technical report, German National Research Center for Information\nTechnology, 2002.\n\n[7] W. Maass, T. Natschl\u00a8ager, and H. Markram. Real-time computing without stable states: A new framework\n\nfor neural computation based on perturbations. Neural Computation, 14(11):2531\u20132560, 2002.\n\n[8] D. Verstraeten, B. Schrauwen, M. D\u2019Haene, and D. Stroobandt. An experimental uni\ufb01cation of reservoir\n\ncomputing methods. Neural Networks, 20:391\u2013403, 2007.\n\n[9] E. Antonelo, B. Schrauwen, and J. Van Campenhout. Generative modeling of autonomous robots and\n\ntheir environments using reservoir computing. Neural Processing Letters, 26(3):233\u2013249, 2007.\n\n[10] G. Holzmann and H. Hauser. Echo state networks with \ufb01lter neurons and a delay & sum readout. Neural\n\nNetworks, 23:244\u2013256, 2010.\n\n[11] D. Verstraeten, B. Schrauwen, and D. Stroobandt. Isolated word recognition using a liquid state machine.\nIn Proceedings of the 13th European Symposium on Arti\ufb01cial Neural Networks (ESANN), pages 435\u2013440,\n2005.\n\n[12] M. Skowronski and J. Harris. Automatic speech recognition using a predictive echo state network classi-\n\n\ufb01er. Neural Networks, 20(3):414\u2013423, 2007.\n\n[13] B. Schrauwen. A hierarchy of recurrent networks for speech recognition. In NIPS Workshop on Deep\n\nLearning for Speech Recognition and Related Applications, 2009.\n\n[14] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N. Dahlgren. The DARPA TIMIT acoustic-\nphonetic continuous speech corpus cd-rom. Technical report, National Institute of Standards and Tech-\nnology, 1993.\n\n[15] K.F. Lee and H-W. Hon. Speaker-independent phone recognition using hidden markov models. In IEEE\n\nTrans. on Acoustics, Speech and Signal Processing, ASSP, volume 37, pages 1641\u20131648, 1989.\n\n[16] H. Jaeger, M. Lukosevicius, D. Popovici, and U. Siewert. Optimization and applications of echo state\n\nnetworks with leaky-integrator neurons. Neural Networks, 20:335\u2013352, 2007.\n\n[17] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recogni-\ntion in continuously spoken sentences. IEEE Trans. on Acoustics Speech & Signal Processing, 28:357\u2013\n366, 1980.\n\n[18] L. Van Immerseel and J.P. Martens. Pitch and voiced/unvoiced determination with an auditory model.\n\nAcoustical Society of America, 91(6):3511\u20133526, June 1992.\n\n[19] B. Schrauwen, L. Buesing, and R. Legenstein. Computational power and the order-chaos phase transition\nin reservoir computing. In Proc. Advances in Neural Information Processing Systems (NIPS), volume 21,\npages 1425\u20131432, 2008.\n\n[20] P. Schwarz, P. Matejka, and J. Cernocky. Hierarchical structures of neural networks for phoneme recog-\nnition. In Proc. International Conference on Acoustics, Speech and Signal Processing, pages 325\u2013328,\n2006.\n\n[21] Linguistic Data Consortium. COMLEX english pronunciation lexicon, 2009.\n[22] K. Demuynck, J. Roelens, D. Van Compernolle, and P. Wambacq. SPRAAK: An open source speech\n\nrecognition and automatic annotation kit. In Procs. Interspeech 2008, page 495, 2008.\n\n[23] J. Ming and F.J. Smith. Improved phone recognition using bayesian triphone models. IEEE Trans. on\n\nAcoustics, Speech and Signal Processing, ASSP, 1:409\u2013412, 1998.\n\n9\n\n\f", "award": [], "sourceid": 760, "authors": [{"given_name": "Fabian", "family_name": "Triefenbach", "institution": null}, {"given_name": "Azarakhsh", "family_name": "Jalalvand", "institution": null}, {"given_name": "Benjamin", "family_name": "Schrauwen", "institution": null}, {"given_name": "Jean-pierre", "family_name": "Martens", "institution": null}]}