{"title": "An Online Sequence-to-Sequence Model Using Partial Conditioning", "book": "Advances in Neural Information Processing Systems", "page_first": 5067, "page_last": 5075, "abstract": "Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.", "full_text": "An Online Sequence-to-Sequence Model Using Partial\n\nConditioning\n\nNavdeep Jaitly\nGoogle Brain\n\nDavid Sussillo\nGoogle Brain\n\nQuoc V. Le\nGoogle Brain\n\nndjaitly@google.com\n\nsussillo@google.com\n\nqvl@google.com\n\nOriol Vinyals\n\nGoogle DeepMind\n\nIlya Sutskever\n\nOpen AI\u2217\n\nSamy Bengio\nGoogle Brain\n\nvinyals@google.com\n\nilyasu@openai.com\n\nbengio@google.com\n\nAbstract\n\nSequence-to-sequence models have achieved impressive results on various tasks.\nHowever, they are unsuitable for tasks that require incremental predictions to be\nmade as more data arrives or tasks that have long input sequences and output\nsequences. This is because they generate an output sequence conditioned on an\nentire input sequence. In this paper, we present a Neural Transducer that can make\nincremental predictions as more input arrives, without redoing the entire computa-\ntion. Unlike sequence-to-sequence models, the Neural Transducer computes the\nnext-step distribution conditioned on the partially observed input sequence and\nthe partially generated sequence. At each time step, the transducer can decide to\nemit zero to many output symbols. The data can be processed using an encoder\nand presented as input to the transducer. The discrete decision to emit a symbol\nat every time step makes it dif\ufb01cult to learn with conventional backpropagation.\nIt is however possible to train the transducer by using a dynamic programming\nalgorithm to generate target discrete decisions. Our experiments show that the\nNeural Transducer works well in settings where it is required to produce output\npredictions as data come in. We also \ufb01nd that the Neural Transducer performs well\nfor long sequences even when attention mechanisms are not used.\n\n1\n\nIntroduction\n\nThe recently introduced sequence-to-sequence model has shown success in many tasks that map\nsequences to sequences, e.g., translation, speech recognition, image captioning and dialogue model-\ning [17, 4, 1, 6, 3, 20, 18, 15, 19]. However, this method is unsuitable for tasks where it is important\nto produce outputs as the input sequence arrives. Speech recognition is an example of such an online\ntask \u2013 users prefer seeing an ongoing transcription of speech over receiving it at the \u201cend\u201d of an\nutterance. Similarly, instant translation systems would be much more effective if audio was translated\nonline, rather than after entire utterances. This limitation of the sequence-to-sequence model is due\nto the fact that output predictions are conditioned on the entire input sequence.\nIn this paper, we present a Neural Transducer, a more general class of sequence-to-sequence learning\nmodels. Neural Transducer can produce chunks of outputs (possibly of zero length) as blocks of inputs\narrive - thus satisfying the condition of being \u201conline\u201d (see Figure 1(b) for an overview). The model\ngenerates outputs for each block by using a transducer RNN that implements a sequence-to-sequence\nmodel. The inputs to the transducer RNN come from two sources: the encoder RNN and its own\n\n\u2217Work done at Google Brain\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a) seq2seq\n\n(b) Neural Transducer\n\nFigure 1: High-level comparison of our method with sequence-to-sequence models. (a) Sequence-to-\nsequence model [17]. (b) The Neural Transducer (this paper) which emits output symbols as data\ncome in (per block) and transfers the hidden state across blocks.\n\nrecurrent state. In other words, the transducer RNN generates local extensions to the output sequence,\nconditioned on the features computed for the block by an encoder RNN and the recurrent state of the\ntransducer RNN at the last step of the previous block.\nDuring training, alignments of output symbols to the input sequence are unavailable. One way\nof overcoming this limitation is to treat the alignment as a latent variable and to marginalize over\nall possible values of this alignment variable. Another approach is to generate alignments from a\ndifferent algorithm, and train our model to maximize the probability of these alignments. Connec-\ntionist Temporal Classi\ufb01cation (CTC) [7] follows the former strategy using a dynamic programming\nalgorithm, that allows for easy marginalization over the unary potentials produced by a recurrent\nneural network (RNN). However, this is not possible in our model, since the neural network makes\nnext-step predictions that are conditioned not just on the input data, but on the alignment, and the\ntargets produced until the current step. In this paper, we show how a dynamic programming algorithm,\ncan be used to compute \"approximate\" best alignments from this model. We show that training our\nmodel on these alignments leads to strong results.\nOn the TIMIT phoneme recognition task, a Neural Transducer (with 3 layered unidirectional LSTM\nencoder and 3 layered unidirectional LSTM transducer) can achieve an accuracy of 20.8% phoneme\nerror rate (PER) which is close to state-of-the-art for unidirectional models. We show too that if good\nalignments are made available (e.g, from a GMM-HMM system), the model can achieve 19.8% PER.\n\n2 Related Work\n\nIn the past few years, many proposals have been made to add more power or \ufb02exibility to neural\nnetworks, especially via the concept of augmented memory [10, 16, 21] or augmented arithmetic\nunits [13, 14]. Our work is not concerned with memory or arithmetic components but it allows more\n\ufb02exibility in the model so that it can dynamically produce outputs as data come in.\nOur work is related to traditional structured prediction methods, commonplace in speech recognition.\nThe work bears similarity to HMM-DNN [11] and CTC [7] systems. An important aspect of these\napproaches is that the model makes predictions at every input time step. A weakness of these models\nis that they typically assume conditional independence between the predictions at each output step.\nSequence-to-sequence models represent a breakthrough where no such assumptions are made \u2013 the\noutput sequence is generated by next step prediction, conditioning on the entire input sequence and\nthe partial output sequence generated so far [5, 6, 3]. Figure 1(a) shows the high-level picture of this\narchitecture. However, as can be seen from the \ufb01gure, these models have a limitation in that they have\nto wait until the end of the speech utterance to start decoding. This property makes them unattractive\nfor real time speech recognition and online translation. Bahdanau et. al. [2] attempt to rectify this for\nspeech recognition by using a moving windowed attention, but they do not provide a mechanism to\naddress the situation that arises when no output can be produced from the windowed segment of data.\nFigure 1(b) shows the difference between our method and sequence-to-sequence models.\nA strongly related model is the sequence transducer [8, 9]. This model augments the CTC model\nby combining the transcription model with a prediction model. The prediction model is akin to a\n\n2\n\nyLyiy2y1xLxix2x1Neural Networkyi-1y1<s>yL-1yL</s><e><e>xLxbW+WxbWx1y2ymencodertransducerymym+1ym+1<e>transducerW\fFigure 2: An overview of the Neural Transducer architecture for speech. The input acoustic sequence is\nprocessed by the encoder to produce hidden state vectors hi at each time step i, i = 1\u00b7\u00b7\u00b7 L. The transducer\nreceives a block of inputs at each step and produces up to M output tokens using the sequence-to-sequence model\nover this input. The transducer maintains its state across the blocks through the use of recurrent connections\nto the previous output time steps. The \ufb01gure above shows the transducer producing tokens for block b. The\nsubsequence emitted in this block is ymym+1ym+2.\n\nlanguage model and operates only on the output tokens, as a next step prediction model. This gives\nthe model more expressiveness compared to CTC which makes independent predictions at every time\nstep. However, unlike the model presented in this paper, the two models in the sequence transducer\noperate independently \u2013 the model does not provide a mechanism by which the prediction network\nfeatures at one time step would change the transcription network features in the future, and vice versa.\nOur model, in effect both generalizes this model and the sequence to sequence model.\nOur formulation requires inferring alignments during training. However, our results indicate that this\ncan be done relatively fast, and with little loss of accuracy, even on a small dataset where no effort\nwas made at regularization. Further, if alignments are given, as is easily done of\ufb02ine for various tasks,\nthe model is able to train relatively fast, without this inference step.\n\n3 Methods\n\nIn this section we describe the model in more detail. Please refer to Figure 2 for an overview.\n\n3.1 Model\n\nW (cid:101) be the number of blocks.\n\nLet x1\u00b7\u00b7\u00b7L be the input data that is L time steps long, where xi represents the features at input time\nstep i. Let W be the block size, i.e., the periodicity with which the transducer emits output tokens,\nand N = (cid:100) L\nLet \u02dcy1\u00b7\u00b7\u00b7S be the target sequence, corresponding to the input sequence. Further, let the transducer\nproduce a sequence of k outputs, \u02dcyi\u00b7\u00b7\u00b7(i+k), where 0 \u2264 k < M, for any input block. Each such\nsequence is padded with the <e> symbol, that is added to the vocabulary. It signi\ufb01es that the\ntransducer may proceed and consume data from the next block. When no symbols are produced for a\nblock, this symbol is akin to the blank symbol of CTC.\nThe sequence \u02dcy1\u00b7\u00b7\u00b7S can be transduced from the input from various alignments. Let Y be the set\nof all alignments of the output sequence \u02dcy1\u00b7\u00b7\u00b7S to the input blocks. Let y1\u00b7\u00b7\u00b7(S+B)) \u2208 Y be any\nsuch alignment. Note that the length of y is B more than the length of \u02dcy, since there are B end\nof block symbols, <e>, in y. However, the number of sequences y matching to \u02dcy is much larger,\ncorresponding to all possible alignments of \u02dcy to the blocks. The block that element yi is aligned\n\n3\n\n\u237ab\u237a2ymym+2ym+1ym-1ym-2cmencodersm-1smsm+1ymym+1sm+2h(b-1)Wsm+3h(b-1)W+1h(b-1)W+2hbWhbW+1+ym+3h\u2019m+1h\u2019m+2h\u2019m+3Wattentiontransducereb-1=m-1eb=m+2ym-1ym+2\u237a1h\u2019m-1h\u2019mx(b-1)W+1xbWxbW+1x(b-1)W\f(cid:16)\n\nb(cid:89)\n\nb(cid:48)=2\n\n(cid:17)\n\n(1)\n\n|x1\u00b7\u00b7\u00b7b(cid:48)W , y1\u00b7\u00b7\u00b7eb(cid:48)\u22121\n(cid:1)\n\nto can be inferred simply by counting the number of <e> symbols that came before index i. Let,\neb, b \u2208 1\u00b7\u00b7\u00b7 N be the index of the last token in y emitted in the bth block. Note that e0 = 0 and\neN = (S + B). Thus yeb =<e> for each block b.\n\nIn this section, we show how to compute p(cid:0)y1\u00b7\u00b7\u00b7(S+B))|x1\u00b7\u00b7\u00b7L\n\n(cid:1). Later, in section 3.5 we show how\n\nto compute, and maximize p (\u02dcy1\u00b7\u00b7\u00b7S|x1\u00b7\u00b7\u00b7L).\nWe \ufb01rst compute the probability of l compute the probability of seeing output sequence y1\u00b7\u00b7\u00b7eb by the\nend of block b as follows:\n\np (y1\u00b7\u00b7\u00b7eb|x1\u00b7\u00b7\u00b7bW ) = p (y1\u00b7\u00b7\u00b7e1|x1\u00b7\u00b7\u00b7W )\n\np\n\ny(eb(cid:48)\u22121+1)\u00b7\u00b7\u00b7e(cid:48)\n\nb\n\nEach of the terms in this equation is itself computed by the chain rule decomposition, i.e., for any\nblock b,\n\np(cid:0)y(eb\u22121+1)\u00b7\u00b7\u00b7eb|x1\u00b7\u00b7\u00b7bW , y1\u00b7\u00b7\u00b7eb\u22121\n\neb(cid:89)\nThe next step probability terms, p(cid:0)ym|x1\u00b7\u00b7\u00b7bW , y1\u00b7\u00b7\u00b7(m\u22121)\n\n(cid:1) =\n\nm=e(b\u22121)+1\n\np(cid:0)ym|x1\u00b7\u00b7\u00b7bW , y1\u00b7\u00b7\u00b7(m\u22121)\n(cid:1), in Equation 2 are computed by the\n\n(2)\n\ntransducer using the encoding of the input x1\u00b7\u00b7\u00b7bW computed by the encoder, and the label pre\ufb01x\ny1\u00b7\u00b7\u00b7(m\u22121) that was input into the transducer, at previous emission steps. We describe this in more\ndetail in the next subsection.\n\n3.2 Next Step Prediction\n\nWe again refer the reader to Figure 2 for this discussion. The example shows a transducer with two\nhidden layers, with units sm and h(cid:48)\nm at output step m. In the \ufb01gure, the next step prediction is shown\nfor block b. For this block, the index of the \ufb01rst output symbol is m = eb\u22121 + 1, and the index of the\nlast output symbol is m + 2 (i.e. eb = m + 2).\nThe transducer computes the next step prediction, using parameters, \u03b8, of the neural network through\nthe following sequence of steps:\n\n(cid:0)sm, h((b\u22121)W +1)\u00b7\u00b7\u00b7bW ; \u03b8(cid:1)\n\nm, \u03b8)\n\nm\u22121, [cm; sm] ; \u03b8)\n\n(cid:1) = fsof tmax (ym; h(cid:48)\n\nsm = fRN N (sm\u22121, [cm\u22121; ym\u22121] ; \u03b8)\ncm = fcontext\nh(cid:48)\nm = fRN N (h(cid:48)\n\np(cid:0)ym|x1\u00b7\u00b7\u00b7bW , y1\u00b7\u00b7\u00b7(m\u22121)\n(cid:0)sm, h((b\u22121)W +1)\u00b7\u00b7\u00b7bW ; \u03b8(cid:1) is the context function, that computes the input to the trans-\n\n(3)\n(4)\n(5)\n(6)\nwhere fRN N (am\u22121, bm; \u03b8) is the recurrent neural network function (such as an LSTM or\na sigmoid or tanh RNN) that computes the state vector am for a layer at a step using the\nrecurrent state vector am\u22121 at the last time step, and input bm at the current time step;2\nfsof tmax (\u00b7; am; \u03b8) is the softmax distribution computed by a softmax layer, with input vector am;\nand fcontext\nducer at output step m from the state sm at the current time step, and the features h((b\u22121)W +1)\u00b7\u00b7\u00b7bW\nof the encoder for the current input block, b. We experimented with different ways of computing\nthe context vector \u2013 with and without an attention mechanism. These are described subsequently in\nsection 3.3.\nNote that since the encoder is an RNN, h(b\u22121)W\u00b7\u00b7\u00b7bW is actually a function of the entire input, x1\u00b7\u00b7\u00b7bW\nso far. Correspondingly, sm is a function of the labels emitted so far, and the entire input seen so far.3\nSimilarly, h(cid:48)\n\nm is a function of the labels emitted so far and the entire input seen so far.\n\n3.3 Computing fcontext\n\nWe \ufb01rst describe how the context vector is computed by an attention model similar to earlier\nwork [5, 1, 3]. We call this model the MLP-attention model.\n\n2Note that for LSTM, we would have to additionally factor in cell states from the previous states - we have\n\nignored this in the notation for purpose of clarity. The exact details are easily worked out.\n\n3For the \ufb01rst output step of a block it includes only the input seen until the end of the last block.\n\n4\n\n\fIn this model the context vector cm is in computed in two steps - \ufb01rst a normalized attention vector\n\u03b1m is computed from the state sm of the transducer and next the hidden states h(b\u22121)W +1\u00b7\u00b7\u00b7bW\nof the encoder for the current block are linearly combined using \u03b1 and used as the context vector.\nTo compute \u03b1m, a multi-layer perceptron computes a scalar value, em\nfor each pair of transducer\nj\nstate sm and encoder h(b\u22121)W +j. The attention vector is computed from the scalar values, em\nj ,\nj = 1\u00b7\u00b7\u00b7 W . Formally:\n\nem\nj = fattention\n\u03b1m = sof tmax ([em\n\n(cid:0)sm, h(b\u22121)W +j; \u03b8(cid:1)\nW(cid:88)\n\n2 ;\u00b7\u00b7\u00b7 em\nW ])\n\n1 ; em\n\n\u03b1m\nj h(b\u22121)W +j\n\ncm =\n\n(7)\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nj=1\n\nj = sT\n\nmh(b\u22121)W +j.\n\nWe also experimented with using a simpler model for fattention that computed em\nWe refer to this model as DOT-attention model.\nBoth of these attention models have two shortcomings. Firstly there is no explicit mechanism\nthat requires the attention model to move its focus forward, from one output time step to the next.\nSecondly, the energies computed as inputs to the softmax function, for different input frames j are\nindependent of each other at each time step, and thus cannot modulate (e.g., enhance or suppress)\neach other, other than through the softmax function. Chorowski et. al. [6] ameliorate the second\nproblem by using a convolutional operator that affects the attention at one time step using the attention\nat the last time step.\nWe attempt to address these two shortcomings using a new attention mechanism. In this model,\ninstead of feeding [em\nW ] into a softmax, we feed them into a recurrent neural network with\none hidden layer that outputs the softmax attention vector at each time step. Thus the model should\nbe able to modulate the attention vector both within a time step and across time steps. This attention\nmodel is thus more general than the convolutional operator of Chorowski et. al. (2015), but it can\nonly be applied to the case where the context window size is constant. We refer to this model as\nLSTM-attention.\n\n2 ;\u00b7\u00b7\u00b7 em\n\n1 ; em\n\n3.4 Addressing End of Blocks\n\nSince the model only produces a small sequence of output tokens in each block, we have to address the\nmechanism for shifting the transducer from one block to the next. We experimented with three distinct\nways of doing this. In the \ufb01rst approach, we introduced no explicit mechanism for end-of-blocks,\nhoping that the transducer neural network would implicitly learn a model from the training data. In\nthe second approach we added end-of-block symbols, <e>, to the label sequence to demarcate the end\nof blocks, and we added this symbol to the target dictionary. Thus the softmax function in Equation 6\nimplicitly learns to either emit a token, or to move the transducer forward to the next block. In the\nthird approach, we model moving the transducer forward, using a separate logistic function of the\nattention vector. The target of the logistic function is 0 or 1 depending on whether the current step is\nthe last step in the block or not.\n\n3.5 Training\n\nIn this section we show how the Neural Transducer model can be trained.\nThe probability of the output sequence \u02dcy1..S, given x1\u00b7\u00b7\u00b7L is as follows4:\n\np (\u02dcy1\u00b7\u00b7\u00b7S|x1\u00b7\u00b7\u00b7L) =\n\np(cid:0)y1\u00b7\u00b7\u00b7(S+B))|x1\u00b7\u00b7\u00b7L\n\n(cid:1)\n\n(cid:88)\n\ny\u2208Y\n\nIn theory, we can train the model by maximizing the log of equation 10. The gradient for the log\nlikelihood can easily be expressed as follows:\n\np(cid:0)y1\u00b7\u00b7\u00b7(S+B))|x1\u00b7\u00b7\u00b7L, \u02dcy1\u00b7\u00b7\u00b7S\n\n(cid:1) d\n\nlog p(cid:0)y1\u00b7\u00b7\u00b7(S+B)|x1\u00b7\u00b7\u00b7L\n\n(cid:1)\n\nlog p (\u02dcy1\u00b7\u00b7\u00b7S|x1\u00b7\u00b7\u00b7L) =\n\nd\nd\u03b8\n\nd\u03b8\n\n(cid:88)\n\ny\u2208Y\n\n4Note that this equation implicitly incorporates the prior for alignments within the equation\n\n5\n\n\fEach of the latter term in the sum on the right hand side can be computed, by backpropagation, using\ny as the target of the model. However, the marginalization is intractable because of the sum over a\ncombinatorial number of alignments. Alternatively, the gradient can be approximated by sampling\n\nfrom the posterior distribution (i.e. p(cid:0)y1\u00b7\u00b7\u00b7(S+B))|x1\u00b7\u00b7\u00b7L, \u02dcy1\u00b7\u00b7\u00b7S\n\n(cid:1)). However, we found this had very\n\nlarge noise in the learning and the gradients were often too biased, leading to the models that rarely\nachieved decent accuracy.\nInstead, we attempted to maximize the probability in equation 10 by computing the sum over only one\nterm - corresponding to the y1\u00b7\u00b7\u00b7S with the highest posterior probability. Unfortunately, even doing this\nexactly is computationally infeasible because the number of possible alignments is combinatorially\nlarge and the problem of \ufb01nding the best alignment cannot be decomposed to easier subproblems. So\nwe use an algorithm that \ufb01nds the approximate best alignment with a dynamic programming-like\nalgorithm that we describe in the next paragraph.\nAt each block, b, for each output position j, this algorithm keeps track of the approximate best\nhypothesis h(j, b) that represents the best partial alignment of the input sequence \u02dcy1\u00b7\u00b7\u00b7j to the partial\ninput x1\u00b7\u00b7\u00b7bW . Each hypothesis, keeps track of the best alignment y1\u00b7\u00b7\u00b7(j+b) that it represents, and\nthe recurrent states of the decoder at the last time step, corresponding to this alignment. At block\nb + 1, all hypotheses h(j, b), j <= min (b (M \u2212 1) , S) are extended by at most M tokens using\ntheir recurrent states, to compute h(j, b + 1), h(j + 1, b + 1)\u00b7\u00b7\u00b7 h(j + M, b + 1)5. For each position\nj(cid:48), j(cid:48) <= min ((b + 1) (M \u2212 1) , S) the highest log probability hypothesis h(j(cid:48), b + 1) is kept6. The\nalignment from the best hypothesis h(S, B) at the last block is used for training.\nIn theory, we need to compute the alignment for each sequence when it is trained, using the model\nparameters at that time. In practice, we batch the alignment inference steps, using parallel tasks,\nand cache these alignments. Thus alignments are computed less frequently than the model updates -\ntypically every 100-300 sequences. This procedure has the \ufb02avor of experience replay from Deep\nReinforcement learning work [12].\n\n3.6\n\nInference\n\nFor inference, given the input acoustics x1\u00b7\u00b7\u00b7L, and the model parameters, \u03b8, we \ufb01nd the sequence of\nlabels y1..M that maximizes the probability of the labels, conditioned on the data, i.e.,\n\nlog p(cid:0)ye(b\u22121)+1\u00b7\u00b7\u00b7eb|x1\u00b7\u00b7\u00b7bW , y1\u00b7\u00b7\u00b7e(b\u22121)\n\n(cid:1)\n\n(12)\n\nN(cid:88)\n\nb=1\n\n\u02dcy1\u00b7\u00b7\u00b7S = arg max\n\ny1\u00b7\u00b7\u00b7S(cid:48) ,e1\u00b7\u00b7\u00b7N\n\nExact inference in this scheme is computationally expensive because the expression for log probability\ndoes not permit decomposition into smaller terms that can be independently computed. Instead, each\ncandidate, y1\u00b7S(cid:48), would have to be tested independently, and the best sequence over an exponentially\nlarge number of sequences would have to be discovered. Hence, we use a beam search heuristic to\n\ufb01nd the \u201cbest\u201d set of candidates. To do this, at each output step m, we keep a heap of alternative n\nbest pre\ufb01xes, and extend each one by one symbol, trying out all the possible alternative extensions,\nkeeping only the best n extensions. Included in the beam search is the act of moving the attention to\nthe next input block. The beam search ends either when the sequence is longer than a pre-speci\ufb01ed\nthreshold, or when the end of token symbol is produced at the last block.\n\n4 Experiments and Results\n\n4.1 Addition Toy Task\n\nWe experimented with the Neural Transducer on the toy task of adding two three-digit decimal\nnumbers. The second number is presented in the reverse order, and so is the target output. Thus the\nmodel can produce the \ufb01rst output as soon as the \ufb01rst digit of the second number is observed. The\nmodel is able to achieve 0% error on this task with a very small number of units (both encoder and\ntransducer are 1 layer unidirectional LSTM RNNs with 100 units).\n\n5Note the minutiae that each of these extensions ends with <e> symbol.\n6We also experimented with sampling from the extensions in proportion to the probabilities, but this did not\n\nalways improve results.\n\n6\n\n\fAs can be seen below, the model learns to output the digits as soon as the required information is\navailable. Occasionally the model waits an extra step to output its target symbol. We show results\n(blue) for four different examples (red). A block window size of W=1 was used, with M=8.\n\n2\n\n9<e>\n\n+\n<e>\n\n5\n\n2<e>\n\n3\n<e>\n\n<s>\n5<e>\n<s>\n\n771<e>\n\n2\n<e>\n4\n<e>\n\n2\n<e>\n0\n<e>\n\n7\n<e>\n+\n<e>\n\n+\n<e>\n2\n<e>\n\n3\n<e>\n6\n\n2<e>\n\n2\n<e>\n1\n<e>\n\n+\n<e>\n7\n<e>\n\n7\n<e>\n4\n<e>\n\n4.2 TIMIT\n\n<s>\n\n032<e>\n\n2\n\n0<e>\n\n<s>\n3<e>\n\nWe used TIMIT, a standard benchmark for speech recognition, for our larger experiments. Log Mel\n\ufb01lterbanks were computed every 10ms as inputs to the system. The targets were the 60 phones\nde\ufb01ned for the TIMIT dataset (h# were relabelled as pau).\nWe used stochastic gradient descent with momentum with a batch size of one utterance per training\nstep. An initial learning rate of 0.05, and momentum of 0.9 was used. The learning rate was reduced\nby a factor of 0.5 every time the average log prob over the validation set decreased 7. The decrease\nwas applied for a maximum of 4 times. The models were trained for 50 epochs and the parameters\nfrom the epochs with the best dev set log prob were used for decoding.\nWe trained a Neural Transducer with three layer LSTM RNN coupled to a three LSTM layer\nunidirectional encoder RNN, and achieved a PER of 20.8% on the TIMIT test set. This model\nused the LSTM attention mechanism. Alignments were generated from a model that was updated\nafter every 300 steps of Momentum updates. Interestingly, the alignments generated by the model\nare very similar to the alignments produced by a Gaussian Mixture Model-Hidden Markov Model\n(GMM-HMM) system that we trained using the Kaldi toolkit \u2013 even though the model was trained\nentirely discriminatively. The small differences in alignment correspond to an occasional phoneme\nemitted slightly later by our model, compared to the GMM-HMM system.\nWe also trained models using alignments generated from the GMM-HMM model trained on Kaldi.\nThe frame level alignments from Kaldi were converted into block level alignments by assigning each\nphone in the sequence to the block it was last observed in. The same architecture model described\nabove achieved an accuracy of 19.8% with these alignments.\nFor further exploratory experiments, we used the GMM-HMM alignments as given to avoid computing\nthe best alignments. Table 1 shows a comparison of our method against a basic implementation of a\nsequence-to-sequence model that produces outputs for each block independent of the other blocks,\nand concatenates the produced sequences. Here, the sequence-to-sequence model produces the output\nconditioned on the state of the encoder at the end of the block. Both models used an encoder with\ntwo layers of 250 LSTM cells, without attention. The standard sequence-to-sequence model performs\nsigni\ufb01cantly worse than our model \u2013 the recurrent connections of the transducer across blocks are\nclearly helpful in improving the accuracy of the model.\n\nTable 1: Impact of maintaining recurrent state of transducer across blocks on the PER (median of 3 runs). This\ntable shows that maintaining the state of the transducer across blocks leads to much better results.\n\nW BLOCK-RECURRENCE PER\n15\n34.3\n20.6\n15\n\nNo\nYes\n\nFigure 3 shows the impact of block size on the accuracy of the different transducer variants that we\nused. See Section 3.3 for a description of the {DOT,MLP,LSTM}-attention models. All models used\na two LSTM layer encoder and a two LSTM layer transducer. The model is sensitive to the choice of\nthe block size, when no attention is used. However, it can be seen that with an appropriate choice of\nwindow size (W=8), the Neural Transducer without attention can match the accuracy of the attention\nbased Neural Transducers. Further exploration of this con\ufb01guration should lead to improved results.\nWhen attention is used in the transducer, the precise value of the block size becomes less important.\nThe LSTM-based attention model seems to be more consistent compared to the other attention\n\n7Note the TIMIT provides a validation set, called the dev set. We use these terms interchangeably.\n\n7\n\n\fmechanisms we explored. Since this model performed best with W=25, we used this con\ufb01guration\nfor subsequent experiments.\n\nFigure 3: Impact of the number of frames (W) in a block and attention mechanism on PER. Each number is the\nmedian value from three experiments.\n\nTable 2 explores the impact of the number of layers in the transducer and the encoder on the PER.\nA three layer encoder coupled to a three layer transducer performs best on average. Four layer\ntransducers produced results with higher spread in accuracy \u2013 possibly because of the more dif\ufb01cult\noptimization involved. Thus, the best average PER we achieved (over 3 runs) was 19.8% on the\nTIMIT test set. These results could probably be improved with other regularization techniques, as\nreported by [6] but we did not pursue those avenues in this paper.\n\nTable 2: Impact of depth of encoder and transducer on PER.\n\n# of layers in encoder / transducer\n\n2\n3\n\n1\n\n2\n19.2\n18.5\n\n3\n18.9\n18.2\n\n4\n18.8\n19.4\n\nFor a comparison with previously published sequence-to-sequence models on this task, we used a\nthree layer bidirectional LSTM encoder with 250 LSTM cells in each direction and achieved a PER\nof 18.7%. By contrast, the best reported results using previous sequence-to-sequence models are\n17.6% [6]. However, this requires controlling over\ufb01tting carefully.\n\n5 Discussion\n\nOne of the important side-effects of our model using partial conditioning with a blocked transducer\nis that it naturally alleviates the problem of \u201closing attention\u201d suffered by sequence-to-sequence\nmodels. Because of this, sequence-to-sequence models perform worse on longer utterances [6, 3].\nThis problem is automatically tackled in our model because each new block automatically shifts the\nattention monotonically forward. Within a block, the model learns to move attention forward from\none step to the next, and the attention mechanism rarely suffers, because both the size of a block,\nand the number of output steps for a block are relatively small. As a result, error in attention in one\nblock, has minimal impact on the predictions at subsequent blocks. Finally, we note that increasing\nthe block size, W , so that it is as large as the input utterance makes the model similar to vanilla\nend-to-end models [5, 3].\n\n6 Conclusion\n\nWe have introduced a new model that uses partial conditioning on inputs to generate output sequences.\nThis allows the model to produce output as input arrives. This is useful for speech recognition\nsystems and will also be crucial for future generations of online speech translation systems. Further it\ncan be useful for performing transduction over long sequences \u2013 something that is possibly dif\ufb01cult\nfor sequence-to-sequence models. We applied the model to a toy task of addition, and to a phone\nrecognition task and showed that is can produce results comparable to the state of the art from\nsequence-to-sequence models.\n\n8\n\n51015202530window size (W)1920212223242526Phone Error Rate (PER)no-attentionDOT-ATTENTIONMLP-ATTENTIONLSTM-ATTENTION\fReferences\n[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning\n\nto Align and Translate. In International Conference on Learning Representations, 2015.\n\n[2] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end\n\nattention-based large vocabulary speech recognition. In http://arxiv.org/abs/1508.04395, 2015.\n\n[3] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint\n\narXiv:1508.01211, 2015.\n\n[4] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwen, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical\nMachine Translation. In Conference on Empirical Methods in Natural Language Processing, 2014.\n\n[5] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end Continuous Speech\nRecognition using Attention-based Recurrent NN: First Results. In Neural Information Processing Systems:\nWorkshop Deep Learning and Representation Learning Workshop, 2014.\n\n[6] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-\n\nBased Models for Speech Recognition. In Neural Information Processing Systems, 2015.\n\n[7] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural\nnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on,\npages 6645\u20136649. IEEE, 2013.\n\n[8] Alex Graves. Sequence Transduction with Recurrent Neural Networks. In International Conference on\n\nMachine Learning: Representation Learning Workshop, 2012.\n\n[9] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep Recurrent\nNeural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.\n[10] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,\n\n2014.\n\n[11] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew\nSenior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. Deep neural networks\nfor acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing\nMagazine, IEEE, 29(6):82\u201397, 2012.\n\n[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,\nand Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.\n[13] Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with\n\ngradient descent. arXiv preprint arXiv:1511.04834, 2015.\n\n[14] Scott Reed and Nando de Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279,\n\n2015.\n\n[15] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-\nYun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive generation of\nconversational responses. arXiv preprint arXiv:1506.06714, 2015.\n\n[16] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in\n\nNeural Information Processing Systems, pages 2431\u20132439, 2015.\n\n[17] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. In\n\nNeural Information Processing Systems, 2014.\n\n[18] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a\n\nforeign language. In NIPS, 2015.\n\n[19] Oriol Vinyals and Quoc V. Le. A neural conversational model. In ICML Deep Learning Workshop, 2015.\n[20] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural Image\n\nCaption Generator. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[21] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv preprint\n\narXiv:1505.00521, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2025, "authors": [{"given_name": "Navdeep", "family_name": "Jaitly", "institution": "Google Brain"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google"}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": "Google"}, {"given_name": "David", "family_name": "Sussillo", "institution": "Google"}, {"given_name": "Samy", "family_name": "Bengio", "institution": "Google Brain"}]}