{"title": "Sequence to Sequence Learning with Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3104, "page_last": 3112, "abstract": "Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.", "full_text": "Sequence to Sequence Learning\n\nwith Neural Networks\n\nIlya Sutskever\n\nGoogle\n\nOriol Vinyals\n\nGoogle\n\nQuoc V. Le\n\nGoogle\n\nilyasu@google.com\n\nvinyals@google.com\n\nqvl@google.com\n\nAbstract\n\nDeep Neural Networks (DNNs) are powerful models that have achieved excel-\nlent performance on dif\ufb01cult learning tasks. Although DNNs work well whenever\nlarge labeled training sets are available, they cannot be used to map sequences to\nsequences. In this paper, we present a general end-to-end approach to sequence\nlearning that makes minimal assumptions on the sequence structure. Our method\nuses a multilayered Long Short-Term Memory (LSTM) to map the input sequence\nto a vector of a \ufb01xed dimensionality, and then another deep LSTM to decode the\ntarget sequence from the vector. Our main result is that on an English to French\ntranslation task from the WMT\u201914 dataset, the translations produced by the LSTM\nachieve a BLEU score of 34.8 on the entire test set, where the LSTM\u2019s BLEU\nscore was penalized on out-of-vocabulary words. Additionally, the LSTM did not\nhave dif\ufb01culty on long sentences. For comparison, a phrase-based SMT system\nachieves a BLEU score of 33.3 on the same dataset. When we used the LSTM\nto rerank the 1000 hypotheses produced by the aforementioned SMT system, its\nBLEU score increases to 36.5, which is close to the previous best result on this\ntask. The LSTM also learned sensible phrase and sentence representations that\nare sensitive to word order and are relatively invariant to the active and the pas-\nsive voice. Finally, we found that reversing the order of the words in all source\nsentences (but not target sentences) improved the LSTM\u2019s performance markedly,\nbecause doing so introduced many short term dependencies between the source\nand the target sentence which made the optimization problem easier.\n\n1\n\nIntroduction\n\nDeep Neural Networks (DNNs) are extremely powerful machine learning models that achieve ex-\ncellent performance on dif\ufb01cult problems such as speech recognition [13, 7] and visual object recog-\nnition [19, 6, 21, 20]. DNNs are powerful because they can perform arbitrary parallel computation\nfor a modest number of steps. A surprising example of the power of DNNs is their ability to sort\nN N -bit numbers using only 2 hidden layers of quadratic size [27]. So, while neural networks are\nrelated to conventional statistical models, they learn an intricate computation. Furthermore, large\nDNNs can be trained with supervised backpropagation whenever the labeled training set has enough\ninformation to specify the network\u2019s parameters. Thus, if there exists a parameter setting of a large\nDNN that achieves good results (for example, because humans can solve the task very rapidly),\nsupervised backpropagation will \ufb01nd these parameters and solve the problem.\n\nDespite their \ufb02exibility and power, DNNs can only be applied to problems whose inputs and targets\ncan be sensibly encoded with vectors of \ufb01xed dimensionality. It is a signi\ufb01cant limitation, since\nmany important problems are best expressed with sequences whose lengths are not known a-priori.\nFor example, speech recognition and machine translation are sequential problems. Likewise, ques-\ntion answering can also be seen as mapping a sequence of words representing the question to a\n\n1\n\n\fsequence of words representing the answer. It is therefore clear that a domain-independent method\nthat learns to map sequences to sequences would be useful.\n\nSequences pose a challenge for DNNs because they require that the dimensionality of the inputs and\noutputs is known and \ufb01xed. In this paper, we show that a straightforward application of the Long\nShort-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.\nThe idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large \ufb01xed-\ndimensional vector representation, and then to use another LSTM to extract the output sequence\nfrom that vector (\ufb01g. 1). The second LSTM is essentially a recurrent neural network language model\n[28, 23, 30] except that it is conditioned on the input sequence. The LSTM\u2019s ability to successfully\nlearn on data with long range temporal dependencies makes it a natural choice for this application\ndue to the considerable time lag between the inputs and their corresponding outputs (\ufb01g. 1).\n\nThere have been a number of related attempts to address the general sequence to sequence learning\nproblem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18]\nwho were the \ufb01rst to map the entire input sentence to vector, and is related to Cho et al. [5] although\nthe latter was used only for rescoring hypotheses produced by a phrase-based system. Graves [10]\nintroduced a novel differentiable attention mechanism that allows neural networks to focus on dif-\nferent parts of their input, and an elegant variant of this idea was successfully applied to machine\ntranslation by Bahdanau et al. [2]. The Connectionist Sequence Classi\ufb01cation is another popular\ntechnique for mapping sequences to sequences with neural networks, but it assumes a monotonic\nalignment between the inputs and the outputs [11].\n\nFigure 1: Our model reads an input sentence \u201cABC\u201d and produces \u201cWXYZ\u201d as the output sentence. The\nmodel stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads the\ninput sentence in reverse, because doing so introduces many short term dependencies in the data that make the\noptimization problem much easier.\n\nThe main result of this work is the following. On the WMT\u201914 English to French translation task,\nwe obtained a BLEU score of 34.81 by directly extracting translations from an ensemble of 5 deep\nLSTMs (with 384M parameters and 8,000 dimensional state each) using a simple left-to-right beam-\nsearch decoder. This is by far the best result achieved by direct translation with large neural net-\nworks. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81\nBLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalized\nwhenever the reference translation contained a word not covered by these 80k. This result shows\nthat a relatively unoptimized small-vocabulary neural network architecture which has much room\nfor improvement outperforms a phrase-based SMT system.\n\nFinally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline on\nthe same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by\n3.2 BLEU points and is close to the previous best published result on this task (which is 37.0 [9]).\n\nSurprisingly, the LSTM did not suffer on very long sentences, despite the recent experience of other\nresearchers with related architectures [26]. We were able to do well on long sentences because we\nreversed the order of words in the source sentence but not the target sentences in the training and test\nset. By doing so, we introduced many short term dependencies that made the optimization problem\nmuch simpler (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with\nlong sentences. The simple trick of reversing the words in the source sentence is one of the key\ntechnical contributions of this work.\n\nA useful property of the LSTM is that it learns to map an input sentence of variable length into\na \ufb01xed-dimensional vector representation. Given that translations tend to be paraphrases of the\nsource sentences, the translation objective encourages the LSTM to \ufb01nd sentence representations\nthat capture their meaning, as sentences with similar meanings are close to each other while different\n\n2\n\n\fsentences meanings will be far. A qualitative evaluation supports this claim, showing that our model\nis aware of word order and is fairly invariant to the active and passive voice.\n\n2 The model\n\nThe Recurrent Neural Network (RNN) [31, 28] is a natural generalization of feedforward neural\nnetworks to sequences. Given a sequence of inputs (x1, . . . , xT ), a standard RNN computes a\nsequence of outputs (y1, . . . , yT ) by iterating the following equation:\nht = sigm (cid:0)W hxxt + W hhht\u22121(cid:1)\nyt = W yhht\n\nThe RNN can easily map sequences to sequences whenever the alignment between the inputs the\noutputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose\ninput and the output sequences have different lengths with complicated and non-monotonic relation-\nships.\n\nThe simplest strategy for general sequence learning is to map the input sequence to a \ufb01xed-sized\nvector using one RNN, and then to map the vector to the target sequence with another RNN (this\napproach has also been taken by Cho et al. [5]). While it could work in principle since the RNN is\nprovided with all the relevant information, it would be dif\ufb01cult to train the RNNs due to the resulting\nlong term dependencies (\ufb01gure 1) [14, 4, 16, 15]. However, the Long Short-Term Memory (LSTM)\n[16] is known to learn problems with long range temporal dependencies, so an LSTM may succeed\nin this setting.\n\nThe goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT \u2032 |x1, . . . , xT ) where\n(x1, . . . , xT ) is an input sequence and y1, . . . , yT \u2032 is its corresponding output sequence whose length\nT \u2032 may differ from T . The LSTM computes this conditional probability by \ufb01rst obtaining the \ufb01xed-\ndimensional representation v of the input sequence (x1, . . . , xT ) given by the last hidden state of the\nLSTM, and then computing the probability of y1, . . . , yT \u2032 with a standard LSTM-LM formulation\nwhose initial hidden state is set to the representation v of x1, . . . , xT :\n\np(y1, . . . , yT \u2032 |x1, . . . , xT ) =\n\nT \u2032\nY\n\nt=1\n\np(yt|v, y1, . . . , yt\u22121)\n\n(1)\n\nIn this equation, each p(yt|v, y1, . . . , yt\u22121) distribution is represented with a softmax over all the\nwords in the vocabulary. We use the LSTM formulation from Graves [10]. Note that we require that\neach sentence ends with a special end-of-sentence symbol \u201c<EOS>\u201d, which enables the model to\nde\ufb01ne a distribution over sequences of all possible lengths. The overall scheme is outlined in \ufb01gure\n1, where the shown LSTM computes the representation of \u201cA\u201d, \u201cB\u201d, \u201cC\u201d, \u201c<EOS>\u201d and then uses\nthis representation to compute the probability of \u201cW\u201d, \u201cX\u201d, \u201cY\u201d, \u201cZ\u201d, \u201c<EOS>\u201d.\n\nOur actual models differ from the above description in three important ways. First, we used two\ndifferent LSTMs: one for the input sequence and another for the output sequence, because doing\nso increases the number model parameters at negligible computational cost and makes it natural to\ntrain the LSTM on multiple language pairs simultaneously [18]. Second, we found that deep LSTMs\nsigni\ufb01cantly outperformed shallow LSTMs, so we chose an LSTM with four layers. Third, we found\nit extremely valuable to reverse the order of the words of the input sentence. So for example, instead\nof mapping the sentence a, b, c to the sentence \u03b1, \u03b2, \u03b3, the LSTM is asked to map c, b, a to \u03b1, \u03b2, \u03b3,\nwhere \u03b1, \u03b2, \u03b3 is the translation of a, b, c. This way, a is in close proximity to \u03b1, b is fairly close to \u03b2,\nand so on, a fact that makes it easy for SGD to \u201cestablish communication\u201d between the input and the\noutput. We found this simple data transformation to greatly improve the performance of the LSTM.\n\n3 Experiments\n\nWe applied our method to the WMT\u201914 English to French MT task in two ways. We used it to\ndirectly translate the input sentence without using a reference SMT system and we it to rescore the\nn-best lists of an SMT baseline. We report the accuracy of these translation methods, present sample\ntranslations, and visualize the resulting sentence representation.\n\n3\n\n\f3.1 Dataset details\n\nWe used the WMT\u201914 English to French dataset. We trained our models on a subset of 12M sen-\ntences consisting of 348M French words and 304M English words, which is a clean \u201cselected\u201d\nsubset from [29]. We chose this translation task and this speci\ufb01c training set subset because of the\npublic availability of a tokenized training and test set together with 1000-best lists from the baseline\nSMT [29].\n\nAs typical neural language models rely on a vector representation for each word, we used a \ufb01xed\nvocabulary for both languages. We used 160,000 of the most frequent words for the source language\nand 80,000 of the most frequent words for the target language. Every out-of-vocabulary word was\nreplaced with a special \u201cUNK\u201d token.\n\n3.2 Decoding and Rescoring\n\nThe core of our experiments involved training a large deep LSTM on many sentence pairs. We\ntrained it by maximizing the log probability of a correct translation T given the source sentence S,\nso the training objective is\n\n1/|S| X\n\nlog p(T |S)\n\n(T ,S)\u2208S\n\nwhere S is the training set. Once training is complete, we produce translations by \ufb01nding the most\nlikely translation according to the LSTM:\n\n\u02c6T = arg max\n\nT\n\np(T |S)\n\n(2)\n\nWe search for the most likely translation using a simple left-to-right beam search decoder which\nmaintains a small number B of partial hypotheses, where a partial hypothesis is a pre\ufb01x of some\ntranslation. At each timestep we extend each partial hypothesis in the beam with every possible\nword in the vocabulary. This greatly increases the number of the hypotheses so we discard all but\nthe B most likely hypotheses according to the model\u2019s log probability. As soon as the \u201c<EOS>\u201d\nsymbol is appended to a hypothesis, it is removed from the beam and is added to the set of complete\nhypotheses. While this decoder is approximate, it is simple to implement. Interestingly, our system\nperforms well even with a beam size of 1, and a beam of size 2 provides most of the bene\ufb01ts of beam\nsearch (Table 1).\n\nWe also used the LSTM to rescore the 1000-best lists produced by the baseline system [29]. To\nrescore an n-best list, we computed the log probability of every hypothesis with our LSTM and took\nan even average with their score and the LSTM\u2019s score.\n\n3.3 Reversing the Source Sentences\n\nWhile the LSTM is capable of solving problems with long term dependencies, we discovered that\nthe LSTM learns much better when the source sentences are reversed (the target sentences are not\nreversed). By doing so, the LSTM\u2019s test perplexity dropped from 5.8 to 4.7, and the test BLEU\nscores of its decoded translations increased from 25.9 to 30.6.\n\nWhile we do not have a complete explanation to this phenomenon, we believe that it is caused by\nthe introduction of many short term dependencies to the dataset. Normally, when we concatenate a\nsource sentence with a target sentence, each word in the source sentence is far from its corresponding\nword in the target sentence. As a result, the problem has a large \u201cminimal time lag\u201d [17]. By\nreversing the words in the source sentence, the average distance between corresponding words in\nthe source and target language is unchanged. However, the \ufb01rst few words in the source language\nare now very close to the \ufb01rst few words in the target language, so the problem\u2019s minimal time lag is\ngreatly reduced. Thus, backpropagation has an easier time \u201cestablishing communication\u201d between\nthe source sentence and the target sentence, which in turn results in substantially improved overall\nperformance.\n\nInitially, we believed that reversing the input sentences would only lead to more con\ufb01dent predic-\ntions in the early parts of the target sentence and to less con\ufb01dent predictions in the later parts. How-\never, LSTMs trained on reversed source sentences did much better on long sentences than LSTMs\n\n4\n\n\ftrained on the raw source sentences (see sec. 3.7), which suggests that reversing the input sentences\nresults in LSTMs with better memory utilization.\n\n3.4 Training details\n\nWe found that the LSTM models are fairly easy to train. We used deep LSTMs with 4 layers,\nwith 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary\nof 160,000 and an output vocabulary of 80,000. Thus the deep LSTM uses 8000 real numbers to\nrepresent a sentence. We found deep LSTMs to signi\ufb01cantly outperform shallow LSTMs, where\neach additional layer reduced perplexity by nearly 10%, possibly due to their much larger hidden\nstate. We used a naive softmax over 80,000 words at each output. The resulting LSTM has 384M\nparameters of which 64M are pure recurrent connections (32M for the \u201cencoder\u201d LSTM and 32M\nfor the \u201cdecoder\u201d LSTM). The complete training details are given below:\n\n\u2022 We initialized all of the LSTM\u2019s parameters with the uniform distribution between -0.08\n\nand 0.08\n\n\u2022 We used stochastic gradient descent without momentum, with a \ufb01xed learning rate of 0.7.\nAfter 5 epochs, we begun halving the learning rate every half epoch. We trained our models\nfor a total of 7.5 epochs.\n\n\u2022 We used batches of 128 sequences for the gradient and divided it the size of the batch\n\n(namely, 128).\n\n\u2022 Although LSTMs tend to not suffer from the vanishing gradient problem, they can have\nexploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10,\n25] by scaling it when its norm exceeded a threshold. For each training batch, we compute\ns = kgk2, where g is the gradient divided by 128. If s > 5, we set g = 5g\ns .\n\n\u2022 Different sentences have different lengths. Most sentences are short (e.g., length 20-30)\nbut some sentences are long (e.g., length > 100), so a minibatch of 128 randomly chosen\ntraining sentences will have many short sentences and few long sentences, and as a result,\nmuch of the computation in the minibatch is wasted. To address this problem, we made sure\nthat all sentences in a minibatch are roughly of the same length, yielding a 2x speedup.\n\n3.5 Parallelization\n\nA C++ implementation of deep LSTM with the con\ufb01guration from the previous section on a sin-\ngle GPU processes a speed of approximately 1,700 words per second. This was too slow for our\npurposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was\nexecuted on a different GPU and communicated its activations to the next GPU / layer as soon as\nthey were computed. Our models have 4 layers of LSTMs, each of which resides on a separate\nGPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible\nfor multiplying by a 1000 \u00d7 20000 matrix. The resulting implementation achieved a speed of 6,300\n(both English and French) words per second with a minibatch size of 128. Training took about a ten\ndays with this implementation.\n\n3.6 Experimental Results\n\nWe used the cased BLEU score [24] to evaluate the quality of our translations. We computed our\nBLEU scores using multi-bleu.pl1 on the tokenized predictions and ground truth. This way\nof evaluating the BELU score is consistent with [5] and [2], and reproduces the 33.3 score of [29].\nHowever, if we evaluate the best WMT\u201914 system [9] (whose predictions can be downloaded from\nstatmt.org\\matrix) in this manner, we get 37.0, which is greater than the 35.8 reported by\nstatmt.org\\matrix.\n\nThe results are presented in tables 1 and 2. Our best results are obtained with an ensemble of LSTMs\nthat differ in their random initializations and in the random order of minibatches. While the decoded\ntranslations of the LSTM ensemble do not outperform the best WMT\u201914 system, it is the \ufb01rst time\nthat a pure neural translation system outperforms a phrase-based SMT baseline on a large scale MT\n\n1There several variants of the BLEU score, and each variant is de\ufb01ned with a perl script.\n\n5\n\n\fMethod\n\nBahdanau et al. [2]\nBaseline System [29]\n\nSingle forward LSTM, beam size 12\nSingle reversed LSTM, beam size 12\n\nEnsemble of 5 reversed LSTMs, beam size 1\nEnsemble of 2 reversed LSTMs, beam size 12\nEnsemble of 5 reversed LSTMs, beam size 2\nEnsemble of 5 reversed LSTMs, beam size 12\n\ntest BLEU score (ntst14)\n\n28.45\n33.30\n\n26.17\n30.59\n33.00\n33.27\n34.50\n34.81\n\nTable 1: The performance of the LSTM on WMT\u201914 English to French test set (ntst14). Note that\nan ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam of\nsize 12.\n\nMethod\n\nBaseline System [29]\n\nCho et al. [5]\n\nBest WMT\u201914 result [9]\n\nRescoring the baseline 1000-best with a single forward LSTM\nRescoring the baseline 1000-best with a single reversed LSTM\n\nRescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs\n\nOracle Rescoring of the Baseline 1000-best lists\n\ntest BLEU score (ntst14)\n\n33.30\n34.54\n37.0\n\n35.61\n35.85\n36.5\n\n\u223c45\n\nTable 2: Methods that use neural networks together with an SMT system on the WMT\u201914 English\nto French test set (ntst14).\n\ntask by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM is\nwithin 0.5 BLEU points of the best WMT\u201914 result if it is used to rescore the 1000-best list of the\nbaseline system.\n\n3.7 Performance on long sentences\n\nWe were surprised to discover that the LSTM did well on long sentences, which is shown quantita-\ntively in \ufb01gure 3. Table 3 presents several examples of long sentences and their translations.\n\n3.8 Model Analysis\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\u22128\n\nMary admires John\n\nMary is in love with John\n\nMary respects John\n\nJohn admires Mary\n\nJohn is in love with Mary\n\nJohn respects Mary\n\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221215\n\nI was given a card by her in the garden\n\nIn the garden , she gave me a card\n\nShe gave me a card in the garden\n\nShe was given a card by me in the garden\n\nIn the garden , I gave her a card\n\nI gave her a card in the garden\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n20\n\nFigure 2: The \ufb01gure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtained\nafter processing the phrases in the \ufb01gures. The phrases are clustered by meaning, which in these examples is\nprimarily a function of word order, which would be dif\ufb01cult to capture with a bag-of-words model. Notice that\nboth clusters have similar internal structure.\n\nOne of the attractive features of our model is its ability to turn a sequence of words into a vector\nof \ufb01xed dimensionality. Figure 2 visualizes some of the learned representations. The \ufb01gure clearly\nshows that the representations are sensitive to the order of words, while being fairly insensitive to the\n\n6\n\n\fType\n\nSentence\n\nOur model Ulrich UNK , membre du conseil d\u2019 administration du constructeur automobile Audi ,\n\nTruth\n\nOur model\n\nTruth\n\naf\ufb01rme qu\u2019 il s\u2019 agit d\u2019 une pratique courante depuis des ann\u00b4ees pour que les t\u00b4el\u00b4ephones\nportables puissent \u02c6etre collect\u00b4es avant les r\u00b4eunions du conseil d\u2019 administration a\ufb01n qu\u2019 ils\nne soient pas utilis\u00b4es comme appareils d\u2019 \u00b4ecoute `a distance .\nUlrich Hackenberg , membre du conseil d\u2019 administration du constructeur automobile Audi ,\nd\u00b4eclare que la collecte des t\u00b4el\u00b4ephones portables avant les r\u00b4eunions du conseil , a\ufb01n qu\u2019 ils\nne puissent pas \u02c6etre utilis\u00b4es comme appareils d\u2019 \u00b4ecoute `a distance , est une pratique courante\ndepuis des ann\u00b4ees .\n\n\u201c Les t\u00b4el\u00b4ephones cellulaires , qui sont vraiment une question , non seulement parce qu\u2019 ils\npourraient potentiellement causer des interf\u00b4erences avec les appareils de navigation , mais\nnous savons , selon la FCC , qu\u2019 ils pourraient interf\u00b4erer avec les tours de t\u00b4el\u00b4ephone cellulaire\nlorsqu\u2019 ils sont dans l\u2019 air \u201d , dit UNK .\n\u201c Les t\u00b4el\u00b4ephones portables sont v\u00b4eritablement un probl`eme , non seulement parce qu\u2019 ils\npourraient \u00b4eventuellement cr\u00b4eer des interf\u00b4erences avec les instruments de navigation , mais\nparce que nous savons , d\u2019 apr`es la FCC , qu\u2019 ils pourraient perturber les antennes-relais de\nt\u00b4el\u00b4ephonie mobile s\u2019 ils sont utilis\u00b4es `a bord \u201d , a d\u00b4eclar\u00b4e Rosenker .\n\nOur model Avec la cr\u00b4emation , il y a un \u201c sentiment de violence contre le corps d\u2019 un \u02c6etre cher \u201d ,\n\nTruth\n\nqui sera \u201c r\u00b4eduit `a une pile de cendres \u201d en tr`es peu de temps au lieu d\u2019 un processus de\nd\u00b4ecomposition \u201c qui accompagnera les \u00b4etapes du deuil \u201d .\nIl y a , avec la cr\u00b4emation , \u201c une violence faite au corps aim\u00b4e \u201d ,\nqui va \u02c6etre \u201c r\u00b4eduit `a un tas de cendres \u201d en tr`es peu de temps , et non apr`es un processus de\nd\u00b4ecomposition , qui \u201c accompagnerait les phases du deuil \u201d .\n\nTable 3: A few examples of long translations produced by the LSTM alongside the ground truth\ntranslations. The reader can verify that the translations are sensible using Google translate.\n\nLSTM (34.8)\nbaseline (33.3)\n\n40\n\n35\n\ne\nr\no\nc\ns\n \nU\nE\nL\nB\n\n30\n\n25\n\n20\n\n4 7 8\n\n12\n\n17\n\n22\n\n28\n\n35\n\n79\n\n40\n\n35\n\ne\nr\no\nc\ns\n \nU\nE\nL\nB\n\n30\n\n25\n\n20\n\n0\n\nLSTM (34.8)\nbaseline (33.3)\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n3500\n\ntest sentences sorted by their length\n\ntest sentences sorted by average word frequency rank\n\nFigure 3: The left plot shows the performance of our system as a function of sentence length, where the\nx-axis corresponds to the test sentences sorted by their length and is marked by the actual sequence lengths.\nThere is no degradation on sentences with less than 35 words, there is only a minor degradation on the longest\nsentences. The right plot shows the LSTM\u2019s performance on sentences with progressively more rare words,\nwhere the x-axis corresponds to the test sentences sorted by their \u201caverage word frequency rank\u201d.\n\nreplacement of an active voice with a passive voice. The two-dimensional projections are obtained\nusing PCA.\n\n4 Related work\n\nThere is a large body of work on applications of neural networks to machine translation. So far,\nthe simplest and most effective way of applying an RNN-Language Model (RNNLM) [23] or a\n\n7\n\n\fFeedforward Neural Network Language Model (NNLM) [3] to an MT task is by rescoring the n-\nbest lists of a strong MT baseline [22], which reliably improves translation quality.\n\nMore recently, researchers have begun to look into ways of including information about the source\nlanguage into the NNLM. Examples of this work include Auli et al. [1], who combine an NNLM\nwith a topic model of the input sentence, which improves rescoring performance. Devlin et al. [8]\nfollowed a similar approach, but they incorporated their NNLM into the decoder of an MT system\nand used the decoder\u2019s alignment information to provide the NNLM with the most useful words in\nthe input sentence. Their approach was highly successful and it achieved large improvements over\ntheir baseline.\n\nOur work is closely related to Kalchbrenner and Blunsom [18], who were the \ufb01rst to map the input\nsentence into a vector and then back to a sentence, although they map sentences to vectors using\nconvolutional neural networks, which lose the ordering of the words. Similarly to this work, Cho et\nal. [5] used an LSTM-like RNN architecture to map sentences into vectors and back, although their\nprimary focus was on integrating their neural network into an SMT system. Bahdanau et al. [2] also\nattempted direct translations with a neural network that used an attention mechanism to overcome\nthe poor performance on long sentences experienced by Cho et al. [5] and achieved encouraging\nresults. Likewise, Pouget-Abadie et al. [26] attempted to address the memory problem of Cho et\nal. [5] by translating pieces of the source sentence in way that produces smooth translations, which\nis similar to a phrase-based approach. We suspect that they could achieve similar improvements by\nsimply training their networks on reversed source sentences.\n\nEnd-to-end training is also the focus of Hermann et al. [12], whose model represents the inputs and\noutputs by feedforward networks, and map them to similar points in space. However, their approach\ncannot generate translations directly: to get a translation, they need to do a look up for closest vector\nin the pre-computed database of sentences, or to rescore a sentence.\n\n5 Conclusion\n\nIn this work, we showed that a large deep LSTM, that has a limited vocabulary and that makes\nalmost no assumption about problem structure can outperform a standard SMT-based system whose\nvocabulary is unlimited on a large-scale MT task. The success of our simple LSTM-based approach\non MT suggests that it should do well on many other sequence learning problems, provided they\nhave enough training data.\n\nWe were surprised by the extent of the improvement obtained by reversing the words in the source\nsentences. We conclude that it is important to \ufb01nd a problem encoding that has the greatest number\nof short term dependencies, as they make the learning problem much simpler. In particular, while\nwe were unable to train a standard RNN on the non-reversed translation problem (shown in \ufb01g. 1),\nwe believe that a standard RNN should be easily trainable when the source sentences are reversed\n(although we did not verify it experimentally).\n\nWe were also surprised by the ability of the LSTM to correctly translate very long sentences. We\nwere initially convinced that the LSTM would fail on long sentences due to its limited memory,\nand other researchers reported poor performance on long sentences with a model similar to ours\n[5, 2, 26]. And yet, LSTMs trained on the reversed dataset had little dif\ufb01culty translating long\nsentences.\n\nMost importantly, we demonstrated that a simple, straightforward and a relatively unoptimized ap-\nproach can outperform an SMT system, so further work will likely lead to even greater translation\naccuracies. These results suggest that our approach will likely do well on other challenging sequence\nto sequence problems.\n\n6 Acknowledgments\n\nWe thank Samy Bengio, Jeff Dean, Matthieu Devin, Geoffrey Hinton, Nal Kalchbrenner, Thang Luong, Wolf-\ngang Macherey, Rajat Monga, Vincent Vanhoucke, Peng Xu, Wojciech Zaremba, and the Google Brain team\nfor useful comments and discussions.\n\n8\n\n\fReferences\n\n[1] M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent\n\nneural networks. In EMNLP, 2013.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\narXiv preprint arXiv:1409.0473, 2014.\n\n[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In Journal of\n\nMachine Learning Research, pages 1137\u20131155, 2003.\n\n[4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is dif\ufb01cult.\n\nIEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[5] K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase represen-\ntations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078,\n2014.\n\n[6] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation.\n\nIn CVPR, 2012.\n\n[7] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large\nvocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing - Special\nIssue on Deep Learning for Speech and Language Processing, 2012.\n\n[8] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robust neural network\n\njoint models for statistical machine translation. In ACL, 2014.\n\n[9] Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Hea\ufb01eld. Edinburgh\u2019s phrase-based machine\n\ntranslation systems for wmt-14. In WMT, 2014.\n\n[10] A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850,\n\n2013.\n\n[11] A. Graves, S. Fern\u00b4andez, F. Gomez, and J. Schmidhuber. Connectionist temporal classi\ufb01cation: labelling\n\nunsegmented sequence data with recurrent neural networks. In ICML, 2006.\n\n[12] K. M. Hermann and P. Blunsom. Multilingual distributed representations without word alignment. In\n\nICLR, 2014.\n\n[13] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,\nT. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE\nSignal Processing Magazine, 2012.\n\n[14] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master\u2019s thesis, Institut fur Infor-\n\nmatik, Technische Universitat, Munchen, 1991.\n\n[15] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient \ufb02ow in recurrent nets: the dif\ufb01culty\n\nof learning long-term dependencies, 2001.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.\n[17] S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. 1997.\n[18] N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013.\n[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[20] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building\n\nhigh-level features using large scale unsupervised learning. In ICML, 2012.\n\n[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 1998.\n\n[22] T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of\n\nTechnology, 2012.\n\n[23] T. Mikolov, M. Kara\ufb01\u00b4at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based\n\nlanguage model. In INTERSPEECH, pages 1045\u20131048, 2010.\n\n[24] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: a method for automatic evaluation of machine\n\ntranslation. In ACL, 2002.\n\n[25] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks. arXiv\n\npreprint arXiv:1211.5063, 2012.\n\n[26] J. Pouget-Abadie, D. Bahdanau, B. van Merrienboer, K. Cho, and Y. Bengio. Overcoming the\ncurse of sentence length for neural machine translation using automatic segmentation. arXiv preprint\narXiv:1409.1257, 2014.\n\n[27] A. Razborov. On small depth threshold circuits.\n\nIn Proc. 3rd Scandinavian Workshop on Algorithm\n\nTheory, 1992.\n\n[28] D. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.\n\nNature, 323(6088):533\u2013536, 1986.\n\n[29] H. Schwenk. University le mans. http://www-lium.univ-lemans.fr/\u02dcschwenk/cslm_\n\njoint_paper/, 2014. [Online; accessed 03-September-2014].\n\n[30] M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. In INTER-\n\nSPEECH, 2010.\n\n[31] P. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of IEEE, 1990.\n\n9\n\n\f", "award": [], "sourceid": 1610, "authors": [{"given_name": "Ilya", "family_name": "Sutskever", "institution": "Google"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google Research"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}]}