{"title": "Deliberation Networks: Sequence Generation Beyond One-Pass Decoding", "book": "Advances in Neural Information Processing Systems", "page_first": 1784, "page_last": 1794, "abstract": "The encoder-decoder framework has achieved promising progress for many sequence generation tasks, including machine translation, text summarization, dialog system, image captioning, etc. Such a framework adopts an one-pass forward process while decoding and generating a sequence, but lacks the deliberation process: A generated sequence is directly used as final output without further polishing. However, deliberation is a common behavior in human's daily life like reading news and writing papers/articles/books. In this work, we introduce the deliberation process into the encoder-decoder framework and propose deliberation networks for sequence generation. A deliberation network has two levels of decoders, where the first-pass decoder generates a raw sequence and the second-pass decoder polishes and refines the raw sentence with deliberation. Since the second-pass deliberation decoder has global information about what the sequence to be generated might be, it has the potential to generate a better sequence by looking into future words in the raw sentence. Experiments on neural machine translation and text summarization demonstrate the effectiveness of the proposed deliberation networks. On the WMT 2014 English-to-French translation task, our model establishes a new state-of-the-art BLEU score of 41.5.", "full_text": "Deliberation Networks: Sequence Generation\n\nBeyond One-Pass Decoding \u2217\n\n1Yingce Xia, 2Fei Tian, 3Lijun Wu, 1Jianxin Lin, 2Tao Qin, 1Nenghai Yu, 2Tie-Yan Liu\n\n1University of Science and Technology of China, Hefei, China\n\n2Microsoft Research, Beijing, China\n\n3Sun Yat-sen University, Guangzhou, China\n\n1yingce.xia@gmail.com, linjx@mail.ustc.edu.cn, ynh@ustc.edu.cn\n\n2{fetia,taoqin,tie-yan.liu}@microsoft.com, 3wulijun3@mail2.sysu.edu.cn\n\nAbstract\n\nThe encoder-decoder framework has achieved promising progress for many se-\nquence generation tasks, including machine translation, text summarization, dialog\nsystem, image captioning, etc. Such a framework adopts an one-pass forward pro-\ncess while decoding and generating a sequence, but lacks the deliberation process:\nA generated sequence is directly used as \ufb01nal output without further polishing.\nHowever, deliberation is a common behavior in human\u2019s daily life like reading\nnews and writing papers/articles/books. In this work, we introduce the deliberation\nprocess into the encoder-decoder framework and propose deliberation networks for\nsequence generation. A deliberation network has two levels of decoders, where the\n\ufb01rst-pass decoder generates a raw sequence and the second-pass decoder polishes\nand re\ufb01nes the raw sentence with deliberation. Since the second-pass deliberation\ndecoder has global information about what the sequence to be generated might\nbe, it has the potential to generate a better sequence by looking into future words\nin the raw sentence. Experiments on neural machine translation and text summa-\nrization demonstrate the effectiveness of the proposed deliberation networks. On\nthe WMT 2014 English-to-French translation task, our model establishes a new\nstate-of-the-art BLEU score of 41.5.\n\n1\n\nIntroduction\n\nThe neural network based encoder-decoder framework has been widely adopted for sequence genera-\ntion tasks, including neural machine translation [1], text summarization [19], image captioning [27],\netc. In such a framework, the encoder encodes the source input x with length m into a sequence of\nvectors {h1, h2,\u00b7\u00b7\u00b7 , hm}. The decoder, which is typically an RNN, generates an output sequence\nword by word2 based on the source-side vector representations and previously generated words. The\nattention mechanism [1, 35], which dynamically attends to different parts of x while generating\neach target-side word, is integrated into the encoder-decoder framework to improve the quality of\ngenerating long sequences [1].\nAlthough the framework has achieved great success, one concern is that while generating one word,\none can only leverage the generated words but not the future words un-generated so far. That is,\nwhen the decoder generates the t-th word yt, only y<t can be used, while the possible words y>t are\nnot explicitly considered. In contrast, in real-word human cognitive processes, global information,\nincluding both the past and the future parts, is leveraged in an iterative polishing process. Here are\ntwo examples: (1) Consider the situation that we are reading a sentence and meet an unknown word\n\n\u2217This work was done when Yingce Xia, Lijun Wu and Jianxin Lin were interns at Microsoft Research.\n2Throughout this work, a word refers to the basic unit in a sequence.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fin the middle of the sentence. We do not stop here. Instead, we move forward until the end of the\nsentence. Then we go back to the unknown word and try to understand it using its context, including\nthe words both preceding and after it. (2) To write a good document (or paragraph, article), we\nusually \ufb01rst create a complete draft and then polish it based on global understanding of the whole\ndraft. When polishing a speci\ufb01c part, we take the whole picture of the draft into consideration to\nevaluate how well the local element \ufb01ts into the global environment rather than only looking back to\nthe preceding parts.\nWe call such a polishing process as deliberation. Motivated by such human cognitive behaviors, we\npropose the deliberation networks, which leverage the global information with both looking back\nand forward in sequence decoding through a deliberation process. Concretely speaking, to integrate\nsuch a process into the sequence generation framework, we carefully design our architecture, which\nconsists of two decoders, a \ufb01rst-pass decoder D1 and a second-pass/deliberation decoder D2, as\nwell as an encoder E. Given a source input x, the E and D1 jointly works like the standard encoder-\ndecoder model to generate a coarse sequence \u02c6y as a draft and the corresponding representations\n\u02c6s = {\u02c6s1, \u02c6s2,\u00b7\u00b7\u00b7 , \u02c6sT \u02c6y} used to generate \u02c6y, where T\u02c6y is the length of \u02c6y. Afterwards, the deliberation\ndecoder D2 takes x, \u02c6y and \u02c6s as inputs and outputs the re\ufb01ned sequence y. When D2 generates the\nt-th word yt, an additional attention model is used to assign an adaptive weight \u03b2j to each \u02c6yj and \u02c6sj\n\nfor any j \u2208 [T\u02c6y], and(cid:80) \u03b2j[\u02c6yj; \u02c6sj] is fed into D2.3 In this way, the global information of the target\n\nsequence can be utilized to re\ufb01ne the generation process. We propose a Monte Carlo based algorithm\nto overcome the dif\ufb01culty brought by the discrete property of \u02c6y in optimizing the deliberation network.\nTo verify the effectiveness of our model, we work on two representative sequence generation tasks.\n(1) Neural machine translation refers to using neural networks to translate sentences from a source\nlanguage to a target language [1, 33, 32, 34]. A standard NMT model consists of an encoder (used\nto encode source sentences) and a decoder (used to generate target sentences), and thus can be\nimproved by our proposed deliberation network. Experimental results show that based on a widely\nused single-layer GRU model [1], on the WMT\u201914 [29] English\u2192French dataset, we can improve the\nBLEU score [17], by 1.7 points compared to the model without deliberation. We also apply our model\non Chinese\u2192English translations and improve BLEU by an averaged 1.26 points on four different\ntest sets. Furthermore, on the WMT\u201914 English\u2192French translation task, by applying deliberation to\na deep LSTM model, we achieve a BLEU score 41.50, setting a new record for this task.\n(2) Text summarization is a task that summarizes a long article into a short abstract. The encoder-\ndecoder framework can also be used for such a task and thus could be re\ufb01ned by deliberation networks.\nExperimental results on Gigaword dataset [6] show that deliberation network can improve ROUGE-1,\nROUGE-2, and ROUGE-L by 3.45, 1.70 and 3.02 points.\n\n1.1 Related Work\n\nAlthough there exist many works to improve the attention based encoder-decoder framework for\nsequence generation, such as changing the training loss [28, 18, 22] or the decoding objective [14, 7],\nnot much attention has been paid to the structure of the encoder-decoder framework. Our work\nchanges the structure of the framework by introducing the second-pass decoder into it.\nThe idea of deliberation/re\ufb01nement is not well explored for sequence generation tasks, especially\nfor the encoder-decoder based approaches [3, 23, 1] in neural machine translation. One related work\nis post-editing [16, 2]: a source sentence e is \ufb01rst translated to f(cid:48), and then f(cid:48) is re\ufb01ned by another\nmodel. Different from our deliberation network, the two processes (i.e., generating and re\ufb01ning) in\npost-editing are separated. As a comparison, what we build is a consistent model in which all the\ncomponents are coupled together and jointly optimized in an end-to-end way. As a result, deliberation\nnetworks lead to better accuracies. Another related work is the review network [36]. The idea is to\nreview all the information encoded by the encoder to obtain thought vectors that are more compact\nand abstractive. The thought vectors are then used in decoding. Different from our work, the review\nsteps are added on the encoder side, while the decoder side is unchanged and still adopts one-pass\ndecoding.\n\n3In this work, let [v1; v2;\u00b7\u00b7\u00b7 ; vn] denote the long vector concatenated by the input vectors v1,\u00b7\u00b7\u00b7 , vn. With\n\na little bit confusion, [m] with a single integer input m denotes the set {1, 2,\u00b7\u00b7\u00b7 , m}.\n\n2\n\n\fThe rest of our paper is organized as follows. Our proposed deliberation network is introduced in\nSection 2, including the model structure and the optimization process. Applications to neural machine\ntranslation and text summarization are introduced in Section 3 and Section 4 respectively. Section 5\nconcludes the paper and discusses possible future directions.\n\n2 The Framework\n\nIn this section, we \ufb01rst introduce the overall architecture of deliberation networks, then the details of\nindividual components, and \ufb01nally propose an end-to-end Monte Carlo based algorithm to train the\ndeliberation networks.\n\n2.1 Structure of Deliberation Networks\nAs shown in Figure 1, a deliberation network consists of an encoder E, a \ufb01rst-pass decoder D1 and\na second-pass decoder D2. Deliberation happens at the second-pass decoder, which is also called\ndeliberation decoder alternatively. Brie\ufb02y speaking, E is used to encode the source sequence into a\nsequence of vector representations. D1 reads the encoder representations and generates a \ufb01rst-pass\ntarget sequence as a draft, which is further provided as input to the deliberation decoder D2 for the\nsecond-pass decoding. In the rest of this section, for simplicity of description, we use RNN as the\nbasic building block for both the encoder and decoders4. All the W \u2019s and v\u2019s in this section with\ndifferent superscripts or subscripts are the parameters to be learned. Besides, all the bias terms are\nomitted to increase readability.\n\nFigure 1: Framework of deliberation networks: Blue, yellow and green parts indicate encoder E,\n\ufb01rst-pass decoder D1 and the second-pass decoder D2 respectively. The E-to-D1 attention model is\nomitted for readability.\n\n2.2 Encoder and First-pass Decoder\nWhen an input sequence x is fed into the encoder E, it is encoded into Tx hidden states H =\n{h1, h2,\u00b7\u00b7\u00b7 , hTx} where Tx is the length of x. Speci\ufb01cally, hi = RNN(xi, hi\u22121), where xi acts as\nthe representation (e.g., word embedding vector) for the i-th word in x and h0 is a zero vector.\nThe \ufb01rst-pass decoder D1 will generate a series of hidden states \u02c6sj \u2200j \u2208 [T\u02c6y], and a \ufb01rst-pass\nsequence \u02c6yj \u2200j \u2208 [T\u02c6y], where T\u02c6y is the length of the generated sequence. Next we show how they\nare generated in detail.\n\n4The proposed deliberation networks are independent to the speci\ufb01c implementation of the recurrent units\n\nand can be applied to simple RNN or its variants such as LSTM [11] or GRU [3].\n\n3\n\n\fctxe =(cid:80)Tx\n\nSimilar to the conventional encoder-decoder model, an attention model is included in D1. At step j,\nthe attention model in D1 \ufb01rst generates a context ctxe de\ufb01ned as follows:\n\ni=1\u03b1ihi; \u03b1i \u221d exp(vT\n\n\u03b1 tanh(W c\n\n(1)\nBased on ctxe, \u02c6sj is calculated as \u02c6sj = RNN([\u02c6yj\u22121; ctxe], \u02c6sj\u22121). After obtaining \u02c6sj, another af\ufb01ne\ntransformation is applied on the concatenated vector [\u02c6sj; ctxe; \u02c6yj\u22121]. Finally, the results of the\ntransformation are fed into a softmax layer, and the \u02c6yj is sampled out from the obtained multinomial\ndistribution.\n\natt,hhi + W c\n\ni=1\u03b1i = 1.\n\natt,\u02c6s\u02c6sj\u22121))\u2200i \u2208 [Tx]; (cid:80)Tx\n\n2.3 Second-Pass Decoder\nOnce the \ufb01rst-pass target sequence \u02c6y is generated by the \ufb01rst-pass decoder D1, it is fed into the\nsecond-pass decoder D2 for further re\ufb01nement. Based on the sequence \u02c6y and the hidden states \u02c6sj\n\u2200j \u2208 [T\u02c6y] provided by D1, D2 eventually outputs the second-pass sequence y via the deliberation\nprocess.\nSpeci\ufb01cally, at step t, D2 takes the previous hidden state st\u22121 generated by itself, previously decoded\nword yt\u22121, the source contextual information ctx(cid:48)\ne and the \ufb01rst-pass contextual information ctxc as\ninputs. Two detailed points are: (1) The computation of ctx(cid:48)\ne is similar to that of ctxe shown in Eqn. (1)\nwith two differences: First, \u02c6sj\u22121 is replaced by st\u22121; second, the model parameters are different. (2)\nTo obtain ctxc, D2 has an attention model (i.e., the Ac in Figure 1) that can map the words \u02c6yj\u2019s and\nthe hidden states \u02c6sj\u2019s into a context vector. Mathematically speaking, in the re\ufb01nement process at\nt-th time step, the \ufb01rst-pass contextual information vector ctxc is computed as:\n\nj=1\u03b2j[\u02c6sj; \u02c6yj]; \u03b2j \u221d exp(vT\n\n\u03b2 tanh(W d\n\natt, \u02c6sy[\u02c6sj; \u02c6yj] + W d\n\nj=1 \u03b2j = 1.\nAs can be seen from the above computation, the deliberation process at time step t in the second-pass\ndecoding uses the whole sequence generated by the \ufb01rst-pass decoder, including both the words\npreceding and after t-th step in the \ufb01rst-pass sequence. That is, the \ufb01rst-pass contextual vector ctxc\naggregates the global information extracted from the \ufb01rst-pass sequence \u02c6y.\nAfter receiving ctxc, we calculate st as st = RNN([yt\u22121; ctx(cid:48)\nD1, [st; ctx(cid:48)\n\ne; ctxc; yt\u22121] will be further transformed to generate yt.\n\ne; ctxc], st\u22121). Similar to sampling \u02c6yt in\n\natt,sst\u22121)) \u2200j \u2208 [T\u02c6y]; (cid:80)T \u02c6y\n\nctxc =(cid:80)T \u02c6y\n\n2.4 Algorithm\nLet DXY = {(x(i), y(i))}n\ni=1 denote the training corpus with n paired sequences5. Denote the\nparameters of E, D1 and D2 as \u03b8e, \u03b81 and \u03b82 respectively. The training of sequence-to-sequence\ni=1 log P (yi|xi). Under our setting,\n\nlearning is usually to maximize the data log likelihood (1/n)(cid:80)n\nthis rule can be specialized to maximize (1/n)(cid:80)\n\n(x,y)\u2208DXY\n\nJ (x, y; \u03b8e, \u03b81, \u03b82), where\nP (y|y(cid:48), E(x; \u03b8e); \u03b82)P (y(cid:48)|E(x; \u03b8e); \u03b81).\n\nJ (x, y; \u03b8e, \u03b81, \u03b82) = log\n\n(cid:88)\n\ny(cid:48)\u2208Y\n\nIn Eqn. (2), Y is the collection of all possible target sequences and E(x; \u03b8e) indicates a function that\nmaps x to its corresponding hidden states given by the encoder. One can verify that the \ufb01rst-order\nderivative of J (x, y; \u03b8e, \u03b81, \u03b82) w.r.t \u03b81 is:\n\n(cid:80)\n(cid:80)\ny(cid:48)\u2208Y P (y|y(cid:48), E(x; \u03b8e); \u03b82)\u2207\u03b81P (y(cid:48)|E(x; \u03b8e); \u03b81)\ny(cid:48)\u2208Y P (y|y(cid:48), E(x; \u03b8e); \u03b82)P (y(cid:48)|E(x; \u03b8e); \u03b81)\n\n,\n\n\u2207\u03b81J (x, y; \u03b8e, \u03b81, \u03b82) =\n\nwhich is extremely hard to compute due to the large space of Y. Similarly, the gradients w.r.t. \u03b8e and\n\u03b82 are also computationally intractable. To overcome such dif\ufb01culties, we propose a Monte Carlo\nbased method to optimize the lower bound of J (x, y; \u03b8e, \u03b81, \u03b82). Note by the concavity of J w.r.t\ny(cid:48), one can verify that J (x, y; \u03b8e, \u03b81, \u03b82) \u2265 \u02dcJ (x, y; \u03b8e, \u03b81, \u03b82), with the right-hand side acting as a\nlower bound and de\ufb01ned as\n\n\u02dcJ (x, y; \u03b8e, \u03b81, \u03b82) =\n\nP (y(cid:48)|E(x; \u03b8e); \u03b81) log P (y|y(cid:48), E(x; \u03b8e); \u03b82).\n\n(cid:88)\n\ny(cid:48)\u2208Y\n\n5Let x(i) and y(i) denote i\u2019th source input and target output in the training data. Let xi and yi denote the i-th\n\nword in x and y.\n\n4\n\n(2)\n\n(3)\n\n\fDenote \u02dcJ (x, y; \u03b8e, \u03b81, \u03b82) as \u02dcJ. The gradients of \u02dcJ w.r.t its parameters are:\n\n(cid:125)\nP (y(cid:48)|E(x; \u03b8e); \u03b81) log P (y|y(cid:48), E(x; \u03b8e); \u03b82)\u2207\u03b81 log P (y(cid:48)|E(x; \u03b8e); \u03b81)\n;\n\n(cid:123)(cid:122)\n(cid:125)\nP (y(cid:48)|E(x; \u03b8e); \u03b81)\u2207\u03b82 log P (y|y(cid:48), E(x; \u03b8e); \u03b82)\n\n(cid:123)(cid:122)\n\n(cid:124)\n(cid:124)\n\nG1\n\n;\n\n(4)\n\nG2\n\n\u2207\u03b81\n\n\u02dcJ =\n\n\u2207\u03b82\n\n\u02dcJ =\n\ny(cid:48)\u2208Y\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ny(cid:48)\u2208Y\n\n\u02dcJ =\n\nP (y(cid:48)|E(x; \u03b8e); \u03b81)Ge(x, y, y(cid:48); \u03b8e, \u03b81, \u03b82), where Ge is de\ufb01ned as follows:\n\n\u2207\u03b8e\nGe = \u2207\u03b8e log P (y|y(cid:48), E(x; \u03b8e); \u03b82) + log P (y|y(cid:48), E(x; \u03b8e); \u03b82)\u2207\u03b8e log P (y(cid:48)|E(x; \u03b8e); \u03b81).\n\ny(cid:48)\u2208Y\n\nLet \u0398 = [\u03b81; \u03b82; \u03b8e] and G(x, y, y(cid:48); \u0398) = [G1; G2; Ge], where G1, G2 and Ge are de\ufb01ned in Eqn. (4).\n(For ease of reference, we assume that all the \u03b8\u00b7\u2019s and G\u00b7\u2019s are \ufb02attened.) Obviously, if y(cid:48) is sampled\nfrom distribution P (y(cid:48)|E(x; \u03b8e); \u03b81), G(x, y, y(cid:48); \u0398) is an unbiased estimator of the gradient of \u02dcJ\nw.r.t. all model parameters \u0398. Based on that we propose our algorithm in Algorithm 1.\n\nAlgorithm 1: Algorithm to train the deliberation network\nInput: Training data corpus DXY ; minibatch size m; optimizer Opt(\u00b7\u00b7\u00b7 ) with gradients as input ;\nwhile models not converged do\n\nRandomly sample a mini-batch of m sequence pairs {x(i), y(i)} \u2200i \u2208 [m] from DXY ;\nFor any x(i) where i \u2208 [m], sample y(cid:48)(i) according to distribution P (\u00b7|E(x(i); \u03b8e); \u03b81);\nPerform parameter update: \u0398 \u2190 \u0398 + Opt( 1\n\n(cid:80)m\ni=1 G(x(i), y(i), y(cid:48)(i); \u0398)).\n\nm\n\nDiscussions (1) The choice of Opt(...) is quite \ufb02exible. One can choose different optimizers such\nas Adadelta [37], Adam [13], or SGD for different tasks, depending on common practice in the\nspeci\ufb01c task. (2) The Y space is usually extremely large in sequence generation tasks. To obtain\nbetter sampled y(cid:48), we can use beam search instead of randomly sampling.\n\n3 Application to Neural Machine Translation\n\nWe evaluate the deliberation networks with two different network structures: (1) the shallow model,\nwhich is based on a widely-used single-layer GRU model named RNNSearch [1, 12]; (2) the deep\nmodel, which is based on a deep LSTM model similar to GNMT [31]. Both of the two kinds of\nmodels are implemented in Theano [24].\n\n3.1 Shallow Models\n\n3.1.1 Settings\nDatasets We work on two translation tasks, English-to-French translation (denoted as En\u2192Fr) and\nChinese-to-English translation (denoted as Zh\u2192En). For En\u2192Fr, we employ the standard \ufb01ltered\nWMT\u201914 dataset6, which is widely used in NMT literature [1, 12]. There are 12M bilingual sentence\npairs in the dataset. We concatenate newstest2012 and newstest2013 together as the validation set and\nuse newstest2014 as the test set. For Zh\u2192En, we choose 1.25M bilingual sentence pairs from LDC\ndataset as training corpus, use NIST2003 as the validation set, and NIST2004, NIST2005, NIST2006,\nNIST2008 as the test sets. Following the common practice [1, 12], we remove the sentences with\nmore than 50 words for both translation tasks. Furthermore, we limit the both the source words and\ntarget words as 30k most-frequent ones. The out-of-vocabulary words are replaced by a special token\n\u201cUNK\u201d.\nModel We choose the most widely adopted NMT model RNNSearch [1, 12, 25] as the basic structure\nto construct the deliberation network. To be speci\ufb01c, all of E, D1 and D2 are GRU networks [1] with\none hidden layer of 1000 neurons. The word embedding dimension is set as 620. For Zh\u2192En, we\napply 0.5 dropout rate to the layer before softmax and no dropout is used in En\u2192Fr translation.\n\n6http://www-lium.univ-lemans.fr/\u02dcschwenk/cslm_joint_paper/data/bitexts.tgz\n\n5\n\n\fOptimization All the models are trained on a single NVIDIA K40 GPU. We \ufb01rst pre-train two standard\nencoder-decoder based NMT models (i.e., RNNSearch) until convergence, which take about two\nweeks for En\u2192Fr and one week for Zh\u2192En using Adadelta [37]. For any deliberation network, (1)\nthe encoder is initialized by the encoder of the pre-trained RNNSearch model; (2) both the \ufb01rst-pass\nand second-pass decoders are initialized by the decoder of the pre-trained RNNSearch model; (3)\nthe attention model used to compute the \ufb01rst-pass context vector is randomly initialized from a\nuniform distribution on [\u22120.1, 0.1]. Then we train the deliberation networks by Algorithm 1 until\nconvergence, which takes roughly 5 days for both tasks. The minibatch size is \ufb01xed as 80 throughout\nthe optimization. Plain SGD is used as the optimizer in this process, with initial learning rate 0.2\nand halving according to validation accuracy. To sample the intermediate translation output by the\n\ufb01rst decoder, we use beam search with beam size 2, considering the tradeoff between accuracy and\nef\ufb01ciency.\nEvaluation We use BLEU [17] as the evaluation metric for translation qualities. BLEU is the geometric\nmean of n-gram precisions where n \u2208 {1, 2, 3, 4}, weighted by sentence lengths. Following the\ncommon practice in NMT, we use multi-bleu.pl7 to calculate case-sensitive BLEU scores for En\u2192Fr,\nwhile evaluating the translation qualities of Zh\u2192En by case-insensitive BLEU scores. The larger the\nBLEU score is, the better the translation quality is. For the baselines and deliberation networks, we\nuse beam search with beam size 12 to generate sentences.\nBaselines We compare our proposed algorithms with the following baselines: (i) The standard NMT\nalgorithm RNNSearch [1, 12], denoted as Mbase; (ii) The standard NMT model with two stacked\ndecoding layers, denoted as Mdec\u00d72; (3) The review network proposed in [36]. We try both 4 and 8\nreviewers and \ufb01nd the 4-reviewer model is slightly better. The review network in our experiment is\ntherefore denoted as Mreviewer\u00d74. We refer to our proposed algorithm as Mdelib.\n\n3.1.2 Results\n\nWe compare our proposed algorithms with the following baselines: (i) The standard NMT algorithm,\ndenoted as Mbase; (ii) The standard NMT model with two stacked decoding layers, denoted as\nMdec\u00d72; (3) The review network proposed in [36]. We try both 4 and 8 reviewers and \ufb01nd the\n4-reviewer model is slightly better. The review network in our experiment is therefore denoted as\nMreviewer\u00d74. We refer to our proposed algorithm as Mdelib. Table 1 shows the results of En\u2192Fr\ntranslation. We have several observations:\n(1) Our proposed algorithm performs the best among all candidates, which validates the effectiveness\nof the deliberation process. (2) Our method Mdelib outperforms the baseline algorithm Mbase. This\nshows that further polishing the raw output indeed leads to better sequences. (3) Applying an\nadditional decoding layer, i.e., Mdec\u00d72, increases the translation quality, but it is still far behind\nthat of Mdelib. Clearly, the second decoder layer of Mdec\u00d72 can still only leverage the previously\ngenerated words but not unseen and un-generated future words, while the second-pass decoder of\nMdelib can leverage the richer information contained in all the words from the \ufb01rst-pass decoder. Such\na re\ufb01nement process from the global view signi\ufb01cantly improves the translation results. (4) Mdelib\noutperforms Mreviewer\u00d74 by 0.91 point, which shows that reviewing the possible future contextual\ninformation from the source side is not enough. The \u201cfuture\u201d information from the decoder side is\nalso very important, since it is directly related with the \ufb01nal output.\n\nTable 1: BLEU scores of En\u2192Fr translation\n\nAlgorithm Mbase Mdec\u00d72 Mreviewer\u00d74 Mdelib\nBLEU\n31.67\n\n30.40\n\n29.97\n\n30.76\n\nThe translation results of Zh\u2192En are summarized in Table 2. We have similar observations as those\nfor En\u2192Fr translations: Mdelib outperforms all the baseline methods, particularly with an average\ngain of 1.26 points over Mbase.\nApart from the quantitative analysis, we list two examples in Table 3 to better understand how a\ndeliberation network works. Each example contains \ufb01ve sentences, which are the source sentence\nin Chinese, the reference sentence in English as ground truth translation, the translation generated\n\n7https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl\n\n6\n\n\fTable 2: BLEU scores of Zh\u2192En translation\n\nAlgorithm NIST04 NIST05 NIST06 NIST08\nMbase\n26.21\nMdelib\n27.13\n\n32.74\n33.90\n\n34.96\n36.90\n\n34.57\n35.57\n\nby Mbase and the output translation by both the \ufb01rst-pass decoder and second-pass decoder (i.e., the\n\ufb01nal translation by deliberation network Mdelib).\nTable 3: Case studies of Zh\u2192En translations. Note the \u201c......\u201d in the second example represents a common\nsentence \u201cthe two sides will discuss how to improve the implementation of the cease-\ufb01re agreement\u201d.\n\n[Source] Aiji shuo, zhongdong heping xieyi yuqi jiang you yige xinde jiagou .\n[Reference] Egypt says a new framework is expected to come into being for the Middle East\npeace agreement .\n[Base] egypt \u2019s middle east peace agreement is expected to have a new framework , he said .\n[First-pass] egypt \u2019s middle east peace agreement is expected to have a new framework , egypt said .\n[Second-pass] egypt says the middle east peace agreement is expected to have a new framework .\n[Source] Nuowei dashiguan zhichu, \"shuangfang jiang taolun ruhe gaijin luoshi tinghuo xieyi, zhe\nyeshi san nian lai shuangfang shouci zai ruci gao de cengji shang jinxing mianduimian tanpan\"\n[Reference] The Norwegian embassy pointed out that , \" Both sides will discuss how to improve the\nimplementation of the cease-\ufb01re agreement , which is the \ufb01rst time for both sides to have\nface-to-face negotiations at such a high level . \"\n[Base] \" ...... , which is the \ufb01rst time for the two countries to conduct face-to-face talks on the basis\nof a high level of three years , \" it said .\n[First-pass] \" ...... , which is the \ufb01rst time for the two countries to conduct face-to-face talks on the\nbasis of a high level of three years , \" the norwegian embassy said in a statement .\n[Second-pass] \" ...... , which is the \ufb01rst time in three years for the two countries to conduct\nface-to-face talks at such high level , \" the norwegian embassy said .\n\nIn the \ufb01rst example, the translation from both base model and \ufb01rst-pass decoder contains the phrase\negypt\u2019s middle east peace agreement, which is odd and inaccurate, given that an agreement cannot\nbelong to a single country as Egypt. As a comparison, the second-pass decoder re\ufb01nes such phrase into\na more natural and accurate one. i.e., egypt says the middle east peace agreement, by looking forward\nto the future translation phrase \u201cegypt said\u201d output by the \ufb01rst-pass decoder. On the other hand, the\nsecond-pass decoder outputs a sentence with correct tense, i.e., egypt says ... is .... However, the\ntwo sentences output by Mbase and the \ufb01rst-pass decoder are inconsistent in tense, whose structures\nare \u201c... is ..., egypt said \u201d. This problem is well addressed by the deliberation network, since the\nsecond-pass decoder can access the global information contained in the draft sequence generated by\nthe \ufb01rst-pass decoder, and therefore output a more consistent sentence.\nIn the second example, as shown in bold fonts, the phrase \u201cconduct face-to-face talks on the basis\nof a high level of three years\u201d from both base model and \ufb01rst-pass decoder carries all necessary\ninformation of its corresponding source segments, but apparently it is out-of-order and seems to be a\nsimple combination of words. The second-pass decoder re\ufb01nes such translation into a correct, and\nmore \ufb02uent one, by forwarding the sub phrase in three years to the position right after the \ufb01rst time.\nAt last we compare the decoding time of deliberation network with that of the RNNSearch. Based on\nthe Theano implementation, to translate 3003 English sentences to French, RNNSearch takes 964\nseconds while the deliberation network takes 1924 seconds. Indeed, the deliberation network takes\nroughly 2 times decoding time of RNNSearch, but can bring 1.7 points improvements in BLEU.\n\n3.2 Deep Models\n\nWe work on a deep LSTM model to further evaluate deliberation networks through the WMT\u201914\nEn\u2192Fr translation task. Compared to the shallow model, there are several different aspects: (1) We\nuse 34M sentence pairs from WMT\u201914 as training data, apply the BPE [21] techniques to split the\ntraining sentences into sub-word units and restrict the source and target sentence lengths within 64\nsubwords. The encoder and decoder share a common vocabulary containing 36k subwords. (2) All of\nE, D1 and D2 are 4-layer LSTMs with residual connections [9, 10]. The word embedding dimension\n\n7\n\n\fTable 4: Comparison between deliberation network and different deep NMT systems (En\u2192Fr).\nBLEU\nSystem\nGNMT [31]\n39.92\nFairSeq [4]\n40.51\nTransformer [26]\n41.0\n39.51\n40.53\n41.50\n\nCon\ufb01gurations\nStacked LSTM (8-layer encoder + 8 layer decoder) + RL \ufb01netune\nConvolution (15-layer) encoder and (15-layer) decoder\nSelf-Attention + 6-layer encoder + 6-layer decoder\nStack LSTM (4-layer encoder and 4-layer decoder)\nStack 4-layer NMT + Dual Learning\nStack 4-layer NMT + Dual Learning + Deliberation Network\n\nthis work\n\nand hidden node dimension are 512 and 1024 respectively. The dropout rate is set as 0.1. (3) We train\nthe standard encoder-decoder based deep model for about 25 days until convergence. Furthermore,\nwe leverage our recently proposed dual learning techniques [8, 33] to improve the model, which\ntakes another 7 days. We initialize the deliberation network in the same way in Section 3.1.1. Then,\nwe train the deliberation network by Algorithm 1 for 10 days. When generating translations, we use\nbeam search with beam size 8.\nThe experimental results of applying deliberation network to the deep LSTM model are shown\nin Table 4. On En\u2192Fr translation task, the baseline of our implemented NMT system is 39.51.\nWith dual learning, we achieve a 40.53 BLEU score. After applying deliberation techniques, the\nBLEU score can be further improved to 41.50, which as far as we know, is a new single-model\nstate-of-the-art result for this task. This not only illustrates the effectiveness of deliberation network\nagain, but also shows that even if a model is good enough, it can still bene\ufb01t from the deliberation\nprocess.\n\n4 Application to Text Summarization\n\nWe further verify the effectiveness of deliberation networks on text summarization, which is another\nreal-world application that encoder-decoder framework succeeds to help [19].\n\n4.1 Settings\n\nText summarization refers to using a short and abstractive sentence to summarize the major points of\na sentence or paragraph, which is typically much longer. The training, validation and test sets for the\ntask are extracted from Gigaword Corpus [6]: For each selected article, the \ufb01rst sentence is used as\nsource-side input and the title used as target-side output. We process the data in the same way as that\nproposed in [20, 30], and obtain training/validation/test sets with roughly 189k/18k/10k sentence\npairs respectively. There are roughly 42k unique words in the source input and 19k unique words in\nthe target output and we remain all of them as the vocabulary in the encoder-decoder models.\nThe model structure is the same as that used in Section 3.1 except that both word embedding\ndimension and hidden node size are reduced to 128. We use Adadelta algorithm with gradient clip\nvalue 5.0 to optimize deliberation network. The mini-batch size is \ufb01xed as 32.\nThe evaluation measures are chosen as ROUGE-1, ROUGE-2 and ROUGE-L, which are all widely\nadopted evaluation metric for text summarization [15]. ROUGE-N (N = 1, 2 in our setting) is an\nN-gram recall between a candidate summary and a set of reference summaries. ROUGE-L is a similar\nstatistic like ROUGE-N but based on longest common subsequences. When generating the titles, we\nuse beam search with beam size 10. For the thoroughness of comparison, similar to NMT, we add\nanother two baselines apart from the basic encoder-decoder model: the stacked-decoder model with 2\nlayers (Mdec\u00d72), as well as the review net with 4 reviewers (Mreviewer\u00d74).\n\n4.2 Results\n\nThe experimental results of text summarization are listed in Table 5. Again, the deliberation network\nachieves clear improvements over all the baselines. For example, in terms of ROUGE-2, it is 1.12 and\n0.96 points better compared with stacked decoder model and review net respectively. Furthermore,\none may note that a signi\ufb01cant difference between NMT and text summarization is that: In NMT, the\nlengths of input and output sequence are very close; but in text summarization, the input is extremely\n\n8\n\n\flong while the output is very short. The better results brought by deliberation networks shows that\neven if the output sentence is short, it is helpful to include the deliberation process which re\ufb01nes the\nlow-level draft in the \ufb01rst-pass decoder.\n\nTable 5: ROUGE-{1, 2, L} scores of text summarization\n\nAlgorithm ROUGE-1 ROUGE-2 ROUGE-L\nMbase\nMdec\u00d72\nMreviewer\u00d74\nMdelib\n\n27.45\n27.93\n28.26\n30.90\n\n10.51\n11.09\n11.25\n12.21\n\n26.07\n26.50\n27.28\n29.09\n\n5 Conclusions and Future Work\n\nIn this work, we have proposed deliberation networks for sequence generation tasks, in which the\n\ufb01rst-pass decoder is used for generating a raw sequence, and the second-pass decoder is used to\npolish the raw sequence. Experiments show that our method achieves much better results than several\nbaseline methods in both machine translation and text summarization, and achieves a new single\nmodel state-of-the-art result on WMT\u201914 English to French translation.\nThere are multiple promising directions to explore in the future. First, we will study how to apply\nthe idea of deliberation to tasks beyond sequence generation, such as improving the image qualities\ngenerated by GAN [5]. Second, we will study how to re\ufb01ne/polish different levels of a neural network,\nlike the hidden states in an RNN, or feature maps in a CNN. Third, we are curious about whether\nbetter sequences can be generated with more passes of decoders, i.e., re\ufb01ning a generated sequence\nagain and again. Fourth, we will study how to speed up the inference of deliberation networks and\nreduce their inference time.\n\nAcknowledgments\n\nThe authors would like to thank Yang Fan and Kaitao Song for implementing the deep neural machine\ntranslation basic model. This work is partially supported by the National Natural Science Foundation\nof China (Grant No. 61371192).\n\nReferences\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In International Conference on Learning Representations, 2015.\n\n[2] R. Chatterjee, J. G. de Souza, M. Negri, and M. Turchi. The fbk participation in the wmt\n2016 automatic post-editing shared task. In Proceedings of the First Conference on Machine\nTranslation: Volume 2, Shared Task Papers, 2016.\n\n[3] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using rnn encoder\u2013decoder for statistical machine\ntranslation. In EMNLP, pages 1724\u20131734, 2014.\n\n[4] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to\n\nsequence learning. ICML, 2017.\n\n[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, pages 2672\u20132680. 2014.\n\n[6] D. Graff and C. Cieri. English gigaword. linguistic data consortium, 2003.\n\n[7] D. He, H. Lu, Y. Xia, T. Qin, L. Wang, and T. Liu. Decoding with value networks for neural\nmachine translation. In 31st Annual Conference on Neural Information Processing Systems\n(NIPS), 2017.\n\n[8] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine\n\ntranslation. In Advances In Neural Information Processing Systems, pages 820\u2013828, 2016.\n\n9\n\n\f[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European\n\nConference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735\u20131780,\n\nNov. 1997.\n\n[12] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural\nmachine translation. In the annual meeting of the Association for Computational Linguistics,\n2015.\n\n[13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arX-\n\niv:1412.6980, 2014.\n\n[14] J. Li, W. Monroe, and D. Jurafsky. A simple, fast diverse decoding algorithm for neural\n\ngeneration. arXiv preprint arXiv:1611.08562, 2016.\n\n[15] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization\n\nbranches out: Proceedings of the ACL-04 workshop, volume 8, 2004.\n\n[16] J. Niehues, E. Cho, T.-L. Ha, and A. Waibel. Pre-translation for neural machine translation. In\n\nCOLING, 2016.\n\n[17] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of\nmachine translation. In the annual meeting of the Association for Computational Linguistics,\npages 311\u2013318, 2002.\n\n[18] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural\n\nnetworks. arXiv preprint arXiv:1511.06732, 2015.\n\n[19] A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence\n\nsummarization. In EMNLP, pages 379\u2013389, 2015.\n\n[20] A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence\n\nsummarization. ACL, 2015.\n\n[21] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword\n\nunits. the annual meeting of the Association for Computational Linguistics, 2016.\n\n[22] S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural\nmachine translation. the annual meeting of the Association for Computational Linguistics, 2016.\n\n[23] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[24] T. D. Team. Theano: A Python framework for fast computation of mathematical expressions.\n\narXiv preprint arXiv:1605.02688, 2016.\n\n[25] Z. Tu, Y. Liu, L. Shang, X. Liu, and H. Li. Neural machine translation with reconstruction. In\n\nAAAI, 2017.\n\n[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and\n\nI. Polosukhin. Attention is all you need. In NIPS, 2017.\n\n[27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption\ngenerator. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3156\u20133164,\n2015.\n\n[28] S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In\n\nACL, 2016.\n\n[29] WMT\u201914. http://www.statmt.org/wmt14/translation-task.html.\n\n10\n\n\f[30] L. Wu, L. Zhao, T. Qin, J. Lai, and T. Liu. Sequence prediction with unlabeled data by\nreward function learning. In Proceedings of the Twenty-Sixth International Joint Conference on\nArti\ufb01cial Intelligence (IJCAI), pages 3098\u20133104, 2017.\n\n[31] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al. Google\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[32] Y. Xia, J. Bian, T. Qin, N. Yu, and L. Tie-Yan. Dual inference for machine learning.\n\nIn\nProceedings of the Twenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence (IJCAI),\npages 3112\u20133118, 2017.\n\n[33] Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. Dual supervised learning. In Proceedings\nof the 34th International Conference on Machine Learning (ICML), pages 3789\u20133798, 2017.\n\n[34] Y. Xia, F. Tian, T. Qin, N. Yu, and T. Liu. Sequence generation with target attention. In\nThe European Conference on Machine Learning and Principles and Practice of Knowledge\nDiscovery in Databases (ECMLPKDD), 2017.\n\n[35] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio.\nShow, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[36] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. R. Salakhutdinov. Review networks for caption\ngeneration. In Advances in Neural Information Processing Systems, pages 2361\u20132369, 2016.\n\n[37] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n11\n\n\f", "award": [], "sourceid": 1120, "authors": [{"given_name": "Yingce", "family_name": "Xia", "institution": "University of Science and Technology of China"}, {"given_name": "Fei", "family_name": "Tian", "institution": "Miicrosoft Research"}, {"given_name": "Lijun", "family_name": "Wu", "institution": "Sun Yat-sen University"}, {"given_name": "Jianxin", "family_name": "Lin", "institution": "USTC"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Nenghai", "family_name": "Yu", "institution": "University of Science and Technology of China"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research"}]}