{"title": "Generative Neural Machine Translation", "book": "Advances in Neural Information Processing Systems", "page_first": 1346, "page_last": 1355, "abstract": "We introduce Generative Neural Machine Translation (GNMT), a latent variable architecture which is designed to model the semantics of the source and target sentences. We modify an encoder-decoder translation model by adding a latent variable as a language agnostic representation which is encouraged to learn the meaning of the sentence. GNMT achieves competitive BLEU scores on pure translation tasks, and is superior when there are missing words in the source sentence. We augment the model to facilitate multilingual translation and semi-supervised learning without adding parameters. This framework significantly reduces overfitting when there is limited paired data available, and is effective for translating between pairs of languages not seen during training.", "full_text": "Generative Neural Machine Translation\n\nHarshil Shah1\n\nDavid Barber1;2;3\n\n1University College London\n\n2Alan Turing Institute\n\n3reinfer.io\n\nAbstract\n\nWe introduce Generative Neural Machine Translation (GNMT), a latent variable\narchitecture which is designed to model the semantics of the source and target\nsentences. We modify an encoder-decoder translation model by adding a latent\nvariable as a language agnostic representation which is encouraged to learn the\nmeaning of the sentence. GNMT achieves competitive BLEU scores on pure trans-\nlation tasks, and is superior when there are missing words in the source sentence.\nWe augment the model to facilitate multilingual translation and semi-supervised\nlearning without adding parameters. This framework signi\ufb01cantly reduces over-\n\ufb01tting when there is limited paired data available, and is effective for translating\nbetween pairs of languages not seen during training.\n\n1 Introduction\n\nData from multiple modalities (e.g. an image and a caption) can be used to learn representations with\nmore understanding about their environment compared to when only a single modality (an image\nor a caption alone) is available. Such representations can then be included as components in larger\nmodels which may be responsible for several tasks. However, it can often be expensive to acquire\nmulti-modal data even when large amounts of unsupervised data may be available.\nIn the machine translation context, the same sentence expressed in different languages offers the\npotential to learn a representation of the sentence\u2019s meaning. Variational Neural Machine Transla-\ntion (VNMT) [Zhang et al., 2016] attempts to achieve this by augmenting a baseline model with\na latent variable intended to represent the underlying semantics of the source sentence, achieving\nhigher BLEU scores than the baseline. However, because the latent representation is dependent on\nthe source sentence, it can be argued that it doesn\u2019t serve a different purpose to the encoder hidden\nstates of the baseline model. As we demonstrate empirically, this model tends to ignore the latent\nvariable therefore it is not clear that it learns a useful representation of the sentence\u2019s meaning.\nIn this paper, we introduce Generative Neural Machine Translation (GNMT), whose latent variable\nis more explicitly designed to learn the sentence\u2019s semantic meaning. Unlike the majority of neural\nmachine translation models (which model the conditional distribution of the target sentence given\nthe source sentence), GNMT models the joint distribution of the target sentence and the source\nsentence. To do this, it uses the latent variable as a language agnostic representation of the sentence,\nwhich generates text in both the source and target languages. By giving the latent representation\nresponsibility for generating the same sentence in multiple languages, it is encouraged to learn the\nsemantic meaning of the sentence. We show that GNMT achieves competitive BLEU scores on\ntranslation tasks, relies heavily on the latent variable and is particularly effective at translating long\nsentences. When there are missing words in the source sentence, GNMT is able to use its learned\nrepresentation to infer what those words may be and produce good translations accordingly.\nWe then extend GNMT to facilitate multilingual translation whilst sharing parameters across lan-\nguages. This is achieved by adding two categorical variables to the model in order to indicate the\nsource and target languages respectively. We show that this parameter sharing helps to reduce the\nimpact of over\ufb01tting when the amount of available paired data is limited, and proves to be effective\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffor translating between pairs of languages which were not seen during training. We also show that\nby setting the source and target languages to the same value, monolingual data can be leveraged to\nfurther reduce the impact of over\ufb01tting in the limited paired data context, and to provide signi\ufb01cant\nimprovements for translating between previously unseen language pairs.\n\n2 Model\n\nNotation x denotes the source sentence (with number of words Tx) and y denotes the target sen-\ntence (with number of words Ty). e(v) is embedding of word v.\nGNMT models the joint probability of the target sentence and the source sentence i.e. p(x; y) by\nusing a latent variable z as a language agnostic representation of the sentence. The factorization of\nthe joint distribution is shown in equation (1) and graphically in \ufb01gure 1.\n\np(cid:18)(x; y; z) = p(z)p(cid:18)(xjz)p(cid:18)(yjz; x)\n\n(1)\n\nThis architecture means that z models the commonality between the source and target sentences,\nwhich is the semantic meaning. We use a Gaussian prior: p(z) = N (0; I). (cid:18) represents the set of\nweights of the neural networks that govern the conditional distributions of x and y.\n\nz\n\nx\n\ny\n\nFigure 1: The GNMT graphical model.\n\n2.1 Generative process\n\n2.1.1 Source sentence\nTo compute p(cid:18)(xjz), we use a model similar to that presented by Bowman et al. [2016]. The condi-\ntional probabilities, for t = 1; : : : ; Tx, are:\n\np(xt = vxjx1; : : : ; xt(cid:0)1; z) / exp((Wxhx\n\nt ) (cid:1) e(vx))\n\nwhere hx\n\nt is computed as:\n\nt = LSTM(hx\nhx\n\nt(cid:0)1; z; e(xt(cid:0)1))\n\n(2)\n\n(3)\n\n2.1.2 Target sentence\nTo compute p(cid:18)(yjx; z), we modify RNNSearch [Bahdanau et al., 2015] to accommodate the latent\nvariable z. Firstly, the source sentence is encoded using a bidirectional LSTM. The encoder hidden\nstates for t = 1; : : : ; Tx are computed as:\n\n (cid:0)(cid:0)!\nLSTM(henc\n\nt(cid:6)1; z; e(xt))\n\nhenc\n\nt =\n\nThen, the conditional probabilities, for t = 1; : : : ; Ty, are:\n\np(yt = vyjy1; : : : ; yt(cid:0)1; x; z) / exp((Wyhdec\n\nt\n\nwhere hdec\n\nt\n\nis computed as:\n\nhdec\nt = LSTM(hdec\n\nt(cid:0)1; z; e(yt(cid:0)1); ct)\n\nTx\u2211\n\u2211\n\ns=1\n\nct =\n\n(cid:11)s;t =\n\n(cid:11)s;thenc\n\ns\n\nt(cid:0)1; henc\n\nexp(W(cid:11)[hdec\ns\nTx\nr=1 exp(W(cid:11)[hdec\n\n])\nt(cid:0)1; henc\n\nr\n\n])\n\n2\n\n) (cid:1) e(vy))\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n\fAlgorithm 1 Generating translations\n\nMake an initial \u2018guess\u2019 for the target sentence y. This is made randomly, according to a uniform\ndistribution.\nwhile not converged do\n\nE-like step: Sample fz(s)gS\nlatest setting of the target sentence.\nM-like step: Choose the words in y to maximize 1\nlog p(cid:18)(yjz(s); x) using beam search.\nS\nend while\n\n\u2211\ns=1 from the variational distribution q\u03d5(zjx; y), where y is the\ns=1 log p(z(s)) + log p(xjz(s)) +\n\nS\n\n2.2 Training\n\nTo learn the parameters (cid:18) of the model, we use stochastic gradient variational Bayes (SGVB)\nto perform approximate maximum likelihood estimation [Kingma and Welling, 2014, Rezende\net al., 2014]. To do this, we parameterize a Gaussian inference distribution q\u03d5(zjx; y) =\nN ((cid:22)\u03d5(x; y); (cid:6)\u03d5(x; y)). This allows us to maximize the following lower bound on the log like-\nlihood:\n\n[\n\nlog p(x; y) (cid:21) Eq\u03d5(zjx;y)\n\nlog\n\np(cid:18)(x; y; z)\nq\u03d5(zjx; y)\n\n(cid:17) L(x; y)\n\nAs per VNMT, for the functions (cid:22)\u03d5(x; y) and (cid:6)\u03d5(x; y), we \ufb01rst encode the source and target\nsentences using a bidirectional LSTM. For i 2 fx; yg and t = 1; : : : ; Ti:\n\n (cid:0)(cid:0)!\nLSTM(hinf;i\n\nhinf;i\n\nt =\n\nt(cid:6)1 ; e(it)) where it =\n\nxt\nyt\n\nif i = x\nif i = y\n\n(10)\n\nWe then concatenate the averages of the two sets of hidden states, and use this vector to compute the\nmean and variance of the Gaussian distribution:\n\n24 1\n\nTx\u2211\n\n35\n\nTy\u2211\n\nt=1\n\nhinf =\n\nhinf;x\n\nt\n\n;\n\n1\nTy\n\nTx\n\nt=1\n\nhinf;y\n\nt\n\nq\u03d5(zjx; y) = N (W(cid:22)hinf ; diag(exp(W(cid:6)hinf )))\n\n(9)\n\n(11)\n\n(12)\n\n]\n\n{\n\n2.3 Generating translations\n\n\u222b\n\nOnce the model has been trained (i.e. (cid:18) and \u03d5 are \ufb01xed), given a source sentence x, we want to\n\ufb01nd the target sentence y which maximizes p(yjx) =\np(cid:18)(yjz; x)p(zjx) dz. However, this inte-\ngral is intractable and so p(yjx) cannot be easily computed. Instead, because arg maxy p(yjx) =\narg maxy p(x; y), we can perform approximate maximization by using a procedure inspired by the\nEM algorithm [Neal and Hinton, 1998]. We increase a lower bound on log p(x; y) by iterating\nbetween an E-like step and an M-like step, as described in algorithm 1.\n\n2.4 Translating with missing words\n\nUnlike architectures which model the conditional probability of the target sentence given the source\nsentence, p(yjx), GNMT is naturally suited to performing translation when there are missing words\nin the source sentence, because it can use the latent representation to infer what those missing words\nmay be.\nGiven a source sentence with visible words xvis and missing words xmiss, we want to \ufb01nd the\nsettings of xmiss and y which maximize p(xmiss; yjxvis). However, this quantity is intractable as it\nsuffers from a similar issue to that described in section 2.3. Therefore, we use a procedure similar\nto algorithm 1, increasing a lower bound on log p(xvis; xmiss; y), as described in algorithm 2.\n\n2.5 Multilingual translation\n\nWe extend GNMT to facilitate multilingual translation, referring to this version of the model as\nGNMT-MULTI. We add two categorical variables to GNMT, lx and ly (encoded as one-hot vectors),\n\n3\n\n\fAlgorithm 2 Translating when there are missing words\n\nMake an initial \u2018guess\u2019 for the target sentence y and the missing words in the source sentence\nxmiss. These are made randomly, according to a uniform distribution.\nwhile not converged do\n\nE-like step: Sample fz(s)gS\ns=1 from the variational distribution q\u03d5(zjx; y), where x is the latest\n\u2211\nsetting of the source sentence and y is the latest setting of the target sentence.\nM-like step (1): Choose the missing words in the source sentence xmiss to maximize\ns=1 log p(z(s)) + log p(cid:18)(xvis; xmissjz(s)) using beam search.\n1\ns=1 log p(z(s)) + log p(xjz(s)) +\nS\nM-like step (2): Choose the words in y to maximize 1\nlog p(cid:18)(yjz(s); x) using beam search, where x is the latest setting of the source sentence.\nS\nend while\n\nS\n\n\u2211\n\nS\n\nwhich indicate what the source and target languages are respectively. The joint distribution is:\n\np(cid:18)(x; y; zjlx; ly) = p(z)p(cid:18)(xjz; lx)p(cid:18)(yjz; x; lx; ly)\n\n(13)\n\nThis structure allows for parameters to be shared regardless of the input and output languages, and\nwhen the amount of available paired translation data is limited, this parameter sharing can signif-\nicantly mitigate the risk of over\ufb01tting. The forms of the neural networks in GNMT-MULTI are\nidentical to those in GNMT (as described in sections 2.1 and 2.2), except that lx and ly are now\nconcatenated with the embeddings e(xt) and e(yt) respectively.\n\n2.6 Semi-supervised learning\n\nMonolingual data can be used within the GNMT-MULTI framework to perform semi-supervised\nlearning. This is simply done by setting the source and target language variables lx and ly to the\nsame value, in which case the model must attempt to reconstruct the input sentence, rather than\nIn section 4, we show that when the amount of available paired translation data is\ntranslate it.\nlimited, using monolingual data in this way further reduces over\ufb01tting compared to cross-language\nparameter sharing alone. Note that we are not concerned about the encoder simply copying the\nsentence across to the decoder, because the cross-language parameter sharing prevents this.\n\n3 Related work\n\n\u222b\n\nWhilst there have been many attempts at designing generative models of text [Bowman et al.,\n2016, Dieng et al., 2017, Yang et al., 2017], their usage for translation has been limited. Most\nclosely related to our work is Variational Neural Machine Translation (VNMT) [Zhang et al.,\n2016], which introduces a latent variable z with the aim of capturing the source sentence\u2019s se-\nmantics. It models the conditional probability of the target sentence given the source sentence as\np(yjx) =\np(cid:18)(yjz; x)p(cid:18)(zjx) dz. The authors \ufb01nd that VNMT achieves improvements over mod-\neling p(yjx) directly (i.e. without a latent variable). The primary difference compared to our work\nis that VNMT does not model the probability distribution of the source sentence. We believe that\nlearning the joint distribution is a more dif\ufb01cult task than learning the conditional, however this is\nnot without bene\ufb01t because when learning the joint distribution, the latent variable is more explic-\nitly encouraged to learn the semantic meaning, as shown in the examples in section 4. In addition,\nbecause the latent representation is dependent on the source sentence, it is not clear that it serves a\ndifferent purpose to the encoder hidden states.\nAlso related is the work of Shu et al. [2017], which presents an approach for using unlabeled data for\nconditional density estimation. The authors propose a hybrid framework that regularizes the condi-\ntionally trained parameters towards the jointly trained parameters. Experiments on image modeling\ntasks show improvements over conditional training alone.\nIn work similar to GNMT-MULTI, Johnson et al. [2017] perform multilingual translation whilst\nsharing parameters by prepending, to the source sentence, a string indicating the target language.\nUnlike GNMT-MULTI, this approach does not indicate the source language.\nThere have also been various attempts to leverage monolingual data to improve translation mod-\nels. Zhang and Zong [2016] use source language monolingual data and Sennrich et al. [2016] use\n\n4\n\n\ftarget language monolingual data to create a synthetic dataset with which to augment the original\npaired dataset. This is done by passing the monolingual data through a pre-trained translation model\n(trained using the original paired data), thus creating a new synthetic dataset of paired data. This is\ncombined with the original paired data to create a new, larger dataset which is used to train a new\nmodel. In both papers, the authors \ufb01nd that their methods obtain improvements over using paired\ndata alone. However, these procedures do not directly integrate monolingual data into a single,\nuni\ufb01ed model.\n\n4 Experiments\n\nIn this section we evaluate the effectiveness of GNMT and GNMT-MULTI on the 6 permutations of\nlanguage pairs between English (EN), Spanish (ES) and French (FR) i.e. EN ! ES, ES ! EN, EN\n! FR, etc. We also train GNMT-MULTI in a semi-supervised manner, as described in section 2.6,\nand refer to this as GNMT-MULTI-SSL. We compare the performance of GNMT, GNMT-MULTI,\nand GNMT-MULTI-SSL against that of VNMT, which we believe to be the most closely related\nmodel to our work.\n\n4.1 Dataset\n\nWe use paired data provided by the Multi UN corpus1[Tiedemann, 2012]. We train each model\nwith a small, medium and large amount of paired data, corresponding to 40K, 400K and 4M paired\nsentences respectively. For each language pair, we create validation sets of size 5K and test sets of\nsize 10K paired sentences respectively. For the monolingual data used to train GNMT-MULTI-SSL,\nwe use the News Crawl articles from 2009 to 2012, provided for the WMT\u201913 translation task. There\nare 20.9M, 2.7M and 4.5M monolingual sentences for EN, ES and FR respectively.\n\nPreprocessing For each language, we convert all characters into lowercase and use a vocabulary\nof the 20,000 most common words from the paired data, replacing words outside of this vocabulary\nwith an unknown word token. We exclude sentences which have a proportion of unknown words\ngreater than 10% and which are longer than 50 words.\n\n4.2 Training\n\nWe optimize the ELBO, shown in equation (9), using stochastic gradient ascent. For all models, The\nlatent representation z has 100 units, each of the RNN hidden states has 1,000 units, and the word\nembeddings are 300-dimensional. To ensure training is fast, we use only a single sample z per data\npoint from the variational distribution at each iteration. We perform early stopping by evaluating the\nELBO on the validation set every 1,000 iterations. We implement both models in Python, using the\nTheano [Theano Development Team, 2016] and Lasagne [Dieleman et al., 2015] libraries.\n\n4.2.1 Optimization challenges\n\nThe ELBO from equation (9) can be expressed as:\n\nL(x; y) = Eq\u03d5(zjx;y) [log p(x; yjz)] (cid:0) DKL [q\u03d5(zjx; y) jj p(z)]\n\n(14)\n\nAs pointed out by Bowman et al. [2016], when training latent variable language models such as the\none described in section 2.1.1, the objective function encourages the model to set q\u03d5(zjx; y) equal\nto the prior p(z). As a result, the KL divergence term in equation (14) collapses to 0 and the model\nignores the latent variable altogether. To address this, we use the following two techniques:\n\nKL divergence annealing We multiply the KL divergence term by a constant weight, which we\nlinearly anneal from 0 to 1 over the \ufb01rst 50,000 iterations of training [Bowman et al., 2016, S\u00f8nderby\net al., 2016].\n\n1Whilst the Multi UN corpus forms part of the WMT\u201913 corpus, we did not use the full WMT\u201913 corpus\n\nsince it only provides translations between EN & ES and EN & FR, but not between ES & FR.\n\n5\n\n\fTable 1: Test set BLEU scores on pure translation for models trained with varying amounts of paired\nsentences.\nPAIRED\nDATA\n\nES!EN EN!FR\n\nFR!EN ES!FR\n\nEN!ES\n\nFR!ES\n\nSYSTEM\n\n40K\n\n400K\n\n4M\n\nVNMT\nGNMT\nGNMT-MULTI\nGNMT-MULTI-SSL\nVNMT\nGNMT\nGNMT-MULTI\nGNMT-MULTI-SSL\nVNMT\nGNMT\nGNMT-MULTI\nGNMT-MULTI-SSL\n\n12.45\n13.55\n16.32\n23.44\n33.27\n33.87\n40.08\n43.96\n44.10\n44.52\n44.43\n45.94\n\n12.30\n12.84\n15.36\n22.25\n31.96\n32.75\n38.56\n41.63\n43.03\n43.83\n43.91\n45.08\n\n12.20\n12.47\n15.99\n20.88\n27.71\n28.55\n35.55\n37.37\n38.06\n37.97\n38.02\n39.41\n\n12.98\n13.84\n16.92\n20.99\n27.69\n28.98\n37.28\n39.66\n38.56\n38.44\n38.67\n40.69\n\n12.19\n13.26\n16.80\n22.65\n28.76\n29.41\n36.31\n38.09\n35.28\n35.96\n35.57\n38.97\n\n13.44\n14.95\n18.21\n24.51\n31.22\n31.33\n38.68\n40.79\n40.27\n40.55\n40.79\n42.05\n\nTable 2: Test set KL divergence values (DKL [q\u03d5(zjx; y) jj p(z)]) for the model trained with 4M\npaired sentences, averaged across languages.\n\nSYSTEM VNMT GNMT GNMT-MULTI GNMT-MULTI-SSL\n\nDKL\n\n1.104\n\n5.581\n\n9.661\n\n10.915\n\nWord dropout\nIn equation (3), the dependence of the hidden state on the previous word means that\nthe RNN can often afford to ignore the latent variable whilst still maintaining syntactic consistency\nbetween words. To prevent this from happening, during training we randomly replace the word\nbeing passed to the next RNN hidden state with the unknown word token, as suggested by Bowman\net al. [2016]. This is parameterized by a drop rate, which we set to 30%.\nWord dropout signi\ufb01cantly weakens translation performance for VNMT, therefore we use KL diver-\ngence annealing when training both models, but only use word dropout when training GNMT.\n\n4.3 Results\n\n4.3.1 Translation\n\n\u222b\n\n\u2211\n\nS\n\nS\n\ns=1 from p(cid:18)(zjx) and then maximizing 1\n\nThe procedure for generating translations using GNMT is described in algorithm 1. For VNMT, the\nconditional likelihood is p(yjx) =\np(cid:18)(yjz; x)p(cid:18)(zjx) dz. This can be maximized by drawing a\ns=1 p(cid:18)(yjz(s); x). This is done\nset of samples fz(s)gS\napproximately, using beam search.\nWe report results on translation tasks in table 1. When trained with 40K and 400K paired sen-\ntences, GNMT has a small advantage over VNMT in terms of BLEU scores across all language\npairs. However, both models tend to over\ufb01t on these relatively small amounts of paired data. As\na result, GNMT-MULTI achieves much higher BLEU scores with both 40K and 400K paired sen-\ntences, due to the parameter sharing between languages. Adding monolingual data produces yet\nanother signi\ufb01cant increase in BLEU scores. In fact, GNMT-MULTI-SSL trained with only 400K\npaired sentences achieves performance comparable with GNMT trained with 4M paired sentences.\nEven with 4M paired sentences, adding monolingual data is helpful, with GNMT-MULTI-SSL out-\nperforming the other models.\nIn table 2, we report the values of the KL divergence term DKL [q\u03d5(zjx; y) jj p(z)] for the model\ntrained with 4M paired sentences. The higher values for GNMT, GNMT-MULTI and GNMT-\nMULTI-SSL clearly indicate that these models are placing higher reliance on the latent variable\nthan is VNMT.\n\nBLEU by sentence length It is argued by Tu et al. [2016] that attention based translation models\nsuffer \u2018coverage\u2019 issues, particularly on long sentences, because they do not keep track of the number\n\n6\n\n\fFigure 2: Test set BLEU scores on pure translation, by sentence length, evaluated using the model\nparameters trained with 4M paired sentences.\n\nTable 3: An example of a long sentence translation, showing the ability of GNMT to capture long\nrange semantics.\n\nSOURCE DANS CE D\u00c9CRET,\n\nIL MET EN LUMI\u00c8RE LES PRINCIPALES R\u00c9ALISATIONS DE LA\nR\u00c9PUBLIQUE D\u2019OUZB\u00c9KISTAN DANS LE DOMAINE DE LA PROTECTION ET DE LA PROMO-\nTION DES DROITS DE L\u2019HOMME ET APPROUVE LE PROGRAMME D\u2019ACTIVIT\u00c9S MARQUANT\nLE SOIXANTI\u00c8ME ANNIVERSAIRE DE LA D\u00c9CLARATION UNIVERSELLE DES DROITS DE\nL\u2019HOMME.\n\nTARGET\n\nVNMT\n\nGNMT\n\nTHE DECREE HIGHLIGHTS MAJOR ACHIEVEMENTS BY THE REPUBLIC OF UZBEKISTAN IN\nTHE FIELD OF PROTECTION AND PROMOTION OF HUMAN RIGHTS AND APPROVES THE PRO-\nGRAMME OF ACTIVITIES DEVOTED TO THE SIXTIETH ANNIVERSARY OF THE UNIVERSAL\nDECLARATION OF HUMAN RIGHTS.\nIN THIS REGARD, THE DECREE HIGHLIGHTS THE MAIN ACHIEVEMENTS OF THE UZBEK\nREPUBLIC IN THE PROMOTION AND PROMOTION AND PROTECTION OF HUMAN RIGHTS AND\nSUPPORTS THE ACTIVITIES OF THE SIXTIETH ANNIVERSARY OF THE HUMAN RIGHTS OF\nHUMAN RIGHTS.\nIN THIS DECREE, IT HIGHLIGHTS THE MAIN ACHIEVEMENTS OF THE REPUBLIC OF UZBEK-\nISTAN ON THE PROTECTION AND PROMOTION OF HUMAN RIGHTS AND APPROVES THE AC-\nTIVITIES OF THE SIXTIETH ANNIVERSARY OF THE UNIVERSAL DECLARATION OF HUMAN\nRIGHTS.\n\nof times each source word is translated to a target word. However, because the latent variable in\nGNMT is explicitly encouraged to model the sentence\u2019s semantics, it helps to alleviate these issues.\nThis is demonstrated in \ufb01gure 2 and in the example in table 3, which show that GNMT tends to\nperform better than VNMT on long sentences, reducing the problems of translating the same phrase\nmultiple times and neglecting to translate others at all.\n\n4.3.2 Missing word translation\n\nIn order to demonstrate that GNMT does indeed learn a useful representation of the sentence\u2019s\nsemantic meaning, we perform missing word translation (i.e. where the source sentence has missing\nwords). The model is forced to rely on its learned representation in order to infer what the missing\nwords could be, and then to produce a good translation accordingly.\nTo produce the missing word data, for each sentence we randomly sample a missing word mask\nwhere each word (independently) has a 30% chance of being missing. The procedure for generating\ntranslations using GNMT is described in algorithm 2. To generate translations using VNMT, we\nreplace the missing words in the source sentence with the unknown word token and then conduct\nthe same conditional likelihood maximization described in section 4.3.1. The results are reported in\ntable 4. From the BLEU scores, it is evident that GNMT has a signi\ufb01cant advantage over VNMT\nin this scenario, thanks to the quality of its learned representations. We show an example missing\nword translation in table 5, where the difference in quality between GNMT and VNMT is clear.\n\n4.3.3 Unseen language pair translation\n\nBecause GNMT-MULTI shares parameters across languages, it should be naturally suited to per-\nforming translations between pairs of languages that it never saw during training. For both VNMT\nand GNMT, to translate, say, from English to Spanish, we \ufb01rst translate from English to French then\nfrom French to Spanish (because we assume the English to Spanish parameters are not available).\nFor GNMT-MULTI and GNMT-MULTI-SSL, we train new models where the respective language\n\n7\n\n\fTable 4: Test set BLEU scores for missing word translation. We use the model parameters trained\nwith 4M paired sentences.\n\nSYSTEM EN ! ES\n26.99\nVNMT\n33.23\nGNMT\n\nES ! EN EN ! FR\n23.79\n27.39\n29.84\n33.46\n\nFR ! EN ES ! FR\n22.46\n23.51\n29.83\n28.27\n\nFR ! ES\n25.75\n33.09\n\nTable 5: A randomly sampled test set missing word translation from English to Spanish. The struck-\nthrough words in the source sentence are considered missing.\n\nSOURCE WE LOOK FORWARD AT THIS SESSION TO FURTHER MEASURES TO STRENGTHEN THE BEI-\n\nJING DECLARATION AND PLATFORM FOR ACTION.\n\nTARGET\n\nVNMT\n\nGNMT\n\nESPERAMOS QUE EN ESTE PER\u00cdODO DE SESIONES SE ADOPTEN NUEVAS MEDIDAS PARA\nCONSOLIDAR LA DECLARACI\u00d3N Y LA PLATAFORMA DE ACCI\u00d3N DE BEIJING.\n\nCONSIDERAMOS QUE EL PER\u00cdODO SE REFIERE A LAS MEDIDAS DE FORTALECIMIENTO DE\nLA PLATAFORMA DE ACCI\u00d3N DE BEIJING.\n\nESPERAMOS CON INTER\u00c9S EN ESTE PER\u00cdODO DE SESIONES UN EXAMEN DE MEDIDAS PARA\nFORTALECER LA DECLARACI\u00d3N Y LA PLATAFORMA DE ACCI\u00d3N DE BEIJING.\n\nTable 6: Test set BLEU scores for unseen pair translation. We use the VNMT and GNMT parameters\ntrained with 4M paired sentences. For GNMT-MULTI and GNMT-MULTI-SSL, we train new\nmodels with 4M paired sentences, but with the respective language pairs excluded during training.\n\nSYSTEM\nVNMT\nGNMT\nGNMT-MULTI\nGNMT-MULTI-SSL\n\nEN ! ES\n35.58\n35.35\n36.72\n38.80\n\n(EN, ES) UNSEEN\n\n(EN, FR) UNSEEN\n\n(ES, FR) UNSEEN\n\nES ! EN EN ! FR\n31.34\n33.59\n31.55\n33.76\n32.81\n35.05\n37.43\n34.79\n\nFR ! EN ES ! FR\n32.31\n31.95\n32.39\n31.38\n32.94\n32.62\n34.98\n33.57\n\nFR ! ES\n35.86\n35.85\n36.77\n38.11\n\npairs are excluded during training. Once trained, we can directly translate from the source to the\ntarget language without having to translate to an intermediate language \ufb01rst.\nIn table 6, we report results on translation between previously unseen language pairs. In this context,\nVNMT and GNMT perform similarly in terms of BLEU scores. However, both models are consis-\ntently outperformed by GNMT-MULTI (albeit only by a small amount). Using monolingual data is\nvery effective in this context, with GNMT-MULTI-SSL outperforming all other models.\n\n5 Discussion and future work\n\nIn this paper, we have introduced Generative Neural Machine Translation (GNMT), a latent variable\narchitecture which aims to model the semantic meaning of the source and target sentences. For\npure translation tasks, GNMT performs competitively with a comparable conditional model, places\nhigher reliance on the latent variable and achieves higher BLEU scores when translating long sen-\ntences. When there are missing words in the source sentence, GNMT has superior performance.\nWe extend GNMT to facilitate multilingual translation without adding parameters to the model.\nThis parameter sharing reduces the impact of over\ufb01tting when the amount of available paired data\nis limited, and proves to be effective for translating between pairs of languages which were not seen\nduring training. We also show that this architecture can be used to leverage monolingual data, which\nfurther reduces the impact of over\ufb01tting in the limited paired data context, and provides signi\ufb01cant\nimprovements for translating between previously unseen language pairs.\nWhilst we chose to factorize the joint distribution as per equation (1), this was not the only option we\nconsidered. The primary alternative was to use the factorization p(cid:18)(x; y; z) = p(z)p(cid:18)(xjz)p(cid:18)(yjz);\none could argue that this is in fact more natural for learning sentence semantics, since z wouldn\u2019t\n\n8\n\n\fbe able to rely on knowing x explicitly to help generate y through p(cid:18)(yjx; z). However, experimen-\ntally we found that the model struggled to generate grammatically coherent translations which also\nretained the source sentence\u2019s meaning.\nWe have shown that the idea of using the same sentence in different languages allows for a useful\nlatent representation to be learned. Using these sentence representations could be very promising for\nuse in downstream tasks where \u2018understanding\u2019 of the environment would be helpful, e.g. question\nanswering, dialog generation, etc.\n\nAcknowledgments\n\nThis work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1.\n\nReferences\nD. Bahdanau, K. Cho, and Y. Bengio. Neural Machine Translation by Jointly Learning to Align and\n\nTranslate. In International Conference on Learning Representations, 2015.\n\nS. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating Sentences\nfrom a Continuous Space. In Conference on Computational Natural Language Learning (CoNLL),\n2016.\n\nS. Dieleman, J. Schl\u00fcter, C. Raffel, E. Olson, S. K. S\u00f8nderby, et al. Lasagne: First release, 2015.\n\nA. B. Dieng, C. Wang, J. Gao, and J. Paisley. TopicRNN: A Recurrent Neural Network with Long-\n\nRange Semantic Dependency. In International Conference on Learning Representations, 2017.\n\nM. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Vi\u00e9gas, M. Wattenberg,\nG. Corrado, M. Hughes, and J. Dean. Google\u2019s Multilingual Neural Machine Translation System:\nEnabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics,\n5:339\u2013351, 2017.\n\nD. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In International Conference on\n\nLearning Representations, 2014.\n\nR. M. Neal and G. E. Hinton. A View of the EM Algorithm that Justi\ufb01es Incremental, Sparse, and\n\nother Variants. In Learning in Graphical Models, pages 355\u2013368. Springer Netherlands, 1998.\n\nD. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Infer-\nence in Deep Generative Models. In Proceedings of the 31st International Conference on Machine\nLearning, volume 32, pages 1278\u20131286. PMLR, 2014.\n\nR. Sennrich, B. Haddow, and A. Birch. Improving Neural Machine Translation Models with Mono-\nlingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational\nLinguistics, pages 86\u201396. Association for Computational Linguistics, 2016.\n\nR. Shu, H. H. Bui, and M. Ghavamzadeh. Bottleneck Conditional Density Estimation.\n\nIn\nProceedings of the 34th International Conference on Machine Learning, volume 70, pages 3164\u2013\n3172. PMLR, 2017.\n\nC. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. Ladder Variational Autoen-\n\ncoders. In Advances in Neural Information Processing Systems 29, pages 3738\u20133746, 2016.\n\nTheano Development Team. Theano: A Python framework for fast computation of mathematical\n\nexpressions. arXiv preprint, arXiv:1605.02688, 2016.\n\nJ. Tiedemann. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eight International\nConference on Language Resources and Evaluation. European Language Resources Association\n(ELRA), 2012.\n\nZ. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li. Modeling Coverage for Neural Machine Translation. In\nProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages\n76\u201385. Association for Computational Linguistics, 2016.\n\n9\n\n\fZ. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick. Improved Variational Autoencoders for\nText Modeling using Dilated Convolutions. In Proceedings of the 34th International Conference\non Machine Learning, volume 70, pages 3881\u20133890. PMLR, 2017.\n\nB. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang. Variational Neural Machine Translation.\nIn Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,\npages 521\u2013530. Association for Computational Linguistics, 2016.\n\nJ. Zhang and C. Zong. Exploiting Source-side Monolingual Data in Neural Machine Translation.\nIn Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,\npages 1535\u20131545. Association for Computational Linguistics, 2016.\n\n10\n\n\f", "award": [], "sourceid": 695, "authors": [{"given_name": "Harshil", "family_name": "Shah", "institution": "UCL"}, {"given_name": "David", "family_name": "Barber", "institution": "University College London"}]}