{"title": "Levenshtein Transformer", "book": "Advances in Neural Information Processing Systems", "page_first": 11181, "page_last": 11191, "abstract": "Modern neural sequence generation models are built to either generate tokens step-by-step from scratch or (iteratively) modify a sequence of tokens bounded by a fixed length. In this work, we develop Levenshtein Transformer, a new partially autoregressive model devised for more flexible and amenable sequence generation. Unlike previous approaches, the basic operations of our model are insertion and deletion. The combination of them facilitates not only generation but also sequence refinement allowing dynamic length changes. We also propose a set of new training techniques dedicated at them, effectively exploiting one as the other's learning signal thanks to their complementary nature. Experiments applying the proposed model achieve comparable or even better performance with much-improved efficiency on both generation (e.g. machine translation, text summarization) and refinement tasks (e.g. automatic post-editing). We further confirm the flexibility of our model by showing a Levenshtein Transformer trained by machine translation can straightforwardly be used for automatic post-editing.", "full_text": "Levenshtein Transformer\n\nJiatao Gu\u2020, Changhan Wang\u2020, and Jake Zhao (Junbo)\u2021(cid:5)\n\n\u2020{jgu, changhan}@fb.com \u2021jakezhao@cs.nyu.edu\n\n\u2020Facebook AI Research\n\n\u2021New York University (cid:5)Tigerobo Inc.\n\nAbstract\n\nModern neural sequence generation models are built to either generate tokens\nstep-by-step from scratch or (iteratively) modify a sequence of tokens bounded\nby a \ufb01xed length.\nIn this work, we develop Levenshtein Transformer, a new\npartially autoregressive model devised for more \ufb02exible and amenable sequence\ngeneration. Unlike previous approaches, the basic operations of our model are\ninsertion and deletion. The combination of them facilitates not only generation\nbut also sequence re\ufb01nement allowing dynamic length changes. We also propose\na set of new training techniques dedicated at them, effectively exploiting one as\nthe other\u2019s learning signal thanks to their complementary nature. Experiments\napplying the proposed model achieve comparable or even better performance\nwith much-improved ef\ufb01ciency on both generation (e.g. machine translation, text\nsummarization) and re\ufb01nement tasks (e.g. automatic post-editing). We further\ncon\ufb01rm the \ufb02exibility of our model by showing a Levenshtein Transformer trained\nby machine translation can straightforwardly be used for automatic post-editing. 1\n\n1\n\nIntroduction\n\nNeural sequence generation models are widely developed and deployed in tasks such as machine\ntranslation (Bahdanau et al., 2015; Vaswani et al., 2017). As we examine the current frameworks,\nthe most popular autoregressive models generate tokens step-by-step. If not better, recent non-\nautoregressive approaches (Gu et al., 2018; Kaiser et al., 2018; Lee et al., 2018) have proved it\npossible to perform generation within a much smaller number of decoding iterations.\nIn this paper, we propose Levenshtein Transformer (LevT), aiming to address the lack of \ufb02exibility of\nthe current decoding models. Notably, in the existing frameworks, the length of generated sequences\nis either \ufb01xed or monotonically increased as the decoding proceeds. This remains incompatible\nwith human-level intelligence where humans can revise, replace, revoke or delete any part of their\ngenerated text. Hence, LevT is proposed to bridge this gap by breaking the in-so-far standardized\ndecoding mechanism and replacing it with two basic operations \u2014 insertion and deletion.\nWe train the LevT using imitation learning. The resulted model contains two policies and they are\nexecuted in an alternate manner. Empirically, we show that LevT achieves comparable or better results\nthan a standard Transformer model on machine translation and summarization, while maintaining\nthe ef\ufb01ciency advantages bene\ufb01ted from parallel decoding similarly to (Lee et al., 2018). With this\nmodel, we argue that the decoding becomes more \ufb02exible. For example, when the decoder is given an\nempty token, it falls back to a normal sequence generation model. On the other hand, the decoder acts\nas a re\ufb01nement model when the initial state is a low-quality generated sequence. Indeed, we show\nthat a LevT trained from machine translation is directly applicable to translation post-editing without\n\n1Codes for reproducing this paper are released in https://github.com/pytorch/fairseq/tree/\n\nmaster/examples/nonautoregressive_translation\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fany change. This would not be possible with any framework in the literature because generation and\nre\ufb01nement are treated as two different tasks due to the model\u2019s inductive bias.\nOne crucial component in LevT framework is the learning algorithm. We leverage the characteristics\nof insertion and deletion \u2014 they are complementary but also adversarial. The algorithm we propose\nis called \u201cdual policy learning\u201d. The idea is that when training one policy (insertion or deletion),\nwe use the output from its adversary at the previous iteration as input. An expert policy, on the\nother hand, is drawn to provide a correction signal. Despite that, in theory, this learning algorithm is\napplicable to other imitation learning scenarios where a dual adversarial policy exists, in this work we\nprimarily focus on a proof-of-concept of this algorithm landing at training the proposed LevT model.\nTo this end, we summarize the contributions as follows:\n\u2022 We propose Levenshtein Transformer (LevT), a new sequence generation model composed of the\ninsertion and deletion operations. This model achieves comparable or even better results than a\nstrong Transformer baseline in both machine translation and text summarization, but with much\nbetter ef\ufb01ciency (up to \u00d75 speed-up in terms of actual machine execution time);\nlearning, tackling the complementary and adversarial nature of the dual policies;\n\n\u2022 We propose a corresponding learning algorithm under the theoretical framework of imitation\n\u2022 We recognize our model as a pioneer attempt to unify sequence generation and re\ufb01nement, thanks\nto its built-in \ufb02exibility. With this uni\ufb01cation, we empirically validate the feasibility of applying a\nLevT model trained by machine translation directly to translation post-editing, without any change.\n\n2 Problem Formulation\n\n2.1 Sequence Generation and Re\ufb01nement\n\nWe unify the general problems of sequence generation and re\ufb01nement by casting them to a Markov\nDecision Process (MDP) de\ufb01ned by a tuple (Y,A,E,R, y0). We consider the setup consisting an\nagent interacting with an environment E which receives the agent\u2019s editing actions and returns the\nmodi\ufb01ed sequence. We de\ufb01ne Y = V Nmax as a set of discrete sequences up to length Nmax where V\nis a vocabulary of symbols. At every decoding iteration, the agent receives an input y drawn from\nscratch or uncompleted generation, chooses an action a and gets a reward r. We use A to denote\nthe set of actions and R for the reward function. Generally the reward function R measures the\ndistance between the generation and the ground-truth sequence, R(y) = \u2212D(y, y\u2217) which can be\nany distance measurement such as the Levenshtein distance (Levenshtein, 1965). It is crucial to\nincorporate y0 \u2208 Y into the our formulation. As the initial sequence, the agent receives\u2014when y0 is\nan already generated sequence from another system, the agent essentially learns to do re\ufb01nement\nwhile it falls back to generation if y0 is an empty sequence. The agent is modeled by a policy, \u03c0, that\nmaps the current generation over a probability distribution over A. That is, \u03c0 : Y \u2192 P (A).\n2.2 Actions: Deletion & Insertion\n\nFollowing the above MDP formulation, with a subsequence yk = (y1, y2, ..., yn), the two basic\nactions \u2013 deletion and insertion \u2013 are called to generate yk+1 = E(yk, ak+1). Here we let y1 and yn\nbe special symbols and , respectively. Since we mainly focus on the policy of a single round\ngeneration, the superscripts are omitted in this section for simplicity. For conditional generation like\nMT, our policy also includes an input of source information x which is also omitted here.\nDeletion The deletion policy reads the input sequence y, and for every token yi \u2208 y, the deletion\npolicy \u03c0del(d|i, y) makes a binary decision which is 1 (delete this token) or 0 (keep it). We additionally\nconstrain \u03c0del(0|1, y) = \u03c0del(0|n, y) = 1 to avoid sequence boundary being broken. The deletion\nclassi\ufb01er can also be seen as a \ufb01ne-grained discriminator used in GAN (Goodfellow et al., 2014)\nwhere we predict \u201cfake\u201d or \u201creal\u201d labels for every predicted token.\n\nInsertion In this work, it is slightly more complex to build the insertion atomic because it involves\ntwo phases: placeholder prediction and token prediction so that it is able to insert multiple tokens\nat the same slot. First, among all the possible inserted slots (yi, yi+1) in y, \u03c0plh(p|i, y) predicts the\npossibility of adding one or several placeholders. In what follows, for every placeholder predicted as\n\n2\n\n\fFigure 1: The illustration of the proposed Levenshtein Transformer decoder for one re\ufb01nement\niteration. The same architecture can be applied for three different tasks with speci\ufb01c classi\ufb01ers. For\nsimplicity, the encoder-decoder attention is omitted within each Transformer-Block.\n\nabove, a token prediction policy \u03c0tok(t|i, y) replaces the placeholders with actual tokens in the vocab-\nulary. The two-stage insertion process can also be viewed as a hybrid of Insertion Transformer (Stern\net al., 2019) and masked language model (MLM, Devlin et al., 2018; Ghazvininejad et al., 2019).\n\nPolicy combination Recall that our two operations are complementary. Hence we combine them\nin an alternate fashion. For example in sequence generation from the empty, insertion policy is \ufb01rst\ncalled and it is followed by deletion, and then repeat till the certain stopping condition is ful\ufb01lled.\nIndeed, it is possible to leverage the parallelism in this combination. We essentially decompose\none iteration of our sequence generator into three phases: \u201cdelete tokens \u2013 insert placeholders \u2013\nreplace placeholders with new tokens\u201d. Within each stage, all operations are performed in parallel.\nMore precisely, given the current sequence y = (y0, . . . , yn), and suppose the action to predict is\na = {d0, . . . dn\n\n0 , . . . , tpn\u22121\nn\u22121\n\n; p0, . . . , pn\u22121\n\n0, . . . tp0\n\n; t1\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nd\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\np\n\n(cid:125)\n}, the policy for one iteration is:\n\u03c0plh(pi|i, y(cid:48)) \u00b7\n\n\u03c0tok(ti|i, y(cid:48)(cid:48)),\n\n(cid:89)\n\nti\u2208t\n\n(cid:125)\n(cid:89)\n\n(cid:124)\n(cid:123)(cid:122)\n\u03c0del(di|i, y) \u00b7\n\nt\n\ndi\u2208d\n\n(cid:89)\n\npi\u2208p\n\n\u03c0(a|y) =\n\n(1)\n\nwhere y(cid:48) = E(y, d) and y(cid:48)(cid:48) = E(y(cid:48), p). We parallelize the computation within each sub-tasks.\n3 Levenshtein Transformer\n\nIn this section, we cover the specs of Levenshtein Transformer and the dual-policy learning algorithm.\nOverall our model takes a sequence of tokens (or none) as the input then iteratively modify it by\nalternating between insertion and deletion, until the two policies combined converge. We describe\nthe detailed learning and inference algorithms in the Appendix.\n\n3.1 Model\n\nWe use Transformer (Vaswani et al., 2017) as the basic building block. For conditional generation,\nthe source x is included in each TransformerBlock. The states from the l-th block are:\n\n(cid:26) Ey0 + P0, Ey1 + P1, ..., Eyn + Pn,\n\n0\n\nh(l+1)\n\n, h(l+1)\n\n(2)\nwhere E \u2208 R|V|\u00d7dmodel and P \u2208 RNmax\u00d7dmodel are the token and position embeddings, respectively. We\nshow the illustration of the proposed LevT model for one re\ufb01nement (delete, insert) as Figure 1.\n\nTransformerBlockl(h(l)\n\n1 , ..., h(l)\nn ),\n\n, ..., h(l+1)\n\n0 , h(l)\n\n=\n\nn\n\n1\n\nl = 0\nl > 0\n\n3\n\n++++++catsitmatLevenshtein TransformerasatonLevenshtein TransformerLevenshtein Transformer\u2713\u2718\u2713catmat[1][3][0]catmat[PLH][PLH][PLH][PLH]catmatasatonthetheTransformer Block_2Transformer Block_1Transformer Block_L\u2026BOSx2x3(cid:51)(cid:47)(cid:43)x5EOS123456h1h2h3h4h5h6Placeholder Classi\ufb01erToken Classi\ufb01erDeletion Classi\ufb01erToken EmbeddingsPosition EmbeddingsClassi\ufb01ersDelete TokensInsert PlaceholdersFill-in Tokens\fPolicy Classi\ufb01ers The decoder outputs (h0, h2, ..., hn) are passed to three policy classi\ufb01ers:\n\n\u03c0del\n\n2. Placeholder Classi\ufb01er: LevT predicts the number of tokens to be inserted at every consecutive\n\n1. Deletion Classi\ufb01er: LevT scans over the input tokens (except for the boundaries) and predict\n\n\u201cdeleted\u201d (0) or \u201ckept\u201d (1) for each token position,\n\nwhere A \u2208 R2\u00d7dmodel, and we always keep the boundary tokens.\nposition pairs, by casting the representation to a categorical distribution:\n\n\u03b8 (d|i, y) = softmax(cid:0)hi \u00b7 A(cid:62)(cid:1) , i = 1, . . . n \u2212 1,\n(p|i, y) = softmax(cid:0)concat(hi, hi+1) \u00b7 B(cid:62)(cid:1) , i = 0, . . . n \u2212 1,\n\n(4)\nwhere B \u2208 R(Kmax+1)\u00d7(2dmodel). Based on the number (0 \u223c Kmax) of tokens it predicts, we insert\nthe considered number of placeholders at the current position. In our implementation, placehoder\nis represented by a special token which was reserved in the vocabulary.\n\n\u03c0plh\n\u03b8\n\n(3)\n\n3. Token Classi\ufb01er: following the placeholder prediction, LevT needs to \ufb01ll in tokens replacing all\n\nthe placeholders. This is achieved by training a token predictor as follow:\n\n(t|i, y) = softmax(cid:0)hi \u00b7 C(cid:62)(cid:1) , \u2200yi = ,\n\n\u03c0tok\n\u03b8\n\n(5)\n\nwhere C \u2208 R|V|\u00d7dmodel with parameters being shared with the embedding matrix.\n\nWeight Sharing Our default implementation always assumes the three operations to share the\nsame Transformer backbone to bene\ufb01t features learned from other operations. However, it is also\npossible to disable weight sharing and train separate decoders for each operations, which increases\nthe capacity of the model while does not affect the overall inference time.\n\nEarly Exit Although it is parameter-ef\ufb01cient to share the same Transformer architecture across\nthe above three heads, there is room for improvement as one decoding iteration requires three full\npasses of the network. To make trade-off between performance and computational cost, we propose\nto perform early exit (attaching the classi\ufb01er to an intermediate block instead of the last one) for \u03c0del\nand \u03c0plh to reduce computation while keeping \u03c0tok always based on the last block, considering that\ntoken prediction is usually more challenging than the other two tasks.\n\n3.2 Dual-policy Learning\n\nImitation Learning We use imitation learning to train the Levenshtein Transformer. Essentially\nwe let the agent imitate the behaviors that we draw from some expert policy \u03c0\u2217. The expert policy\nis derived from direct usage of ground-truth targets or less noisy version \ufb01ltered by sequence\ndistillation (Kim and Rush, 2016). The objective is to maximize the following expectation:\n\n(cid:124)\n\nlog \u03c0del\n\nEydel\u223cd\u02dc\u03c0del\nd\u2217\u223c\u03c0\u2217\nd\u2217\ni \u2208d\u2217\nDeletion Objective\n\n(cid:123)(cid:122)\n\n(cid:125)\n\u03b8 (d\u2217i |i, ydel)\n\n(cid:124)\n\n+ Eyins\u223cd\u02dc\u03c0ins\np\u2217,t\u2217\u223c\u03c0\u2217\n\nlog \u03c0plh\n\n\u03b8\n\n(p\u2217i |i, yins) +\nt\u2217\ni \u2208t\u2217\n\n(cid:123)(cid:122)\n\nInsertion Objective\n\nlog \u03c0tok\n\n\u03b8\n\n(t\u2217i |i, y(cid:48)ins)\n\n,\n\nwhere y(cid:48)ins is the output after inserting palceholders p\u2217 upon yins. \u02dc\u03c0del, \u02dc\u03c0ins are the roll-in polices and\nwe repeatedly draw states (sequences) from their induced state distribution d\u02dc\u03c0del , d\u02dc\u03c0ins. These states\nare \ufb01rst executed by the expert policy returning the suggested actions by the expert, and then we\nmaximize the conditional log-likelihood over them. By de\ufb01nition, the roll-in policy determines the\nstate distribution fed to \u03c0\u03b8 during training. In this work, we have two strategies to construct the roll-in\npolicy \u2014 adding noise to the ground-truth or using the output from the adversary policy. Figure 2\nshows a diagram of this learning paradigm. We formally write down the roll-in policies as follows.\n\n1. Learning to Delete: we design the \u02dc\u03c0del as a stochastic mixture between the initial input y0 or the\n\noutput by applying insertion from the model with some mixture factor \u03b1 \u2208 [0, 1]:\n\nd\u02dc\u03c0del = {y0\n\n(6)\nwhere u \u223c Uniform[0, 1] and y(cid:48) is any sequence ready to insert tokens. \u02dct is obtained by sampling\ninstead of doing argmax from Eq. (5).\n\nif u < \u03b1 else E\n\n(cid:0)\n\nE (y(cid:48), p\u2217) , \u02dct(cid:1) , p\u2217 \u223c \u03c0\u2217, \u02dct \u223c \u03c0\u03b8}\n\n4\n\n(cid:88)\n\n\uf8ee\uf8f0 (cid:88)\n\np\u2217\ni \u2208p\u2217\n\n(cid:88)\n\n\uf8f9\uf8fb\n(cid:125)\n\n\fFigure 2: The data-\ufb02ow of learning.\n\n2. Learning to Insert: similar to the deletion step, we apply a mixture of the deletion output and\na random word dropping sequence of the round-truth, inspired by recent advances of training\n(cid:17)\nmasked language model (Devlin et al., 2018). We use random dropping as a form of noise injection\nto encourage more exploration. Let \u03b2 \u2208 [0, 1] and u \u223c Uniform[0, 1],\ny\u2217, \u02dcd\n\n(cid:0)y0, d\u2217(cid:1) , d\u2217 \u223c \u03c0\u2217 if u < \u03b2 else E\n\n, \u02dcd \u223c \u03c0RND}\n\nd\u02dc\u03c0ins = {E\n\n(cid:16)\n\n(7)\n\nExpert Policy\nhard or too weak to learn from. Speci\ufb01cally, we considered two types of experts:\n\nIt is crucial to construct an expert policy in imitation learning which cannot be too\n\n1. Oracle: One way is to build an oracle which accesses to the ground-truth sequence. It returns the\n\noptimal actions a\u2217 (either oracle insertion p\u2217, t\u2217 or oracle deletion d\u2217) by:\n\na\u2217 = argmin\n\na D(y\u2217,E(y, a))\n\n(8)\nHere, we use the Levenshtein distance (Levenshtein, 1965)2 as D considering it is possible to\nobtain the action suggestions ef\ufb01ciently by dynamic programming.\n2. Distillation: We also explore to use another teacher model to provide expert policy, which is\nknown as sequence-level knowledge distillation (Kim and Rush, 2016). This technique has been\nwidely used in previous approaches for nonauoregressive generation (Gu et al., 2018). More\nprecisely, we \ufb01rst train an autoregressive teacher model using the same datasets and then replace\nthe ground-truth sequence y\u2217 by the beam-search result of this teacher-model, yAR. We use the\nsame mechanism to \ufb01nd the suggested option as using the ground-truth oracle.\n\n3.3\n\nInference\n\nGreedy Decoding At inference time, we apply the trained model over the initial sequence y0 for\nseveral iterations. We greedily pick up the actions associated with high probabilities in Eq. (3)(4)(5).\nMoreover, we \ufb01nd that using search (instead of greedy decoding) or nosiy parallel decoding (Cho,\n2016) does not yield much gain in LevT. This observation is quite opposite to what has been widely\ndiscovered in autoregressive decoding. We hypothesize there may be two reasons leading to this\nissue: (i) The local optimal point brought by greedy decoding in autoregressive models is often far\nfrom the optimal point globally. Search techniques resolve this issue with tabularization. In our case,\nhowever, because LevT inserts or deletes tokens dynamically, it could easily revoke the tokens that\nare found sub-optimal and re-insert better ones; (ii) the log-probability of LevT is not a good metric\nto select the best output. However, we do believe to see more improvements if we include an external\nre-ranker, e.g. an autoregressive teacher model. We leave this discussion in the future work.\n\nTermination Condition The decoding stops when one of the following two conditions is ful\ufb01lled:\n\n1. Looping: Generation is terminated if two consecutive re\ufb01nement iterations return the same output\nwhich can be (i) there are no words to delete or insert; (ii) the agent gets stuck in an in\ufb01nite loop:\ni.e. the insertion and deletion counter each other and keep looping.\n\n2. Timeout: We further set a maximum number of iterations (timeout) to guarantee a constant-time\n\ncomplexity in the worst case (Lee et al., 2018; Ghazvininejad et al., 2019).\n\nPenalty for Empty Placeholders Similar to Stern et al. (2019), we add a penalty to insert \u201cempty\u201d\nplaceholder in decoding. Overly inserting \u201cempty\u201d placeholders may result in shorter output. A\npenalty term \u03b3 \u2208 [0, 3] is subtracted from the logits of 0 in Eq. (4).\n\n2We only consider the variant which only computes insertion and deletion. No substitution is considered.\n\n5\n\nyyy\u21e4AAAB8XicbVBNSwMxEJ2tX7V+VT16CRZBPJTdKuix6MVjBfuB7VqyabYNTbJLkhXK0n/hxYMiXv033vw3Zts9aOuDgcd7M8zMC2LOtHHdb6ewsrq2vlHcLG1t7+zulfcPWjpKFKFNEvFIdQKsKWeSNg0znHZiRbEIOG0H45vMbz9RpVkk780kpr7AQ8lCRrCx0kMvFkE6mT6elfrlilt1Z0DLxMtJBXI0+uWv3iAiiaDSEI617npubPwUK8MIp9NSL9E0xmSMh7RrqcSCaj+dXTxFJ1YZoDBStqRBM/X3RIqF1hMR2E6BzUgvepn4n9dNTHjlp0zGiaGSzBeFCUcmQtn7aMAUJYZPLMFEMXsrIiOsMDE2pCwEb/HlZdKqVb3zau3uolK/zuMowhEcwyl4cAl1uIUGNIGAhGd4hTdHOy/Ou/Mxby04+cwh/IHz+QMMf5CAyyy0AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/cA2ls120y7d3YTdjRBC/4UXD4p49d9489+4aXPQ1gcDj/dmmJkXxJxp47rfzsrq2vrGZmmrvL2zu7dfOThs6yhRhLZIxCPVDbCmnEnaMsxw2o0VxSLgtBNMbnK/80SVZpG8N2lMfYFHkoWMYGOlh34sgiydPrrlQaXq1twZ0DLxClKFAs1B5as/jEgiqDSEY617nhsbP8PKMMLptNxPNI0xmeAR7VkqsaDaz2YXT9GpVYYojJQtadBM/T2RYaF1KgLbKbAZ60UvF//zeokJr/yMyTgxVJL5ojDhyEQofx8NmaLE8NQSTBSztyIyxgoTY0PKQ/AWX14m7XrNO6/V7y6qjesijhIcwwmcgQeX0IBbaEILCEh4hld4c7Tz4rw7H/PWFaeYOYI/cD5/ABWdkIY=\u21e1\u21e4AAAB7XicbVBNSwMxEJ2tX7V+VT16CRZBPJTdKuix6MVjBfsB7VqyabaNzSZLkhXK0v/gxYMiXv0/3vw3Zts9aOuDgcd7M8zMC2LOtHHdb6ewsrq2vlHcLG1t7+zulfcPWlomitAmkVyqToA15UzQpmGG006sKI4CTtvB+Cbz209UaSbFvZnE1I/wULCQEWys1OrF7OGs1C9X3Ko7A1omXk4qkKPRL3/1BpIkERWGcKx113Nj46dYGUY4nZZ6iaYxJmM8pF1LBY6o9tPZtVN0YpUBCqWyJQyaqb8nUhxpPYkC2xlhM9KLXib+53UTE175KRNxYqgg80VhwpGRKHsdDZiixPCJJZgoZm9FZIQVJsYGlIXgLb68TFq1qnderd1dVOrXeRxFOIJjOAUPLqEOt9CAJhB4hGd4hTdHOi/Ou/Mxby04+cwh/IHz+QOhpY6B\u21e1rndAAAB+nicbVBNS8NAEN3Ur1q/Uj16WSyCp5JUQY9FLx4r2A9oYtlsNu3SzSbsTtQS+1O8eFDEq7/Em//GpM1BWx8MPN6bYWaeFwuuwbK+jdLK6tr6RnmzsrW9s7tnVvc7OkoUZW0aiUj1PKKZ4JK1gYNgvVgxEnqCdb3xVe5375nSPJK3MImZG5Kh5AGnBDJpYFadmN85wB5B01RJf1oZmDWrbs2Al4ldkBoq0BqYX44f0SRkEqggWvdtKwY3JQo4FWxacRLNYkLHZMj6GZUkZNpNZ6dP8XGm+DiIVFYS8Ez9PZGSUOtJ6GWdIYGRXvRy8T+vn0Bw4aZcxgkwSeeLgkRgiHCeA/a5YhTEJCOEKp7diumIKEIhSysPwV58eZl0GnX7tN64Oas1L4s4yugQHaETZKNz1ETXqIXaiKIH9Ixe0ZvxZLwY78bHvLVkFDMH6A+Mzx+E7JQp\u21e1\u2713AAAB8nicbVBNS8NAEN34WetX1aOXxSJ4KkkV9Fj04rGC/YAklM120y7d7IbdiVBCf4YXD4p49dd489+4aXPQ1gcDj/dmmJkXpYIbcN1vZ219Y3Nru7JT3d3bPzisHR13jco0ZR2qhNL9iBgmuGQd4CBYP9WMJJFgvWhyV/i9J6YNV/IRpikLEzKSPOaUgJX8IOWDAMYMSHVQq7sNdw68SryS1FGJ9qD2FQwVzRImgQpijO+5KYQ50cCpYLNqkBmWEjohI+ZbKknCTJjPT57hc6sMcay0LQl4rv6eyElizDSJbGdCYGyWvUL8z/MziG/CnMs0AybpYlGcCQwKF//jIdeMgphaQqjm9lZMx0QTCjalIgRv+eVV0m02vMtG8+Gq3rot46igU3SGLpCHrlEL3aM26iCKFHpGr+jNAefFeXc+Fq1rTjlzgv7A+fwB4X+Q/A==yyy0AAAB8HicbVBNS8NAEJ34WetX1aOXxSJ6KkkV9Fj04rGC/ZA2lM120y7d3YTdjRBCf4UXD4p49ed489+4aXPQ1gcDj/dmmJkXxJxp47rfzsrq2vrGZmmrvL2zu7dfOThs6yhRhLZIxCPVDbCmnEnaMsxw2o0VxSLgtBNMbnO/80SVZpF8MGlMfYFHkoWMYGOlx34sgiydnpUHlapbc2dAy8QrSBUKNAeVr/4wIomg0hCOte55bmz8DCvDCKfTcj/RNMZkgke0Z6nEgmo/mx08RadWGaIwUrakQTP190SGhdapCGynwGasF71c/M/rJSa89jMm48RQSeaLwoQjE6H8ezRkihLDU0swUczeisgYK0yMzSgPwVt8eZm06zXvola/v6w2boo4SnAMJ3AOHlxBA+6gCS0gIOAZXuHNUc6L8+58zFtXnGLmCP7A+fwBUK2QFQ==yyy00AAAB8XicbVBNSwMxEJ2tX7V+VT16CRapp7JbBT0WvXisYD+wXUo2zbahSXZJskJZ+i+8eFDEq//Gm//GbLsHbX0w8Hhvhpl5QcyZNq777RTW1jc2t4rbpZ3dvf2D8uFRW0eJIrRFIh6pboA15UzSlmGG026sKBYBp51gcpv5nSeqNIvkg5nG1Bd4JFnICDZWeuzHIkins2q1NChX3Jo7B1olXk4qkKM5KH/1hxFJBJWGcKx1z3Nj46dYGUY4nZX6iaYxJhM8oj1LJRZU++n84hk6s8oQhZGyJQ2aq78nUiy0norAdgpsxnrZy8T/vF5iwms/ZTJODJVksShMODIRyt5HQ6YoMXxqCSaK2VsRGWOFibEhZSF4yy+vkna95l3U6veXlcZNHkcRTuAUzsGDK2jAHTShBQQkPMMrvDnaeXHenY9Fa8HJZ47hD5zPH7QXkEY=\u21b5AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V7Ae0oUy2m3bpZrPsboQS+iO8eFDEq7/Hm//GpM1BWx8MPN6bYWZeoAQ31nW/ndLG5tb2Tnm3srd/cHhUPT7pmDjRlLVpLGLdC9AwwSVrW24F6ynNMAoE6wbTu9zvPjFteCwf7UwxP8Kx5CGnaDOpO0ChJlgZVmtu3V2ArBOvIDUo0BpWvwajmCYRk5YKNKbvucr6KWrLqWDzyiAxTCGd4pj1MyoxYsZPF+fOyUWmjEgY66ykJQv190SKkTGzKMg6I7QTs+rl4n9eP7HhrZ9yqRLLJF0uChNBbEzy38mIa0atmGUEqebZrYROUCO1WUJ5CN7qy+uk06h7V/XGw3Wt2SziKMMZnMMleHADTbiHFrSBwhSe4RXeHOW8OO/Ox7K15BQzp/AHzucPwq2PLg==1\u21b5AAAB8nicbVBNS8NAEJ34WetX1aOXxSJ4sSRV0GPBi8cK9gPSUCbbTbt0sxt2N0Ip/RlePCji1V/jzX9j0uagrQ8GHu/NMDMvTAQ31nW/nbX1jc2t7dJOeXdv/+CwcnTcNirVlLWoEkp3QzRMcMlallvBuolmGIeCdcLxXe53npg2XMlHO0lYEONQ8ohTtJnke+SS9FAkIyz3K1W35s5BVolXkCoUaPYrX72BomnMpKUCjfE9N7HBFLXlVLBZuZcaliAd45D5GZUYMxNM5yfPyHmmDEikdFbSkrn6e2KKsTGTOMw6Y7Qjs+zl4n+en9roNphymaSWSbpYFKWCWEXy/8mAa0atmGQEqebZrYSOUCO1WUp5CN7yy6ukXa95V7X6w3W10SjiKMEpnMEFeHADDbiHJrSAgoJneIU3xzovzrvzsWhdc4qZE/gD5/MHS2GP9A==1AAAB8XicbVBNS8NAEN3Ur1q/qh69LBbBiyWpgh4LXjxWsB/YhrLZTtqlm03YnQgl9F948aCIV/+NN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUc2jyWMa6EzADUihookAJnUQDiwIJ7WB8O/PbT6CNiNUDThLwIzZUIhScoZUePXpBewEgK/XLFbfqzkFXiZeTCsnR6Je/eoOYpxEo5JIZ0/XcBP2MaRRcwrTUSw0kjI/ZELqWKhaB8bP5xVN6ZpUBDWNtSyGdq78nMhYZM4kC2xkxHJllbyb+53VTDG/8TKgkRVB8sShMJcWYzt6nA6GBo5xYwrgW9lbKR0wzjjakWQje8surpFWrepfV2v1VpV7P4yiSE3JKzolHrkmd3JEGaRJOFHkmr+TNMc6L8+58LFoLTj5zTP7A+fwBgqSPgA==AAAB7XicbVBNS8NAEN34WetX1aOXxSJ4KkkV9Fjw4rGC/YA2lM120q7dbMLuRCih/8GLB0W8+n+8+W/ctDlo64OBx3szzMwLEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk41hxaPZay7ATMghYIWCpTQTTSwKJDQCSa3ud95Am1ErB5wmoAfsZESoeAMrdTuB4CsPKhU3Zo7B10lXkGqpEBzUPnqD2OeRqCQS2ZMz3MT9DOmUXAJs3I/NZAwPmEj6FmqWATGz+bXzui5VYY0jLUthXSu/p7IWGTMNApsZ8RwbJa9XPzP66UY3viZUEmKoPhiUZhKijHNX6dDoYGjnFrCuBb2VsrHTDOONqA8BG/55VXSrte8y1r9/qraaBRxlMgpOSMXxCPXpEHuSJO0CCeP5Jm8kjcndl6cd+dj0brmFDMn5A+czx/6lo66Learn to InsertLearn to Deleteyyy\u21e4AAAB8XicbVBNSwMxEJ2tX7V+VT16CRZBPJTdKuix6MVjBfuB7VqyabYNTbJLkhXK0n/hxYMiXv033vw3Zts9aOuDgcd7M8zMC2LOtHHdb6ewsrq2vlHcLG1t7+zulfcPWjpKFKFNEvFIdQKsKWeSNg0znHZiRbEIOG0H45vMbz9RpVkk780kpr7AQ8lCRrCx0kMvFkE6mT6elfrlilt1Z0DLxMtJBXI0+uWv3iAiiaDSEI617npubPwUK8MIp9NSL9E0xmSMh7RrqcSCaj+dXTxFJ1YZoDBStqRBM/X3RIqF1hMR2E6BzUgvepn4n9dNTHjlp0zGiaGSzBeFCUcmQtn7aMAUJYZPLMFEMXsrIiOsMDE2pCwEb/HlZdKqVb3zau3uolK/zuMowhEcwyl4cAl1uIUGNIGAhGd4hTdHOy/Ou/Mxby04+cwh/IHz+QMMf5CAyyy\u21e4AAAB8XicbVBNSwMxEJ2tX7V+VT16CRZBPJTdKuix6MVjBfuB7VqyabYNTbJLkhXK0n/hxYMiXv033vw3Zts9aOuDgcd7M8zMC2LOtHHdb6ewsrq2vlHcLG1t7+zulfcPWjpKFKFNEvFIdQKsKWeSNg0znHZiRbEIOG0H45vMbz9RpVkk780kpr7AQ8lCRrCx0kMvFkE6mT6elfrlilt1Z0DLxMtJBXI0+uWv3iAiiaDSEI617npubPwUK8MIp9NSL9E0xmSMh7RrqcSCaj+dXTxFJ1YZoDBStqRBM/X3RIqF1hMR2E6BzUgvepn4n9dNTHjlp0zGiaGSzBeFCUcmQtn7aMAUJYZPLMFEMXsrIiOsMDE2pCwEb/HlZdKqVb3zau3uolK/zuMowhEcwyl4cAl1uIUGNIGAhGd4hTdHOy/Ou/Mxby04+cwh/IHz+QMMf5CAApply DeletionApply Insertion\fTable 1: Generation quality (BLEU \u2191, ROUGE-1/2/L \u2191) and latency (ms \u2193) as well as the average\nnumber of decoder iterations (IDEC) on the standard test sets for LevT and the autoregressive baseline\n(with both greedy and beam-search outputs). We show the results of LevT trained from both oracle\nand the autoregressive teacher model.\n\nTransformer\n\nLevenshtein Transformer\ndistillation\n\nDataset\n\nMetric\n\nGigaword\n\nRo-En\nEn-De\nEn-Ja\n\nBLEU\nBLEU\nBLEU\nROUGE-1\nROUGE-2\nROUGE-L\nLatency (ms) /IDEC\nRo-En\nLatency (ms) /IDEC\nEn-De\nEn-Ja\nLatency (ms) /IDEC\nGigaword Latency (ms) /IDEC\n\nQuality \u2191\n\nSpeed \u2193\n\ngreedy\n31.67\n26.89\n42.86\n37.31\n18.10\n34.65\n\nbeam4\n32.30\n27.17\n43.68\n37.87\n18.92\n35.13\n\n326 / 27.1\n343 / 28.1\n261 / 22.6\n116 / 10.1\n\n349 / 27.1\n369 / 28.1\n306 / 22.6\n149 / 10.1\n\noracle\n33.02\n25.20\n42.36\n36.14\n17.14\n34.34\n97 / 2.19\n126 / 2.88\n112 / 2.61\n98 / 2.32\n\n33.26\n27.27\n43.17\n37.40\n18.33\n34.51\n90 / 2.03\n92 / 2.05\n106 / 1.97\n84 / 1.73\n\nFigure 3: An example of WAT\u201917 En-Ja translation with two decoder iterations by LevT. We present\nthe inserted tokens in purple and deleted tokens with red strikethrough\n\n.\n\n4 Experiments\n\nWe validate the ef\ufb01ciency, effectiveness, and \ufb02exibility of Levenshtein Transformer extensively across\nthree different tasks \u2014 machine translation (MT), text summarization (TS) and automatic post-editing\n(APE) for machine translation, from both generation (\u00a74.1) and re\ufb01nement (\u00a74.2) perspectives.\n\n4.1 Sequence Generation\n\nFor the sequence generation perspective, we evaluate LevT model on MT and TS. As a special case,\nsequence generation assumes empty y0 = as input and no initial deletion is applied.\n\nData & Evaluation We use three diversi\ufb01ed language pairs for MT experiments: WMT\u201916\nRomanian-English (Ro-En)3, WMT\u201914 English-German (En-De)4 and WAT2017 Small-NMT\nEnglish-Japanese (En-Ja, Nakazawa et al., 2017)5. The TS experiments use preprocessed data\nfrom the Annotated English Gigaword (Gigaword, Rush et al., 2015)6. We learn byte-pair encod-\ning (BPE, Sennrich et al., 2016) vocabulary on tokenized data. Detailed dataset statistics can be\nfound in the Appendix. For evaluation metrics, we use BLEU (Papineni et al., 2002) for MT and\nROUGE-1,2,L (Lin, 2004) for TS. Before computing the BLEU scores for Japanese output, we\nalways segment Japanese words using KyTea 7.\n\nModels & Training We adopt the model architecture of Transformer base (Vaswani et al., 2017)\nfor the proposed LevT model and the autoregressive baseline. All the Transformer-based models are\n\n3http://www.statmt.org/wmt16/translation-task.html\n4http://www.statmt.org/wmt14/translation-task.html\n5http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2017/snmt/index.html\n6https://github.com/harvardnlp/sent-summary\n7http://www.phontron.com/kytea/\n\n6\n\n(cid:770)The (cid:770)latter (cid:770)coil (cid:770)generated (cid:770)2.2 T (cid:770)in (cid:770)liquid (cid:770)helium .(cid:770)(cid:2957)(cid:5647)(cid:900)(cid:960)(cid:945)(cid:1016)(cid:901) (cid:4246)(cid:1435)(cid:997)(cid:1015)(cid:947)(cid:1005)(cid:1278)(cid:893)(cid:1163)(cid:1159)(cid:1163)(cid:1197)(cid:936)(cid:1674)(cid:877)(cid:885) (cid:822)nothing to delete >>(cid:62)(cid:770)(cid:64)(cid:62)(cid:2957)(cid:5647)(cid:64)(cid:62)(cid:900)(cid:64)(cid:62)(cid:4246)(cid:1435)(cid:64)(cid:62)(cid:4246)(cid:1435)(cid:64)(cid:62)(cid:997)(cid:1015)(cid:947)(cid:1005)(cid:64)(cid:62)(cid:997)(cid:1015)(cid:947)(cid:1005)(cid:64)(cid:62)(cid:1163)(cid:1159)(cid:1163)(cid:64)(cid:62)(cid:1163)(cid:1159)(cid:1163)(cid:64)(cid:62)(cid:1197)(cid:64)(cid:62)(cid:3)(cid:822)(cid:64)(iteration 1)(iteration 2)(cid:62)(cid:770)(cid:64)(cid:62)(cid:2957)(cid:5647)(cid:64)(cid:62)(cid:900)(cid:64)(cid:62)(cid:4246)(cid:1435)(cid:64)(cid:62)(cid:4246)(cid:1435)(cid:64)(cid:62)(cid:997)(cid:1015)(cid:947)(cid:1005)(cid:64)(cid:62)(cid:997)(cid:1015)(cid:947)(cid:1005)(cid:64)(cid:62)(cid:1163)(cid:1159)(cid:1163)(cid:64)(cid:62)(cid:1163)(cid:1159)(cid:1163)(cid:64)(cid:62)(cid:1197)(cid:64)(cid:62)(cid:3)(cid:822)(cid:64)(cid:62)(cid:770)(cid:64)(cid:62)(cid:2957)(cid:5647)(cid:64)(cid:62)(cid:900)(cid:64)(cid:62)(cid:960)(cid:945)(cid:1016)(cid:64)(cid:62)(cid:901)(cid:64)(cid:62)(cid:4246)(cid:1435)(cid:64)(cid:62)(cid:997)(cid:1015)(cid:947)(cid:1005)(cid:64)(cid:62)(cid:1278)(cid:893)(cid:64)(cid:62)(cid:1163)(cid:1159)(cid:1163)(cid:64)(cid:62)(cid:1197)(cid:64)(cid:62)(cid:21682)(cid:4766)(cid:877)(cid:885)(cid:64)(cid:62)(cid:3)(cid:822)(cid:64)insert >>delete >>insert >>nothing to delete, nothing to insert >>[Terminate]\fTable 2: Ablation study for Levenshtein Transformer on En-De (a) and Ro-En (b) translation tasks.\n\n(a) Test BLEU for variant weight sharing. Baseline scores from Lee et al.\n(IT, 2018), Ghazvininejad et al. (MaskT, 2019) are included for reference.\n\n(b) Test BLEU and deletion loss\nwith variant roll-in polices.\n\nsharing\noracle\ndistill\n\nnone\n\u2212\n25.11\n\nplh, ins\n25.50\n27.73\n\nins, del\n\nall\n\n\u2212\n24.90\n\n25.20\n27.27\n\nIT\n\u2212\n21.61\n\nMaskT\n\n\u2212\n26.56\n\nroll-in BLEU NLL(del)\nOurs\n\u2248 0.202\nDAE\n\u2248 0.037\n\n33.02\n31.78\n\n(a) Average number of re\ufb01nement iterations v.s. length measured\non monolingual corpus. For most of the time, LevT decodes with\nmuch smaller number (generally, 1\u223c4) of iterations.\n\n(b) BLEU v.s. speed-up for LevT across\nvariant early-exits and the autoregressive\nbaselines on the test set of Ro-En.\n\nFigure 4: Plots showing the decoding ef\ufb01ciency of the proposed Levenshtein Transformer.\n\ntrained on 8 Nvidia Volta GPUs with maximum 300K steps and a total batch-size of around 65, 536\ntokens per step (We leave more details to the Appendix).\n\nOverall results We present our main results on the generation quality and decoding speed in\nTable 1. We measure the speed by the averaged generation latency of generating one sequence at a\ntime on single Nvidia V100 GPU. To remove the implementation bias, we also present the number of\ndecoder iterations as a reference. It can be concluded that for both MT and summarization tasks, our\nproposed LevT achieves comparable and sometimes better generation quality compared to the strong\nautoregressive baseline, while LevT is much more ef\ufb01cient at decoding. A translation example is\nshown in Figure 3 and we leave more in Appendix. We conjecture that this is due to that the output\nof the teacher model possesses fewer modes and much less noisy than the real data. Consequently,\nLevT needs less number of iterations to converge to this expert policy.\n\nAblation on Ef\ufb01ciency As shown in Figure 4a, we plot the average number of iterations over\nthe length of input over a monolingual corpus. LevT learns to properly adjust the decoding time\naccordingly. We also explore the variants of \u201cearly exit\u201d where we denote LevT(m-n) as a model with\nm and n blocks for deletion (Eq. (3)) and placeholder prediction (Eq. (4)) respectively. Figure 4b\nshows that although it compromises the quality a bit, our model with early exit achieves up to \u00d75\nspeed-up (execution time) comparing against a strong autoregressive Transformer using beam-search.\n\nAblation on Weight Sharing We also evaluate LevT with different weight sharing as noted in\n\u00a73.1. The results of models trained with oracle or distillation are listed in Table 2a. We observe that\nweight-sharing is bene\ufb01cial especially between the two insertion operations (placeholder and token\nclassi\ufb01ers). Also, it shows another +0.5 BLEU improvement by not sharing the deletion operation\nwith insertion compared to the default setting, which may indicate that insertion and deletion capture\ncomplementary information, requiring larger capacity by learning them separately.\n\nImportance of mixture roll-in policy We perform an ablation study on the learning algorithm.\nSpeci\ufb01cally, we train a model with no mixing of the \u03c0\u03b8 in Equation (6). We name this experiment\nby DAE due to its resemblance to a denoising autoencoder. We follow closely a standard pipeline\nestablished by Lee et al. (2018). Table 2b shows this comparison. As we can see that the deletion loss\n\n7\n\n020406080100120sentence length024681012number of iterationsLevT TranslationLogarithm TimeLinear TimeConstant Time (4)Constant Time (10)1.01.52.02.53.03.54.04.55.0x speed-up31.631.832.032.232.432.632.833.0x BLEU scoresLevT(2-2)LevT(6-6)LevT(1-1)LevT(3-1)AT (beam4)AT (greedy)\fTable 3: Performance (BLEU \u2191 / case-sensitive TER \u2193) comparison on APE. \u201cdo nothing\u201d represents\nthe results of the original MT system output; the autoregressive model uses beam-size 4. For the\nproposed LevT, we use \u201cscratch\u201d to denote training from scratch on the APE triple data, and use\n\u201czero-shot\u201d to denote applying an MT pre-trained LevT model directly for post-editing tasks. The\nsame model can be further \ufb01ne-tuned. All scores with underlines are from the model trained with an\nautoregressive teacher model (distillation) as the expert policy.\n\nLevenshtein Transformer\n\nDataset\n\nSynthetic\n\nReal\n\nRo-En\nEn-De\nEn-Ja\nEn-De\n\nMT\nsystem\nPBMT\nNMT\nPBMT\nNMT\nPBMT\n\nDo-Nothing Transformer\n\n27.5 / 52.6\n26.2 / 56.5\n15.4 / 69.4\n37.7 / 48.0\n62.5 / 24.5\n\n28.9 / 52.8\n26.9 / 55.6\n22.8 / 61.0\n41.0 / 44.9\n67.2 / 22.1\n\nScratch\n29.1 / 50.4\n28.3 / 53.6\n25.8 / 56.6\n42.2 / 44.3\n66.9 / 21.9\n\nZero-shot\n30.1 / 51.7\n28.0 / 55.8\n16.5 / 69.6\n39.4 / 47.5\n59.6 / 28.7\n\nFine-tune\n\n\u2212\n\u2212\n\u2212\n\u2212\n\n70.1 / 19.2\n\n(a) Test set BLEU scores for WMT Ro-En\n\n(b) Test set TER scores for Real APE En-De\n\nFigure 5: MT & PE Performance v.s. Timeout iterations w/o oracle instructions.\n\nfrom DAE is much smaller while the generation BLEU score is inferior. We conjecture that this is\ncaused by the mismatch between the states from the model and the roll-in policy in training the DAE.\n\nv.s. Exiting Re\ufb01nement-based Models Table 2a also includes results from two relevant recent\nworks which also incorporate iterative re\ufb01nement in non-autoregressive sequence generation. For fair\ncomparison, we use the result with length beam 1 from Ghazvininejad et al. (2019). Although both\napproaches use similar \u201cdenosing\u201d objectives to train the re\ufb01nement process, our model explicitly\nlearns \u201cinsertion\u201d and \u201cdeletion\u201d in a dual-policy learning fashion, and outperforms both models.\n\n4.2 Sequence Re\ufb01nement\n\nWe evaluate LevT\u2019s capability of re\ufb01ning sequence outputs on the APE task. In this setting, inputs\nare pairs of the source sequence and a black-box MT system generation. The ground-truth outputs\nare from real human edits with expansion using synthetic data.\n\nDataset We follow a normal protocol in the synthetic APE experiments (Grangier and Auli, 2017):\nwe \ufb01rst train the input MT system on half of the dataset. Then we will train a re\ufb01nement model on\nthe other half based on the output produced by the MT model trained in the previous phase. For the\nreal APE tasks, we use the data from WMT17 Automatic Post-Editing Shared Task8 on En-De. It\ncontains both real PE triples and a large-scale synthetic corpus.\n\nModels & Evaluation The baseline model is a standard Transformer encoding the concatenation\nof the source and the MT system\u2019s output. For the MT system here, we want some imperfect systems\nthat need to be re\ufb01ned. We consider a statistical phrase-based MT system (PBMT, Koehn et al., 2003)\nand an RNN-based NMT system (Bahdanau et al., 2015). Apart from BLEU scores, we additionally\napply translation error rate (TER, Snover et al., 2006) as it is widely used in the APE literature.\n\n8http://www.statmt.org/wmt17/ape-task.html\n\n8\n\n123456Maximum Iterations20253035404550BLEU scoresNon-AutoregressiveLevTLevT + oracle (D)LevT + oracle (D, P)Transformer (beam4)0123456Maximum Iterations51015202530Translation Error Rate (TER)MT OutputLevTLevT + oracle (D)LevT + oracle (D, P)Transformer (beam4)\fOverall results We show the major comparison in Table 3. When training from scratch, LevT\nconsistently improves the performance of the input MT system (either PBMT or NMT). It also\nachieves better performance than the autoregressive Transformer in most of the cases.\n\nPre-training on MT Thanks to the generality of the LevT model, we show it is feasible to directly\napply the LevT model trained by generation onto re\ufb01nement tasks \u2014 in this case \u2014 MT and APE.\nWe name this a \u201czero-shot post-editing\u201d setting. According to Table 3, the pre-trained MT models are\nalways capable of improving the initial MT input in the synthetic tasks.\nThe real APE task, however, differs quite a bit from the synthetic tasks because human translators\nnormally only \ufb01x a few spotted errors. This ends up with very high BLEU scores even for the\n\u201cDo-nothing\u201d column. However, the pre-trained MT model achieves the best results by \ufb01ne-tuning on\nthe PE data indicating that LevT is able to leverage the knowledge for generation and re\ufb01nement.\n\nCollaborate with Oracle Thanks to the saperation of insertion and deletion operations, LevT has\nbetter interpretability and controllability. For example, we test the ability that LevT adapts oracle (e.g.\nhuman translators) instructions. As shown in Figure 5, both MT and PE tasks have huge improvement\nif every step the oracle deletion is given. This goes even further if the oracle provides both the correct\ndeletion and the number of placehoders to insert. It also sheds some light upon computer-assisted\ntext editing for human translators.\n\n5 Related Work\n\nNon-Autoregressive and Non-Monotonic Decoding Breaking the autoregressive constraints and\nmonotonic (left-to-right) decoding order in classic neural sequence generation systems has recently\nattracted much interest. Stern et al. (2018); Wang et al. (2018) designed partially parallel decoding\nschemes to output multiple tokens at each step. Gu et al. (2018) proposed a non-autoregressive\nframework using discrete latent variables, which was later adopted in Lee et al. (2018) as iterative\nre\ufb01nement process. Ghazvininejad et al. (2019) introduced the masked language modeling objective\nfrom BERT (Devlin et al., 2018) to non-autoregressively predict and re\ufb01ne translations. Welleck et al.\n(2019); Stern et al. (2019); Gu et al. (2019) generate translations non-monotonically by adding words\nto the left or right of previous ones or by inserting words in arbitrary order to form a sequence.\n\nEditing-Based Models Several prior works have explored incorporating \u201cediting\u201d operations for\nsequence generation tasks. For instance, Novak et al. (2016) predict and apply token substitutions\niteratively on phase-based MT system outputs using convolutional neural network. QuickEdit (Grang-\nier and Auli, 2017) and deliberation network (Xia et al., 2017) both consist of two autoregressive\ndecoders where the second decoder re\ufb01nes the translation generated by the \ufb01rst decoder. Guu et al.\n(2018) propose a neural editor which learned language modeling by \ufb01rst retrieving a prototype and\nthen editing over that. Freitag et al. (2019) correct patterned errors in MT system outputs using\ntransformer models trained on monolingual data. Additionally, the use of Levenshtein distance with\ndynamic programming as the oracle policy were also proposed in Sabour et al. (2018); Dong et al.\n(2019). Different from these work, the proposed model learns a non-autoregressive model which\nsimultaneously inserts and deletes multiple tokens iteratively.\n\n6 Conclusion\n\nWe propose Levenshtein Transformer, a neural sequence generation model based on insertion and\ndeletion. The resulted model achieves performance and decoding ef\ufb01ciency, and embraces sequence\ngeneration to re\ufb01nement in one model. The insertion and deletion operations are arguably more\nsimilar to how human writes or edits text. For future work, it is potential to extend this model to\nhuman-in-the-loop generation.\n\nAcknowledgement\n\nWe would like to thank Kyunghyun Cho, Marc\u2019Aurelio Ranzato, Douwe Kiela, Qi Liu and our\ncolleagues at Facebook AI Research for valuable feedback, discussions and technical assistance.\n\n9\n\n\fReferences\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly\nlearning to align and translate. In 3rd International Conference on Learning Representations, ICLR\n2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.\n\nKyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language\n\nmodel. arXiv preprint arXiv:1605.03835.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of\n\ndeep bidirectional transformers for language understanding. CoRR, abs/1810.04805.\n\nYue Dong, Zichao Li, Mehdi Rezagholizadeh, and Jackie Chi Kit Cheung. 2019. Editnts: An neural\nprogrammer-interpreter model for sentence simpli\ufb01cation through explicit editing. arXiv preprint\narXiv:1906.08104.\n\nMarkus Freitag, Isaac Caswell, and Scott Roy. 2019. Text repair model for neural machine translation.\n\narXiv preprint arXiv:1904.04790.\n\nMarjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Constant-time machine\n\ntranslation with conditional masked language models. CoRR, abs/1904.09324.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680.\n\nDavid Grangier and Michael Auli. 2017. Quickedit: Editing text & translations by crossing words\n\nout. arXiv preprint arXiv:1711.04805.\n\nJiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. 2018. Non-\nautoregressive neural machine translation. In 6th International Conference on Learning Represen-\ntations, ICLR 2018, Vancouver, Canada, April 30-May 3, 2018, Conference Track Proceedings.\n\nJiatao Gu, Qi Liu, and Kyunghyun Cho. 2019. Insertion-based decoding with automatically inferred\n\ngeneration order. arXiv preprint arXiv:1902.01370.\n\nKelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. 2018. Generating sentences by\n\nediting prototypes. Transactions of the Association of Computational Linguistics, 6:437\u2013450.\n\nLukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam\nShazeer. 2018. Fast decoding in sequence models using discrete latent variables. In International\nConference on Machine Learning, pages 2395\u20132404.\n\nYoon Kim and Alexander Rush. 2016. Sequence-level knowledge distillation. In EMNLP.\nPhilipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation.\nIn Proceedings of the 2003 Conference of the North American Chapter of the Association for\nComputational Linguistics on Human Language Technology-Volume 1, pages 48\u201354. Association\nfor Computational Linguistics.\n\nJason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural\nsequence modeling by iterative re\ufb01nement. In Proceedings of the 2018 Conference on Empirical\nMethods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018,\npages 1173\u20131182.\n\nVladimir Iosifovich Levenshtein. 1965. Binary codes capable of correcting deletions, insertions, and\nreversals. In Doklady Akademii Nauk, volume 163, pages 845\u2013848. Russian Academy of Sciences.\nChin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summa-\nrization Branches Out: Proceedings of the ACL-04 Workshop, pages 74\u201381, Barcelona, Spain.\nAssociation for Computational Linguistics.\n\nToshiaki Nakazawa, Shohei Higashiyama, Chenchen Ding, Hideya Mino, Isao Goto, Hideto Kazawa,\nYusuke Oda, Graham Neubig, and Sadao Kurohashi. 2017. Overview of the 4th workshop on\nAsian translation. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), pages\n1\u201354, Taipei, Taiwan. Asian Federation of Natural Language Processing.\n\n10\n\n\fRoman Novak, Michael Auli, and David Grangier. 2016. Iterative re\ufb01nement for machine translation.\n\narXiv preprint arXiv:1610.06602.\n\nKishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics.\n\nAlexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstrac-\ntive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in\nNatural Language Processing, pages 379\u2013389, Lisbon, Portugal. Association for Computational\nLinguistics.\n\nSara Sabour, William Chan, and Mohammad Norouzi. 2018. Optimal completion distillation for\n\nsequence learning. arXiv preprint arXiv:1810.01398.\n\nRico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words\nwith subword units. In Proceedings of the 54th Annual Meeting of the Association for Computa-\ntional Linguistics (Volume 1: Long Papers), pages 1715\u20131725, Berlin, Germany. Association for\nComputational Linguistics.\n\nMatthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A\nstudy of translation edit rate with targeted human annotation. In In Proceedings of Association for\nMachine Translation in the Americas, pages 223\u2013231.\n\nMitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. 2019.\n\nInsertion transformer:\n\nFlexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249.\n\nMitchell Stern, Noam Shazeer, and Jakob Uszkoreit. 2018. Blockwise parallel decoding for deep\nautoregressive models. In Advances in Neural Information Processing Systems, pages 10107\u2013\n10116.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual\nConference on Neural Information Processing Systems (NIPS).\n\nChunqi Wang, Ji Zhang, and Haiqing Chen. 2018. Semi-autoregressive neural machine translation.\nIn Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,\npages 479\u2013488, Brussels, Belgium. Association for Computational Linguistics.\n\nSean Welleck, Kiant\u00e9 Brantley, Hal Daum\u00e9 III, and Kyunghyun Cho. 2019. Non-monotonic sequential\n\ntext generation. arXiv preprint arXiv:1902.02192.\n\nYingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Delibera-\ntion networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information\nProcessing Systems, pages 1784\u20131794.\n\n11\n\n\f", "award": [], "sourceid": 5992, "authors": [{"given_name": "Jiatao", "family_name": "Gu", "institution": "Facebook AI Research"}, {"given_name": "Changhan", "family_name": "Wang", "institution": "Facebook AI Research"}, {"given_name": "Junbo", "family_name": "Zhao", "institution": "New York University"}]}