{"title": "A Retrieve-and-Edit Framework for Predicting Structured Outputs", "book": "Advances in Neural Information Processing Systems", "page_first": 10052, "page_last": 10062, "abstract": "For the task of generating complex outputs such as source code, editing existing outputs can be easier than generating complex outputs from scratch.\nWith this motivation, we propose an approach that first retrieves a training example based on the input (e.g., natural language description) and then edits it to the desired output (e.g., code).\nOur contribution is a computationally efficient method for learning a retrieval model that embeds the input in a task-dependent way without relying on a hand-crafted metric or incurring the expense of jointly training the retriever with the editor.\nOur retrieve-and-edit framework can be applied on top of any base model.\nWe show that on a new autocomplete task for GitHub Python code and the Hearthstone cards benchmark, retrieve-and-edit significantly boosts the performance of a vanilla sequence-to-sequence model on both tasks.", "full_text": "A Retrieve-and-Edit Framework for\n\nPredicting Structured Outputs\n\nTatsunori B. Hashimoto\n\nDepartment of Computer Science\n\nStanford University\n\nthashim@stanford.edu\n\nKelvin Guu\n\nDepartment of Statistics\n\nStanford University\nkguu@stanford.edu\n\nYonatan Oren\n\nDepartment of Computer Science\n\nStanford University\n\nyonatano@stanford.edu\n\nPercy Liang\n\nDepartment of Computer Science\n\nStanford University\n\npliang@cs.stanford.edu\n\nAbstract\n\nFor the task of generating complex outputs such as source code, editing existing\noutputs can be easier than generating complex outputs from scratch. With this\nmotivation, we propose an approach that \ufb01rst retrieves a training example based on\nthe input (e.g., natural language description) and then edits it to the desired output\n(e.g., code). Our contribution is a computationally ef\ufb01cient method for learning\na retrieval model that embeds the input in a task-dependent way without relying\non a hand-crafted metric or incurring the expense of jointly training the retriever\nwith the editor. Our retrieve-and-edit framework can be applied on top of any\nbase model. We show that on a new autocomplete task for GitHub Python code\nand the Hearthstone cards benchmark, retrieve-and-edit signi\ufb01cantly boosts the\nperformance of a vanilla sequence-to-sequence model on both tasks.\n\n1\n\nIntroduction\n\nIn prediction tasks with complex outputs, generating well-formed outputs is challenging, as is well-\nknown in natural language generation [20, 28]. However, the desired output might be a variation of\nanother, previously-observed example [14, 13, 30, 18, 24]. Other tasks ranging from music generation\nto program synthesis exhibit the same phenomenon: many songs borrow chord structure from other\nsongs, and software engineers routinely adapt code from Stack Over\ufb02ow.\nMotivated by these observations, we adopt the following retrieve-and-edit framework (Figure 1):\n\n1. Retrieve: Given an input x, e.g., a natural language description \u2018Sum the \ufb01rst two elements\nin tmp\u2019, we use a retriever to choose a similar training example (x(cid:48), y(cid:48)), such as \u2018Sum the\n\ufb01rst 5 items in Customers\u2019.\n\n2. Edit: We then treat y(cid:48) from the retrieved example as a \u201cprototype\u201d and use an editor to edit\n\nit into the desired output y appropriate for the input x.\n\nWhile many existing methods combine retrieval and editing [13, 30, 18, 24], these approaches rely on\na \ufb01xed hand-crafted or generic retrieval mechanism. One drawback to this approach is that designing\na task-speci\ufb01c retriever is time-consuming, and a generic retriever may not perform well on tasks\nwhere x is structured or complex [40]. Ideally, the retrieval metric would be learned from the data in\na task-dependent way: we wish to consider x and x(cid:48) similar only if their corresponding outputs y and\ny(cid:48) differ by a small, easy-to-perform edit. However, the straightforward way of training a retriever\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fjointly with the editor would require summing over all possible x(cid:48) for each example, which would be\nprohibitively slow.\nIn this paper, we propose a way to train a retrieval model that is optimized for the downstream edit\ntask. We \ufb01rst train a noisy encoder-decoder model, carefully selecting the noise and embedding space\nto ensure that inputs that receive similar embeddings can be easily edited by an oracle editor. We then\ntrain the editor by retrieving according to this learned metric. The main advantage of this approach is\nthat it is computationally ef\ufb01cient and requires no domain knowledge other than an encoder-decoder\nmodel with low reconstruction error.\nWe evaluate our retrieve-and-edit approach on a new Python code autocomplete dataset of 76k\nfunctions, where the task is to predict the next token given partially written code and natural language\ndescription. We show that applying the retrieve-and-edit framework to a standard sequence-to-\nsequence model boosts its performance by 14 points in BLEU score [25]. Comparing retrieval\nmethods, learned retrieval improves over a \ufb01xed, bag-of-words baseline by 6 BLEU. We also evaluate\non the Hearthstone cards benchmark [22], where systems must predict a code snippet based on card\nproperties and a natural language description. We show that augmenting a standard sequence-to-\nsequence model with the retrieve-and-edit approach improves the model by 7 BLEU and outperforms\nthe best non-abstract syntax tree (AST) based model by 4 points.\n\n2 Problem statement\nTask. Our goal is to learn a model pmodel(y | x) that predicts an output y (e.g., a 5\u201315 line code\nsnippet) given an input x (e.g., a natural language description) drawn from a distribution pdata. See\nFigure 1 for an illustrative example.\n\nRetrieve-and-edit. The retrieve-and-edit framework corresponds to the following generative pro-\ncess: given an input x, we \ufb01rst retrieve an example (x(cid:48), y(cid:48)) from the training set D by sampling using\na retriever of the form pret((x(cid:48), y(cid:48)) | x). We then generate an output y using an editor of the form\npedit(y | x, (x(cid:48), y(cid:48))). The overall likelihood of generating y given x is\n\npmodel(y | x) =\n\npedit(y | x, (x(cid:48), y(cid:48)))pret((x(cid:48), y(cid:48)) | x),\n\n(1)\n\n(cid:88)\n\n(x(cid:48),y(cid:48))\u2208D\nand the objective that we seek to maximize is\n\nL(pedit, pret) := E(x,y)\u223cpdata [log pmodel(y | x)] .\n\n(2)\nFor simplicity, we focus on deterministic retrievers, where pret((x(cid:48), y(cid:48)) | x) is a point mass on a\nparticular example (x(cid:48), y(cid:48)). This matches the typical approach for retrieve-and-edit methods, and we\nleave extensions to stochastic retrieval [14] and multiple retrievals [13] to future work.\n\nLearning task-dependent similarity. As mentioned earlier, we would like the retriever to incor-\nporate task-dependent similarity: two inputs x and x(cid:48) should be considered similar only if the editor\nhas a high likelihood of editing y(cid:48) into y. The optimal retriever for a \ufb01xed editor would be one that\nmaximizes the standard maximum marginal likelihood objective in equation (1).\nAn initial idea to learn the retriever might be to optimize for maximum marginal likelihood using\nstandard approaches such as gradient descent or expectation maximization (EM). However, both of\n\nFigure 1. The retrieve-and-edit approach consists of the retriever, which identi\ufb01es a relevant example\nfrom the training set, and the editor, which predicts the output conditioned on the retrieved example.\n\n2\n\nSum the first two elements in InputRetrieved inputtmpSum the first 5 items in xx\u2019np.sum(Customers[:5])Customersnp.sum(tmp[:2])Editory\u2019PrototypeGenerated output\fthese approaches involve summing over all training examples D on each training iteration, which is\ncomputationally intractable.\nInstead, we break up the optimization problem into two parts. We \ufb01rst train the retriever in isolation,\nreplacing the edit model pedit with an oracle editor p\u2217\nedit and optimizing a lower bound for the marginal\nlikelihood under this editor. Then, given this retriever, we train the editor using the standard maximum\nlikelihood objective. This decomposition makes it possible to avoid the computational dif\ufb01culties of\nlearning a task-dependent retrieval metric, but importantly, we will still be able to learn a retriever\nthat is task-dependent.\n\n3 Learning to retrieve and edit\n\nWe \ufb01rst describe the procedure for training our retriever (Section 3.1), which consists of embedding\nthe inputs x into a vector space (Section 3.1.1) and retrieving according to this embedding. We\nthen describe the editor and its training procedure, which follows immediately from maximizing the\nmarginal likelihood (Section 3.2).\n\n3.1 Retriever\n\nSections 3.1.1\u20133.1.3 will justify our training procedure as maximization of a lower bound on the\nlikelihood; one can skip to Section 3.1.4 for the actual training procedure if desired.\nWe would like to train the retriever based on L (Equation 2), but we do not yet know the behavior\nof the editor. We can avoid this problem by optimizing the retriever pret, assuming the editor is the\ntrue conditional distribution over the targets y given the retrieved example (x(cid:48), y(cid:48)) under the joint\n(cid:80)\ndistribution pret((x(cid:48), y(cid:48)) | x)pdata(x, y). We call this the oracle editor for pret,\n(cid:80)\nx pret((x(cid:48), y(cid:48)) | x)pdata(x, y)\nx,y pret((x(cid:48), y(cid:48)) | x)pdata(x, y)\n\nedit(y | (x(cid:48), y(cid:48))) =\np\u2217\n\n.\n\nThe oracle editor gives rise to the following lower bound on suppedit L(pret, pedit)\nedit(y | (x(cid:48), y(cid:48)))]],\n\nL\u2217(pret) := E(x,y)\u223cpdata[E(x(cid:48),y(cid:48))|x\u223cpret[log p\u2217\n\n(3)\nedit rather than the best possible\nedit does not condition on the input x to ensure that the bound\n\nwhich follows from Jensen\u2019s inequality and using a particular editor p\u2217\npedit.1 Unlike the real editor pedit, p\u2217\nrepresents the quality of the retrieved example alone.\nNext, we wish to \ufb01nd a further lower bound that takes the form of a distance minimization problem:\n(4)\nwhere C is a constant independent of pret. The pret that maximizes this lower bound is the deterministic\nretriever which \ufb01nds the nearest neighbor to x under the metric d.\nIn order to obtain such a lower bound, we will learn an encoder p\u03b8(v | x) and decoder p\u03c6(y | v) and\nuse the distance metric in the latent space of v as our distance d. When p\u03b8(v | x) takes a particular\nform, we can show that this results in the desired lower bound (4).\n\nL\u2217(pret) \u2265 C \u2212 Ex\u223cpdata[Ex(cid:48)|x\u223cpret[d(x(cid:48), x)2]],\n\n3.1.1 The latent space as a task-dependent metric\nConsider any encoder-decoder model with a probabilistic encoder p\u03b8(v | x) and decoder p\u03c6(y | v).\nWe can show that there is a variational lower bound that takes a form similar to (4) and decouples pret\nfrom the rest of the objective.\nProposition 1. For any densities p\u03b8(v | x) and p\u03c6(y | v) and random variables (x, y, x(cid:48), y(cid:48)) \u223c\npret((x(cid:48), y(cid:48)) | x)pdata(x, y),\n(cid:125)\nL\u2217(pret) \u2265 E(x,y)\u223cpdata[Ev\u223cp\u03b8(v|x)[log p\u03c6(y | v)]]\n\n(cid:124)\n(cid:125)\n\u2212 Ex[Ex(cid:48)|x\u223cpret[KL(p\u03b8(v | x)||p\u03b8(v | x(cid:48)))]]\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n.\n\n(5)\n\n(cid:124)\n\n:=Lreconstruct(\u03b8,\u03c6)\n\n:=Ldiscrepancy(\u03b8,pret)\n\n1This expression is the conditional entropy H(y | x(cid:48), y(cid:48)). An alternative interpretation of L\u2217 is that\nmaximization with respect to pret is equivalent to maximizing the mutual information between y and (x(cid:48), y(cid:48)).\n\n3\n\n\fProof\nThe inequality follows from standard arguments on variational approximations. Since\nedit(y | (x(cid:48), y(cid:48))) is the conditional distribution implied by the joint distribution (x(cid:48), y(cid:48), x, y), we\np\u2217\nhave:\n\nwhere(cid:82) p\u03c6(y | v)p\u03b8(v | x(cid:48))dv is just another distribution. Taking the expectation of both sides with\n\np\u03c6(y | v)p\u03b8(v | x(cid:48))dv\n\nEy|x(cid:48),y(cid:48)\u223cp\u2217\n\n[log p\u2217\n\nrespect to (x, x(cid:48), y(cid:48)) and applying law of total expectation yields:\n\n(cid:90)\n\nedit(y | (x(cid:48), y(cid:48)))] \u2265 Ey|x(cid:48),y(cid:48)\u223cp\u2217\n(cid:20)\n(cid:90)\n\n(cid:20)\n\nedit\n\n(cid:21)(cid:21)\n\n(cid:21)\n\n,\n\n(cid:20)\n\nlog\n\nedit\n\nL\u2217(pret) \u2265 E(x,y)\u223cpdata\n\nE(x(cid:48),y(cid:48))|x\u223cpret\n\nlog\n\np\u03c6(y | v)p\u03b8(v | x(cid:48))dv\n\n.\n\n(6)\n\nNext, we apply the standard evidence lower bound (ELBO) on the latent variable v with variational\ndistribution p\u03b8(v | x). This continues the lower bounds\n\n(cid:2)Ev|x\u223cp\u03b8 [log p\u03c6(y | v)] \u2212 KL(p\u03b8(v | x)||p\u03b8(v | x(cid:48)))(cid:3)(cid:3)\n\n\u2265 E(x,y)\u223cpdata\n\u2265 E(x,y)\u223cpdata[Ev|x\u223cp\u03b8 [log p\u03c6(y | v)]] \u2212 Ex\u223cpdata[Ex(cid:48)\u223cpret [KL(p\u03b8(v | x)||p\u03b8(v | x(cid:48)))]],\n\n(cid:2)E(x(cid:48),y(cid:48))|x\u223cpret\n\nwhere the last inequality is just collapsing expectations.\n\n(cid:2)Ev|x\u223cp\u03b8 [log p\u03c6(y | v)](cid:3) from a discrepancy term KL(p\u03b8(v | x)||p\u03b8(v | x(cid:48))). However,\n\nProposition 1 takes the form of the desired lower bound (4), since it decouples the reconstruction term\nE(x,y)\u223cpdata\nthere are two differences between the earlier lower bound (4) and our derived result. The KL-\ndivergence may not represent a distance metric, and there is dependence on unknown parameters\n(\u03b8, \u03c6). We will resolve these problems next.\n\n3.1.2 The KL-divergence as a distance metric\nWe will now show that for a particular choice of p\u03b8, the KL divergence KL(p\u03b8(v | x)||p\u03b8(v | x(cid:48)))\ntakes the form of a squared distance metric. In particular, choose p\u03b8(v | x) to be a von Mises-Fisher\ndistribution over unit vectors centered on the output of an encoder \u00b5\u03b8(x):\n\np\u03b8(v | x) = vMF(v; \u00b5\u03b8(x), \u03ba) = C\u03ba exp(cid:0)\u03ba\u00b5\u03b8(x)(cid:62)v(cid:1) ,\n\nLdiscrepancy(\u03b8, pret) = C\u03ba Ex\u223cpdata[Ex(cid:48)\u223cpret[(cid:107)\u00b5\u03b8(x) \u2212 \u00b5\u03b8(x(cid:48))(cid:107)2\n2]],\n\n(7)\nwhere both v and \u00b5\u03b8(x) are unit vectors, and C\u03ba is a normalization constant depending only on d\nand \u03ba. The von Mises-Fisher distribution p\u03b8 turns the KL divergence term into a squared Euclidean\ndistance on the unit sphere (see the Appendix A). This further simpli\ufb01es the discrepancy term (5) to\n(8)\nThe KL divergence on other distributions such as the Gaussian can also be expressed as a distance\nmetric, but we choose the von-Mises Fisher since the KL divergence is upper bounded by a constant,\na property that we will use next.\nThe retriever pret that minimizes (8) deterministically retrieves the x(cid:48) that is closest to x according\nto the embedding \u00b5\u03b8. For ef\ufb01ciency, we implement this retriever using a cosine-LSH hash via the\nannoy Python library, which we found to be both accurate and scalable.\n\n3.1.3 Setting the encoder-decoder parameters (\u03b8, \u03c6)\n\nAny choice of (\u03b8, \u03c6) turns Proposition 1 into a lower bound of the form (4), but the bound can\npotentially be very loose if these parameters are chosen poorly. Joint optimization over (\u03b8, \u03c6, pret)\nis computationally expensive, as it requires a sum over the potential retrieved examples. Instead,\nwe will optimize \u03b8, \u03c6 with respect to a conservative lower bound that is independent of pret. For the\nvon-Mises Fisher distribution, KL(p\u03b8(v | x)||p\u03b8(v | x(cid:48))) \u2264 2C\u03ba, and thus\n\nE(x,y)\u223cpdata[Ev|x\u223cp\u03b8 [log p\u03c6(y | v)]] \u2212 Ex\u223cpdata[Ex(cid:48)\u223cpret[KL(p\u03b8(v | x)||p\u03b8(v | x(cid:48)))]]\n\u2265 E(x,y)\u223cpdata[Ev|x\u223cp\u03b8 [log p\u03c6(y | v)]] \u2212 2C\u03ba.\n\nTherefore, we can optimize \u03b8, \u03c6 with respect to this worst-case bound. This lower bound objective\nis analogous to the recently proposed hyperspherical variational autoencoder and is straightforward\nto train using reparametrization gradients [9, 14, 38]. Our training procedure consists of applying\nminibatch stochastic gradient descent on (\u03b8, \u03c6) where gradients involving v are computed with the\nreparametrization trick.\n\n4\n\n\f3.1.4 Overall procedure\n\nThe overall retrieval training procedure consists of two steps:\n\n1. Train an encoder-decoder to map each input x into an embedding v that can reconstruct the\n\noutput y:\n\n(\u02c6\u03b8, \u02c6\u03c6) := arg max\n\u03b8,\u03c6\n\nE(x,y)\u223cpdata[Ev|x\u223cp\u03b8 [log p\u03c6(y | v)]].\n\n(9)\n\n2. Set the retriever to be the deterministic nearest neighbor input in the training set under the\n\nencoder:\n\n\u02c6pret(x(cid:48), y(cid:48) | x) := 1[(x(cid:48), y(cid:48)) = arg min\n(x(cid:48),y(cid:48))\u2208D\n\n(cid:107)\u00b5\u02c6\u03b8(x) \u2212 \u00b5\u02c6\u03b8(x(cid:48))(cid:107)2\n2].\n\n(10)\n\n3.2 Editor\nThe procedure in Section 3.1.4 returns a retriever \u02c6pret that maximizes a lower bound on L\u2217, which is\nde\ufb01ned in terms of the oracle editor p\u2217\nedit, we\ntrain the editor pedit to directly maximize L(pedit, \u02c6pret).\nSpeci\ufb01cally, we solve the optimization problem:\n\nedit. Since we do not have access to the oracle editor p\u2217\n\narg max\npedit\n\nE(x,y)\u223cpdata[E(x(cid:48),y(cid:48))\u223c \u02c6pret [log pedit(y | x, (x(cid:48), y(cid:48)))]].\n\n(11)\n\nIn our experiments, we let pedit be a standard sequence-to-sequence model with attention and copying\n[12, 36] (see Appendix B for details), but any model architecture can be used for the editor.\n\n4 Experiments\n\nWe evaluate our retrieve-and-edit framework on two tasks. First, we consider a code autocomplete\ntask over Python functions taken from GitHub and show that retrieve-and-edit substantially outper-\nforms approaches based only on sequence-to-sequence models or retrieval. Then, we consider the\nHearthstone cards benchmark and show that retrieve-and-edit can boost the accuracy of existing\nsequence-to-sequence models.\nFor both experiments, the dataset is processed by standard space-and-punctuation tokenization, and\nwe run the retrieve and edit model with randomly initialized word vectors and \u03ba = 500, which we\nobtained by evaluating BLEU scores on the development set of both datasets. Both the retriever and\neditor were trained for 1000 iterations on Hearthstone and 3000 on GitHub via ADAM minibatch\ngradient descent, with batch size 16 and a learning rate of 0.001.\n\n4.1 Autocomplete on Python GitHub code\n\nGiven a natural language description of a Python function and a partially written code fragment, the\ntask is to return a candidate list of k = 1, 5, 10 next tokens (Figure 2). A model predicts correctly if\nthe ground truth token is in the candidate list. The performance of a model is de\ufb01ned in terms of the\naverage or maximum number of successive tokens correctly predicted.\n\nDataset. Our Python autocomplete dataset is a representative sample of Python code from GitHub,\nobtained from Google Bigquery by retrieving Python code containing at least one block comment\nwith restructured text (reST) formatting (See Appendix C for details). We use this data to form a\ncode prediction task where each example consists of four inputs: the block comment, function name,\narguments, and a partially written function body. The output is the next token in the function body.\nTo avoid the possibility that repository forks and duplicated library \ufb01les result in a large number of\nduplicate functions, we explicitly deduplicated all \ufb01les based on both the \ufb01le contents and repository\npath name. We also removed any duplicate function/docstring pairs and split the train and test set at\nthe repository level. We tokenized using space and punctuation and kept only functions with at most\n150 tokens, as the longer functions are nearly impossible to predict from the docstring. This resulted\nin a training set of 76k Python functions.\n\n5\n\n\fRetrieve-and-edit (Retrieve+Edit)\nSeq2Seq\nRetriever only (TaskRetriever)\n\nLongest completed length Avg completion length BLEU\nk=1\n17.6\n10.6\n13.5\n\nk=10\n8.1\n3.8\n\nk=10\n21.9\n13.2\n\n34.7\n19.2\n29.9\n\nk=5\n20.9\n12.5\n\nk=5\n7.5\n3.4\n\nk=1\n5.8\n2.5\n4.7\n\nTable 1. Retrieve-and-edit substantially improves the performance over baseline sequence-to-sequence\nmodels (Seq2Seq) and trained retrieval without editing (TaskRetriever) on the Python autocomplete\ndataset. k indicates the number of candidates over beam-search considered for predicting a token, and\ncompletion length is the number of successive tokens that are correctly predicted.\n\nTaskRetriever\nInputRetriever\nLexicalRetriever\n\nLongest completed length Avg completion length BLEU\n29.9\n29.8\n23.1\n\n13.5\n12.3\n9.8\n\n4.7\n4.1\n3.4\n\nTable 2.\nRetrievers based on the noisy encoder-decoder (TaskRetriever) outperform a retriever\nbased on bag-of-word vectors (LexicalRetriever). Learning an encoder-decoder on the inputs alone\n(InputRetriever) results in a slight loss in accuracy.\n\nResults. Comparing the retrieve-and-edit model (Retrieve+Edit) to a sequence-to-sequence baseline\n(Seq2Seq) whose architecture and training procedure matches that of the editor, we \ufb01nd that retrieval\nadds substantial performance gains on all metrics with no domain knowledge or hand-crafted features\n(Table 1).\nWe also evaluate various retrievers: TaskRetriever, which is our task-dependent retriever presented\nin Section 3.1; LexicalRetriever, which embeds the input tokens using a bag-of-word vectors and\nretrieves based on cosine similarity; and InputRetriever, which uses the same encoder-decoder\narchitecture as TaskRetriever but modi\ufb01es the decoder to predict x rather than y. Table 2 shows\nthat TaskRetriever signi\ufb01cantly outperforms LexicalRetriever on all metrics, but is comparable to\nInputRetriever on BLEU and slightly better on the autocomplete metrics. We did not directly compare\nto abstract syntax tree (AST) based methods here since they do not have a direct way to condition on\npartially-generated code, which is needed for autocomplete.\nExamples of predicted outputs in Figure 2 demonstrate that the docstring does not fully specify\nthe structure of the output code. Despite this, the retrieval-based methods are sometimes able to\nretrieve relevant functions. In the example, the retriever learns to return a function that has a similar\nconditional check. Retrieve+Edit does not have enough information to predict the true function and\ntherefore predicts a generic conditional (if not b_data). In contrast, the seq2seq defaults to predicting\na generic getter function rather than a conditional.\n\nFigure 2. Example from the Python autocomplete dataset along with the retrieved example used during\nprediction (top center) and baselines (right panels). The edited output (bottom center) mostly follows\nthe retrieved example but replaces the conditional with a generic one.\n\n6\n\nis_encrypted Test if this is vault encrypted data blob :arg data: a python2 str or a python3 'bytes' to test whether it is recognized as vault encrypted data :returns: True if it is recognized. Otherwise, False. b_datadef is_encrypted(b_data): if b_data.startswith(b_HEADER): return True return Falsedef is_encrypted(b_data): if not b_data.startswith(b_HEADER): return True return False def is_encrypted(b_data): if b_data.startswith(b_HEADER): return True return b_data.get() return False def check_if_finished(pclip): duration=pclip.clip.duration started=pclip.started if(datetime.now()-started).total_seconds ()>duration: return True return False@classmethod def delete_grid_file(cls,file): ret_val=gxapi_cy.WrapSYS._delete_grid_file( GXContext . _get_tls_geo(), file.encode()) return ret_val Ground truthRetrieved prototypeFixed retrievalEdited outputSeq2seq onlyInputxy\u2019y\fFigure 3. Example from the Hearthstone validation set (left panels) and the retrieved example used\nduring prediction (top right). The output (bottom right) differs with the gold standard only in omitting\nan optional variable de\ufb01nition (minion_type).\n\nBLEU Accuracy\n\nAST based\n22.7\n16.2\n\n79.2\n75.8\nNon AST models\n70.0\n65.6\n62.5\n60.4\n43.2\n\n9.1\n4.5\n0.0\n1.5\n0.0\n\nAbstract Syntax Network (ASN) [26]\nYin et al[41]\n\nRetrieve+Edit (this work)\nLatent Predictor Network [22]\nRetriever [22]\nSequence-to-sequence [22]\nStatistical MT [22]\n\nTable 3. Retrieve-and-edit substantially improves upon standard sequence-to-sequence approaches for\nHearthstone, and closes the gap to AST-based models.\n\n4.2 Hearthstone cards benchmark\n\nThe Hearthstone cards benchmark consists of 533 cards in a computer card game, where each card is\nassociated with a code snippet. The task is to output a Python class given a card description. Figure 3\nshows a typical example along with the retrieved example and edited output. The small size of this\ndataset makes it challenging for sequence-to-sequence models to avoid over\ufb01tting to the training set.\nIndeed, it has been observed that naive sequence-to-sequence approaches perform quite poorly [22].\nFor quantitative evaluation, we compute BLEU and exact match probabilities using the tokenization\nand evaluation scheme of [41]. Retrieve+Edit provides a 7 point improvement in BLEU over the\nsequence-to-sequence and retrieval baselines (Table 4.2) and 4 points over the best non-AST based\nmethod, despite the fact that our editor is a vanilla sequence-to-sequence model.\nMethods based on ASTs still achieve the highest BLEU and exact match scores, but we are able to\nsigni\ufb01cantly narrow the gap between specialized code generation techniques and vanilla sequence-to-\nsequence models if the latter is boosted with the retrieve-and-edit framework. Note that retrieve-and-\nedit could also be applied to AST-based models, which would be an interesting direction for future\nwork.\nAnalysis of example outputs shows that for the most part, the retriever \ufb01nds relevant cards. As an\nexample, Figure 3 shows a retrieved card (DarkIronDwarf) that functions similarly to the desired\noutput (Spellbreaker). Both cards share the same card type and attributes, both have a battlecry, which\nis a piece of code that executes whenever the card is played, and this battlecry consists of modifying\nthe attributes of another card. Our predicted output corrects nearly all mistakes in the retrieved output,\nidentifying that the modi\ufb01cation should be changed from ChangeAttack to Silence. The output\n\n7\n\nName: Spellbreaker Stats: ATK4 DEF3 COST4 DUR-1 Type: Minion Class: Neutral Minion type: NIL Rarity: CommonDescription: Battlecry: Silence a minion class DarkIronDwarf (MinionCard): def __init__(self): super().__init__(\"Dark Iron Dwarf\",4, CHARACTER_CLASS.ALL,CARD_RARITY.COMMON, minion_type=MINION_TYPE.NONE, battlecry=Battlecry(Give( BuffUntil(ChangeAttack(2), TurnEnded(player=CurrentPlayer()))), MinionSelector(players=BothPlayer(), picker = UserPicker()))) def create_minion(self, player): return Minion(4, 4) Ground truthEdited outputRetrieved prototypeInputxy\u2019yclass Spellbreaker (MinionCard): def __init__(self): super().__init__ (\"Spellbreaker\",4, CHARACTER_CLASS.ALL,CARD_RARITY.COMMON, minion_type=MINION_TYPE.NONE, battlecry=Battlecry(Silence(), MinionSelector(players=BothPlayer(), picker = UserPicker()))) def create_minion(self, player): return Minion(4, 3)class Spellbreaker (MinionCard): def __init__(self): super().__init__ (\"Spellbreaker\",4, CHARACTER_CLASS.ALL,CARD_RARITY.COMMON, minion_type=MINION_TYPE.NONE, battlecry=Battlecry(Silence(), MinionSelector(players=BothPlayer(), picker = UserPicker()))) def create_minion(self, player): return Minion(4, 3)Blue text: missing from generation, but appears in ground truthRed text: appears in generation, but not in ground truth\fdiffers from the gold standard on only one line: omitting the line minion_type=MINION_TYPE.none.\nIncidentally, it turns out that this is not an actual semantic error since MINION_TYPE.none is the\ndefault setting for this \ufb01eld, and the retrieved DarkIronDwarf card also omits this \ufb01eld.\n\n5 Related work\n\nRetrieval models for text generation. The use of retrieval in text generation dates back to early\nexample-based machine translation systems that retrieved and adapted phrases from a translation\ndatabase [33]. Recent work on dialogue generation [40, 30, 37] proposed a joint system in which an\nRNN is trained to transform a retrieved candidate. Closely related work in machine translation [13]\naugments a neural machine translation model with sentence pairs from the training set retrieved by\nan off-the-shelf search engine. Retrieval-augmented models have also been used in image captioning\n[18, 24]. These models generate captions of an image via a sentence compression scheme from an\ninitial caption retrieved based on image context. Our work differs from all the above conceptually\nin designing retrieval systems explicitly for the task of editing, rather than using \ufb01xed retrievers\n(e.g., based on lexical overlap). Our work also demonstrates that retrieve-and-edit can boost the\nperformance of vanilla sequence-to-sequence models without the use of domain-speci\ufb01c retrievers.\nA related edit-based model [14] has also proposed editing examples as a way to augment text\ngeneration. However, the task there was unconditional generation, and examples were chosen\nby random sampling. In contrast, our work focuses on conditional sequence generation with a\ndeterministic retriever, which cannot be solved using the same random sampling and editing approach.\n\nEmbedding models. Embedding sentences using noisy autoencoders has been proposed earlier as\na sentence VAE [5], which demonstrated that a Gaussian VAE captures semantic structure in a latent\nvector space. Related work on using the von-Mises Fisher distribution for VAE shows that sentences\ncan also be represented using latent vectors on the unit sphere [9, 14, 38]. Our encoder-decoder\nis based on the same type of VAE, showing that the latent space of a noisy encoder-decoder is\nappropriate for retrieval.\nSemantic hashing by autoencoders [16] is a related idea where an autoencoder\u2019s latent representation\nis used to construct a hash function to identify similar images or texts [29, 6]. A related idea is\ncross-modal embeddings, which jointly embed and align items in different domains (such as images\nand captions) using autoencoders [39, 2, 31, 10]. Both of these approaches seek to learn general\nsimilarity metrics between examples for the purpose of identifying documents or images that are\nsemantically similar. Our work differs from these approaches in that we consider task-speci\ufb01c\nembeddings that consider items to be similar only if they are useful for the downstream edit task and\nderive bounds that connect similarity in a latent metric to editability.\n\nLearned retrieval. Some question answering systems learn to retrieve based on supervision of\nthe correct item to retrieve [35, 27, 19], but these approaches do not apply to our setting since we\ndo not know which items are easy to edit into our target sequence y and must instead estimate this\nfrom the embedding. There have also been recent proposals for scalable large-scale learned memory\nmodels [34] that can learn a retrieval mechanism based on a known reward. While these approaches\nmake training pret tractable for a known pedit, they do not resolve the problem that pedit is not \ufb01xed or\nknown.\n\nCode generation. Code generation is well studied [21, 17, 4, 23, 1], but these approaches have not\nexplored edit-based generation. Recent code generation models have also constrained the output\nstructure based on ASTs [26, 41] or used specialized copy mechanisms for code [22]. Our goal\ndiffers from these works in that we use retrieve-and-edit as a general-purpose method to boost model\nperformance. We considerd simple sequence-to-sequence models as an example, but the framework\nis agnostic to the editor and could also be used with specialized code generation models. Recent\nwork appearing after submission of this work supports this hypothesis by showing that augmenting\nAST-based models with AST subtrees retrieved via edit distance can boost the performance of\nAST-based models [15].\n\nNonparametric models and mixture models. Our model is related to nonparametric regression\ntechniques [32], where in our case, proximity learned by the encoder corresponds to a neighborhood,\n\n8\n\n\fand the editor is a learned kernel. Adaptive kernels for nonparametric regression are well-studied\n[11] but have mainly focused on learning local smoothness parameters rather than the functional\nform of the kernel. More generally, the idea of conditioning on retrieved examples is an instance of a\nmixture model, and these types of ensembling approaches have been shown to boost the performance\nof simple base models on tasks such as language modeling [7]. One can view retrieve-and-edit as\nanother type of mixture model.\n\n6 Discussion\n\nIn this work, we considered the task of generating complex outputs such as source code using standard\nsequence-to-sequence models augmented by a learned retriever. We show that learning a retriever\nusing a noisy encoder-decoder can naturally combine the desire to retrieve examples that maximize\ndownstream editability with the computational ef\ufb01ciency of cosine LSH. Using this approach, we\ndemonstrated that our model can narrow the gap between specialized code generation models and\nvanilla sequence-to-sequence models on the Hearthstone dataset, and show substantial improvements\non a Python code autocomplete task over sequence-to-sequence baselines.\n\nReproducibility. Data and code used to generate the results of this paper are available\non the CodaLab Worksheets platform at https://worksheets.codalab.org/worksheets/\n0x1ad3f387005c492ea913cf0f20c9bb89/.\n\nAcknowledgements. This work was funded by the DARPA CwC program under ARO prime\ncontract no. W911NF-15-1-0462.\n\nReferences\n[1] M. Allamanis, D. Tarlow, A. Gordon, and Y. Wei. Bimodal modelling of source code and natural\nlanguage. In International Conference on Machine Learning (ICML), pages 2123\u20132132, 2015.\n\n[2] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In\n\nInternational Conference on Machine Learning (ICML), pages 1247\u20131255, 2013.\n\n[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In International Conference on Learning Representations (ICLR), 2015.\n\n[4] M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow. Deepcoder: Learning to\n\nwrite programs. arXiv preprint arXiv:1611.01989, 2016.\n\n[5] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating\nsentences from a continuous space. In Computational Natural Language Learning (CoNLL),\npages 10\u201321, 2016.\n\n[6] S. Chaidaroon and Y. Fang. Variational deep semantic hashing for text documents. In ACM\n\nSpecial Interest Group on Information Retreival (SIGIR), pages 75\u201384, 2017.\n\n[7] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion\nword benchmark for measuring progress in statistical language modeling. arXiv preprint\narXiv:1312.3005, 2013.\n\n[8] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine\ntranslation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1724\u20131734,\n2014.\n\n[9] T. R. Davidson, L. Falorsi, N. D. Cao, T. Kipf, and J. M. Tomczak. Hyperspherical variational\n\nauto-encoders. arXiv preprint arXiv:1804.00891, 2018.\n\n[10] F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In\n\nProceedings of the 22Nd ACM International Conference on Multimedia, pages 7\u201316, 2014.\n\n9\n\n\f[11] A. Goldenshluger and A. Nemirovski. On spatially adaptive estimation of nonparametric\n\nregression. Mathematical Methods of Statistics, 6:135\u2013170, 1997.\n\n[12] J. Gu, Z. Lu, H. Li, and V. O. Li. Incorporating copying mechanism in sequence-to-sequence\n\nlearning. In Association for Computational Linguistics (ACL), 2016.\n\n[13] J. Gu, Y. Wang, K. Cho, and V. O. Li. Search engine guided non-parametric neural machine\n\ntranslation. arXiv preprint arXiv:1705.07267, 2017.\n\n[14] K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang. Generating sentences by editing prototypes.\n\nTransactions of the Association for Computational Linguistics (TACL), 0, 2018.\n\n[15] S. A. Hayati, R. Olivier, P. Avvaru, P. Yin, A. Tomasic, and G. Neubig. Retrieval-based neural\n\ncode generation. In Empirical Methods in Natural Language Processing (EMNLP), 2018.\n\n[16] A. Krizhevsky and G. E. Hinton. Using very deep autoencoders for content-based image retrieval.\nIn 19th European Symposium on Arti\ufb01cial Neural Networks, Computational Intelligence and\nMachine Learning (ESANN), pages 489\u2013494, 2011.\n\n[17] N. Kushman and R. Barzilay. Using semantic uni\ufb01cation to generate regular expressions\nfrom natural language. In Human Language Technology and North American Association for\nComputational Linguistics (HLT/NAACL), pages 826\u2013836, 2013.\n\n[18] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Generalizing image captions for\nimage-text parallel corpus. In Association for Computational Linguistics (ACL), pages 790\u2013796,\n2013.\n\n[19] T. Lei, H. Joshi, R. Barzilay, T. Jaakkola, K. Tymoshenko, A. Moschitti, and L. Marquez.\nSemi-supervised question retrieval with gated convolutions. In North American Association for\nComputational Linguistics (NAACL), pages 1279\u20131289, 2016.\n\n[20] J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue\n\ngeneration. arXiv preprint arXiv:1701.06547, 2017.\n\n[21] P. Liang, M. I. Jordan, and D. Klein. Learning programs: A hierarchical Bayesian approach. In\n\nInternational Conference on Machine Learning (ICML), pages 639\u2013646, 2010.\n\n[22] W. Ling, E. Grefenstette, K. M. Hermann, T. Ko\u02c7cisk\u00fd, A. Senior, F. Wang, and P. Blunsom.\nLatent predictor networks for code generation. In Association for Computational Linguistics\n(ACL), pages 599\u2013609, 2016.\n\n[23] C. Maddison and D. Tarlow. Structured generative models of natural source code. In Interna-\n\ntional Conference on Machine Learning (ICML), pages 649\u2013657, 2014.\n\n[24] R. Mason and E. Charniak. Domain-speci\ufb01c image captioning. In Computational Natural\n\nLanguage Learning (CoNLL), pages 2\u201310, 2014.\n\n[25] K. Papineni, S. Roukos, T. Ward, and W. Zhu. BLEU: A method for automatic evaluation of\n\nmachine translation. In Association for Computational Linguistics (ACL), 2002.\n\n[26] M. Rabinovich, M. Stern, and D. Klein. Abstract syntax networks for code generation and\n\nsemantic parsing. In Association for Computational Linguistics (ACL), 2017.\n\n[27] A. Severyn and A. Moschitti. Learning to rank short text pairs with convolutional deep neural\nnetworks. In ACM Special Interest Group on Information Retreival (SIGIR), pages 373\u2013382,\n2015.\n\n[28] L. Shao, S. Gouws, D. Britz, A. Goldie, B. Strope, and R. Kurzweil. Generating high-quality and\ninformative conversation responses with sequence-to-sequence models. In Empirical Methods\nin Natural Language Processing (EMNLP), pages 2210\u20132219, 2017.\n\n[29] D. Shen, Q. Su, P. Chapfuwa, W. Wang, G. Wang, R. Henao, and L. Carin. Nash: Toward end-\nto-end neural architecture for generative semantic hashing. In Association for Computational\nLinguistics (ACL), pages 2041\u20132050, 2018.\n\n10\n\n\f[30] Y. Song, R. Yan, X. Li, D. Zhao, and M. Zhang. Two are better than one: An ensemble of\n\nretrieval- and generation-based dialog systems. arXiv preprint arXiv:1610.07149, 2016.\n\n[31] N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 2222\u20132230, 2012.\n\n[32] C. J. Stone. Consistent nonparametric regression. Annals of Statistics, 5, 1977.\n\n[33] E. Sumita and H. Iida. Experiments and prospects of example-based machine translation. In\n\nAssociation for Computational Linguistics (ACL), 1991.\n\n[34] W. Sun, A. Beygelzimer, H. Daume, J. Langford, and P. Mineiro. Contextual memory trees.\n\narXiv preprint arXiv:1807.06473, 2018.\n\n[35] M. Tan, C. dos Santos, B. Xiang, and B. Zhou. LSTM-based deep learning models for non-\n\nfactoid answer selection. arXiv preprint arXiv:1511.04108, 2015.\n\n[36] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al. Google\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[37] Y. Wu, F. Wei, S. Huang, Z. Li, and M. Zhou. Response generation by context-aware prototype\n\nediting. arXiv preprint arXiv:1806.07042, 2018.\n\n[38] J. Xu and G. Durrett. Spherical latent spaces for stable variational autoencoders. In Empirical\n\nMethods in Natural Language Processing (EMNLP), 2018.\n\n[39] F. Yan and K. Mikolajczyk. Deep correlation for matching images and text. In Computer Vision\n\nand Pattern Recognition (CVPR), pages 3441\u20133450, 2015.\n\n[40] R. Yan, Y. Song, and H. Wu. Learning to respond with deep neural networks for retrieval-based\nhuman-computer conversation system. In ACM Special Interest Group on Information Retreival\n(SIGIR), pages 55\u201364, 2016.\n\n[41] P. Yin and G. Neubig. A syntactic neural model for general-purpose code generation. In\n\nAssociation for Computational Linguistics (ACL), pages 440\u2013450, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6491, "authors": [{"given_name": "Tatsunori", "family_name": "Hashimoto", "institution": "Stanford"}, {"given_name": "Kelvin", "family_name": "Guu", "institution": "Google"}, {"given_name": "Yonatan", "family_name": "Oren", "institution": "Stanford"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}