{"title": "Code Generation as a Dual Task of Code Summarization", "book": "Advances in Neural Information Processing Systems", "page_first": 6563, "page_last": 6573, "abstract": "Code summarization (CS) and code generation (CG) are two crucial tasks in the field of automatic software development. Various neural network-based approaches are proposed to solve these two tasks separately. However, there exists a specific intuitive correlation between CS and CG, which has not been exploited in previous work. In this paper, we apply the relations between two tasks to improve the performance of both tasks. In other words, exploiting the duality between the two tasks, we propose a dual training framework to train the two tasks simultaneously. In this framework, we consider the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality. We evaluate our approach on two datasets collected from GitHub, and experimental results show that our dual framework can improve the performance of CS and CG tasks over baselines.", "full_text": "Code Generation as a Dual Task of\n\nCode Summarization\n\nBolin Wei\u2020, Ge Li\u2020\u2217, Xin Xia\u2021, Zhiyi Fu\u2020, Zhi Jin\u2020\u2217\n\n\u2020Key Laboratory of High Con\ufb01dence Software Technologies (Peking University)\n\nMinistry of Education, China; Software Institute, Peking University, China\n\n\u2021Faculty of Information Technology, Monash University, Australia\n\u2020bolin.wbl@gmail.com\n\u2020{lige,ypfzy,zhijin}@pku.edu.cn\n\n\u2021xin.xia@monash.edu\n\nAbstract\n\nCode summarization (CS) and code generation (CG) are two crucial tasks in the\n\ufb01eld of automatic software development. Various neural network-based approaches\nare proposed to solve these two tasks separately. However, there exists a speci\ufb01c\nintuitive correlation between CS and CG, which has not been exploited in previous\nwork. In this paper, we apply the relations between two tasks to improve the\nperformance of both tasks. In other words, exploiting the duality between the two\ntasks, we propose a dual training framework to train the two tasks simultaneously.\nIn this framework, we consider the dualities on probability and attention weights,\nand design corresponding regularization terms to constrain the duality. We evaluate\nour approach on two datasets collected from GitHub, and experimental results\nshow that our dual framework can improve the performance of CS and CG tasks\nover baselines.\n\n1\n\nIntroduction\n\nCode summarization (CS) is a task that generates comment for a piece of the source code, whereas\ncode generation (CG) aims to generate code based on natural language intent, e.g., description of\nrequirements. Code comments, a form of natural language description, provide a clear understanding\nfor users, and are very useful for software maintenance [de Souza et al., 2005]. On the other\nhand, CG is an indispensable process in which programmers write code to implement speci\ufb01c\nintents [Balzer, 1985]. Proper comments and correct code can massively improve programmers\u2019\nproductivity and enhance software quality. However, generating the correct code or comments is\ncostly, time-consuming, and error-prone. Therefore, carrying out CS and CG automatically becomes\ngreatly important for software development.\nRecently, many researchers, inspired by an encoder-decoder framework [Cho et al., 2014], applied\nneural networks to solve these two tasks independently. For CS, the encoder uses a neural network to\nrepresent source code as a real-valued vector, and the decoder uses another neural network to generate\ncomments word by word. The main difference among previous studies is the way to encode source\ncode. Speci\ufb01cally, Iyer et al. [2016] and Hu et al. [2018a,b] modeled source code as a sequence of\ntokens, composed by original code tokens or abstract syntax tree (AST) nodes obtained by traversing\nASTs in a certain order. On the other hand, Wan et al. [2018] treated a code snippet as an AST. It is\nworth noting that these previous models have both introduced the attention mechanism [Bahdanau\net al., 2015] to learn the alignment between the code and the comment.\n\n\u2217Corresponding authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFor CG with a general-purpose programming language, the encoder-decoder framework was \ufb01rst\napplied by Ling et al. [2016], where the encoder took a natural language description as input, and\nthe decoder generated the source code. In addition, some researchers used auxiliary information to\nimprove the performance of their models, such as grammar rules [Yin and Neubig, 2017, Rabinovich\net al., 2017, Sun et al., 2018] and structural descriptions [Ling et al., 2016]. These methods all applied\nthe attention mechanism as in CS task. Overall, previous studies on CS and CG were independent of\neach other. None of the previous studies before have considered the relations between the two tasks\nor exploited the relations to improve each other.\nIntuitively, CS and CG are related to each other, i.e., the input of CS is the output of CG, and vice\nversa. We refer to this relation as duality, which provides some utilizable constraints to train the\ntwo tasks. Speci\ufb01cally, from the perspective of probability, given a piece of source code and a\ncorresponding comment, there exists a pair of inverse conditional probabilities between them, bound\nby their common joint probability. From the perspective of the model structure, both tasks take the\nencoder-decoder framework, with source code and comment as both input and output. We conjecture\nthat the attention weights from the two models should be as similar as possible because they both\nre\ufb02ect the similarity between the token at one end and the token at the other end. Besides, the CS\nand the CG model require similar abilities in understanding natural language and source code. Thus,\nwe argue that the joint training of the two models can improve the performance of both models,\nespecially when we add some constraints to this duality.\nIn this paper, we design a dual learning framework to train a CS and a CG model simultaneously to\nexploit the duality of them. Besides applying a probabilistic correlation as a regularization term [Xia\net al., 2017] in the loss function, we design a novel constraint about attention mechanism to strengthen\nthe duality. We conduct our experiments on Java and Python projects collected from GitHub used by\nprevious work [Hu et al., 2018a,b, Wan et al., 2018]. Experimental results show that jointly training\ntwo models can help the CS model and CG model outperform the state-of-the-art models.\nThe contributions of our work are shown as follows:\n\n\u2022 To our best knowledge, it is the \ufb01rst time we propose a joint model for automated CS and CG.\nWe unprecedentedly attempt to treat CS and CG as dual tasks and apply a dual framework\nto boost each other.\n\n\u2022 We adopt a probabilistic correlation between CS and CG as a regularization term in loss\nfunction and design a novel constraint to guarantee the similarity of attention weights from\ntwo models in the training process.\n\n2 Related Work\n\nCode summarization (CS), as an essential part of the software development cycle, has attracted a lot\nof recent attention. With the development of deep learning, neural networks are applied in this task\nsuccessfully. Besides Allamanis et al. [2016] using a CNN to generate short and name-like summaries,\nmost of the related work followed the encoder-decoder framework. Iyer et al. [2016] used an RNN\nwith attention mechanism as a decoder. Hu et al. [2018a] introduced a machine translation model to\ngenerate summaries for Java methods given the serialized AST. Then they adopted a transfer learning\nmethod to utilize API information to generate code summarization [Hu et al., 2018b]. Wan et al.\n[2018] designed a tree RNN, which leverages code structure information, and applied a reinforcement\nlearning framework to build the model. Note that since the tree-based module must take a tree (e.g.,\nAST) as input and different training samples have different tree structures, it is hard to carry out batch\nprocessing in the implementation.\nAs a fundamental part of software development, code generation (CG) is a popular topic among the\n\ufb01eld of software engineering as well. Recently, more and more researchers apply neural networks\nfor general-purpose CG. Ling et al. [2016] introduced a sequence-to-sequence model to generate\ncode from natural language and structured speci\ufb01cations. Yin and Neubig [2017] and Rabinovich\net al. [2017] both combined the grammar rules with the decoder and improved code generation\nperformance. Sun et al. [2018] argued that traditional RNN might not handle the long dependency\nproblem, and they proposed a grammar-based structural CNN. Different from the aforementioned\nmethods, we argue that a simple CG model could boost the CS model in our dual framework, and the\nCG model can also bene\ufb01t from the dual training process.\n\n2\n\n\fFigure 1: The overall dual training framework. A CS model and a CG model are trained jointly in the\nframework.\n\nDual learning, proposed by He et al. [2016], is a reinforcement training process that jointly trains\na primal task and its dual task. Xia et al. [2017] considered it as a way of supervised learning\nand designed a probabilistic regularization term to constrain the duality, which has been applied\nsuccessfully in machine translation, sentiment classi\ufb01cation, and image recognition. Tang et al.\n[2017] treated question answering and question generation as a dual task and proved the effectiveness\nof dual supervised learning. Li et al. [2018] applied a dual framework to improve the performance of\nvisual question answering with the help of visual question generation. Xiao et al. [2018] introduced\na dual framework to jointly train a question answering model and a question generation model in\nmachine reading comprehension. To the best of our knowledge, we are the \ufb01rst to propose a dual\nlearning framework in CS and CG, and leverage the duality between them. Furthermore, we design a\nnew dual constraint about attention weights to strengthen the correlation between the two tasks.\n\n3 Proposed Approach\n\nOur training framework, illustrated in Figure 1, consists of three parts: a CS model, a CG model,\nand dual constraints. The CS model aims to translate the source code to a comment, while the CG\nmodel maps natural language description to a source code snippet. Dual constraints are used by\nadding regularization terms in the loss function to constrain the duality between two models. We \ufb01rst\nformulate the tasks of CS and CG and then describe the details of our framework.\nn } denotes the\nWe denote a set of source code snippets as X = {x(i)} where x(i) = {x(i)\ntoken sequence of a code snippet. Each code snippet x(i) has a corresponding natural language\ncomment as y(i) where y(i) = {y(i)\nm } \u2208 Y. Thus, the CS model learns a mapping fxy from\nX to Y and the CG model learns a reverse mapping fyx from Y to X . Different from previous work,\nwe regard the two tasks as a dual learning problem and train them jointly.\n\n1 , ..., x(i)\n\n1 , ..., y(i)\n\n3.1 Code Summarization Model\n\nThe CS model takes a code snippet x(i) as input to generate a comment. Considering the ef\ufb01ciency,\nwe choose the sequence-to-sequence (Seq2Seq) neural network with an attention mechanism as our\nmodel. The model contains two parts: an encoder and a decoder.\nThe encoder \ufb01rst maps the token of a source code snippet into a word embedding. Words which do\nnot appear in the vocabulary are de\ufb01ned as unknown. Then to leverage the contextual information, we\nuse a bidirectional LSTM to process the sequence of the word embeddings. We concatenate hidden\nstates of the time step i from two directions as the representation of the i-th token hi in source code.\nThe decoder is another LSTM with an attention mechanism between encoder and decoder, which\ngenerates a word yt based on the representation of whole source code snippet and the previous words\n\n3\n\nCode SnippetsCode Commentsx1x2xnx3...h1h2h3hn...AttentionEncodery0y1ym-1y2...h'1h'2h'3h'm...y1y2y3ymDecodery1y2ymy3...h1h2h3hm...Attentionx0x1xn-1x2...h'1h'2h'3h'n...x1x2x2xnEncoderDecoderCode Generation ModelDual ConstraintsXP(X)^Language model for codeYP(Y)^P(Y|X)Code Summarization ModelLanguage model for commentsP(X|Y)\fin the comment. This process is formulated as\n\nP (y|x) =\n\nm(cid:89)\n\nt=1\n\nP (yt|y<t, x)\n\n(1)\n\nThe last hidden state from the encoder is used to initialize the hidden state of the decoder which\n(cid:48)\n(cid:48)\ncomputes the hidden state as h\nt\u22121 is de\ufb01ned as the concatenation of\nt = LSTM(y\n(cid:48)\nthe embedding of yt\u22121 and the attention vector at\u22121. h\nt is used to compute the attention weights as\n(2)\n\n(cid:48)\n(cid:48)\nt\u22121), where y\nt\u22121, h\n\n(cid:48)(cid:62)\nt W hi\n\n(cid:101)\u03b1ti = h\nexp{(cid:101)\u03b1ti}\n(cid:80)\nj exp{(cid:101)\u03b1tj}\n\n\u03b1ti =\n\nct =(cid:80)\n\nwhere W is a group of trainable weights. The attention score \u03b1ti measures the similarity between\nthe token of comment yt and the token of code snippet xi. We compute the context vector ct as\n(cid:48)\nj \u03b1tjhj. Then ct and h\nt are fed into a feed-forward neural network to get the attention vector\nat which is fed into a softmax layer to get the prediction of the token yt in the comment. We use\nnegative log-likelihood as the training objective. The loss function of a training sample is\n\nlogP (yt|y<t, x)\n\n(4)\n\nlxy = \u2212 1\nm\n\n3.2 Code Generation Model\n\nUnlike work depending on the grammar rules, our CG model, which can be regarded as the inverse\nCS task, predicts the code snippet only based on natural language description. We use the same\nstructure as the CS model, i.e., a Seq2Seq neural network. The encoder is a bidirectional LSTM,\nand the decoder is another LSTM with an attention mechanism. Different from the CS model, the\ndecoder in the CG model learns a conditional probability as\n\n(3)\n\n(5)\n\n(6)\n\nm(cid:88)\n\nt=1\n\nn(cid:89)\nn(cid:88)\n\nt=1\n\nt=1\n\nP (x|y) =\n\nP (xt|x<t, y)\n\nAnd the training objective is formulated as\nlyx = \u2212 1\nn\n\nlogP (xt|x<t, y)\n\nSince the source code contains lots of identi\ufb01ers, the vocabulary size of the source code is usually\nmuch larger than that of comments, which results in more parameters in the output layer of the CG\nmodel than in the output layer of the CS model.\n\n3.3 Dual Training Framework\n\nThe dual training framework includes three components: a CS model, a CG model and two dual\nregularization terms to constrain the duality of the two models, which are enlightened by the\nprobabilistic correlation and the symmetry of attention weights between two models.\nGiven a pair of (cid:104)x, y(cid:105), the CS model and the CG model have probabilistic correlation, because they\nare both connected to the joint probability P (x, y) which can be computed in two equivalent ways.\n(7)\nSince the CS model, parameterized by \u03b8xy, is built to learn the conditional probability P (y|x; \u03b8xy),\nand the CG model, parameterized by \u03b8yx, is built to learn the conditional probability P (x|y; \u03b8yx),\nwe can jointly train these two models by minimizing their loss functions subject to the constraint\nof Eqn. 7. We build the constraint of Eqn. 7 to a penalty term by using the method of Lagrange\nmultipliers, and thus our regularization term is\n\nP (x, y) = P (x)P (y|x) = P (y)P (x|y)\n\nldual = [log \u02c6P (x) + logP (y|x; \u03b8xy) \u2212 log \u02c6P (y) \u2212 logP (x|y; \u03b8yx)]2\n\n(8)\n\n4\n\n\fAlgorithm 1 Algorithm Description\n\nInput: Language models \u02c6P (x) and \u02c6P (y) for any x \u2208 X and y \u2208 Y; hyper parameters \u03bbdual1, \u03bbdual2, \u03bbatt1\nand \u03bbatt2; optimizers opt1 and opt2\nrepeat\n\nGet a minibatch of k pairs (cid:104)(xi, yi)(cid:105)k\ni=1;\nCalculate the gradients for \u03b8xy and \u03b8yx.\n\nk(cid:88)\nk(cid:88)\n\ni=1\n\nGxy = (cid:79)\u03b8xy (1/k)\n\nGyx = (cid:79)\u03b8yx (1/k)\n\ni=1\n\nUpdate \u03b8xy and \u03b8yx\n\u03b8xy \u2190 opt1(\u03b8xy, Gxy), \u03b8yx \u2190 opt2(\u03b8yx, Gyx)\n\nuntil models converged\n\n[lxy(fxy(xi; \u03b8xy), yi) + \u03bbdual1ldual(xi, yi; \u03b8xy, \u03b8yx) + \u03bbatt1latt(xi, yi; \u03b8xy, \u03b8yx)];\n\n[lyx(fyx(yi; \u03b8yx), xi) + \u03bbdual2ldual(xi, yi; \u03b8xy, \u03b8yx) + \u03bbatt2latt(xi, yi; \u03b8xy, \u03b8yx)];\n\nwhere \u02c6P (x) and \u02c6P (y) are marginal distributions, which can be modeled by their language models,\nrespectively. By minimizing this dual loss function, the probabilistic connection between the two\nmodels could be strengthened, which is helpful to the training process.\nNaturally, structures of the CS model and CG model have symmetries, meaning that the output of\none model is the input of the other model, and vice versa. Thus, we can introduce this property to the\ndual training process. In this paper, we focus on the symmetry of attention weights and argue that\nthe alignment between tokens in the source code snippet and tokens in the comment has symmetry,\nwhich could be measured by attention weights. Speci\ufb01cally, given the comment \u201c\ufb01nd the position of\na character inside a string\u201d and the corresponding source code \u201cstring . \ufb01nd ( character )\u201d, no matter\nhow the generating direction goes, the word \u201c\ufb01nd\u201d in the comment should always be aligned to the\nsame token \u201c\ufb01nd\u201d in the source code. Hence, we design a regularization term to leverage this duality.\nThe matrices of attention weights before normalization in the CS and the CG model, easily obtained\nby Eqn. 2, are denoted as Axy \u2208 Rn\u00d7m and Ayx \u2208 Rm\u00d7n. The element \u03b1ij in Axy and the element\n\u03b1ji in Ayx both measure the similarity between the i-th token in the source code with the j-th token\nin the comment. For the i-th token in the source code, we obtain its attention weights bi, a probability\nxy is the i-th row vector of Axy.\ndistribution, from the CS model by bi = softmax(Ai\n(cid:48)\ni = softmax(Ai\nThen the attention weights from CG model is b\nyx is the i-th column\nvector of Ayx. Finally, we apply the Jensen\u2013Shannon divergence [Fuglede and Topsoe, 2004], a\nsymmetric measurement of similarity between two probability distributions, to constrain the distance\nbetween these two attention weights. Thus, the penalty term l1 for tokens in the source code is\n\nyx), where Ai\n\nxy), where Ai\n\n1\n2n\n\n[DKL(bi(cid:107) bi + b\n2\n\n(cid:48)\ni\n\n(cid:48)\n) + DKL(b\n\ni(cid:107) bi + b\n2\n\ni=1\n\nl1 =\n\nwhere DKL is the Kullback\u2013Leibler divergence, de\ufb01ned as DKL(p(cid:107)q) =(cid:80)\n\nq(x) , which\nmeasures how one probability distribution p diverges from the other probability distribution q.\nLikewise, we can obtain the penalty term l2 for tokens in comments in the same manner. Consequently,\nthe \ufb01nal regularization term about attention weights latt is the sum of l1 and l2. Moreover, we consider\nthat the attention weights from one model could be regarded as the soft label of the other model. The\noverall algorithm is described in Algorithm 1, and the complexity of our model is the same as the\nSeq2Seq neural network. Our implementation is based on PyTorch.2\n\nx p(x)log p(x)\n\n(9)\n\n(cid:48)\ni\n\n)]\n\nn(cid:88)\n\n4 Experiments\n\n4.1 Datasets\n\nWe conduct our CS and CG experiments on two datasets, including a Java dataset [Hu et al., 2018b]\nand a Python dataset [Wan et al., 2018]. The statistics of the two datasets are shown in Table 1.\n\n2https://pytorch.org/\n\n5\n\n\fTable 1: Statistics of datasets\n\nDataset\nTrain\nValidation\nTest\nAvg. tokens in comment\nAvg. tokens in code\n\nJava\n69,708\n8,714\n8,714\n17.7\n98.8\n\nPython\n55,538\n18,505\n18,502\n9.49\n35.6\n\nThe Java dataset, containing Java methods extracted from 2015 to 2016 Java projects, is collected\nfrom GitHub3. We process the dataset following Hu et al. [2018b]. The \ufb01rst sentence of Javadoc is\nextracted as the natural language description, which describes the functionality of the Java method.\nEach data sample is organized as a pair of (cid:104)method, comment(cid:105).\nFor our language model applied to Java dataset, we can use various large scale monolingual corpus to\npretrain. In this paper, we use the Java projects from 2009 to 2014 on GitHub [Hu et al., 2018a] as\nour dataset. We separate the original datasets into a code-only dataset and a comment-only dataset.\nThe language model of source code takes the Java methods in the code-only dataset as input, whereas\nthe language model of comment takes the comments in the comment-only dataset as input. Each\ndataset is split into training, test and validation sets by 8:1:1.\nThe original Python dataset is collected by Barone and Sennrich [2017], consisting of about 110K\nparallel samples and about 160K code-only samples. The parallel corpus is used to evaluate CS task\nand CG task. Since Wan et al. [2018] used the dataset in their experiments, we follow their approach\nto process this dataset.\nFor the language model of source code in Python, we use the code-only samples in the original\ndataset to pretrain. We divide the samples into training, test, and validation sets by 8:1:1. However,\nfor the language model of comments in Python, we are not able to \ufb01nd enough monolingual corpus to\npretrain. Considering the patterns of comments in Python dataset and in Java dataset are very similar,\nwe use the comment-only corpus from Java dataset as an alternative. Experimental results turn out\nthat the language model pretrained in this way is bene\ufb01cial to our dual training.\n\n4.2 Hyperparameters\n\nWe set the token embeddings and LSTM states both to 512 dimensions for the CS model and set the\nLSTM states to 256 dimensions for the CG model to \ufb01t GPU memory. Afterward, to initialize the CS\nand the CG model in our dual training framework, we use warm-start CS and CG models, whose\nparameters are optimized by Adam [Kingma and Ba, 2015] with the initial learning rate of 0.002.\nWarm-starting means that we pretrained CS and CG models separately, then applied dual constraints\nto the two models for joint training, which can speed up the convergence process of joint training.\nThe dropout rates of all models are set to 0.2 and mini-batch sizes of all models to 32. For dual\nlearning process, we observe that the SGD is appropriate with initial learning rate 0.2. We halve the\nlearning rate if the performance of the validation set decreases once and freeze the token embeddings\nif the performance decreases again. According to the performance of the validation set, the best dual\nmodel is selected after joint training 30 epochs in the experiments. The \u03bbdual1 and \u03bbdual2 are set to\n0.001 and 0.01 respectively, and the \u03bbatt1 and \u03bbatt2 are set to 0.01 and 0.1. We use beam search in\nthe inference process, whose size is set to 10. Furthermore, the vocabulary sizes of the code in Java\nand Python dataset are set to 30000 and 50000, and maximum lengths of code and comments in Java\nare set to 150 and 50 respectively.\nOur language models for source code and comments both employ 3 LSTM layers. Token embeddings\nand LSTM states are both 300-dimensional. Batch size is set to 40 and dropout rate to 0.3. We apply\ngradient clipping to prevent gradients from becoming too large. The vocabulary consists of all words\nthat have a minimum frequency of 3. Adam is chosen as our optimizer, and the initial learning rate is\nset to 0.002. During the dual training process, the parameters of language models are \ufb01xed. Language\nmodels are only used to calculate marginal probabilities \u02c6P (x) and \u02c6P (y) in Eqn. 8.\n\n3https://github.com/\n\n6\n\n\fTable 2: The overall performance of our CS models compared with baselines\n\nMethods\nCODE-NN\nDeepCom\nTree2Seq\nRL+Hybrid2Seq\nAPI+CODE\nBasic Model\nDual Model\n\nJava\n\nBLEU METEOR ROUGE-L\n27.60\n39.75\n37.88\n38.22\n41.31\n41.01\n42.39\n\n41.10\n52.67\n51.50\n51.91\n52.25\n51.64\n53.61\n\n12.61\n23.06\n22.55\n22.75\n23.73\n23.26\n25.77\n\nPython\n\nBLEU METEOR ROUGE-L\n17.36\n20.78\n20.07\n19.28\n15.36\n20.47\n21.80\n\n37.81\n37.35\n35.64\n39.34\n33.65\n38.77\n39.45\n\n9.288\n9.979\n8.957\n9.752\n8.571\n10.38\n11.14\n\nTable 3: BLEU scores and percentage of valid code (PoV) on CG task\n\nMethods\nSEQ2TREE\nBasic Model\nDual Model\n\n5 Experimental Results\n\nJava\n\nPython\n\nBLEU\n13.80\n10.86\n17.17\n\nPoV\nBLEU\n22.6% 4.472\n19.6% 10.43\n27.4% 12.09\n\nPoV\n22.7%\n41.8%\n51.9%\n\nMetrics. We evaluate the performance of CS task based on three metrics, BLEU [Papineni et al.,\n2002], METEOR [Banerjee and Lavie, 2005] and ROUGE-L [Lin, 2004]. These metrics all measure\nthe quality of generated comments and can represent the human\u2019s judgment. BLEU is de\ufb01ned as the\ngeometric mean of n-gram matching precision scores multiplied by a brevity penalty to prevent very\nshort generated sentences. We choose sentence level BLEU as our metric as in Hu et al. [2018a,b].\nMETEOR combines unigram matching precision and recall scores using harmonic mean and employs\nsynonym matching. ROUGE-L computes the length of longest common subsequence between\ngenerated sentence and reference and focuses on recall scores. For CG task, we choose BLEU as\nour performance metric because accuracies on both datasets are too low. Ling et al. [2016], Yin\nand Neubig [2017] and Sun et al. [2018] discussed the effectiveness of the BLEU in the CG task.\nThey treated the BLEU as an appropriate proxy for measuring semantics and left exploring more\nsophisticated metrics as future work. Furthermore, to evaluate how much of the generated code is\nvalid, we calculate the percentage of code that can be parsed into an AST.\nBaselines. We compare our CS model\u2019s performance with the following \ufb01ve baselines.4 It has been\nproved that the attention mechanism is very helpful to comment generation [Hu et al., 2018a], so all\nbaselines introduced this module. CODE-NN [Iyer et al., 2016] uses token embeddings to encode\nsource code and uses an LSTM to decode. To exploit the structural information, DeepCom [Hu et al.,\n2018a] takes a sequence of tokens as input, which is obtained through traversing the AST with a\nstructure-based traversal method, while Tree2Seq [Eriguchi et al., 2016] directly uses a tree-based\nLSTM as an encoder. RL+Hybrid2Seq [Wan et al., 2018] is a model trained with reinforcement\nlearning, whose encoder is the combination of an LSTM and an AST-based LSTM. We further\ncompare with API+CODE [Hu et al., 2018b] without transferred API knowledge, which introduces\nAPI information when generating comments. The proportion of source code in Python having no APIs\nis about 20%; we set the prediction of test samples having no APIs to null. Except for Tree2Seq and\nRL+Hybrid2Seq (They used Word2Vec to pretrain their token embeddings), the token embeddings for\nother models are randomly initialized. For DeepCom, we use a bi-LSTM as the encoder to ensure the\nnumber of parameters is comparable to that of our model. For API+CODE, we set the embeddings\nand GRU states to 512 dimensions.\nFor CG model, we compare our dual model with the individually trained basic model, and also\ncompare to SEQ2TREE [Dong and Lapata, 2016], which modeled the source code as a tree. Since\nour CG model only takes natural language description as input, to make a fair comparison, we do not\ncompare with other models that take grammar rules [Yin and Neubig, 2017, Rabinovich et al., 2017,\nSun et al., 2018] or structured speci\ufb01cation [Ling et al., 2016] as additional input.\n\n4For papers that provide the source code, we directly reproduce their methods on two datasets. Otherwise,\n\nwe rebuild their models with reference to the papers.\n\n7\n\n\fTable 4: Ablation study of different settings on CS task. Model (M) 1 is the basic model of\nindependent training.\n\nDuality\n\nM Probabilistic\n1\n2\n3\n4\n\n-\n(cid:88)\n-\n(cid:88)\n\nAttention\nDuality\n\n-\n-\n(cid:88)\n(cid:88)\n\nJava\n\nBLEU METEOR ROUGE-L\n41.01\n41.73\n41.96\n42.39\n\n23.26\n25.54\n25.80\n25.77\n\n51.64\n53.60\n53.57\n53.61\n\nPython\n\nBLEU METEOR ROUGE-L\n20.47\n21.66\n21.57\n21.80\n\n38.77\n38.83\n39.07\n39.45\n\n10.38\n10.81\n10.91\n11.14\n\nOverall Results. The overall performance of the CS model is shown in Table 2. Results show that\nour dual model obviously outperforms all the baselines on three metrics at the same time. To test\nwhether the improvements of our dual model over baselines are statistically signi\ufb01cant, we applied the\nWilcoxon Rank Sum test (WRST) [Wilcoxon, 2006], and all the p-values are less than 0.01, indicating\nsigni\ufb01cant increases. We also used Cliff\u2019s Delta [Cliff, 1996] to measure the effect size, and the\nvalues are non-negligible. From results of baseline models, we can see that our independently trained\nbasic model is simple and effective, yet is still inferior to the dual model, showing the effectiveness of\njoint training process. Compared to the sequence-based models, DeepCom and our basic model, the\ntree-based models (Tree2Seq and RL+Hybrid2Seq) do not achieve strong improvements of BLEU\nand METEOR scores on the two datasets. We suppose the reason for this phenomenon is that after\nthe source code is converted to AST, the number of nodes becomes very large, resulting in increased\nnoise. According to statistics, the average token numbers of Java and Python datasets have been\nmore than doubled after parsing. In this case, due to the method\u2019s handling of the custom identi\ufb01er\ncontained in the code, DeepCom has achieved good results. The results of API+CODE indicate that\nAPI knowledge in Java dataset is bene\ufb01cial to the comment generation. Though we do not focus on\nintegrating structural information and API knowledge in the current work, we will leverage them in\nfuture studies considering their potential for boosting performance.\nSince CS and CG models are trained at the same time and the parameters of the two models are\nseparate after the joint training, i.e., the two models solve their respective tasks separately after the\njoint training, the number of parameters of each dual model is the same as that of the basic model.\nThe number of parameters for all models on Java CS task is as follows: CODE-NN 34M, DeepCom\n53M, Tree2Seq 52M, RL+Hybrid2Seq 80M, API+CODE 80M, Basic model 53M and Dual model\n53M. The relationship between the number of all models\u2019 parameters is consistent in Java and Python\nexperiments on CS task.\nThe results of CG model are shown in Table 3. Although the BLEU score is very low, dual training\ncan still improve the performance of individually trained basic model, proving the effectiveness of\ndual training. It is observable that SEQ2TREE\u2019s performance is better than our basic model\u2019s on Java\ndataset, which demonstrates the ability of SEQ2TREE to leverage code hierarchies. However, its\nperformance is far worse than our basic model\u2019s on Python dataset. The reason is that SEQ2TREE\nbuilds up tree structure according to brackets contained in the code, while hierarchical levels in\nPython code are segmented by line breaks and indents. Besides, dual training can also increase the\npercentage of valid code on both Java and Python datasets.\nComponent Analysis. We veri\ufb01ed the role of dual regularization terms in joint training on CS task.\nThe experimental results are shown in Table 4. Model 1 is the basic model of independent training.\nWe can see that the introduction of CG model to train models jointly and constrain the duality between\nthe two models can improve the performance of the CS model, indicating the utilizability of the\nrelation between the two tasks. Speci\ufb01cally, we \ufb01nd that applying a regularization term on attention\nis slightly more effective than applying a regularization term on probability to improve model\u2019s\nperformance. This is because, intuitively, the attention regularization term has a more powerful and\nmore explicit constraint than the probability regularization term. Naturally, combining the two will\nfurther enhance our CS model\u2019s performance. To test whether the improvements of two constraints\nover one constraint on BLEU are signi\ufb01cant, we applied the WRST, and all the p-values are less than\n0.05, indicating signi\ufb01cant increases. Cliff\u2019s Delta values also show non-negligible improvements.\nIn order to better understand the role of regularization terms, we also conduct experiments on the\nPython dataset to observe changes in regularization terms. In particular, after exerting the two\nconstraints, the value of regularization term on probability is reduced from 129.1 to 125.9, and the\n\n8\n\n\f(a) CS task\n\n(b) CG task\n\n(c) Basic model\n\n(d) Dual model\n\nFigure 2: Qualitative analysis of our dual model. (a) Example of the generated comment given\nJava methods. (b) Example of the generated Java methods given natural language description. (c)\nAttention weights of basic model on CS task. (d) Attention weights of dual model on CS task.\n\nvalue of the regularization term on attention is reduced from 10.51 to 4.757, indicating that the\nrelation between the two models is enhanced after joint training.\nQualitative Analysis and Visualization. Figure 2a and Figure 2b show examples of CS task and\nCG task. We can see that the generated code and comment from the dual model have a very high\nsemantic similarity with human-written. Figure 2c and Figure 2d show attention weights of a sample\nin the CS basic model and dual model. We can see that for words in the comment that are not aligned\nwith the code, such as \u201ca\u201d, the attention weights gained by basic model focus on a few speci\ufb01c words.\nOn the other hand, the attention weights gained by the dual model are smoother, and therefore, we can\nget a better code representation for words that are not aligned. We think it is because the similarities\nbetween these tokens in the comment and the tokens in the code are different between CS and CG\nmodels. Since we add the attention constraint, the attention distributions of two models become close,\nmaking the attention weights of dual model smoother than them of the basic model in CS task.\nDiscussion on Grammar Constraints. To compare the dual model with the model that takes\ngrammar rules as input for CG task, we evaluated SNM [Yin and Neubig, 2017] on Python dataset.\nSNM explicitly introduces the constraints of grammar rules when generating ASTs. The BLEU score\nfor SNM is 8.095 and lower than our Basic model, indicating that the CG task on this dataset is\nvery challenging. In particular, all prediction of SNM is valid, whereas the percentage of valid code\ngenerated by the dual model is low (Table 3). Hence, it is advantageous for the current dual model to\nconstrain the generated code to satisfy the grammar rules, which will increase the percentage of valid\ncode. Noting that the dual learning is a paradigm for joint training CS and CG. Integrating grammar\nrules into one model does not affect the dual relationship between the two models, and we leave it as\nour future work.\n\n6 Conclusion\n\nIn this paper, we aim to build a framework which uses the CG as a dual task for the CS. To this end,\nwe propose a dual learning framework to jointly train CG and CS models. In order to enhance the\nrelationship between the two tasks in the joint training process, besides applying the constraint on\nprobability, we creatively propose a constraint that exploits the nature of the attention mechanism.\nIn order to con\ufb01rm the effect of our model, we conduct experiments both on Java and on Python\ndatasets. The experimental results show that after the dual training process, the CS model and the CG\nmodel can surpass the existing state-of-the-art methods on both datasets. In the future, we plan to\nconsider more information to improve the performance of the joint model further, e.g., we would like\nto take the grammar rules as input to improve the performance of CG.\n\n7 Acknowledgments\n\nWe thank all reviewers for their constructive comments, Fang Liu for discussion on manuscript. This\nresearch is supported by the National Key R&D Program under Grant No. 2018YFB1003904, and\nthe National Natural Science Foundation of China under Grant Nos. 61832009, 61620106007 and\n61751210.\n\n9\n\nSource Code: public static void closeQuiet(@Nullable Closeable closeable){if (closeable != null){try{ closeable.close();}catch(IOExceptionignored){}}}Human-Written: closes resource without reporting any error.Dual Model: quietly closes given closeable without reporting.Comment: prints an integer to standard output and flushes standard output.Human-Written: public static void print(Object x){out.print(x); out.flush();}Dual Model: public static void print(int x){out.print(x); out.flush();}deffixed_ip_*contextaddressreturnIMPLfixed_ip_*contextaddressgetafixedipbyaddressorraiseifitdoesnotexist.deffixed_ip_*contextaddressreturnIMPLfixed_ip_*contextaddressgetafixedipbyaddressorraiseifitdoesnotexist.\fReferences\nM. Allamanis, H. Peng, and C. A. Sutton. A convolutional attention network for extreme summariza-\ntion of source code. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages\n2091\u20132100. JMLR.org, 2016.\n\nD. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and\n\ntranslate. In ICLR, 2015.\n\nR. Balzer. A 15 year perspective on automatic programming.\n\nEngineering, (11):1257\u20131268, 1985.\n\nIEEE Transactions on Software\n\nS. Banerjee and A. Lavie. METEOR: an automatic metric for MT evaluation with improved correla-\ntion with human judgments. In IEEvaluation@ACL, pages 65\u201372. Association for Computational\nLinguistics, 2005.\n\nA. V. M. Barone and R. Sennrich. A parallel corpus of python functions and documentation strings\nfor automated code documentation and code generation. In IJCNLP(2), pages 314\u2013319. Asian\nFederation of Natural Language Processing, 2017.\n\nK. Cho, B. van Merrienboer, \u00c7. G\u00fcl\u00e7ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.\nLearning phrase representations using RNN encoder-decoder for statistical machine translation. In\nEMNLP, pages 1724\u20131734. ACL, 2014.\n\nN. Cliff. Ordinal methods for behavioral data analysis. 1996.\n\nS. C. B. de Souza, N. Anquetil, and K. M. de Oliveira. A study of the documentation essential to\n\nsoftware maintenance. In SIGDOC, pages 68\u201375. ACM, 2005.\n\nL. Dong and M. Lapata. Language to logical form with neural attention. In ACL (1). The Association\n\nfor Computer Linguistics, 2016.\n\nA. Eriguchi, K. Hashimoto, and Y. Tsuruoka. Tree-to-sequence attentional neural machine translation.\n\n2016.\n\nB. Fuglede and F. Topsoe. Jensen-shannon divergence and hilbert space embedding. In Information\n\nTheory, 2004. ISIT 2004. Proceedings. International Symposium on, page 31. IEEE, 2004.\n\nD. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. Dual learning for machine translation. In\n\nNIPS, pages 820\u2013828, 2016.\n\nX. Hu, G. Li, X. Xia, D. Lo, and Z. Jin. Deep code comment generation. In ICPC, pages 200\u2013210.\n\nACM, 2018a.\n\nX. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin. Summarizing source code with transferred API\n\nknowledge. In IJCAI, pages 2269\u20132275. ijcai.org, 2018b.\n\nS. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Summarizing source code using a neural attention\n\nmodel. In ACL (1). The Association for Computer Linguistics, 2016.\n\nD. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\nY. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation\nas dual task of visual question answering. In CVPR, pages 6116\u20136124. IEEE Computer Society,\n2018.\n\nC.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches\n\nOut, 2004.\n\nW. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kocisk\u00fd, F. Wang, and A. W. Senior. Latent\npredictor networks for code generation. In ACL (1). The Association for Computer Linguistics,\n2016.\n\nK. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine\n\ntranslation. In ACL, pages 311\u2013318. ACL, 2002.\n\n10\n\n\fM. Rabinovich, M. Stern, and D. Klein. Abstract syntax networks for code generation and semantic\n\nparsing. In ACL (1), pages 1139\u20131149. Association for Computational Linguistics, 2017.\n\nZ. Sun, Q. Zhu, L. Mou, Y. Xiong, G. Li, and L. Zhang. A grammar-based structural cnn decoder for\n\ncode generation. arXiv preprint arXiv:1811.06837, 2018.\n\nD. Tang, N. Duan, T. Qin, Z. Yan, and M. Zhou. Question answering and question generation as dual\n\ntasks. arXiv preprint arXiv:1706.02027, 2017.\n\nY. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu. Improving automatic source code\n\nsummarization via deep reinforcement learning. In ASE, pages 397\u2013407. ACM, 2018.\n\nF. Wilcoxon. Some rapid approximate statistical procedures. Annals of the New York Academy of\n\nSciences, 52:808 \u2013 814, 12 2006. doi: 10.1111/j.1749-6632.1950.tb53974.x.\n\nY. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. Dual supervised learning. In ICML, volume 70\n\nof Proceedings of Machine Learning Research, pages 3789\u20133798. PMLR, 2017.\n\nH. Xiao, F. Wang, Y. Feng, and J. Zheng. Dual ask-answer network for machine reading comprehen-\n\nsion. arXiv preprint arXiv:1809.01997, 2018.\n\nP. Yin and G. Neubig. A syntactic neural model for general-purpose code generation. In ACL (1),\n\npages 440\u2013450. Association for Computational Linguistics, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3531, "authors": [{"given_name": "Bolin", "family_name": "Wei", "institution": "Peking University"}, {"given_name": "Ge", "family_name": "Li", "institution": "Peking University"}, {"given_name": "Xin", "family_name": "Xia", "institution": "Monash University"}, {"given_name": "Zhiyi", "family_name": "Fu", "institution": "Key Lab of High Confidence Software Technologies (Peking University), Ministry of Education"}, {"given_name": "Zhi", "family_name": "Jin", "institution": "Key Lab of High Confidence Software Technologies (Peking University), Ministry o"}]}