Reviews: Code Generation as a Dual Task of Code Summarization

There idea of leveraging duality has been applied in other contexts, but never in the context of CS/CG. The related work seems to imply (but I wasn't 100% sure if this was the claim) that even in the context of these other domains, the dual constraint on attention weights was new. Is that correct? The idea in the paper is original and is well explained. The improvements over the baselines seem small but real. One aspect of the evaluation that I did not like is that it dismisses approaches that take additional inputs, such as grammars. If the addition of grammars leads to a substantial performance boost, then this work is comparing against a sub-standard baseline, and given that grammars already exist for all the languages tested, requiring a grammar does not seem like a significant extra burden on the user. Post rebuttal: So if I understand correctly from your rebuttal, you are making two arguments regarding the comparison with techniques that use grammars: a) your tool is already competitive with tools that use grammars (at least based on that one experiment reported in the rebuttal), and b) Your technique has a lot of room for improvement by using grammars, since a large fraction of what you produce is not valid code. I think these are important points to make in the paper, and adding the data from the rebuttal into the paper (or at least the supplement) would strengthen it considerably.

Reviewer 2

This paper proposes to train simultaneously two models one for code generation and one for code summarization. The main observation is that these two tasks are dual and they can be constrained when trained simultaneously to take advantage of the duality. The constraints are related to the probability correlation and the symmetry of the attention of the two models. The models are based on traditional Seq2Seq models. The probabilistic correlation of the two models is used to define a regularization term, while the symmetry of the two tasks are reflected in regularization terms based on the attention weights matrices. The paper compares this approach with state-of-the-art approaches for two different datasets (Java and Python) and show improvements in the 1-2% ranges for scores such as bleu, meteor and rouge. While the duality observation is interesting, the improvement in performance is limited. The ablation studies in the paper show that each constraint based on the two dualities brings a tad bit of performance improvement. The paper is well written and straightforward to follow (some suggestions for improvement below). =============== UPDATE: Dual-task training for CS/CG is neat and in the light of the clarifications in the authors' feedback, I'm increasing my score 6->7. Could you please include in the paper the following (from the authors' feedback): the discussion on grammars as additional input, the results on valid code, the clarification on how the language model is used, how the warm state for the two models is achieved, the diagram with the architecture. I think all these clarifications will increase the quality of the paper.

Reviewer 3

This paper presents an interesting approach of using the duality relationship between Code Summarization (CS) and Code Generation (CG) to improve the performance of a neural model on both tasks simultaneously. The main idea is to exploit the fact that the conditional probability of a comment given some source code, and the conditional probability of source code given a comment, are both related by their common joint probability. Moreover, since both the tasks of CS and CG use an attention-based seq2seq architecture, this paper also proposes to add an additional constraint that the two attention vectors have similar distributions, i.e. the attention weight of comment word i to source token j for the CS task is similar to the attention weights of the same pair for the CG task. The method is evaluated on two datasets of Java and Python programs/comment pairs and the dual training outperforms several baseline methods including the same architecture trained without dual constraints (basic model). Overall, I liked the idea of exploiting the dual relationship between the code summarization and code generation tasks. The proposed dual regularization terms relating to the factorization of conditional probability distributions and similarity of attention matrices are quite elegant. The experiment results also significantly improve the baseline approaches, and the ablation results show that both the duality constraints are useful. One thing that wasn’t clear to me was which parts of the dual relationship modeling were novel and which parts were taken from previous works. For example, Xia et al. [2017] proposed a supervised learning approach for imposing duality constraints and presented a similar probabilistic duality constraint (similar to Equation 8). The learning algorithm also seems similar except with the addition of the second regularization constraint. Is the only novel thing proposed in the paper is the dual regularization constraint corresponding to the similarity of attention vectors in Equation 9? In the experiments, is Basic Model the same as the seq2seq with attention model? In the DeepCom [Hu et al. 2018a] paper, the DeepCom model outperforms both Seq2Seq and Attention-based Seq2Seq models on summarization of Java methods. Can the authors present some insights on why the basic model might be outperforming the DeepCom model? It was also not clear whether for comparisons between different baseline models in Table 2, all the models have comparable number of trainable parameters? In the dual task, there are essentially twice the number of parameters, so it would be good to state how to compensate the baseline models with equal number of parameters. In section 4.1, the text states that the original Python dataset consists of 110M parallel samples and 160M code-only samples, and that the parallel corpus is used for evaluating the CS and CG tasks. But in Table 1, it seems there are much fewer samples (18.5k) for Python dataset. I was also wondering why not use the comments from the parallel corpus to pre-train the language model rather than using the language model for Java dataset. In section 4.2, it states that the best model is selected after 30 epochs in the experiments. Is this the case for the basic model or the dual model? Also, is the case for Java or Python dataset? Minor: page 1: Varies of → Various page 1: Specifically, z → Specifically page 1: studies before has → studies before have page 2: attracts lots of researchers to work on → (possibly something like) has attracted a lot of recent attention page 4: larger than it of → larger than that of

Paper ID:	3531
Title:	Code Generation as a Dual Task of Code Summarization

Reviewer 1

Reviewer 2

Reviewer 3