NeurIPS 2020

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks

Review 1

Summary and Contributions: This paper proposes a new NN architecture for the task of learning to execute programs. The proposed architecture (IPA-GNN) uses recurrent NNs as well as message passing in GNNs to model attending to a given instruction at time t to learn how a program with control flow is executed. It handles control flow by means of a soft attention mechanism and execution by means of a soft instruction pointer. Their results show that they can achieve better accuracies as well as generalization to larger programs compared to existing techniques in the literature.

Strengths: * The authors develop their model similar to a basic interpreter. Their model is a natural description of what a soft interpreter would look like. * They train on programs with lower complexity (lines of code) and test on programs with larger complexity, especially with the aim of showing generalizability.

Weaknesses: ** Training methodology is not properly mentioned. * What loss function did you use? It is a regression-based loss on the numerical value? or a cross-entropy loss on the 1000 possible values for v_0 mod 1000? * What is the accuracy you report in Table 2? Is it the % of the time the result 100% matched the target? or is it a regression loss (something like MAPE)? ** Need more comparison to other GNNs * one explicitly relevant GNN architecture is the "Graph Attention Network". It would be good if you can mention how it is related to IPA-GNN. GATs also allow attending using different weights to incoming messages from neighbors. Note that GAT is a kind of convolution-based GNN and does not use a recurrent unit, so you will have to adapt the attention mechanism in the context of GGNN. * GraphSAGE allows different aggregation functions. In IPA-GNN you use (+) as your aggregator. Did you try out different aggregators? ** Claim that GGNN = NoExecute and NoControl model * GGNN is performing Non-linear(linear(hidden_states)) where as IPA-GNN is performing Linear(Non-linear(hidden_states)). Essentially, GGNN is applying the non-linearity after the messages are aggregated, but in your case, the non-linearity is applied first and then the messages are passed and aggregated. It is not clear to me, that these two computational patterns are equivalent. It would be good if you can make a formal argument on this. ** Program generator * How realistic are the programs generated by your probabilistic grammar? ********** Post rebuttal comments ************************ I would still like a more formal argument about the equality between GGNN and no execute / no control model, since it is still not clear how linear(non-linear(hidden_states)) == non-linear(linear(hidden_states)). However, I am positive about the rest of the paper and am keeping my original score.

Correctness: * I am not sure about the claim that NoExecute + NoControl IPA-GNN is equivalent to GGNN * The calculations of accuracies in Table 2 and Figure 2 are not properly mentioned.

Clarity: * The paper is well-written. The symbols and math are easy to understand.

Relation to Prior Work: * I am happy with the related work mentioned in the context of learning to execute. I would like to see more comparisons with other GNN architectures proposed in literature.

Reproducibility: Yes

Additional Feedback: * Please address my concerns in the weaknesses section. * How susceptible is your system to syntactic changes? I am curious to know if a pure renaming of the variables would result in the same output.

Review 2

Summary and Contributions: The paper intends to achieve the best of both worlds -- GNN and RNN -- on exploring program structure and reasoning long program sequence for static analysis. To this end, the paper proposes a GNN architecture, called Instruction Pointer Attention Graph Neural Network (IPA-GNN). The key idea is to develop an instruction attention GNN model with RNN as the interpreter over statement representation of latent branch decisions. Testing of learning to execute full programs with control flow graphs and partial programs shows that the proposed model outperforms both RNN and GNN only models.

Strengths: The study identifies that an RNN trace model with latent branch decisions is a special case of a GNN. The proposed model outperforms both RNN and GNN models on the task of learning to execute programs.

Weaknesses: It would be good to have more evaluations on each component of IPA-GNN.

Correctness: Yes.

Clarity: Certain part of the paper is hard to read, e.g., the descriptions of instruction pointer RNN and IPA-GNN make it confusing to interpret the relationship between the two.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: The paper tackles a real problem in program analysis especially branches and addresses the tradeoff between GNN and RNN. It appears promising that the proposed IPA-GNN is demonstrated to be more efficient than both GNN and RNN by identifying an effective way to combine the benefits of the two. The proposed model is evaluated by a large dataset with various branch structures. But it would be good to also evaluate the effect of each model component in addition to the overall performance. The description of the model can also be improved by showing an overview diagram or graph. Comments after rebuttal: Most of the reviewer's concerns were addressed by the rebuttal.

Review 3

Summary and Contributions: The authors propose a novel GNN architecture, the Instruction Pointer Attention Graph Neural Network (IPA-GNN) to handle the task of learning to execute programs using control flow graphs. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both the full and partial program execution tasks.

Strengths: The authors introduced two variants of the "learning to execute" task: full and partial program execution.

Weaknesses: The authors developed the Instruction Pointer Attention Graph Neural Network (IPA-GNN). While the work is named as an attention model, the authors barely talked about the attention mechanism directly. The authors choose to compare GGNN. GGNN is a general model not specifically designed for these tasks. And the GGNN shows very bad performance in Fig.2. What is the reason to select this model? The authors did not evaluate many other methods mentioned in related work.

Correctness: yes

Clarity: ok

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: The model architecture improvement in this paper is not strong enough. ============== POST REBUTTAL The new baseline model is a worthy addition to the experimental part. The author mentioned a significant improvement compared to line-by-line RNN. The performance achieved by three epochs training (line 286 training section). Limited the training epochs to 3 is strange, especially when the training set has 5M examples. Overall, based on the improvement, I change my score to 5.