Review for NeurIPS paper: Learning Graph Structure With A Finite-State Automaton Layer

NeurIPS 2020

Learning Graph Structure With A Finite-State Automaton Layer

Review 1

Summary and Contributions: * A differentiable method based on finite automata is presented to find new (useful) edges in a graph. * Experiments grid-world navigation and a program analysis task show that the method succeeds at learning relevant relationships that help downstream models.

Strengths: * While composed of well-understood components, the proposed method is new in the context of learning from graphs. * The method is sound and seems to provide (minor) gains on the well-established variable misuse task, competing with strong baselines. * Methods to automatically discover important relationships in input data are of high relevance to the NeurIPS community.

Weaknesses: The gains in the experiments are minor, and the most substantial part of the experiments is limited to the program analysis setting, in which the target relationships are well understood. Further experiments on other domains (e.g., on sequences encoding proteins (cf. https://blog.einstein.ai/provis/) or graphs of theorems)) would be useful to better understand the generality of the proposed approach. This would have the potential to turn this from an interesting, specialized paper into a seminal paper.

Correctness: Claims made in the paper seem correct, and are supported by proofs in the appendix.

Clarity: The paper is relatively well-written, though dense. Many details can only be found in the somewhat sprawling appendix. Minor nits: * L60-61: "one can construct a regular language L such that an L-path from n_1 to n_2 corresponds to the location n_2 of possible previous assignments to the variable at n_1" sounds weird (singular/plural disagreement), and should probably read "one can construct a regular language L such that for a variable at location n_1 and a previous assignment to that variable at location n_2, there exists an L-path from n_1 to n_2"

Relation to Prior Work: The relationship to prior work is clearly articulated, though I am missing some discussion of the relationship of the proposed method to the implicit learning of relationships in transformers. Specifically, it is well-known that attention weights can be viewed as a weighted adjacency matrix. While I could conjecture some differences to the method proposed here (e.g., transformers only do one-step reasoning [but what about deep transformer models / universal transformer?]), the paper would be improved by discussing this as well.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper introduces a differentiable approach to automatically compute semantically meaningful long-distance edges over static graphs. It can be used during end-to-end learning of graph-based (GNN or Transformer driven) tasks. The work observes that many such hand-crafted edges can be represented as regular languages over graph paths, and thus can be encoded by a POMDP following agent. It then proposes an efficient procedure for training such an agent as a differentiable module in a larger graph network, using its halting distribution as a new weighted edge type. The work is evaluated against several baselines on ML4Code tasks (either learning supervised semantic edges or learning an end-to-end task), and noticeably outperforms them using less training data.

Strengths: + An insightful novel formulation for learning semantic graph edges end-to-end, without hand-engineering them (e.g. as data flow). + Theoretical analysis that proves validity of the approach for any regular language of the target semantic edge – in particular, for data flow edges of program representations. + Comprehensive and excellent evaluation on three distinct families of tasks against strong baselines. + Clear benefit of the inductive bias with less training data. + The paper is well written and accessible.

Weaknesses: - Some missing or unacknowledged related work.

Correctness: The theoretical findings and empirical methodology seem sound. I did not check the Appendix proofs in detail.

Clarity: I'm well familiar with the ML4Code literature, less so with the RL option literature. The paper is excellently presented for such an audience, introducing parts of its approach step by step, with examples, and appropriate references. It's a pleasure to read. All the relevant technical details are described in the Appendix in elaborate detail. For an audience less familiar with ML4Code, I suggest introducing a new figure in the Introduction that presents the code snippet from Fig 1b with static (handcrafted) data-flow edges, contrasted side-by-side with the edges learned by GFSA (similar figure to existing Fig 1b). This will also help define the concept of data flow (L60-67).

Relation to Prior Work: The prior work is adequately discussed, with one significant exception and a couple minor connections. Significant: What the authors call "RelAtt" was introduced by Wang et al. [1] concurrently with the GREAT model. They call it RAT (Relation-Aware Transformer) and apply on the task on encoding database schema graphs rather than encoding program AST graphs. Minor: - The authors might want to discuss a possible connection of GFSA to Neural State Machines [2], although the connection is weak. - GFSA learns a POMDP as an approximation of the latent FSA representing the desired edge relation. One could also imagine constructing such an FSA explicitly [3] and either (a) using it as a form of supervision to GFSA, or (b) relaxing the FSA into differentiable form to facilitate end-to-end learning akin to GFSA. This requires knowing the desired edge relation, thus might not be possible for an end-to-end task like VarMisuse, though. [1] Wang, B., Shin, R., Liu, X., Polozov, O. and Richardson, M., 2020. RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). [2] Hudson, D. and Manning, C.D., 2019. Learning by abstraction: The neural state machine. In Advances in Neural Information Processing Systems (pp. 5903-5916). [3] Weiss, G., Goldberg, Y. and Yahav, E., 2018, July. Extracting automata from recurrent neural networks using queries and counterexamples. In International Conference on Machine Learning (pp. 5247-5256).

Reproducibility: Yes

Additional Feedback: This is great work and should definitely be accepted. A few thoughts and comments on the follow-up: * The fact that observations depend on the initial node (L114) introduces a limited form of "variable capturing" in the regular language that the POMDP tries to approximate. It is still regular, but this can be broadened to depend on the whole agent's history and thus represent more complex languages. The POMDP would no longer be theoretically guaranteed to represent such a language, but it might still learn a useful one. * The VarMisuse experiments show that GFSA-enriched networks seem to use the new weighted edges as extra representations, inducing some form of long-distance structure in the input graph (akin to Transformer attention structure). I wonder what would happen if one used GFSA for self-supervised program learning like CodeBERT [4] rather than supervised program analysis tasks. Training on a large corpus of _natural_ (rather than generated) code with a self-supervised objective might induce new semantic relations that might shed new light on which long-distance dependencies are actually important to capture an informative representation of the code. * Handcrafted data flow edges are sharp and discrete. GFSA-learned probability matrices are not necessarily so. Have you considered any regularization on sharpness of the probabilities? [4] Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D. and Zhou, M., 2020. CodeBERT: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155.

Review 3

Summary and Contributions: This paper addresses the problem of augmenting graph data with additional edges corresponding to higher-level abstract relations. For example, an abstract syntax tree (AST) might be augmented with edges that identify the last statement to read a variable. The authors propose an end-to-end differentiable method to learn policies for finding these additional edges by casting the problem as a POMDP for a finite state machine (whose corresponding regular expressions characterize the paths of interest) and deriving a differentiable policy for navigating it from a source state to an absorbing state. This policy can then be used to more quickly solve a downstream task that depends on the derived edges.

Strengths: -The paper is clearly written and definitely of relevance to the NeurIPS community. -The authors characterize the problem nicely in terms of existing RL theory (POMDPs) and establish results which map learning in this environment to paths in a finite state automaton. They also connect their analysis to finding successor states in an RL problem. -The method is evaluated on several tasks including a GridWorld problem and program analysis and the results seem moderately better than baselines (especially for less training data)

Weaknesses: -The results are somewhat mixed and the domains are relatively simple -There is another body of work which I think is related to this but not mentioned in the automatic knowledge-base completion literature. In particular I am curious how this method compares to something like: Go For a Walk: https://arxiv.org/abs/1711.05851 (other than the fact that GFW uses RL explicitly). Could you not compare to existing RL approaches?

Correctness: As far as I can tell everything looks correct.

Clarity: The paper is well written and reasonable clear. Because it focuses on program analysis, there may be a bit of a learning curve for readers not familiar with that literature.

Relation to Prior Work: The prior work section seems complete, except for coverage of RL-based path discovery work mentioned above.

Reproducibility: Yes

Additional Feedback: Thank you for running the GFAW comparison. I have raised my score.

Review 4

Summary and Contributions: In this paper, the authors study the problem of how to learn to add edges into input graphs for better performance in downstream tasks. The main contribution are summarized as follows. C1. The authors formulate the problem of learning to add edges into input graphs, and highlight the potential applications in software engineering scenario. C2. GFSA is proposed to enable edge addition learning from downstream tasks. C3. Empirical results on program analysis and bug detection tasks suggest the effectiveness of the proposed method.

Strengths: S1. The authors investigate an interesting aspect of graph learning: how to enable edge addition learning with context understanding. S2. The authors also suggest a unique angle to address the problem of context understanding. In particular, context understanding is formulated as a process of state transition triggered by actions in a finite-state automaton, and casted into a learning problem. S3. The effectiveness of the proposed method is supported by multiple concrete tasks in software engineering and program analysis, indicating its real-world impact.

Weaknesses: W1. The unique value of the propose GFSA is unclear. As discussed in the paper, the problem setting behind GFSA is friendly to reinforcement learning techniques. There has been some work that aims to learn better graph structure for downstream tasks by reinforcement learning [1]. The key questions is why GFSA is preferred over existing reinforcement learning techniques. Without the clarification, the unique value in GFSA remains vague. W2. The action "backtrack" could be problematic. While this action enables the model to handle the cases where no edge addition is needed or discovered, it could also make the model run into endless loops. How does the final reward guide the model to minimize the chance of triggering "backtrack"? W3. The proposed method may suffer low learning efficiency. As reward comes from downstream tasks, supervision on early actions could be fairly weak. How is this issue addressed in this work? W4. It is difficult to see how GFSA could support multiple edge addition. Reference [1] Wang, L. et al. Learning Robust Representations with Graph Denoising Policy Network.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Q1. Does GFSA only support one edge addition? Q2. In practice, for two nodes connected by an added edge, what is their distance? If they are far away, it means many actions are needed to reach that edge addition decision, and the issue discussed in W3 could be severe. If they are local, why not casting the context understanding as a link prediction problem and solving it by message passing models (e.g., GNN variants)? ======After rebuttal====== The authors shared reasonable response to my questions. I have raised the score.