Review for NeurIPS paper: Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning

NeurIPS 2020

Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning

Review 1

Summary and Contributions: [edit: see "additional feedback" section for response to author rebuttal] This paper considers the problem of learning representations of specific subgraphs of a larger graph, in such a way that subgraphs of non-isomorphic graphs should be assigned different representations. Specifically, they note that most GNNs can only distinguish things that the 1-WL test can distinguish, and propose a new method of computing graph features (called "distance encoding") to get around this restriction. The distance encoding the authors propose appears to work as follows. Suppose we want to compute a representation of a subgraph S of a graph G. Instead of running a GNN on G and then pooling across the subgraph, the subgraph S is used to modify the features of all of the nodes in G, in particular based on the distances between those nodes in G and the nodes in S (this is DEGNN). Optionally, the subgraph is also used to modify the edge structure by, conceptually, adding additional edges and/or edge features based on pairwise distances in G (this is DEAGNN). Then, the GNN is run as normal on this graph. The authors include theoretical results for regular graphs and distance-regular graphs, showing that simple GNNs cannot distinguish either of these, and that their DEGNN models can distinguish regular graphs with high probability, but cannot distinguish distance-regular graphs. They also include experiments on node feature prediction, link prediction, and triangle (clique) prediction, and show strong results. Although their approach seems to work in practice and has some theoretical justifications, I think the paper focuses too much on the specific feature encoding they use, while obscuring the fact that they are considering a different, simpler problem, by assuming access to a particular target subgraph in the input that the 1-WL test and other GNN methods do not have. This simpler problem is interesting, but not well described, and I think the paper could be greatly improved by revising it to focus on this general idea and not on the specific encoding.

Strengths: The proposed distance encoding is similar to prior work, but seems to generalize that prior work, and seems like an interesting starting point for more sophisticated feature encodings. The paper has a good balance of theoretical and experimental results. Theoretically, they demonstrate sets of graphs that their method can distinguish but simpler GNNs fail to distinguish, as well as sets of graphs for which their method cannot distinguish. Empirically, they do a thorough evaluation of their model on node, edge, and triangle prediction, and show strong results versus baselines.

Weaknesses: Although admittedly I am not an expert in graph isomorphism tests and similar literature, it seems to me that the additional "power" of the DEGNN method is not due to the particular encoding proposed here, but instead comes from focusing on an easier problem. The authors motivate their approach by discussing the Weisfeiler-Lehhman tests, which can be used to decide that particular pairs of graphs are not isomorphic. The idea of WL tests is to take each graph, compute a representation of that graph, and then compare if the representations are equal; if they are not, the graphs must not be isomorphic. Graph neural networks can be viewed as doing something similar, in that they compute representations of graphs based on local features, and it can be shown that for the graph isomorphism task, simple GNNs have equivalent power to the 1-WL test. However, the task the authors consider here is simpler: given two graphs, and also a particular subset S of nodes within each graph, determine if there is an isomorphism between the two graphs that also restricts to an isomorphism between those node subsets (by computing a representation of each graph that depends also on S, and seeing if those are equal). As a special case, if the subset is of size one, this is asking whether there is an isomorphism between the two graphs such that the chosen nodes map to each other under the isomorphism. This task is easier, since it has significantly reduced the set of matchings that we need to consider. The authors then make use of this additional information by encoding the query subgraph S as a collection of distances of other nodes to nodes in that query set. But if you are trying to compute a representation of a specific subgraph S, there are much simpler things you could do! For instance, just adding a separate node feature to each node that says whether it is in S, and then running a GNN on that. I think that this would have equivalent representational power without adding any of the complexity of the distance encoding, since a GNN could learn to compute distance encodings on its own when given the starting set of nodes S. (This kind of conditioning is not novel, and in fact when the set S is known in advance many GNN applications do something like this, see for instance section 3.1 of "Gated Graph Sequence Neural Networks".) Corollary 3.4 in the paper does show that you can use a single-subgraph structural representation to compute a full-graph structural representation, by doing injective pooling across the structural representations for all nodes. But it is important to note that this is much more computationally expensive than doing a 1-WL test or a normal GNN pass, since to compute the representation for each individual node, you must run a separate GNN with different node features. And, again, the representative power of this does not seem to depend on the distance encoding, but simply on the fact that the GNN is allowed to condition on a particular node set. I do think that characterizing the representative power of pooled small-node-set structural representations is an interesting thing to do. But in my opinion this paper does not make a clear distinction between (a) using an input query subset to affect node representations, and (b) using the specific feature encoding they propose based on random walks. I think it would be much more compelling to focus the theoretical aspect of the paper on the difference in power between GNNs that compute graph representations (without having access to a subgraph S) and GNNs that compute small-node-set representations (which are allowed to condition on a particular subgraph S as an extra feature), and then present the distance encoding as a particular special case of this general idea. Additionally, the approach is motivated based on graph isomorphism tests, but none of the experiments have to do with graph isomorphisms! It seems important to actually evaluate the proposed method on a graph isomorphism problem and demonstrate that the model can actually learn that kind of task in practice. Notably missing is any discussion of the actual readout operation and loss function used for the experiments with their models. The appendix includes a few details about readout layers for baselines, but nothing for the DEGNN models, and even this is not particularly easy to understand.

Correctness: I found a few potential issues with the definitions and claims made in the paper: In definition 3.1, there seems to be a mismatch between the name "distance encoding" and the extremely broad definition. For instance, under this definition, a constant function is a distance encoding, as is a function that counts the degree of the current node, or that counts the number of unique paths between the node and S, or that computes eigenvalues of subgraphs of the adjacency matrix, or many similar things. Is that intentional? Theorem 3.3 is written in terms of "some small constant ε > 0". This is an extremely imprecise statement. How small is small? There is no upper bound. Could it grow like n? Faster? Slower? I have not read the proof in detail, but I skimmed it and could not find any definition of ε whatsoever. Does Theorem 3.7 depend on the particular use of distance encoding? Again, I only skimmed the proof, but it seems to rely on the specific case of shortest path distance, even though the theorem statement is more general.

Clarity: On a high level, I think that the paper does not put enough emphasis on how their task differs from the standard formulation, and in particular the difference between substructure representations of SPECIFIC substructures, vs whole graph representations. Many prior GNN approaches can be seen as producing either a single graph representation, or simultaneously producing representations for individual nodes or node pairs. This paper seems to be different because it instead is focused on producing a representation for a SINGLE set of nodes, without having to compute representations for every node at once. This is, in my opinion, the core difference between their approach and the GNN approaches similar to the 1-WL test, but this is not clearly stated; it took me until section 3.1 to realize that a single, specific subset S of interest is provided at input time.

Relation to Prior Work: I am not very familiar with the prior work in this space, so I cannot speak strongly to this. However, the authors do seem to cite a wide range of previous contributions.

Reproducibility: No

Additional Feedback: Line 45: "review them in details" should be "review them in detail" Line 243: "to do the following change" is unclear. Change to what? Is it figure 1b? Also, it would be better to actually write out the full model definition instead of just writing down a substitution. Line 296: "has recently attracted much significance to" I don't understand what this means. Broader impact section: My understanding is that this section is supposed to focus on ethical impacts. However, much of this section is in fact a summary of the rest of the paper, as a replacement for a conclusion, which seems contrary to the intended purpose. Also, on line 385: choosing datasets using "rules that are irrelevant to ethics" is NOT a sufficient way to ensure ethical research! Ignoring ethical issues does not make them go away, instead it is important to incorporate ethics into the decision making process. (That said, I don't think that caused any particular problems in this case.) ===== Edit after author response ===== I had not considered that the node set could be the entire graph. It seems that in that case the probability is 1-o(1), which could increase to 1 arbitrarily slowly, but I suppose this is still more powerful than the WL test in this setting, which would remain small. I'm still struggling to wrap my head around how this paper relates to prior work and known results. My current understanding after reading the rebuttal, which may be incomplete: - The 1WL test cannot distinguish certain graphs, but does not use a specific node set. - "WLGNN" models also cannot distinguish those graphs, and also do not use a specific node set. - Methods like GraphSAGE and SEAL do use a specific node set, so they are already more powerful than the 1WL test. But the power of these has not been theoretically established. - Distance encodings are a way of making features that DO use a specific node set. - The "everything in the set gets 1" simple idea I mentioned in my initial review is technically a distance encoding. It also can be more powerful than the 1WL test, because it uses the node set. But if the node set is the whole graph, it isn't any more powerful than the 1WL test. - On the other hand, the distance encoding based on SPD is still more powerful than the 1WL test, even when there is NOT a specific node set, according to the claims in the rebuttal. (This is somewhat surprising to me!) - Some prior work has used distance encoding before, but they have not proved any bounds on how powerful they are. I think that the abstract and introduction of the paper could be improved to clarify the above relationships and position this work appropriately with respect to the existing literature. In particular it seems important to emphasize that: - giving access to the node set is already more powerful than the WL test, and so prior approaches that have access to the node set aren't necessarily limited to the WL test (even though they may not be as powerful as distance encodings) - the primary contribution of this work is the formalization and mathematical characterization of the power of distance encodings, not the novelty of the encoding itself (as the authors note in the "Novelty" section of the rebuttal) It seems to me that the n=k case (and probably the small-k case as well) strongly depends (through Lemma C.1) on the aggregation function being injective. How do you ensure that this is the case? If this is a heuristic choice, what do you choose (since the example of sum pooling seems clearly not injective)? If this is a learned aggregation process, is it still provable that DEGNN can distinguish nodes using this learned aggregation instead of an ideal one? It would be good to address this in the next revision. I still do not understand what value epsilon should have. Does the statement hold for every epsilon > 0 that is independent of n? If so, it would be better to say "for any epsilon". Otherwise it seems important to clarify what value epsilon has in the proof, and how small it is. I also still feel that Definition 3.1 is too broad for the name "distance encoding", as it allows many things that are not distances to be called distance encodings. If this generality is desired, I think it would be better to call it something like "target-subset encoding", with distance encodings as a (provably powerful) special case of that. If not, perhaps the definition should be made more specific so that things that are clearly not distances aren't included. (See my initial review for examples.) I am increasing my review to 5 (Marginally below acceptance threshold) in light of the clarifications in the rebuttal.

Review 2

Summary and Contributions: The paper introduces a data augmentation framework for the improvement of the expressiveness of graph neural networks based on the structural information that a graph entity has. Although there are other works that suggest efficient ways of improving the performance of different graph learning algorithms, this work provides an extensive study on how the encoded structural information can be applied on different scenarios either as extra node features or as controller in the aggregation step of a message passing scheme. Moreover, the authors suggest a categorization of the message passing neural networks between the WLGNN, which capitalizes on the computational power of the WL isomorphism test and the DEGNN, as they call their method, which encapsulates the graph-related data augmentation inside already known models.

Strengths: The idea seems to be novel even though there are other works that suggest the utilization of structural information in the node attribute space [1]. The authors provide a unified framework of incorporating different levels of structural dependencies through the parameterization of the \zeta distance encoding function. Moreover, as it is shown in the experimental section, the method seems to outperform some strong baselines in all the examined tasks. Finally, as it is stated in the "Broader Impact" section, the framework can be easily modified and applied to any kind of graph-related scenario. [1] "Are Powerful Graph Neural Nets Necessary? A Dissection on Graph Classification". Ting Chen, Song Bian, Yizhou Sun. arXiv 2019.

Weaknesses: - The authors do not refer to the complexity of their method. In section 3, they claim that "this simple design can be efficiently implemented", however they do not provide any information on the computational cost of Equation (2). In particular, the utilization of powers of random-walk matrices seems to exhibit high complexity, while the special cases of SPD and generalized PageRank scores are, also, expensive. - In Table 1, the authors compare their method against a few baselines. However, particularly for the case of the role prediction, a few stronger competitors have been suggested before, such as the RolX [2] algorithm, which also, incorporates structural features, extracted in an automatic manner and GraphWave [3], which exploits diffusion wavelets to provide encoded information about the graph structure and seems to be one of the strongest structural representation learning baseline. The authors should compare the proposed approach against these competitors as well, in order to have a more complete overview of their contribution. - As it is described in Appendix G1 Section, the employed benchmark datasets are very small ( # nodes < 1500 ), so a 10% test split would provide ~= 150 test samples, not allowing for a safe conclusion of the actual accuracy improvement of Table 1. Moreover, regarding the structural representation tasks, it is natural to assume larger size graphs, where structurally similar nodes could lie in distant positions of the same graph. Thus, it would be nice if the authors utilize larger graphs in the experimentats. [2] "RolX: Structural Role Extraction & Mining in Large Graphs". Keith Henderson et al. In KDD 2012. [3] "GraphWave: Learning Structural Node Embeddings". Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. In KDD 2018.

Correctness: The method and the empirical methodology seem to be correct.

Clarity: The paper lacks of clarity in many parts. Specifically, in section 3, where the authors introduce their main contribution, the paragraphs are not well-structured and the statements are vague both contextually and syntactically.

Relation to Prior Work: It is clear how the proposed approach differs from previous methods.

Reproducibility: Yes

Additional Feedback: ========Edit After Author's Response=========== I would like to thank the authors for addressing a number of my concerns, however, the response did not change my opinion. In my view, this is a decent paper, but I still think that the paper lacks clarity about a number of technical details and I would like to see more experimental results (even on synthetic datasets to verify the theoretical results).

Review 3

Summary and Contributions: This paper proposes Distance Encoding, a simple yet effective way to enhance the representation power of GNNs by augmenting node features or controlling message passing via pairwise distance. Using node distance to augment GNNs breaks the symmetry among nodes, while without losing the inductive ability of GNNs as compared to those approaches using node identifiers. The paper theoretically shows that almost all r-regular graphs can be distinguished by Distance Encoding where traditional 1-WL and GNNs fail. It also gives a limitation result saying Distance Encoding cannot handle distance regular graphs. The empirical results on node/link/triangle prediction demonstrate better results than baselines.

Strengths: The theoretical analysis is sound and convincing. The method proposed is simple yet effective. The paper is very relevant to the NeurIPS community.

Weaknesses: The theoretical analysis only shows Distance Encoding's better representation power than 1-WL on regular graphs. Although regular graphs are one major type of counter-examples for 1-WL, it is beneficial to discuss or analyze Distance Encoding's power for general graphs. The empirical evaluation on node classification does not use standard benchmark datasets. It is reasonable that Distance Encoding improves over GNNs on feature-less graphs. Is it also useful for graphs with rich features?

Correctness: The claims and method look correct to me, though I didn't check the full proofs. The empirical methodology is in a fair setting.

Clarity: The paper is generally well written, though it can be improved by introducing the core idea earlier. The main idea of Distance Encoding doesn't become clear until page 4. Too many preliminaries precede the main topic. Theorems 2.6 and 2.7 also look a bit straightforward (they both directly generalize theorems in existing works such as GIN), while taking a lot of space.

Relation to Prior Work: Yes. The paper discusses its relations to 1-WL, WLGNN, SEAL, PGNN, etc. very well.

Reproducibility: Yes

Additional Feedback: Can Distance Encoding also enhance 2-WL and higher-order GNNs? ------------------------------------------ After rebuttal: I would like the authors to know that the reviewers and meta-reviewer discussed this paper extensively on its strengths, weaknesses, and areas of improvements. We finally reached a concensus that the paper indeed makes excellent technical contributions while the writing should be improved a lot. My suggestions on writing are: 1) Discuss the differences between DE and node subset conditioning (raised by R1). Explicitly point out DE incorporates the annotation trick and is more powerful than node conditioning (the author response did a good job). Point out that node conditioning is also more powerful than WL (i.e., DE > node conditioning > 1WL). 2) Clarify the contribution is on characterizing DE mathematically, not proposing DE. This can be done by putting the discussions of prior work using DE (such as SEAL and PGNN) from Section 4 to Introduction.