Review for NeurIPS paper: Evaluating Attribution for Graph Neural Networks

NeurIPS 2020

Evaluating Attribution for Graph Neural Networks

Review 1

Summary and Contributions: Systematic evaluation of attribution methods for graph neural networks, incl. code/data for a benchmarking suite.

Strengths: + Attribution for GNNs is still under studied and the combination of performance comparison for several GNN attribution methods with an accompanying open-source benchmarking suite for GNN attribution methods that includes realistic datasets and associated code are good contributions to the field.

Weaknesses: - Attribution for GNNs involves finding both features and a subgraph of nodes that together inform the prediction generated for a given example. The authors focus only on the features that are relevant to a prediction. - The authors have not benchmarked major published work such as GNNExplainer (ref [47]). Also, the authors chose to not consider graph attention networks, which tend to perform better than standard GCNs. Furthermore, attention has been used as an attribution mechanism on its own, and they may show better characteristics in the context of attribution methods in general.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Comments: - The authors find that GraphNets, due to their complexity do not support accurate attributions. Consider removing them from subsequent analyses to simplify the presentation. Minor comments: - Attribution accuracy is summarized for all task, model, and attribution method combinations in Table 3 - that should be Figure 3. - "We adapt the following (attribution) methods to graphs." Some, if not all, have already been adapted to GNNs (e.g. in [33] for Grad-CAM and [28] for IG). - Ref 35 is from KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016 Pages 1135–1144https://doi.org/10.1145/2939672.2939778

Review 2

Summary and Contributions: As DNNs are increasingly being deployed in real-world applications that affect individual lives, providing human intelligible explanations for their predictions has attracted tremendous attention recently. However, evaluating explanation in DNNs is a challenging problem, especially for the attribution of graph neural networks. This work proposes a framework to quantitatively evaluate attribution methods in GNNs, in terms of the attribution accuracy, stability, faithfulness, and consistency. The benchmark suite would be publicly available, including data and code. This work also provides recommendations for the community, aiming to inform design of new attribution methods in future.

Strengths: 1. The conclusions make sense to me. The authors recommend the use of CAM paired with a GCN. Since CAM requires that the last layer to be a global pooling layer, thus in many scenarios, CAM could not be used. In those settings, the authors suggest the usage of Integrated Gradient. 2. The authors suggest to evaluation attribution using four dimensions, which are sound to me. It could give a comprehensive evaluation for one attribution method. 3. The faithfulness evaluation is novel. The authors suggest to include spurious correlations to the training dataset. A faithful attribution method should be able to locate the spurious correlations and reveal why the model is “right for the wrong reasons”. This faithfulness evaluation is novel and convincing to me. Hope this kind of faithfulness evaluation could also be included in other domains beyond GNNs.

Weaknesses: 1. One minor concern is that the ground truth labels are only provided for the nodes. As an approximation, the authors propose to redistribute the edge attributions equally onto their endpoint nodes’ attributions. It raises concerns about whether this kind of approximation could accurately reflect the contribution of each node. 2. Another minor comment is for line 198 and line 206. The results are given in Figure 3 rather than Table 3.

Correctness: The claims, methods, and empirical evaluation are mostly correct, to the best of my knowledge.

Clarity: The paper is well written, and I enjoy reading this paper.

Relation to Prior Work: Yes, it clearly describes the contributions.

Reproducibility: Yes

Additional Feedback: Post rebuttal comments: After reading other reviewers' comments as well as authors' response, I feel like major concerns are not sufficiently addressed in authors's response. I have updated my rating from 7 to 6.

Review 3

Summary and Contributions: This paper studies commonly-used attribution methods for GNN. The paper makes recommendations based on a relatively large-scale quantitative evaluation.

Strengths: + The paper is well-organized. The idea is straightforward and the narrative is easy to follow. + The experimental designs are comprehensive in general. Extensive experiments and carefully designed protocols provide solid supports in comparing different attribution methods. + The open-source benchmarking suite would attract broad interests in GNN and explainability research community.

Weaknesses: - Although a large number of methods, architectures, configurations are tested, the dataset used in the study are quite limited as molecular data. There is no guarantee that the results and conclusions can be generalized to other domains. - Although a wide range of attribution methods are compared, most of them are not proposed to address the studied problem. Besides, most compared methods are out-of-the-date. For instance, the best method, according to the results, is CAM which was proposed four years ago. This makes it hard to accurately assess the contributions of the proposed approach. - The experiment section talks more about plain number reporting rather than discussions or conclusions. More discussions on result analysis, new discoveries, or proof of existing conclusions are all good tips to improve the experiment section.

Correctness: The proposed study is interesting and significant while the empirical methodology is techniquecally sound, although it suffers from weaknesses as listed above.

Clarity: Yes in general except some typos: e.g. Line 174 "... about about ..."

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: After reading other reviewers' comments and the author's response, I would insist on my previous rating of marginally acceptance.

Review 4

Summary and Contributions: This paper attemps to quantify and compare the performances of commonly used attribution algorithms in Graph Neural Networks (GNNs) on the basis of Accuracy, Consistency, Faithfulness and Stability.

Strengths: * To the best of my knowledge, benchmarking attribution algorithms for GNNs is a novel work, although I think the claims made in the paper may be too specific to the datasets experimented on. * The attribution algorithms are compared based on experiments done for both classification and regression problems. * The experiments designed to quantify the evaluation metrics look sound.

Weaknesses: * To judge how an attribution method is performing compared to random guessing, T-test or other statistical tests would have added more value; used in experiments of IROF[1]. * It’s good that experiments were performed on models trained on a variety of hyper-parameters (as in Supplementary). However, while comparing different attribution algorithms, we also need to consider hyper-parameters of the attribution algorithms, which was not in the experimental details. [2] shows that attribution algorithms are quite sensitive to hyper-parameters. Drawing conclusions on attribution algorithms based on a fixed set of their hyper-parameters may not be ideal. * Are the evaluation criteria used in the paper sufficient? Was there a reason why the other evaluation criteria mentioned in [3] were not tried? * Please see more weaknesses in the Relation to Prior Work, and Additional Feedback, below. [1]- IROF: A Low Resource Evaluation Metric For Explanation Methods; AI for Affordable Healthcare workshop ICLR 20. [2]- SAM: The Sensitivity of Attribution Methods to Hyperparameters; CVPR 2020. [3]- Robnik-Sikonja, Marko, and Marko Bohanec. "Perturbation-based explanations of prediction models." Human and Machine Learning. Springer, Cham. 159-175. (2018)

Correctness: The general claims made about attribution algorithms may be specific to the dataset on which the experiments were done in the paper. Also, the empirical studies should have also taken into account the hyperparameters associated with different attribution algorithms.

Clarity: The paper is clear enough to follow. A small point: Figure(3) is referred to as Table(3) in the first paragraph of page(6) which was a little confusing.

Relation to Prior Work: The performance-attribution relative correlation (PARC) score mentioned in Section 4.2 on Attribution Faithfulness is very similar to the Faithfulness metric proposed in [1] below, but [1] is not cited in this paper or compared against. The paper should mention why the evaluation metrics used for attribution in vision datasets using feedforward networks (eg. ROAR[2], Causal Metric[3], IROF[4] etc.) were not adapted for evaluating attribution in GNNs. These metrics can be used even when ground truth is unknown. [1]- Towards Robust Interpretability with Self-Explaining Neural Networks; NeurIPS 2018. [2]- A Benchmark for Interpretability Methods in DeepNeural Networks; NeurIPS 2019. [3]- RISE: Randomized Input Sampling for Explanation of Black-box Models; BMVC 2018. [4]- IROF: A Low Resource Evaluation Metric For Explanation Methods; AI for Affordable Healthcare workshop ICLR 20.

Reproducibility: Yes

Additional Feedback: * While presenting inferences about attribution methods as one of the contributions, it will be good to mention the datasets on which experiments were conducted, since the same inferences may not hold for all datasets. * Attribution algorithms have been adapted to GNNs in earlier papers as well, so, shouldn’t section (3.2) be under ‘Related Work’? * Do the experiments done to quantify Algorithmic Stability in Section 4.3 assume that the original Graph NN predicts the same output class even with the modified input? If yes, then this assumption should be mentioned in the paper. This is an important assumption, especially considering issues of adversarial perturbations. * Is there a specific reason why Kendall's tau was low for all attribution methods and models shown in Figure 3? * The observation on decline of attribution accuracy at later epochs is interesting - stating a possible reason for it would have given more insight. POST-REBUTTAL =============== I thank the authors for all their efforts in the rebuttal. While the responses answered a few concerns raised by all reviewers, I remain concerned about the overall originality of the contributions, and their impactfulness. While I agree that the problems of graph datasets may be reflected in the current choices (molecular datasets), the problems can be affected by issues such as scale in different domains (social networks vs molecular - for e.g). How such conclusions will apply to all domains is not clear, especially when there is no new method proposed in the dataset, and only an analysis. Alternatively, it would have been nice to see if all the analysis had led to the development of a new attribution method that is specific to GNNs. Also, why are the chosen metrics for evaluation necessary and/or sufficient? Without a proper analysis along any one of these dimensions, I believe the work may not be impactful. I retain my original rating for this reason.