Summary and Contributions: This paper proposes a random propagation strategy to perform graph-based data augmentation. Specifically, they randomly dropping out some input nodes, and perform graph propagation to get augmented features. The augmented features can then be used to train any neural networks, with a consistency regularization to enhance the prediction consistency of unlabeled nodes.
Strengths: (1) The framework is simple, yet it beats a wide range of GNN baselines on the semi-supervised node classification benchmark. (2) The proposed method is more robust to adversarial attacks. (3) The code is provided.
Weaknesses: The proposed methods are not that novel. Graph-based feature propagation is used in SGC (https://arxiv.org/abs/1902.07153) and GFN (https://arxiv.org/pdf/1905.04579.pdf); consistency regularization is a standard technique in semi-supervised learning (https://arxiv.org/abs/1610.02242). More specifically: (1) It seems that the consistency regularization is a general framework that can combine with other data augmentation methods, such as dropedge, and sampling algorithms. It would be better if the authors can also try these combinations, instead of only adopting their proposed dropnode augmentation. (2) The authors claim that the methods is suitable for semi-supervised node classification where training nodes are much scarcer. Thus, it would be better if the authors can provide a curve showing the performance of the proposed framework against other baselines under different training data percentage. (3) For larger datasets, it would be better if the authors can try on the Stanford OGB benchmark (https://ogb.stanford.edu/docs/leader_nodeprop/), which provides more standard data split. (4) For sampling-based baselines, better to add some recent works such as GraphSAINT (https://arxiv.org/abs/1907.04931) and LADIES (https://arxiv.org/abs/1911.07323). Also, better to combine these methods with some advanced base GNN. ========================================= Update: Overall, the rebuttal solves many of my concerns, so I decide to raise up the score. But I still have some comments on the added experiments: (1) The experiment added by the authors, which combines CR with Drop_Edge, and sees consistent performance enhancement against just Drop_Edge, further supports that CR is a general framework that can be used for many other sampling methods, which is very interesting and worth further researching. (2) Actually I'm more curious about the results with more data than the current split (say, 30% or more). But the current results added by the authors are very impressive, showing that GRAND is very powerful when label is extremely scarce. (3) The authors provide a theoretical analysis that the adding CR loss is approximate to concol a weighted average of the variance of node feature under different perturbation. What if we directly utilize this loss as regularization to train the model?
Correctness: Yes, they are correct.
Clarity: This paper is well written and motivated.
Relation to Prior Work: Yes.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: This paper proposes a novel framework, GRAND, for semi-supervised learning on graphs. To improve the generalization ability and overcome the over-smoothing problem, the authors propose a random propagation strategy (DropNode) on graph data, i.e. randomly selecting and dropping the entire nodes. Also, a consistency regularization loss is proposed for unlabeled data to optimize the consistency among DropNode augmentations. Experiment results demonstrate the effectiveness of the proposed framework.
Strengths: The proposed framework is novel. The authors theoretically discuss the effectiveness of DropNode and the advantage towards Dropout. The empirical evaluation is sufficient. The ablation study demonstrates the proposed framework achieves better generalization ability and is more robust against attacks. I have carefully read the authors' rebuttal and other reviewers' comments, I still stand on my original rating.
Weaknesses: I think the overall framework is interesting and suitable for semi-supervised graph learning. However, the technical novelty of DropNode needs further discussion. With the proposed DropNode, a part of nodes is randomly selected and dropped, generating multiple perturbed graphs for training. This idea is similar to sub-graph sampling strategies proposed in FastGCN [7]. Another similar idea is seen in GraphSAGE [16], where neighbors are randomly sampled from the full neighborhood set of a node.
Correctness: As far as I’m concerned, the claims and methods in this paper are correct.
Clarity: This paper is well-written and easy to follow.
Relation to Prior Work: DropEdge [29] is a related work and the authors discuss the disadvantages of DropEdge from the aspect of computation complexity. However, the essential difference between DropEdge and DropNode is not discussed in the paper. I think the edge-orient method and node-orient method both can sample sub-graphs randomly and improve generalization ablity. It is necessary to explain why DropNode can bring significant performance gains for semi-supervised learning while DropEdge cannot.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: This paper proposes a method for semi-supervised learning on graphs that address issues of over-smoothing and non-robustness. The method does so by applying graph data augmentation in combination with consistency regularization across nodes. The authors draw connections between their contributions and regularization and further show that the propose method, while simple, outperforms 14 state-of-the-art GNN baselines.
Strengths: * Clearly motivated method, accompanied by explanations of underlying insights * Theoretical backing, where the proposed method is rephrased as a regularization term * Thorough experiments and ablation studies
Weaknesses: No weaknesses to me.
Correctness: Yes and yes.
Clarity: The paper is well written.
Relation to Prior Work: Yes.
Reproducibility: Yes
Additional Feedback: I would have benefited from seeing the training splits in the main paper instead of the supplement, but regardless, the information was easy to find. I'm curious how the performance of GRAND responds as the number of training nodes is increased. I'd be interested to see what the limits are here - at how many labeled training nodes is performance comparable to existing baselines? Is there a lower limit of labeled training nodes at which this method is useful?