Reviews: Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks

As described, the paper makes some strong contributions to the knowledge of GCNs. However, some parts of the paper are not clearly written or organized. As an example, the discussion and numerical results around Figure 1 are difficult to understand. Additionally, some figure and table captions and some mathematical derivations are not clearly expressed, and should be revised. The proposed architectures are evaluated and compared to 3 common datasets (all for classification of scientific publications into one of several classes), and the results on these problems are fairly convincing. However, it would be more convincing to expand experiments and interpretation to different families of problems (with diverse data structures), as well as to compare the time and memory complexity associated with these approaches. Apart from the number layers, it is not clear from the paper and its experiments what factors are critical for the performance of the proposed architectures. What key factors can affect the accuracy and complexity of the proposed architectures?

Reviewer 2

# originality The theoretical analyses on the scalability of GCN have great originality and important in practice. The idea to use the Krylov subspace methods to learn the spectral filter has been a sort of universal one recently, e.g., LanczosNet. The proposed architectures, namely, snowball GCN and truncated Krylov block network, have a certain novelty. # quality - I would like to see the ablation test in order to investigate either Tanh or the snowball/truncated Krylov works. This is very important to clarify the source of difficulty to train GCN, given good end-to-end performances. - Experiments mostly follow the procedures in the existing work, which would be fair comparisons. If there are more fancy applications or datasets, that would be more interesting. I guess Tables 1 and 2 do not show precision, rather classification accuracy (I also confirmed the submitted code and it also says the classification accuracy). At least LanczosNet and Multi-stage training report the same numbers in those tables as classification accuracy in their original papers. Ditto to Tables 3 and 4 in the appendix. By chance, I found that the number in LNet + Cora + 3% must be 76.3 if you directly refer to the number in the original paper. - It's interesting to look at two different architectures, while I would like to know why you need to propose two architectures, especially given the equivalence analysis in Section 5.3. Can I say that let's use snowball if you do not have much computational resource, otherwise truncated Krylov? # clarity - The introduction to GCN and the Krylov subspace methods are very good and nice to outsiders. I just suggest a few points that could be improved. * In Section 4.1, the block vector subspace appears in l.92 for the first place, which would be defined later in l.104. Could you make the definition first? * About 5 lines after l.115 (between l.115 and l.116), the spectrum radius is undefined. Could you define it? - The explanation of the motivation for the proposed architectures might be a bit misleading. I might have misunderstood, but let me confirm so as not to get something wrong. You claim that the proposed architectures are motivated by the theoretical analysis (Theorems 1 and 2) in l.39; this is partially the case since you choose to use the Tanh activation instead of Relu, which would make GCN more scalable and retain richer information even in deeper networks. However, the main ideas in your Snowball GCN and Truncated Block Krylov Network seem motivated by how to alleviate the difficulty in the Krylov subspace methods (in Section 4.4). Could you clarify this part? Or you may emphasize this difficulty somehow in the introduction. - Please explain the percentage under dataset names in Tables 1 and 2. (train-validation split ratio, right?) Ditto to Tables 3 and 4 in the appendix. # significance This work gives an important insight that the commonly used activation Relu might not be suitable in case of GCN. The fact is also confirmed through simple empirical studies. The following work can investigate this line as one of the open problems in GCN field. =========== After feedback =========== Thank you for providing answers. I will leave my score the same. I still feel that it is nice to have messages that says which architectures we should use base on some trade-off relationships if any.

Reviewer 3

The paper is clear but some parts could be improved, for example the authors refer to scalability issues for GCN in the sense of stacking multiple layers but the term refers to scalability wrt size of the input. The authors focus on a very specific instantiation of graph convolutional networks, namely GCN, and to spectral methods. How does the method compare to approaches based on the more general message passing paradigm that can implement both local and global computation? Laplacian smoothing is not necessarily an issue there. Why is the graph defined using edges and adjacency? Isn’t it enough to have either one? Please explain. At the end of sec 2 it is said that Chebyshev polynomial constitutes a spectrum-free. The method does not require the computation of the eigendecomposition, however the resulting method still behaves as spectrum-based. Experiments are very thorough and show that the proposed method achieves good performance in all the proposed tasks. However the section is quite short and should be expanded for better clarity. For example, it is not specified what column is the usual data-regime (not decimated setting) used for each experiment. Future works in this current form is not very useful. I’d consider some rewrite of section 4 and 5 to make space for better motivation and explanation of results.

Paper ID:	5861
Title:	Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks

Reviewer 1

Reviewer 2

Reviewer 3