__ Summary and Contributions__: This paper proposes Graph Cross Networks (GXN) for modeling graph data. A GXN uses a vertex infomax pooling method to create multi-scale graphs, and then several feature crossing layers are used to enhance multi-scale information fusion. The authors conduct extensive experiments in both graph classification and vertex classification. The results show GXN outperforms many competitive baseline methods.

__ Strengths__: 1. Important problem to the NeurIPS community.
2. Extensive experiment.
3. Good result.

__ Weaknesses__: 1. Limited novelty.
2. Lack of comparison with existing methods.

__ Correctness__: The claims are correct.

__ Clarity__: Overall, the paper is well written and easy to follow.

__ Relation to Prior Work__: The paper does not have a section of related work, and there is only a single paragraph in introduction that talks about the relation to existing graph pooling methods, which is not very clear to me.

__ Reproducibility__: Yes

__ Additional Feedback__: This paper proposes GXN for modeling graph data. Overall, the proposed approach is quite intuitive, and extensive experiment on graph classification and vertex classification proves the effectiveness of the approach.
I have the following concerns regarding the paper:
1. The relation to existing graph pooling techniques is not clear.
One key component of GXN is vertex informax pooling, which is used to create multi-scale graphs and is thus related to existing graph pooling techniques. However, this paper does not contain a section of related work, and the authors only talk about the relation to existing pooling methods in introduction. As there are so many graph pooling papers recently, I wonder what is the advantage or contribution of vertex informax pooling over existing pooling methods. It would be helpful if the authors could further explain that during rebuttal.
2. Technically, the novelty of the paper is limited.
In GXN, two key ideas are vertex informax pooling and feature-crossing layers. For vertex infomax pooling, GXN proposes to use mutual information estimation and maximization techniques for graph pooling. However, these techniques have been extensively used in graph machine learning, such as Deep Graph Infomax. Given these works, although the idea of using infomax for graph pooling seems interesting, technically the contribution is quite limited. For the feature-crossing layer, it looks more like a straightforward extension of Graph U-net, where graphs of multiple scales can interact with each other. From this sense, the contribution of this part is quite marginal. Overall, despite some interesting ideas in the paper, I feel like the contribution is not sufficient.
3. I also have some questions regarding the experiment.
(1) For vertex classification, as the three datasets Cora, Citeseer, Pubmed are quite small, people usually run each model with different seeds (e.g., 50 or 100) to report the mean and standard deviation. In experiment, how many runs did the author use to compute the statistics of GXN?
(2) For vertex classification, only the results on the standard data splits are reported. To better convince readers, it would be helpful to evaluate GXN on more random splits.
------------------------------
Thanks the authors for the detailed explanation, and all the additional results on vertex classification!
I still have a few concerns on the VIPool method, and hope the authors could address them in the revised paper.
(1) In VIPool, P_v, P_n, P_{v,n} are all discrete distributions (although the feature vector can be continuous, as there are |V| nodes, the sample from P_v can only have at most |V| values, so it is a discrete distribution). For such discrete distributions, the mutual information can be directly computed according to the definition in O(|V|^2) time without using those neural estimators. Although the neural estimator may give smoother estimation, they also have much larger computational cost, so I don't see why it is necessary to use these neural estimators for this task.
(2) The authors claimed that the cost of VIPool is O(|V|). From my understanding, VIPool greedily selects |\Omega| nodes, where |\Omega| is proportional to |V|. To select each node, VIPool needs to draw negative samples from the node set, so the cost is O(|V|). Therefore, the total cost is at least O(|V|^2) rather than O(|V|).
(3) To speed up pooling, the authors mentioned that negative sampling was used, but the number of negative samples and the distribution of negative samples are not clearly mentioned, which are important to replicate the results of the paper.
Despite the above concerns, I feel like the authors have addressed most of my concerns, and the idea of using mutual information for pooling is quite new. Therefore, I raised my score to weak accept.

__ Summary and Contributions__: A new graph pooling operator based on the mutual information between vertices and their neighborhood is proposed and used in a new GNN architecture supporting information flow between different levels of abstraction.

__ Strengths__: 1) Well-engineered approach with some original contributions.
2) Clearly written, relation to other work is made explicit.
3) Solid experimental evaluation showing the merits of the approach comparing to state-of-the-art approaches.

__ Weaknesses__: 1) Pooling has been studied extensively and this work clearly is incremental building on known techniques.
2) I would like to see a more detailed discussion of the choice of the parameter K.

__ Correctness__: Yes, to the best of my knowledge.

__ Clarity__: The paper is well-written and supported by figures. The paper is quite condensed and on a good technical level, but the authors managed to keep it easily readable and comprehensible.

__ Relation to Prior Work__: The relation to other work is clearly stated and the differences are discussed in detail. The relevant literature is cited as far as I can tell.

__ Reproducibility__: Yes

__ Additional Feedback__: In the experimental evaluation it is stated that "To improve the efficiency of solving problem (1), we modify C(Ω) by preserving only the first term." -- I do not understand this; probably the wrong equation is referenced here. This approach should be justified.
==== Update after the rebuttal ====
Thank you for providing additional experimental results. I have left my score unchanged.

__ Summary and Contributions__: In this paper, the authors propose a vertex pooling method based on neural information selection and then design a new architecture to compress the graph. The experimental results show moderate improvement over the existing approaches.

__ Strengths__: S1: The definition of selecting vertices with mutual information is novel and interesting
S2: The experimental results show reasonable improvement.

__ Weaknesses__: W1: the insight of the proposed pooling method compared with other pooling is lacking
W2: the performance improvement of the proposed pooling compared with exiting methods are not significant. Especially, in a few datasets, structurePool method seems to be better than GXN with other pooling. What if you combine StructPool with GXN?

__ Correctness__: Yes.

__ Clarity__: Reasonable

__ Relation to Prior Work__: I would like the authors provide a more detailed review on the state-of-the-art pooling compared with the new approach.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper proposes graph cross network to achieve comprehensive feature learning from multiple scales of a graph. The two key components of graph cross network include a novel vertex infomax pooling, which creates multiscale graphs in a trainable manner and a novel feature crossing layer, enabling feature interchange across scales. The proposed VIPool selects the most informative subset of vertices based on the neural estimation of mutual information between vertex features and neighborhood features. The proposed work has been used for graph as well as node classification tasks where it obtained state-of-the-art results.

__ Strengths__: The paper proposed two novel techniques (1) vertex infomax pooling which is basically a hierarchical graph pooling considering the neighborhood of nodes and (2) graph cross network for multiscale feature learning, which I have found novel compared to the existing literature. The paper also experimentally compared existing works with the proposed one, which made it quite strong.

__ Weaknesses__: Although the experiments demonstrated in paper are quite convincing, some further experiments would have been interesting for the community:
(1) Effect of neighborhood hops R for vertex classification accuracies
(2) The comparison of training times with existing similar techniques
These experiments can either be added to the supplementary or in a future extended version.

__ Correctness__: I think the claims and method is absolutely correct. The experimental results are convincing.

__ Clarity__: The paper is quite well written and organized. It was easy to follow.

__ Relation to Prior Work__: I find one existing existing paper is missing in the reference:
[1] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, Jure Leskovec, Hierarchical Graph Representation Learning with Differentiable Pooling, NeurIPS, 2018.
Others are properly mentioned as far as I understand.

__ Reproducibility__: Yes

__ Additional Feedback__: (1) The overall loss function of the method is written as L = L_{task} + \alpha L_{pool}, I am curious about the behavior of \alpha and also the stability of the training procedure. It would be interesting to see some kind of plot about the behavior of the mutual information with the number of training iterations.
(2) Is it possible to use vertex infomax pooling together with other kind of architectures if multiscale graph neural networks, such as Graph U-net. I think an experiment demonstrating that would be interesting to the community.
(1) The graph cross network has feature crossing layers which I suppose to be computationally very expensive, and hence it should be difficult to train. It would be interesting to hear some information on the training procedure of the model.
(2) I am also curious about the effect of number of layers and hierarchical levels used to train the model.
Apart from these comments, I am quite happy with the paper.
After rebuttal:
==========
I have read through the rebuttal submitted by the authors and I am quite satisfied with that, hence I am happy to accept the paper.