Paper ID: | 6235 |
---|---|

Title: | Rethinking Kernel Methods for Node Representation Learning on Graphs |

After rebuttal: thank you for the additional experiments. They strengthen the empirical contribution of the paper, so I've increased my score to a 7. ________________ Originality: The paper is a novel combination of known techniques: by reinterpreting the the iterative node aggregation procedure of Kipf et al's GCN as feature smoothing technique, they develop a novel feature mapping function for learning positive semi-definite (psd) graph kernels. The key difference from the Kipf et al approach is they separate the node aggregation and non-linear representation learning components: node features are the output of a multi-layer perceptron and then aggregated once (rather than at every layer) by a multi-hop aggregation function. They argue theoretically that this approach is universal in the sense that it can approximate any invertible psd kernel. Quality: I thought the empirical results of the paper were interesting because they suggest that decoupling the aggregation and representation learning components of GCN-style models leads to better performance (at least on these datasets). I would have liked to see more ablation experiments to explore this further: e.g. how does performance change with fewer / more hops? What role is \omega_h playing empirically? e.g. Would a fixed constant, say c^h, c \in [0, 1] lead to similar performance or is it necessary to have a trainable parameter? Do you need an MLP or would a single projection layer suffice? If not, how many layers do you need? I didn't find the theory as interesting - it essentially says you can learn psd kernels (Theorem 1) & that they are universal (Theorem 2). That may be true, but I'm not sure it buys us anything beyond the fact that you're learning a similarity function? Clarity: in general the paper was well-written and clear. That said, table 1 was missing what units it was in - accuracy? AUC? Also are those standard error or standard deviations? Significance: I thought this is an interesting contribution because it gives a new perspective on the relative importance of the components of GCN-style models. That said (as mentioned above), a proper ablation study that explored the implications of this perspective would make this a far stronger contribution.

The paper presents a kernel (and MLP)-based approach for node classification in a graph. The system is rather complex and takes more than one reading to understand it. The paper lists four contributions, two more than the ones I reported in the "Contributions" section. I do not fully agree with the remaining ones: "we theoretically show that our formulation is powerful enough to express any valid positive semidefinite kernels.". As the authors state, their results assume the feature mapping \phi to be invertible, which I do not think it is true for any \phi. The authors should smooth the claim or prove any kernel is invertible. I understand they use a MLP to approximate \phi^-1, but that's an approximation, it should be proved how precise that is for every kernel for the claim to be true in its current form. "our model captures the structural information of a node by aggregation in a single step, other than a recursive manner, thus is more efficient". However, an analysis of the efficiency of the method, together with a comparison with related methods, is missing in the paper. If I understand correctly the statement in rows 114-115, it should be rephrased to "it is impossible to..." (see [1] for a justification). The inline equation in row 119 reminds me of the Mapping Kernels framework, it would be fair to assess whether I am correct and, in case, cite the paper [2]. Theorem 1 is straightforward, there is no need for a Theorem. It is not easy to understand how significant is the improvement, because a description of the datasets is missing in the paper and no statistical analysis is provided. However, I see the main contribution of the paper in the novel theoretically-guided approach. ---------------- [1] Ramon, J., & Gärtner, T. (2003). Expressivity versus Efficiency of Graph Kernels. Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences {(MGTS-2003)}, 65–74. [2] Shin, K., & Kuboyama, T. (2008). A generalization of Haussler’s convolution kernel: mapping kernel. In Proceedings of the 25th international conference on Machine learning (pp. 944–951). Helsinki, Finland: ACM. https://doi.org/10.1145/1390156.1390275

This paper proposed a learnable kernel-based framework for node classification. To efficiently learn the kernel, the paper proposed a mechanism for node feature aggregation and a data-driven similarity metric employed during the training phase. The problem and solution are both interesting, however the experiments are conducted on three datasets. What's the time and space complexity of this method?