NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3470
Title:Wasserstein Weisfeiler-Lehman Graph Kernels

Reviewer 1

originality The paper proposes a novel method to compute distances between graphs, that can take into accound the node content as well as the global information repartition, using Wasserstein distance and OT tools, based on existing subtrees kernels. The fact that it can be applied to continuous attributes in nodes is a plus. On that matter, I wonder if links could be done with kernels proposed in [1] (that can also deal with complex information on edges)? quality: The proposed graph kernel is based on known technics thus the theoretical part is not large, it mainly concerns the notion of definitness of the produced kernel. This point is well discussed and proven for the discrete case. For the continuous attributes case, the author have insights but no formal proof. However since they chose to use KSVM as a safety guard in the experimental part, it's not a problem. They also note that results between KSVM and SVM are very similar, probably signifying that either the resulting kernels are indeed positive definite and if not, the negative part if neglectible. clarity: The paper is clearly written and polished. All required tools are described/defined. It's a clear improvement over a previous version I had the oppotunity to review. Associated with the provided code, this work can easily be reproduced and re-used. significance; having efficient and accurate kernels for graph is definitively an important matter. I'm confident that this work is useful both for future research and for practitioners. Remark : there are some arxiv references for papers that are very likely published, please check and update. [1] Kashima et al. Marginalized kernels between labeled graphs, ICML 2003

Reviewer 2

The main motivation of this work is based on the fact that conventional graph kernels loose information in their embedding and/or aggregation steps. While we agree with the authors on this point, it is not clear what is the information lost with the proposed WWL graph kernel. Since the proposed method is based on the WL subtree kernel, then it has the same weaknesses as it. Moreover, it may have more issues, such as the non-uniqueness of the embedding, the iterative operations related to hashing… The part “To ensure the theoretical correctness of our results…” is confusing and misleading. On a first reading, the reader may understand that the theoretical results are not correct. The authors need to rewrite this part in order to emphasize on the fact that the work with indefinite kernel learning and RKKS is only required when working with the continuous WWL. This is not the case of categorical WWL. Moreover, the results given in Table 1 are equivalent to using conventional definite kernels and RKHS. It is difficult to understand the assessment when comparing KSVM (in the case of WWL) and conventional SVM (all other methods). We think that it is important to assess the price to pay when using the proposed indefinite kernel with RKKS. This is relevant because many graph kernels have been proposed with the positive definiteness, thus can be easily used in conventional Machine Learning algorithms. In experiments, it seems that WL-OA and the proposed WWL have comparable performances when dealing with categorical node labels. We think that it would be important to test if a version of WL-OA that takes into account continuous node labels would be also as powerful as the proposed WWL. All experiments would have benefitted from a comparison with more graph kernels from the literature. This would be more important as baseline that using simple vertex or edge histograms. In other words, it is not clear how this more complex approach, which may provide indefinite kernels, compares to simple approaches, i.e., the large class of graph kernels. This is the first time that we read “Reproducible kernel Hilbert space” and “Reproducible kernel Krein space”. It should be “Reproducing”, not “Reproducible” ! -------------------------------------- -------------------------------------- We thank the authors for their positive feedback concerning some of the raised issue. We have updated the overall score accordingly.

Reviewer 3

Comment before rebuttal/discussion: The idea is interesting and potentially useful, though it is not obvious to this reviewer whether the optimal transport metric is the right one in this case. Given to isomorphic graphs, the W-L embedding is not exactly the same vector, but the corresponding permutation will send one into the other one. The Wasserstein distance is actually very sensitive to permutations on the labels so it seems to me that something like the Gromov-Wasserstein distance should be more amenable for the purpose of comparing graphs. The point the authors make in lines 71-80 is that the W-L kernels (that use measurements based on walks and other permutation-invariant features) are too weak and that's why the Wasserstein distance is a better candidate. I may be missing something but I'd like to see a theoretical justification for the Wasserstein distance in this context. Is it true that if two graphs are isomorphic then the WL embeddings are usually essentially the same, and therefore the Wasserstein distance is a relevant metric? Or is the algorithm taking this into consideration in a way I am missing? Comment after rebuttal/discussion: Now I see where my confusion was. I think the notation is a little unusual for W-L. If you look at the outcome of the iterations of W-L as distributions in R^{H+1} (supported in N points where N is the number of vertices in the graphs), then it works (the distributions don't care about the permutation). [Following the W-L algorithm it is intuitive why it works, the sequence of vectors for each node is equivariant with respect to permutations on the original labels]. Before I understood that the authors were computing Wasserstein distances between the different iterations of the W-L algorithm (in that case it'd be distributions in R^N supported in H+1 points). In that case the invariance doesn't hold. The authors may want to clarify this point, and say that they are computing a distance between distributions in R^{H+1}. In particular it may worth mentioning that the distance is 0 whenever two graphs are isometric. Maybe in line 158, right after the statement "While not constituting a test of graph isomorphism..." Originality. This reviewer believes the idea of the paper is interesting. I am not familiar enough with this literature to comment on the originality of the paper. Quality. The paper is very well written and very clear. The only issue I had is what I mentioned in the point above. The rest of the comments were addressed by the response of the authors. Thank you! Feedback on the code: Since I was confused about the embedding in the pair of isomorphic graphs setting, I actually used the code to understand what was going on. The code computes the W-L embedding and then the optimal transport on the embedding. It reads a file with some specific format with many graphs. I think a better interface that should be easy to add is a simple python function that takes two matrices (adjacency of the graph) and computes the embedding and the Wasserstein distance. It's not necessary but it would make it much easier to use. (Actually the implementation is very easy, but I had to look at it in order to understand what the embedding and concatenation of different iterations of W-L was doing). Clarity. The paper is clear except for the point I mention above.