NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center

### Reviewer 1

# Update after rebuttal I thank the authors for their detailed rebuttal; my questions were answered sufficiently. I am convinced this would make a good contribution and suggest to accept the paper. # Summary of the review This is a well-written paper on a very relevant topic, namely the use of topological data analysis (TDA) for graph classification. The paper uses a 'learned' weight function for persistence images, i.e. a descriptor of topological features that can be vectorised. The weight function is seen to be crucial for obtaining good performance values. While I have some more comments/suggestions, I suggest acceptance. ## Clarity & terminology The paper is written clearly and can be followed very well. There are a few word choices & explanations that can be improved, though: - Technically, persistence diagrams form a multiset in the _extended_ plane because infinite features might still exist (l. 23). - The caption of Figure 1 could be extended to make the paper more accessible. Space permitting, a brief description of the method, along with pointers to the respective sections/parts of the method could be helpful. - The description of persistent homology in Section 2 is not accessible to non-experts. For example, the sentence 'Using the topological object called homology classes to describe these features' is not entirely correct. I know that the 'typical' description of persistent homology in terms of groups, generators, and so on, is too technical, but I would still suggest rephrasing a few things here. Two suggestions for inspiration (I am not affiliated with any of these authors, I just think that their explanations of persistent homology and TDA are accessible): 1. Carrière et al.: 'PersLay: A Simple and Versatile Neural Network Layer for Persistence Diagrams' 2. Hofer et al.: 'Deep Learning with Topological Signatures' - While correct, I find that the citation of zigzag persistence and extended persistence is not useful here because, as far as I understand, they are not used in the experimental section. - In Figure 2, the persistence image is not rendered correctly; this might be solvable by 'filling' the cells instead of showing only their outlines (it would also make the figure easier to read) - If space permits, some of the equations could be typeset on their own because it would make it easier to refer to them. - In l. 118 (Definition 2.1): what is the 'persistent plane'? This should probably refer to the (extended) persistence plane. - In l 123, there is a sentence fragment: 'The persistence image for a diagram $A$ is $PI_A$ consists of $N$ numbers' - 'Persistence image' is sometimes written as 'persistence-image'. This should be used consistently. - The paper uses $[1, k]$ and $[1, m]$ to denote discrete intervals. I would suggest using a set notation instead; at first glance, the other notation can easily be misunderstood as a continuous interval. - The discussion of the matrix formulation is somewhat redundant: both in l. 196 and l. 223, the paper discusses that this does not require loops. Mentioning this once should be sufficient. - l. 270: This should read 'Generation of persistence diagrams' ## Method & experiments The method and the experiments are well described. The weight function optimisation (and its re-framing as a matrix optimisation algorithm) is described in a straightforward and easy-to-understand manner. The authors are to be commended! A few comments/questions: - Lemma 3.2 could be formulated more generally if I am not mistaken. The proposed method should be a valid kernel for any valid kernel function on the points, right? - How is the training performed exactly? I assume that the hyperparameter optimisation only uses the respective training data set from the current fold? If so, these details should be provided. - What type of filtration is used for the graph classification experiment? A sublevel/superlevel set filtration or extended persistence based on the scalar descriptor? I would expect extended persistence to be used here, because it would explain the additional citations. - The paper discusses additional weight function choices (polynomials of degree $d$) but only Gaussian functions are used for the experiments, unless I am mistaken. Are other functions good enough? If so, this would be an interesting thing to comment on (and maybe even run an additional experiment to empirically demonstrate this). - Is subsampling being used for the experiments or not? The discussion of the complexity leads me to think that subsamples might be employed, but I was unable to find a proper discussion of this. - To be fully self-contained, a definition of the Ricci curvature for graphs could be given (maybe in the supplementary materials). - Heat maps for the neuron trees are being shown (in the supp. mat.); what about the graph classification experiments? The performance in the tables lead me to believe that this might be interesting to show. ## Language & style The paper is written very well. I have only a few highly personal 'pet peeves' or suggestions: - If possible, order citations by their number. The citation string in l. 36 looks a little bit confusing first. - Instead of 'We note that [26]', I personally prefer 'Hofer et al. [26]' because it makes the reference easier to find. Some typos/mistakes: - l. 28: 'Due to these' --> 'Due to these reasons?' - l. 170: 'where $\sigma$ is width' --> 'where $\sigma$ is the width' - l. 204: 'Optimal distance problem' --> 'The optimal distance problem' - Table 2: 'appraches' --> 'approaches' - l. 315: 'Variance speaking' --> 'In terms of variance, ...'

### Reviewer 2

The paper proposes a novel kernel for graphs, which is constructed from topological features, namely form persistence diagrams. The major contribution is to learn a weight function for the topological features from labelled graph data, rather than using pre-set fixed weights. In practice, such distance metric is learned on the persistent images, that is a vectorial representation of the persistence diagram. Using the graph labels, the kernel is then specifically optimized to better distinguish between classes. A matrix formulation for speed up calculations is proposed and experiments on graph classification tasks are presented. Both the novel idea and experimental results support the validity of the proposed method. However, I found the technical structure and notation of the paper to be unclear in the model description; the experiments should be made fair by running the competitor methods with the same setup of the proposed approach. Detailed comments are provided below 1) Section 2: the notation and background definitions should be presented more clearly. 2) Concepts such as shape or creation/disruption of topological feature can be introduced in a background section. 3) It would be useful to give practical examples of the topological features to consider for the specific graph application. In particular, the paper should better clarify how the proposed approach is extended to the graph framework. 4) It is unclear for which kind of graphs the method is designed. For instance, how are the features generated from node/edge labels/attributes? 5) The matrix representation idea is strongly emphasized to pursue the argument of efficient computation, mentioning that the optimization is faster in the popular programming language. Could the author(s) provide a more technical explanation? 6) It is mentioned that subsampling techniques for large datasets are used. It would be interesting to discuss how this affects the performances and when it is applied. 7) The manuscript claims statistical significance, but this cannot be guaranteed if the methods are not evaluated on the same splits. Taking the results of the other methods from the papers does make the comparison unfair. %%%%%%%%%%%%%%% Thank you for your response and clarifications. I found the experimental setup to be improved, given the new results that have been provided.