{"title": "Hierarchical Graph Representation Learning with Differentiable Pooling", "book": "Advances in Neural Information Processing Systems", "page_first": 4800, "page_last": 4810, "abstract": "Recently, graph neural networks (GNNs) have revolutionized the field of graph representation learning through effectively learned node embeddings, and achieved state-of-the-art results in tasks such as node classification and link prediction. However, current GNN methods are inherently flat and do not learn hierarchical representations of graphs---a limitation that is especially problematic for the task of graph classification, where the goal is to predict the label associated with an entire graph. Here we propose DiffPool, a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer. Our experimental results show that combining existing GNN methods with DiffPool yields an average improvement of 5-10% accuracy on graph classification benchmarks, compared to all existing pooling approaches, achieving a new state-of-the-art on four out of five benchmark datasets.", "full_text": "Hierarchical Graph Representation Learning with\n\nDifferentiable Pooling\n\nRex Ying\n\nrexying@stanford.edu\n\nStanford University\n\nJiaxuan You\n\njiaxuan@stanford.edu\n\nStanford University\n\nChristopher Morris\n\nchristopher.morris@udo.edu\n\nTU Dortmund University\n\nXiang Ren\n\nxiangren@usc.edu\n\nUniversity of Southern California\n\nWilliam L. Hamilton\n\nwleif@stanford.edu\n\nStanford University\n\nJure Leskovec\n\njure@cs.stanford.edu\n\nStanford University\n\nAbstract\n\nRecently, graph neural networks (GNNs) have revolutionized the \ufb01eld of graph\nrepresentation learning through effectively learned node embeddings, and achieved\nstate-of-the-art results in tasks such as node classi\ufb01cation and link prediction.\nHowever, current GNN methods are inherently \ufb02at and do not learn hierarchical\nrepresentations of graphs\u2014a limitation that is especially problematic for the task\nof graph classi\ufb01cation, where the goal is to predict the label associated with an\nentire graph. Here we propose DIFFPOOL, a differentiable graph pooling module\nthat can generate hierarchical representations of graphs and can be combined with\nvarious graph neural network architectures in an end-to-end fashion. DIFFPOOL\nlearns a differentiable soft cluster assignment for nodes at each layer of a deep\nGNN, mapping nodes to a set of clusters, which then form the coarsened input\nfor the next GNN layer. Our experimental results show that combining existing\nGNN methods with DIFFPOOL yields an average improvement of 5\u201310% accuracy\non graph classi\ufb01cation benchmarks, compared to all existing pooling approaches,\nachieving a new state-of-the-art on four out of \ufb01ve benchmark data sets.\n\n1\n\nIntroduction\n\nIn recent years there has been a surge of interest in developing graph neural networks (GNNs)\u2014\ngeneral deep learning architectures that can operate over graph structured data, such as social network\ndata [16, 21, 36] or graph-based representations of molecules [7, 11, 15]. The general approach with\nGNNs is to view the underlying graph as a computation graph and learn neural network primitives\nthat generate individual node embeddings by passing, transforming, and aggregating node feature\ninformation across the graph [15, 16]. The generated node embeddings can then be used as input to\nany differentiable prediction layer, e.g., for node classi\ufb01cation [16] or link prediction [32], and the\nwhole model can be trained in an end-to-end fashion.\nHowever, a major limitation of current GNN architectures is that they are inherently \ufb02at as they\nonly propagate information across the edges of the graph and are unable to infer and aggregate the\ninformation in a hierarchical way. For example, in order to successfully encode the graph structure\nof organic molecules, one would ideally want to encode the local molecular structure (e.g., individual\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: High-level illustration of our proposed method DIFFPOOL. At each hierarchical layer, we\nrun a GNN model to obtain embeddings of nodes. We then use these learned embeddings to cluster\nnodes together and run another GNN layer on this coarsened graph. This whole process is repeated\nfor L layers and we use the \ufb01nal output representation to classify the graph.\n\natoms and their direct bonds) as well as the coarse-grained structure of the molecular graph (e.g.,\ngroups of atoms and bonds representing functional units in a molecule). This lack of hierarchical\nstructure is especially problematic for the task of graph classi\ufb01cation, where the goal is to predict\nthe label associated with an entire graph. When applying GNNs to graph classi\ufb01cation, the standard\napproach is to generate embeddings for all the nodes in the graph and then to globally pool all these\nnode embeddings together, e.g., using a simple summation or neural network that operates over sets\n[7, 11, 15, 25]. This global pooling approach ignores any hierarchical structure that might be present\nin the graph, and it prevents researchers from building effective GNN models for predictive tasks\nover entire graphs.\nHere we propose DIFFPOOL, a differentiable graph pooling module that can be adapted to various\ngraph neural network architectures in an hierarchical and end-to-end fashion (Figure 1). DIFFPOOL\nallows for developing deeper GNN models that can learn to operate on hierarchical representations\nof a graph. We develop a graph analogue of the spatial pooling operation in CNNs [23], which\nallows for deep CNN architectures to iteratively operate on coarser and coarser representations of\nan image. The challenge in the GNN setting\u2014compared to standard CNNs\u2014is that graphs contain\nno natural notion of spatial locality, i.e., one cannot simply pool together all nodes in a \u201cm \u00d7 m\npatch\u201d on a graph, because the complex topological structure of graphs precludes any straightforward,\ndeterministic de\ufb01nition of a \u201cpatch\u201d. Moreover, unlike image data, graph data sets often contain\ngraphs with varying numbers of nodes and edges, which makes de\ufb01ning a general graph pooling\noperator even more challenging.\nIn order to solve the above challenges, we require a model that learns how to cluster together nodes\nto build a hierarchical multi-layer scaffold on top of the underlying graph. Our approach DIFFPOOL\nlearns a differentiable soft assignment at each layer of a deep GNN, mapping nodes to a set of clusters\nbased on their learned embeddings. In this framework, we generate deep GNNs by \u201cstacking\u201d GNN\nlayers in a hierarchical fashion (Figure 1): the input nodes at the layer l GNN module correspond\nto the clusters learned at the layer l \u2212 1 GNN module. Thus, each layer of DIFFPOOL coarsens\nthe input graph more and more, and DIFFPOOL is able to generate a hierarchical representation\nof any input graph after training. We show that DIFFPOOL can be combined with various GNN\napproaches, resulting in an average 7% gain in accuracy and a new state of the art on four out of\n\ufb01ve benchmark graph classi\ufb01cation tasks. Finally, we show that DIFFPOOL can learn interpretable\nhierarchical clusters that correspond to well-de\ufb01ned communities in the input graphs.\n\n2 Related Work\n\nOur work builds upon a rich line of recent research on graph neural networks and graph classi\ufb01cation.\nGeneral graph neural networks. A wide variety of graph neural network (GNN) models have\nbeen proposed in recent years, including methods inspired by convolutional neural networks [5,\n8, 11, 16, 21, 24, 29, 36], recurrent neural networks [25], recursive neural networks [1, 30] and\nloopy belief propagation [7]. Most of these approaches \ufb01t within the framework of \u201cneural message\npassing\u201d proposed by Gilmer et al. [15]. In the message passing framework, a GNN is viewed as a\n\n2\n\nOriginalnetworkPooled networkat level 1Pooled networkat level 2Graph classificationPooled networkat level 3\fmessage passing algorithm where node representations are iteratively computed from the features\nof their neighbor nodes using a differentiable aggregation function. Hamilton et al. [17] provides a\nconceptual review of recent advancements in this area, and Bronstein et al. [4] outlines connections\nto spectral graph convolutions.\nGraph classi\ufb01cation with graph neural networks. GNNs have been applied to a wide variety of\ntasks, including node classi\ufb01cation [16, 21], link prediction [31], graph classi\ufb01cation [7, 11, 40], and\nchemoinformatics [28, 27, 14, 19, 32]. In the context of graph classi\ufb01cation\u2014the task that we study\nhere\u2014a major challenge in applying GNNs is going from node embeddings, which are the output of\nGNNs, to a representation of the entire graph. Common approaches to this problem include simply\nsumming up or averaging all the node embeddings in a \ufb01nal layer [11], introducing a \u201cvirtual node\u201d\nthat is connected to all the nodes in the graph [25], or aggregating the node embeddings using a deep\nlearning architecture that operates over sets [15]. However, all of these methods have the limitation\nthat they do not learn hierarchical representations (i.e., all the node embeddings are globally pooled\ntogether in a single layer), and thus are unable to capture the natural structures of many real-world\ngraphs. Some recent approaches have also proposed applying CNN architectures to the concatenation\nof all the node embeddings [29, 40], but this requires a specifying (or learning) a canonical ordering\nover nodes, which is in general very dif\ufb01cult and equivalent to solving graph isomorphism.\nLastly, there are some recent works that learn hierarchical graph representations by combining GNNs\nwith deterministic graph clustering algorithms [8, 35, 13], following a two-stage approach. However,\nunlike these previous approaches, we seek to learn the hierarchical structure in an end-to-end fashion,\nrather than relying on a deterministic graph clustering subroutine.\n\n3 Proposed Method\n\nThe key idea of DIFFPOOL is that it enables the construction of deep, multi-layer GNN models by\nproviding a differentiable module to hierarchically pool graph nodes. In this section, we outline the\nDIFFPOOL module and show how it is applied in a deep GNN architecture.\n\n3.1 Preliminaries\nWe represent a graph G as (A, F ), where A \u2208 {0, 1}n\u00d7n is the adjacency matrix, and F \u2208 Rn\u00d7d\nis the node feature matrix assuming each node has d features.1 Given a set of labeled graphs\nD = {(G1, y1), (G2, y2), ...} where yi \u2208 Y is the label corresponding to graph Gi \u2208 G, the goal\nof graph classi\ufb01cation is to learn a mapping f : G \u2192 Y that maps graphs to the set of labels. The\nchallenge\u2014compared to standard supervised machine learning setup\u2014is that we need a way to\nextract useful feature vectors from these input graphs. That is, in order to apply standard machine\nlearning methods for classi\ufb01cation, e.g., neural networks, we need a procedure to convert each graph\nto an \ufb01nite dimensional vector in RD.\nGraph neural networks. In this work, we build upon graph neural networks in order to learn useful\nrepresentations for graph classi\ufb01cation in an end-to-end fashion. In particular, we consider GNNs\nthat employ the following general \u201cmessage-passing\u201d architecture:\n\nH (k) = M (A, H (k\u22121); \u03b8(k)),\n\n(1)\nwhere H (k) \u2208 Rn\u00d7d are the node embeddings (i.e., \u201cmessages\u201d) computed after k steps of the\nGNN and M is the message propagation function, which depends on the adjacency matrix, trainable\nparameters \u03b8(k), and the node embeddings H (k\u22121) generated from the previous message-passing\nstep.2 The input node embeddings H (0) at the initial message-passing iteration (k = 1), are initialized\nusing the node features on the graph, H (0) = F .\nThere are many possible implementations of the propagation function M [15, 16]. For example, one\npopular variant of GNNs\u2014Kipf\u2019s et al. [21] Graph Convolutional Networks (GCNs)\u2014implements\nM using a combination of linear transformations and ReLU non-linearities:\n\nH (k) = M (A, H (k\u22121); W (k)) = ReLU( \u02dcD\u2212 1\n\n(2)\n1We do not consider edge features, although one can easily extend the algorithm to support edge features\n\n2 \u02dcA \u02dcD\u2212 1\n\n2 H (k\u22121)W (k\u22121)),\n\n2For notational convenience, we assume that the embedding dimension is d for all H (k); however, in general\n\nusing techniques introduced in [35].\n\nthis restriction is not necessary.\n\n3\n\n\fwhere \u02dcA = A + I, \u02dcD =(cid:80)\n\nj\n\n\u02dcAij and W (k) \u2208 Rd\u00d7d is a trainable weight matrix. The differentiable\npooling model we propose can be applied to any GNN model implementing Equation (1), and is\nagnostic with regards to the speci\ufb01cs of how M is implemented.\nA full GNN module will run K iterations of Equation (1) to generate the \ufb01nal output node embeddings\nZ = H (K) \u2208 Rn\u00d7d, where K is usually in the range 2-6. For simplicity, in the following sections we\nwill abstract away the internal structure of the GNNs and use Z = GNN(A, X) to denote an arbitrary\nGNN module implementing K iterations of message passing according to some adjacency matrix A\nand initial input node features X.\nStacking GNNs and pooling layers. GNNs implementing Equation (1) are inherently \ufb02at, as they\nonly propagate information across edges of a graph. The goal of this work is to de\ufb01ne a general,\nend-to-end differentiable strategy that allows one to stack multiple GNN modules in a hierarchical\nfashion. Formally, given Z = GNN(A, X), the output of a GNN module, and a graph adjacency\nmatrix A \u2208 Rn\u00d7n, we seek to de\ufb01ne a strategy to output a new coarsened graph containing m < n\n(cid:48) \u2208 Rm\u00d7d. This new\nnodes, with weighted adjacency matrix A\ncoarsened graph can then be used as input to another GNN layer, and this whole process can be\nrepeated L times, generating a model with L GNN layers that operate on a series of coarser and\ncoarser versions of the input graph (Figure 1). Thus, our goal is to learn how to cluster or pool\ntogether nodes using the output of a GNN, so that we can use this coarsened graph as input to another\nGNN layer. What makes designing such a pooling layer for GNNs especially challenging\u2014compared\nto the usual graph coarsening task\u2014is that our goal is not to simply cluster the nodes in one graph,\nbut to provide a general recipe to hierarchically pool nodes across a broad set of input graphs. That is,\nwe need our model to learn a pooling strategy that will generalize across graphs with different nodes,\nedges, and that can adapt to the various graph structures during inference.\n\n(cid:48) \u2208 Rm\u00d7m and node embeddings Z\n\n3.2 Differentiable Pooling via Learned Assignments\n\nOur proposed approach, DIFFPOOL, addresses the above challenges by learning a cluster assignment\nmatrix over the nodes using the output of a GNN model. The key intuition is that we stack L GNN\nmodules and learn to assign nodes to clusters at layer l in an end-to-end fashion, using embeddings\ngenerated from a GNN at layer l \u2212 1. Thus, we are using GNNs to both extract node embeddings that\nare useful for graph classi\ufb01cation, as well to extract node embeddings that are useful for hierarchical\npooling. Using this construction, the GNNs in DIFFPOOL learn to encode a general pooling strategy\nthat is useful for a large set of training graphs. We \ufb01rst describe how the DIFFPOOL module pools\nnodes at each layer given an assignment matrix; following this, we discuss how we generate the\nassignment matrix using a GNN architecture.\nPooling with an assignment matrix. We denote the learned cluster assignment matrix at layer l as\nS(l) \u2208 Rnl\u00d7nl+1. Each row of S(l) corresponds to one of the nl nodes (or clusters) at layer l, and\neach column of S(l) corresponds to one of the nl+1 clusters at the next layer l + 1. Intuitively, S(l)\nprovides a soft assignment of each node at layer l to a cluster in the next coarsened layer l + 1.\nSuppose that S(l) has already been computed, i.e., that we have computed the assignment matrix at\nthe l-th layer of our model. We denote the input adjacency matrix at this layer as A(l) and denote\nthe input node embedding matrix at this layer as Z (l). Given these inputs, the DIFFPOOL layer\n(A(l+1), X (l+1)) = DIFFPOOL(A(l), Z (l)) coarsens the input graph, generating a new coarsened\nadjacency matrix A(l+1) and a new matrix of embeddings X (l+1) for each of the nodes/clusters in\nthis coarsened graph. In particular, we apply the two following equations:\n\nX (l+1) = S(l)T\nA(l+1) = S(l)T\n\nZ (l) \u2208 Rnl+1\u00d7d,\nA(l)S(l) \u2208 Rnl+1\u00d7nl+1.\n\n(3)\n\n(4)\n\nEquation (3) takes the node embeddings Z (l) and aggregates these embeddings according to the\ncluster assignments S(l), generating embeddings for each of the nl+1 clusters. Similarly, Equation (4)\ntakes the adjacency matrix A(l) and generates a coarsened adjacency matrix denoting the connectivity\nstrength between each pair of clusters.\nThrough Equations (3) and (4), the DIFFPOOL layer coarsens the graph: the next layer adjacency\nmatrix A(l+1) represents a coarsened graph with nl+1 nodes or cluster nodes, where each individual\n\n4\n\n\fij\n\ncluster node in the new coarsened graph corresponds to a cluster of nodes in the graph at layer l.\nNote that A(l+1) is a real matrix and represents a fully connected edge-weighted graph; each entry\nA(l+1)\ncan be viewed as the connectivity strength between cluster i and cluster j. Similarly, the i-th\nrow of X (l+1) corresponds to the embedding of cluster i. Together, the coarsened adjacency matrix\nA(l+1) and cluster embeddings X (l+1) can be used as input to another GNN layer, a process which\nwe describe in detail below.\nLearning the assignment matrix. In the following we describe the architecture of DIFFPOOL, i.e.,\nhow DIFFPOOL generates the assignment matrix S(l) and embedding matrices Z (l) that are used in\nEquations (3) and (4). We generate these two matrices using two separate GNNs that are both applied\nto the input cluster node features X (l) and coarsened adjacency matrix A(l). The embedding GNN at\nlayer l is a standard GNN module applied to these inputs:\n\nZ (l) = GNNl,embed(A(l), X (l)),\n\n(5)\n\ni.e., we take the adjacency matrix between the cluster nodes at layer l (from Equation 4) and the\npooled features for the clusters (from Equation 3) and pass these matrices through a standard GNN to\nget new embeddings Z (l) for the cluster nodes. In contrast, the pooling GNN at layer l, uses the input\ncluster features X (l) and cluster adjacency matrix A(l) to generate an assignment matrix:\n\n(cid:16)\n\n(cid:17)\n\nS(l) = softmax\n\nGNNl,pool(A(l), X (l))\n\n,\n\n(6)\n\nwhere the softmax function is applied in a row-wise fashion. The output dimension of GNNl,pool\ncorresponds to a pre-de\ufb01ned maximum number of clusters in layer l, and is a hyperparameter of the\nmodel.\nNote that these two GNNs consume the same input data but have distinct parameterizations and play\nseparate roles: The embedding GNN generates new embeddings for the input nodes at this layer,\nwhile the pooling GNN generates a probabilistic assignment of the input nodes to nl+1 clusters.\nIn the base case, the inputs to Equations (5) and Equations (6) at layer l = 0 are simply the input\nadjacency matrix A and the node features F for the original graph. At the penultimate layer L \u2212 1 of\na deep GNN model using DIFFPOOL, we set the assignment matrix S(L\u22121) be a vector of 1\u2019s, i.e.,\nall nodes at the \ufb01nal layer L are assigned to a single cluster, generating a \ufb01nal embedding vector\ncorresponding to the entire graph. This \ufb01nal output embedding can then be used as feature input to a\ndifferentiable classi\ufb01er (e.g., a softmax layer), and the entire system can be trained end-to-end using\nstochastic gradient descent.\nPermutation invariance. Note that in order to be useful for graph classi\ufb01cation, the pooling layer\nshould be invariant under node permutations. For DIFFPOOL we get the following positive result,\nwhich shows that any deep GNN model based on DIFFPOOL is permutation invariant, as long as the\ncomponent GNNs are permutation invariant.\nProposition 1. Let P \u2208 {0, 1}n\u00d7n be any permutation matrix,\nthen DIFFPOOL(A, Z) =\nDIFFPOOL(P AP T , P X) as long as GNN(A, X) = GNN(P AP T , X) (i.e., as long as the GNN\nmethod used is permutation invariant).\n\nProof. Equations (5) and (6) are permutation invariant by the assumption that the GNN module\nis permutation invariant. And since any permutation matrix is orthogonal, applying P T P = I to\nEquation (3) and (4) \ufb01nishes the proof.\n\n3.3 Auxiliary Link Prediction Objective and Entropy Regularization\n\nIn practice, it can be dif\ufb01cult to train the pooling GNN (Equation 4) using only gradient signal from\nthe graph classi\ufb01cation task. Intuitively, we have a non-convex optimization problem and it can be\ndif\ufb01cult to push the pooling GNN away from spurious local minima early in training. To alleviate\nthis issue, we train the pooling GNN with an auxiliary link prediction objective, which encodes the\nintuition that nearby nodes should be pooled together.\nIn particular, at each layer l, we minimize\nLLP = ||A(l), S(l)S(l)T ||F , where || \u00b7 ||F denotes the Frobenius norm. Note that the adjacency matrix\nA(l) at deeper layers is a function of lower level assignment matrices, and changes during training.\n\n5\n\n\f(cid:80)n\n\nAnother important characteristic of the pooling GNN (Equation 4) is that the output cluster assignment\nfor each node should generally be close to a one-hot vector, so that the membership for each cluster\nor subgraph is clearly de\ufb01ned. We therefore regularize the entropy of the cluster assignment by\nminimizing LE = 1\ni=1 H(Si), where H denotes the entropy function, and Si is the i-th row of S.\nn\nDuring training, LLP and LE from each layer are added to the classi\ufb01cation loss. In practice we\nobserve that training with the side objective takes longer to converge, but nevertheless achieves better\nperformance and more interpretable cluster assignments.\n\n4 Experiments\n\nWe evaluate the bene\ufb01ts of DIFFPOOL against a number of state-of-the-art graph classi\ufb01cation\napproaches, with the goal of answering the following questions:\nQ1 How does DIFFPOOL compare to other pooling methods proposed for GNNs (e.g., using sort\n\npooling [40] or the SET2SET method [15])?\n\nQ2 How does DIFFPOOL combined with GNNs compare to the state-of-the-art for graph classi\ufb01ca-\n\ntion task, including both GNNs and kernel-based methods?\n\nQ3 Does DIFFPOOL compute meaningful and interpretable clusters on the input graphs?\nData sets. To probe the ability of DIFFPOOL to learn complex hierarchical structures from graphs in\ndifferent domains, we evaluate on a variety of relatively large graph data sets chosen from benchmarks\ncommonly used in graph classi\ufb01cation [20]. We use protein data sets including ENZYMES, PRO-\nTEINS [3, 12], D&D [10], the social network data set REDDIT-MULTI-12K [39], and the scienti\ufb01c\ncollaboration data set COLLAB [39]. See Appendix A for statistics and properties. For all these data\nsets, we perform 10-fold cross-validation to evaluate model performance, and report the accuracy\naveraged over 10 folds.\nModel con\ufb01gurations. In our experiments, the GNN model used for DIFFPOOL is built on top of\nthe GRAPHSAGE architecture, as we found this architecture to have superior performance compared\nto the standard GCN approach as introduced in [21]. We use the \u201cmean\u201d variant of GRAPHSAGE [16]\nand apply a DIFFPOOL layer after every two GRAPHSAGE layers in our architecture. A total of 2\nDIFFPOOL layers are used for the datasets. For small datasets such as ENZYMES, PROTEINS and\nCOLLAB, 1 DIFFPOOL layer can achieve similar performance. After each DIFFPOOL layer, 3 layers\nof graph convolutions are performed, before the next DIFFPOOL layer, or the readout layer. The\nembedding matrix and the assignment matrix are computed by two separate GRAPHSAGE models\nrespectively. In the 2 DIFFPOOL layer architecture, the number of clusters is set as 25% of the number\nof nodes before applying DIFFPOOL, while in the 1 DIFFPOOL layer architecture, the number of\nclusters is set as 10%. Batch normalization [18] is applied after every layer of GRAPHSAGE. We also\nfound that adding an (cid:96)2 normalization to the node embeddings at each layer made the training more\nstable. In Section 4.2, we also test an analogous variant of DIFFPOOL on the STRUCTURE2VEC [7]\narchitecture, in order to demonstrate how DIFFPOOL can be applied on top of other GNN models.\nAll models are trained for 3 000 epochs with early stopping applied when the validation loss starts to\ndrop. We also evaluate two simpli\ufb01ed versions of DIFFPOOL:\n\u2022 DIFFPOOL-DET, is a DIFFPOOL model where assignment matrices are generated using a deter-\n\u2022 DIFFPOOL-NOLP is a variant of DIFFPOOL where the link prediction side objective is turned off.\n\nministic graph clustering algorithm [9].\n\n4.1 Baseline Methods\n\nIn the performance comparison on graph classi\ufb01cation, we consider baselines based upon GNNs\n(combined with different pooling methods) as well as state-of-the-art kernel-based approaches.\nGNN-based methods.\n\u2022 GRAPHSAGE with global mean-pooling [16]. Other GNN variants such as those proposed in [21]\n\u2022 STRUCTURE2VEC (S2V) [7] is a state-of-the-art graph representation learning algorithm that\n\u2022 Edge-conditioned \ufb01lters in CNN for graphs (ECC) [35] incorporates edge information into the\n\nare omitted as empirically GraphSAGE obtained higher performance in the task.\n\ncombines a latent variable model with GNNs. It uses global mean pooling.\n\nGCN model and performs pooling using a graph coarsening algorithm.\n\n6\n\n\fTable 1: Classi\ufb01cation accuracies in percent. The far-right column gives the relative increase in\naccuracy compared to the baseline GRAPHSAGE approach.\n\nMethod\n\nGRAPHLET\nSHORTEST-PATH\n1-WL\nWL-OA\nPATCHYSAN\nGRAPHSAGE\nECC\nSET2SET\nSORTPOOL\nDIFFPOOL-DET\nDIFFPOOL-NOLP\nDIFFPOOL\n\nl\ne\nn\nr\ne\nK\n\nN\nN\nG\n\nENZYMES D&D REDDIT-MULTI-12K COLLAB\n\nPROTEINS Gain\n\nData Set\n\n41.03\n42.32\n53.43\n60.13\n\n\u2013\n\n54.25\n53.50\n60.15\n57.12\n58.33\n61.95\n62.53\n\n74.85\n78.86\n74.02\n79.04\n76.27\n75.42\n74.10\n78.12\n79.37\n75.47\n79.98\n80.64\n\n21.73\n36.93\n39.03\n44.38\n41.32\n42.24\n41.73\n43.49\n41.82\n46.18\n46.65\n47.08\n\n64.66\n59.10\n78.61\n80.74\n72.60\n68.25\n67.79\n71.75\n73.76\n82.13\n75.58\n75.48\n\n72.91\n76.43\n73.76\n75.26\n75.00\n70.48\n72.65\n74.29\n75.54\n75.62\n76.22\n76.25\n\n4.17\n\n\u2013\n\n0.11\n3.32\n3.39\n5.42\n5.95\n6.27\n\nnode ordering, applies convolutions on linear sequences of node embeddings.\n\n\u2022 PATCHYSAN [29] de\ufb01nes a receptive \ufb01eld (neighborhood) for each node, and using a canonical\n\u2022 SET2SET replaces the global mean-pooling in the traditional GNN architectures by the aggrega-\ntion used in SET2SET [38]. Set2Set aggregation has been shown to perform better than mean\npooling in previous work [15]. We use GRAPHSAGE as the base GNN model.\n\u2022 SORTPOOL [40] applies a GNN architecture and then performs a single layer of soft pooling\n\nfollowed by 1D convolution on sorted node embeddings.\n\nFor all the GNN baselines, we use 10-fold cross validation numbers reported by the original authors\nwhen possible. For the GRAPHSAGE and SET2SET baselines, we use the base implementation and\nhyperparameter sweeps as in our DIFFPOOL approach. When baseline approaches did not have the\nnecessary published numbers, we contacted the original authors and used their code (if available) to\nrun the model, performing a hyperparameter search based on the original author\u2019s guidelines.\nKernel-based algorithms. We use the GRAPHLET [34], the SHORTEST-PATH [2], WEISFEILER-\nLEHMAN kernel (WL) [33], and WEISFEILER-LEHMAN OPTIMAL ASSIGNMENT KERNEL (WL-\nOA) [22] as kernel baselines. For each kernel, we computed the normalized gram matrix. We\ncomputed the classi\ufb01cation accuracies using the C-SVM implementation of LIBSVM [6], using\n10-fold cross validation. The C parameter was selected from {10\u22123, 10\u22122, . . . , 102, 103} by 10-fold\ncross validation on the training folds. Moreover, for WL and WL-OA we additionally selected the\nnumber of iteration from {0, . . . , 5}.\n\n4.2 Results for Graph Classi\ufb01cation\n\nTable 1 compares the performance of DIFFPOOL to these state-of-the-art graph classi\ufb01cation baselines.\nThese results provide positive answers to our motivating questions Q1 and Q2: We observe that\nour DIFFPOOL approach obtains the highest average performance among all pooling approaches for\nGNNs, improves upon the base GRAPHSAGE architecture by an average of 6.27%, and achieves state-\nof-the-art results on 4 out of 5 benchmarks. Interestingly, our simpli\ufb01ed model variant, DIFFPOOL-\nDET, achieves state-of-the-art performance on the COLLAB benchmark. This is because many\ncollaboration graphs in COLLAB show only single-layer community structures, which can be captured\nwell with pre-computed graph clustering algorithm [9]. One observation is that despite signi\ufb01cant\nperformance improvement, DIFFPOOL could be unstable to train, and there is signi\ufb01cant variation\nin accuracy across different runs, even with the same hyperparameter setting. It is observed that\nadding the link predictioin objective makes training more stable, and reduces the standard deviation\nof accuracy across different runs.\nDifferentiable Pooling on STRUCTURE2VEC. DIFFPOOL can be applied to other GNN architec-\ntures besides GRAPHSAGE to capture hierarchical structure in the graph data. To further support\nanswering Q1, we also applied DIFFPOOL on Structure2Vec (S2V). We ran experiments using S2V\nwith three layer architecture, as reported in [7]. In the \ufb01rst variant, one DIFFPOOL layer is applied\nafter the \ufb01rst layer of S2V, and two more S2V layers are stacked on top of the output of DIFFPOOL.\n\n7\n\n\fThe second variant applies one DIFFPOOL layer after the \ufb01rst and second layer of S2V respectively.\nIn both variants, S2V model is used to compute the embedding matrix, while GRAPHSAGE model is\nused to compute the assignment matrix.\n\nTable 2: Accuracy results of applying DIFFPOOL to S2V.\n\nData Set\n\nENZYMES\nD&D\n\nMethod\nS2V S2V WITH 1 DIFFPOOL\n61.10\n78.92\n\n62.86\n80.75\n\nS2V WITH 2 DIFFPOOL\n\n63.33\n82.07\n\nThe results in terms of classi\ufb01cation accuracy are summarized in Table 2. We observe that DIFFPOOL\nsigni\ufb01cantly improves the performance of S2V on both ENZYMES and D&D data sets. Similar\nperformance trends are also observed on other data sets. The results demonstrate that DIFFPOOL is a\ngeneral strategy to pool over hierarchical structure that can bene\ufb01t different GNN architectures.\nRunning time. Although applying DIFFPOOL requires additional computation of an assignment\nmatrix, we observed that DIFFPOOL did not incur substantial additional running time in practice. This\nis because each DIFFPOOL layer reduces the size of graphs by extracting a coarser representation of\nthe graph, which speeds up the graph convolution operation in the next layer. Concretely, we found\nthat GRAPHSAGE with DIFFPOOL was 12\u00d7 faster than the GRAPHSAGE model with SET2SET\npooling, while still achieving signi\ufb01cantly higher accuracy on all benchmarks.\n\n4.3 Analysis of Cluster Assignment in DIFFPOOL\n\nHierarchical cluster structure. To address Q3, we investigated the extent to which DIFFPOOL\nlearns meaningful node clusters by visualizing the cluster assignments in different layers. Figure\n2 shows such a visualization of node assignments in the \ufb01rst and second layers on a graph from\nCOLLAB data set, where node color indicates its cluster membership. Node cluster membership\nis determined by taking the argmax of its cluster assignment probabilities. We observe that even\nwhen learning cluster assignment based solely on the graph classi\ufb01cation objective, DIFFPOOL\ncan still capture the hierarchical community structure. We also observe signi\ufb01cant improvement in\nmembership assignment quality with link prediction auxiliary objectives.\nDense vs. sparse subgraph structure. In addition, we observe that DIFFPOOL learns to collapse\nnodes into soft clusters in a non-uniform way, with a tendency to collapse densely-connected\nsubgraphs into clusters. Since GNNs can ef\ufb01ciently perform message-passing on dense, clique-like\nsubgraphs (due to their small diameters) [26], pooling together nodes in such a dense subgraph is not\nlikely to lead to any loss of structural information. This intuitively explains why collapsing dense\nsubgraphs is a useful pooling strategy for DIFFPOOL. In contrast, sparse subgraphs may contain many\ninteresting structures, including path-, cycle- and tree-like structures, and given the high-diameter\ninduced by sparsity, GNN message-passing may fail to capture these structures. Thus, by separately\npooling distinct parts of a sparse subgraph, DIFFPOOL can learn to capture the meaningful structures\npresent in sparse graph regions (e.g., as in Figure 2).\nAssignment for nodes with similar representations. Since the assignment network computes the\nsoft cluster assignment based on features of input nodes and their neighbors, nodes with both similar\ninput features and neighborhood structure will have similar cluster assignment. In fact, one can\nconstruct synthetic cases where 2 nodes, although far away, have exactly the same neighborhood\nstructure and features for self and all neighbors. In this case the pooling network is forced to assign\nthem into the same cluster, which is different from the concept of pooling in other architectures such\nas image ConvNets. In some cases we do observe that disconnected nodes are pooled together.\nIn practice we rely on the identi\ufb01ability assumption similar to Theorem 1 in GraphSAGE [16],\nwhere nodes are identi\ufb01able via their features. This holds in many real datasets 3. The auxiliary\nlink prediction objective is observed to also help discouraging nodes that are far away to be pooled\ntogether. Furthermore, it is possible to use more sophisticated GNN aggregation function such as\n\n3However, some chemistry molecular graph datasets contain many nodes that are structurally similar, and\n\nassignment network is observed to pool together nodes that are far away.\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Visualization of hierarchical cluster assignment in DIFFPOOL, using example graphs from\nCOLLAB. The left \ufb01gure (a) shows hierarchical clustering over two layers, where nodes in the second\nlayer correspond to clusters in the \ufb01rst layer. (Colors are used to connect the nodes/clusters across the\nlayers, and dotted lines are used to indicate clusters.) The right two plots (b and c) show two more\nexamples \ufb01rst-layer clusters in different graphs. Note that although we globally set the number of\nclusters to be 25% of the nodes, the assignment GNN automatically learns the appropriate number of\nmeaningful clusters to assign for these different graphs.\n\nhigh-order moments [37] to distinguish nodes that are similar in structure and feature space. The\noverall framework remains unchanged.\nSensitivity of the Pre-de\ufb01ned Maximum Number of Clusters. We found that the assignment\nvaries according to the depth of the network and C, the maximum number of clusters. With larger C,\nthe pooling GNN can model more complex hierarchical structure. The trade-off is that very large\nC results in more noise and less ef\ufb01ciency. Although the value of C is a pre-de\ufb01ned parameter, the\npooling net learns to use the appropriate number of clusters by end-to-end training. In particular,\nsome clusters might not be used by the assignment matrix. Column corresponding to unused cluster\nhas low values for all nodes. This is observed in Figure 2(c), where nodes are assigned predominantly\ninto 3 clusters.\n\n5 Conclusion\n\nWe introduced a differentiable pooling method for GNNs that is able to extract the complex hierarchi-\ncal structure of real-world graphs. By using the proposed pooling layer in conjunction with existing\nGNN models, we achieved new state-of-the-art results on several graph classi\ufb01cation benchmarks.\nInteresting future directions include learning hard cluster assignments to further reduce computational\ncost in higher layers while also ensuring differentiability, and applying the hierarchical pooling\nmethod to other downstream tasks that require modeling of the entire graph structure.\n\nAcknowledgement\n\nThis research has been supported in part by DARPA SIMPLEX, Stanford Data Science Initiative,\nHuawei, JD and Chan Zuckerberg Biohub. Christopher Morris is funded by the German Science\nFoundation (DFG) within the Collaborative Research Center SFB 876 \u201cProviding Information by\nResource-Constrained Data Analysis\u201d, project A6 \u201cResource-ef\ufb01cient Graph Mining\u201d. The authors\nalso thank Marinka Zitnik for help in visualizing the high-level illustration of the proposed methods.\n\nReferences\n[1] M. Bianchini, M. Gori, and F. Scarselli. Processing directed acyclic graphs with recursive neural\n\nnetworks. IEEE Transactions on Neural Networks, 12(6):1464\u20131470, 2001.\n\n[2] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In IEEE International\n\nConference on Data Mining, pages 74\u201381, 2005.\n\n[3] K. M. Borgwardt, C. S. Ong, S. Sch\u00f6nauer, S. V. N. Vishwanathan, A. J. Smola, and H.-\nP. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Supplement\n1):i47\u2013i56, 2005.\n\n9\n\nPooling at Layer 1Pooling at Layer 2\f[4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning:\n\nGoing beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\n[5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and deep locally connected\n\nnetworks on graphs. In International Conference on Learning Representations, 2014.\n\n[6] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions\non Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at http:\n//www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[7] H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured\n\ndata. In International Conference on Machine Learning, pages 2702\u20132711, 2016.\n\n[8] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\nwith fast localized spectral \ufb01ltering. In Advances in Neural Information Processing Systems,\npages 3844\u20133852, 2016.\n\n[9] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel\napproach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1944\u20131957,\n2007.\n\n[10] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without\n\nalignments. Journal of Molecular Biology, 330(4):771 \u2013 783, 2003.\n\n[11] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,\nand R. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in Neural Information Processing Systems, pages 2224\u20132232, 2015.\n\n[12] A. Feragen, N. Kasenburg, J. Petersen, M. D. Bruijne, and K. M. Borgwardt. Scalable kernels\nfor graphs with continuous attributes. In Advances in Neural Information Processing Systems,\npages 216\u2013224, 2013. Erratum available at http://image.diku.dk/aasa/papers/\ngraphkernels_nips_erratum.pdf.\n\n[13] M. Fey, J. E. Lenssen, F. Weichert, and H. M\u00fcller. SplineCNN: Fast geometric deep learning with\ncontinuous B-spline kernels. In IEEE Conference on Computer Vision and Pattern Recognition,\n2018.\n\n[14] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur. Protein interface prediction using graph convo-\nlutional networks. In Advances in Neural Information Processing Systems, pages 6533\u20136542,\n2017.\n\n[15] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing\nfor quantum chemistry. In International Conference on Machine Learning, pages 1263\u20131272,\n2017.\n\n[16] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, pages 1025\u20131035, 2017.\n\n[17] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and\n\napplications. IEEE Data Engineering Bulletin, 40(3):52\u201374, 2017.\n\n[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In International Conference on Machine Learning, pages 448\u2013456,\n2015.\n\n[19] W. Jin, C. W. Coley, R. Barzilay, and T. S. Jaakkola. Predicting organic reaction outcomes with\nWeisfeiler-Lehman network. In Advances in Neural Information Processing Systems, pages\n2604\u20132613, 2017.\n\n[20] K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann. Benchmark data sets for\n\ngraph kernels, 2016.\n\n[21] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn International Conference on Learning Representations, 2017.\n\n10\n\n\f[22] N. M. Kriege, P.-L. Giscard, and R. Wilson. On valid optimal assignment kernels and appli-\ncations to graph classi\ufb01cation. In Advances in Neural Information Processing Systems, pages\n1623\u20131631, 2016.\n\n[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105,\n2012.\n\n[24] T. Lei, W. Jin, R. Barzilay, and T. S. Jaakkola. Deriving neural architectures from sequence and\n\ngraph kernels. In International Conference on Machine Learning, pages 2024\u20132033, 2017.\n\n[25] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In\n\nInternational Conference on Learning Representations, 2016.\n\n[26] R. Liao, M. Brockschmidt, D. Tarlow, A. L. Gaunt, R. Urtasun, and R. Zemel. Graph partition\nneural networks for semi-supervised classi\ufb01cation. In International Conference on Learning\nRepresentations (Workshop Track), 2018.\n\n[27] A. Lusci, G. Pollastri, and P. Baldi. Deep architectures and deep learning in chemoinformatics:\nThe prediction of aqueous solubility for drug-like molecules. Journal of Chemical Information\nand Modeling, 53(7):1563\u20131575, 2013.\n\n[28] C. Merkwirth and T. Lengauer. Automatic generation of complementary descriptors with\nmolecular graph networks. Journal of Chemical Information and Modeling, 45(5):1159\u20131168,\n2005.\n\n[29] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In\n\nInternational Conference on Machine Learning, pages 2014\u20132023, 2016.\n\n[30] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[31] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling. Modeling\nrelational data with graph convolutional networks. In Extended Semantic Web Conference,\n2018.\n\n[32] K. Sch\u00fctt, P. J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, and K. R. M\u00fcller.\nSchNet: A continuous-\ufb01lter convolutional neural network for modeling quantum interactions.\nIn Advances in Neural Information Processing Systems, pages 992\u20131002, 2017.\n\n[33] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.\nWeisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12:2539\u20132561,\n2011.\n\n[34] N. Shervashidze, S. V. N. Vishwanathan, T. H. Petri, K. Mehlhorn, and K. M. Borgwardt.\nEf\ufb01cient graphlet kernels for large graph comparison. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 488\u2013495, 2009.\n\n[35] M. Simonovsky and N. Komodakis. Dynamic edge-conditioned \ufb01lters in convolutional neural\nnetworks on graphs. In IEEE Conference on Computer Vision and Pattern Recognition, pages\n29\u201338, 2017.\n\n[36] P. Veli\u02c7ckovi\u00b4c, G. Cucurull, A. Casanova, A. Romero, P. Li\u00f2, and Y. Bengio. Graph attention\n\nnetworks. In International Conference on Learning Representations, 2018.\n\n[37] S. Verma and Z.-L. Zhang. Graph capsule convolutional neural networks. arXiv preprint\n\narXiv:1805.08090, 2018.\n\n[38] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. In\n\nInternational Conference on Learning Representations, 2015.\n\n[39] P. Yanardag and S. V. N. Vishwanathan. A structural smoothing framework for robust graph\ncomparison. In Advances in Neural Information Processing Systems, pages 2134\u20132142, 2015.\n[40] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for\n\ngraph classi\ufb01cation. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2332, "authors": [{"given_name": "Zhitao", "family_name": "Ying", "institution": "Stanford University"}, {"given_name": "Jiaxuan", "family_name": "You", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "Morris", "institution": "TU Dortmund University"}, {"given_name": "Xiang", "family_name": "Ren", "institution": "University of Southern California"}, {"given_name": "Will", "family_name": "Hamilton", "institution": "McGill University / FAIR"}, {"given_name": "Jure", "family_name": "Leskovec", "institution": "Stanford University and Pinterest"}]}