{"title": "Provably Powerful Graph Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2156, "page_last": 2167, "abstract": "Recently, the Weisfeiler-Lehman (WL) graph isomorphism test was used to measure the expressive power of graph neural networks (GNN). It was shown that the popular message passing GNN cannot distinguish between graphs that are indistinguishable by the $1$-WL test \\citep{morris2019,xu2019}. Unfortunately, many simple instances of graphs are indistinguishable by the $1$-WL test. \n\nIn search for more expressive graph learning models we build upon the recent $k$-order invariant and equivariant graph neural networks \\citep{maron2019} and present two results: \n\nFirst, we show that such $k$-order networks can distinguish between non-isomorphic graphs as good as the $k$-WL tests, which are provably stronger than the $1$-WL test for $k>2$. This makes these models strictly stronger than message passing models. Unfortunately, the higher expressiveness of these models comes with a computational cost of processing high order tensors. \n\nSecond, setting our goal at building a provably stronger, \\emph{simple} and \\emph{scalable} model we show that a reduced $2$-order network containing just scaled identity operator, augmented with a single quadratic operation (matrix multiplication) has a provable $3$-WL expressive power. Differently put, we suggest a simple model that interleaves applications of standard Multilayer-Perceptron (MLP) applied to the feature dimension and matrix multiplication. \n\nWe validate this model by presenting state of the art results on popular graph classification and regression tasks. To the best of our knowledge, this is the first practical invariant/equivariant model with guaranteed $3$-WL expressiveness, strictly stronger than message passing models.", "full_text": "Provably Powerful Graph Networks\n\nHaggai Maron\u2217 Heli Ben-Hamu\u2217 Hadar Serviansky\u2217 Yaron Lipman\n\nWeizmann Institute of Science\n\nRehovot, Israel\n\nAbstract\n\nRecently, the Weisfeiler-Lehman (WL) graph isomorphism test was used to mea-\nsure the expressive power of graph neural networks (GNN). It was shown that the\npopular message passing GNN cannot distinguish between graphs that are indistin-\nguishable by the 1-WL test (Morris et al., 2018; Xu et al., 2019). Unfortunately,\nmany simple instances of graphs are indistinguishable by the 1-WL test.\nIn search for more expressive graph learning models we build upon the recent\nk-order invariant and equivariant graph neural networks (Maron et al., 2019a,b)\nand present two results:\nFirst, we show that such k-order networks can distinguish between non-isomorphic\ngraphs as good as the k-WL tests, which are provably stronger than the 1-WL\ntest for k > 2. This makes these models strictly stronger than message passing\nmodels. Unfortunately, the higher expressiveness of these models comes with a\ncomputational cost of processing high order tensors.\nSecond, setting our goal at building a provably stronger, simple and scalable\nmodel we show that a reduced 2-order network containing just scaled identity\noperator, augmented with a single quadratic operation (matrix multiplication) has a\nprovable 3-WL expressive power. Differently put, we suggest a simple model that\ninterleaves applications of standard Multilayer-Perceptron (MLP) applied to the\nfeature dimension and matrix multiplication. We validate this model by presenting\nstate of the art results on popular graph classi\ufb01cation and regression tasks. To the\nbest of our knowledge, this is the \ufb01rst practical invariant/equivariant model with\nguaranteed 3-WL expressiveness, strictly stronger than message passing models.\n\n1\n\nIntroduction\n\nGraphs are an important data modality which is frequently used in many \ufb01elds of science and\nengineering. Among other things, graphs are used to model social networks, chemical compounds,\nbiological structures and high-level image content information. One of the major tasks in graph\ndata analysis is learning from graph data. As classical approaches often use hand-crafted graph\nfeatures that are not necessarily suitable to all datasets and/or tasks (e.g., Kriege et al. (2019)), a\nsigni\ufb01cant research effort in recent years is to develop deep models that are able to learn new graph\nrepresentations from raw features (e.g., Gori et al. (2005); Duvenaud et al. (2015); Niepert et al.\n(2016); Kipf and Welling (2016); Veli\u02c7ckovi\u00b4c et al. (2017); Monti et al. (2017); Hamilton et al. (2017a);\nMorris et al. (2018); Xu et al. (2019)).\nCurrently, the most popular methods for deep learning on graphs are message passing neural networks\nin which the node features are propagated through the graph according to its connectivity structure\n(Gilmer et al., 2017). In a successful attempt to quantify the expressive power of message passing\nmodels, Morris et al. (2018); Xu et al. (2019) suggest to compare the model\u2019s ability to distinguish\nbetween two given graphs to that of the hierarchy of the Weisfeiler-Lehman (WL) graph isomorphism\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftests (Grohe, 2017; Babai, 2016). Remarkably, they show that the class of message passing models\nhas limited expressiveness and is not better than the \ufb01rst WL test (1-WL, a.k.a. color re\ufb01nement). For\nexample, Figure 1 depicts two graphs (i.e., in blue and in green) that 1-WL cannot distinguish, hence\nindistinguishable by any message passing algorithm.\nThe goal of this work is to explore and develop GNN models that possess\nhigher expressiveness while maintaining scalability, as much as possible.\nWe present two main contributions. First, establishing a baseline for ex-\npressive GNNs, we prove that the recent k-order invariant GNNs (Maron\net al., 2019a,b) offer a natural hierarchy of models that are as expressive\nas the k-WL tests, for k \u2265 2. Second, as k-order GNNs are not practical\nfor k > 2 we develop a simple, novel GNN model, that incorporates\nstandard MLPs of the feature dimension and a matrix multiplication layer.\nThis model, working only with k = 2 tensors (the same dimension as\nthe graph input data), possesses the expressiveness of 3-WL. Since, in\nthe WL hierarchy, 1-WL and 2-WL are equivalent, while 3-WL is strictly\nFigure 1: Two graphs not\nstronger, this model is provably more powerful than the message passing\ndistinguished by 1-WL.\nmodels. For example, it can distinguish the two graphs in Figure 1. As far as we know, this model is\nthe \ufb01rst to offer both expressiveness (3-WL) and scalability (k = 2).\nThe main challenge in achieving high-order WL expressiveness with GNN models stems from the\ndif\ufb01culty to represent the multisets of neighborhoods required for the WL algorithms. We advocate a\nnovel representation of multisets based on Power-sum Multi-symmetric Polynomials (PMP) which\nare a generalization of the well-known elementary symmetric polynomials. This representation\nprovides a convenient theoretical tool to analyze models\u2019 ability to implement the WL tests.\nA related work to ours that also tried to build graph learning methods that surpass the 1-WL expres-\nsiveness offered by message passing is Morris et al. (2018). They develop powerful deep models\ngeneralizing message passing to higher orders that are as expressive as higher order WL tests. Al-\nthough making progress, their full model is still computationally prohibitive for 3-WL expressiveness\nand requires a relaxed local version compromising some of the theoretical guarantees.\nExperimenting with our model on several real-world datasets that include classi\ufb01cation and regression\ntasks on social networks, molecules, and chemical compounds, we found it to be on par or better than\nstate of the art.\n\n2 Previous work\n\nDeep learning on graph data. The pioneering works that applied neural networks to graphs are\nGori et al. (2005); Scarselli et al. (2009) that learn node representations using recurrent neural\nnetworks, which were also used in Li et al. (2015). Following the success of convolutional neural\nnetworks (Krizhevsky et al., 2012), many works have tried to generalize the notion of convolution\nto graphs and build networks that are based on this operation. Bruna et al. (2013) de\ufb01ned graph\nconvolutions as operators that are diagonal in the graph laplacian eigenbasis. This paper resulted\nin multiple follow up works with more ef\ufb01cient and spatially localized convolutions (Henaff et al.,\n2015; Defferrard et al., 2016; Kipf and Welling, 2016; Levie et al., 2017). Other works de\ufb01ne graph\nconvolutions as local stationary functions that are applied to each node and its neighbours (e.g.,\nDuvenaud et al. (2015); Atwood and Towsley (2016); Niepert et al. (2016); Hamilton et al. (2017b);\nVeli\u02c7ckovi\u00b4c et al. (2017); Monti et al. (2018)). Many of these works were shown to be instances of\nthe family of message passing neural networks (Gilmer et al., 2017): methods that apply parametric\nfunctions to a node and its neighborhood and then apply some pooling operation in order to generate\na new feature for each node. In a recent line of work, it was suggested to de\ufb01ne graph neural networks\nusing permutation equivariant operators on tensors describing k-order relations between the nodes.\nKondor et al. (2018) identi\ufb01ed several such linear and quadratic equivariant operators and showed\nthat the resulting network can achieve excellent results on popular graph learning benchmarks. Maron\net al. (2019a) provided a full characterization of linear equivariant operators between tensors of\narbitrary order. In both cases, the resulting networks were shown to be at least as powerful as message\npassing neural networks. In another line of work, Murphy et al. (2019) suggest expressive invariant\ngraph models de\ufb01ned using averaging over all permutations of an arbitrary base neural network.\n\n2\n\n\fWeisfeiler Lehman graph isomorphism test. The Weisfeiler Lehman tests is a hierarchy of\nincreasingly powerful graph isomorphism tests (Grohe, 2017). The WL tests have found many\napplications in machine learning: in addition to Xu et al. (2019); Morris et al. (2018), this idea was\nused in Shervashidze et al. (2011) to construct a graph kernel method, which was further generalized\nto higher order WL tests in Morris et al. (2017). Lei et al. (2017) showed that their suggested GNN\nhas a theoretical connection to the WL test. WL tests were also used in Zhang and Chen (2017)\nfor link prediction tasks. In a concurrent work, Morris and Mutzel (2019) suggest constructing\ngraph features based on an equivalent sparse version of high-order WL achieving great speedup and\nexpressiveness guarantees for sparsely connected graphs.\n\n3 Preliminaries\nWe denote a set by {a, b, . . . , c}, an ordered set (tuple) by (a, b, . . . , c) and a multiset (i.e., a set with\npossibly repeating elements) by {{a, b, . . . , c}}. We denote [n] = {1, 2, . . . , n}, and (ai | i \u2208 [n]) =\n(a1, a2, . . . , an). Let Sn denote the permutation group on n elements. We use multi-index i \u2208 [n]k to\ndenote a k-tuple of indices, i = (i1, i2, . . . , ik). g \u2208 Sn acts on multi-indices i \u2208 [n]k entrywise by\ng(i) = (g(i1), g(i2), . . . , g(ik)). Sn acts on k-tensors X \u2208 Rnk\u00d7a by (g \u00b7 X)i,j = Xg\u22121(i),j, where\ni \u2208 [n]k, j \u2208 [a].\n3.1 k-order graph networks\nMaron et al. (2019a) have suggested a family of permutation-invariant deep neural network models\nfor graphs. Their main idea is to construct networks by concatenating maximally expressive linear\nequivariant layers. More formally, a k-order invariant graph network is a composition F = m \u25e6\nh \u25e6 Ld \u25e6 \u03c3 \u25e6 \u00b7\u00b7\u00b7 \u25e6 \u03c3 \u25e6 L1, where Li : Rnki\u00d7ai \u2192 Rnki+1\u00d7ai+1, maxi\u2208[d+1] ki = k, are equivariant\nlinear layers, namely satisfy\n\nLi(g \u00b7 X) = g \u00b7 Li(X),\n\n\u2200g \u2208 Sn,\n\n\u2200X \u2208 Rnki\u00d7ai,\n\n\u03c3 is an entrywise non-linear activation, \u03c3(X)i,j = \u03c3(Xi,j), h : Rnkd+1\u00d7ad+1 \u2192 Rad+2 is an invariant\nlinear layer, namely satis\ufb01es\n\nh(g \u00b7 X) = h(X),\n\n\u2200g \u2208 Sn,\n\n\u2200X \u2208 Rnkd+1\u00d7ad+1,\n\nand m is a Multilayer Perceptron (MLP). The invariance of F is achieved by construction (by\npropagating g through the layers using the de\ufb01nitions of equivariance and invariance):\nF (g \u00b7 X) = m(\u00b7\u00b7\u00b7 (L1(g \u00b7 X))\u00b7\u00b7\u00b7 ) = m(\u00b7\u00b7\u00b7 (g \u00b7 L1(X))\u00b7\u00b7\u00b7 ) = \u00b7\u00b7\u00b7 = m(h(g \u00b7 Ld(\u00b7\u00b7\u00b7 ))) = F (X).\nWhen k = 2, Maron et al. (2019a) proved that this construction gives rise to a model that can\napproximate any message passing neural network (Gilmer et al., 2017) to an arbitrary precision;\nMaron et al. (2019b) proved these models are universal for a very high tensor order of k = poly(n),\nwhich is of little practical value (an alternative proof was recently suggested in Keriven and Peyr\u00e9\n(2019)).\n3.2 The Weisfeiler-Lehman graph isomorphism test\nLet G = (V, E, d) be a colored graph where |V | = n and d : V \u2192 \u03a3 de\ufb01nes the color attached to\neach vertex in V , \u03a3 is a set of colors. The Weisfeiler-Lehman (WL) test is a family of algorithms\nused to test graph isomorphism. Two graphs G, G(cid:48) are called isomorphic if there exists an edge and\ncolor preserving bijection \u03c6 : V \u2192 V (cid:48).\nThere are two families of WL algorithms: k-WL and k-FWL (Folklore WL), both parameterized\nby k = 1, 2, . . . , n. k-WL and k-FWL both construct a coloring of k-tuples of vertices, that is\nc : V k \u2192 \u03a3. Testing isomorphism of two graphs G, G(cid:48) is then performed by comparing the\nhistograms of colors produced by the k-WL (or k-FWL) algorithms.\nWe will represent coloring of k-tuples using a tensor C \u2208 \u03a3nk, where Ci \u2208 \u03a3, i \u2208 [n]k denotes the\ncolor of the k-tuple vi = (vi1, . . . , vik ) \u2208 V k. In both algorithms, the initial coloring C0 is de\ufb01ned\nusing the isomorphism type of each k-tuple. That is, two k-tuples i, i(cid:48) have the same isomorphism\ntype (i.e., get the same color, Ci = Ci(cid:48)) if for all q, r \u2208 [k]: (i) viq = vir \u21d0\u21d2 vi(cid:48)\nr; (ii)\n); and (iii) (vir , viq ) \u2208 E \u21d0\u21d2 (vi(cid:48)\n) \u2208 E. Clearly, if G, G(cid:48) are two isomorphic\nd(viq ) = d(vi(cid:48)\ngraphs then there exists g \u2208 Sn so that g \u00b7 C(cid:48)0 = C0.\n\n= vi(cid:48)\n\nr\n\nq\n\nq\n\n, vi(cid:48)\n\nq\n\n3\n\n\f(cid:110)\n(cid:16)\n\nIn the next steps, the algorithms re\ufb01ne the colorings Cl, l = 1, 2, . . . until the coloring does not\nchange further, that is, the subsets of k-tuples with same colors do not get further split to different\ncolor groups. It is guaranteed that no more than l = poly(n) iterations are required (Douglas, 2011).\nThe construction of Cl from Cl\u22121 differs in the WL and FWL versions. The difference\nis in how the colors are aggregated from neighboring k-tuples. We de\ufb01ne two notions\nof neighborhoods of a k-tuple i \u2208 [n]k:\n\nNj(i) =\n\n(i1, . . . , ij\u22121, i(cid:48), ij+1, . . . , ik)\n\n(cid:12)(cid:12)(cid:12) i(cid:48) \u2208 [n]\n(cid:111)\n\n(cid:17)\n\n(1)\n\nN F\n\nj (i) =\n\n(j, i2, . . . , ik), (i1, j, . . . , ik), . . . , (i1, . . . , ik\u22121, j)\n\n(2)\nNj(i), j \u2208 [k] is the j-th neighborhood of the tuple i used by the WL algorithm, while\nj (i), j \u2208 [n] is the j-th neighborhood used by the FWL algorithm. Note that Nj(i) is a set of n\nN F\nk-tuples, while N F\nj (i) is an ordered set of k k-tuples. The inset to the right illustrates these notions\nof neighborhoods for the case k = 2: the top \ufb01gure shows N1(3, 2) in purple and N2(3, 2) in orange.\nThe bottom \ufb01gure shows N F\nThe coloring update rules are:\n\nj (3, 2) for all j = 1, . . . , n with different colors for different j.\n\n(3)\n\n(4)\n\n(cid:16)\n(cid:16)\n\ni\n\nCl\u22121\nCl\u22121\n\ni\n\n,\n\n,\n\n(cid:16) {{Cl\u22121\n(cid:110)(cid:110)(cid:0)Cl\u22121\n\nj\n\nj\n\nWL: Cl\n\ni = enc\n\nFWL: Cl\n\ni = enc\n\n| j \u2208 N F\n\n| j \u2208 Nj(i)}}(cid:12)(cid:12)(cid:12) j \u2208 [k]\n(cid:17) (cid:17)\nj (i)(cid:1)(cid:12)(cid:12)(cid:12) j \u2208 [n]\n(cid:111)(cid:111) (cid:17)\n| j \u2208 [n]}}(cid:17)\n(cid:16)\n\nwhere enc is a bijective map from the collection of all possible tuples in the r.h.s. of Equations (3)-(4)\nto \u03a3.\nWhen k = 1 both rules, (3)-(4), degenerate to Cl\n, which will not\nre\ufb01ne any initial color. Traditionally, the \ufb01rst algorithm in the WL hierarchy is called WL, 1-WL, or\nthe color re\ufb01nement algorithm. In color re\ufb01nement, one starts with the coloring prescribed with d.\nThen, in each iteration, the color at each vertex is re\ufb01ned by a new color representing its current color\nand the multiset of its neighbors\u2019 colors.\nSeveral known results of WL and FWL algorithms (Cai et al., 1992; Grohe, 2017; Morris et al., 2018;\nGrohe and Otto, 2015) are:\n\n,{{Cl\u22121\n\ni = enc\n\nCl\u22121\n\nj\n\ni\n\n1. 1-WL and 2-WL have equivalent discrimination power.\n2. k-FWL is equivalent to (k + 1)-WL for k \u2265 2.\n3. For each k \u2265 2 there is a pair of non-isomorphic graphs distinguishable by (k + 1)-WL but\n\nnot by k-WL.\n\n4 Colors and multisets in networks\nBefore we get to the two main contributions of this paper we address three challenges that arise when\nanalyzing networks\u2019 ability to implement WL-like algorithms: (i) Representing the colors \u03a3 in the\nnetwork; (ii) implementing a multiset representation; and (iii) implementing the encoding function.\nColor representation. We will represent colors as vectors. That is, we will use tensors C \u2208 Rnk\u00d7a\nto encode a color per k-tuple; that is, the color of the tuple i \u2208 [n]k is a vector Ci \u2208 Ra. This\neffectively replaces the color tensors \u03a3nk in the WL algorithm with Rnk\u00d7a.\nMultiset representation. A key technical part of our method is the way we encode multisets in\nnetworks. Since colors are represented as vectors in Ra, an n-tuple of colors is represented by a\nmatrix X = [x1, x2, . . . , xn]T \u2208 Rn\u00d7a, where xj \u2208 Ra, j \u2208 [n] are the rows of X. Thinking about\nX as a multiset forces us to be indifferent to the order of rows. That is, the color representing g \u00b7 X\nshould be the same as the color representing X, for all g \u2208 Sn. One possible approach is to perform\nsome sort (e.g., lexicographic) to the rows of X. Unfortunately, this seems challenging to implement\nwith equivariant layers.\nInstead, we suggest to encode a multiset X using a set of Sn-invariant functions called the Power-sum\nMulti-symmetric Polynomials (PMP) (Briand, 2004; Rydh, 2007). The PMP are the multivariate\n\n4\n\n\fanalog to the more widely known Power-sum Symmetric Polynomials, pj(y) =(cid:80)n\n\ni , j \u2208 [n],\nwhere y \u2208 Rn. They are de\ufb01ned next. Let \u03b1 = (\u03b11, . . . , \u03b1a) \u2208 [n]a be a multi-index and for y \u2208 Ra\nwe set y\u03b1 = y\u03b11\n1\n\na . Furthermore, |\u03b1| =(cid:80)a\n\nj=1 \u03b1j. The PMP of degree \u03b1 \u2208 [n]a is\n\n\u00b7 y\u03b12\n\n2 \u00b7\u00b7\u00b7 y\u03b1a\n\ni=1 yj\n\nn(cid:88)\n\ni=1\n\np\u03b1(X) =\n\ni , X \u2208 Rn\u00d7a.\nx\u03b1\n\nA key property of the PMP is that the \ufb01nite subset p\u03b1, for |\u03b1| \u2264 n generates the ring of Multi-\nsymmetric Polynomials (MP), the set of polynomials q so that q(g \u00b7 X) = q(X) for all g \u2208 Sn,\nX \u2208 Rn\u00d7a (see, e.g., (Rydh, 2007) corollary 8.4). The PMP generates the ring of MP in the sense\nthat for an arbitrary MP q, there exists a polynomial r so that q(X) = r (u(X)), where\n\nu(X) :=(cid:0)p\u03b1(X)(cid:12)(cid:12) |\u03b1| \u2264 n(cid:1) .\n\n(5)\n\nAs the following proposition shows, a useful consequence of this property is that the vector u(X) is\na unique representation of the multi-set X \u2208 Rn\u00d7a.\nProposition 1. For arbitrary X, X(cid:48) \u2208 Rn\u00d7a: \u2203g \u2208 Sn so that X(cid:48) = g \u00b7 X if and only if\nu(X) = u(X(cid:48)).\n\nWe note that Proposition 1 is a generalization of lemma 6 in Zaheer et al. (2017) to the case of\nmultisets of vectors. This generalization was possible since the PMP provide a continuous way to\nencode vector multisets (as opposed to scalar multisets in previous works). The full proof is provided\nin the supplementary material.\n\nEncoding function. One of the bene\ufb01ts in the vector representation of colors is that the encoding\nfunction can be implemented as a simple concatenation: Given two color tensors C \u2208 Rnk\u00d7a,\nC(cid:48) \u2208 Rnk\u00d7b, the tensor that represents for each k-tuple i the color pair (Ci, C(cid:48)\n) \u2208\nRnk\u00d7(a+b).\n\ni) is simply (C, C(cid:48)\n\n5 k-order graph networks are as powerful as k-WL\nOur goal in this section is to show that, for every 2 \u2264 k \u2264 n, k-order graph networks (Maron\net al., 2019a) are at least as powerful as the k-WL graph isomorphism test in terms of distinguishing\nnon-isomorphic graphs. This result is shown by constructing a k-order network model and learnable\nweight assignment that implements the k-WL test.\nTo motivate this construction we note that the WL update step, Equation 3, is equivariant (see proof\nin the supplementary material). Namely, plugging in g \u00b7 Cl\u22121 the WL update step would yield g \u00b7 Cl.\nTherefore, it is plausible to try to implement the WL update step using linear equivariant layers and\nnon-linear pointwise activations.\nTheorem 1. Given two graphs G = (V, E, d), G(cid:48) = (V (cid:48), E(cid:48), d(cid:48)) that can be distinguished by the\nk-WL graph isomorphism test, there exists a k-order network F so that F (G) (cid:54)= F (G(cid:48)). On the\nother direction for every two isomorphic graphs G \u223c= G(cid:48) and k-order network F , F (G) = F (G(cid:48)).\nThe full proof is provided in the supplementary material. Here we outline the basic idea for the proof.\nFirst, an input graph G = (V, E, d) is represented using a tensor of the form B \u2208 Rn2\u00d7(e+1), as\nfollows. The last channel of B, namely B:,:,e+1 (\u2019:\u2019 stands for all possible values [n]) encodes the\nadjacency matrix of G according to E. The \ufb01rst e channels B:,:,1:e are zero outside the diagonal, and\nBi,i,1:e = d(vi) \u2208 Re is the color of vertex vi \u2208 V .\nNow, the second statement in Theorem 1 is clear since two isomorphic graphs G, G(cid:48) will have tensor\nrepresentations satisfying B(cid:48)\nMore challenging is showing the other direction, namely that for non-isomorphic graphs G, G(cid:48) that\ncan be distinguished by the k-WL test, there exists a k-network distinguishing G and G(cid:48). The key\nidea is to show that a k-order network can encode the multisets {{Bj | j \u2208 Nj(i)}} for a given tensor\nB \u2208 Rnk\u00d7a. These multisets are the only non-trivial component in the WL update rule, Equation 3.\nNote that the rows of the matrix X = Bi1,...,ij\u22121,:,ij+1,...,ik,: \u2208 Rn\u00d7a are the colors (i.e., vectors)\n\n= g \u00b7 B and therefore, as explained in Section 3.1, F (B) = F (B(cid:48)\n\n).\n\n5\n\n\fthat de\ufb01ne the multiset {{Bj | j \u2208 Nj(i)}}. Following our multiset representation (Section 4) we\nwould like the network to compute u(X) and plug the result at the i-th entry of an output tensor C.\n\nThis can be done in two steps: First, applying the polynomial function \u03c4 : Ra \u2192 Rb, b =(cid:0) n+a\u22121\n\nentrywise to B, where \u03c4 is de\ufb01ned by \u03c4 (x) = (x\u03b1 | |\u03b1| \u2264 n) (note that b is the number of multi-\nindices \u03b1 such that |\u03b1| \u2264 n). Denote the output of this step Y. Second, apply a linear equivariant\noperator summing over the j-the coordinate of Y to get C, that is\n\na\u22121\n\n(cid:1)\n\nCi,: := Lj(Y)i,: =\n\nYi1,\u00b7\u00b7\u00b7 ,ij\u22121,i(cid:48),ij+1,...,ik,: =\n\n\u03c4 (Bj,:) = u(X),\n\ni \u2208 [n]k,\n\nn(cid:88)\n\ni(cid:48)=1\n\n(cid:88)\n\nj\u2208Nj (i)\n\nwhere X = Bi1,...,ij\u22121,:,ij+1,...,ik,: as desired. Lastly, we use the universal approximation theorem\n(Cybenko, 1989; Hornik, 1991) to replace the polynomial function \u03c4 with an approximating MLP\nm : Ra \u2192 Rb to get a k-order network (details are in the supplementary material). Applying m\nfeature-wise, that is m(B)i,: = m(Bi,:), is in particular a k-order network in the sense of Section 3.1.\n\n6 A simple network with 3-WL discrimination power\n\nIn this section we describe a simple GNN model that has 3-WL\ndiscrimination power. The model has the form\nF = m \u25e6 h \u25e6 Bd \u25e6 Bd\u22121 \u00b7\u00b7\u00b7 \u25e6 B1,\n\n(6)\nwhere as in k-order networks (see Section 3.1) h is an in-\nvariant layer and m is an MLP. B1, . . . , Bd are blocks with\nthe following structure (see \ufb01gure 2 for an illustration). Let\nX \u2208 Rn\u00d7n\u00d7a denote the input tensor to the block. First, we\napply three MLPs m1, m2 : Ra \u2192 Rb, m3 : Ra \u2192 Rb(cid:48)\nto\nthe input tensor, ml(X), l \u2208 [3]. This means applying the\nMLP to each feature of the input tensor independently, i.e.,\nml(X)i1,i2,: := ml(Xi1,i2,:), l \u2208 [3]. Second, matrix multiplication is performed between match-\ning features, i.e., W:,:,j := m1(X):,:,j \u00b7 m2(X):,:,j, j \u2208 [b]. The output of the block is the tensor\n(m3(X), W).\nWe start with showing our basic requirement from GNN, namely invariance:\nLemma 1. The model F described above is invariant, i.e., F (g \u00b7 B) = F (B), for all g \u2208 Sn, and B.\nProof. Note that matrix multiplication is equivariant: for two matrices A, B \u2208 Rn\u00d7n and g \u2208 Sn\none has (g \u00b7 A) \u00b7 (g \u00b7 B) = g \u00b7 (A \u00b7 B). This makes the basic building block Bi equivariant, and\nconsequently the model F invariant, i.e., F (g \u00b7 B) = F (B).\n\nFigure 2: Block structure.\n\nBefore we prove the 3-WL power for this model, let us provide some intuition as to why matrix\nmultiplication improves expressiveness. Let us show matrix multiplication allows this model to\ndistinguish between the two graphs in Figure 1, which are 1-WL indistinguishable. The input tensor\nB representing a graph G holds the adjacency matrix at the last channel A := B:,:,e+1. We can build\na network with 2 blocks computing A3 and then take the trace of this matrix (using the invariant layer\nh). Remember that the d-th power of the adjacency matrix computes the number of d-paths between\nvertices; in particular tr(A3) computes the number of cycles of length 3. Counting shows the upper\ngraph in Figure 1 has 0 such cycles while the bottom graph has 12. The main result of this section is:\nTheorem 2. Given two graphs G = (V, E, d), G(cid:48) = (V (cid:48), E(cid:48), d(cid:48)) that can be distinguished by the\n3-WL graph isomorphism test, there exists a network F (equation 6) so that F (G) (cid:54)= F (G(cid:48)). On the\nother direction for every two isomorphic graphs G \u223c= G(cid:48) and F (Equation 6), F (G) = F (G(cid:48)).\nThe full proof is provided in the supplementary material. Here we outline the main idea of the proof.\nThe second part of this theorem is already shown in Lemma 1. To prove the \ufb01rst part, namely that the\nmodel in Equation 6 has 3-WL expressiveness, we show it can implement the 2-FWL algorithm, that\nis known to be equivalent to 3-WL (see Section 3.2). As before, the challenge is in implementing the\nneighborhood multisets as used in the 2-FWL algorithm. That is, given an input tensor B \u2208 Rn2\u00d7a we\nwould like to compute an output tensor C \u2208 Rn2\u00d7b where Ci1,i2,: \u2208 Rb represents a color matching\n\n6\n\n\fthe multiset {{(Bj,i2,:, Bi1,j,:) | j \u2208 [n]}}. As before, we use the multiset representation introduced in\nsection 4. Consider the matrix X \u2208 Rn\u00d72a de\ufb01ned by\nXj,: = (Bj,i2,:, Bi1,j,:),\n\nj \u2208 [n].\n\n(7)\n\nOur goal is to compute an output tensor W \u2208 Rn2\u00d7b, where Wi1,i2,: = u(X).\n\nConsider the multi-index set(cid:8)\u03b1 | \u03b1 \u2208 [n]2a,|\u03b1| \u2264 n(cid:9) of cardinality b =(cid:0) n+2a\u22121\n\n(cid:1), and write it in\n\nthe form {(\u03b2l, \u03b3l) | \u03b2, \u03b3 \u2208 [n]a,|\u03b2l| + |\u03b3l| \u2264 n, l \u2208 b}.\nNow de\ufb01ne polynomial maps \u03c41, \u03c42 : Ra \u2192 Rb by \u03c41(x) = (x\u03b2l | l \u2208 [b]), and \u03c42(x) = (x\u03b3l | l \u2208\n[b]). We apply \u03c41 to the features of B, namely Yi1,i2,l := \u03c41(B)i1,i2,l = (Bi1,i2,:)\u03b2l; similarly,\nZi1,i2,l := \u03c42(B)i1,i2,l = (Bi1,i2,:)\u03b3l. Now,\nWi1,i2,l := (Z:,:,l \u00b7 Y:,:,l)i1,i2 =\n\n(Bj,i2,:, Bi1,j,:)(\u03b2l,\u03b3l),\n\nZi1,j,lYj,i2,l =\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nB\u03b2l\nj,i2,:B\u03b3l\n\ni1,j,: =\n\n2a\u22121\n\nj=1\n\nj=1\n\nj=1\n\nhence Wi1,i2,: = u(X), where X is de\ufb01ned in Equation 7. To get an implementation with the model\nin Equation 6 we need to replace \u03c41, \u03c42 with MLPs. We use the universal approximation theorem to\nthat end (details are in the supplementary material).\nTo conclude, each update step of the 2-FWL algorithm is implemented in the form of a block Bi\napplying m1, m2 to the input tensor B, followed by matrix multiplication of matching features,\nW = m1(B) \u00b7 m2(B). Since Equation 4 requires pairing the multiset with the input color of each\nk-tuple, we take m3 to be identity and get (B, W) as the block output.\n\nGeneralization to k-FWL. One possible extension is to add a generalized matrix multiplica-\ntion to k-order networks to make them as expressive as k-FWL and hence (k + 1)-WL. Gener-\nalized matrix multiplication is de\ufb01ned as follows. Given A1, . . . , Ak \u2208 Rnk, then ((cid:12)k\ni=1Ai)i =\n\n(cid:80)n\n\nj=1 A1\n\nj,i2,...,ikA2\n\ni1,j,...,ik\n\n\u00b7\u00b7\u00b7 Ak\n\ni1,...,ik\u22121,j.\n\nRelation to (Morris et al., 2018). Our model offers two bene\ufb01ts over the 1-2-3-GNN suggested in\nthe work of Morris et al. (2018), a recently suggested GNN that also surpasses the expressiveness of\nmessage passing networks. First, it has lower space complexity (see details below). This allows us to\nwork with a provably 3-WL expressive model while Morris et al. (2018) resorted to a local 3-GNN\nversion, hindering their 3-WL expressive power. Second, from a practical point of view our model is\narguably simpler to implement as it only consists of fully connected layers and matrix multiplication\n(without having to account for all subsets of size 3).\n\nComplexity analysis of a single block. Assuming a graph with n nodes, dense edge data and\na constant feature depth, the layer proposed in Morris et al. (2018) has O(n3) space complexity\n(number of subsets) and O(n4) time complexity (O(n3) subsets with O(n) neighbors each). Our\nlayer (block), however, has O(n2) space complexity as only second order tensors are stored (i.e.,\nlinear in the size of the graph data), and time complexity of O(n3) due to the matrix multiplication.\nWe note that the time complexity of Morris et al. (2018) can probably be improved to O(n3) while our\ntime complexity can be improved to O(n2.x) due to more advanced matrix multiplication algorithms.\n\n7 Experiments\n\nImplementation details. We implemented the GNN model as described in Section 6 (see Equa-\ntion 6) using the TensorFlow framework (Abadi et al., 2016). We used three identical blocks\nB1, B2, B3, where in each block Bi : Rn2\u00d7a \u2192 Rn2\u00d7b we took m3(x) = x to be the identity (i.e.,\nm3 acts as a skip connection, similar to its role in the proof of Theorem 2); m1, m2 : Ra \u2192 Rb are\nchosen as d layer MLP with hidden layers of b features. After each block Bi we also added a single\nlayer MLP m4 : Rb+a \u2192 Rb. Note that although this fourth MLP is not described in the model\nin Section 6 it clearly does not decrease (nor increase) the theoretical expressiveness of the model;\nwe found it ef\ufb01cient for coding as it reduces the parameters of the model. For the \ufb01rst block, B1,\na = e + 1, where for the other blocks b = a. The MLPs are implemented with 1 \u00d7 1 convolutions.\n\n7\n\n\fTable 1: Graph Classi\ufb01cation Results on the datasets from Yanardag and Vishwanathan (2015)\n\ndataset\n\nsize\nclasses\navg node #\n\nMUTAG\n\nPTC\n\nPROTEINS\n\nNCI1\n\nNCI109\n\nCOLLAB\n\nIMDB-B\n\nIMDB-M\n\n188\n2\n17.9\n\n344\n2\n25.5\n\n1113\n2\n39.1\n\n4110\n2\n29.8\n\n4127\n2\n29.6\n\n5000\n3\n74.4\n\n1000\n2\n19.7\n\n1500\n3\n13\n\nGK (Shervashidze et al., 2009)\nRW (Vishwanathan et al., 2010)\nPK (Neumann et al., 2016)\nWL (Shervashidze et al., 2011)\nFGSD (Verma and Zhang, 2017)\nAWE-DD (Ivanov and Burnaev, 2018)\nAWE-FB (Ivanov and Burnaev, 2018)\n\nDGCNN (Zhang et al., 2018)\nPSCN (Niepert et al., 2016)(k=10)\nDCNN (Atwood and Towsley, 2016)\nECC (Simonovsky and Komodakis, 2017)\nDGK (Yanardag and Vishwanathan, 2015)\nDiffPool (Ying et al., 2018)\nCCN (Kondor et al., 2018)\nInvariant Graph Networks (Maron et al., 2019a)\nGIN (Xu et al., 2019)\n1-2-3 GNN (Morris et al., 2018)\nOurs 1\nOurs 2\nOurs 3\nRank\n\n81.39\u00b11.7\n79.17\u00b12.1\n76\u00b12.7\n84.11\u00b11.9\n92.12\nNA\n87.87\u00b19.7\n85.83\u00b11.7\n88.95\u00b14.4\nNA\n76.11\n87.44\u00b12.7\nNA\n91.64\u00b17.2\n83.89\u00b112.95\n89.4\u00b15.6\n86.1\u00b1\n90.55\u00b18.7\n88.88\u00b17.4\n89.44\u00b18.05\n3rd\n\n55.65\u00b10.5\n55.91\u00b10.3\n59.5\u00b12.4\n57.97\u00b12.5\n62.80\nNA\nNA\n58.59\u00b12.5\n62.29\u00b15.7\nNA\nNA\n60.08\u00b12.6\nNA\n70.62\u00b17.0\n58.53\u00b16.86\n64.6\u00b17.0\n60.9\u00b1\n66.17\u00b16.54\n64.7\u00b17.46\n62.94\u00b16.96\n2nd\n\nResults\n71.39\u00b10.3\n59.57\u00b10.1\n73.68\u00b10.7\n74.68\u00b10.5\n73.42\nNA\nNA\n75.54\u00b10.9\n75\u00b12.5\n61.29\u00b11.6\nNA\n75.68\u00b10.5\n78.1\nNA\n76.58\u00b15.49\n76.2\u00b12.8\n75.5\u00b1\n77.2\u00b14.73\n76.39\u00b15.03\n76.66\u00b15.59\n2nd\n\n62.49\u00b10.3\n> 3 days\n82.54\u00b10.5\n84.46\u00b10.5\n79.80\nNA\nNA\n74.44\u00b10.5\n76.34\u00b11.7\n56.61\u00b1 1.0\n76.82\n80.31\u00b10.5\nNA\n76.27\u00b14.1\n74.33\u00b12.71\n82.7\u00b11.7\n76.2\u00b1\n83.19\u00b11.11\n81.21\u00b12.14\n80.97\u00b11.91\n2nd\n\n62.35\u00b10.3\nNA\nNA\nNA\nNA\nNA\n85.12\u00b10.3\nNA\n80.02\n78.84\n73.93\u00b11.9\nNA\nNA 70.99 \u00b1 1.4\n73.76\u00b10.5\nNA\n72.6\u00b12.2\nNA\n52.11\u00b10.7\nNA\n75.03\nNA\n80.32\u00b10.3\n73.09\u00b10.3\n75.5\nNA\n75.54\u00b13.4\nNA\n72.82\u00b11.45\n78.36\u00b12.47\n80.2\u00b11.9\nNA\nNA\nNA\n80.16\u00b11.11\n81.84\u00b11.85\n81.77\u00b11.26\n81.38\u00b11.42\n80.68\u00b11.71\n82.23\u00b11.42\n2nd\n1st\n\nNA\nNA\nNA\nNA\n73.62\n74.45 \u00b1 5.8\n73.13 \u00b13.2\n70.03\u00b10.9\n71\u00b12.3\n49.06\u00b11.4\nNA\n66.96\u00b10.6\nNA\nNA\n72.0\u00b15.54\n75.1\u00b15.1\n74.2\u00b1\n72.6\u00b14.9\n72.2\u00b14.26\n73\u00b15.77\n6th\n\nNA\nNA\nNA\nNA\n52.41\n51.54 \u00b13.6\n51.58 \u00b1 4.6\n47.83\u00b10.9\n45.23\u00b12.8\n33.49\u00b11.4\nNA\n44.55\u00b10.5\nNA\nNA\n48.73\u00b13.41\n52.3\u00b12.8\n49.5\u00b1\n50\u00b13.15\n44.73\u00b17.89\n50.46\u00b13.59\n5th\n\nTable 2: Regression, the QM9 dataset.\nOurs 2\n\nParameter search was conducted on learning rate and\nlearning rate decay, as detailed below. We have\nexperimented with two network suf\ufb01xes adopted\nfrom previous papers: (i) The suf\ufb01x used in Maron\net al. (2019a) that consists of an invariant max pool-\ning (diagonal and off-diagonal) followed by a three\nFully Connected (FC) with hidden units\u2019 sizes of\n(512, 256, #classes); (ii) the suf\ufb01x used in Xu et al.\n(2019) adapted to our network: we apply the invariant\nmax layer from Maron et al. (2019a) to the output\nof every block followed by a single fully connected\nlayer to #classes. These outputs are then summed\ntogether and used as the network output on which the\nloss function is de\ufb01ned.\n\nTarget\n0.244\n\u00b5\n0.95\n\u03b1\n0.00388\n\u0001homo\n0.00512\n\u0001lumo\n0.0112\n\u2206\u0001\n(cid:104)R2(cid:105)\n17\nZP V E 0.00172\n-\nU0\n-\nU\n-\nH\n-\nG\n0.27\nCv\n\nDTNN MPNN\n0.358\n0.89\n0.00541\n0.00623\n0.0066\n28.5\n0.00216\n-\n-\n-\n-\n0.42\n\n123-gnn Ours 1\n0.231\n0.382\n0.00276\n0.00287\n0.00406\n16.07\n0.00064\n0.234\n0.234\n0.229\n0.238\n0.184\n\n0.476\n0.27\n0.00337\n0.00351\n0.0048\n22.9\n0.00019\n0.0427\n0.111\n0.0419\n0.0469\n0.0944\n\n0.0934\n0.318\n0.00174\n0.0021\n0.0029\n3.78\n0.000399\n0.022\n0.0504\n0.0294\n0.024\n0.144\n\nDatasets. We evaluated our network on two different tasks: Graph classi\ufb01cation and graph regres-\nsion. For classi\ufb01cation, we tested our method on eight real-world graph datasets from (Yanardag and\nVishwanathan, 2015): three datasets consist of social network graphs, and the other \ufb01ve datasets come\nfrom bioinformatics and represent chemical compounds or protein structures. Each graph is repre-\nsented by an adjacency matrix and possibly categorical node features (for the bioinformatics datasets).\nFor the regression task, we conducted an experiment on a standard graph learning benchmark called\nthe QM9 dataset (Ramakrishnan et al., 2014; Wu et al., 2018). It is composed of 134K small organic\nmolecules (sizes vary from 4 to 29 atoms). Each molecule is represented by an adjacency matrix,\na distance matrix (between atoms), categorical data on the edges, and node features; the data was\nobtained from the pytorch-geometric library (Fey and Lenssen, 2019). The task is to predict 12 real\nvalued physical quantities for each molecule.\n\nGraph classi\ufb01cation results. We follow the standard 10-fold cross validation protocol and splits\nfrom Zhang et al. (2018) and report our results according to the protocol described in Xu et al. (2019),\nnamely the best averaged accuracy across the 10-folds. Parameter search was conducted on a \ufb01xed\n\nrandom 90%-10% split: learning rate in(cid:8)5 \u00b7 10\u22125, 10\u22124, 5 \u00b7 10\u22124, 10\u22123(cid:9); learning rate decay in\n\n[0.5, 1] every 20 epochs. We have tested three architectures: (1) b = 400, d = 2, and suf\ufb01x (ii); (2)\nb = 400, d = 2, and suf\ufb01x (i); and (3) b = 256, d = 3, and suf\ufb01x (ii). (See above for de\ufb01nitions of\nb, d and suf\ufb01x). Table 1 presents a summary of the results (top part - non deep learning methods).\nThe last row presents our ranking compared to all previous methods; note that we have scored in the\ntop 3 methods in 6 out of 8 datasets.\n\n8\n\n\fGraph regression results. The data is randomly split into 80% train, 10% validation and 10%\ntest. We have conducted the same parameter search as in the previous experiment on the validation\nset. We have used the network (2) from classi\ufb01cation experiment, i.e., b = 400, d = 2, and suf\ufb01x\n(i), with an absolute error loss adapted to the regression task. Test results are according to the best\nvalidation error. We have tried two different settings: (1) training a single network to predict all the\noutput quantities together and (2) training a different network for each quantity. Table 2 compares the\nmean absolute error of our method with three other methods: 123-gnn (Morris et al., 2018) and (Wu\net al., 2018); results of all previous work were taken from (Morris et al., 2018). Note that our method\nachieves the lowest error on 5 out of the 12 quantities when using a single network, and the lowest\nerror on 9 out of the 12 quantities in case each quantity is predicted by an independent network.\n\nEquivariant layer evaluation. The model in Section 6 does not\nincorporate all equivariant linear layers as characterized in (Maron\net al., 2019a). It is therefore of interest to compare this model to\nmodels richer in linear equivariant layers, as well as a simple MLP\nbaseline (i.e., without matrix multiplication). We performed such\nan experiment on the NCI1 dataset (Yanardag and Vishwanathan,\n2015) comparing: (i) our suggested model, denoted Matrix Product\n(MP); (ii) matrix product + full linear basis from (Maron et al.,\n2019a) (MP+LIN); (iii) only full linear basis (LIN); and (iv) MLP\napplied to the feature dimension.\nDue to the memory limitation in (Maron et al., 2019a) we used the\nsame feature depths of b1 = 32, b2 = 64, b3 = 256, and d = 2.\nThe inset shows the performance of all methods on both training\nand validation sets, where we performed a parameter search on\nthe learning rate (as above) for a \ufb01xed decay rate of 0.75 every 20\nepochs. Although all methods (excluding MLP) are able to achieve\na zero training error, the (MP) and (MP+LIN) enjoy better gener-\nalization than the linear basis of Maron et al. (2019a). Note that\n(MP) and (MP+LIN) are comparable, however (MP) is considerably\nmore ef\ufb01cient.\n\n8 Conclusions\n\nWe explored two models for graph neural networks that possess superior graph distinction abilities\ncompared to existing models. First, we proved that k-order invariant networks offer a hierarchy\nof neural networks that parallels the distinction power of the k-WL tests. This model has lesser\npractical interest due to the high dimensional tensors it uses. Second, we suggested a simple GNN\nmodel consisting of only MLPs augmented with matrix multiplication and proved it achieves 3-WL\nexpressiveness. This model operates on input tensors of size n2 and therefore useful for problems\nwith dense edge data. The downside is that its complexity is still quadratic, worse than message\npassing type methods. An interesting future work is to search for more ef\ufb01cient GNN models with\nhigh expressiveness. Another interesting research venue is quantifying the generalization ability of\nthese models.\n\nAcknowledgments\n\nThis research was supported in part by the European Research Council (ERC Consolidator Grant,\n\"LiftMatch\" 771136) and the Israel Science Foundation (Grant No. 1830/17).\n\nReferences\nAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G.,\nIsard, M., et al. (2016). Tensor\ufb02ow: A system for large-scale machine learning. In 12th {USENIX}\nSymposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265\u2013283.\n\nAtwood, J. and Towsley, D. (2016). Diffusion-convolutional neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 1993\u20132001.\n\n9\n\n0501001500.50.60.70.80.9Validation0501001500.50.60.70.80.9TrainMP+LINLINMLP1.01.0MPAccuracy (%)# of epochsAccuracy (%)\fBabai, L. (2016). Graph isomorphism in quasipolynomial time. In Proceedings of the forty-eighth\n\nannual ACM symposium on Theory of Computing, pages 684\u2013697. ACM.\n\nBriand, E. (2004). When is the algebra of multisymmetric polynomials generated by the elementary\n\nmultisymmetric polynomials. Contributions to Algebra and Geometry, 45(2):353\u2013368.\n\nBruna, J., Zaremba, W., Szlam, A., and LeCun, Y. (2013). Spectral Networks and Locally Connected\n\nNetworks on Graphs. pages 1\u201314.\n\nCai, J.-Y., F\u00fcrer, M., and Immerman, N. (1992). An optimal lower bound on the number of variables\n\nfor graph identi\ufb01cation. Combinatorica, 12(4):389\u2013410.\n\nCybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of\n\ncontrol, signals and systems, 2(4):303\u2013314.\n\nDefferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional neural networks on graphs\nwith fast localized spectral \ufb01ltering. In Advances in Neural Information Processing Systems, pages\n3844\u20133852.\n\nDouglas, B. L. (2011). The weisfeiler-lehman method and graph isomorphism testing. arXiv preprint\n\narXiv:1101.5211.\n\nDuvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and\nAdams, R. P. (2015). Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in neural information processing systems, pages 2224\u20132232.\n\nFey, M. and Lenssen, J. E. (2019). Fast graph representation learning with pytorch geometric. arXiv\n\npreprint arXiv:1903.02428.\n\nGilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural message passing\n\nfor quantum chemistry. In International Conference on Machine Learning, pages 1263\u20131272.\n\nGori, M., Monfardini, G., and Scarselli, F. (2005). A new model for earning in raph domains.\n\nProceedings of the International Joint Conference on Neural Networks, 2(January):729\u2013734.\n\nGrohe, M. (2017). Descriptive complexity, canonisation, and de\ufb01nable graph structure theory,\n\nvolume 47. Cambridge University Press.\n\nGrohe, M. and Otto, M. (2015). Pebble games and linear equations. The Journal of Symbolic Logic,\n\n80(3):797\u2013844.\n\nHamilton, W., Ying, Z., and Leskovec, J. (2017a). Inductive representation learning on large graphs.\n\nIn Advances in Neural Information Processing Systems, pages 1024\u20131034.\n\nHamilton, W. L., Ying, R., and Leskovec, J. (2017b). Representation learning on graphs: Methods\n\nand applications. arXiv preprint arXiv:1709.05584.\n\nHenaff, M., Bruna, J., and LeCun, Y. (2015). Deep Convolutional Networks on Graph-Structured\n\nData. (June).\n\nHornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks,\n\n4(2):251\u2013257.\n\nIvanov, S. and Burnaev, E. (2018). Anonymous walk embeddings. arXiv preprint arXiv:1805.11921.\n\nKeriven, N. and Peyr\u00e9, G. (2019). Universal invariant and equivariant graph neural networks. CoRR,\n\nabs/1905.04943.\n\nKipf, T. N. and Welling, M. (2016). Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907.\n\nKondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. (2018). Covariant compositional\n\nnetworks for learning graphs. arXiv preprint arXiv:1801.02144.\n\nKriege, N. M., Johansson, F. D., and Morris, C. (2019). A survey on graph kernels.\n\n10\n\n\fKrizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classi\ufb01cation with deep convolu-\ntional neural networks. In Advances in neural information processing systems, pages 1097\u20131105.\n\nLei, T., Jin, W., Barzilay, R., and Jaakkola, T. (2017). Deriving neural architectures from sequence and\ngraph kernels. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 2024\u20132033. JMLR. org.\n\nLevie, R., Monti, F., Bresson, X., and Bronstein, M. M. (2017). CayleyNets: Graph Convolutional\n\nNeural Networks with Complex Rational Spectral Filters. pages 1\u201312.\n\nLi, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. (2015). Gated Graph Sequence Neural Networks.\n\n(1):1\u201320.\n\nMaron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. (2019a). Invariant and equivariant graph\n\nnetworks. In International Conference on Learning Representations.\n\nMaron, H., Fetaya, E., Segol, N., and Lipman, Y. (2019b). On the universality of invariant networks.\n\nIn International conference on machine learning.\n\nMonti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., and Bronstein, M. M. (2017). Geometric\ndeep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1,\npage 3.\n\nMonti, F., Shchur, O., Bojchevski, A., Litany, O., G\u00fcnnemann, S., and Bronstein, M. M. (2018).\n\nDual-Primal Graph Convolutional Networks. pages 1\u201311.\n\nMorris, C., Kersting, K., and Mutzel, P. (2017). Glocalized weisfeiler-lehman graph kernels: Global-\nlocal feature maps of graphs. In 2017 IEEE International Conference on Data Mining (ICDM),\npages 327\u2013336. IEEE.\n\nMorris, C. and Mutzel, P. (2019). Towards a practical k-dimensional weisfeiler-leman algorithm.\n\narXiv preprint arXiv:1904.01543.\n\nMorris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M.\n(2018). Weisfeiler and leman go neural: Higher-order graph neural networks. arXiv preprint\narXiv:1810.02244.\n\nMurphy, R. L., Srinivasan, B., Rao, V., and Ribeiro, B. (2019). Relational pooling for graph\n\nrepresentations. arXiv preprint arXiv:1903.02541.\n\nNeumann, M., Garnett, R., Bauckhage, C., and Kersting, K. (2016). Propagation kernels: ef\ufb01cient\n\ngraph kernels from propagated information. Machine Learning, 102(2):209\u2013245.\n\nNiepert, M., Ahmed, M., and Kutzkov, K. (2016). Learning Convolutional Neural Networks for\n\nGraphs.\n\nRamakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. (2014). Quantum chemistry\n\nstructures and properties of 134 kilo molecules. Scienti\ufb01c data, 1:140022.\n\nRydh, D. (2007). A minimal set of generators for the ring of multisymmetric functions. In Annales\n\nde l\u2019institut Fourier, volume 57, pages 1741\u20131769.\n\nScarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2009). The graph neural\n\nnetwork model. Neural Networks, IEEE Transactions on, 20(1):61\u201380.\n\nShervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. (2011).\n\nWeisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539\u20132561.\n\nShervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., and Borgwardt, K. (2009). Ef\ufb01cient\ngraphlet kernels for large graph comparison. In Arti\ufb01cial Intelligence and Statistics, pages 488\u2013495.\n\nSimonovsky, M. and Komodakis, N. (2017). Dynamic edge-conditioned \ufb01lters in convolutional\nneural networks on graphs. In Proceedings - 30th IEEE Conference on Computer Vision and\nPattern Recognition, CVPR 2017.\n\n11\n\n\fVeli\u02c7ckovi\u00b4c, P., Cucurull, G., Casanova, A., Romero, A., Li\u00f2, P., and Bengio, Y. (2017). Graph\n\nAttention Networks. pages 1\u201312.\n\nVerma, S. and Zhang, Z.-L. (2017). Hunt for the unique, stable, sparse and fast feature learning on\n\ngraphs. In Advances in Neural Information Processing Systems, pages 88\u201398.\n\nVishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. (2010). Graph\n\nkernels. Journal of Machine Learning Research, 11(Apr):1201\u20131242.\n\nWu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and\nPande, V. (2018). Moleculenet: a benchmark for molecular machine learning. Chemical science,\n9(2):513\u2013530.\n\nXu, K., Hu, W., Leskovec, J., and Jegelka, S. (2019). How powerful are graph neural networks? In\n\nInternational Conference on Learning Representations.\n\nYanardag, P. and Vishwanathan, S. (2015). Deep Graph Kernels. In Proceedings of the 21th ACM\n\nSIGKDD International Conference on Knowledge Discovery and Data Mining - KDD \u201915.\n\nYing, R., You, J., Morris, C., Ren, X.,\n\n, W. L., and Leskovec, J. (2018). Hierarchical Graph\n\nRepresentation Learning with Differentiable Pooling.\n\nZaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017).\n\nDeep sets. In Advances in Neural Information Processing Systems, pages 3391\u20133401.\n\nZhang, M. and Chen, Y. (2017). Weisfeiler-lehman neural machine for link prediction. In Proceedings\nof the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 575\u2013583. ACM.\n\nZhang, M., Cui, Z., Neumann, M., and Chen, Y. (2018). An end-to-end deep learning architecture for\n\ngraph classi\ufb01cation. In Proceedings of AAAI Conference on Arti\ufb01cial Inteligence.\n\n12\n\n\f", "award": [], "sourceid": 1275, "authors": [{"given_name": "Haggai", "family_name": "Maron", "institution": "NVIDIA Research"}, {"given_name": "Heli", "family_name": "Ben-Hamu", "institution": "Weizmann Institute of Science"}, {"given_name": "Hadar", "family_name": "Serviansky", "institution": "Weizmann Institute of Science"}, {"given_name": "Yaron", "family_name": "Lipman", "institution": "Weizmann Institute of Science"}]}