{"title": "Understanding the Representation Power of Graph Neural Networks in Learning Graph Topology", "book": "Advances in Neural Information Processing Systems", "page_first": 15413, "page_last": 15423, "abstract": "To deepen our understanding of graph neural networks, we investigate the representation power of Graph Convolutional Networks (GCN) through the looking glass of graph moments, a key property of graph topology encoding path of various lengths.\nWe find that GCNs are rather restrictive in learning graph moments. Without careful design, GCNs can fail miserably even with multiple layers and nonlinear activation functions.\nWe analyze theoretically the expressiveness of GCNs, arriving at a modular GCN design, using different propagation rules.\nOur modular design is capable of distinguishing graphs from different graph generation models for surprisingly small graphs, a notoriously difficult problem in network science. \nOur investigation suggests that, depth is much more influential than width and deeper GCNs are more capable of learning higher order graph moments. \nAdditionally, combining GCN modules with different propagation rules is critical to the representation power of GCNs.", "full_text": "Understanding the Representation Power of Graph\n\nNeural Networks in Learning Graph Topology\n\nNima Dehmamy\u2217\n\nAlbert-L\u00e1szl\u00f3 Barab\u00e1si\u2020\n\nCSSI, Kellogg School of Management\nNorthwestern University, Evanston, IL\n\nCenter for Complex Network Research,\n\nNortheastern University, Boston MA\n\nnimadt@bu.edu\n\nalb@neu.edu\n\nRose Yu\n\nKhoury College of Computer Sciences,\nNortheastern University, Boston, MA\n\nroseyu@northeastern.edu\n\nAbstract\n\nTo deepen our understanding of graph neural networks, we investigate the repre-\nsentation power of Graph Convolutional Networks (GCN) through the looking\nglass of graph moments, a key property of graph topology encoding path of vari-\nous lengths. We \ufb01nd that GCNs are rather restrictive in learning graph moments.\nWithout careful design, GCNs can fail miserably even with multiple layers and\nnonlinear activation functions. We analyze theoretically the expressiveness of\nGCNs, concluding that a modular GCN design, using different propagation rules\nwith residual connections could signi\ufb01cantly improve the performance of GCN.\nWe demonstrate that such modular designs are capable of distinguishing graphs\nfrom different graph generation models for surprisingly small graphs, a notoriously\ndif\ufb01cult problem in network science. Our investigation suggests that, depth is much\nmore in\ufb02uential than width, with deeper GCNs being more capable of learning\nhigher order graph moments. Additionally, combining GCN modules with different\npropagation rules is critical to the representation power of GCNs.\n\n1\n\nIntroduction\n\nThe surprising effectiveness of graph neural networks [17] has led to an explosion of interests in\ngraph representation learning, leading to applications from particle physics [12], to molecular biology\n[37] to robotics [4]. We refer readers to several recent surveys [7, 38, 33, 14] and the references\ntherein for a non-exhaustive list of the research. Graph convolution networks (GCNs) are among the\nmost popular graph neural network models. In contrast to existing deep learning architectures, GCNs\nare known to contain fewer number of parameters, can handle irregular grids with non-Euclidean\ngeometry, and introduce relational inductive bias into data-driven systems. It is therefore commonly\nbelieved that graph neural networks can learn arbitrary representations of graph data.\n\nDespite their practical success, most GCNs are deployed as black boxes feature extractors for graph\ndata. It is not yet clear to what extent can these models capture different graph features. One\nprominent feature of graph data is node permutation invariance: many graph structures stay the same\n\n\u2217work done when at Center for Complex Network Research, Northeastern University, Boston, MA\n\u2020Center for Cancer Systems Biology, Dana Farber Cancer Institute, Boston MA, Brigham and Women\u2019s\nHospital, Harvard Medical School, Boston MA, Center for Network Science, Central European University,\nBudapest, Hungary\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\funder relabelling or permutations of the nodes. For instance, people in a friendship network may be\nfollowing a similar pattern for making friends, in similar cultures. To satisfy permutation invariance,\nGCNs assign global parameters to all the nodes by which signi\ufb01cantly simpli\ufb01es learning. But such\nef\ufb01ciency comes at the cost of expressiveness: GCNs are not universal function approximators [34].\nWe use GCN in a broader sense than in [20], allowing different propagation rules (see below (4)).\n\nTo obtain deeper understanding of graph neural networks, a few recent work have investigated the\nbehavior of GCNs including expressiveness and generalizations. For example, [28] showed that\nmessage passing GCNs can approximate measurable functions in probability. [34, 24, 25] de\ufb01ned\nexpressiveness as the capability of learning multi-set functions and proved that GCNs are at most as\npowerful as the Weisfeiler-Lehman test for graph isomorphism, but assuming GCNs with in\ufb01nite\nnumber of hidden units and layers. [32] analyzed the generalization and stability of GCNs, which\nsuggests that the generalization gap of GCNs depends on the eigenvalues of the graph \ufb01lters. However,\ntheir analysis is limited to a single layer GCN for semi-supervised learning tasks. Up until now, the\nrepresentation power of multi-layer GCNs for learning graph topology remains elusive.\n\nIn this work, we analyze the representation power of GCNs in learning graph topology using graph\nmoments, capturing key features of the underlying random process from which a graph is produced.\nWe argue that enforcing node permutation invariance is restricting the representation power of GCNs.\nWe discover pathological cases for learning graph moments with GCNs. We derive the representation\npower in terms of number of hidden units (width), number of layers (depths), and propagation rules.\nWe show how a modular design for GCNs with different propagation rules signi\ufb01cantly improves\nthe representation power of GCN-based architectures. We apply our modular GCNs to distinguish\ndifferent graph topology from small graphs. Our experiments show that depth is much more in\ufb02uential\nthan width in learning graph moments and combining different GCN modules can greatly improve\nthe representation power of GCNs. 3\n\nIn summary, our contributions in this work include:\n\n\u2022 We reveal the limitations of graph convolutional networks in learning graph topology. For\nlearning graph moments, certain designs GCN completely fails, even with multiple layers\nand non-linear activation functions.\n\n\u2022 we provide theoretical guarantees for the representation power of GCN for learning graph\nmoments, which suggests a strict dependence on the depth whereas the width plays a weaker\nrole in many cases.\n\n\u2022 We take a modular approach in designing GCNs that can learn a large class of node\npermutation invariant function of of the graph, including non-smooth functions. We \ufb01nd\nthat having different graph propagation rules with residual connections can dramatically\nincrease the representation power of GCNs.\n\n\u2022 We apply our approach to build a \u201cgraph stethoscope\u201d: given a graph, classify its generating\nprocess or topology. We provide experimental evidence to validate our theoretical analysis\nand the bene\ufb01ts of a modular approach.\n\nNotation and De\ufb01nitions A graph is a set of N nodes connected via a set of edges. The adjacency\nmatrix of a graph A encodes graph topology, where each element Aij represents an edge from node\ni to node j. We use AB and A \u00b7 B (if more than two indices may be present) to denote the matrix\nproduct of matrices A and B. All multiplications and exponentiations are matrix products, unless\nexplicitly stated. Lower indices Aij denote i, jth elements of A, and Ai means the ith row. Ap\ndenotes the pth matrix power of A. We use a(m) to denote a parameter of the mth layer.\n\n2 Learning Graph Moments\n\nGiven a collection of graphs, produced by an unknown random graph generation process, learning\nfrom graphs requires us to accurately infer the characteristics of the underlying generation process.\nSimilar to how moments E[X p] of a random variable X characterize its probability distribution,\ngraph moments [5, 23] characterize the random process from which the graph is generated.\n\n3 All\n\ncode\n\nand\n\nhyperparameters\n\nare\n\navailable\n\nat\n\nhttps://github.com/nimadehmamy/\n\nUnderstanding-GCN\n\n2\n\n\f2.1 Graph moments\n\nIn general, a pth order graph moment Mp is the ensemble average of an order p polynomial of A\n\np\n\nMp(A) =\n\n(A \u00b7 Wq + Bq)\n\n(1)\n\nYq=1\n\nwith Wq and Bq being N \u00d7 N matrices. Under the constraint of node permutation invariance, Wq\nmust be either proportional to the identity matrix, or a uniform aggregation matrix. Formally,\n\nM (A) = A \u00b7 W + B, Node Permutation Invariance \u21d2 W, B = cI,\n\nor W, B = c11T (2)\n\nwhere 1 is a vector of ones. Graph moments encode topological information of a graph and are useful\nfor graph coloring and Hamiltonicity. For instance, graph power Ap\nij counts the number of paths\nfrom node i to j of length p. For a graph of size N , A has N eigenvalues. Applying eigenvalue\ndecomposition to graph moments, we have E[Ap] = E[(V T \u039bU )p]) = V T E[\u039bp]U . Graphs moments\ncorrespond to the distribution of the eigenvalues \u039b, which are random variables that characterize the\ngraph generation process. Graph moments are node permutation invariant, meaning that relabelling\nof the nodes will not change the distribution of degrees, the paths of a given length, or the number of\ntriangles, to name a few. The problem of learning graph moments is to learn a functional approximator\nF such that F : A \u2192 Mp(A), while preserving node permutation invariance.\n\nDifferent graph generation processes can depend on different orders of graph moments. For example,\nin Barab\u00e1si-Albert (BA) model [1], the probability of adding a new edge is proportional to the degree,\nwhich is a \ufb01rst order graph moment. In diffusion processes, however, the stationary distribution\ndepends on the normalized adjacency matrix \u02c6A as well as its symmetrized version \u02c6As, de\ufb01ned as\nfollows:\n\nAik\n\n\u02c6A \u2261 D\u22121A\n\n\u02c6As \u2261 D\u22121/2AD\u22121/2\n\n(3)\n\nDij \u2261 \u03b4ijXk\n\nwhich are not smooth functions of A and have no Taylor expansion in A, because of the inverse D\u22121.\nProcesses involving D\u22121 and A are common and per (2) D and Tr[A] are the only node permutation\ninvariant \ufb01rst order moments of A. Thus, in order to approximate more general node permutation\ninvariant F (A), it is crucial for a graph neural network to be able to learn moments of A, \u02c6A and \u02c6As\nsimultaneously. In general, non-smooth functions of A can depend on A\u22121, which may be important\nfor inverting a diffusion process. We will only focus on using A, \u02c6A and \u02c6As here, but all argument\nhold also if we include A\u22121, \u02c6A\u22121 and \u02c6A\u22121\n\nas well.\n\ns\n\n2.2 Learning with Fully Connected Networks\n\nConsider a toy example of learning the \ufb01rst order moment. Given a collection of graphs with\nN = 20 nodes, the inputs are their adjacency matrices A, and the outputs are the node degrees\nj=1 Aij . For a fully connected (FC) neural network, it is a rather simple task given its\nuniversal approximation power [19]. However, a FC network treats the adjacency matrices as vector\ninputs and ignores the underlying graph structures, it needs a large amount of training samples and\nmany parameters to learn properly.\n\nDi = PN\n\nFig. 1 shows the mean squared error (MSE) of a single layer FC network in learning the \ufb01rst order\nmoments. Each curve corresponds to different number of training samples, ranging from 500\u201310,000.\nThe horizontal axis shows the number of hidden units. We can see that even though the network can\nlearn the moments properly reaching an MSE of \u2248 10\u22124, it requires the same order of magnitude\nof hidden units as the number of nodes in the graph, and at least 1, 000 samples. Therefore, FC\nnetworks are quite inef\ufb01cient for learning graph moments, which motivates us to look into more\npower alternatives: graph convolution networks.\n\n2.3 Learning with Graph Convolutional Networks\n\nWe consider the following class of graph convolutional networks. A single layer GCN propagates the\nnode attributes h using a function f (A) of the adjacency matrix and has an output given by\n\nF (A, h) = \u03c3 (f (A) \u00b7 h \u00b7 W + b)\n\n(4)\n\n3\n\n\fAdjacency\n\nNxN\n\nHidden FC layer\n\nn units\n\nOutput layer\nN outputs\n\nFigure 1: Learning graph\nmoments (Erd\u02ddos-R\u00e9nyi graph)\nwith a single fully-connected\nlayer. Best validation MSE\nw.r.t number of hidden units\nn and the number of samples\nin the training data (curves of\ndifferent colors).\n\nFigure 2: Learning the degree of nodes in a graph with a single\nlayer of GCN. When the GCN layer is designed as \u03c3(A \u00b7 h \u00b7 W )\nwith linear activation function \u03c3(x) = x, the network easily learns\nthe degree (a). However, if the network uses the propagation rule\nas \u03c3(D\u22121A \u00b7 h \u00b7 W ), it fails to learn degree, with very high MSE\nloss (b). The training data were instances of Barabasi-Albert\ngraphs (preferential attachment) with N = 20 nodes and m = 2\ninitial edges.\n\nwhere f is called the propagation rule, hi is the attribute of node i, W is the weight matrix and\nb is the bias. As we are interested in the graph topology, we ignore the node attributes and set\nhi = 1. Note that the weights W are only coupled to the node attributes h but not to the propagation\nrule f (A). The de\ufb01nition in Eqn (4) covers a broad class of GCNs. For example, GCN in [20]\nuses f = D\u22121/2AD\u22121/2. GraphSAGE [16] mean aggregator is equivalent to f = D\u22121A. These\narchitectures are also special cases of Message-Passing Neural Networks [13].\n\nWe apply a single layer GCN with different propagation rules to learn the node degrees of BA graphs.\nWith linear activation \u03c3(x) = x, the solution for learning node degrees is f (A) = A, W = 1 and\n\nb = 0. For high-order graph moments of the form Mp = Pj(Ap)ij , a single layer GCN has to\n\nlearn the function f (A) = Ap. As shown in Figure 2, a single layer GCN with f (A) = A can learn\nthe degrees perfectly even with as few as 50 training samples for a graph of N = 20 nodes (Fig.\n2a). Note that GCN only requires 1 hidden unit to learn, which is much more ef\ufb01cient than the FC\nnetworks. However, if we set the learning target as f (A) = D\u22121A, the same GCN completely fails at\nlearning the graph moments regardless of the sample size, as shown in Fig. 2b. This demonstrates the\nlimitation of GCNs due to the permutation invariance constraint. Next we analyze this phenomena\nand provide theoretical guarantees for the representation power of GCNs.\n\n3 Theoretical Analysis\n\nTo learn graph topology, fully connected layers require a large number of hidden units. The following\ntheorem characterizes the representation power of fully connected neural network for learning graph\nmoments in terms of number of nodes N , order of moments p and number of hidden units n.\nTheorem 1. A fully connected neural network with one hidden layer requires n > O(C 2\nf ) \u223c\nO(p2N 2q) number of neurons in the best case with 1 \u2264 q \u2264 2 to learn a graph moment of order p for\n\ngraphs with N nodes. Additionally, it also needs S > O(nd) \u223c O(cid:0)p2N 2q+2(cid:1) number of samples to\n\nmake the learning tractable.\n\nClearly, if a FC network fully parameterizes every element in a N \u00d7 N adjacency matrix A, the\ndimensions of the input would have to be d = N 2. If the FC network allows weight sharing among\nnodes, the input dimension would be d = N . The Fourier transform of a polynomial function of\norder p with O(1) coef\ufb01cients will have an L1 norms of Cf \u223c O(p). Using Barron\u2019s result [2] with\nd = N q, where 1 \u2264 q \u2264 2 and set the Cf \u223c O(p), we can obtain the approximation bound.\n\nIn contrast to fully connected neural networks, graph convolutional networks are more ef\ufb01cient in\nlearning graph moments. A graph convolution network layer without bias is of the form:\n\nF (A, h) = \u03c3(f (A) \u00b7 h \u00b7 W )\n\n(5)\n\nPermutation invariance restricts the weight matrix W to be either proportional to the identity matrix,\nor a uniform aggregation matrix, see Eqn. (2). When W = cI, the resulting graph moment Mp(A)\nhas exactly the form of the output of a p layer GCN with linear activation function.\n\n4\n\nA, hGCN \u03c3 (A \u22c5 h \u22c5 W)DegreesA, hGCN \u03c3 (D-1A \u22c5 h \u22c5 W)Degreesa) b) \fWe \ufb01rst show, via an explicit example, that a n < p layer GCN by stacking layers of the form in Eqn.\n(5) cannot learn pth order graph moments.\n\nLemma 1. A graph convolutional network with n < p layers cannot, in general, learn a graph\nmoment of order p for a set of random graphs.\n\nWe prove this by showing a counterexample. Consider a directed graph of two nodes with adjacency\n\nb\n\n0(cid:19). Suppose we want to use a single layer GCN to learn the second order moment\nmatrix A =(cid:18)0 a\nf (A)i =Pj(A2)ij =Pk AikDk. The node attributes hil are decoupled from the propagation rule\n\nf (A)i. Their values are set to ones hil = 1, or any values independent of A. The network tries to\nlearn the weight matrix Wl\u00b5 and has an output h(1) of the form\n\nh(1)\n\ni\u00b5 = \u03c3 (A \u00b7 h \u00b7 W )i\u00b5 = \u03c3\uf8eb\n\uf8edXj,l\n\nAijhjlWl\u00b5\uf8f6\n\uf8f8 ,\n\n(6)\n\nFor brevity, de\ufb01ne Vi\u00b5 \u2261 Pl hilWl\u00b5. Setting the output h(1) to the desired function A \u00b7 D, with\n\n2\u00b5 = ab, (hence \u00b5 can only be 1) and plugging in A, the two components of\n\n1\u00b5 = h(1)\n\ncomponents h(1)\nthe output become\n\nh(1)\n1\u00b5 = \u03c3 (D1V1\u00b5) = \u03c3 (aV1\u00b5) = ab\n\nh(1)\n2\u00b5 = \u03c3 (D2V2\u00b5) = \u03c3 (bV2\u00b5) = ab.\n\n(7)\n\nwhich must be satis\ufb01ed \u2200a, b. But it\u2019s impossible to satisfy \u03c3 (aV1\u00b5) = ab for (a, b) \u2208 R2 with V1\u00b5\nand \u03c3(\u00b7) independent of a, b.\n\nProposition 1. A graph convolutional network with n layers, and no bias terms, in general, can\n\nthe second layer h(2)\n\nlearn f (A)i =Pj (Ap)ij only if n = p or n > p if the bias is allowed.\nIf we use a two layer GCN to learn a \ufb01rst order moment f (A)i =Pj Aij = Di, for the output of\n\u00b5\u03bd ! = a (8)\nh(2) = \u03c3(2)(cid:16)A \u00b7 \u03c3(1)(cid:16)A \u00b7 h \u00b7 W (1)(cid:17) \u00b7 W (2)(cid:17) , h(2)\n\n1\u03bd = \u03c3(2) aX\u00b5\n\n\u03c3(1)(cid:16)bV (1)\n\n2\u00b5 (cid:17) W (2)\n\ni\u03bd we have\n\nAgain, since this must hold for any value of a, b and \u03bd, we see that h(2)\n1\u03bd is a function of b through the\noutput of the \ufb01rst layer h(1)\n1\u03bd = a can only be satis\ufb01ed if the \ufb01rst layer output is a constant.\nIn other words, only if the \ufb01rst layer can be bypassed (e.g. if the bias is large and weights are zero)\ncan a two-layer GCN learn the \ufb01rst order moment.\n\n2\u00b5 . Thus h(2)\n\nThis result also generalizes to multiple layers and higher order moments in a straightforward fashion.\nFor GCN with linear activation, a similar argument shows that when the node attributes h are not\n\nGCN layers, without bias. With bias, a feed-forward GCN with n > p layers can learn single term\n\nimplicitly a function of A, in order to learn the functionPj (Ap)ij , we need to have exactly n = p\norder p moments such asPj (Ap)ij . However, since it needs to set the some weights of n \u2212 p layers\nto zero, it can fail in learning mixed order moments such asPj(Aq + Ap)ij .\n\nTo allow GCNs with very few parameters to learn mixed order moments, we introduce residual\nconnections [18] by concatenating the output of every layer [h(1), . . . , h(m)] to the \ufb01nal output of the\nnetwork. This way, by applying an aggregation layer or a FC layer which acts the same way on the\noutput for every node, we can approximate any polynomial function of graph moments. Speci\ufb01cally,\nthe \ufb01nal N \u00d7 do output h(f inal) of the aggregation layer has the form\n\nh(f inal)\ni\u00b5\n\n= \u03c3 n\nXm=1\n\na(m)\n\u00b5\n\n\u00b7 h(m)\n\ni ! ,\n\nh(m) = \u03c3(A \u00b7 h(m\u22121) \u00b7 W (m) + b(m)),\n\n(9)\n\nwhere \u00b7 acts on the output channels of each output layers. The above results lead to the following\ntheorem which guarantees the representation power of multi-layer GCNs with respect to learning\ngraph moments.\n\n5\n\n\fFigure 4: Test loss over number of epochs for learning \ufb01rst (top), second (middle) and third (bottom)\n\norder graph moments Mp(A) =Pj (Ap)ij , with varying number of layers and different activation\n\nfunctions. A multi-layer GCN with residual connections is capable of learning the graph moments\nwhen the number of layers is at least the target order of the graph moments. The graphs are from our\nsynthetic graph dataset described in Sec. 6.\n\nTheorem 2. With the number of layers n greater or equal to the order p of a graph moment Mp(A),\ngraph convolutional networks with residual connections can learn a graph moment Mp with O(p)\nnumber of neurons, independent of the size of the graph.\n\nTheorem 2 suggests that the representation power of GCN has a strong dependence on the number\nof layers (depth) rather than the size of the graph (width). It also highlights the importance of\nresidual connections. By introducing residual connections into multiple GCN layers, we can learn\nany polynomial function of graph moments with linear activation. Interestingly, Graph Isomophism\nNetwork (GIN) proposed in [34] uses the following propagation rule:\n\nF (A, h) = \u03c3 ([(1 + \u01eb)I + A] \u00b7 h \u00b7 W )\n\n(10)\n\nwhich is a special case of our GCN with one residual connection between two modules.\n\n4 Modular GCN Design\n\nIn order to overcome the limitation of the GCNs in learn-\ning graph moments, we take a modular approach to GCN\ndesign. We treat different GCN propagation rules as\ndifferent \u201cmodules\u201d and consider three important GCN\nmodules (1) f1 = A [22] (2) f2 = D\u22121A [20], and (3)\nf3 = D\u22121/2AD\u22121/2 [16]. Figure 3a) shows the design\nof a single GCN layer where we combine three different\nGCN modules. The output of the modules are concate-\nnated and fed into a node-wise FC layer. Note that our\ndesign is different from the multi-head attention mech-\nanism in Graph Attention Network [31] which uses the\nsame propagation rule for all the modules.\n\nHowever, simply stacking GCN layers on top of each other\nin a feed-forward fashion is quite restrictive, as shown by\nour theoretical analysis for multi-layer GCNs. Different\n\n6\n\nFigure 3: GCN layer (a), using three dif-\nferent propagation rules and a node-wise\nFC layer. Using residual connections (b)\nallows a n-layer modular GCN to learn\nany polynomial function of order n of its\nconstituent operators.\n\n100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA1 moment, linear activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA1 moment, relu activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA1 moment, sigmoid activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA1 moment, tanh activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA2 moment, linear activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA2 moment, relu activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA2 moment, sigmoid activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA2 moment, tanh activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA3 moment, linear activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA3 moment, relu activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA3 moment, sigmoid activation1 layers2 layers3 layers4 layers100101102103epoch107106105104103102101100Test LossWith Residual ConnectionsA3 moment, tanh activation1 layers2 layers3 layers4 layersA, hGCNConcatvGCNGCNOutputResidual ArchitectureThe Full GCN moduleGCND-1/2AD-1/2D-1AAConcatNode-wise Fully Conn.a)b)\fpropagation rules cannot be written as Taylor expansions of each other, while all of them are important\nin modeling the graph generation process. Hence, no matter how many layers or how non-linear\nthe activation function gets, multi-layer GCN stacked in a feed-forward way cannot learn network\nmoments whose order is not precisely the number of layers. If we add residual connections from the\noutput of every layer to the \ufb01nal aggregation layer, we would be able to approximate any polynomial\nfunctions of graph moments. Figure 3b) shows the design of a muli-layer GCN with residual\nconnections. We stack the modular GCN layer on top of each other and concatenate the residual\nconnections from every layer. The \ufb01nal layer aggregates the output from all previous layers, including\nresidual connections.\n\nWe measure the representation power of GCN design in learning different orders of graph moments\n\nMp(A) = Pj (Ap)ij with p = 1, 2, 3. Figure 4 shows the test loss over number of epochs for\n\nlearning \ufb01rst (top), second (middle) and third (bottom) order graph moments. We vary the number of\nlayers from 1 to 4 and test with different activation functions including linear, ReLU, sigmoid and\ntanh. Consistent with the theoretical analysis, we observe that whenever the number of layers is at\nleast the target order of the graph moments, a multi-layer GCN with residual connections is capable of\nlearning the graph moments. Interestingly, Jumping Knowledge (JK) Networks [35] showed similar\neffects of adding residual connections for Message Passing Graph Neural Networks.\n\nOur modular approach demonstrates the importance of architectural design when using specialized\nneural networks. Due to permutation invariance, feed-forward GCNs are quite limited in their\nrepresentation power and can fail at learning graph topology. However, with careful design including\ndifferent propagation rules and residual connections, it is possible to improve the representation power\nof GCNs in order to capture higher order graph moments while preserving permutation invariance.\n\n5 Related Work\n\nGraph Representation Learning There has been increasing interest in deep learning on graphs,\nsee e.g. many recent surveys of the \ufb01eld [7, 38, 33]. Graph neural networks [22, 20, 17] can learn\ncomplex representations of graph data. For example, Hop\ufb01eld networks [28, 22] propagate the\nhidden states to a \ufb01xed point and use the steady state representation as the embedding for a graph;\nGraph convolution networks [8, 20] generalize the convolutional operation from convolutional neural\nnetworks to learn from geometric objects beyond regular grids. [21] proposes a deep architecture\nfor long-term forecasting of spatiotemporal graphs. [37] learns the representations for generating\nrandom graphs sequentially using an adversarial loss at each step. Despite practical success, deep\nunderstanding and theoretical analysis of graph neural networks is still largely lacking.\n\nExpressiveness of Neural Networks Early results on the expressiveness of neural networks take a\nhighly theoretical approach, from using functional analysis to show universal approximation results\n[19], to studying network VC dimension [3]. While these results provided theoretically general\nconclusions, they mostly focus on single layer shallow networks. For deep fully connected networks,\nseveral recent papers have focused on understanding the bene\ufb01ts of depth for neural networks\n[11, 29, 28, 27]) with speci\ufb01c choice of weights. For graph neural networks, [34, 24, 25] prove\nthe equivalence of a graph neural network with Weisfeiler-Lehman graph isomorphism test with\nin\ufb01nite number of hidden layers. [32] analyzes the generalization and stability of GCNs, which\ndepends on eigenvalues of the graph \ufb01lters. However, their analysis is limited to a single layer GCN\nin the semi-supervised learning setting. Most recently, [10] demonstrates the equivalence between\nin\ufb01nitely wide multi-layer GNNs and Graph Neural Tangent Kernels, which enjoy polynomial sample\ncomplexity guarantees.\n\nDistinguishing Graph Generation Models Understanding random graph generation processes\nhas been a long lasting interest of network analysis. Characterizing the similarities and differences of\ngeneration models has applications in, for example, graph classi\ufb01cation: categorizing a collections of\ngraphs based on either node attributes or graph topology. Traditional graph classi\ufb01cation approaches\nrely heavily on feature engineering and hand designed similarity measures [30, 15]. Several recent\nwork propose to leverage deep architecture [6, 36, 9] and learn graph similarities at the representation\nlevel. In this work, instead of proposing yet another deep architecture for graph classi\ufb01cation, we\nprovide insights for the representation power of GCNs using well-known generation models. Our\ninsights can provide guidance for choosing similarity measures in graph classi\ufb01cation.\n\n7\n\n\f6 Graph Stethoscope: Distinguishing Graph Generation Models\n\nAn important application of learning graph moments is to distinguish different random graph genera-\ntion models. For random graph generation processes like the BA model, the asymptotic behavior\n(N \u2192 \u221e) is known, such as scale-free. However, when the number of nodes is small, it is generally\ndif\ufb01cult to distinguish collections of graphs with different graph topology if the generation process is\nrandom. Thus, building an ef\ufb01cient tool that can probe the structure of small graphs of N < 50 like a\nstethoscope can be highly challenging, especially when all the graphs have the same number of nodes\nand edges.\n\nBA vs. ER. We consider two tasks for graph stethoscope. In the \ufb01rst setting, we generate 5, 000\ngraphs with the same number of nodes and vary the number of edges, half of which are from the\nBarabasi-Albert (BA) model and the other half from the Erdos-Renyi (ER) model. In the BA model,\na new node attaches to m existing nodes with a likelihood proportional to the degree of the existing\nnodes. The 2, 500 BA graphs are evenly split with m = 1, N/8, N/4, 3N/8, N/2. To avoid the bias\nfrom the order of appearance of nodes caused by preferential attachment, we shuf\ufb02e the node labels.\nER graphs are random undirected graphs with a probability p for generating every edge. We choose\nfour values for p uniformly between 1/N and N/2. All graphs have similar number of edges.\n\nBA vs. Con\ufb01guration Model One might argue that distinguishing BA from ER for small graphs\nis easy as BA graphs are known to have a power-law distribution for the node degrees [1], and\nER graphs have a Poisson degree distribution. Hence, we create a much harder task where\nwe compare BA graphs with \u201cfake\u201d BA graphs where the nodes have the same degree but all\nedges are rewired using the Con\ufb01guration Model [26] (Con\ufb01g.). The resulting graphs share ex-\nactly the same degree distribution. We also \ufb01nd that higher graph moments of the Con\ufb01g BA\nare dif\ufb01cult to distinguish from real BA, despite the Con\ufb01g. model not \ufb01xing these moments.\nDistinguishing BA and Con\ufb01g BA is very dif\ufb01cult using stan-\ndard methods such as a Kolmogorov-Smirnov (KS) test. KS test\nmeasures the distributional differences of a statistical measure\nbetween two graphs and uses hypothesis testing to identify the\ngraph generation model. Figure 5 shows the KS test values for\npairs of real-real BA (blue) and pairs of real-fake BA (orange)\nw.r.t different graph moments. The dashed black lines show the\nmean of the KS test values for real-real pairs. We observe that the\ndistributions of differences in real-real pairs are almost the same\nas those of real-fake pairs, meaning the variability in different\ngraph moments among real BA graphs is almost the same as that\nbetween real and Con\ufb01g BA graphs.\n\nTable 1: Test accuracy with dif-\nferent modules combinations for\nBA-ER. f1 = A, f2 = D\u22121A,\nand f3 = D\u22121/2AD\u22121/2.\n\n53.5 %\n76.9 %\n89.4 %\n98.8 %\n\nf1, f3\n\nf1, f2, f3\n\nModules\n\nAccuracy\n\nf1\nf3\n\nClassi\ufb01cation Using our GCN Module We evaluate the classi\ufb01cation accuracy for these two\nsettings using the modular GCN design, and analyze the trends of representation power w.r.t network\ndepth and width, as well as the number of nodes in the graph. Our architecture consists of layers\nof our GCN module (Fig. 3, linear activation). The output is passed to a fully connected layer\nwith softmax activation, yielding and N \u00d7 c matrix (N nodes in graph, c label classes). The \ufb01nal\n\nFigure 5: Distribution of Kolmogorov-Smirnov (KS) test values for differences between graph the\n\n\ufb01rst four graph momentsPi(Ap)ij in the dataset. \u201creal-real\u201d shows the distribution of KS test when\n\ncomparing the graph moments of two real instances of the BA. All graphs have N = 30 nodes, but\nvarying number of links. The \u201creal-fake\u201d case does the KS test for one real BA against one fake BA\ncreated using the con\ufb01guration model.\n\n8\n\n0.00.51.01.52.02.5KS test value101100DensityMoment 1, KS test, BA Config.mean realreal-realreal-fake0.00.20.40.6KS test value100101Moment 2, KS test, BA Config.mean realreal-realreal-fake0.000.050.100.15KS test value100101102Moment 3, KS test, BA Config.mean realreal-realreal-fake0.000.020.040.060.08KS test value100101102Moment 4, KS test, BA Config.mean realreal-realreal-fake\fFigure 6: Classify graphs of Barabasi-Albert model vs. Erdos-Renyi model (top) and Barabasi-\nAlbert model vs. con\ufb01guration model (bottom). Left: test accuracy with respect to network depth\nfor different number of nodes (N) and number of units (U). Right: test accuracy with respect to graph\nsize for different number of layers (L) and number of units (U).\n\nclassi\ufb01cation is found by mean-pooling over the N outputs. We used mean-pooling to aggregate\nnode-level representations, after which a single number is passed to a classi\ufb01cation layer. Figure 6\nleft column shows the accuracy with increasing number of layers for different number of layers and\nhidden units. We \ufb01nd that depth is more in\ufb02uential than width: increasing one layer can improve the\ntest accuracy by at least 5%, whereas increasing the width has very little effect. The right column is\nan alternative view with increasing size of the graphs. It is clear that smaller networks are harder to\nlearn, while for N \u2265 50 nodes is enough for 100% accuracy in BA-ER case. BA-Con\ufb01g is a much\nharder task, with the highest accuracy of 90%.\n\nWe also conduct ablation study for our modular GCN design. Table 1 shows the change of test\naccuracy when we use different combinations of modules. Note that the number of parameters are\nkept the same for all different design. We can see that a single module is not enough to distinguish\ngraph generation models with an accuracy close to random guessing. Having all three modules with\ndifferent propagation rules leads to almost perfect discrimination between BA and ER graphs. This\ndemonstrates the bene\ufb01ts of combining GCN modules to improve its representation power.\n\n7 Conclusion\n\nWe conduct a thorough investigation in understanding what can/cannot be learned by GCNs. We\nfocus on graph moments, a key characteristic of graph topology. We found that GCNs are rather\nrestrictive in learning graph moments, and multi-layer GCNs cannot learn graph moments even with\nnonlinear activation. Theoretical analysis suggests a modular approach in designing graph neural\nnetworks while preserving permutation invariance. Modular GCNs are capable of distinguishing\ndifferent graph generative models for surprisingly small graphs. Our investigation suggests that, for\nlearning graph moments, depth is much more in\ufb02uential than width. Deeper GCNs are more capable\nof learning higher order graph moments. Our experiments also highlight the importance of combining\nGCN modules with residual connections in improving the representation power of GCNs.\n\nAcknowledgments\n\nThis work was supported in part by NSF #185034, ONR-OTA (N00014-18-9-0001).\n\n9\n\n12345# layers0.8250.8500.8750.9000.9250.9500.9751.000Test accuracyLinear GCN activation, BA vs ERN=10, u=1N=10, u=3N=10, u=5N=15, u=1N=15, u=3N=15, u=5N=20, u=1N=20, u=3N=20, u=5N=30, u=1N=30, u=3N=30, u=5N=50, u=1N=50, u=3N=50, u=5101520253035404550# of nodes in graph0.8000.8250.8500.8750.9000.9250.9500.9751.000Test AccuracyAccuracy of GCN modular architectures in classifying BA vs ERL=1, u=1L=1, u=3L=1, u=5L=2, u=1L=2, u=3L=2, u=5L=3, u=1L=3, u=3L=3, u=5L=4, u=1L=4, u=3L=4, u=512345# layers0.650.700.750.800.85Test accuracyLinear GCN activation, BA vs Configuration Model, prunedN=10, u=1N=10, u=3N=10, u=5N=20, u=1N=20, u=3N=20, u=5N=30, u=1N=30, u=3N=30, u=5N=50, u=1N=50, u=3N=50, u=5101520253035404550# of nodes in graph0.650.700.750.800.85Test AccuracyAccuracy of GCN modular architectures in classifying BA vs Config. ModelL=1, u=1L=1, u=3L=1, u=5L=2, u=1L=2, u=3L=2, u=5L=3, u=1L=3, u=3L=3, u=5L=4, u=1L=4, u=3L=4, u=5\fReferences\n\n[1] R\u00e9ka Albert and Albert-L\u00e1szl\u00f3 Barab\u00e1si. Statistical mechanics of complex networks. Reviews\n\nof modern physics, 74(1):47, 2002.\n\n[2] Andrew R Barron. Approximation and estimation bounds for arti\ufb01cial neural networks. Machine\n\nlearning, 14(1):115\u2013133, 1994.\n\n[3] Peter L Bartlett. The sample complexity of pattern classi\ufb01cation with neural networks: the size\nof the weights is more important than the size of the network. IEEE transactions on Information\nTheory, 44(2):525\u2013536, 1998.\n\n[4] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius\nZambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan\nFaulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint\narXiv:1806.01261, 2018.\n\n[5] John Adrian Bondy and Uppaluri Siva Ramachandra Murty. Graph theory, volume 244 of.\n\nGraduate texts in Mathematics, 2008.\n\n[6] Stephen Bonner, John Brennan, Georgios Theodoropoulos, Ibad Kureshi, and Andrew Stephen\nMcGough. Deep topology classi\ufb01cation: A new approach for massive graph classi\ufb01cation. In\n2016 IEEE International Conference on Big Data (Big Data), pages 3290\u20133297. IEEE, 2016.\n\n[7] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.\nGeometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine,\n34(4):18\u201342, 2017.\n\n[8] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[9] James P Canning, Emma E Ingram, Sammantha Nowak-Wolff, Adriana M Ortiz, Nesreen K\nAhmed, Ryan A Rossi, Karl RB Schmitt, and Sucheta Soundarajan. Predicting graph categories\nfrom structural properties. arXiv preprint arXiv:1805.02682, 2018.\n\n[10] Simon S Du, Kangcheng Hou, Barnab\u00e1s P\u00f3czos, Ruslan Salakhutdinov, Ruosong Wang, and\nKeyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels.\narXiv preprint arXiv:1905.13192, 2019.\n\n[11] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In\n\nConference on learning theory, pages 907\u2013940, 2016.\n\n[12] Steven Farrell, Paolo Cala\ufb01ura, Mayur Mudigonda, Dustin Anderson, Jean-Roch Vlimant,\nStephan Zheng, Josh Bendavid, Maria Spiropulu, Giuseppe Cerati, Lindsey Gray, et al. Novel\ndeep learning methods for track reconstruction. arXiv preprint arXiv:1810.06111, 2018.\n\n[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural\nmessage passing for quantum chemistry. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 1263\u20131272. JMLR. org, 2017.\n\n[14] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance:\n\nA survey. Knowledge-Based Systems, 151:78\u201394, 2018.\n\n[15] Ting Guo and Xingquan Zhu. Understanding the roles of sub-graph features for graph classi\ufb01ca-\ntion: an empirical study perspective. In Proceedings of the 22nd ACM international conference\non Information & Knowledge Management, pages 817\u2013822. ACM, 2013.\n\n[16] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large\n\ngraphs. In Advances in Neural Information Processing Systems, pages 1024\u20131034, 2017.\n\n[17] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods\n\nand applications. arXiv preprint arXiv:1709.05584, 2017.\n\n[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[19] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks,\n\n4(2):251\u2013257, 1991.\n\n[20] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n10\n\n\f[21] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural net-\nwork: Data-driven traf\ufb01c forecasting. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[22] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. arXiv preprint arXiv:1511.05493, 2015.\n\n[23] Yaw-Ling Lin and Steven S Skiena. Algorithms for square roots of graphs. SIAM Journal on\n\nDiscrete Mathematics, 8(1):99\u2013118, 1995.\n\n[24] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen,\nGaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural\nnetworks. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, volume 33, pages\n4602\u20134609, 2019.\n\n[25] Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational\npooling for graph representations. In International Conference on Machine Learning, pages\n4663\u20134673, 2019.\n\n[26] Mark Newman. Networks: an introduction. Oxford university press, 2010.\n\n[27] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the\nexpressive power of deep neural networks. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 2847\u20132854. JMLR. org, 2017.\n\n[28] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.\nThe graph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[29] Matus Telgarsky. Bene\ufb01ts of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.\n\n[30] Johan Ugander, Lars Backstrom, and Jon Kleinberg. Subgraph frequencies: Mapping the\nIn Proceedings of the 22nd\n\nempirical and extremal geography of large graph collections.\ninternational conference on World Wide Web, pages 1307\u20131318. ACM, 2013.\n\n[31] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua\n\nBengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[32] Saurabh Verma and Zhi-Li Zhang. Stability and generalization of graph convolutional neural\n\nnetworks. arXiv preprint arXiv:1905.01004, 2019.\n\n[33] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A\n\ncomprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.\n\n[34] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural\n\nnetworks? arXiv preprint arXiv:1810.00826, 2018.\n\n[35] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and\nStefanie Jegelka. Representation learning on graphs with jumping knowledge networks. arXiv\npreprint arXiv:1806.03536, 2018.\n\n[36] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n1365\u20131374. ACM, 2015.\n\n[37] Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutional\npolicy network for goal-directed molecular graph generation. arXiv preprint arXiv:1806.02473,\n2018.\n\n[38] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. arXiv preprint\n\narXiv:1812.04202, 2018.\n\n11\n\n\f", "award": [], "sourceid": 8876, "authors": [{"given_name": "Nima", "family_name": "Dehmamy", "institution": "Northeastern University"}, {"given_name": "Albert-Laszlo", "family_name": "Barabasi", "institution": "Northeastern University"}, {"given_name": "Rose", "family_name": "Yu", "institution": "Northeastern University"}]}