{"title": "Mean-field theory of graph neural networks in graph partitioning", "book": "Advances in Neural Information Processing Systems", "page_first": 4361, "page_last": 4371, "abstract": "A theoretical performance analysis of the graph neural network (GNN) is presented. For classification tasks, the neural network approach has the advantage in terms of flexibility that it can be employed in a data-driven manner, whereas Bayesian inference requires the assumption of a specific model. A fundamental question is then whether GNN has a high accuracy in addition to this flexibility. Moreover, whether the achieved performance is predominately a result of the backpropagation or the architecture itself is a matter of considerable interest. To gain a better insight into these questions, a mean-field theory of a minimal GNN architecture is developed for the graph partitioning problem. This demonstrates a good agreement with numerical experiments.", "full_text": "Mean-\ufb01eld theory of graph neural networks\n\nin graph partitioning\n\nTatsuro Kawamoto, Masashi Tsubaki\nArti\ufb01cial Intelligence Research Center,\n\nNational Institute of Advanced Industrial Science and Technology,\n\n2-3-26 Aomi, Koto-ku, Tokyo, Japan\n\n{kawamoto.tatsuro, tsubaki.masashi}@aist.go.jp\n\nDepartment of Mathematical and Computing Science, Tokyo Institute of Technology,\n\nTomoyuki Obuchi\n\n2-12-1 Ookayama Meguro-ku Tokyo, Japan\n\nobuchi@c.titech.ac.jp\n\nAbstract\n\nA theoretical performance analysis of the graph neural network (GNN) is pre-\nsented. For classi\ufb01cation tasks, the neural network approach has the advantage\nin terms of \ufb02exibility that it can be employed in a data-driven manner, whereas\nBayesian inference requires the assumption of a speci\ufb01c model. A fundamental\nquestion is then whether GNN has a high accuracy in addition to this \ufb02exibil-\nity. Moreover, whether the achieved performance is predominately a result of the\nbackpropagation or the architecture itself is a matter of considerable interest. To\ngain a better insight into these questions, a mean-\ufb01eld theory of a minimal GNN\narchitecture is developed for the graph partitioning problem. This demonstrates a\ngood agreement with numerical experiments.\n\n1\n\nIntroduction\n\nDeep neural networks have been subject to signi\ufb01cant attention concerning many tasks in machine\nlearning, and a plethora of models and algorithms have been proposed in recent years. The appli-\ncation of the neural network approach to problems on graphs is no exception and is being actively\nstudied, with applications including social networks and chemical compounds [1, 2]. A neural net-\nwork model on graphs is termed a graph neural network (GNN) [3]. While excellent performances\nof GNNs have been reported in the literature, many of these results rely on experimental studies,\nand seem to be based on the blind belief that the nonlinear nature of GNNs leads to such strong per-\nformances. However, when a deep neural network outperforms other methods, the factors that are\nreally essential should be clari\ufb01ed: Is this thanks to the learning of model parameters, e.g., through\nthe backpropagation [4], or rather the architecture of the model itself? Is the choice of the archi-\ntecture predominantly crucial, or would even a simple choice perform suf\ufb01ciently well? Moreover,\ndoes the GNN generically outperform other methods?\nTo obtain a better understanding of these questions, not only is empirical knowledge based on bench-\nmark tests required, but also theoretical insights. To this end, we develop a mean-\ufb01eld theory of\nGNN, focusing on a problem of graph partitioning. The problem concerns a GNN with random\nmodel parameters, i.e., an untrained GNN. If the architecture of the GNN itself is essential, then the\nperformance of the untrained GNN should already be effective. On the other hand, if the \ufb01ne-tuning\nof the model parameters via learning is crucial, then the result for the untrained GNN is again useful\nto observe the extent to which the performance is improved.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Architecture of the GNN considered in this paper.\n\nTable 1: Comparison of various methods under the framework of Eq. (2).\n\nalgorithm\nuntrained GNN\ntrained GNN [5]\nSpectral method\nEM + BP\n\n\u03c6(x) W t\ndomain M \u03d5(x)\nI(x)\ntanh\nA\nI(cid:0)L ReLu\nI(x)\nL\nI(x)\nI(x)\nB softmax log(x)\n\nV\nV\nV\n\u20d7E\n\nQR\n\nlearned\n\nbt\n\nrandom omitted\ntrained\n\nfW t; btg update\n\nnot trained\n\nomitted trained via backprop.\nupdated at each layer\nlearned via M-step\n\nlearned\n\n/\n\nFor a given graph G = (V; E), where V is the set of vertices and E is the set of (undirected)\nedges, the graph partitioning problem involves assigning one out of K group labels to each vertex.\nThroughout this paper, we restrict ourselves to the case of two groups (K = 2). The problem setting\nfor graph partitioning is relatively simple compared with other GNN applications. Thus, it is suitable\nas a baseline for more complicated problems. There are two types of graph partitioning problem:\nOne is to \ufb01nd the best partition for a given graph under a certain objective function. The other is\nto assume that a graph is generated by a statistical model, and infer the planted (i.e., preassigned)\ngroup labels of the generative model. Herein, we consider the latter problem.\nBefore moving on to the mean-\ufb01eld theory, we \ufb01rst clarify the algorithmic relationship between\nGNN and other methods of graph partitioning.\n\n2 Graph neural network and its relationship to other methods\n\nThe goal of this paper is to examine the graph partitioning performance using a minimal GNN\narchitecture. To this end, we consider a GNN with the following feedforward dynamics. Each vertex\nis characterized by a D-dimensional feature vector whose elements are xi(cid:22) (i 2 V , (cid:22) 2 f1; : : : ; Dg),\nand the state matrix X = [xi(cid:22)] obeys\n\n\u2211\n\n(\n\n)\n\nxt+1\ni(cid:22) =\n\nAij\u03d5\n\nj(cid:23)\n\nxt\nj(cid:23)\n\nW t\n\n(cid:23)(cid:22) + bt\n\ni(cid:22):\n\n(1)\n\nThroughout this paper, the layers of the GNN are indexed by t 2 f0; : : : ; Tg. Furthermore, \u03d5(x) is\na nonlinear activation function, bt = [bt\n(cid:23)(cid:22)] is a linear transform\nthat operates on the feature space. To infer the group assignments, a D (cid:2) K matrix W out of the\nreadout classi\ufb01er is applied at the end of the last layer. Although there is no restriction on \u03d5 in\nour mean-\ufb01eld theory, we adopt \u03d5 = tanh as a speci\ufb01c choice in the experiment below. As there\nis no detailed attribute on each vertex, the initial state X 0 is set to be uniformly random, and the\nadjacency matrix A is the only input in the present case. For the graphs with vertex attributes, deep\nneural networks that utilize such information [6, 7, 8] are also proposed.\n\ni(cid:22)] is a bias term1, and W t = [W t\n\n1The bias term is only included in this section, and will be omitted in later sections.\n\n2\n\nA++= (xt , xt , ... , xt )Layer tiA(cid:668)(cid:668)(cid:668)(cid:668)(cid:668)XtN\u03d5jk1111\u03d5\u2022Wt+1ReadoutclassifiersFeature vector of iLayer t + 1\u2022WtXt+1ii1i2iD(p , p , ... , p )i1i2iKLayer out(cid:668)ij(cid:668)(cid:668)XT(cid:668)(cid:668)(cid:668)(cid:668)(cid:668)(cid:668)(cid:668)\u2022Wout\fThe set of bias terms fbtg, the linear transforms fW tg in the intermediate layers, and W out for\nthe readout classi\ufb01er are initially set to be random. These are updated through the backpropagation\nusing the training set. In the semi-supervised setting, (real-world) graphs in which only a portion of\nthe vertices are correctly labeled are employed as a training set. On the other hand, in the case of\nunsupervised learning, graph instances of a statistical model can be employed as the training set.\nThe GNN architecture described above can be thought of as a special case of the following more\ngeneral form:\n\n)\n\n\u03d5(xt\n\nj(cid:23))W t\n(cid:23)(cid:22)\n\n+ bt\n\ni(cid:22);\n\n(2)\n\n\u2211\n\n(\u2211\n\nxt+1\ni(cid:22) =\n\nMij \u03c6\n\nj\n\n(cid:23)\n\np\n\np\n\nd1; : : : ;\n\n(cid:0)1=2AD\n\n(cid:0)1=2 (cid:17) diag(\n\n(cid:0)1=2, where D\n\nwhere \u03c6(x) is another activation function. With different choices for the matrix and activation\nfunctions shown in Table 1, various algorithm can be obtained. Equation (1) is recovered by setting\nMij = Aij and \u03c6(x) = I(x) (where I(x) is the identity function), while [5] employed a Laplacian-\nlike matrix M = I (cid:0) L = D\ndN ) (di is the\ndegree of vertex i) and L is called the normalized Laplacian [9].\nThe spectral method using the power iteration is obtained in the limit where \u03d5(x) and \u03c6(x) are linear\nand bt is absent.2 For the simultaneous power iteration that extracts the leading K eigenvectors, the\nstate matrix X 0 is set as an N (cid:2)K matrix whose column vectors are mutually orthogonal. While the\nnormalized Laplacian L is commonly adopted for M, there are several other choices [9, 10, 11].3\nAt each iteration, the orthogonality condition is maintained via QR decomposition [12], i.e., for\nacts as W t. Rt is a D (cid:2) D upper triangular matrix\nZt := M X t, X t+1 = ZtR\nthat is obtained by the QR decomposition of Zt. Therefore, rather than training W t, it is determined\nat each layer based on the current state.\nThe belief propagation (BP) algorithm (also called the message passing algorithm) in Bayesian\ninference also falls under the framework of Eq. (2). While the domain of the state consists of the\nvertices (i; j 2 V ) for GNNs, this algorithm deals with the directed edges i ! j 2 \u20d7E, where \u20d7E\nis obtained by putting directions to every undirected edge. In this case, the state xt\n(cid:27);i!j represents\nthe logarithm of the marginal probability that vertex i belongs to the group (cid:27) with the missing\ninformation of vertex j at the tth iteration. With the choice of matrix and activation functions shown\nin Table 1 (EM+BP), Eq. (2) becomes exactly the update equation of the BP algorithm4 [13]. The\nmatrix M = B = [Bj!k;i!j] is the so-called non-backtracking matrix [14], and the softmax\nfunction represents the normalization of the state xt\n\n(cid:0)1\nt\n\n, where R\n\n(cid:0)1\nt\n\n(cid:27);i!j.\n\nThe BP algorithm requires the model parameters W t and bt as inputs. For example, when the\nexpectation-maximization (EM) algorithm is considered, the BP algorithm comprises half (the E-\nstep) of the algorithm. The parameter learning of the model is conducted in the other half (the\nM-step), which can be performed analytically using the current result of the BP algorithm. Here,\nW t and bt are the estimates of the so-called density matrix (or af\ufb01nity matrix) and the external \ufb01eld\nresulting from messages from non-edges [13], respectively, and are common for every t. Therefore,\nthe differences between the EM algorithm and GNN are summarized as follows. While there is an\nanalogy between the inference procedures, in the EM algorithm the parameter learning of the model\nis conducted analytically, at the expense of the restrictions of the assumed statistical model. On\nthe other hand, in GNNs the learning is conducted numerically in a data-driven manner [15], for\nexample by backpropagation. While we will shed light on the detailed correspondence in the case\nof graph partitioning here, the relationship between GNN and BP is also mentioned in [16, 17].\n\n3 Mean-\ufb01eld theory of the detectability limit\n\n3.1 Stochastic block model and its detectability limit\n\nWe analyze the performance of an untrained GNN on the stochastic block model (SBM). This is a\nrandom graph model with a planted group structure, and is commonly employed as the generative\n\n2Alternatively, it can be regarded that diag(bt\n3A matrix const:I is sometimes added in order to shift the eigenvalues.\n4Precisely speaking, this is the BP algorithm in which the stochastic block model (SBM) is assumed as the\n\niD) is added to W t when bt has common rows.\n\ni1; : : : ; bt\n\ngenerative model. The SBM is explained below.\n\n3\n\n\fmodel of an inference-based graph clustering algorithm [13, 18, 19, 20]. The SBM is de\ufb01ned as\nfollows. We let jV j = N, and each of the vertices has a preassigned group label (cid:27) 2 f1; : : : ; Kg,\ni.e., V = [(cid:27)V(cid:27). We de\ufb01ne V(cid:27) as the set of vertices in a group (cid:27), (cid:13)(cid:27) (cid:17) jV(cid:27)j=N, and (cid:27)i represents\nthe planted group assignment of vertex i. For each pair of vertices i 2 V(cid:27) and j 2 V(cid:27)\u2032, an edge is\ngenerated with probability (cid:26)(cid:27)(cid:27)\u2032, which is an element of the density matrix. Throughout this paper,\n(cid:0)1), so that the resulting graph has a constant average degree, or in\nwe assume that (cid:26)(cid:27)(cid:27)\u2032 = O(N\nother words the graph is sparse. We denote the average degree by c. Therefore, the adjacency matrix\nA = [Aij] of the SBM is generated with probability\n\n\u220f\n\n(\n\n)1(cid:0)Aij\n\np(A) =\n\ni<j\n\n(cid:26)Aij\n(cid:27)i(cid:27)j\n\n1 (cid:0) (cid:26)Aij\n\n(cid:27)i(cid:27)j\n\n:\n\n(3)\n\n\u2032, the SBM is nothing but the Erd\u02ddos-R\u00e9nyi random\nWhen (cid:26)(cid:27)(cid:27)\u2032 is the same for all pairs of (cid:27) and (cid:27)\ngraph. Clearly, in this case no algorithm can infer the planted group assignments better than random\nchance. Interestingly, even when (cid:26)(cid:27)(cid:27)\u2032 is not constant there exists a limit for the strength of the group\nstructure, below which the planted group assignments cannot be inferred better than chance. This\nis called the detectability limit. Consider the SBM consisting of two equally-sized groups that is\n\u2032, which is often referred to as the\nparametrized as (cid:26)(cid:27)(cid:27)\u2032 = (cid:26)in for (cid:27) = (cid:27)\nsymmetric SBM. In this case, it is rigorously known [21, 22, 23] that detection is impossible by any\nalgorithm for the SBM with a group structure weaker than\n\n\u2032 and (cid:26)(cid:27)(cid:27)\u2032 = (cid:26)out for (cid:27) \u0338= (cid:27)\n\n(4)\nwhere \u03f5 (cid:17) (cid:26)out=(cid:26)in. However, it is important to note that this is the information-theoretic limit,\nand the achievable limit for a speci\ufb01c algorithm may not coincide with Eq. (4). For this reason, we\ninvestigate the algorithmic detectability limit of the GNN here.\n\nc + 1);\n\n\u03f5 = (\n\np\n\np\nc (cid:0) 1)=(\n\n3.2 Dynamical mean-\ufb01eld theory\n\nIn an untrained GNN, each element of the matrix W t is randomly determined according to the\n(cid:24) N (0; 1=D). We assume that the feature dimension D is\nGaussian distribution at each t, i.e., W t\nsuf\ufb01ciently large, but D=N \u226a 1. Let us consider a state xt\n(cid:23)(cid:22)\n(cid:27)(cid:22) that represents the average state within\na group, i.e., xt\n(cid:27)(cid:22)] is expressed\n\u2211\nas\n\ni(cid:22). The probability distribution that xt = [xt\nxt\n\n\u27e8\u220f\n\n(cid:17) ((cid:13)(cid:27)N )\n\n\u2211\n\n(cid:0)1\n\n(cid:27)(cid:22)\n\n1A\u27e9\n\n\u2211\n0@xt+1\n\ni2V(cid:27)\n\n(cid:14)\n\n(cid:27)(cid:22)\n\nP (xt+1) =\n\n(cid:27)(cid:22)\n\n(cid:0) 1\n(cid:13)(cid:27)N\n\ni2V(cid:27)\n\nj(cid:23)\n\nAij\u03d5(xt\n\nj(cid:23))W t\n(cid:23)(cid:22)\n\n;\n\n(5)\n\nA;W t;X t\n\nwhere \u27e8: : :\u27e9\nA;W t;X t denotes the average over the graph A, the random linear transform W t, and\n\u222b\nthe state X t of the previous layer. Using the Fourier representation, the normalization condition of\nEq. (5) is expressed as\nD^xt+1Dxt+1e\n\n(cid:27)(cid:22) (cid:13)(cid:27)^xt+1\n\n\u2211\n\n\u2211\n\n{\n\n(cid:0)L0\n\n\u27e8\n\n\u27e9\n\n; (6)\n\n1 =\n\nL1\n\ne\n\n(cid:27)(cid:22) xt+1\n(cid:27)(cid:22)\nij AijW t\n\n(cid:22)(cid:23)\n\n(cid:23)(cid:22)^xt+1\n\n(cid:27)i(cid:22)\u03d5(xt\n\nj(cid:23)):\n\n\u2211\nL0 =\nL1 = 1\n\nN\n\nA;W t;X t ;\n\nis conjugate to xt+1\n(cid:27)(cid:22) ,\n\nand D^xt+1Dxt+1 (cid:17)\n\n\u220f\n\nwhere ^xt+1\n(cid:27)(cid:22)\n\nis an auxiliary variable that\n\n(cid:27)(cid:22)((cid:13)(cid:27)d^xt+1\n\n(cid:27)(cid:22) dxt+1\n\n(cid:27)(cid:22) =2(cid:25)i).\n\n\u2211\n\n1\n\nAfter taking the average of the symmetric SBM over A as well as the average over W t in the\nstationary limit with respect to t, the following self-consistent equation is obtained with respect to\nthe covariance matrix C = [C(cid:27)(cid:27)\u2032] of x = xt:\n\nC(cid:27)(cid:27)\u2032 =\n\n(7)\nwhere B(cid:27)(cid:27)\u2032 (cid:17) N (cid:13)(cid:27)(cid:26)(cid:27)(cid:27)\u2032(cid:13)(cid:27)\u2032. The detailed derivation can be found in the supplemental material. The\nreader may notice that the above expression resembles the recursive equations in [24, 25]. However,\nit should be noted that Eq. (7) is not obtained as an exact closed equation. The derivation relies\n\n\u03d5(x~(cid:27))\u03d5(x~(cid:27)\u2032);\n\nB(cid:27) ~(cid:27)B(cid:27)\u2032 ~(cid:27)\u2032\n\n(cid:13)(cid:27)(cid:13)(cid:27)\u2032\n\ndet C\n\n~(cid:27)~(cid:27)\u2032\n\n(cid:0) 1\n2 x\u22a4C\np\n\n(cid:0)1x\n\ndx e\n(2(cid:25)) N\n\n2\n\n\u222b\n\n4\n\n\fFigure 2: Performance of the untrained GNN evaluated by the behavior of the covariance matrix\nC. (a) Detectability phase diagram of the SBM. The solid line represents the mean-\ufb01eld estimate\nby Eq. (7), the dashed line represents the phase boundary of the spectral method (see Section 5 for\ndetails), and the dark shaded region represents the region above Eq. (4). The region containing points\nis that where the covariance gap C11 (cid:0) C12 is signi\ufb01cantly larger than the gaps in the information-\ntheoretically undetectable region. (b) The curves of the covariances C11 and C12 of the SBMs with\nc = 8: The mean-\ufb01eld estimates (gray lines that have larger values) and curves obtained by the\nregression of the experimental data (purple lines with smaller values).\n\nmainly on the assumption that the macroscopic random variable xt dominates the behavior of the\nstate X t. It is numerically con\ufb01rmed that this assumption appears plausible. This type of analysis\nis called dynamical mean-\ufb01eld theory (or the Martin-Siggia-Rose formalism) [26, 27, 28, 29].\nWhen the correlation within a group C(cid:27)(cid:27) is equal to the correlation between groups C(cid:27)(cid:27)\u2032 ((cid:27) \u0338= (cid:27)\n\u2032),\nthe GNN is deemed to have reached the detectability limit. Beyond the detectability limit, Eq. (7) is\nno longer a two-component equation, but is reduced to an equation with respect to the variance of\none indistinguishable group.\nThe accuracy of our mean-\ufb01eld estimate is examined in Fig. 2, using the covariance matrix C that\nwe obtained directly from the numerical experiment. In the detectability phase diagram in Fig. 2a,\nthe region that has a covariance gap C11 (cid:0) C12 > (cid:22)g + 2(cid:27)g for at least one graph instance among 30\nsamples is indicated by the dark purple points, and the region with C11(cid:0) C12 > (cid:22)g + (cid:27)g is indicated\nby the pale purple points, where (cid:22)g and (cid:27)g are the mean and standard deviation, respectively, of the\ncovariance gap in the information-theoretically undetectable region. In Fig. 2b, the elements of the\ncovariance matrix are compared for the SBM with c = 8. The consistency of our mean-\ufb01eld estimate\nis examined for a speci\ufb01c implementation in Section 5.\n\n4 Normalized mutual information error function\n\nThe previous section dealt with the feedforward process of an untrained GNN. By employing a clas-\nsi\ufb01er such as the k-means method [30], W out is not required, and the inference of the SBM can be\nperformed without any training procedure. To investigate whether the training signi\ufb01cantly improves\nthe performance, the algorithm that updates the matrices fW tg and W out must be speci\ufb01ed.\nThe cross-entropy error function is commonly employed to train W out for a classi\ufb01cation task.\nHowever, this error function unfortunately cannot be directly applied to the present case. Note that\nthe planted group assignment of SBM is invariant under a global permutation of the group labels.\nIn other words, as long as the set of vertices in the same group share the same label, the label itself\ncan be anything. This is called the identi\ufb01ability problem [31]. The cross-entropy error function is\nnot invariant under a global label permutation, and thus the classi\ufb01er cannot be trained unless the\ndegrees of freedom are constrained. A possible brute force approach is to explicitly evaluate all the\npermutations in the error function [32], although this obviously results in a considerable computa-\ntional burden unless the number of groups is very small. Note also that semi-supervised clustering\ndoes not suffer from the identi\ufb01ability issue, because the permutation symmetry is explicitly broken.\n\n5\n\n(cid:21)(cid:23)(cid:25)(cid:18)(cid:17)(cid:68)(cid:17)(cid:15)(cid:17)(cid:17)(cid:15)(cid:18)(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:20)(cid:17)(cid:15)(cid:21)(cid:17)(cid:15)(cid:22)(cid:54)(cid:79)(cid:69)(cid:70)(cid:85)(cid:70)(cid:68)(cid:85)(cid:66)(cid:67)(cid:77)(cid:70)(cid:37)(cid:70)(cid:85)(cid:70)(cid:68)(cid:85)(cid:66)(cid:67)(cid:77)(cid:70)(cid:17)(cid:15)(cid:17)(cid:17)(cid:15)(cid:18)(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:20)(cid:17)(cid:15)(cid:21)(cid:17)(cid:19)(cid:17)(cid:21)(cid:17)(cid:23)(cid:17)(cid:36)(cid:80)(cid:87)(cid:66)(cid:83)(cid:74)(cid:66)(cid:79)(cid:68)(cid:70)(cid:84)(cid:46)(cid:70)(cid:66)(cid:79)(cid:14)(cid:112)(cid:70)(cid:77)(cid:69)(cid:16)(cid:38)(cid:89)(cid:81)(cid:70)(cid:83)(cid:74)(cid:78)(cid:70)(cid:79)(cid:85)(cid:36)(cid:46)(cid:39)(cid:18)(cid:18)(cid:36)(cid:46)(cid:39)(cid:18)(cid:19)(cid:36)(cid:70)(cid:89)(cid:81)(cid:18)(cid:18)(cid:36)(cid:70)(cid:89)(cid:81)(cid:18)(cid:19)(a)(b)\fHere, we instead propose the use of the normalized mutual information (NMI) as an error function\nfor the readout classi\ufb01er. The NMI is a comparison measure of two group assignments, which\nnaturally eliminates the permutation degrees of freedom. Let (cid:27) = f(cid:27)i = (cid:27)ji 2 V(cid:27)g be the labels\nof the planted group assignments, and ^(cid:27) = f(cid:27)i = ^(cid:27)ji 2 V^(cid:27)g be the labels of the estimated group\nassignments. First, the (unnormalized) mutual information is de\ufb01ned as\n\nK\u2211\n\nK\u2211\n\nI((cid:27); ^(cid:27)) =\n\nP(cid:27) ^(cid:27) log\n\n(cid:27)=1\n\n^(cid:27)=1\n\nP(cid:27) ^(cid:27)\nP(cid:27)P^(cid:27)\n\n;\n\n(8)\n\nwhere the joint probability P(cid:27) ^(cid:27) is the fraction of vertices that belong to the group (cid:27) in the planted\nassignment and the group ^(cid:27) in the estimated assignment. Furthermore, P(cid:27) and P^(cid:27) are the marginals\nof P(cid:27)^(cid:27), and we let H((cid:27)) and H( ^(cid:27)) be the corresponding entropies. The NMI is de\ufb01ned by\n\nNMI((cid:27); ^(cid:27)) (cid:17)\n\n2I((cid:27); ^(cid:27))\n\nH((cid:27)) + H( ^(cid:27))\n\n:\n\n(9)\n\nWe adopt this measure as the error function for the readout classi\ufb01er. For the resulting state xT\ni(cid:22),\nthe estimated assignment probability pi(cid:27) that vertex i belongs to the group (cid:27) is de\ufb01ned as pi(cid:27) (cid:17)\nsoftmax(ai(cid:27)), where ai(cid:27) =\n\n(cid:22) xi(cid:22)W out\n\n\u2211\n\nP(cid:27) ^(cid:27) =\n\n1\nN\n\n\u2211\n\nP(cid:27) =\n\n^(cid:27)\n\nIn summary,\n\nN\u2211\n\ni=1\n\n\u2211\n\n(cid:22)(cid:27) . Each element of the NMI is then obtained as\nP (i 2 V(cid:27); i 2 V^(cid:27)) =\n\u2211\n\nN\u2211\n\ni2V(cid:27)\n\n1\nN\n\npi^(cid:27);\n\nP(cid:27) ^(cid:27) = (cid:13)(cid:27);\n\nP^(cid:27) =\n\nP(cid:27) ^(cid:27) =\n\npi^(cid:27):\n\n\u2211\n\n(cid:27)\n\n\u2211\n\n(cid:27) ^(cid:27) P(cid:27) ^(cid:27) log P(cid:27)^(cid:27)\n\n1\nN\n\ni=1\n\n\u2211\n\n)\n\n:\n\n(\n\n\u2211\n\n1 (cid:0)\n\n(10)\n\n(11)\n\nNMI ([P(cid:27)^(cid:27)]) = 2\n\n(cid:27) (cid:13)(cid:27) log (cid:13)(cid:27) +\n\n(cid:27)^(cid:27) P(cid:27) ^(cid:27) log\n\n(cid:27) P(cid:27) ^(cid:27)\n\nThis measure is permutation invariant, because the NMI counts the label co-occurrence patterns for\neach vertex in (cid:27) and ^(cid:27).\n\n5 Experiments\n\nFirst, the consistency between our mean-\ufb01eld theory and a speci\ufb01c implementation of an untrained\nGNN is examined. The performance of the untrained GNN is evaluated by drawing phase diagrams.\nFor the SBMs with various values for the average degree c and the strength of group structure \u03f5, the\noverlap, i.e., the fraction of vertices that coincide with their planted labels, is calculated. Afterward,\nit is investigated whether a signi\ufb01cant improvement is achieved through the parameter learning of\nthe model. Note that because even a completely random clustering can correctly infer half of the\nlabels on average, the minimum of the overlap is 0:5.5 As mentioned above, we adopt \u03d5 = tanh as\nthe speci\ufb01c choice of activation function.\n\n5.1 Untrained GNN with the k-means classi\ufb01er\n\nWe evaluate the performance of the untrained GNN in which the resulting state X is read out using\nthe k-means (more precisely k-means++ [30]) classi\ufb01er. In this case, no parameter learning takes\nplace. We set the dimension of the feature space to D = 100 and the number of layers to T = 100,\nand each result represents the average over 30 samples.\nFigure 3a presents the corresponding phase diagram. The overlap is indicated by colors, and the\nsolid line represents the detectability limit estimated by Eq. (7). The dashed line represents the\nmean-\ufb01eld estimate of the detectability limit of the spectral method6 [33, 34], and the shaded area\n\n5For this reason, the overlap is sometimes standardized such that the minimum equals zero.\n6Again, there are several choices for the matrix to be adopted in the spectral method. However, the Lapla-\ncians and modularity matrix, for example, have the same detectability limit when the graph is regular or the\naverage degree is suf\ufb01ciently large.\n\n6\n\n\fFigure 3: Performance of the untrained GNN using the k-means classi\ufb01er. (a) The same detectability\nphase diagram as in Fig. 2. The heatmap represents the overlap obtained using the untrained GNN.\n(b) The overlaps of the SBM with c = 8: The light shaded area represents the region above the\nestimate using Eq. (7), the dashed line represents the detectability limit of the spectral method, and\nthe dark shaded region represents the information-theoretically undetectable region.\n\nrepresents the region above which the inference is information-theoretically impossible. It is known\nthat the detectability limit of the BP algorithm [13] coincides with this information-theoretic limit so\nlong as the model parameters are correctly learned. For the present model, it is also known that the\nEM algorithm can indeed learn these parameters [35]. Note that it is natural that a Bayesian method\nwill outperform others as long as a consistent model is used, whereas it may perform poorly if the\nassumed model is not consistent.\nIt can be observed that our mean-\ufb01eld estimate exhibits a good agreement with the numerical exper-\niment. For a closer view, the overlaps of multiple graph sizes with c = 8 are presented in Fig. 3b.\n(cid:3) (cid:25) 0:33, and this appears to coincide with the point at which the overlap\n(\nFor c = 8, the estimate is \u03f5\nis almost 0:5. It should be noted that the performance can vary depending on the implementation\ndetails. For example, while the k-means method is performed to X T in the present experiment, it\ncan instead be performed to \u03d5\n. An experiment concerning such a case is presented in the\nsupplemental material.\n\n)\n\nX T\n\n5.2 GNN with backpropagation and a trained classi\ufb01er\n\nNow, we consider a trained GNN, and compare its performance with the untrained one. A set\nof SBM instances is provided as the training set. This consists of 1; 000 SBM instances with\nN = 5; 000, where an average degree c 2 f3; 4; : : : ; 10g and strength of the group structure\n\u03f5 2 f0:05; 0:1; : : : ; 0:5g are adopted. For the validation (development) set, 100 graph instances\nof the same SBMs are provided. Finally, the SBMs with various values of \u03f5 and the average degree\nc = 8 are provided as the test set.\nWe evaluated the performance of a GNN trained by backpropagation. We implemented the GNN\nusing Chainer (version 3.2.0) [36]. As in the previous section, the dimension of the feature space is\nset to D = 100, and various numbers of layers are examined. For the error function of the readout\nclassi\ufb01er, we adopted the NMI error function described in Section 4. The model parameters are\noptimized using the default setting of the Adam optimizer [37] in Chainer. Although we examined\nvarious optimization procedures for \ufb01ne-tuning, the improvement was hardly observable.\nWe also employ residual networks (ResNets) [38] and batch normalization (BN) [39]. These are\nalso adopted in [32]. The ResNet imposes skip (or shortcut) connections on a deep network, i.e.,\nxt+1\ni(cid:22) , where s is the number of layers skipped, and is set as s = 5.\ni(cid:22) =\nThe BN layer, which standardizes the distribution of the state X t, is placed at each intermediate\nlayer t. Finally, we note that the parameters of deep GNNs (e.g., T > 25) cannot be learned\ncorrectly without using the ResNet and BN techniques.\nThe results using the GNN trained as above are illustrated in Fig. 4. First, it can be observed from\nFig. 4a that a deep structure is important for a better accuracy. For suf\ufb01ciently deep networks, the\n\n(cid:23)(cid:22) + xt(cid:0)s\nW t\n\n\u2211\n\n(\n\n)\n\nj(cid:23) Aij\u03d5\n\nxt\nj(cid:23)\n\n7\n\n(cid:21)(cid:23)(cid:25)(cid:18)(cid:17)(cid:17)(cid:15)(cid:18)(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:20)(cid:17)(cid:15)(cid:21)(cid:17)(cid:15)(cid:22)(cid:54)(cid:79)(cid:69)(cid:70)(cid:85)(cid:70)(cid:68)(cid:85)(cid:66)(cid:67)(cid:77)(cid:70)(cid:37)(cid:70)(cid:85)(cid:70)(cid:68)(cid:85)(cid:66)(cid:67)(cid:77)(cid:70)(cid:17)(cid:15)(cid:22)(cid:17)(cid:15)(cid:23)(cid:31)(cid:48)(cid:87)(cid:70)(cid:83)(cid:77)(cid:66)(cid:81)(a)(b)(cid:17)(cid:15)(cid:17)(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:21)(cid:17)(cid:15)(cid:22)(cid:17)(cid:15)(cid:23)(cid:17)(cid:15)(cid:24)(cid:17)(cid:15)(cid:25)(cid:17)(cid:15)(cid:26)(cid:18)(cid:15)(cid:17)(cid:48)(cid:87)(cid:70)(cid:83)(cid:77)(cid:66)(cid:81)(cid:47)(cid:20)(cid:18)(cid:19)(cid:23)(cid:19)(cid:21)(cid:18)(cid:19)(cid:22)(cid:17)(cid:19)(cid:22)(cid:17)(cid:17)(cid:22)(cid:17)(cid:17)(cid:17)(cid:18)(cid:17)(cid:17)(cid:17)(cid:17)\fFigure 4: Overlaps of the GNN with trained model parameters. (a) The overlaps of the GNN with\nvarious number of layers T on the SBM with c = 8 and N = 5; 000. (b) The graph size dependence\nof the overlap of the GNN with T = 100 on the SBM with c = 8. In both cases, the shaded regions\nand dashed line are plotted in the same manner as in Fig. 3b.\n\noverlaps obtained by the trained GNN are clearly better than those of the untrained counterpart (see\nFig. 3b). On the other hand, the region of \u03f5 where the overlap suddenly deteriorates still coincides\nwith our mean-\ufb01eld estimate for the untrained GNN. This implies that in the limit N ! 1, the\ndetectability limit is not signi\ufb01cantly improved by training. To demonstrate the \ufb01nite-size effect in\nthe result of Fig. 4a, the overlaps of various graph sizes are plotted in Fig. 4b. The variation of\n(cid:3) (cid:25) 0:33 as the graph size is increased, implying the presence of\noverlaps becomes steeper around \u03f5\ndetectability phase transition around the value of \u03f5 predicted by our mean-\ufb01eld estimate.\nThe untrained and trained GNNs exhibit a clear difference in overlap when X T is employed as the\nreadout classi\ufb01er. However, it should be noted that the untrained GNN where \u03d5(X T ) is adopted\nas the readout classi\ufb01er exhibits a performance close to that of the trained GNN. The reader should\nalso bear in mind that the computational cost required for training is not negligible.\n\n6 Discussion\n\nIn a minimal GNN model, the adjacency matrix A is employed for the connections between inter-\nmediate layers. In fact, there have been many attempts [5, 32, 40, 41] to adopt a more complex\narchitecture rather than A. Furthermore, other types of applications of deep neural networks to\ngraph partitioning or related problems have been described [6, 7, 42]. The number of GNN vari-\neties can be arbitrarily extended by modifying the architecture and learning algorithm. Again, it is\nimportant to clarify which elements are essential for the performance.\nThe present study offers a baseline answer to this question. Our mean-\ufb01eld theory and numerical\nexperiment using the k-means readout classi\ufb01er clarify that an untrained GNN with a simple ar-\nchitecture already performs well. It is worth noting that our mean-\ufb01eld theory yields an accurate\nestimate of the detectability limit in a compact form. The learning of the model parameters by\nbackpropagation does contribute to an improved accuracy, although this appears to be quantitatively\ninsigni\ufb01cant. Importantly, the detectability limit appears to remain (almost) the same.\nThe minimal GNN that we considered in this paper is not the state of the art for the inference of\nthe symmetric SBM. However, as described in Section 2, an advantage of the GNN is its \ufb02exibility,\nin that the model can be learned in a data-driven manner. For a more complicated example, such\nas the graphs of chemical compounds in which each vertex has attributes, the GNN is expected to\ngenerically outperforms other approaches. In such a case, the performance may be signi\ufb01cantly\nimproved thanks to backpropagation. This would constitute an interesting direction for future work.\nIn addition, the adequacy of the NMI error function that we introduced for the readout classi\ufb01er\nshould be examined in detail.\n\n8\n\n(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:21)(cid:17)(cid:15)(cid:22)(cid:17)(cid:15)(cid:23)(cid:17)(cid:15)(cid:24)(cid:17)(cid:15)(cid:25)(cid:17)(cid:15)(cid:26)(cid:18)(cid:15)(cid:17)(cid:48)(cid:87)(cid:70)(cid:83)(cid:77)(cid:66)(cid:81)(cid:47)(cid:20)(cid:18)(cid:19)(cid:23)(cid:19)(cid:21)(cid:18)(cid:19)(cid:22)(cid:17)(cid:19)(cid:22)(cid:17)(cid:17)(cid:22)(cid:17)(cid:17)(cid:17)(cid:17)(cid:15)(cid:19)(cid:17)(cid:15)(cid:21)(cid:17)(cid:15)(cid:22)(cid:17)(cid:15)(cid:23)(cid:17)(cid:15)(cid:24)(cid:17)(cid:15)(cid:25)(cid:17)(cid:15)(cid:26)(cid:18)(cid:15)(cid:17)(cid:48)(cid:87)(cid:70)(cid:83)(cid:77)(cid:66)(cid:81)(cid:53)(cid:22)(cid:18)(cid:17)(cid:19)(cid:22)(cid:22)(cid:17)(cid:18)(cid:17)(cid:17)(a)(b)\fAcknowledgments\n\nThe authors are grateful to Ryo Karakida for helpful comments. This work was supported by the\nNew Energy and Industrial Technology Development Organization (NEDO) (T.K. and M. T.) and\nJSPS KAKENHI No. 18K11463 (T. O.).\n\nReferences\n[1] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.\n\nInternational Conference on Knowledge Discovery and Data Mining, 2016.\n\nIn\n\n[2] David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAlan Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning\nmolecular \ufb01ngerprints. In Advances in Neural Information Processing Systems, 2015.\n\n[3] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-\n\ndini. The graph neural network model. Trans. Neur. Netw., 20(1):61\u201380, January 2009.\n\n[4] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by\n\nback-propagating errors. nature, 323(6088):533, 1986.\n\n[5] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[6] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural\nIn International conference on machine learning, pages 2014\u20132023,\n\nnetworks for graphs.\n2016.\n\n[7] Will Hamilton, Zhitao Ying, and Jure Leskovec.\n\nInductive representation learning on large\ngraphs. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1024\u20131034.\nCurran Associates, Inc., 2017.\n\n[8] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures\nfrom sequence and graph kernels. In Doina Precup and Yee Whye Teh, editors, Proceedings of\nthe 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine\nLearning Research, pages 2024\u20132033, International Convention Centre, Sydney, Australia,\n06\u201311 Aug 2017. PMLR.\n\n[9] Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416,\n\nAugust 2007.\n\n[10] M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices.\n\nPhys. Rev. E, 74:036104, 2006.\n\n[11] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional\n\nstochastic blockmodel. The Annals of Statistics, 39(4):1878\u20131915, 2011.\n\n[12] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n\n[13] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov\u00e1. Asymptotic\nanalysis of the stochastic block model for modular networks and its algorithmic applications.\nPhys. Rev. E, 84:066106, 2011.\n\n[14] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zde-\nborov\u00e1, and Pan Zhang. Spectral redemption in clustering sparse networks. Proc. Natl. Acad.\nSci. U.S.A., 110(52):20935\u201340, December 2013.\n\n[15] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. GraphRNN: A\n\ndeep generative model for graphs. arXiv preprint arXiv:1802.08773, 2018.\n\n[16] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neu-\n\nral message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n9\n\n\f[17] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and\nMax Welling. Modeling relational data with graph convolutional networks. arXiv preprint\narXiv:1703.06103, 2017.\n\n[18] Tiago P. Peixoto. Ef\ufb01cient monte carlo and greedy heuristic for the inference of stochastic\n\nblock models. Phys. Rev. E, 89:012804, Jan 2014.\n\n[19] Tiago P Peixoto. Bayesian stochastic blockmodeling. arXiv preprint arXiv:1705.10225, 2017.\n\n[20] M. E. J. Newman and Gesine Reinert. Estimating the number of communities in a network.\n\nPhys. Rev. Lett., 117:078301, Aug 2016.\n\n[21] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted\n\npartition model. Probability Theory and Related Fields, 162(3):431\u2013461, Aug 2015.\n\n[22] Laurent Massouli\u00e9. Community detection thresholds and the weak ramanujan property.\n\nIn\nProceedings of the 46th Annual ACM Symposium on Theory of Computing, STOC \u201914, pages\n694\u2013703, New York, 2014. ACM.\n\n[23] Cristopher Moore. The computer science and physics of community detection: landscapes,\n\nphase transitions, and hardness. arXiv preprint arXiv:1702.00467, 2017.\n\n[24] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos.\nIn D. D. Lee,\nM. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Infor-\nmation Processing Systems 29, pages 3360\u20133368. Curran Associates, Inc., 2016.\n\n[25] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep infor-\n\nmation propagation. arXiv preprint arXiv:1611.01232, 2016.\n\n[26] H. Sompolinsky and Annette Zippelius. Relaxational dynamics of the edwards-anderson model\n\nand the mean-\ufb01eld theory of spin-glasses. Phys. Rev. B, 25:6860\u20136875, Jun 1982.\n\n[27] A. Crisanti and H. Sompolinsky. Dynamics of spin systems with randomly asymmetric bonds:\n\nIsing spins and glauber dynamics. Phys. Rev. A, 37:4865\u20134874, Jun 1988.\n\n[28] A Crisanti, HJ Sommers, and H Sompolinsky. Chaos in neural networks: chaotic solutions.\n\npreprint, 1990.\n\n[29] Manfred Opper, Burak \u00c7akmak, and Ole Winther. A theory of solving tap equations for ising\nmodels with general invariant random matrices. Journal of Physics A: Mathematical and The-\noretical, 49(11):114002, 2016.\n\n[30] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding.\n\nIn\nProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA\n\u201907, pages 1027\u20131035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Math-\nematics.\n\n[31] Krzysztof Nowicki and Tom A. B Snijders. Estimation and prediction for stochastic block-\n\nstructures. Journal of the American Statistical Association, 96(455):1077\u20131087, 2001.\n\n[32] Joan Bruna and Xiang Li. Community detection with graph neural networks. arXiv preprint\n\narXiv:1705.08415, 2017.\n\n[33] Tatsuro Kawamoto and Yoshiyuki Kabashima. Limitations in the spectral method for graph\npartitioning: Detectability threshold and localization of eigenvectors. Phys. Rev. E, 91:062803,\nJun 2015.\n\n[34] T. Kawamoto and Y. Kabashima. Detectability of the spectral method for sparse graph parti-\n\ntioning. Eur. Phys. Lett., 112(4):40007, 2015.\n\n[35] Tatsuro Kawamoto. Algorithmic detectability threshold of the stochastic block model. Phys.\n\nRev. E, 97:032301, Mar 2018.\n\n10\n\n\f[36] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open\nsource framework for deep learning. In Proceedings of workshop on machine learning sys-\ntems (LearningSys) in the twenty-ninth annual conference on neural information processing\nsystems, 2015.\n\n[37] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv\n\npreprint arXiv:1412.6980, 2014.\n\n[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recogni-\ntion, 2016.\n\n[39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[40] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[41] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29,\npages 3844\u20133852. Curran Associates, Inc., 2016.\n\n[42] Liang Yang, Xiaochun Cao, Dongxiao He, Chuan Wang, Xiao Wang, and Weixiong Zhang.\nModularity based community detection with deep learning. In Proceedings of the Twenty-Fifth\nInternational Joint Conference on Arti\ufb01cial Intelligence, IJCAI\u201916, pages 2252\u20132258. AAAI\nPress, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2132, "authors": [{"given_name": "Tatsuro", "family_name": "Kawamoto", "institution": "National Institute of Advanced Industrial Science and Technology"}, {"given_name": "Masashi", "family_name": "Tsubaki", "institution": "National Institute of Advanced Industrial Science and Technology (AIST)"}, {"given_name": "Tomoyuki", "family_name": "Obuchi", "institution": "Tokyo Institute of Technology"}]}