{"title": "Recovering Communities in the General Stochastic Block Model Without Knowing the Parameters", "book": "Advances in Neural Information Processing Systems", "page_first": 676, "page_last": 684, "abstract": "The stochastic block model (SBM) has recently gathered significant attention due to new threshold phenomena. However, most developments rely on the knowledge of the model parameters, or at least on the number of communities. This paper introduces efficient algorithms that do not require such knowledge and yet achieve the optimal information-theoretic tradeoffs identified in Abbe-Sandon FOCS15. In the constant degree regime, an algorithm is developed that requires only a lower-bound on the relative sizes of the communities and achieves the optimal accuracy scaling for large degrees. This lower-bound requirement is removed for the regime of arbitrarily slowly diverging degrees, and the model parameters are learned efficiently. For the logarithmic degree regime, this is further enhanced into a fully agnostic algorithm that achieves the CH-limit for exact recovery in quasi-linear time. These provide the first algorithms affording efficiency, universality and information-theoretic optimality for strong and weak consistency in the SBM.", "full_text": "Recovering Communities in the General Stochastic\n\nBlock Model Without Knowing the Parameters\n\nEmmanuel Abbe\n\nColin Sandon\n\nDepartment of Electrical Engineering and PACM\n\nDepartment of Mathematics\n\nPrinceton University\nPrinceton, NJ 08540\n\neabbe@princeton.edu\n\nPrinceton University\nPrinceton, NJ 08540\n\nsandon@princeton.edu\n\nAbstract\n\nThe stochastic block model (SBM) has recently gathered signi\ufb01cant attention due\nto new threshold phenomena. However, most developments rely on the knowledge\nof the model parameters, or at least on the number of communities. This paper in-\ntroduces ef\ufb01cient algorithms that do not require such knowledge and yet achieve\nthe optimal information-theoretic tradeoffs identi\ufb01ed in Abbe-Sandon FOCS15.\nIn the constant degree regime, an algorithm is developed that requires only a\nlower-bound on the relative sizes of the communities and achieves the optimal\naccuracy scaling for large degrees. This lower-bound requirement is removed for\nthe regime of arbitrarily slowly diverging degrees, and the model parameters are\nlearned ef\ufb01ciently. For the logarithmic degree regime, this is further enhanced into\na fully agnostic algorithm that achieves the CH-limit for exact recovery in quasi-\nlinear time. These provide the \ufb01rst algorithms affording ef\ufb01ciency, universality\nand information-theoretic optimality for strong and weak consistency in the SBM.\n\n1\n\nIntroduction\n\nThis paper studies the problem of recovering communities in the general stochastic block model\nwith linear size communities, for constant and logarithmic degree regimes. In contrast to [1], this\npaper does not require knowledge of the parameters. It shows how to learn these from the graph\ntoplogy. We next provide some motivations on the problem and further background on the model.\nDetecting communities (or clusters) in graphs is a fundamental problem in networks, computer\nscience and machine learning. This applies to a large variety of complex networks (e.g., social and\nbiological networks) as well as to data sets engineered as networks via similarly graphs, where one\noften attempts to get a \ufb01rst impression on the data by trying to identify groups with similar behavior.\nIn particular, \ufb01nding communities allows one to \ufb01nd like-minded people in social networks, to\nimprove recommendation systems, to segment or classify images, to detect protein complexes, to\n\ufb01nd genetically related sub-populations, or discover new tumor subclasses. See [1] for references.\nWhile a large variety of community detection algorithms have been deployed in the past decades, the\nunderstanding of the fundamental limits of community detection has only appeared more recently,\nin particular for the SBM [1\u20137]. The SBM is a canonical model for community detection. We use\nhere the notation SBM(n, p, W ) to refer to a random graph ensemble on the vertex-set V = [n],\nwhere each vertex v \u2208 V is assigned independently a hidden (or planted) label \u03c3v in [k] under a\nprobability distribution p = (p1, . . . , pk) on [k], and each unordered pair of nodes (u, v) \u2208 V \u00d7 V\nis connected independently with probability W\u03c3u,\u03c3v, where W is a symmetric k \u00d7 k matrix with\nentries in [0, 1]. Note that G \u223c SBM(n, p, W ) denotes a random graph drawn under this model,\nwithout the hidden (or planted) clusters (i.e., the labels \u03c3v) revealed. The goal is to recover these\nlabels by observing only the graph.\n\n1\n\n\fa \u2212 \u221a\n\nb| \u2265 \u221a\n\nRecently the SBM came back at the center of the attention at both the practical level, due to ex-\ntensions allowing overlapping communities that have proved to \ufb01t well real data sets in massive\nnetworks [8], and at the theoretical level due to new phase transition phenomena [2\u20136]. The latter\nworks focus exclusively on the SBM with two symmetric communities, i.e., each community is of\nthe same size and the connectivity in each community is identical. Denoting by p the intra- and q\nthe extra-cluster probabilities, most of the results are concerned with two \ufb01gure of merits: (i) re-\ncovery (also called exact recovery or strong consistency), which investigates the regimes of p and\nq for which there exists an algorithm that recovers with high probability the two communities com-\npletely [7, 9\u201319], (ii) detection, which investigates the regimes for which there exists an algorithm\nthat recovers with high probability a positively correlated partition [2\u20134].\nThe sharp threshold for exact recovery was obtained in [5, 6], showing1 that for p = a log(n)/n,\nq = b log(n)/n, a, b > 0, exact recovery is solvable if and only if |\u221a\n2, with ef\ufb01cient\nalgorithms achieving the threshold. In addition, [5] introduces an SDP, proved to achieve the thresh-\nold in [20, 21], while [22] shows that a spectral algorithm also achieves the threshold. The sharp\nthreshold for detection was obtained in [3, 4], showing that detection is solvable (and so ef\ufb01ciently)\nif and only if (a \u2212 b)2 > 2(a + b), when p = a/n, q = b/n, settling a conjecture from [2].\nBesides the detection and the recovery properties, one may ask about the partial recovery of the\ncommunities, studied in [1, 19, 23\u201325]. Of particular interest to this paper is the case of strong\nrecovery (also called weak consistency), where only a vanishing fraction of the nodes is allowed to\nbe misclassi\ufb01ed. For two-symmetric communities, [6] shows that strong recovery is possible if and\nonly if n(p \u2212 q)2/(p + q) diverges, extended in [1] for general SBMs.\nIn the next section, we discuss the results for the general SBM of interest in this paper and the\nproblem of learning the model parameters. We conclude this section by providing motivations on\nthe problem of achieving the threshold with an ef\ufb01cient and universal algorithm.\nThreshold phenomena have long been studied in \ufb01elds such as information theory (e.g., Shannon\u2019s\ncapacity) and constrained satisfaction problems (e.g., the SAT threshold). In particular, the quest of\nachieving the threshold has generated major algorithmic developments in these \ufb01elds (e.g., LDPC\ncodes, polar codes, survey propagation to name a few). Likewise, identifying thresholds in com-\nmunity detection models is key to benchmark and guide the development of clustering algorithms.\nHowever, it is particularly crucial to develop benchmarks that do not depend sensitively on the\nknowledge of the model parameters. A natural question is hence whether one can solve the various\nrecovery problems in the SBM without having access to the parameters. This paper answers this\nquestion in the af\ufb01rmative for the exact and strong recovery of the communities.\n\n1.1 Prior results on the general SBM with known parameters\n\n(Recovery requirements.)\n\nMost of the previous works are concerned with the SBM having symmetric communities (mainly\n2 or sometimes k), with the exception of [19] which provides the \ufb01rst general achievability results\nfor the SBM.2 Recently, [1] studied fundamental limits for the general model SBM(n, p, W ), with\np independent of n. The results are summarized below. Recall \ufb01rst the recovery requirements:\nDe\ufb01nition 1.\nAn algorithm recovers or detects communities in\nSBM(n, p, W ) with an accuracy of \u03b1 \u2208 [0, 1], if it outputs a labelling of the nodes {\u03c3(cid:48)(v), v \u2208 V },\nwhich agrees with the true labelling \u03c3 on a fraction \u03b1 of the nodes with probability 1 \u2212 on(1).\nThe agreement is maximized over relabellings of the communities. Strong recovery refers to\n\u03b1 = 1 \u2212 on(1) and exact recovery refers to \u03b1 = 1.\nThe problem is solvable information-theoretically if there exists an algorithm that solves it, and\nef\ufb01ciently if the algorithm runs in polynomial-time in n. Note that exact recovery in SBM(n, p, W )\nrequires the graph not to have vertices of degree 0 in multiple communities with high probability.\nTherefore, for exact recovery, we focus on W = ln(n)Q/n where Q is \ufb01xed.\nI. Partial and strong recovery in the general SBM. The \ufb01rst result of [1] concerns the regime\nwhere the connectivity matrix W scales as Q/n for a positive symmetric matrix Q (i.e., the node\n\n1 [6] generalizes this to a, b = \u0398(1).\n2 [24] also study variations of the k-symmetric model.\n\n2\n\n\faverage degree is constant). The following notion of SNR is \ufb01rst introduced\n\nSNR = |\u03bbmin|2/\u03bbmax\n\n(1)\nwhere \u03bbmin and \u03bbmax are respectively the smallest3 and largest eigenvalues of diag(p)Q. The algo-\nrithm Sphere-comparison is proposed that solves partial recovery with exponential accuracy\nand quasi-linear complexity when the SNR diverges.\n[1] Given any k \u2208 Z, p \u2208 (0, 1)k with |p| = 1, and symmetric matrix Q with no\nTheorem 1.\ntwo rows equal, let \u03bb be the largest eigenvalue of P Q, and \u03bb(cid:48) be the eigenvalue of P Q with the\nsmallest nonzero magnitude.\n\u03bb > 4, \u03bb7 < (\u03bb(cid:48))8, and 4\u03bb3 < (\u03bb(cid:48))4, for some\n\u03b5 = \u03b5(\u03bb, \u03bb(cid:48)) and C = C(p, Q) > 0, Sphere-comparison detects communities in graphs\ndrawn from SBM(n, p, Q/n) with accuracy 1 \u2212 4ke\u2212 C\u03c1\n)), provided\nthat the above is larger than 1 \u2212 mini pi\n2 ln(4k) , and runs in O(n1+\u0001) time. Moreover, \u03b5 can be made\narbitrarily small with 8 ln(\u03bb\n\n2/|\u03bb(cid:48)|)/ ln(\u03bb), and C(p, \u03b1Q) is independent of \u03b1.\n\n(cid:16) (\u03bb(cid:48))4\n\u03bb3 \u2212 1\n\n16k /(1 \u2212 exp(\u2212 C\u03c1\n\nIf SNR := |\u03bb(cid:48)|2\n\n(cid:17)\n\n\u221a\n\n16k\n\n(a\u2212b)2\n\nNote that for k symmetric clusters, SNR reduces to\nk(a+(k\u22121)b), which is the quantity of inter-\nest for detection [2, 26]. Moreover, the SNR must diverge to ensure strong recovery in the sym-\nmetric case [1]. The following is an important consequence of the previous theorem, stating that\nSphere-comparison solves strong recovery when the entries of Q are ampli\ufb01ed.\nCorollary 1. [1] For any k \u2208 Z, p \u2208 (0, 1)a with |p| = 1, and symmetric matrix Q with no two rows\nequal, there exist \u0001(c) = O(1/ ln(c)) such that for all suf\ufb01ciently large c, Sphere-comparison\ndetects communities in SBM(n, p, cQ/n) with accuracy 1 \u2212 e\u2212\u2126(c) and complexity On(n1+\u0001(c)).\nThe above gives the optimal scaling both in accuracy and complexity.\nII. Exact recovery in the general SBM. The second result in [1] is for the regime where the con-\nnectivity matrix scales as ln(n)Q/n, Q independent of n, where it is shown that exact recovery has\na sharp threshold characterized by the divergence function\n\n(cid:0)tf (x) + (1 \u2212 t)g(x) \u2212 f (x)tg(x)1\u2212t(cid:1) ,\n\n(cid:88)\n\nD+(f, g) = max\nt\u2208[0,1]\n\nx\u2208[k]\n\nnamed the CH-divergence in [1]. Speci\ufb01cally, if all pairs of columns in diag(p)Q are at D+-distance\nat least 1 from each other, then exact recovery is solvable in the general SBM. We refer to Section\n2.3 in [1] for discussion on the connection with Shannon\u2019s channel coding theorem (and CH vs.\nKL divergence). An algorithm (Degree-profiling) is also developed in [1] that solves exact\nrecovery down to the D+ limit in quasi-linear time, showing that exact recovery has no informational\nto computational gap.\nTheorem 2. [1] (i) Exact recovery is solvable in SBM(n, p, ln(n)Q/n) if and only if\n\nD+((P Q)i||(P Q)j) \u2265 1.\n\nmin\n\ni,j\u2208[k],i(cid:54)=j\n\n(ii) The Degree-profiling algorithm (see [1]) solves exact recovery whenever it\ninformation-theoretically solvable and runs in o(n1+\u0001) time for all \u0001 > 0.\n\nis\n\nExact and strong recovery are thus solved for the general SBM with linear-size communities, when\nthe parameters are known. We next remove the latter assumption.\n\n1.2 Estimating the parameters\n\nFor the estimation of the parameters, some results are known for two-symmetric communities. In\nthe logarithmic degree regime, since the SDP is agnostic to the parameters (it is a relaxation of the\nmin-bisection), the parameters can be estimated by recovering the communities [5, 20, 21]. For the\nconstant-degree regime, [26] shows that the parameters can be estimated above the threshold by\ncounting cycles (which is ef\ufb01ciently approximated by counting non-backtracking walks). These are,\nhowever, for 2 communities. We also became aware of a parallel work [27], which considers private\ngraphon estimation (including SBMs). In particular, for the logarithmic degree regime, [27] obtains\na (non-ef\ufb01cient) procedure to estimate parameters of graphons in an appropriate version of the L2\nnorm. For the general SBM, learning the model was to date mainly open.\n\n3The smallest eigenvalue of diag(p)Q is the one with least magnitude.\n\n3\n\n\f2 Results\n\nAgnostic algorithms are developed for the constant and diverging node degrees (with p, k indepen-\ndent of n). These afford optimal accuracy and complexity scaling for large node degrees and achieve\nthe CH-divergence limit for logarithmic node degrees. In particular, the SBM can be learned ef\ufb01-\nciently for any diverging degrees.\nNote that the assumptions on p and k being independent of n could be slightly relaxed, for example\nto slowly growing k, but we leave this for future work.\n\n2.1 Partial recovery\n\nOur main result for partial recovery holds in the constant degree regime and requires a lower bound\n\u03b4 on the least relative size of the communities. This requirement is removed when working with\ndiverging degrees, as stated in the corollary below.\n\nTheorem 3. Given \u03b4 > 0 and for any k \u2208 Z, p \u2208 (0, 1)k with(cid:80) pi = 1 and 0 < \u03b4 \u2264 min pi,\n\nand any symmetric matrix Q with no two rows equal such that every entry in Qk is positive (in other\nwords, Q such that there is a nonzero probability of a path between vertices in any two communities\nin a graph drawn from SBM(n, p, Q/n)), there exist \u0001(c) = O(1/ ln(c)) such that for all suf\ufb01-\nciently large \u03b1, Agnostic-sphere-comparison detects communities in graphs drawn from\nSBM(n, p, \u03b1Q/n) with accuracy at least 1 \u2212 e\u2212\u2126(\u03b1) in On(n1+\u0001(\u03b1)) time.\nNote that a vertex in community i has degree 0 with probability exponential in c, and there is no\nway to differentiate between vertices of degree 0 from different communities. So, an error rate\nthat decreases exponentially with c is optimal. In [28], we provide a more detailed version of this\ntheorem, which yields a quantitate statement on the accuracy of the algorithm in terms of the SNR\n(\u03bb(cid:48))2/\u03bb for general SBM(n, p, Q/n).\nCorollary 2. If \u03b1 = \u03c9(1) in Theorem 3, the knowledge requirement on \u03b4 can be removed.\n\n2.2 Exact recovery\n\nRecall that from [1], exact recovery is information-theoretically and computationally solvable in\nSBM(n, p, ln(n)Q/n) if and only if,\n\nD+((P Q)i, (P Q)j) \u2265 1.\n\nmin\ni 0. In particular, exact recovery is ef\ufb01ciently\nand universally solvable whenever it is information-theoretically solvable.\n\n3 Proof Techniques and Algorithms\n\n3.1 Partial recovery and the Agnostic-sphere-comparison algorithm\n\n3.1.1 Simpli\ufb01ed version of the algorithm for the symmetric case\n\nTo ease the presentation of the algorithm, we focus \ufb01rst on the symmetric case, i.e., the SBM with\nk communities of relative size 1/k, probability of connecting a\nn across\ncommunities. Let d = (a + (k \u2212 1)b)/k be the average degree.\nDe\ufb01nition 2. For any vertex v, let Nr[G](v) be the set of all vertices with shortest path in G to v of\nlength r. We often drop the subscript G if the graph in question is the original SBM. We also refer\nto \u00afNr(v) as the vector whose i-th entry is the number of vertices in Nr(v) that are in community i.\n\nn inside communities and b\n\nFor an arbitrary vertex v and reasonably small r, there will be typically about dr vertices in Nr(v),\nk )r more of them will be in v\u2019s community than in each other community. Of course,\nand about ( a\u2212b\n\n4\n\n\fthis only holds when r < log n/ log d because there are not enough vertices in the graph otherwise.\nThe obvious way to try to determine whether or not two vertices v and v(cid:48) are in the same community\nis to guess that they are in the same community if |Nr(v) \u2229 Nr(v(cid:48))| > d2r/n and different commu-\nnities otherwise. Unfortunately, whether or not a vertex is in Nr(v) is not independent of whether\nor not it is in Nr(v(cid:48)), which compromises this plan. Instead, we propose to rely on the following\ngraph-splitting step: Randomly assign every edge in G to some set E with a \ufb01xed probability c and\nthen count the number of edges in E that connect Nr[G\\E] and Nr(cid:48)[G\\E]. Formally:\nDe\ufb01nition 3. For any v, v(cid:48) \u2208 G, r, r(cid:48) \u2208 Z, and subset of G\u2019s edges E, let Nr,r(cid:48)[E](v \u00b7 v(cid:48)) be the\nnumber of pairs (v1, v2) such that v1 \u2208 Nr[G\\E](v), v2 \u2208 Nr(cid:48)[G\\E](v(cid:48)), and (v1, v2) \u2208 E.\nNote that E and G\\E are disjoint. However, in SBM(n, p, Q/n), G is sparse enough that even if\nthe two graphs were generated independently, a given pair of vertices would have an edge in both\ngraphs with probability O( 1\nThus, given v, r, and denoting by \u03bb1 = (a + (k \u2212 1)b)/k and \u03bb2 = (a \u2212 b)/k the two eigvenvalues\nof P Q in the symmetric case, the expected number of intra-community neighbors at depth r from v\n1 + (k \u2212 1)\u03bbr\n2), whereas the expected number of extra-community neighbors\nis approximately 1\n2) for each of the other (k \u2212 1) communities. All of\nat depth r from v is approximately 1\nk (\u03bbr\nthese are scaled by 1 \u2212 c if we do the computations in G\\E. Using now the emulated independence\nbetween E and G\\E, and assuming v and v(cid:48) to be in the same community, the expected number\nof edges in E connecting Nr[G\\E](v) to Nr(cid:48)[G\\E](v(cid:48)) is approximately given by the inner product\n1 \u2212 \u03bbr\nut(c \u00b7 P Q)u, where u = 1\n2) and (P Q) is the matrix with a\n(cid:19)r+r(cid:48)+1\non the diagonal and b elsewhere. When v and v(cid:48) are in different communities, the inner product is\nbetween u and a permutation of u. After simpli\ufb01cations, this gives\n\nn2 ). So, E is approximately independent of G\\E.\n\n1 + (k \u2212 1)\u03bbr\n(cid:34)\n\n(3)\nis 1 if v and v(cid:48) are in the same community and 0 otherwise. In order for Nr,r(cid:48)[E](v\u00b7v(cid:48))\nwhere \u03b4\u03c3v,\u03c3v(cid:48)\nto depend on the relative communities of v and v(cid:48), it must be that c(1 \u2212 c)r+r(cid:48)| a\u2212b\nk |r+r(cid:48)+1k is large\nenough, i.e., more than n, so r + r(cid:48) needs to be at least log n/ log | a\u2212b\nk |. A dif\ufb01culty is that for a\nspeci\ufb01c pair of vertices, the dr+r(cid:48)+1 term will be multiplied by a random factor dependent on the\ndegrees of v, v(cid:48), and the nearby vertices. So, in order to stop the variation in the dr+r(cid:48)+1 term from\n\u2212 1) term, it is necessary to cancel out the dominant term.\n\ndrowning out the(cid:0) a\u2212b\n\n(cid:1)r+r(cid:48)+1\n\nNr,r(cid:48)[E](v \u00b7 v(cid:48)) \u2248 c(1 \u2212 c)r+r(cid:48)\n\n(cid:18) a \u2212 b\n\ndr+r(cid:48)+1 +\n\n1 \u2212 \u03bbr\n\n(k\u03b4\u03c3v,\u03c3v(cid:48)\n\n1 \u2212 \u03bbr\n\n2, \u03bbr\n\n2, . . . , \u03bbr\n\n\u2212 1)\n\nk (\u03bbr\n\nk (\u03bbr\n\n(cid:35)\n\nn\n\nk\n\n(k\u03b4\u03c3v,\u03c3v(cid:48)\n\nk\n\nThis brings us to introduce the following sign-invariant statistics:\nIr,r(cid:48)[E](v \u00b7 v(cid:48)) := Nr+2,r(cid:48)[E](v \u00b7 v(cid:48)) \u00b7 Nr,r(cid:48)[E](v \u00b7 v(cid:48)) \u2212 N 2\n\u2248 c2(1 \u2212 c)2r+2r(cid:48)+2\n\n(cid:19)2 \u00b7 dr+r(cid:48)+1\n\n(cid:18) a \u2212 b\n\nd \u2212 a \u2212 b\n\n(cid:18)\n\n\u00b7\n\n(cid:19)r+r(cid:48)+1\nr+1,r(cid:48)[E](v \u00b7 v(cid:48))\n\nn2\n\nk\n\nk\n\n(k\u03b4\u03c3v,\u03c3v(cid:48)\n\n\u2212 1)\n\nIn particular, for r + r(cid:48) odd, Ir,r(cid:48)[E](v \u00b7 v(cid:48)) will tend to be positive if v and v(cid:48) are in the same\ncommunity and negative otherwise, irrespective of the speci\ufb01c values of a, b, k. That suggests the\nfollowing algorithm for partial recovery, it requires knowledge of \u03b4 < 1/k in the constant degree\nregime, but not in the regime where a, b scale with n.\n\n4 log n/ log d and put each of the graph\u2019s edges in E with probability 1/10.\n\n1. Set r = r(cid:48) = 3\n2. Set kmax = 1/\u03b4 and select kmax ln(4kmax) random vertices, v1, ..., vkmax ln(4kmax).\n3. Compute Ir,r(cid:48)[E](vi \u00b7 vj) for each i and j.\n4. If there is a possible assignment of these vertices to communities such that Ir,r(cid:48)[E](vi\u00b7vj) >\n0 if and only if vi and vj are in the same community, then randomly select one vertex from\neach apparent community, v[1], v[2], ...v[k(cid:48)]. Otherwise, fail.\nthe value of Ir,r(cid:48)[E](v[i] \u00b7 v(cid:48)).\n\n5. For every v(cid:48) in the graph, guess that v(cid:48) is in the same community as the v[i] that maximizes\n\nThis algorithm succeeds as long as |a \u2212 b|/k > (10/9)1/6((a + (k \u2212 1)b)/k)5/6, to ensure that\nthe above estimates on Nr,r(cid:48)[E](v \u00b7 v(cid:48)) are reliable. Further, if a, b are scaled by \u03b1 = \u03c9(1), setting\n\n5\n\n\f\u03b4 = 1/ log log \u03b1 allows removal of the knowledge requirement on \u03b4. In addition, playing with r, r(cid:48)\nto take different allows us to reduce the complexity of the algorithm.\nOne alternative to our approach could be to count the non-backtracking walks of a given length\nbetween v and v(cid:48), like in [4,29], instead of using Nr,r(cid:48)[E](v \u00b7 v(cid:48)). However, proving that the number\nof non-backtracking walks is close to its expected value is dif\ufb01cult. Proving that Nr,r(cid:48)[E](v \u00b7 v(cid:48))\nis within a desired range is substantially easier because for any v1 and v2, whether or not there is\nan edge between v1 and v2 directly effects Nr(v) for at most one value of r. Algorithms based on\nshortest path have also been studied in [30].\n\n3.1.2 The general case\nIn the general case, de\ufb01ne Nr(v), \u00afNr(v) and Nr,r(cid:48)[E](v \u00b7 v(cid:48)) as in the previous section. Now, for\nany v1 \u2208 Nr[G/E](v) and v2 \u2208 Nr(cid:48)[G/E](v(cid:48)), (v1, v2) \u2208 E with a probability of approximately\ncQ\u03c3v1 ,\u03c3v2 /n. As a result,\nNr,r(cid:48)[E](v \u00b7 v(cid:48)) \u2248 \u00afNr[G\\E](v) \u00b7 cQ\n\n\u00afNr(cid:48)[G\\E](v(cid:48)) \u2248 ((1 \u2212 c)P Q)re\u03c3v \u00b7 cQ\n\n((1 \u2212 c)P Q)r(cid:48)e\u03c3v(cid:48)\n\nn\n\nn\n\n= c(1 \u2212 c)r+r(cid:48)e\u03c3v \u00b7 Q(P Q)r+r(cid:48)e\u03c3v(cid:48) /n.\n\nFigure 1: The purple edges represent the edges counted by Nr,r(cid:48)[E](v \u00b7 v(cid:48)).\n\nLet \u03bb1, ..., \u03bbh be the distinct eigenvalues of P Q, ordered so that |\u03bb1| \u2265 |\u03bb2| \u2265 ... \u2265 |\u03bbh| \u2265 0.\nAlso de\ufb01ne h(cid:48) so that h(cid:48) = h if \u03bbh (cid:54)= 0 and h(cid:48) = h \u2212 1 if \u03bbh = 0. If Wi is the eigenspace of P Q\ncorresponding to the eigenvalue \u03bbi, and PWi is the projection operator on to Wi, then\n\nNr,r(cid:48)[E](v \u00b7 v(cid:48)) \u2248 c(1 \u2212 c)r+r(cid:48)e\u03c3v \u00b7 Q(P Q)r+r(cid:48)e\u03c3v(cid:48) /n\n\nc(1 \u2212 c)r+r(cid:48)\n\n=\n\nn\n\n(cid:88)\n\ni\n\n\u03bbr+r(cid:48)+1\ni\n\nPWi(e\u03c3v ) \u00b7 P \u22121PWi(e\u03c3v(cid:48) )\n\n(4)\n\n(5)\n\nwhere the \ufb01nal equality holds because for all i (cid:54)= j,\n\n\u03bbiPWi(e\u03c3v ) \u00b7 P \u22121PWj (e\u03c3v(cid:48) ) = PWi(e\u03c3v ) \u00b7 QPWj (e\u03c3v(cid:48) ) = PWi(e\u03c3v ) \u00b7 P \u22121\u03bbjPWj (e\u03c3v(cid:48) ),\n\nand since \u03bbi (cid:54)= \u03bbj, this implies that PWi (e\u03c3v ) \u00b7 P \u22121PWj (e\u03c3v(cid:48) ) = 0.\nDe\ufb01nition 4. Let \u03b6i(v \u00b7 v(cid:48)) = PWi(e\u03c3v ) \u00b7 P \u22121PWi(e\u03c3v(cid:48) ) for all i, v, and v(cid:48).\nEquation (5) is dominated by the \u03bbr+r(cid:48)+1\n\u03bbr+r(cid:48)+1\nh(cid:48)\n\nterms requires cancelling it out somehow. As a start, if \u03bb1 > \u03bb2 > \u03bb3 then\nNr+2,r(cid:48)[E](v \u00b7 v(cid:48)) \u00b7 Nr,r(cid:48)[E](v \u00b7 v(cid:48)) \u2212 N 2\n\u2248 c2(1 \u2212 c)2r+2r(cid:48)+2\n\nr+1,r(cid:48)[E](v \u00b7 v(cid:48))\n\n2 \u2212 2\u03bb1\u03bb2)\u03bbr+r(cid:48)+1\n\n(\u03bb2\n\n1 + \u03bb2\n\n1\n\n1\n\nn2\n\nterm, so getting good estimate of the \u03bbr+r(cid:48)+1\n\nthrough\n\n2\n\n\u03bbr+r(cid:48)+1\n2\n\n\u03b61(v \u00b7 v(cid:48))\u03b62(v \u00b7 v(cid:48))\n\n(cid:12)(cid:12)(cid:12)(cid:12).\n(cid:12)(cid:12)(cid:12)(cid:12) Nr,r(cid:48)[E](v \u00b7 v(cid:48)) Nr+1,r(cid:48)[E](v \u00b7 v(cid:48))\n\nNr+1,r(cid:48)[E](v \u00b7 v(cid:48)) Nr+2,r(cid:48)[E](v \u00b7 v(cid:48))\nNote that the left hand side of this expression is equal to det\nDe\ufb01nition 5. Let Mm,r,r(cid:48)[E](v \u00b7 v(cid:48)) be the m \u00d7 m matrix such that Mm,r,r(cid:48)[E](v \u00b7 v(cid:48))i,j =\nNr+i+j,r(cid:48)[E](v \u00b7 v(cid:48)) for each i and j.\n\n6\n\nENr[G\\E](v)Nr0[G\\E](v0)......vv0\fAs shown in [28], there exists constant \u03b3(\u03bb1, ..., \u03bbm) such that\n\ndet(Mm,r,r(cid:48)[E](v \u00b7 v(cid:48))) \u2248 cm(1 \u2212 c)m(r+r(cid:48))\n\nnm\n\n\u03b3(\u03bb1, ..., \u03bbm)\n\nm(cid:89)\n\n\u03bbr+r(cid:48)+1\ni\n\n\u03b6i(v \u00b7 v(cid:48))\n\n(6)\n\ni=1\n\nwhere we assumed that |\u03bbm| > |\u03bbm+1| above to simplify the discussion (the case |\u03bbm| = |\u03bbm+1| is\nsimilar). This suggests the following plan for estimating the eigenvalues corresponding to a graph.\nFirst, pick several vertices at random. Then, use the fact that |Nr[G\\E](v)| \u2248 ((1 \u2212 c)\u03bb1)r for any\ngood vertex v to estimate \u03bb1. Next, take ratios of (6) for m and m \u2212 1 (with r = r(cid:48)), and look for\nthe smallest m making that ratio small enough (this will use the estimate on \u03bb1), estimating h(cid:48) by\nthis value minus one. Then estimate consecutively all of P Q\u2019s eigenvalues for each selected vertex\nusing ratios of (6). Finally, take the median of these estimates.\nIn general, whether |\u03bbm| > |\u03bbm+1| or |\u03bbm| = |\u03bbm+1|,\ndet(Mm,r+1,r(cid:48)[E](v \u00b7 v(cid:48))) \u2212 (1 \u2212 c)m\u03bbm+1\ndet(Mm\u22121,r+1,r(cid:48)[E](v \u00b7 v(cid:48))) \u2212 (1 \u2212 c)m\u22121\u03bbm\n\u2248 c\nn\n\n(cid:81)m\u22121\n(cid:81)m\u22122\ni=1 \u03bbi det(Mm,r,r(cid:48)[E](v \u00b7 v(cid:48)))\ni=1 \u03bbi det(Mm\u22121,r,r(cid:48)[E](v \u00b7 v(cid:48)))\n\n((1 \u2212 c)\u03bbm)r+r(cid:48)+2\u03b6m(v \u00b7 v(cid:48)).\n\n\u03bbm\u22121(\u03bbm \u2212 \u03bbm+1)\n\u03bbm(\u03bbm\u22121 \u2212 \u03bbm)\n\n\u03b3(\u03bb1, ..., \u03bbm)\n\u03b3(\u03bb1, ..., \u03bbm\u22121)\n\nn\n\n\u03bbr+r(cid:48)+1\ni\n\nThis fact can be used to approximate \u03b6i(v \u00b7 v(cid:48)) for arbitrary v, v(cid:48), and i. Of course, this requires r\n\u03b6i(v \u00b7 v(cid:48)) is large relative to the error terms for all\nand r(cid:48) to be large enough that c(1\u2212c)r+r(cid:48)\ni \u2264 h(cid:48). This requires at least |(1 \u2212 c)\u03bbi|r+r(cid:48)+1 = \u03c9(n) for all i \u2264 h(cid:48). Moreover, for any v and v(cid:48),\n0 \u2264 PWi(e\u03c3v \u2212 e\u03c3v(cid:48) ) \u00b7 P \u22121PWi(e\u03c3v \u2212 e\u03c3v(cid:48) ) = \u03b6i(v \u00b7 v) \u2212 2\u03b6i(v \u00b7 v(cid:48)) + \u03b6i(v(cid:48) \u00b7 v(cid:48)) with equality for\nall i if and only if \u03c3v = \u03c3v(cid:48), so suf\ufb01ciently good approximations of \u03b6i(v \u00b7 v), \u03b6i(v \u00b7 v(cid:48)) and \u03b6i(v(cid:48) \u00b7 v(cid:48))\ncan be used to determine which pairs of vertices are in the same community.\nOne could generate a reasonable classi\ufb01cation based solely on this method of comparing vertices\n(with an appropriate choice of the parameters, as later detailed). However, that would require com-\nputing Nr,r(cid:48)[E](v \u00b7 v) for every vertex in the graph with fairly large r + r(cid:48), which would be slow.\nInstead, we use the fact that for any vertices v, v(cid:48), and v(cid:48)(cid:48) with \u03c3v = \u03c3v(cid:48) (cid:54)= \u03c3v(cid:48)(cid:48),\n\n\u03b6i(v(cid:48) \u00b7 v(cid:48)) \u2212 2\u03b6i(v \u00b7 v(cid:48)) + \u03b6i(v \u00b7 v) = 0 \u2264 \u03b6i(v(cid:48)(cid:48) \u00b7 v(cid:48)(cid:48)) \u2212 2\u03b6i(v \u00b7 v(cid:48)(cid:48)) + \u03b6i(v \u00b7 v)\n\nfor all i, and the inequality is strict for at least one i. So, subtracting \u03b6i(v \u00b7 v) from both sides, we\nhave \u03b6i(v(cid:48) \u00b7 v(cid:48))\u2212 2\u03b6i(v \u00b7 v(cid:48)) \u2264 \u03b6i(v(cid:48)(cid:48) \u00b7 v(cid:48)(cid:48))\u2212 2\u03b6i(v \u00b7 v(cid:48)(cid:48)) for all i, and the inequality is still strict for at\nleast one i. So, given a representative vertex in each community, we can determine which of them a\ngiven vertex, v, is in the same community as without needing to know the value of \u03b6i(v \u00b7 v).\nThis runs fairly quickly if r is large and r(cid:48) is small because the algorithm only requires focusing\non |Nr(cid:48)(v(cid:48))| vertices. This leads to the following plan for partial recovery. First, randomly select\na set of vertices that is large enough to contain at least one vertex from each community with high\nprobability. Next, compare all of the selected vertices in an attempt to determine which of them are\nin the same communities. Then, pick one in each community. Call these anchor nodes. After that,\nuse the algorithm referred to above to determine which community each of the remaining vertices\nis in. As long as there actually was at least one vertex from each community in the initial set and\nnone of the approximations were particularly bad, this should give a reasonable classi\ufb01cation. The\nrisk that this randomly gives a bad classi\ufb01cation due to a bad set of initial vertices can be mitigated\nby repeating the previous classi\ufb01cation procedure several times as discussed in [28]. This completes\nthe Agnostic-sphere-comparison algorithm. We refer to [28] for the details.\n\n3.2 Exact recovery and the Agnostic-degree-profiling algorithm\n\nThe exact recovery part is similar to [1] and uses the fact that once a good enough clustering has been\nobtained from Agnostic-sphere-comparison, the classi\ufb01cation can be \ufb01nished by making\nlocal improvements based on the node\u2019s neighborhoods. Similar techniques have been used in [5,\n11, 19, 31, 32]. However, we establish here a sharp characterization of the local procedure error.\nThe key result is that, when testing between two multivariate Poisson distributions of means\nlog(n)\u03bb1 and log(n)\u03bb2 respectively, where \u03bb1, \u03bb2 \u2208 Zk\n+, the probability of error (of maximum\n\n7\n\n\fa posteriori decoding) is\n\nn\u2212D+(\u03bb1,\u03bb2)+o(1).\n\n(7)\nThis is proved in [1]. In the case of unknown parameters, the algorithmic approach is largely un-\nchanged, adding a step where the best known classi\ufb01cation is used to estimate p and Q prior to any\nlocal improvement step. The analysis of the algorithm requires however some careful handling.\n\u221a\nFirst, it is necessary to prove that given a labelling of the graph\u2019s vertices with an error rate of x,\none can compute approximations of p and Q that are within O(x + log(n)/\nn) of their true values\nwith probability 1 \u2212 o(1). Secondly, one needs to modify the above hypothesis testing estimates to\ncontrol the error probability. In attempting to determine vertices\u2019 communities based on estimates of\np and Q that are off by at most \u03b4, say p(cid:48) and Q(cid:48), one must show that a classi\ufb01cation of its neighbors\nthat has an error rate of \u03b4 classi\ufb01es the vertices with an error rate only eO(\u03b4 log n) times higher than\nit would be if the parameter really were p(cid:48) and Q(cid:48) and the vertices\u2019 neighbors were all classi\ufb01ed\ncorrectly. Thirdly, one needs to show that since D+((P Q)i, (P Q)j) is differentiable with respect\nto any element of P Q, the error rate if the parameters really were p(cid:48) and Q(cid:48) is at worst eO(\u03b4 log n)\nas high as the error rate with the actual parameters. Combining these yields the conclusion that any\nerrors in the estimates of the SBM\u2019s parameters do not disrupt vertex classi\ufb01cation any worse than\nthe errors in the preliminary classi\ufb01cations already were.\nThe Agnostic-degree-profiling algorithm. The inputs are (G, \u03b3), where G is a graph,\nand \u03b3 \u2208 [0, 1] (see [28] for how to set \u03b3 speci\ufb01cally). The algorithm outputs each node\u2019s label.\n(1) De\ufb01ne the graph g(cid:48) on the vertex set [n] by selecting each edge in g independently with proba-\nbility \u03b3, and de\ufb01ne the graph g(cid:48)(cid:48) that contains the edges in g that are not in g(cid:48).\n(2) Run Agnostic-sphere-comparison on g(cid:48) with \u03b4 = 1/ log log(n) to obtain the classi\ufb01-\ncation \u03c3(cid:48) \u2208 [k]n.\n(3) Determine the size of all alleged communities, and estimate the edge density among these.\n(4) For each node v \u2208 [n], determine the most likely community label of node v based on its degree\npro\ufb01le \u00afN1(v) computed from the preliminary classi\ufb01cation \u03c3(cid:48), and call it \u03c3(cid:48)(cid:48)v .\n(5) Use \u03c3(cid:48)(cid:48)v to get new estimates of p and Q.\n(6) For each node v \u2208 [n], determine the most likely community label of node v based on its degree\npro\ufb01le \u00afN1(v) computed from \u03c3(cid:48)(cid:48). Output this labelling.\nIn step (3) and (6), the most likely label is the one that maximizes the probability that the degree\npro\ufb01le comes from a multivariate distribution of mean ln(n)(P Q)i for i \u2208 [k]. Note that this\nalgorithm does not require a lower bound on min pi because setting \u03b4 to a slowly decreasing function\nof n results in \u03b4 being within an acceptable range for all suf\ufb01ciently large n.\n\n4 Data implementation and open problems\n\nWe tested a simpli\ufb01ed version of our algorithm on real data (see [28]), for the blog network of\nAdamic and Glance \u201905. We obtained an error rate of about 60/1222 (best trial was 57, worst 67),\nachieving the state-of-the-art (as described in [32]). Our extend quite directly to a slowly growing\nnumber of communities (e.g., up to logarithmic).\nIt would be interesting to extend the current\napproach to smaller sized, watching the complexity scaling, as well as to corrected-degrees, labeled-\nedges, or overlapping communities (though our approach already applies to linear-sized overlaps).\n\nAcknowledgments\n\nThis research was partly supported by NSF grant CCF-1319299 and the Bell Labs Prize.\n\nReferences\n[1] E. Abbe and C. Sandon. Community detection in general stochastic block models: fundamental limits\n\nand ef\ufb01cient recovery algorithms. arXiv:1503.00609. To appear in FOCS15., March 2015.\n\n[2] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00b4a. Asymptotic analysis of the stochastic block model\n\nfor modular networks and its algorithmic applications. Phys. Rev. E, 84:066106, December 2011.\n\n[3] L. Massouli\u00b4e. Community detection thresholds and the weak Ramanujan property. In STOC 2014: 46th\n\nAnnual Symposium on the Theory of Computing, pages 1\u201310, New York, United States, June 2014.\n\n8\n\n\f[4] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. Available online at\n\narXiv:1311.4115 [math.PR], January 2014.\n\n[5] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. To appear in IEEE\n\nTransactions on Information Theory. Available at ArXiv:1405.3267, May 2014.\n\n[6] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for binary symmetric block models.\n\nArxiv:arXiv:1407.1591. To appear in STOC15., July 2014.\n\n[7] J. Xu Y. Chen. Statistical-computational tradeoffs in planted problems and submatrix localization with a\n\ngrowing number of clusters and submatrices. arXiv:1402.1267, February 2014.\n\n[8] P. K. Gopalan and D. M. Blei. Ef\ufb01cient discovery of overlapping communities in massive networks.\n\nProceedings of the National Academy of Sciences, 2013.\n\n[9] P. W. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks,\n\n5(2):109\u2013137, 1983.\n\n[10] T.N. Bui, S. Chaudhuri, F.T. Leighton, and M. Sipser. Graph bisection algorithms with good average case\n\nbehavior. Combinatorica, 7(2):171\u2013191, 1987.\n\n[11] M.E. Dyer and A.M. Frieze. The solution of some random NP-hard problems in polynomial expected\n\ntime. Journal of Algorithms, 10(4):451 \u2013 489, 1989.\n\n[12] Mark Jerrum and Gregory B. Sorkin. The metropolis algorithm for graph bisection. Discrete Applied\n\nMathematics, 82(13):155 \u2013 175, 1998.\n\n[13] A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Lecture\n\nNotes in Computer Science, 1671:221\u2013232, 1999.\n\n[14] T. A. B. Snijders and K. Nowicki. Estimation and Prediction for Stochastic Blockmodels for Graphs with\n\nLatent Block Structure. Journal of Classi\ufb01cation, 14(1):75\u2013100, January 1997.\n\n[15] F. McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001. Pro-\n\nceedings. 42nd IEEE Symposium on, pages 529\u2013537, 2001.\n\n[16] P. J. Bickel and A. Chen. A nonparametric view of network models and newmangirvan and other modu-\n\nlarities. Proceedings of the National Academy of Sciences, 2009.\n\n[17] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel.\n\nThe Annals of Statistics, 39(4):1878\u20131915, 08 2011.\n\n[18] D. S. Choi, P. J. Wolfe, and E. M. Airoldi. Stochastic blockmodels with a growing number of classes.\n\nBiometrika, pages 1\u201312, 2012.\n\n[19] V. Vu. A simple svd algorithm for \ufb01nding hidden partitions. Available online at arXiv:1404.3918, 2014.\n[20] J. Xu B. Hajek, Y. Wu. Achieving exact cluster recovery threshold via semide\ufb01nite programming.\n\narXiv:1412.6156, November 2014.\n\n[21] A. S. Bandeira. Random laplacian matrices and convex relaxations. arXiv:1504.03987, 2015.\n[22] S. Yun and A. Proutiere. Accurate community detection in the stochastic block model via spectral algo-\n\nrithms. arXiv:1412.7335, December 2014.\n\n[23] E. Mossel, J. Neeman, and A. Sly. Belief propagation, robust reconstruction, and optimal recovery of\n\nblock models. Arxiv:arXiv:1309.1380, 2013.\n\n[24] O. Gu\u00b4edon and R. Vershynin. Community detection in sparse networks via Grothendieck\u2019s inequality.\n\nArXiv:1411.4686, November 2014.\n\n[25] P. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in the sparse graphs: A\n\nspectral algorithm with optimal rate of recovery. arXiv:1501.05021, January 2015.\n\n[26] E. Mossel, J. Neeman, and A. Sly. Stochastic block models and reconstruction. Available online at\n\narXiv:1202.1499 [math.PR], 2012.\n\n[27] C. Borgs, J. Chayes, and A. Smith. Private graphon estimation for sparse graphs. In preparation, 2015.\n[28] E. Abbe and C. Sandon. Recovering communities in the general stochastic block model without knowing\n\nthe parameters. arXiv:1506.03729, June 2015.\n\n[29] C. Bordenave, M. Lelarge, and L. Massouli\u00b4e. Non-backtracking spectrum of random graphs: community\n\ndetection and non-regular ramanujan graphs. Available at arXiv:1501.06087, 2015.\n\n[30] S. Bhattacharyya and P. J. Bickel. Community Detection in Networks using Graph Distance. ArXiv\n\ne-prints, January 2014.\n\n[31] N. Alon and N. Kahale. A spectral technique for coloring random 3-colorable graphs. In SIAM Journal\n\non Computing, pages 346\u2013355, 1994.\n\n[32] A. Y. Zhang H. H. Zhou C. Gao, Z. Ma. Achieving optimal misclassi\ufb01cation proportion in stochastic\n\nblock model. arXiv:1505.03772, 2015.\n\n9\n\n\f", "award": [], "sourceid": 478, "authors": [{"given_name": "Emmanuel", "family_name": "Abbe", "institution": "Princeton University"}, {"given_name": "Colin", "family_name": "Sandon", "institution": "Princeton University"}]}