{"title": "Contextual Stochastic Block Models", "book": "Advances in Neural Information Processing Systems", "page_first": 8581, "page_last": 8593, "abstract": "We provide the first information theoretical tight analysis for inference of latent community structure given a sparse graph along with high dimensional node covariates, correlated with the same latent communities. Our work bridges recent theoretical breakthroughs in detection of latent community structure without nodes covariates and a large body of empirical work using diverse heuristics for combining node covariates with graphs for inference. The tightness of our analysis implies in particular, the information theoretic necessity of combining the different sources of information. \nOur analysis holds for networks of large degrees as well as for a Gaussian version of the model.", "full_text": "Contextual Stochastic Block Models\n\nYash Deshpande\u2217\n\nAndrea Montanari \u2020\n\nElchanan Mossel\u2021\n\nSubhabrata Sen\u00a7\n\nAbstract\n\nWe provide the \ufb01rst information theoretic tight analysis for inference of latent\ncommunity structure given a sparse graph along with high dimensional node\ncovariates, correlated with the same latent communities. Our work bridges recent\ntheoretical breakthroughs in the detection of latent community structure without\nnodes covariates and a large body of empirical work using diverse heuristics\nfor combining node covariates with graphs for inference. The tightness of our\nanalysis implies in particular, the information theoretical necessity of combining\nthe different sources of information. Our analysis holds for networks of large\ndegrees as well as for a Gaussian version of the model.\n\n1\n\nIntroduction\n\nData clustering is a widely used primitive in exploratory data analysis and summarization. These\nmethods discover clusters or partitions that are assumed to re\ufb02ect a latent partitioning of the data with\nsemantic signi\ufb01cance. In a machine learning pipeline, results of such a clustering may then be used\nfor downstream supervised tasks, such as feature engineering, privacy-preserving classi\ufb01cation or\nfair allocation [CMS11, KGB+12, CDPF+17].\nAt risk of over-simpli\ufb01cation, there are two settings that are popular in literature. In graph clustering,\nthe dataset of n objects is represented as a symmetric similarity matrix A = (Aij)1\u2264i,j\u2264n. For\ninstance, A can be binary, where Aij = 1 (or 0) denotes that the two objects i, j are similar (or not).\nIt is, then, natural to interpret A as the adjacency matrix of a graph. This can be carried over to\nnon-binary settings by considering weighted graphs. On the other hand, in more traditional (binary)\nclassi\ufb01cation problems, the n objects are represented as p-dimensional feature or covariate vectors\nb1, b2,\u00b7\u00b7\u00b7 , bn. This feature representation can be the input for a clustering method such as k-means,\nor instead used to construct a similarity matrix A, which in turn is used for clustering or partitioning.\nThese two representations are often taken to be mutually exclusive and, in fact, interchangeable.\nIndeed, just as feature representations can be used to construct similarity matrices, popular spectral\nmethods [NJW02, VL07] implicitly construct a low-dimensional feature representation from the\nsimilarity matrices.\nThis paper is motivated by scenarios where the graph, or similarity, representation A \u2208 Rn\u00d7n, and\nthe feature representation B = [b1, b2, . . . , bn] \u2208 Rp\u00d7n provide independent, or complementary,\ninformation on the latent clustering of the n objects. (Technically, we will assume that A and B\nare conditionally independent given the node labels.) We argue that in fact in almost all practical\ngraph clustering problems, feature representations provide complementary information of the latent\nclustering. This is indeed the case in many social and biological networks, see e.g. [NC16] and\nreferences within.\nAs an example, consider the \u2018political blogs\u2019 dataset [AG05]. This is a directed network of political\nblogs during the 2004 US presidential election, with a link between two blogs if one referred to the\n\n\u2217Department of Mathematics, Massachusetts Institute of Technology\n\u2020Departments of Electrical Engineering and Statistics, Stanford University\n\u2021Department of Mathematics, Massachusetts Institute of Technology\n\u00a7Department of Mathematics, Massachusetts Institute of Technology\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fother. It is possible to just use the graph structure in order to identify political communities (as was\ndone in [AG05]). Note however that much more data is available. For example we may consider an\nalternative feature representation of the blogs, wherein each blog is converted to a \u2018bag-of words\u2019\nvector of its content. This gives a quite different, and complementary representation of blogs that\nplausibly re\ufb02ects their political leaning. A number of approaches can be used for the simple task of\npredicting leaning from the graph information (or feature information) individually. However, given\naccess to both sources, it is challenging to combine them in a principled fashion.\nIn this context, we introduce a simple statistical model of complementary graph and high-dimensional\ncovariate data that share latent cluster structure. This model is an intuitive combination of two\nwell-studied models in machine learning and statistics: the stochastic block model and the spiked\ncovariance model [Abb17, HLL83, JL04]. We focus on the task of uncovering this latent structure\nand make the following contributions:\n\nSharp thresholds: We establish a sharp information-theoretic threshold for detecting the latent\nstructure in this model. This threshold is based on non-rigorous, but powerful, techniques\nfrom statistical physics.\n\nRigorous validation: We consider a certain \u2018Gaussian\u2019 limit of the statistical model, which is of\nindependent interest. In this limit, we rigorously establish the correct information-theoretic\nthreshold using novel Gaussian comparison inequalities. We further show convergence to\nthe Gaussian limit predictions as the density of the graph diverges.\n\nAlgorithm: We provide a simple, iterative algorithm for inference based on the belief propagation\nheuristic. For data generated from the model, we empirically demonstrate that the the\nalgorithm achieves the conjectured information-theoretic threshold.\n\nThe rest of the paper is organized as follows. The model and results are presented in Section 2.\nFurther related work is discussed in Section 3. The prediction of the threshold from statistical physics\ntechniques is presented in 4, along with the algorithm. While all proofs are presented in the appendix,\nwe provide an overview of the proofs of our rigorous results in Section 5. Finally, we numerically\nvalidate the prediction in Section 6.\n\n2 Model and main results\n\nWe will focus on the simple case where the n objects form two latent clusters of approximately equal\nsize, labeled + and \u2212. Let v \u2208 {\u00b11}n be the vector encoding this partitioning. Then, the observed\ndata is a pair of matrices (AG, B), where AG is the adjacency matrix of the graph G and B \u2208 Rp\u00d7n\nis the matrix of covariate information. Each column bi, i \u2264 n of matrix B contains the covariate\ninformation about vertex i. We use the following probabilistic model: conditional on v, and a latent\nvector u \u223c N(0, Ip/p):\n\n(2)\nwhere Zi \u2208 Rp has independent standard normal entries. It is convenient to parametrize the edge\nprobabilities by the average degree d and the normalized degree separation \u03bb:\n\nviu +\n\nbi =\n\nn\n\n\u221a\ncin = d + \u03bb\n\nd ,\n\n\u221a\ncout = d \u2212 \u03bb\n\nd .\n\n(3)\n\nHere d, \u03bb, \u00b5 are parameters of the model which, for the sake of simplicity, we assume to be \ufb01xed and\nknown. In other words, two objects i, j in the same cluster or community are slightly more likely to\nbe connected than for objects i, j(cid:48) in different clusters. Similarly, according to (2), they have slightly\npositively correlated feature vectors bi, bj, while objects i, j(cid:48) in different clusters have negatively\ncorrelated covariates bi, bj(cid:48).\nNote that this model is a combination of two observation models that have been extensively studied:\nthe stochastic block model and the spiked covariance model. The stochastic block model has its\nroots in sociology literature [HLL83] and has witnessed a resurgence of interest from the computer\n\n2\n\nP(AG\n\nij = 1) =\n\n(cid:26)cin/n\n(cid:114) \u00b5\n\ncout/n\n\nwith probability ,\notherwise.\nZi\u221a\np\n\n,\n\n(1)\n\n\fscience and statistics community since the work of Decelle et al. [DKMZ11]. This work focused\non the sparse setting where the graph as O(n) edges and conjectured, using the non-rigorous cavity\nmethod, the following phase transition phenomenon. This was later established rigorously in a series\nof papers [MNS15, MNS13, Mas14].\nTheorem 1 ([MNS15, MNS13, Mas14]). Suppose d > 1 is \ufb01xed. The graph G is distinguishable\nwith high probability from an Erd\u00a8os-Renyi random graph with average degree d if and only if \u03bb \u2265 1.\n\nMoreover, if \u03bb > 1, there exists a polynomial-time computable estimate(cid:98)v =(cid:98)v(AG) \u2208 {\u00b11}n of the\n\ncluster assignment satisfying, almost surely:\n\n\u2265 \u03b5(\u03bb) > 0.\n\n(4)\n\n|(cid:104)(cid:98)v, v(cid:105)|\n\nn\n\nlim inf\nn\u2192\u221e\n\nIn other words, given the graph G, it is possible to non-trivially estimate the latent clustering v if, and\nonly if, \u03bb > 1.\nThe covariate model (2) was proposed by Johnstone and Lu [JL04] and has been extensively studied\nin statistics and random matrix theory. The weak recovery threshold was characterized by a number\nof authors, including Baik et al [BBAP05], Paul [Pau07] and Onatski et al [OMH+13].\n\nTheorem 2 ([BBAP05, Pau07, OMH+13]). Let(cid:98)v1 be the principal eigenvector of BTB, where\n(cid:98)v1 is normalized so that (cid:107)(cid:98)v1(cid:107)2 = n. Suppose that p, n \u2192 \u221e with p/n \u2192 1/\u03b3 \u2208 (0,\u221e). Then\nlim inf n\u2192\u221e |(cid:104)(cid:98)v1, v(cid:105)|/n > 0 if and only if \u00b5 >\n\n\u03b3, no such estimator exists.\n\n\u03b3. Moreover, if \u00b5 <\n\n\u221a\n\n\u221a\n\n\u221a\n\nIn other words, this theorem shows that it is possible to estimate v solely from the covariates using,\nin fact, a spectral method if, and only if \u00b5 >\nOur \ufb01rst result is the following prediction that establishes the analogous threshold prediction that\nsmoothly interpolates between Theorems 1 and 2.\nClaim 3 (Cavity prediction). Given AG, B as in Eqs.(1), (2), and assume that n, p \u2192 \u221e with p/n \u2192\n\n1/\u03b3 \u2208 (0,\u221e). Then there exists an estimator(cid:98)v =(cid:98)v(AG, B) \u2208 {\u00b11}n so that lim inf |(cid:104)(cid:98)v, v(cid:105)|/n is\n\n\u03b3.\n\nbounded away from 0 if and only if\n\n\u03bb2 +\n\n\u00b52\n\u03b3\n\n> 1 .\n\n(5)\n\nWe emphasize here that this claim is not rigorous; we obtain this prediction via the cavity method.\nThe cavity method is a powerful technique from the statistical physics of mean \ufb01eld models [MM09].\nOur instantiation of the cavity method is outlined in Section 4, along with Appendix B and D (see\nsupplement). The cavity method is remarkably successful and a number of its predictions have been\nmade rigorous [MM09, Tal10]. Consequently, we view Claim 3 as a conjecture, with strong positive\nevidence. Theorems 1 and 2 con\ufb01rm the cavity prediction rigorously in the corner cases, in which\neither \u03bb or \u00b5 vanishes, using intricate tools from random matrix theory and sparse random graphs.\nOur main result con\ufb01rms rigorously Claim 3 in the limit of large degrees.\nTheorem 4. Suppose v is uniformly distributed in {\u00b11}n and we observe AG, B as in (1), (2).\nConsider the limit p, n \u2192 \u221e with p/n \u2192 1/\u03b3. Then, for some \u03b5(\u03bb, \u00b5) > 0 independent of d,\n\nlim inf\n\n\u2265 \u03b5(\u03bb, \u00b5) \u2212 od(1)\n\nlim sup\nn\u2192\u221e\n\nif \u03bb2 + \u00b52/\u03b3 > 1,\n\nn\u2192\u221e sup(cid:98)v( \u00b7 )\nsup(cid:98)v( \u00b7 )\nHere the limits hold in probability, the supremum is over estimators(cid:98)v : (AG, B) (cid:55)\u2192(cid:98)v(AG, B) \u2208 Rn,\nwith (cid:107)(cid:98)v(AG, B)(cid:107)2 =\n\u221a\n\nd \u2192 \u221e.\nIn order to establish this result, we consider a modi\ufb01cation of the original model in (1), (2), which is\nof independent interest. Suppose, conditional on v \u2208 {\u00b11} and the latent vector u we observe (A, B)\nas follows:\n\nn. Here od(1) indicates a term independent of n which tends to zero as\n\nif \u03bb2 + \u00b52/\u03b3 < 1.\n\n= od(1)\n\n(6)\n\n(7)\n\n|(cid:104)(cid:98)v(AG, B), v(cid:105)|\n|(cid:104)(cid:98)v(AG, B), v(cid:105)|\n\nn\n\nn\n\n(cid:26)N(\u03bbvivj/n, 1/n)\n\nAij \u223c\nN(\u03bbvivj/n, 2/n)\n\u221a\nBai \u223c N(\n\n\u221a\n\u00b5viua/\n\nn, 1/p).\n\nif i < j\nif i = j,\n\n3\n\n(8)\n\n(9)\n\n\fThis model differs from (1), in that the graph observation AG is replaced by the observation A\nwhich is equal to \u03bbvvT/n, corrupted by Gaussian noise. This model generalizes so called \u2018rank-one\ndeformations\u2019 of random matrices [P\u00b4ec06, KY13, BGN11], as well as the Z2 synchronization model\n[ABBS14, Cuc15].\nOur main motivation for introducing the Gaussian observation model is that it captures the large-\ndegree behavior of the original graph model. The next result formalizes this intuition: its proof is an\nimmediate generalization of the Lindeberg interpolation method of [DAM16].\nTheorem 5. Suppose v \u2208 {\u00b11}n is uniformly random, and u is independent. We denote by\nI(v; AG, B) the mutual information of the latent random variables v and the observable data AG, B.\nFor all \u03bb, \u00b5: we have that:\n\nd\u2192\u221e lim sup\nlim\nn\u2192\u221e\n\n1\nn\n\n|I(v; AG, B) \u2212 I(v; A, B)| = 0,\n\nd\u2192\u221e lim sup\nlim\nn\u2192\u221e\n\ndI(v; AG, B)\n\nd(\u03bb2)\n\n\u2212 1\n4\n\nMMSE(v; AG, B)\n\n(cid:12)(cid:12)(cid:12) 1\n\nn\n\n(cid:12)(cid:12)(cid:12) = 0,\n\n(10)\n\n(11)\n\nF}.\n\nwhere MMSE(v; AG, B) = n\u22122E{(cid:107)vvT \u2212 E{vvT|AG, B}(cid:107)2\nFor the Gaussian observation model (8), (9) we can establish a precise weak recovery threshold,\nwhich is the main technical novelty of this paper.\nTheorem 6. Suppose v is uniformly distributed in {\u00b11}n and we observe A, B as in (8), (9).\nConsider the limit p, n \u2192 \u221e with p/n \u2192 1/\u03b3.\n\u221a\n\n1. If \u03bb2 + \u00b52/\u03b3 < 1, then for any estimator(cid:98)v : (A, B) (cid:55)\u2192(cid:98)v(A, B), with (cid:107)(cid:98)v(A, B)(cid:107)2 =\nwe have lim supn\u2192\u221e |(cid:104)(cid:98)v, v(cid:105)|/n = 0.\n2. If \u03bb2 + \u00b52/\u03b3 > 1, let(cid:98)v(A, B) be normalized so that (cid:107)(cid:98)v(A, B)(cid:107)2 =\n\nn, and proportional\n\n\u221a\n\nn,\n\nthe maximum eigenvector of the matrix M (\u03be\u2217), where\n\nand \u03be\u2217 = arg min\u03be>0 \u03bbmax(M (\u03be)). Then, lim inf n\u2192\u221e |(cid:104)(cid:98)v, v(cid:105)|/n > 0 in probability.\n\nM (\u03be) = A +\n\nIn ,\n\nBTB +\n\n(12)\n\n2\u00b52\n\u03bb2\u03b32\u03be\n\n\u03be\n2\n\nTheorem 4 is proved by using this threshold result, in conjunction with the universality Theorem 5.\n\n3 Related work\n\nThe need to incorporate node information in graph clustering has been long recognized. To address\nthe problem, diverse clustering methods have been introduced\u2014 e.g. those based on generative\nmodels [NC16, Hof03, ZVA10, YJCZ09, KL12, LM12, XKW+12, HL14, YML13], heuristic model\nfree approaches [BVR17, ZLZ+16, GVB12, ZCY09, NAJ03, GFRS13, DV12, CZY11, SMJZ12,\nSZLP16], Bayesian methods [CB10, BC11] etc. [BCMM15] surveys other clustering methods for\ngraphs with node and edge attributes. Semisupervised graph clustering [Pee12, EM12, ZMZ14],\nwhere labels are available for a few vertices are also somewhat related to our line of enquiry. The\nliterature in this domain is quite vast and extremely diffuse, and thus we do not attempt to provide an\nexhaustive survey of all related attempts in this direction.\nIn terms of rigorous results, [AJC14, LMX15] introduced and analyzed a model with informative\nedges, but they make the strong and unrealistic requirement that the label of individual edges and\neach of their endpoints are uncorrelated and are only able to prove one side of their conjectured\nthreshold. The papers [BVR17, ZLZ+16] \u2013among others\u2013 rigorously analyze speci\ufb01c heuristics\nfor clustering and provide some guarantees that ensure consistency. However, these results are not\noptimal. Moreover, it is possible that they only hold in the regime where using either the node\ncovariates or the graph suf\ufb01ces for inference.\nSeveral theoretical works [KMS16, MX16] analyze the performance of local algorithms in the semi-\nsupervised setting, i.e., where the true labels are given for a small fraction of nodes. In particular\n[KMS16] establishes that for the two community sparse stochastic block model, correlated recovery\nis impossible given any vanishing proportion of nodes. Note that this is in stark contrast to Theorem 4\n\n4\n\n\f(and the Claim for the sparse graph model) above, which posits that given high dimensional covariate\ninformation actually shifts the information theoretic threshold for detection and weak recovery. The\nanalysis in [KMS16, MX16] is also local in nature, while our algorithms and their analysis go well\nbeyond the diameter of the graph.\n\n4 Belief propagation: algorithm and cavity prediction\n\nRecall the model (1), (2), where we are given the data (AG, B) and our task is to infer the latent\ncommunity labels v. From a Bayesian perspective, a principled approach computes posterior expecta-\ntion with respect to the conditional distribution P(v, u|AG, B) = P(v, u, AG, B)/P(AG, B). This\nis, however, not computationally tractable because it requires to marginalize over v \u2208 {+1,\u22121}n\nand u \u2208 Rp. At this point, it becomes necessary to choose an approximate inference procedure,\nsuch as variational inference or mean \ufb01eld approximations [WJ+08]. In Bayes inference problem\non locally-tree like graphs, belief propagation is optimal among local algorithms (see for instance\n[DM15] for an explanation of why this is the case).\na for i \u2208 [n],\nThe algorithm proceeds by computing, in an iterative fashion vertex messages \u03b7t\na \u2208 [p] and edge messages \u03b7t\ni\u2192j for all pairs (i, j) that are connected in the graph G. For a vertex i\nof G, we denote its neighborhood in G by \u2202i. Starting from an initialization (\u03b7t0, mt0 )t0=\u22121,0, we\nupdate the messages in the following linear fashion:\n\ni , mt\n\n\u03b3\n\n(cid:114) \u00b5\n(cid:114) \u00b5\n(cid:114) \u00b5\n\n\u03b3\n\n\u03b3\n\n\u03b7t+1\ni\u2192j =\n\n\u03b7t+1\ni =\n\nmt+1 =\n\n(BTmt)i \u2212 \u00b5\n\u03b3\n\n\u03b7t\u22121\ni +\n\n(BTmt)i \u2212 \u00b5\n\u03b3\n\n\u03b7t\u22121\ni +\n\n\u03bb\u221a\nd\n\n\u03bb\u221a\nd\n\nB\u03b7t \u2212 \u00b5mt\u22121.\n\n(cid:88)\n(cid:88)\n\nk\u2208\u2202i\\j\n\nk\u2208\u2202i\n\n\u221a\nk\u2192i \u2212 \u03bb\n\u03b7t\n\u221a\n\nn\n\nk\u2192i \u2212 \u03bb\n\u03b7t\n\nd\n\nd\n\n(cid:88)\n(cid:88)\n\nk\u2208[n]\n\n\u03b7t\nk,\n\n\u03b7t\nk,\n\nn\n\nk\u2208[n]\n\n(13)\n\n(14)\n\n(15)\n\nHere, and below, we will use \u03b7t = (\u03b7t\na)a\u2208[p] to denote the vectors of vertex\nmessages. After running the algorithm for some number of iterations tmax, we return, as an estimate,\nthe sign of the vertex messages \u03b7tmax\n\ni )i\u2208[n], mt = (mt\n\ni\n\n, i.e.(cid:98)vi(AG, B) = sgn(\u03b7tmax\n\ni\n\n).\n\n(16)\n\nThese update equations have a number of intuitive features. First, in the case that \u00b5 = 0, i.e. we have\nno covariate information, the edge messages become:\n\n(cid:88)\n\nk\u2208\u2202i\\j\n\n\u03b7t+1\ni\u2192j =\n\n\u03bb\u221a\nd\n\nk\u2192i \u2212 \u03bb\n\u03b7t\n\n(cid:88)\n\n\u221a\n\nd\n\nn\n\nk\u2208[n]\n\n\u03b7t\nk,\n\n(17)\n\nwhich corresponds closely to the spectral power method on the nonbacktracking walk matrix of G\n[KMM+13]. Conversely, when \u03bb = 0, the updates equations on mt, \u03b7t correspond closely to the\nusual power iteration to compute singular vectors of B.\nWe obtain this algorithm from belief propagation using two approximations. First, we linearize the\nbelief propagation update equations around a certain \u2018zero information\u2019 \ufb01xed point. Second, we use\nan \u2018approximate message passing\u2019 version of the belief propagation updates which results in the\naddition of the memory terms in Eqs. (13), (14), (15). The details of these approximations are quite\nstandard and deferred to Appendix D. For a heuristic discussion, we refer the interested reader to the\ntutorials [Mon12, TKGM14] (for the Gaussian approximation) and the papers [DKMZ11, KMM+13]\n(for the linearization procedure).\nAs with belief propagation, the behavior of this iterative algorithm, in the limit p, n \u2192 \u221e can be\ntracked using a distributional recursion called density evolution.\nDe\ufb01nition 1 (Density evolution). Let ( \u00afm, U ) and (\u00af\u03b7, V ) be independent random vectors such that\nU \u223c N(0, 1), V \u223c Uniform({\u00b11}), \u00afm, \u00af\u03b7 have \ufb01nite variance. Further assume that (\u00af\u03b7, V ) d=\n(\u2212\u00af\u03b7,\u2212V ) and ( \u00afm, U ) d= (\u2212 \u00afm,\u2212U ) (where d= denotes equality in distribution).\n\n5\n\n\fWe then de\ufb01ne new random pairs ( \u00afm(cid:48), U(cid:48)) and (\u00af\u03b7(cid:48), V (cid:48)), where U(cid:48) \u223c N(0, 1), V (cid:48) \u223c Uniform({\u00b11}),\nand (\u00af\u03b7, V ) d= (\u2212\u00af\u03b7,\u2212V ), ( \u00afm, U ) d= (\u2212 \u00afm,\u2212U ), via the following distributional equation\n\n\u00afm(cid:48)(cid:12)(cid:12)U(cid:48)\n\u00af\u03b7(cid:48)(cid:12)(cid:12)V (cid:48)=+1\n\n(cid:104) k+(cid:88)\n\nd= \u00b5E{V \u00af\u03b7}U(cid:48) +(cid:0)\u00b5E{\u00af\u03b72}(cid:1)1/2\n(cid:105) \u2212 \u03bb\nk\u2212(cid:88)\n(cid:12)(cid:12)+ +\n(cid:12)(cid:12)\u2212\nE{ \u00afm2}(cid:17)1/2\n(cid:16) \u00b5\n\n\u03bb\u221a\nd\nE{U \u00afm} +\n\n\u03b61,\n\nd=\n\nk=1\n\nk=1\n\n\u00af\u03b7k\n\n\u00af\u03b7k\n\n+\n\n\u00b5\n\u03b3\n\n\u03b3\n\n\u03b62.\n\n(18)\n\ndE{\u00af\u03b7}\n\n\u221a\n\n(19)\n\nsame as the (unconditional) distribution of Z. Notice that the distribution of \u00af\u03b7(cid:48)(cid:12)(cid:12)V (cid:48)=\u2212 is determined\n\nHere we use the notation X|Y\nd= Z to mean that the conditional distribution of X given Y is the\nby the last equation using the symmetry property. Further \u00af\u03b7k|+ and \u00af\u03b7k|\u2212 denote independent random\nvariables distributed (respectively) as \u00af\u03b7|V =+ and \u00af\u03b7|V =\u2212. Finally k+ \u223c Poiss(d/2 + \u03bb\nd/2), k\u2212 \u223c\nPoiss(d/2 \u2212 \u03bb\nd/2), \u03b61 \u223c N(0, 1) and \u03b62 \u223c N(0, 1) are mutually independent and independent\nfrom the previous random variables.\nThe density evolution map, denoted by DE, is de\ufb01ned as the mapping from the law of (\u00af\u03b7, V, \u00afm, U )\nto the law of (\u00af\u03b7(cid:48), V (cid:48), \u00afm(cid:48), U(cid:48)). With a slight abuse of notation, we will omit V, U, V (cid:48), U(cid:48), whose\ndistribution is left unchanged and write\n\n\u221a\n\n\u221a\n\n(\u00af\u03b7(cid:48), \u00afm(cid:48)) = DE(\u00af\u03b7, \u00afm) .\n\n(20)\n\nThe following claim is the core of the cavity prediction. It states that the density evolution recursion\nfaithfully describes the distribution of the iterates \u03b7t, mt.\nClaim 7. Let (\u00af\u03b70, V ), ( \u00afm0, U ) be random vectors satisfying the conditions of de\ufb01nition 1. De\ufb01ne\nthe density evolution sequence (\u00af\u03b7t, \u00afmt) = DEt(\u00af\u03b70, \u00afm0), i.e. the result of iteratively applying the\nmapping DE t times.\nConsider the linear message passing algorithm of Eqs. (13) to (15), with the following initializa-\nd=\ntion. We set (m0\n\u00afm0|U =\nd= \u00af\u03b70|V =vi,\n\u221a\npur. Analogously, \u03b70\ni\u2192j|v\nd= \u00af\u03b70|V =vi. Finally \u03b7\u22121\n\u03b70\nThen, as n, p \u2192 \u221e with p/n \u2192 1/\u03b3, the following holds for uniformly random indices i \u2208 [n] and\na \u2208 [p]:\n\nr|u\nr)r\u2208[p] conditionally independent given u, with conditional distribution m0\n\ni |v\ni\u2192j are conditionally independent given v with \u03b70\n\ni , \u03b70\ni = \u03b7\u22121\n\nr = 0 for all i, j, r.\n\ni\u2192j = m\u22121\n\n(mt\n\n\u221a\np) d\u21d2 ( \u00afmt, U )\na, ua\ni , vi) d\u21d2 (\u00af\u03b7t, V ).\n(\u03b7t\n\nThe following simple lemma shows the instability of the density evolution recursion.\nLemma 8. Under the density evolution mapping, we obtain the random variables (\u00af\u03b7(cid:48), \u00afm(cid:48)) =\nDE(\u00af\u03b7, \u00afm(cid:48) Let m and m(cid:48) denote the vector of the \ufb01rst two moments of (\u00af\u03b7, V, \u00afm, U ) and (\u00af\u03b7(cid:48), V (cid:48), \u00afm(cid:48), U(cid:48))\nde\ufb01ned as follows:\n\nand similarly for m(cid:48). Then, for (cid:107)m(cid:107)2 \u2192 0, we have\n\nm = (E{V \u00af\u03b7}, E{U \u00afm}, E{\u00af\u03b72}, E{ \u00afm2}) ,\n\n\uf8ee\uf8ef\uf8f0\u03bb2 \u00b5/\u03b3\n\n\u00b5\n0\n0\n\n0\n0\n0\n\nm(cid:48) =\n\n0\n0\n0\n0\n\u03bb2 \u00b5/\u03b3\n0\n\u00b5\n\n\uf8f9\uf8fa\uf8fb m + O((cid:107)m(cid:107)2)\n\nIn particular, the linearized map m (cid:55)\u2192 m(cid:48) at m = 0 has spectral radius larger than one if and only if\n\u03bb2 + \u00b52/\u03b3 > 1.\n\n6\n\n(21)\n\n(22)\n\n(23)\n\n(24)\n\n\fThe interpretation of the cavity prediction and the instability lemma is as follows. If we choose an\ninitialization (\u00af\u03b70, V ), ( \u00afm0, U ) with \u00af\u03b70, \u00afm0 positively correlated with V and U, then this correlation\nincreases exponentially over time if and only if \u03bb2 + \u00b52/\u03b3 > 15. In other words, a small initial\ncorrelation is ampli\ufb01ed.\nWhile we do not have an initialization that is positively correlated with the true labels, a random\ninitialization \u03b70, m0 has a random correlation with v, u of order 1/\nn. If \u03bb2 + \u00b52/\u03b3 > 1, this\ncorrelation is ampli\ufb01ed over iterations, yielding a nontrivial reconstruction of v. On the other hand, if\n\u03bb2 + \u00b52/\u03b3 < 1 then this correlation is expected to remain small, indicating that the algorithm does\nnot yield a useful estimate.\n\n\u221a\n\n5 Proof overview\n\nAs mentioned above, a key step of our analysis is provided by Theorem 6, which establishes a weak\nrecovery threshold for the Gaussian observation model of Eqs. (8), (9).\nThe proof proceeds in two steps: \ufb01rst, we prove that, for \u03bb2 + \u00b52/\u03b3 < 1 it is impossible to distinguish\nbetween data A, B generated according to this model, and data generated according to the null model\n\u00b5 = \u03bb = 0. Denoting by P\u03bb,\u00b5 the law of data A, B, this is proved via a standard second moment\nargument. Namely, we bound the chi square distance uniformly in n, p\n\n\u03c72(P\u03bb,\u00b5, P0,0) \u2261 E0,0\n\n\u2212 1 \u2264 C ,\n\n(25)\n\n(cid:40)(cid:18) dP\u03bb,\u00b5\n\n(cid:19)2(cid:41)\n\ndP0,0\n\nand then bound the total variation distance by the chi-squared distance (cid:107)P\u03bb,\u00b5 \u2212 P0,0(cid:107)T V \u2264 1 \u2212\n(\u03c72(P\u03bb,\u00b5, P0,0) + 1)\u22121. This in turn implies that no test can distinguish between the two hypotheses\nwith probability approaching one as n, p \u2192 \u221e. The chi-squared bound also allows to show that weak\nrecovery is impossible in the same regime.\nIn order to prove that weak recovery is possible for \u03bb2 + \u00b52/\u03b3 > 1, we consider the following\noptimization problem over x \u2208 Rn, y \u2208 Rp:\n\n(26)\n(27)\n\nmaximize (cid:104)x, Ax(cid:105) + b\u2217(cid:104)x, By(cid:105),\nsubject to (cid:107)x(cid:107)2 = (cid:107)y(cid:107)2 = 1 .\n\n\u221a\n\n(cid:98)v =\n\nwhere b\u2217 = 2\u00b5\n\n\u03bb\u03b3 . Denoting solution of this problem by ((cid:98)x, \u02c6y), we output the (soft) label estimates\nn(cid:98)x. This de\ufb01nition turns out to be equivalent to the spectral algorithm in the statement of\n\nTheorem 6, and is therefore ef\ufb01ciently computable.\nThis optimization problem undergoes a phase transition exactly at the weak recovery threshold\n\u03bb2 + \u00b52/\u03b3 = 1, as stated below.\nLemma 9. Denote by T = Tn,p(A, B) the value of the optimization problem (26).\n\n(i) If \u03bb2 + \u00b52\n\n\u03b3 < 1, then, almost surely\n\n(cid:114)\n\nlim\nn,p\u2192\u221e Tn,p(A, B) = 2\n\n1 +\n\nb2\u2217\u03b3\n4\n\n+ b\u2217 .\n\n(28)\n\n(ii) If \u03bb, \u00b5 > 0, and \u03bb2 + \u00b52\n\n\u03b3 > 1 then there exists \u03b4 = \u03b4(\u03bb, \u00b5) > 0 such that, almost surely\n\n(cid:114)\n\nlim\nn,p\u2192\u221e Tn,p(A, B) = 2\n\n1 +\n\nb2\u2217\u03b3\n4\n\n+ b\u2217 + \u03b4(\u03bb, \u00b5) .\n\n(29)\n\n(iii) Further, de\ufb01ne\n\n\u02dcTn,p(\u02dc\u03b4; A, B) =\n\n(cid:107)x(cid:107)=(cid:107)y(cid:107)=1,|(cid:104)x,v(cid:105)|<\u02dc\u03b4\n\n\u221a\n\nn\n\nsup\n\n(cid:104)(cid:104)x, Ax(cid:105) + b\u2217(cid:104)x, By(cid:105)(cid:105)\n\n.\n\n5Notice that both the messages variance E(\u03b72) and covariance with the ground truth E(\u03b7V ) increase, but the\n\nnormalized correlation (correlation divided by standard deviation) increases.\n\n7\n\n\fFigure 1: (Left) Empirical probability of rejecting the null (lighter is higher) using BP test. (Middle)\n\nMean overlap |(cid:104)(cid:98)vBP, v(cid:105)/n| and (Right) mean covariate overlap |(cid:104)(cid:98)uBP, u(cid:105)| attained by BP estimate.\n\nThen for each \u03b4 > 0, there exists \u02dc\u03b4 > 0 suf\ufb01ciently small, such that, almost surely\n\n(cid:114)\n\nlim\n\nn,p\u2192\u221e\n\n\u02dcTn,p(\u02dc\u03b4; A, B) < 2\n\n1 +\n\nb2\u2217\u03b3\n4\n\n+ b\u2217 +\n\n\u03b4\n2\n\n.\n\n(30)\n\nThe \ufb01rst two points imply that Tn,p(A, B) provide a statistic to distinguish between P0,0 and P\u03bb,\u00b5\nwith probability of error that vanishes as n, p \u2192 \u221e if \u03bb2 + \u00b52/\u03b3 > 1. The third point (in conjunction\n\nwith the second one) guarantees that the maximizer(cid:98)x is positively correlated with v, and hence\n\nimplies weak recovery.\nIn fact, we prove a stronger result that provides an asymptotic expression for the value Tn,p(A, B)\nfor all \u03bb, \u00b5. We obtain the above phase-transition result by specializing the resulting formula in the\ntwo regimes \u03bb2 + \u00b52/\u03b3 < 1 and \u03bb2 + \u00b52/\u03b3 > 1. We prove this asymptotic formula by Gaussian\nprocess comparison, using Sudakov-Fernique inequality. Namely, we compare the Gaussian process\nappearing in the optimization problem of Eq. (26) with the following ones:\n\n(cid:114) \u00b5\n(cid:104)x, v0(cid:105)2 + (cid:104)x,(cid:101)gx(cid:105) + b\u2217\n(cid:104)x,(cid:102)Wxx(cid:105) + b\u2217\n\n(cid:104)x, v0(cid:105)2 +\n\nn\n\n1\n2\n\nX1(x, y) =\n\nX2(x, y) =\n\n\u03bb\nn\n\u03bb\nn\n\n(cid:104)x, v0(cid:105)(cid:104)y, u0(cid:105) + (cid:104)y,(cid:101)gy(cid:105) ,\n(cid:114) \u00b5\n\n(cid:104)x, v0(cid:105)(cid:104)y, u0(cid:105) +\n\nn\n\n(cid:104)y,(cid:102)Wyy(cid:105) ,\n\n1\n2\n\n(31)\n\n(32)\n\nwhere(cid:101)gx,(cid:101)gy are isotropic Gaussian vectors, with suitably chosen variances, and(cid:102)Wx,(cid:102)Wy are GOE\n\nmatrices, again with properly chosen variances. We prove that maxx,y X1(x, y) yields an upper\nbound on Tn,p(A, B), and maxx,y X2(x, y) yields a lower bound on the same quantity.\nNote that maximizing the \ufb01rst process X1(x, y) essentially reduces to solving a separable problem\nover the coordinates of x and y and hence to an explicit expression. On the other hand, maximizing\nthe second process leads (after decoupling the term (cid:104)x, v0(cid:105)(cid:104)y, u0(cid:105)) to two separate problems, one\nfor the vector x, and the other for y. Each of the two problems reduce to \ufb01nding the maximum\neigenvector of a rank-one deformation of a GOE matrix, a problem for which we can leverage on\nsigni\ufb01cant amount of information from random matrix theory. The resulting upper and lower bound\ncoincide asymptotically.\nAs is often the case with Gaussian comparison arguments, the proof is remarkably compact, and\nsomewhat surprising (it is unclear a priori that the two bounds should coincide asymptotically). While\nupper bounds by processes of the type of X1(x, y) are quite common in random matrix theory, we\nthink that the lower bound by X2(x, y) (which is crucial for proving our main theorem) is novel and\nmight have interesting generalizations.\n\n8\n\n\fWe demonstrate the ef\ufb01cacy of the full belief propagation algorithm, restated below:\n\ntanh(\u03b7t\u22121\n\ni\n\n) +\n\nf (\u03b7t\n\nf (\u03b7t\n\nk; \u03c1n) ,\n\n6 Experiments\n\n\u03b3\n\nq\u2208[p]\n\n(cid:114) \u00b5\n(cid:88)\n(cid:114) \u00b5\n(cid:88)\n(cid:112)\u00b5/\u03b3\n(cid:88)\n\uf8eb\uf8ed1 + \u00b5 \u2212 \u00b5\n\n\u03c4 t+1\nq\n\nj\u2208[n]\n\nq\u2208[p]\n\n\u03b3\n\n\u03b7t+1\ni =\n\n\u03b7t+1\ni\u2192j =\n\nmt+1\n\nq =\n\n\u03c4 t+1\nq =\n\n(cid:18)(cid:88)\n(cid:18)(cid:88)\n\nq\u2208[p]\n\nq\u2208[p]\n\n(cid:19)\n(cid:19)\n\nB2\nqi\n\u03c4 t\nq\n\nB2\nqi\n\u03c4 t\nq\n\nBqimt\n\nq \u2212 \u00b5\n\u03b3\n\nBqimt\n\nq \u2212 \u00b5\n\u03b3\n\n(cid:88)\n\n\u03b3\n\nj\u2208[n]\n\nB2\n\nqjsech2(\u03b7t\nj)\n\ntanh(\u03b7t\u22121\n\ni\n\n) +\n\n(cid:18)(cid:88)\n\uf8f6\uf8f8\u22121\n\nj\u2208[n]\n\n.\n\nBqj tanh(\u03b7t\n\nj) \u2212 \u00b5\n\u03b3\u03c4 t+1\n\nq\n\nB2\n\nqjsech2(\u03b7t\nj)\n\nmt\u22121\n\nq\n\n(cid:88)\n(cid:88)\n\nk\u2208\u2202i\n\nk\u2208\u2202i\\j\n\nk\u2192i; \u03c1) \u2212 (cid:88)\nk\u2192i; \u03c1) \u2212 (cid:88)\n\nk\u2208[n]\n\nk\u2208[n]\n\nf (\u03b7t\n\n(cid:19)\n\n(33)\n\nf (\u03b7t\n\nk; \u03c1n) ,\n\n(34)\n\n(35)\n\n(36)\n\n(37)\n\n(38)\n\n(39)\n\nHere the function f (; \u03c1) and the parameters \u03c1, \u03c1n are de\ufb01ned as:\n\n(cid:17)\n\n,\n\ncosh(z \u2212 \u03c1)\n\n(cid:16) cosh(z + \u03c1)\n\u22121(cid:16) \u03bb\n(cid:17)\n\n\u221a\nd) ,\n\u221a\nd\nn \u2212 d\n\n\u22121(\u03bb/\n\n.\n\nf (z; \u03c1) \u2261 1\n2\n\nlog\n\u03c1 \u2261 tanh\n\u03c1n \u2261 tanh\n\nWe refer the reader to Appendix D for a derivation of the algorithm. As demonstrated in Appendix D,\nthe BP algorithm in Section 4 is obtained by linearizing the above in \u03b7.\nIn our experiments, we perform 100 Monte Carlo runs of the following process:\n\n1. Sample AG, B from P\u03bb,\u00b5 with n = 800, p = 1000, d = 5.\n2. Run BP algorithm for T = 50 iterations with random initialization \u03b70\nN(0, 0.01). yielding vertex and covariate iterates \u03b7T \u2208 Rn, mT \u2208 Rp.\n\n3. Reject the null hypothesis if(cid:13)(cid:13)\u03b7T(cid:13)(cid:13)2 >(cid:13)(cid:13)\u03b70(cid:13)(cid:13)2, else accept the null.\n4. Return estimates(cid:98)vBP\n\na /(cid:13)(cid:13)mT(cid:13)(cid:13)2.\n\ni ),(cid:98)uBP\n\ni = sgn(\u03b7T\n\na = mT\n\ni , \u03b7\u22121\n\ni\n\n, m0\n\na, m\u22121\n\na \u223ciid\n\nFigure 1 (left) shows empirical probabilities of rejecting the null for (\u03bb, \u00b5) \u2208 [0, 1] \u00d7 [0,\n\nnext two plots display the mean overlap |(cid:104)(cid:98)vBP, v(cid:105)/n| and (cid:104)(cid:98)uBP, u(cid:105)/(cid:107)u(cid:107) achieved by the BP estimates\n\n(lighter is higher overlap). Below the theoretical curve (red) of \u03bb2 + \u00b52/\u03b3 = 1, the null hypothesis is\naccepted and the estimates show negligible correlation with the truth. These results are in excellent\nagreement with our theory. Importantly, while our rigorous result holds only in the limit of diverging\nd, the simulations show agreement already for d = 5. This lends further credence to the cavity\nprediction Claim 3.\n\n\u03b3]. The\n\n\u221a\n\nAcknowledgements\n\nA.M. was partially supported by grants NSF DMS-1613091, NSF CCF-1714305 and NSF IIS-\n1741162. E.M was partially supported by grants NSF DMS-1737944 and ONR N00014-17-1-2598.\nY.D would like to acknowledge Nilesh Tripuraneni for discussions about this paper.\n\n9\n\n\fReferences\n[Abb17]\n\n[ABBS14]\n\n[AG05]\n\n[AJC14]\n\n[BBAP05]\n\n[BC11]\n\nEmmanuel Abbe, Community detection and stochastic block models: recent develop-\nments, arXiv preprint arXiv:1703.10146 (2017).\nEmmanuel Abbe, Afonso S Bandeira, Annina Bracher, and Amit Singer, Decoding\nbinary node labels from censored edge measurements: Phase transition and ef\ufb01cient\nrecovery, IEEE Transactions on Network Science and Engineering 1 (2014), no. 1,\n10\u201322.\nLada A Adamic and Natalie Glance, The political blogosphere and the 2004 us election:\ndivided they blog, Proceedings of the 3rd international workshop on Link discovery,\nACM, 2005, pp. 36\u201343.\nChristopher Aicher, Abigail Z Jacobs, and Aaron Clauset, Learning latent block\nstructure in weighted networks, Journal of Complex Networks 3 (2014), no. 2, 221\u2013\n248.\nJinho Baik, G\u00b4erard Ben Arous, and Sandrine P\u00b4ech\u00b4e, Phase transition of the largest\neigenvalue for nonnull complex sample covariance matrices, Annals of Probability\n(2005), 1643\u20131697.\nRamnath Balasubramanyan and William W Cohen, Block-lda: Jointly modeling entity-\nannotated text and entity-entity links, Proceedings of the 2011 SIAM International\nConference on Data Mining, SIAM, 2011, pp. 450\u2013461.\n\n[BGN11]\n\n[BCMM15] C\u00b4ecile Bothorel, Juan David Cruz, Matteo Magnani, and Barbora Micenkova, Clus-\ntering attributed graphs: models, measures and methods, Network Science 3 (2015),\nno. 3, 408\u2013444.\nFlorent Benaych-Georges and Raj Rao Nadakuditi, The eigenvalues and eigenvectors\nof \ufb01nite, low rank perturbations of large random matrices, Advances in Mathematics\n227 (2011), no. 1, 494\u2013521.\nNorbert Binkiewicz, Joshua T Vogelstein, and Karl Rohe, Covariate-assisted spectral\nclustering, Biometrika 104 (2017), no. 2, 361\u2013377.\nJonathan Chang and David M Blei, Hierarchical relational models for document\nnetworks, The Annals of Applied Statistics (2010), 124\u2013150.\n\n[BVR17]\n\n[CB10]\n\n[CDMF+09] Mireille Capitaine, Catherine Donati-Martin, Delphine F\u00b4eral, et al., The largest eigen-\nvalues of \ufb01nite rank deformation of large wigner matrices: convergence and nonuni-\nversality of the \ufb02uctuations, The Annals of Probability 37 (2009), no. 1, 1\u201347.\n\n[CMS11]\n\n[CDPF+17] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq, Algorith-\nmic decision making and the cost of fairness, Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, ACM, 2017,\npp. 797\u2013806.\nKamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate, Differentially private\nempirical risk minimization, Journal of Machine Learning Research 12 (2011), no. Mar,\n1069\u20131109.\nMihai Cucuringu, Synchronization over z 2 and community detection in signed multiplex\nnetworks with constraints, Journal of Complex Networks 3 (2015), no. 3, 469\u2013506.\nHong Cheng, Yang Zhou, and Jeffrey Xu Yu, Clustering large attributed graphs: A\nbalance between structural and attribute similarities, ACM Transactions on Knowledge\nDiscovery from Data (TKDD) 5 (2011), no. 2, 12.\nYash Deshpande, Emmanuel Abbe, and Andrea Montanari, Asymptotic mutual infor-\nmation for the balanced binary stochastic block model, Information and Inference: A\nJournal of the IMA 6 (2016), no. 2, 125\u2013170.\n\n[DAM16]\n\n[CZY11]\n\n[Cuc15]\n\n[DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov\u00b4a, Asymp-\ntotic analysis of the stochastic block model for modular networks and its algorithmic\napplications, Physical Review E 84 (2011), no. 6, 066106.\nYash Deshpande and Andrea Montanari, Finding Hidden Cliques of Size N/e in Nearly\nLinear Time, Foundations of Computational Mathematics 15 (2015), no. 4, 1069\u20131128.\n\n[DM15]\n\n10\n\n\f[DV12]\n\n[EM12]\n\n[GFRS13]\n\n[GSV05]\n\n[GVB12]\n\n[HL14]\n\n[HLL83]\n\n[Hof03]\n[JL04]\n\nTA Dang and Emmanuel Viennet, Community detection based on structural and\nattribute similarities, International conference on digital society (icds), 2012, pp. 7\u201312.\nEric Eaton and Rachael Mansbach, A spin-glass model for semi-supervised community\ndetection., AAAI, 2012, pp. 900\u2013906.\nStephan Gunnemann, Ines Farber, Sebastian Raubach, and Thomas Seidl, Spectral\nsubspace clustering for graphs with feature vectors, Data Mining (ICDM), 2013 IEEE\n13th International Conference on, IEEE, 2013, pp. 231\u2013240.\nD. Guo, S. Shamai, and S. Verd\u00b4u, Mutual information and minimum mean-square error\nin gaussian channels, IEEE Trans. Inform. Theory 51 (2005), 1261\u20131282.\nJaume Gibert, Ernest Valveny, and Horst Bunke, Graph embedding in vector spaces by\nnode attribute statistics, Pattern Recognition 45 (2012), no. 9, 3072\u20133083.\nTuan-Anh Hoang and Ee-Peng Lim, On joint modeling of topical communities and per-\nsonal interest in microblogs, International Conference on Social Informatics, Springer,\n2014, pp. 1\u201316.\nPaul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt, Stochastic\nblockmodels: First steps, Social networks 5 (1983), no. 2, 109\u2013137.\nPeter D Hoff, Random effects models for network data, na, 2003.\nIain M Johnstone and Arthur Yu Lu, Sparse principal components analysis, Unpub-\nlished manuscript (2004).\n\n[KGB+12] Virendra Kumar, Yuhua Gu, Satrajit Basu, Anders Berglund, Steven A Eschrich,\nMatthew B Schabath, Kenneth Forster, Hugo JWL Aerts, Andre Dekker, David Fen-\nstermacher, et al., Radiomics: the process and the challenges, Magnetic resonance\nimaging 30 (2012), no. 9, 1234\u20131248.\nMyunghwan Kim and Jure Leskovec, Latent multi-group membership graph model,\narXiv preprint arXiv:1205.4546 (2012).\n\n[KL12]\n\n[KMS16]\n\n[KMM+13] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka\nZdeborov\u00b4a, and Pan Zhang, Spectral redemption in clustering sparse networks, Pro-\nceedings of the National Academy of Sciences 110 (2013), no. 52, 20935\u201320940.\nVarun Kanade, Elchanan Mossel, and Tselil Schramm, Global and local information in\nclustering labeled block models, IEEE Transactions on Information Theory 62 (2016),\nno. 10, 5906\u20135917.\nAntti Knowles and Jun Yin, The isotropic semicircle law and deformation of wigner\nmatrices, Communications on Pure and Applied Mathematics 66 (2013), no. 11, 1663\u2013\n1749.\nJure Leskovec and Julian J Mcauley, Learning to discover social circles in ego networks,\nAdvances in neural information processing systems, 2012, pp. 539\u2013547.\n\n[KY13]\n\n[LM12]\n\n[Mas14]\n\n[LMX15] Marc Lelarge, Laurent Massouli\u00b4e, and Jiaming Xu, Reconstruction in the labelled\nstochastic block model, IEEE Transactions on Network Science and Engineering 2\n(2015), no. 4, 152\u2013163.\nLaurent Massouli\u00b4e, Community detection thresholds and the weak ramanujan property,\nProceedings of the forty-sixth annual ACM symposium on Theory of computing, ACM,\n2014, pp. 694\u2013703.\nM. M\u00b4ezard and A. Montanari, Information, Physics and Computation, Oxford, 2009.\nElchanan Mossel, Joe Neeman, and Allan Sly, A proof of the block model threshold\nconjecture, Combinatorica (2013), 1\u201344.\n\n[MM09]\n[MNS13]\n\n[MNS15]\n\n[Mon12]\n\n, Reconstruction and estimation in the planted partition model, Probability\n\nTheory and Related Fields 162 (2015), no. 3-4, 431\u2013461.\nA. Montanari, Graphical Models Concepts in Compressed Sensing, Compressed Sens-\ning: Theory and Applications (Y.C. Eldar and G. Kutyniok, eds.), Cambridge University\nPress, 2012.\n\n11\n\n\f[MX16]\n\n[NAJ03]\n\n[NJW02]\n\n[MRZ15]\n\nAndrea Montanari, Daniel Reichman, and Ofer Zeitouni, On the limitation of spectral\nmethods: From the gaussian hidden clique problem to rank-one perturbations of\ngaussian tensors, Advances in Neural Information Processing Systems, 2015, pp. 217\u2013\n225.\nElchanan Mossel and Jiaming Xu, Local algorithms for block models with side in-\nformation, Proceedings of the 2016 ACM Conference on Innovations in Theoretical\nComputer Science, ACM, 2016, pp. 71\u201380.\nJennifer Neville, Micah Adler, and David Jensen, Clustering relational data using\nattribute and link information, Proceedings of the text mining and link analysis work-\nshop, 18th international joint conference on arti\ufb01cial intelligence, San Francisco, CA:\nMorgan Kaufmann Publishers, 2003, pp. 9\u201315.\nMark EJ Newman and Aaron Clauset, Structure and inference in annotated networks,\nNature Communications 7 (2016), 11863.\nAndrew Y Ng, Michael I Jordan, and Yair Weiss, On spectral clustering: Analysis and\nan algorithm, Advances in neural information processing systems, 2002, pp. 849\u2013856.\n[OMH+13] Alexei Onatski, Marcelo J Moreira, Marc Hallin, et al., Asymptotic power of sphericity\ntests for high-dimensional data, The Annals of Statistics 41 (2013), no. 3, 1204\u20131231.\nDebashis Paul, Asymptotics of sample eigenstructure for a large dimensional spiked\ncovariance model, Statistica Sinica 17 (2007), no. 4, 1617.\nSandrine P\u00b4ech\u00b4e, The largest eigenvalue of small rank perturbations of hermitian\nrandom matrices, Probability Theory and Related Fields 134 (2006), no. 1, 127\u2013173.\nLeto Peel, Supervised blockmodelling, arXiv preprint arXiv:1209.5561 (2012).\nArlei Silva, Wagner Meira Jr, and Mohammed J Zaki, Mining attribute-structure\ncorrelated patterns in large attributed graphs, Proceedings of the VLDB Endowment\n5 (2012), no. 5, 466\u2013477.\nLaura M Smith, Linhong Zhu, Kristina Lerman, and Allon G Percus, Partitioning\nnetworks with node attributes by compressing information \ufb02ow, ACM Transactions on\nKnowledge Discovery from Data (TKDD) 11 (2016), no. 2, 15.\nM. Talagrand, Mean \ufb01eld models for spin glasses: Volume i, Springer-Verlag, Berlin,\n2010.\n\n[Pee12]\n[SMJZ12]\n\n[SZLP16]\n\n[Tal10]\n\n[NC16]\n\n[Pau07]\n\n[P\u00b4ec06]\n\n[VL07]\n\n[TKGM14] Eric W Tramel, Santhosh Kumar, Andrei Giurgiu, and Andrea Montanari, Statistical\nestimation: From denoising to sparse regression and hidden cliques, arXiv preprint\narXiv:1409.5557 (2014).\nUlrike Von Luxburg, A tutorial on spectral clustering, Statistics and computing 17\n(2007), no. 4, 395\u2013416.\nMartin J Wainwright, Michael I Jordan, et al., Graphical models, exponential families,\nand variational inference, Foundations and Trends R(cid:13) in Machine Learning 1 (2008),\nno. 1\u20132, 1\u2013305.\n\n[WJ+08]\n\n[YJCZ09]\n\n[XKW+12] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng, A model-based\napproach to attributed graph clustering, Proceedings of the 2012 ACM SIGMOD\ninternational conference on management of data, ACM, 2012, pp. 505\u2013516.\nTianbao Yang, Rong Jin, Yun Chi, and Shenghuo Zhu, Combining link and content\nfor community detection: a discriminative approach, Proceedings of the 15th ACM\nSIGKDD international conference on Knowledge discovery and data mining, ACM,\n2009, pp. 927\u2013936.\nJaewon Yang, Julian McAuley, and Jure Leskovec, Community detection in networks\nwith node attributes, Data Mining (ICDM), 2013 IEEE 13th international conference\non, IEEE, 2013, pp. 1151\u20131156.\nYang Zhou, Hong Cheng, and Jeffrey Xu Yu, Graph clustering based on struc-\ntural/attribute similarities, Proceedings of the VLDB Endowment 2 (2009), no. 1,\n718\u2013729.\nYuan Zhang, Elizaveta Levina, Ji Zhu, et al., Community detection in networks with\nnode features, Electronic Journal of Statistics 10 (2016), no. 2, 3153\u20133178.\n\n[ZLZ+16]\n\n[YML13]\n\n[ZCY09]\n\n12\n\n\f[ZMZ14]\n\n[ZVA10]\n\nPan Zhang, Cristopher Moore, and Lenka Zdeborov\u00b4a, Phase transitions in semisuper-\nvised clustering of sparse networks, Physical Review E 90 (2014), no. 5, 052802.\nHugo Zanghi, Stevenn Volant, and Christophe Ambroise, Clustering based on random\ngraph model embedding vertex features, Pattern Recognition Letters 31 (2010), no. 9,\n830\u2013836.\n\n13\n\n\f", "award": [], "sourceid": 5172, "authors": [{"given_name": "Yash", "family_name": "Deshpande", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Subhabrata", "family_name": "Sen", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford"}, {"given_name": "Elchanan", "family_name": "Mossel", "institution": "MIT"}]}