{"title": "Streaming, Memory Limited Algorithms for Community Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 3167, "page_last": 3175, "abstract": "In this paper, we consider sparse networks consisting of a finite number of non-overlapping communities, i.e. disjoint clusters, so that there is higher density within clusters than across clusters. Both the intra- and inter-cluster edge densities vanish when the size of the graph grows large, making the cluster reconstruction problem nosier and hence difficult to solve. We are interested in scenarios where the network size is very large, so that the adjacency matrix of the graph is hard to manipulate and store. The data stream model in which columns of the adjacency matrix are revealed sequentially constitutes a natural framework in this setting. For this model, we develop two novel clustering algorithms that extract the clusters asymptotically accurately. The first algorithm is {\\it offline}, as it needs to store and keep the assignments of nodes to clusters, and requires a memory that scales linearly with the network size. The second algorithm is {\\it online}, as it may classify a node when the corresponding column is revealed and then discard this information. This algorithm requires a memory growing sub-linearly with the network size. To construct these efficient streaming memory-limited clustering algorithms, we first address the problem of clustering with partial information, where only a small proportion of the columns of the adjacency matrix is observed and develop, for this setting, a new spectral algorithm which is of independent interest.", "full_text": "Streaming, Memory Limited Algorithms for\n\nCommunity Detection\n\nSe-Young. Yun\n\nMSR-Inria\n\n23 Avenue d\u2019Italie, Paris 75013\nseyoung.yun@inria.fr\n\nMarc Lelarge \u2217\nInria & ENS\n\n23 Avenue d\u2019Italie, Paris 75013\n\nmarc.lelarge@ens.fr\n\nAlexandre Proutiere \u2020\nKTH, EE School / ACL\n\nOsquldasv. 10, Stockholm 100-44, Sweden\n\nalepro@kth.se\n\nAbstract\n\nIn this paper, we consider sparse networks consisting of a \ufb01nite number of non-\noverlapping communities, i.e. disjoint clusters, so that there is higher density\nwithin clusters than across clusters. Both the intra- and inter-cluster edge densities\nvanish when the size of the graph grows large, making the cluster reconstruction\nproblem nosier and hence dif\ufb01cult to solve. We are interested in scenarios where\nthe network size is very large, so that the adjacency matrix of the graph is hard to\nmanipulate and store. The data stream model in which columns of the adjacency\nmatrix are revealed sequentially constitutes a natural framework in this setting.\nFor this model, we develop two novel clustering algorithms that extract the clus-\nters asymptotically accurately. The \ufb01rst algorithm is of\ufb02ine, as it needs to store\nand keep the assignments of nodes to clusters, and requires a memory that scales\nlinearly with the network size. The second algorithm is online, as it may classify\na node when the corresponding column is revealed and then discard this infor-\nmation. This algorithm requires a memory growing sub-linearly with the network\nsize. To construct these ef\ufb01cient streaming memory-limited clustering algorithms,\nwe \ufb01rst address the problem of clustering with partial information, where only a\nsmall proportion of the columns of the adjacency matrix is observed and develop,\nfor this setting, a new spectral algorithm which is of independent interest.\n\n1\n\nIntroduction\n\nExtracting clusters or communities in networks have numerous applications and constitutes a fun-\ndamental task in many disciplines, including social science, biology, and physics. Most methods\nfor clustering networks assume that pairwise \u201cinteractions\u201d between nodes can be observed, and\nthat from these observations, one can construct a graph which is then partitioned into clusters. The\nresulting graph partitioning problem can be typically solved using spectral methods [1, 3, 5, 6, 12],\ncompressed sensing and matrix completion ideas [2, 4], or other techniques [10].\nA popular model and benchmark to assess the performance of clustering algorithms is the Stochastic\nBlock Model (SBM) [9], also referred to as the planted partition model. In the SBM, it is assumed\n\u2217Work performed as part of MSR-INRIA joint research centre. M.L. acknowledges the support of the\n\u2020A. Proutiere\u2019s research is supported by the ERC FSA grant, and the SSF ICT-Psi project.\n\nFrench Agence Nationale de la Recherche (ANR) under reference ANR-11-JS02-005-01 (GAP project).\n\n1\n\n\fthat the graph to partition has been generated randomly, by placing an edge between two nodes with\nprobability p if the nodes belong to the same cluster, and with probability q otherwise, with q < p.\nThe parameters p and q typically depends on the network size n, and they are often assumed to\ntend to 0 as n grows large, making the graph sparse. This model has attracted a lot of attention\nrecently. We know for example that there is a phase transition threshold for the value of (p\u2212q)2\np+q . If\nwe are below the threshold, no algorithm can perform better than the algorithm randomly assigning\nnodes to clusters [7, 14], and if we are above the threshold, it becomes indeed possible to beat the\nnaive random assignment algorithm [11]. A necessary and suf\ufb01cient condition on p and q for the\nexistence of clustering algorithms that are asymptotically accurate (meaning that the proportion of\nmisclassi\ufb01ed nodes tends to 0 as n grows large) has also been identi\ufb01ed [15]. We \ufb01nally know that\nspectral algorithms can reconstruct the clusters asymptotically accurately as soon as this is at all\npossible, i.e., they are in a sense optimal.\nWe focus here on scenarios where the network size can be extremely large (online social and bio-\nlogical networks can, already today, easily exceed several hundreds of millions of nodes), so that\nthe adjacency matrix A of the corresponding graph can become dif\ufb01cult to manipulate and store.\nWe revisit network clustering problems under memory constraints. Memory limited algorithms are\nrelevant in the streaming data model, where observations (i.e. parts of the adjacency matrix) are\ncollected sequentially. We assume here that the columns of the adjacency matrix A are revealed\none by one to the algorithm. An arriving column may be stored, but the algorithm cannot request it\nlater on if it was not stored. The objective of this paper is to determine how the memory constraints\nand the data streaming model affect the fundamental performance limits of clustering algorithms,\nand how the latter should be modi\ufb01ed to accommodate these restrictions. Again to address these\nquestions, we use the stochastic block model as a performance benchmark. Surprisingly, we estab-\nlish that when there exists an algorithm with unlimited memory that asymptotically reconstruct the\nclusters accurately, then we can devise an asymptotically accurate algorithm that requires a mem-\nory scaling linearly in the network size n, except if the graph is extremely sparse. This claim is\nproved for the SBM with parameters p = a f (n)\nn , with constants a > b, under the\nassumption that log n (cid:28) f (n). For this model, unconstrained algorithms can accurately recover the\nclusters as soon as f (n) = \u03c9(1) [15], so that the gap between memory-limited and unconstrained\nalgorithms is rather narrow. We further prove that the proposed algorithm reconstruct the clusters\naccurately before collecting all the columns of the matrix A, i.e., it uses less than one pass on the\ndata. We also propose an online streaming algorithm with sublinear memory requirement. This\nalgorithm output the partition of the graph in an online fashion after a group of columns arrives.\nSpeci\ufb01cally, if f (n) = n\u03b1 with 0 < \u03b1 < 1, our algorithm requires as little as n\u03b2 memory with\n\n(cid:1). To the best of our knowledge, our algorithm is the \ufb01rst sublinear streaming\n\n\u03b2 > max(cid:0)1 \u2212 \u03b1, 2\n\nn and q = b f (n)\n\n3\n\nalgorithm for community detection. Although streaming algorithms for clustering data streams have\nbeen analyzed [8], the focus in this theoretical computer science literature is on worst case graphs\nand on approximation performance which is quite different from ours.\nTo construct ef\ufb01cient streaming memory-limited clustering algorithms, we \ufb01rst address the problem\nof clustering with partial information. More precisely, we assume that a proportion \u03b3 (that may\ndepend on n) of the columns of A is available, and we wish to classify the nodes corresponding to\nthese columns, i.e., the observed nodes. We show that a necessary and suf\ufb01cient condition for the\n\u03b3f (n) = \u03c9(1). We also show that to classify\nexistence of asymptotically accurate algorithms is\nthe observed nodes ef\ufb01ciently, a clustering algorithm must exploit the information provided by the\nedges between observed and unobserved nodes. We propose such an algorithm, which in turn,\nconstitutes a critical building block in the design of memory-limited clustering schemes.\nTo our knowledge, this paper is the \ufb01rst to address the problem of community detection in the\nstreaming model, and with memory constraints. Note that PCA has been recently investigated in\nthe streaming model and with limited memory [13]. Our model is different, and to obtain ef\ufb01cient\nclustering algorithms, we need to exploit its structure.\n\n\u221a\n\n2 Models and Problem Formulation\n\noverlapping subsets V1, . . . , VK, i.e., V = (cid:83)K\n\nWe consider a network consisting of a set V of n nodes. V admits a hidden partition of K non-\nk=1 Vk. The size of community or cluster Vk is \u03b1kn\nfor some \u03b1k > 0. Without loss of generality, let \u03b11 \u2264 \u03b12 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03b1K. We assume that when the\n\n2\n\n\fnetwork size n grows large, the number of communities K and their relative sizes are kept \ufb01xed. To\nrecover the hidden partition, we have access to a n \u00d7 n symmetric random binary matrix A whose\nentries are independent and satisfy: for all v, w \u2208 V , P[Avw = 1] = p if v and w are in the same\ncluster, and P[Avw = 1] = q otherwise, with q < p. This corresponds to the celebrated Stochastic\nBlock Model (SBM). If Avw = 1, we say that nodes v and w are connected, or that there is an edge\nbetween v and w. p and q typically depend on the network size n. To simplify the presentation,\nwe assume that there exists a function f (n) , and two constants a > b such that p = a f (n)\nand\nn\nq = b f (n)\nn . This assumption on the speci\ufb01c scaling of p and q is not crucial, and most of the results\nderived in this paper hold for more general p and q (as it can be seen in the proofs). For an algorithm\n\u03c0, we denote by \u03b5\u03c0(n) the proportion of nodes that are misclassi\ufb01ed by this algorithm. We say that\n\u03c0 is asymptotically accurate if limn\u2192\u221e E[\u03b5\u03c0(n)] = 0. Note that in our setting, if f (n) = O(1),\nthere is a non-vanishing fraction of isolated nodes for which no algorithm will perform better than\na random guess. In particular, no algorithm can be asymptotically accurate. Hence, we assume that\nf (n) = \u03c9(1), which constitutes a necessary condition for the graph to be asymptotically connected,\ni.e., the largest connected component to have size n \u2212 o(n).\nIn this paper, we address the problem of reconstructing the clusters from speci\ufb01c observed entries\nof A, and under some constraints related to the memory available to process the data and on the way\nobservations are revealed and stored. More precisely, we consider the two following problems.\nProblem 1. Clustering with partial information. We \ufb01rst investigate the problem of detecting\ncommunities under the assumption that the matrix A is partially observable. More precisely, we\nassume that a proportion \u03b3 (that typically depend on the network size n) of the columns of A are\nknown. The \u03b3n observed columns are selected uniformly at random among all columns of A. Given\nthese observations, we wish to determine the set of parameters \u03b3 and f (n) such that there exists an\nasymptotically accurate clustering algorithm.\nProblem 2. Clustering in the streaming model and under memory constraints. We are interested\nhere in scenarios where the matrix A cannot be stored entirely, and restrict our attention to algorithms\nthat require memory less than M bits. Ideally, we would like to devise an asymptotically accurate\nclustering algorithm that requires a memory M scaling linearly or sub-linearly with the network size\nn. In the streaming model, we assume that at each time t = 1, . . . , n, we observe a column Av of\nA uniformly distributed over the set of columns that have not been observed before t. The column\nAv may be stored at time t, but we cannot request it later on if it has not been explicitly stored. The\nproblem is to design a clustering algorithm \u03c0 such that in the streaming model, \u03c0 is asymptotically\naccurate, and requires less than M bits of memory. We distinguish of\ufb02ine clustering algorithms that\nmust store the mapping between all nodes and their clusters (here M has to scale linearly with n),\nand online algorithms that may classify the nodes when the corresponding columns are observed,\nand then discard this information (here M could scale sub-linearly with n).\n\n3 Clustering with Partial Information\n\nIn this section, we solve Problem 1. In what follows, we assume that \u03b3n = \u03c9(1), which simply\nmeans that the number of observed columns of A grows large when n tends to \u221e. However we\nare typically interested in scenarios where the proportion of observed columns \u03b3 tends to 0 as the\nnetwork size grows large. Let (Av, v \u2208 V (g)) denote the observed columns of A. V (g) is referred to\nas the set of green nodes and we denote by n(g) = \u03b3n the number of green nodes. V (r) = V \\ V (g)\nis referred to as the set of red nodes. Note that we have no information about the connections among\nk = V (r) \u2229 Vk. We say that\nthe red nodes. For any k = 1, . . . , K, let V (g)\na clustering algorithm \u03c0 classi\ufb01es the green nodes asymptotically accurately, if the proportion of\nmisclassi\ufb01ed green nodes, denoted by \u03b5\u03c0(n(g)), tends to 0 as the network size n grows large.\n\nk = V (g) \u2229 Vk, and V (r)\n\n3.1 Necessary Conditions for Accurate Detection\n\nWe \ufb01rst derive necessary conditions for the existence of asymptotically accurate clustering algo-\nrithms. As it is usual in this setting, the hardest model to estimate (from a statistical point of view)\ncorresponds to the case of two clusters of equal sizes (see Remark 3 below). Hence, we state our\ninformation theoretic lower bounds, Theorems 1 and 2, for the special case where K = 2, and\n\n3\n\n\f\u03b11 = \u03b12. Theorem 1 states that if the proportion of observed columns \u03b3 is such that\n\u03b3f (n) tends\nto 0 as n grows large, then no clustering algorithm can perform better than the naive algorithm that\nassigns nodes to clusters randomly.\n\n\u221a\n\n\u221a\n\nTheorem 1 Assume that\nproportion of misclassi\ufb01ed green nodes tends to 1/2 as n grows large, i.e.,\n\n\u03b3f (n) = o(1). Then under any clustering algorithm \u03c0, the expected\nE[\u03b5\u03c0(n(g))] = 1/2.\n\nlim\nn\u2192\u221e\n\nTheorem 2 (i) shows that this condition is tight in the sense that as soon as there exists a clustering\nalgorithm that classi\ufb01es the green nodes asymptotically accurately, then we need to have\n\u03b3f (n) =\n\u03c9(1). Although we do not observe the connections among red nodes, we might ask to classify these\nnodes through their connection patterns with green nodes. Theorem 2 (ii) shows that this is possible\nonly if \u03b3f (n) tends to in\ufb01nity as n grows large.\n\n\u221a\n\nTheorem 2 (i) If there exists a clustering algorithm that classi\ufb01es the green nodes asymptotically\naccurately, then we have:\n(ii) If there exists an asymptotically accurate clustering algorithm (i.e., classifying all nodes asymp-\ntotically accurately), then we have: \u03b3f (n) = \u03c9(1).\n\n\u03b3f (n) = \u03c9(1).\n\n\u221a\n\nRemark 3 Theorems 1 and 2 might appear restrictive as they only deal with the case of two clusters\nof equal sizes. This is not the case as we will provide in the next section an algorithm achieving the\nbounds of Theorem 2 (i) and (ii) for the general case (with a \ufb01nite number K of clusters of possibly\ndifferent sizes).\nIn other words, Theorems 1 and 2 translates directly in minimax lower bounds\nthanks to the results we obtain in Section 3.2.\n\nNote that as soon as \u03b3f (n) = \u03c9(1) (i.e. the mean degree in the observed graph tends to in\ufb01nity),\nthen standard spectral method applied on the squared matrix A(g) = (Avw, v, w \u2208 V (g)) will allow\nus to classify asymptotically accurately the green nodes, i.e., taking into account only the graph\ninduced by the green vertices is suf\ufb01cient. However if \u03b3f (n) = o(1) then no algorithm based on\nthe induced graph only will be able to classify the green nodes. Theorem 2 shows that in the range\nof parameters 1/f (n)2 (cid:28) \u03b3 (cid:28) 1/f (n), it is impossible to cluster asymptotically accurately the red\nnodes but the question of clustering the green nodes is left open.\n\n3.2 Algorithms\n\nif(cid:80)\n\nIn this section, we deal with the general case and assume that the number K of clusters (of possibly\ndifferent sizes) is known. There are two questions of interest: clustering green and red nodes. It\nseems intuitive that red nodes can be classi\ufb01ed only if we are able to \ufb01rst classify green nodes.\nIndeed as we will see below, once the green nodes have been classi\ufb01ed, an easy greedy rule is\noptimal for the red nodes.\nClassifying green nodes. Our algorithm to classify green nodes rely on spectral methods. Note that\nas suggested above, in the regime 1/f (n)2 (cid:28) \u03b3 (cid:28) 1/f (n), any ef\ufb01cient algorithm needs to exploit\nthe observed connections between green and red nodes. We construct such an algorithm below. We\nshould stress that our algorithm does not require to know or estimate \u03b3 or f (n).\nWhen from the observations, a red node w \u2208 V (r) is connected to at most a single green node, i.e.,\nv\u2208V (g) Avw \u2264 1, this red node is useless in the classi\ufb01cation of green nodes. On the contrary,\nwhen a red node is connected to two green nodes, say v1 and v2 (Av1w = 1 = Av2w), we may infer\nthat the green nodes v1 and v2 are likely to be in the same cluster. In this case, we say that there is\nan indirect edge between v1 and v2.\nTo classify the green nodes, we will use the matrix A(g) = (Avw)v,w\u2208V (g), as well as the graph\nof indirect edges. However this graph is statistically different from the graphs arising in the clas-\nsical stochastic block model. Indeed, when a red node is connected to three or more green nodes,\nthen the presence of indirect edges between these green nodes are not statistically independent. To\ncircumvent this dif\ufb01culty, we only consider indirect edges created through red nodes connected to\nw\u2208V (g) Awv = 2}. We denote by A(cid:48) the\n(n(g) \u00d7 n(g)) matrix reporting the number of such indirect edges between pairs of green nodes: for\nall v, w \u2208 V (g), A(cid:48)\n\nexactly two green nodes. Let V (i) = {v : v \u2208 V (r) and (cid:80)\n\nvw =(cid:80)\n\nz\u2208V (i) AvzAwz.\n\n4\n\n\fAlgorithm 1 Spectral method with indirect edges\n\nv,w\u2208V (g) A\n\n|V (g)|2\n\n(g)\nvw\n\nand \u02c6p(cid:48) \u2190\n\nw\u2208V (g) Awv = 2}\n\nvw =(cid:80)\n\nV (i) \u2190 {v : v \u2208 V (r) and (cid:80)\n\nInput: A \u2208 {0, 1}|V |\u00d7|V (g)|, V , V (g), K\nV (r) \u2190 V \\ V (g)\nA(g) \u2190 (Avw)v,w\u2208V (g) and A(cid:48) \u2190 (A(cid:48)\n(cid:80)\n(cid:80)\nv,w\u2208V (g) A(cid:48)\n\u02c6p(g) \u2190\nK , \u0393(g) \u2190 Approx(A(g), \u02c6p(g), V (g), K ) and Q(cid:48), \u03c3(cid:48)\nQ(g), \u03c3(g)\n\u03c3(cid:48)\nK\u221a\nK\u221a\n|V (g)| \u02c6p(cid:48) \u00b7 1{|V (g)| \u02c6p(cid:48)\u226550} then\nif\n|V (g)| \u02c6p(g)\n(Sk)1\u2264k\u2264K \u2190 Detection (Q(g), \u0393(g), K)\nRandomly place nodes in V (g) \\ \u0393(g) to partitions (Sk)k=1,...,K\n(Sk)1\u2264k\u2264K \u2190 Detection (Q(cid:48), \u0393(cid:48), K)\nRandomly place nodes in V (g) \\ \u0393(cid:48) to partitions (Sk)k=1,...,K\n\n\u00b7 1{|V (g)| \u02c6p(g)\u226550} \u2265\n\n|V (g)|2\n\nz\u2208V (i) AvzAwz)v,w\u2208V (g)\n\nvw\n\nK , \u0393(cid:48) \u2190 Approx(A(cid:48), \u02c6p(cid:48), V (g), K )\n\n(g)\n\n\u03c3\n\nelse\n\nend if\nOutput: (Sk)1\u2264k\u2264K,\n\n1\n\n, . . . , V (g)\n\nOur algorithm to classify the green nodes consists in the following steps:\nStep 1. Construct the indirect edge matrix A(cid:48) using red nodes connected to two green nodes only.\nStep 2. Perform a spectral analysis of matrices A(g) and A(cid:48) as follows: \ufb01rst trim A(g) and A(cid:48)\n(to remove nodes with too many connections), then extract their K largest eigenvalues and the\ncorresponding eigenvectors.\nStep 3. Select the matrix A(g) or A(cid:48) with the largest normalized K-th largest eigenvalue.\nStep 4. Construct the K clusters V (g)\nthe previous step.\nThe detailed pseudo-code of the algorithm is presented in Algorithm 1. Steps 2 and 4 of the algo-\nrithm are standard techniques used in clustering for the SBM, see e.g. [5]. The algorithms involved\nin these Steps are presented in the supplementary material (see Algorithms 4, 5, 6). Note that to\nextract the K largest eigenvalues and the corresponding eigenvectors of a matrix, we use the power\nmethod, which is memory-ef\ufb01cient (this becomes important when addressing Problem 2). Further\nobserve that in Step 3, the algorithm exploits the information provided by the red nodes: it selects,\nbetween the direct edge matrix A(g) and the indirect edge matrix A(cid:48), the matrix whose spectral\nproperties provide more accurate information about the K clusters. This crucial step is enough for\nthe algorithm to classify the green nodes asymptotically accurately whenever this is at all possible,\nas stated in the following theorem:\n\nK based on the eigenvectors of the matrix selected in\n\n\u221a\n\nTheorem 4 When\nrately.\n\n\u03b3f (n) = \u03c9(1), Algorithm 1 classi\ufb01es the green nodes asymptotically accu-\n\n\u221a\n\nIn view of Theorem 2 (i), our algorithm is optimal. It might be surprising to choose one of the\nmatrix A(g) or A(cid:48) and throw the information contained in the other one. But the following simple\ncalculation gives the main idea. To simplify, consider the case \u03b3f (n) = o(1) so that we know that\nthe matrix A(g) alone is not suf\ufb01cient to \ufb01nd the clusters. In this case, it is easy to see that the\nmatrix A(cid:48) alone allows to classify as soon as\n\u03b3f (n) = \u03c9(1). Indeed, the probability of getting\nan indirect edge between two green nodes is of the order (a2 + b2)f (n)2/(2n) if the two nodes are\nin the same clusters and abf (n)2/n if they are in different clusters. Moreover the graph of indirect\nedges has the same statistics as a SBM with these probabilities of connection. Hence standard results\nshow that spectral methods will work as soon as \u03b3f (n)2 tends to in\ufb01nity, i.e. the mean degree in\nthe observed graph of indirect edges tends to in\ufb01nity. In the case where \u03b3f (n) is too large (indeed\n(cid:29) ln(f (n))), then the graph of indirect edges becomes too sparse for A(cid:48) to be useful. But in this\nregime, A(g) allows to classify the green nodes. This argument gives some intuitions about the full\nproof of Theorem 4 which can be found in the Appendix.\n\n5\n\n\fInput: A \u2208 {0, 1}|V |\u00d7|V (g)|, V , V (g), (S(g)\nV (r) \u2190 V \\ V (g) and Sk \u2190 S(g)\nk , for all k\nfor v \u2208 V (r) do\n\nFind k(cid:63) = arg maxk{(cid:80)\n\nAvw/|S(g)\n\nk )1\u2264k\u2264K.\n\nw\u2208S\n\n(g)\nk\n\nk |} (tie broken uniformly at random)\n\nAlgorithm 2 Greedy selections\n\nSk(cid:63) \u2190 Sk(cid:63) \u222a {v}\nend for\nOutput: (Sk)1\u2264k\u2264K.\n\nAn attractive feature of our Algorithm 1 is that it does not require any parameter of the model as\ninput except the number of clusters K. In particular, our algorithm selects automatically the best\nmatrix among A(cid:48) and A(g) based on their spectral properties.\nClassifying red nodes. From Theorem 2 (ii), in order to classify red nodes, we need to assume that\n\u03b3f (n) = \u03c9(1). Under this assumption, the green nodes are well classi\ufb01ed under Algorithm 1. To\nclassify the red nodes accurately, we show that it is enough to greedily assign these nodes to the\nclusters of green nodes identi\ufb01ed using Algorithm 1. More precisely, a red node v is assigned to the\ncluster that maximizes the number of observed edges between v and the green nodes of this cluster.\nThe pseudo-code of this procedure is presented in Algorithm 2.\n\nTheorem 5 When \u03b3f (n) = \u03c9(1), combining Algorithms 1 and 2 yields an asymptotically accurate\nclustering algorithm.\n\nAgain in view of Theorem 2 (ii), our algorithm is optimal. To summarize our results about Problem\n1, i.e., clustering with partial information, we have shown that:\n(a) If \u03b3 (cid:28) 1/f (n)2, no clustering algorithm can perform better than the naive algorithm that assigns\nnodes to clusters randomly (in the case of two clusters of equal sizes).\n(b) If 1/f (n)2 (cid:28) \u03b3 (cid:28) 1/f (n), Algorithm 1 classi\ufb01es the green nodes asymptotically accurately,\nbut no algorithm can classify the red nodes asymptotically accurately.\n(c) If 1/f (n) (cid:28) \u03b3, the combination of Algorithm 1 and Algorithm 2 classi\ufb01es all nodes asymptoti-\ncally accurately.\n\n4 Clustering in the Streaming Model under Memory Constraints\n\nIn this section, we address Problem 2 where the clustering problem has additional constraints.\nNamely, the memory available to the algorithm is limited (memory constraints) and each column\nAv of A is observed only once, hence if it is not stored, this information is lost (streaming model).\nIn view of previous results, when the entire matrix A is available (i.e. \u03b3 = 1) and when there\nis no memory constraint, we know that a necessary and suf\ufb01cient condition for the existence of\nasymptotically accurate clustering algorithms is that f (n) = \u03c9(1). Here we \ufb01rst devise a cluster-\ning algorithm adapted to the streaming model and using a memory scaling linearly with n that is\nasymptotically accurate as soon as log(n) (cid:28) f (n). Algorithms 1 and 2 are the building blocks of\nthis algorithm, and its performance analysis leverages the results of previous section. We also show\nthat our algorithm does not need to sequentially observe all columns of A in order to accurately\nreconstruct the clusters. In other words, the algorithm uses strictly less than one pass on the data and\nis asymptotically accurate.\nClearly if the algorithm is asked (as above) to output the full partition of the network, it will require\na memory scaling linearly with n, the size of the output. However, in the streaming model, we can\nremove this requirement and the algorithm can output the full partition sequentially similarly to an\nonline algorithm (however our algorithm is not required to take an irrevocable action after the arrival\nof each column but will classify nodes after a group of columns arrives). In this case, the memory\nrequirement can be sublinear. We present an algorithm with a memory requirement which depends\non the density of the graph. In the particular case where f (n) = n\u03b1 with 0 < \u03b1 < 1, our algorithm\nNote that when the graph is very sparse (\u03b1 \u2248 0), then the community detection is a hard statistical\ntask and the algorithm needs to gather a lot of columns so that the memory requirement is quite\n\nrequires as little as n\u03b2 bits of memory with \u03b2 > max(cid:0)1 \u2212 \u03b1, 2\n\n(cid:1) to accurately cluster the nodes.\n\n3\n\n6\n\n\fAlgorithm 3 Streaming of\ufb02ine\nInput: {A1, . . . , AT}, p, V , K\nInitial: N \u2190 n \u00d7 K matrix \ufb01lled with zeros and B \u2190\nSubsampling: At \u2190 Randomly erase entries of At with probability max{0, 1 \u2212 n1/3\nnp }\nfor \u03c4 = 1to (cid:98) T\n\nmin{np,n1/3} log n\n\nnh(n)\n\nB(cid:99) do\n\nA(B) \u2190 n \u00d7 B matrix where i-th column is Ai+(\u03c4\u22121)B\nk ) \u2190 Algorithm 1 (A(B), V,{(\u03c4 \u2212 1)B + 1, . . . , \u03c4 B}, K)\n(S(\u03c4 )\nif \u03c4 = 1 then\n\u02c6Vk \u2190 S(1)\n\nfor all k and Nv,k \u2190(cid:80)\n\nAwv for all v \u2208 V and k\n\nk\n\nw\u2208S\n\n(1)\nk\n\n(cid:80)\n\nv\u2208 \u02c6Vi\n\n(cid:80)\n| \u02c6Vi||S\n\nw\u2208S\n(\u03c4 )\nk\nk |\n(\u03c4 )\n\nAvw\n\nelse\n\n\u02c6Vs(k) \u2190 \u02c6Vs(k) \u222a S(\u03c4 )\n\nNv,s(k) \u2190 Nv,s(k) +(cid:80)\n\nk\n\nfor all k where s(k) = arg max1\u2264i\u2264K\n\nAwv for all v \u2208 V and k\n\nw\u2208S\n\n(\u03c4 )\nk\n\nend if\nend for\nGreedy improvement : \u00afVk \u2190 {v : k = arg max1\u2264i\u2264K\nOutput: ( \u00afVk)1\u2264k\u2264K,\n\nNv,i\n\n| \u02c6Vi| } for all k\n\nhigh (\u03b2 \u2248 1). As \u03b1 increases, the graph becomes denser and the statistical task easier. As a result,\nour algorithm needs to look at smaller blocks of columns and the memory requirement decreases.\nHowever, for \u03b1 \u2265 1/3, although the statistical task is much easier, our algorithm hits its memory\nconstraint and in order to store blocks with suf\ufb01ciently many columns, it needs to subsample each\ncolumn. As a result, the memory requirement of our algorithm does not decrease for \u03b1 \u2265 1/3.\nThe main idea of our algorithms is to successively treat blocks of B consecutive arriving columns.\nEach column of a block is stored in the memory. After the last column of a block arrives, we apply\nAlgorithm 1 to classify the corresponding nodes accurately, and we then merge the obtained clusters\nwith the previously identi\ufb01ed clusters. In the online version, the algorithm can output the partition\nof the block and in the of\ufb02ine version, it stores this result. We \ufb01nally remove the stored columns,\nand proceed with the next block. For the of\ufb02ine algorithm, after a total of T observed columns, we\napply Algorithm 2 to classify the remaining nodes so that T can be less than n. The pseudo-code\nof the of\ufb02ine algorithm is presented in Algorithm 3. Next we discuss how to tune B and T so that\nthe classi\ufb01cation is asymptotically accurate, and we compute the required memory to implement the\nalgorithm.\nBlock size. We denote by B the size of a block. Let h(n) be such that the block size is\nf (n) log(n). Let \u00aff (n) = min{f (n), n1/3} which represents the order of the number of positive\nB = h(n)n\nentries of each column after the subsampling process. According to Theorem 4 (with \u03b3 = B/n),\n\u00aff (n)2 = \u03c9(1), which is\nto accurately classify the nodes arrived in a block, we just need that B\nn\nequivalent to h(n) = \u03c9(\nmin{f (n),n1/3} ). Now the merging procedure that combines the clus-\nters found analyzing the current block with the previously identi\ufb01ed clusters uses the number of\nconnections between the nodes corresponding to the columns of the current block to the previous\nclusters. The number of these connections must grow large as n tends to \u221e to ensure the ac-\ncuracy of the merging procedure. Since the number of these connections scales as B2 \u00aff (n)\nn , we\nneed that h(n)2 = \u03c9(min{f (n), n1/3} log(n)2\n). Note that this condition is satis\ufb01ed as long as\nh(n) = \u03c9(\n\nlog(n)\n\nlog(n)\n\nn\n\nmin{f (n),n1/3} ).\n\nTotal number of columns for the of\ufb02ine algorithm. To accurately classify the nodes whose\ncolumns are not observed, we will show that we need the total number of observed columns T\nto satisfy T = \u03c9(\n\nmin{f (n),n1/3} ) (which is in agreement with Theorem 5).\n\nn\n\nRequired memory for the of\ufb02ine algorithm. To store the columns of a block, we need \u0398(nh(n))\nbits. To store the previously identi\ufb01ed clusters, we need at most log2(K)n bits, and we can store\nthe number of connections between the nodes corresponding to the columns of the current block to\nthe previous clusters using a memory linearly scaling with n. Finally, to execute Algorithm 1, the\n\n7\n\n\fAlgorithm 4 Streaming online\nInput: {A1, . . . , An}, p, V , K\nInitial: B \u2190\nmin{np,n1/3} log n\nSubsampling: At \u2190 Randomly erase entries of At with probability max{0, 1 \u2212 n1/3\nnp }\nfor \u03c4 = 1to \u03c4 (cid:63) do\n\nand \u03c4 (cid:63) = (cid:98) T\nB(cid:99)\n\nnh(n)\n\nA(B) \u2190 n \u00d7 B matrix where i-th column is Ai+(\u03c4\u22121)B\n(Sk)1\u2264k\u2264K \u2190 Algorithm 1 (A(B), V,{(\u03c4 \u2212 1)B + 1, . . . , \u03c4 B}, K)\nif \u03c4 = 1 then\n\nelse\n\n\u02c6Vk \u2190 Sk for all k\nOutput at B: (Sk)1\u2264k\u2264K\n(cid:80)\n\ns(k) \u2190 arg max1\u2264i\u2264K\nv\u2208 \u02c6Vi\nOutput at \u03c4 B: (Ss(k))1\u2264k\u2264K\n\n(cid:80)\nw\u2208Sk\n| \u02c6Vi||Sk|\n\nAvw\n\nfor all k\n\nend if\nend for\n\npower method used to perform the SVD (see Algorithm 5) requires the same amount of bits than\nthat used to store a block of size B. In summary, the required memory is M = \u0398(nh(n) + n).\n\nlog(n)\n\nn\n\nmin{f (n),n1/3} ) and T = \u03c9(\n\nTheorem 6 Assume that h(n) = \u03c9(\nmin{f (n),n1/3} ). Then with\nM = \u0398(nh(n) + n) bits, Algorithm 3, with block size B =\nmin{f (n),n1/3} log(n) and acquiring\nthe T \ufb01rst columns of A, outputs clusters \u02c6V1, . . . , \u02c6VK such that with high probability, there exists a\npermutation \u03c3 of {1, . . . , K} such that: 1\n)\nwith a constant c > 0.\n\nexp(\u2212cT min{f (n),n1/3}\n\n(cid:12)(cid:12)(cid:12) = O\n\n\u02c6Vk \\ V\u03c3(k)\n\n(cid:12)(cid:12)(cid:12)(cid:83)\n\n(cid:17)\n\nn\n\n1\u2264k\u2264K\n\nh(n)n\n\n(cid:16)\n\nn\n\nUnder the conditions of the above theorem, Algorithm 3 is asymptotically accurate. Now if f (n) =\n\u03c9(log(n)), we can choose h(n) = 1. Then Algorithm 3 classi\ufb01es nodes accurately and uses a\nmemory linearly scaling with n. Note that increasing the number of observed columns T just reduces\nthe proportion of misclassi\ufb01ed nodes. For example, if f (n) = log(n)2, with high probability, the\nproportion of misclassi\ufb01ed nodes decays faster than 1/n if we acquire only T = n/ log(n) columns,\nwhereas it decays faster than exp(\u2212 log(n)2) if all columns are observed.\nOur online algorithm is a slight variation of the of\ufb02ine algorithm. Indeed, it deals with the \ufb01rst block\nexactly in the same manner and keeps in memory the partition of this \ufb01rst block. It then handles the\nsuccessive blocks as the \ufb01rst block and merges the partition of these blocks with those of the \ufb01rst\nblock as done in the of\ufb02ine algorithm for the second block. Once this is done, the online algorithm\njust throw all the information away except the partition of the \ufb01rst block.\n\nTheorem 7 Assume that h(n) = \u03c9(\nmin{f (n),n1/3} ), then Algorithm 4 with block size B =\nmin{f (n),n1/3} log n is asymptotically accurate (i.e., after one pass, the fraction of misclassi\ufb01ed nodes\nvanishes) and requires \u0398(nh(n)) bits of memory.\n\nlog(n)\n\nh(n)n\n\n5 Conclusion\n\nWe introduced the problem of community detection with partial information, where only an induced\nsubgraph corresponding to a fraction of the nodes is observed. In this setting, we gave a neces-\nsary condition for accurate reconstruction and developed a new spectral algorithm which extracts\nthe clusters whenever this is at all possible. Building on this result, we considered the streaming,\nmemory limited problem of community detection and developed algorithms able to asymptotically\nreconstruct the clusters with a memory requirement which is linear in the size of the network for\nthe of\ufb02ine version of the algorithm and which is sublinear for its online version. To the best of\nour knowledge, these algorithms are the \ufb01rst community detection algorithms in the data stream\nmodel. The memory requirement of these algorithms is non-increasing in the density of the graph\nand determining the optimal memory requirement is an interesting open problem.\n\n8\n\n\fReferences\n[1] R. B. Boppana. Eigenvalues and graph bisection: An average-case analysis. In Foundations of\n\nComputer Science, 1987., 28th Annual Symposium on, pages 280\u2013285. IEEE, 1987.\n\n[2] S. Chatterjee. Matrix estimation by universal singular value thresholding. arXiv preprint\n\narXiv:1212.1247, 2012.\n\n[3] K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with general degrees\nin the extended planted partition model. Journal of Machine Learning Research-Proceedings\nTrack, 23:35\u20131, 2012.\n\n[4] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In Advances in Neural Information\n\nProcessing Systems 25, pages 2213\u20132221. 2012.\n\n[5] A. Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Combinatorics, Proba-\n\nbility & Computing, 19(2):227\u2013284, 2010.\n\n[6] A. Dasgupta, J. Hopcroft, R. Kannan, and P. Mitra. Spectral clustering by recursive partition-\n\ning. In Algorithms\u2013ESA 2006, pages 256\u2013267. Springer, 2006.\n\n[7] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00b4a. Inference and phase transitions in the\n\ndetection of modules in sparse networks. Phys. Rev. Lett., 107, Aug 2011.\n\n[8] S. Guha, N. Mishra, R. Motwani, and L. O\u2019Callaghan. Clustering data streams. In 41st Annual\nSymposium on Foundations of Computer Science (Redondo Beach, CA, 2000), pages 359\u2013366.\nIEEE Comput. Soc. Press, Los Alamitos, CA, 2000.\n\n[9] P. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks,\n\n5(2):109 \u2013 137, 1983.\n\n[10] M. Jerrum and G. B. Sorkin. The metropolis algorithm for graph bisection. Discrete Applied\n\nMathematics, 82(13):155 \u2013 175, 1998.\n\n[11] L. Massouli\u00b4e. Community detection thresholds and the weak ramanujan property. CoRR,\n\nabs/1311.3085, 2013.\n\n[12] F. McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science,\n\n2001. Proceedings. 42nd IEEE Symposium on, pages 529\u2013537. IEEE, 2001.\n\n[13] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In NIPS, 2013.\n[14] E. Mossel, J. Neeman, and A. Sly. Stochastic block models and reconstruction. arXiv preprint\n\narXiv:1202.1499, 2012.\n\n[15] S. Yun and A. Proutiere. Community detection via random and adaptive sampling. In COLT,\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 1623, "authors": [{"given_name": "Se-Young", "family_name": "Yun", "institution": "MSR-INRIA"}, {"given_name": "marc", "family_name": "lelarge", "institution": "INRIA ENS"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": "KTH"}]}