{"title": "Clustering from Labels and Time-Varying Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1188, "page_last": 1196, "abstract": "We present a general framework for graph clustering where a label is observed to each pair of nodes. This allows a very rich encoding of various types of pairwise interactions between nodes. We propose a new tractable approach to this problem based on maximum likelihood estimator and convex optimization. We analyze our algorithm under a general generative model, and provide both necessary and sufficient conditions for successful recovery of the underlying clusters. Our theoretical results cover and subsume a wide range of existing graph clustering results including planted partition, weighted clustering and partially observed graphs. Furthermore, the result is applicable to novel settings including time-varying graphs such that new insights can be gained on solving these problems. Our theoretical findings are further supported by empirical results on both synthetic and real data.", "full_text": "Clustering from Labels and Time-Varying Graphs\n\nShiau Hong Lim\n\nNational University of Singapore\n\nmpelsh@nus.edu.sg\n\nYudong Chen\n\nEECS, University of California, Berkeley\nyudong.chen@eecs.berkeley.edu\n\nHuan Xu\n\nNational University of Singapore\n\nmpexuh@nus.edu.sg\n\nAbstract\n\nWe present a general framework for graph clustering where a label is observed to\neach pair of nodes. This allows a very rich encoding of various types of pairwise\ninteractions between nodes. We propose a new tractable approach to this problem\nbased on maximum likelihood estimator and convex optimization. We analyze our\nalgorithm under a general generative model, and provide both necessary and suf\ufb01-\ncient conditions for successful recovery of the underlying clusters. Our theoretical\nresults cover and subsume a wide range of existing graph clustering results includ-\ning planted partition, weighted clustering and partially observed graphs. Further-\nmore, the result is applicable to novel settings including time-varying graphs such\nthat new insights can be gained on solving these problems. Our theoretical \ufb01nd-\nings are further supported by empirical results on both synthetic and real data.\n\n1\n\nIntroduction\n\nIn the standard formulation of graph clustering, we are given an unweighted graph and seek a par-\ntitioning of the nodes into disjoint groups such that members of the same group are more densely\nconnected than those in different groups. Here, the presence of an edge represents some sort of\naf\ufb01nity or similarity between the nodes, and the absence of an edge represents the lack thereof.\nIn many applications, from chemical interactions to social networks, the interactions between nodes\nare much richer than a simple \u201cedge\u201d or \u201cnon-edge\u201d. Such extra information may be used to improve\nthe clustering quality. We may represent each type of interaction by a label. One simple setting of\nthis type is weighted graphs, where instead of a 0-1 graph, we have edge weights representing\nthe strength of the pairwise interaction.\nIn this case the observed label between each pair is a\nreal number. In a more general setting, the label need not be a number. For example, on social\nnetworks like Facebook, the label between two persons may be \u201cthey are friends\u201d, \u201cthey went to\ndifferent schools\u201d, \u201cthey liked 21 common pages\u201d, or the concatenation of these. In such cases\ndifferent labels carry different information about the underlying community structure. Standard\napproaches convert these pairwise interactions into a simple edge/non-edge, and then apply standard\nclustering algorithms, which might lose much of the information. Even in the case of a standard\nweighted/unweighted graph, it is not immediately clear how the graph should be used. For example,\nshould the absence of an edge be interpreted as a neutral observation carrying no information, or as\na negative observation which indicates dissimilarity between the two nodes?\nWe emphasize that the forms of labels can be very general. In particular, a label can take the form\nof a time series, i.e., the record of time varying interaction such as \u201cA and B messaged each other\non June 1st, 4th, 15th and 21st\u201d, or \u201cthey used to be friends, but they stop seeing each other since\n2012\u201d. Thus, the labeled graph model is an immediate tool for analyzing time-varying graphs.\n\n1\n\n\fIn this paper, we present a new and principled approach for graph clustering that is directly based on\npairwise labels. We assume that between each pair of nodes i and j, a label Lij is observed which\nis an element of a label set L. The set L may be discrete or continuous, and need not have any\nstructure. The standard graph model corresponds to a binary label set L = {edge, non-edge}, and\na weighted graph corresponds to L = R. Given the observed labels L = (Lij) \u2208 Ln\u00d7n, the goal\nis to partition the n nodes into disjoint clusters. Our approach is based on \ufb01nding a partition that\noptimizes a weighted objective appropriately constructed from the observed labels. This leads to a\ncombinatorial optimization problem, and our algorithm uses its convex relaxation.\nTo systematically evaluate clustering performance, we consider a generalization of the stochastic\nblock model [1] and the planted partition model [2]. Our model assumes that the observed labels\nare generated based on an underlying set of ground truth clusters, where pairs from the same clus-\nter generate labels using a distribution \u00b5 over L and pairs from different clusters use a different\ndistribution \u03bd. The standard stochastic block model corresponds to the case where \u00b5 and \u03bd are two-\npoint distributions with \u00b5(edge) = p and \u03bd(edge) = q. We provide theoretical guarantees for our\nalgorithm under this generalized model.\nOur results cover a wide range of existing clustering settings\u2014with equal or stronger theoretical\nguarantees\u2014including the standard stochastic block model, partially observed graphs and weighted\ngraphs. Perhaps surprisingly, our framework allows us to handle new classes of problems that are not\na priori obvious to be a special case of our model, including the clustering of time-varying graphs.\n\n1.1 Related work\n\nThe planted partition model/stochastic block model [1, 2] are standard models for studying graph\nclustering. Variants of the models cover partially observed graphs [3, 4] and weighted graphs [5, 6].\nAll these models are special cases of ours. Various algorithms have been proposed and analyzed\nunder these models, such as spectral clustering [7, 8, 1], convex optimization approaches [9, 10, 11]\nand tensor decomposition methods [12]. Ours is based on convex optimization; we build upon and\nextend the approach in [13], which is designed for clustering unweighted graphs whose edges have\ndifferent levels of uncertainty, a special case of our problem (cf. Section 4.2 for details).\nMost related to our setting is the labelled stochastic block model proposed in [14] and [15]. A\nmain difference in their model is that they assume each observation is a two-step process: \ufb01rst\nan edge/non-edge is observed; if it is an edge then a label is associated with it.\nIn our model\nall observations are in the form of labels; in particular, an edge or no-edge is also a label. This\ncovers their setting as a special case. Our model is therefore more general and natural\u2014as a result\nour theory covers a broad class of subproblems including time-varying graphs. Moreover, their\nanalysis is mainly restricted to the two-cluster setting with edge probabilities on the order of \u0398(1/n),\nwhile we allow for an arbitrary number of clusters and a wide range of edge/label distributions.\nIn addition, we consider the setting where the distributions of the labels are not precisely known.\nAlgorithmically, they use belief propagation [14] and spectral methods [15].\nClustering time-varying graphs has been studied in various context; see [16, 17, 18, 19, 20] and\nthe references therein. Most existing algorithms use heuristics and lack theoretical analysis. Our\napproach is based on a generative model and has provable performance guarantees.\n\n2 Problem setup and algorithms\n\nWe assume n nodes are partitioned into r disjoint clusters of size at least K, which are unknown and\nconsidered as the ground truth. For each pair (i, j) of nodes, a label Lij \u2208 L is observed, where L is\nthe set of all possible labels.1 These labels are generated independently across pairs according to the\ndistributions \u00b5 and \u03bd. In particular, the probability of observing the label Lij is \u00b5(Lij) if i and j are\nin the same cluster, and \u03bd(Lij) otherwise. The goal is to recover the ground truth clusters given the\nlabels. Let L = (Lij) \u2208 Ln\u00d7n be the matrix of observed labels. We represent the true clusters by\nan n \u00d7 n cluster matrix Y \u2217, where Y \u2217\nij = 0\notherwise (we use the convention Y \u2217\n\nij = 1 if nodes i and j belong to the same cluster and Y \u2217\nii = 1 for all i). The problem is therefore to \ufb01nd Y \u2217 given L.\n\n1Note that L does not have to be \ufb01nite. Although some of the results are presented for \ufb01nite L, they can be\n\neasily adapted to the other cases, for instance, by replacing summation with integration.\n\n2\n\n\fWe take an optimization approach to this problem. To motivate our algorithm, \ufb01rst consider the\ncase of clustering a weighted graph, where all labels are real numbers. Positive weights indicate\nin-cluster interaction while negative weights indicate cross-cluster interaction. A natural approach\nis to cluster the nodes in a way that maximizes the total weight inside the clusters (this is equivalent\nto correlation clustering [21]). Mathematically, this is to \ufb01nd a clustering, represented by a cluster\ni,j LijYij is maximized. For the case of general labels, we pick a weight\nfunction w : L (cid:55)\u2192 R, which assigns a number Wij = w(Lij) to each label, and then solve the\nfollowing max-weight problem:\n\n(cid:104)W, Y (cid:105)\n\nmax\n\nY\n\ns.t. Y is a cluster matrix;\n\n(1)\n\nmatrix Y , such that(cid:80)\nhere (cid:104)W, Y (cid:105) :=(cid:80)\n\nij WijYij is the standard trace inner product. Note that this effectively converts\n\nthe problem of clustering from labels into a weighted clustering problem.\nThe program (1) is non-convex due to the constraint. Our algorithm is based on a convex relaxation\nof (1), using the now well-known fact that a cluster matrix is a block-diagonal 0-1 matrix and thus\nhas nuclear norm2 equal to n [22, 3, 23]. This leads to the following convex optimization problem:\n(2)\n\nmax\n\n(cid:104)W, Y (cid:105)\n(cid:107)Y (cid:107)\u2217 \u2264 n; 0 \u2264 Yij \u2264 1,\u2200(i, j).\n\nY\ns.t.\n\nWe say that this program succeeds if it has a unique optimal solution equal to the true cluster matrix\nY \u2217. We note that a related approach is considered in [13], which is discussed in section 4.\nOne has the freedom of choosing the weight function w. Intuitively, w should assign w(Lij) > 0\nto a label Lij with \u00b5(Lij) > \u03bd(Lij), so the program (2) is encouraged to place i and j in the same\ncluster, the more likely possibility; similarly we should have w(Lij) < 0 if \u00b5(Lij) < \u03bd(Lij). A\ngood weight function should further re\ufb02ect the information in \u00b5 and \u03bd. Our theoretical results in\nsection 3 characterize the performance of the program (2) for any given weight function; building\non this, we further derive the optimal choice for the weight function.\n\n3 Theoretical results\n\nof the true clusters K, and any given weight function w. De\ufb01ne E\u00b5w := (cid:80)\nVar\u00b5w :=(cid:80)\n\nIn this section, we provide theoretical analysis for the performance of the convex program (2) under\nthe probabilistic model described in section 2. The proofs are given in the supplementary materials.\nOur main result is a general theorem that gives suf\ufb01cient conditions for (2) to recover the true cluster\nmatrix Y \u2217. The conditions are stated in terms of the label distribution \u00b5 and \u03bd, the minimum size\nl\u2208L w(l)\u00b5(l) and\nTheorem 1 (Main). Suppose b is any number that satis\ufb01es |w(l)| \u2264 b,\u2200l \u2208 L almost surely. There\nexists a universal constant c > 0 such that if\n\u221a\n\nl\u2208L[w(l) \u2212 E\u00b5w]2\u00b5(l); E\u03bdw and Var\u03bdw are de\ufb01ned similarly.\n\n\u221a\n\n\u2212E\u03bdw \u2265 c\nE\u00b5w \u2265 c\n\nb log n +\n\n\u221a\n\nb log n +\n\nVar\u03bdw\n\nK log n\n\nn log n(cid:112)max(Var\u00b5w, Var\u03bdw)\n\nK\n\n,\n\n(3)\n\n(4)\n\n,\n\nthen Y \u2217 is the unique solution to (2) with probability at least 1 \u2212 n\u221210. 3\nThe theorem holds for any given weight function w. In the next two subsections, we show how to\nchoose w optimally, and then address the case where w deviates from the optimal choice.\n\nK\n\n3.1 Optimal weights\n\nA good candidate for the weight function w can be derived from the maximum likelihood estima-\ntor (MLE) of Y \u2217. Given the observed labels L, the log-likelihood of the true cluster matrix taking\n2The nuclear norm of a matrix is de\ufb01ned as the sum of its singular values. A cluster matrix is positive\n3In all our results, the choice n\u221210 is arbitrary. In particular, the constant c scales linearly with the exponent.\n\nsemide\ufb01nite so its nuclear norm is equal to its trace.\n\n3\n\n\fthe value Y is\n\nlog Pr(L|Y \u2217 = Y ) =\n\nlog(cid:2)\u00b5(Lij)Yij \u03bd(Lij)1\u2212Yij(cid:3) = (cid:104)W, Y (cid:105) + c\n\n(cid:88)\n\ni,j\n\nwhere c is independent of Y and W is given by the weight function w(l) = wMLE(l) := log \u00b5(l)\n\u03bd(l) .\nThe MLE thus corresponds to using the log-likelihood ratio wMLE(\u00b7) as the weight function. The\nfollowing theorem is a consequence of Theorem 1 and characterizes the performance of using the\nMLE weights. In the sequel, we use D(\u00b7(cid:107)\u00b7) to denote the KL divergence between two distributions.\nTheorem 2 (MLE). Suppose wMLE is used, and b and \u03b6 are any numbers which satisfy with\nD(\u03bd||\u00b5) \u2264 \u03b6D(\u00b5||\u03bd) and\nthat Y \u2217 is the unique solution to (2) with probability at least 1 \u2212 n\u221210 if\n\n(cid:12)(cid:12)(cid:12) \u2264 b,\u2200l \u2208 L. There exists a universal constant c > 0 such\n\n(cid:12)(cid:12)(cid:12)log \u00b5(l)\n\n\u03bd(l)\n\nlog n\n\nD(\u03bd||\u00b5) \u2265 c(b + 2)\n,\nD(\u00b5||\u03bd) \u2265 c(\u03b6 + 1)(b + 2)\n\nK\n\n(cid:18) n log n\n\n(cid:19)\n\nK 2\n\n.\n\n(5)\n\n(6)\n\nMoreover, we always have D(\u03bd||\u00b5) \u2264 (2b + 3)D(\u00b5||\u03bd), so we can take \u03b6 = 2b + 3.\nNote that the theorem has the intuitive interpretation that the in/cross-cluster label distributions \u00b5 and\n\u03bd should be suf\ufb01ciently different, measured by their KL divergence. Using a classical result in infor-\nmation theory [24], we may replace the KL divergences with a quantity that is often easier to work\nwith, as summarized below. The LHS of (7) is sometimes called the triangle discrimination [24].\nCorollary 1 (MLE 2). Suppose wMLE is used, and b, \u03b6 are de\ufb01ned as in Theorem 2. There exists a\nuniversal constant c such that Y \u2217 is the unique solution to (2) with probability at least 1 \u2212 n\u221210 if\n\n(cid:88)\n\n(\u00b5(l) \u2212 \u03bd(l))2\n\u00b5(l) + \u03bd(l)\n\n\u2265 c(\u03b6 + 1)(b + 2)\n\n(cid:18) n log n\n\n(cid:19)\n\nK 2\n\n.\n\n(7)\n\nWe may take \u03b6 = 2b + 3.\n\nl\u2208L\n\nThe MLE weight wMLE turns out to be near-optimal, at least in the two-cluster case, in the sense that\nno other weight function (in fact, no other algorithm) has signi\ufb01cantly better performance. This is\nshown by establishing a necessary condition for any algorithm to recover Y \u2217. Here, an algorithm is\na measurable function \u02c6Y that maps the data L to a clustering (represented by a cluster matrix).\nTheorem 3 (Converse). The following holds for some universal constants c, c(cid:48) > 0. Suppose K =\n2 , and b de\ufb01ned in Theorem 2 satis\ufb01es b \u2264 c(cid:48). If\n\nn\n\n(\u00b5(l) \u2212 \u03bd(l))2\n\u00b5(l) + \u03bd(l)\n\n\u2264 c log n\n\nn\n\n,\n\n(8)\n\n(cid:88)\n\nl\u2208L\n\nthen inf \u02c6Y supY \u2217 P( \u02c6Y (cid:54)= Y \u2217) \u2265 1\n2 , where the supremum is over all possible cluster matrices.\nUnder the assumption of Theorem 3, the conditions (7) and (8) match up to a constant factor.\nRemark. The MLE weight |wMLE(l)| becomes large if \u00b5(l) = o(\u03bd(l)) or \u03bd(l) = o(\u00b5(l)), i.e., when\nthe in-cluster probability is negligible compared to the cross-cluster one (or the other way around).\nIt can be shown that in this case the MLE weight is actually order-wise better than a bounded weight\nfunction. We give this result in the supplementary material due to space constraints.\n\n3.2 Monotonicity\n\nWe sometimes do not know the exact true distributions \u00b5 and \u03bd to compute wMLE. Instead, we might\ncompute the weight using the log likelihood ratios of some \u201cincorrect\u201d distribution \u00af\u00b5 and \u00af\u03bd. Our\nalgorithm has a nice monotonicity property: as long as the divergence of the true \u00b5 and \u03bd is larger\nthan that of \u00af\u00b5 and \u00af\u03bd (hence an \u201ceasier\u201d problem), then the problem should still have the same, if not\nbetter probability of success, even though the wrong weights are used.\nWe say that (\u00b5, \u03bd) is more divergent then (\u00af\u00b5, \u00af\u03bd) if, for each l \u2208 L, we have that either\n\n\u00b5(l)\n\u03bd(l)\n\n\u2265 \u00b5(l)\n\u00af\u03bd(l)\n\n\u2265 \u00af\u00b5(l)\n\u00af\u03bd(l)\n\n\u2265 1\n\nor\n\n4\n\n\u03bd(l)\n\u00b5(l)\n\n\u2265 \u03bd(l)\n\u00af\u00b5(l)\n\n\u2265 \u00af\u03bd(l)\n\u00af\u00b5(l)\n\n\u2265 1.\n\n\f\u00af\u03bd(l) ,\u2200l, while the\nTheorem 4 (Monotonicity). Suppose we use the weight function w(l) = log \u00af\u00b5(l)\nactual label distributions are \u00b5 and \u03bd. If the conditions in Theorem 2 or Corollary 1 hold with \u00b5, \u03bd\nreplaced by \u00af\u00b5, \u00af\u03bd, and (\u00b5, \u03bd) is more divergent than (\u00af\u00b5, \u00af\u03bd), then with probability at least 1 \u2212 n\u221210\nY \u2217 is the unique solution to (2).\nThis result suggests that one way to choose the weight function is by using the log-likelihood ratio\nbased on a \u201cconservative\u201d estimate (i.e., a less divergent one) of the true label distribution pair.\n\n3.3 Using inaccurate weights\n\nIn the previous subsection we consider using a conservative log-likelihood ratio as the weight. We\nnow consider a more general weight function w which need not be conservative, but is only required\nto be not too far from the true log-likelihood ratio wMLE. Let\n\nbe the error for each label l \u2208 L. Accordingly, let \u2206\u00b5 :=(cid:80)\n\n\u03b5(l) := w(l) \u2212 wMLE(l) = w(l) \u2212 log\n\n\u00b5(l)\n\u03bd(l)\n\nl\u2208L \u00b5(l)\u03b5(l) and \u2206\u03bd :=(cid:80)\n\nl\u2208L \u03bd(l)\u03b5(l)\nbe the average errors with respect to \u00b5 and \u03bd. Note that \u2206\u00b5 and \u2206\u03bd can be either positive or negative.\nThe following characterizes the performance of using such a w.\nTheorem 5 (Inaccurate Weights). Let b and \u03b6 be de\ufb01ned as in Theorem 2. If the weight w satis\ufb01es\n\n|w(l)| \u2264 \u03bb\n\n|\u2206\u00b5| \u2264 \u03b3D(\u00b5||\u03bd),\n\n|\u2206\u03bd| \u2264 \u03b3D(\u03bd||\u00b5)\n\n(cid:12)(cid:12)(cid:12)(cid:12)log\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\u2200l \u2208 L,\n\n\u00b5(l)\n\u03bd(l)\n\nfor some \u03b3 < 1 and \u03bb > 0. Then Y \u2217 is unique solution to (2) with probability at least 1 \u2212 n\u221210 if\n\nD(\u03bd||\u00b5) \u2265 c\n\n\u03bb2\n\n(1 \u2212 \u03b3)2 (b + 2)\n\nlog n\n\nK\n\nand D(\u00b5||\u03bd) \u2265 c\n\n\u03bb2\n\n(1 \u2212 \u03b3)2 (\u03b6 + 1)(b + 2)\n\n(cid:18) n log n\n\n(cid:19)\n\n.\n\nK 2\n\nTherefore, as long as the errors \u2206\u00b5 and \u2206\u03bd in w are not too large, the condition for recovery will be\norder-wise similar to that in Theorem 2 for using the MLE weight. The numbers \u03bb and \u03b3 measure\nthe amount of inaccuracy in w w.r.t. wMLE. The last two conditions in Theorem 5 thus quantify the\nrelation between the inaccuracy in w and the price we need to pay for using such a weight.\n\n4 Consequences and applications\n\nWe apply the general results in the last section to different special cases. In sections 4.1 and 4.2, we\nconsider two simple settings and show that two immediate corollaries of our main theorems recover,\nand in fact improve upon, existing results. In sections 4.3 and 4.4, we turn to the more complicated\nsetting of clustering time-varying graphs and derive several novel results.\n\n4.1 Clustering a Gaussian matrix with partial observations\n\nAnalogous to the planted partition model for unweighted graphs, the bi-clustering [5] or submatrix-\nlocalization [6, 23] problem concerns with weighted graph whose adjacency matrix has Gaussian\nentries. We consider a generalization of this problem where some of the entries are unobserved.\nSpeci\ufb01cally, we observe a matrix L \u2208 (R \u222a {?})n\u00d7n, which has r submatrices of size K \u00d7 K with\ndisjoint row and column support, such that Lij =? (meaning unobserved) with probability 1\u2212 s and\notherwise Lij \u223c N (uij, 1). Here the means of the Gaussians satisfy: uij = \u00afu if (i, j) is inside the\nsubmatrices and uij = u if outside, where \u00afu > u \u2265 0. Clustering is equivalent to locating these\nsubmatrices with elevated mean, given the large Gaussian matrix L with partial observations.4\nThis is a special case of our labeled framework with L = R \u222a {?}. Computing the log-likelihood\nratios for two Gaussians, we obtain wMLE(Lij) = 0 if Lij =?, and wMLE(Lij) \u221d Lij \u2212 (\u00afu + u)/2\notherwise. This problem is interesting only when \u00afu \u2212 u (cid:46) \u221a\nlog n (otherwise simple element-wise\nthresholding [5, 6] \ufb01nds the submatrices), which we assume to hold. Clearly D (\u00b5(cid:107)\u03bd) = D (\u03bd(cid:107)\u00b5) =\n4 s(\u00afu \u2212 u)2. The following can be proved using our main theorems (proof in the appendix).\n\n1\n\n4Here for simplicity we consider the clustering setting instead of bi-clustering. The latter setting corresponds\n\nto rectangular L and submatrices. Extending our results to this setting is relatively straightforward.\n\n5\n\n\fCorollary 2 (Gaussian Graphs). Under the above setting, Y \u2217 is the unique solution to (2) with\nweights w = wMLE with probability at least 1 \u2212 2n\u221210 provided\n\ns (\u00afu \u2212 u)2 \u2265 c\n\nn log3 n\n\nK 2\n\n.\n\nIn the fully observed case, this recovers the results in [23, 5, 6] up to log factors. Our results are\nmore general as we allow for partial observations, which is not considered in previous work.\n\n4.2 Planted Partition with non-uniform edge densities\n\nThe work in [13] considers a variant of the planted partition model with non-uniform edge densities,\nwhere each pair (i, j) has an edge with probability 1\u2212 uij > 1/2 if they are in the same cluster, and\nwith probability uij < 1/2 otherwise. The number uij can be considered as a measure of the level of\nuncertainty in the observation between i and j, and is known or can be estimated in applications like\ncloud-clustering. They show that using the knowledge of {uij} improves clustering performance,\nand such a setting covers clustering of partially observed graphs that is considered in [11, 3, 4].\nHere we consider a more general setting that does not require the in/cross-cluster edge density to be\nsymmetric around 1\n2. Suppose each pair (i, j) is associated with two numbers pij and qij, such that\nif i and j are in the same cluster (different clusters, resp.), then there is an edge with probability pij\n(qij, resp.); we know pij and qij but not which of them is the probability that generates the edge.\nThe values of pij and qij are generated i.i.d. randomly as (pij, qij) \u223c D by some distribution D on\n[0, 1] \u00d7 [0, 1]. The goal is to \ufb01nd the clusters given the graph adjacency matrix A, (pij) and (qij).\nThis model is a special case of our labeled framework. The labels have the form Lij =\n(Aij, pij, qij) \u2208 L = {0, 1} \u00d7 [0, 1] \u00d7 [0, 1], generated by the distributions\n\n(cid:26)pD(p, q),\n\n\u00b5(l) =\n\n(1 \u2212 p)D(p, q),\n\nl = (1, p, q)\nl = (0, p, q)\n\n(cid:26)qD(p, q),\n\n\u03bd(l) =\n\n(1 \u2212 q)D(p, q),\n+(1\u2212Aij) log 1\u2212pij\n1\u2212qij\n\nl = (1, p, q)\nl = (0, p, q).\n\nThe MLE weight has the form wMLE(Lij) = Aij log pij\nqij\nconvenient to use a conservative weight in which we replace pij and qij with \u00afpij = 3\n\u00afqij = 1\nCorollary 3 (Non-uniform Density). Program (2) recovers Y \u2217 with probability at least 1 \u2212 n\u221210 if\n\n4 qij. Applying Theorem 4 and Corollary 1, we immediately obtain the following.\n\n. It turns out it is more\n4 qij and\n\n4 pij + 3\n\n4 pij + 1\n\n(cid:20) (pij \u2212 qij)2\n\npij(1 \u2212 qij)\n\n(cid:21)\n\nED\n\n\u2265 c\n\nn log n\n\nK 2\n\n,\u2200(i.j).\n\nHere ED is the expectation w.r.t. the distribution D, and LHS above is in fact independent of (i, j).\n\nCorollary 3 improves upon existing results for several settings.\n\u2022 Clustering partially observed graphs. Suppose D is such that pij = p and qij = q with proba-\nbility s, and pij = qij otherwise, where p > q. This extends the standard planted partition model:\neach pair is unobserved with probability 1 \u2212 s. For this setting we require\n\ns(p \u2212 q)2\np(1 \u2212 q)\n\n(cid:38) n log n\n\nK 2\n\n.\n\nWhen s = 1. this matches the best existing bounds for standard planted partition [9, 12] up to a\nlog factor. For the partial observation setting with s \u2264 1, the work in [4] gives a similar bound\nunder the additional assumption p > 0.5 > q, which is not required by our result. For general\np and q, the best existing bound is given in [3, 9], which replaces unobserved entries with 0 and\nrequires the condition s(p\u2212q)2\np(1\u2212sq)\n\nwith symmetric densities pij \u2261 1\u2212 qij, for which we recover their result ED(cid:2)(1\u22122qij)2(cid:3)(cid:38) nlogn\n\n\u2022 Planted partition with non-uniformity. The model and algorithm in [13] is a special case of ours\nK2 .\n\n. Our result is tighter when p and q are close to 1.\n\n(cid:38) n log n\nK2\n\nCorollary 3 is more general as it removes the symmetry assumption.\n\n6\n\n\f4.3 Clustering time-varying multiple-snapshot graphs\n\nStandard graph clustering concerns with clustering on a single, static graph. We now consider a\nsetting where the graph can be time-varying. Speci\ufb01cally, we assume that for each time interval\nt = 1, 2, . . . , T , we observed a snapshot of the graph L(t) \u2208 Ln\u00d7n. We assume each snapshot is\ngenerated by the distributions \u00b5 and \u03bd, independent of other snapshots.\nWe can map this problem into our original labeled framework, by considering the whole time se-\nquence of \u00afLij := (L(1)\nij ) observed at the pair (i, j) as a single label. In this case the label\nset is thus the set of all possible sequences, i.e., \u00afL = (L)T , and the label distributions are (with a\nij ), with \u03bd(\u00b7) given similarly. The MLE weight\nslight abuse of notation) \u00b5( \u00afLij) = \u00b5(L(1)\n(normalized by T ) is thus the average log-likelihood ratio:\n\nij ) . . . \u00b5(L(T )\n\nij , . . . , L(T )\n\nwMLE( \u00afLij) =\n\n1\nT\n\nlog\n\n\u00b5(L(1)\n\u03bd(L(1)\n\nij ) . . . \u00b5(L(T )\nij )\nij ) . . . \u03bd(L(T )\nij )\n\n=\n\n1\nT\n\nlog\n\n\u00b5(L(t)\nij )\n\u03bd(L(t)\nij )\n\n.\n\nT(cid:88)\n\nt=1\n\nSince wMLE( \u00afLij) is the average of T independent random variables, its variance scales with 1\nT .\nApplying Theorem 1, with almost identical proof as in Theorem 2 we obtain the following:\n\u03bd(l)| \u2264 b,\u2200l \u2208 L and D(\u03bd||\u00b5) \u2264 \u03b6D(\u00b5||\u03bd).\nCorollary 4 (Independent Snapshots). Suppose | log \u00b5(l)\nThe program (2) with MLE weights given recovers Y \u2217 with probability at least 1 \u2212 n\u221210 provided\n(cid:110) log n\n\nD(\u03bd||\u00b5) \u2265 c(b + 2)\nK\nD(\u00b5||\u03bd) \u2265 c(b + 2) max\n\n, (\u03b6 + 1)\n\n(cid:111)\n\nlog n\n\n(9)\n\n(10)\n\n.\n\n,\n\nK\n\nn log n\nT K 2\n\nSetting T = 1 recovers Theorem 2. When the second term in (10) dominates, the corollary says that\nthe problem becomes easier if we observe more snapshots, with the tradeoff quanti\ufb01ed precisely.\n\n4.4 Markov sequence of snapshots\n\nWe now consider the more general and useful setting where the snapshots form a Markov chain. For\nsimplicity we assume that the Markov chain is time-invariant and has a unique stationary distribution\nwhich is also the initial distribution. Therefore, the observations L(t)\nij at each (i, j) are generated by\n\ufb01rst drawing a label from the stationary distribution \u00af\u00b5 (or \u00af\u03bd) at t = 1, then applying a one-step\ntransition to obtain the label at each subsequent t. In particular, given the previously observed label\nl, let the intra-cluster and inter-cluster conditional distributions be \u00b5(\u00b7|l) and \u03bd(\u00b7|l). We assume that\nthe Markov chains with respect to both \u00b5 and \u03bd are geometrically ergodic such that for any \u03c4 \u2265 1,\nand label-pair L(1), L(\u03c4 +1),\n\n\u03ba\n\nand\n\n| Pr\u03bd(L(\u03c4 +1)|L(1)) \u2212 \u00af\u03bd(L(\u03c4 +1))| \u2264 \u03ba\u03b3\u03c4\n\n| Pr\u00b5(L(\u03c4 +1)|L(1)) \u2212 \u00af\u00b5(L(\u03c4 +1))| \u2264 \u03ba\u03b3\u03c4\n\nfor some constants \u03ba \u2265 1 and \u03b3 < 1 that only depend on \u00b5 and \u03bd. Let Dl(\u00b5||\u03bd) be the KL-divergence\nl\u2208L \u00af\u00b5(l)Dl(\u00b5||\u03bd) and\nsimilarly for E\u00af\u03bdDl(\u03bd||\u00b5). As in the previous subsection, we use the average log-likelihood ratio as\n(1\u2212\u03b3) minl{\u00af\u00b5(l),\u00af\u03bd(l)}. Applying Theorem 1 gives the following corollary.\nthe weight. De\ufb01ne \u03bb =\nSee sections H\u2013I in the supplementary material for the proof and additional discussion.\nCorollary 5 (Markov Snapshots). Under the above setting, suppose for each label-pair (l, l(cid:48)),\n\nbetween \u00b5(\u00b7|l) and \u03bd(\u00b7|l); Dl(\u03bd||\u00b5) is similarly de\ufb01ned. Let E\u00af\u00b5Dl(\u00b5||\u03bd) =(cid:80)\n(cid:12)(cid:12)(cid:12)log \u00af\u00b5(l)\n\n(cid:12)(cid:12)(cid:12) \u2264 b, D(\u00af\u03bd||\u00af\u00b5) \u2264 \u03b6D(\u00af\u00b5||\u00af\u03bd) and E\u00af\u03bdDl(\u03bd||\u00b5) \u2264 \u03b6E\u00af\u00b5Dl(\u00b5||\u03bd). The\n(cid:17)E\u00af\u03bdDl(\u03bd||\u00b5) \u2265 c(b + 2)\n(cid:17)E\u00af\u00b5Dl(\u00b5||\u03bd) \u2265 c(b + 2) max\n\nprogram (2) with MLE weights recovers Y \u2217 with probability at least 1 \u2212 n\u221210 provided\n\n(cid:12)(cid:12)(cid:12)log \u00b5(l(cid:48)|l)\n(cid:16)\n(cid:16)\n\nD(\u00af\u03bd||\u00af\u00b5) +\nD(\u00af\u00b5||\u00af\u03bd) +\n\n(cid:12)(cid:12)(cid:12) \u2264 b,\n\n(cid:110) log n\n\n, (\u03b6 + 1)\u03bb\n\n\u03bd(l(cid:48)|l)\n\nlog n\n\n(cid:111)\n\n.\n\n(11)\n\n(12)\n\n\u00af\u03bd(l)\n\nK\n\nK\n\nn log n\nT K 2\n\n1\nT\n1\nT\n\n1 \u2212 1\nT\n1 \u2212 1\nT\n\n7\n\n\fAs an illuminating example, consider the case where \u00af\u00b5 \u2248 \u00af\u03bd, i.e., the marginal distributions for\nindividual snapshots are identical or very close. It means that the information is contained in the\nchange of labels, but not in the individual labels, as made evident in the LHSs of (11) and (12).\nIn this case, it is necessary to use the temporal information in order to perform clustering. Such\ninformation would be lost if we disregard the ordering of the snapshots, for example, by aggregating\nor averaging the snapshots then apply a single-snapshot clustering algorithm. This highlights an\nessential difference between clustering time-varying graphs and static graphs.\n\n5 Empirical results\n\nTo solve the convex program (2), we follow [13, 9] and adapt the ADMM algorithm by [25]. We\nperform 100 trials for each experiment, and report the success rate, i.e., the fraction of trials where\nthe ground-truth clustering is fully recovered. Error bars show 95% con\ufb01dence interval. Additional\nempirical results are provided in the supplementary material.\nWe \ufb01rst test the planted partition model with partial observations under the challenging sparse (p\nand q close to 0) and dense settings (p and q close to 1); cf. section 4.2. Figures 1 and 2 show the\nresults for n = 1000 with 4 equal-size clusters. In both cases, each pair is observed with probability\n0.5. For comparison, we include results for the MLE weights as well as the linear weights (based on\nlinear approximation of the log-likelihood ratio), uniform weights and a imputation scheme where\nall unobserved entries are assumed to be \u201cno-edge\u201d.\n\nFigure 1: Sparse graphs\n\nFigure 2: Dense graphs\n\nFigure 3: Reality Mining dataset\n\nCorollary 3 predicts more success as the ratio s(p\u2212q)2\np(1\u2212q) gets larger. All else being the same, distribu-\ntions with small \u03b6 (sparse) are \u201ceasier\u201d to solve. Both predictions are consistent with the empirical\nresults in Figs. 1 and 2. The results also show that the MLE weights outperform the other weights.\nFor real data, we use the Reality Mining dataset [26], which contains individuals from two main\ngroups, the MIT Media Lab and the Sloan Business School, which we use as the ground-truth\nclusters. The dataset records when two individuals interact, i.e., become proximal of each other or\nmake a phone call, over a 9-month period. We choose a window of 14 weeks (the Fall semester)\nwhere most individuals have non-empty interaction data. These consist of 85 individuals with 25 of\nthem from Sloan. We represent the data as a time-varying graph with 14 snapshots (one per week)\nand two labels\u2014an \u201cedge\u201d if a pair of individuals interact within the week, and \u201cno-edge\u201d otherwise.\nWe compare three models: Markov sequence, independent snapshots, and the aggregate (union)\ngraphs. In each trial, the in/cross-cluster distributions are estimated from a fraction of randomly\nselected pairwise interaction data. The vertical axis in Figure 3 shows the fraction of pairs whose\ncluster relationship are correctly identi\ufb01ed. From the results, we infer that the interactions between\nindividuals are likely not independent across time, and are better captured by the Markov model.\n\nAcknowledgments\n\nS.H. Lim and H. Xu were supported by the Ministry of Education of Singapore through AcRF Tier\nTwo grant R-265-000-443-112. Y. Chen was supported by NSF grant CIF-31712-23800 and ONR\nMURI grant N00014-11-1-0688.\n\n8\n\n0.10.20.30.40.500.20.40.60.81p \u2212 qsuccess rateq = 0.02, s = 0.5 MLElinearuniformno partial0.10.20.30.40.500.20.40.60.81p \u2212 qsuccess ratep = 0.98, s = 0.5 MLElinearuniformno partial00.0050.010.0150.020.0250.70.750.80.850.914 weeksfraction of data used in estimationaccuracy Markovindependentaggregate\fReferences\n[1] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic\n\nblock model. Annals of Statistics, 39:1878\u20131915, 2011.\n\n[2] A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model.\n\nRandom Structures and Algorithms, 18(2):116\u2013140, 2001.\n\n[3] S. Oymak and B. Hassibi. Finding dense clusters via low rank + sparse decomposition.\n\narXiv:1104.5186v1, 2011.\n\n[4] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex\n\noptimization. Journal of Machine Learning Research, 15:2213\u20132238, June 2014.\n\n[5] Sivaraman Balakrishnan, Mladen Kolar, Alessandro Rinaldo, Aarti Singh, and Larry Wasser-\nman. Statistical and computational tradeoffs in biclustering. In NIPS Workshop on Computa-\ntional Trade-offs in Statistical Learning, 2011.\n\n[6] Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, and Aarti Singh. Minimax local-\n\nization of structural information in large noisy matrices. In NIPS, pages 909\u2013917, 2011.\n\n[7] F. McSherry. Spectral partitioning of random graphs. In FOCS, pages 529\u2013537, 2001.\n[8] K. Chaudhuri, F. Chung, and A. Tsiatas. Spectral clustering of graphs with general degrees in\n\nthe extended planted partition model. COLT, 2012.\n\n[9] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In NIPS 2012., 2012.\n[10] B. Ames and S. Vavasis. Nuclear norm minimization for the planted clique and biclique prob-\n\nlems. Mathematical Programming, 129(1):69\u201389, 2011.\n\n[11] C. Mathieu and W. Schudy. Correlation clustering with noisy input. In SODA, page 712, 2010.\n[12] Anima Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor spectral approach\n\nto learning mixed membership community models. arXiv preprint arXiv:1302.2684, 2013.\n\n[13] Y. Chen, S. H. Lim, and H. Xu. Weighted graph clustering with non-uniform uncertainties. In\n\nICML, 2014.\n\n[14] Simon Heimlicher, Marc Lelarge, and Laurent Massouli\u00b4e. Community detection in the labelled\nIn NIPS Workshop on Algorithmic and Statistical Approaches for\n\nstochastic block model.\nLarge Social Networks, 2012.\n\n[15] Marc Lelarge, Laurent Massouli\u00b4e, and Jiaming Xu. Reconstruction in the Labeled Stochastic\n\nBlock Model. In IEEE Information Theory Workshop, Seville, Spain, September 2013.\n[16] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75\u2013174, 2010.\n[17] Jimeng Sun, Christos Faloutsos, Spiros Papadimitriou, and Philip S. Yu. Graphscope:\n\nparameter-free mining of large time-evolving graphs. In ACM KDD, 2007.\n\n[18] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering. In ACM SIGKDD Inter-\nnational Conference on Knowledge Discovery and Data Mining, pages 554\u2013560. ACM, 2006.\n[19] Vikas Kawadia and Sameet Sreenivasan. Sequential detection of temporal communities by\n\nestrangement con\ufb01nement. Scienti\ufb01c Reports, 2, 2012.\n\n[20] N.P. Nguyen, T.N. Dinh, Y. Xuan, and M.T. Thai. Adaptive algorithms for detecting commu-\n\nnity structure in dynamic social networks. In INFOCOM, pages 2282\u20132290. IEEE, 2011.\n\n[21] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1), 2004.\n[22] A. Jalali, Y. Chen, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex\n\noptimization. In ICML, 2011.\n\n[23] Brendan P.W. Ames. Guaranteed clustering and biclustering via semide\ufb01nite programming.\n\nMathematical Programming, pages 1\u201337, 2013.\n\n[24] F. Topsoe. Some inequalities for information divergence and related measures of discrimina-\n\ntion. IEEE Transactions on Information Theory, 46(4):1602\u20131609, Jul 2000.\n\n[25] Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmented lagrange multiplier method for exact\nrecovery of corrupted low-rank matrices. Technical Report UILU-ENG-09-2215, UIUC, 2009.\n[26] Nathan Eagle and Alex (Sandy) Pentland. Reality mining: Sensing complex social systems.\n\nPersonal Ubiquitous Comput., 10(4):255\u2013268, March 2006.\n\n[27] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.\n\n9\n\n\f", "award": [], "sourceid": 685, "authors": [{"given_name": "Shiau Hong", "family_name": "Lim", "institution": "National University of Singapore"}, {"given_name": "Yudong", "family_name": "Chen", "institution": "University of California, Berkeley"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}]}