{"title": "Spectral Clustering with Perturbed Data", "book": "Advances in Neural Information Processing Systems", "page_first": 705, "page_last": 712, "abstract": "Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data sets are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance.", "full_text": "Spectral Clustering with Perturbed Data\n\nLing Huang\nIntel Research\n\nDonghui Yan\nUC Berkeley\n\nMichael I. Jordan\n\nUC Berkeley\n\nNina Taft\n\nIntel Research\n\nling.huang@intel.com\n\ndhyan@stat.berkeley.edu\n\njordan@cs.berkeley.edu\n\nnina.taft@intel.com\n\nAbstract\n\nSpectral clustering is useful for a wide-ranging set of applications in areas such as\nbiological data analysis, image processing and data mining. However, the com-\nputational and/or communication resources required by the method in processing\nlarge-scale data are often prohibitively high, and practitioners are often required to\nperturb the original data in various ways (quantization, downsampling, etc) before\ninvoking a spectral algorithm. In this paper, we use stochastic perturbation theory\nto study the effects of data perturbation on the performance of spectral clustering.\nWe show that the error under perturbation of spectral clustering is closely related\nto the perturbation of the eigenvectors of the Laplacian matrix. From this result\nwe derive approximate upper bounds on the clustering error. We show that this\nbound is tight empirically across a wide range of problems, suggesting that it can\nbe used in practical settings to determine the amount of data reduction allowed in\norder to meet a speci\ufb01cation of permitted loss in clustering performance.\n\n1 Introduction\n\nA critical problem in machine learning is that of scaling: Algorithms should be effective compu-\ntationally and statistically as various dimensions of a problem are scaled. One general tool for\napproaching large-scale problems is that of clustering or partitioning, in essence an appeal to the\nprinciple of divide-and-conquer. However, while the output of a clustering algorithm may yield a\nset of smaller-scale problems that may be easier to tackle, clustering algorithms can themselves be\ncomplex, and large-scale clustering often requires the kinds of preprocessing steps that are invoked\nfor other machine learning algorithms [1], including proto-clustering steps such as quantization,\ndownsampling and compression. Such preprocessing steps also arise in the distributed sensing and\ndistributed computing setting, where communication and storage limitations may preclude transmit-\nting the original data to centralized processors.\n\nA number of recent works have begun to tackle the issue of determining the tradeoffs that arise\nunder various \u201cperturbations\u201d of data, including quantization and downsampling [2, 3, 4]. Most of\nthese analyses have been undertaken in the context of well-studied domains such as classi\ufb01cation,\nregression and density estimation, for which there are existing statistical analyses of the effect of\nnoise on performance. Although extrinsic noise differs conceptually from perturbations to data\nimposed by a data analyst to cope with resource limitations, the mathematical issues arising in the\ntwo cases are similar and the analyses of noise have provided a basis for the study of the tradeoffs\narising from perturbations.\n\nIn this paper we focus on spectral clustering, a class of clustering methods that are based on eigen-\ndecompositions of af\ufb01nity, dissimilarity or kernel matrices [5, 6, 7, 8]. These algorithms often out-\nperform traditional clustering algorithms such as the K-means algorithm or hierarchical clustering.\nTo date, however, their impact on real-world, large-scale problems has been limited; in particular,\na distributed or \u201cin-network\u201d version of spectral clustering has not yet appeared. Moreover, there\nhas been little work on the statistical analysis of spectral clustering, and thus there is little theory to\nguide the design of distributed algorithms. There is an existing literature on numerical techniques for\n\n1\n\n\fProcedure SpectralClustering (x1, . . . , xn)\nInput:\nOutput: Bipartition S and \u00afS of the input data\n\nn data samples {xi}n\n\ni=1, xi \u2208 Rd\n\n1. Compute the similarity matrix K:\nKij = exp\u201c\u2212 kxi\u2212xj k2\n\n2\u03c32\nk\n\n2. Compute the diagonal degree matrix D:\n\n\u201d, \u2200xi, xj\n\n3. Compute the normalized Laplacian matrix:\n\nj=1 Kij\n\nDi = Pn\nL = I \u2212 D\u22121K\n\n4. Find the second eigenvector v2 of L\n5. Obtain the two partitions using v2:\n6.\n\nS = {[i] : v2i > 0}, \u00afS = {[i] : v2i \u2264 0}\n\nFigure 1: A spectral bipartitioning algorithm.\n\nMis-clustering\n\nrate\n\n6\n\nEigen error\n\n6\n\nLaplacian\nmatrix error\n\n6\n\nSimilarity\nmatrix error\n\n6\n\nData error\n\nProposition 1\n\n\u03b7\n\n?\n\nk\u02dcv2 \u2212 v2k2\n\nEqn. (5), (6)\n\n?\n\ndL\n\n?\n\ndK\n\n?\n\n\u03c3\n\nLemma 2 &\nEqn. (7)\u2212 (13)\n\nLemma 3 or 4\n\nAssumption A\n\nError propagation\n\nPerturbation analysis\n\nFigure 2: Perturbation analysis: from clustering\nerror to data perturbation error.\n\nscaling spectral clustering (including downsampling [9, 10] and the relaxation of precision require-\nments for the eigenvector computation [7]), but this literature does not provide end-to-end, practical\nbounds on error rates as a function of data perturbations.\n\nIn this paper we present the \ufb01rst end-to-end analysis of the effect of data perturbations on spectral\nclustering. Our focus is quantization, but our analysis is general and can be used to treat other kinds\nof data perturbation. Indeed, given that our approach is based on treating perturbations as random\nvariables, we believe that our methods will also prove useful in developing statistical analyses of\nspectral clustering (although that is not our focus in this paper).\n\nThe paper is organized as follows. In Section 2, we provide a brief introduction to spectral clustering.\nSection 3 contains the main results of the paper; speci\ufb01cally we introduce the mis-clustering rate\n\u03b7, and present upper bounds on \u03b7 due to data perturbations. In Section 4, we present an empirical\nevaluation of our analyses. Finally, in Section 5 we present our conclusions.\n\n2 Spectral clustering and data perturbation\n\n2.1 Background on spectral clustering algorithms\n\ni=1, xi \u2208 R1\u00d7d and some notion of similarity between all pairs of data\nGiven a set of data points {xi}n\npoints xi and xj, spectral clustering attempts to divide the data points into groups such that points in\nthe same group are similar and points in different groups are dissimilar. The point of departure of a\nspectral clustering algorithm is a weighted similarity graph G(V, E), where the vertices correspond\nto data points and the weights correspond to the pairwise similarities. Based on this weighted graph,\nspectral clustering algorithms form the graph Laplacian and compute an eigendecomposition of this\nLaplacian [5, 6, 7]. While some algorithms use multiple eigenvectors and \ufb01nd a k-way clustering\ndirectly, the most widely studied algorithms form a bipartitioning of the data by thresholding the\nsecond eigenvector of the Laplacian (the eigenvector with the second smallest eigenvalue). Larger\nnumbers of clusters are found by applying the bipartitioning algorithm recursively. We present a\nspeci\ufb01c example of a spectral bipartitioning algorithm in Fig. 1.\n\n2.2\n\nInput data perturbation\n\nLet the data matrix X \u2208 Rn\u00d7d be formed by stacking n data samples in rows. To this data matrix we\nassume that perturbation W is applied, such that we obtain a perturbed version \u02dcX of the original data\nX. We assume that a spectral clustering algorithm is applied to \u02dcX and we wish to compare the results\nof this clustering with respect to the spectral clustering of X. This analysis captures a number of data\nperturbation methods, including data \ufb01ltering, quantization, lossy compression and synopsis-based\ndata approximation [11]. The multi-scale clustering algorithms that use \u201crepresentative\u201d samples to\napproximate the original data can be treated using our analysis as well [12].\n\n2\n\n\f3 Mis-clustering rate and effects of data perturbation\n\nLet K and L be the similarity and Laplacian matrix on the original data X, and let \u02dcK and \u02dcL be those\non the perturbed data. We de\ufb01ne the mis-clustering rate \u03b7 as the proportion of samples that have\ndifferent cluster memberships when computed on the two different versions of the data, X and \u02dcX.\nWe wish to bound \u03b7 in terms of the \u201cmagnitude\u201d of the error matrix W = \u02dcX \u2212 X, which we now\nde\ufb01ne. We make the following general stochastic assumption on the error matrix W :\n\nA. All elements of the error matrix W are i.i.d. random variables with zero mean, bounded\n\nvariance \u03c32 and bounded fourth central moment \u00b54; and are independent of X.\n\nRemark.\n(i) Note that we do not make i.i.d. assumptions on the elements of the similarity matrix;\nrather, our assumption refers to the input data only. (ii) This assumption is distribution free, and\ncaptures a wide variety of practical data collection and quantization schemes.\n(iii) Certain data\nperturbation schemes may not satisfy the independence assumption. We have not yet conducted an\nanalysis of the robustness of our bounds to lack of independence, but in our empirical work we have\nfound that the bounds are robust to relatively small amounts of correlation.\n\nWe aim to produce practically useful bounds on \u03b7 in terms of \u03c3 and the data matrix X. The bounds\nshould be reasonably tight so that in practice they could be used to determine the degree of pertur-\nbation \u03c3 given a desired level of clustering performance, or to provide a clustering error guarantee\non the original data even though we have access only to its approximate version.\n\nFig. 2 outlines the steps in our theoretical analysis. Brie\ufb02y, when we perturb the input data (e.g., by\n\ufb01ltering, quantization or compression), we introduce a perturbation W to the data which is quan-\nti\ufb01ed by \u03c32. This induces an error dK := \u02dcK \u2212 K in the similarity matrix, and in turn an error\ndL := \u02dcL \u2212 L in the Laplacian matrix. This further yields an error in the second eigenvector of\nthe Laplacian matrix, which results in mis-clustering error. Overall, we establish an analytical re-\nlationship between the mis-clustering rate \u03b7 and the data perturbation error \u03c32, where \u03b7 is usually\nmonotonically increasing with \u03c32. Our goal is to allow practitioners to specify a mis-clustering\nrate \u03b7\u2217, and by inverting this relationship, to determine the right magnitude of the perturbation \u03c3\u2217\nallowed. That is, our work can provide a practical method to determine the tradeoff between data\nperturbation and the loss of clustering accuracy due to the use of \u02dcX instead of X. When the data\nperturbation can be related to computational or communications savings, then our analysis yields a\npractical characterization of the overall resource/accuracy tradeoff.\n\nPractical Applications Consider in particular a clustering task in a distributed networking system\nthat allows an application to specify a desired clustering error C \u2217 on the distributed data (which is\nnot available to the coordinator). Through a communication protocol similar to that in [4], the coor-\ndinator (e.g., network operation center) gets access to the perturbed data \u02dcX for spectral clustering.\nThe coordinator can compute a clustering error bound C using our method. By setting C \u2264 C \u2217, it\ndetermines the tolerable data perturbation error \u03c3\u2217 and instructs distributed devices to use appropri-\nate numbers of bits to quantize their data. Thus we can provide guarantees on the achieved error,\nC \u2264 C \u2217, with respect to the original distributed data even with access only to the perturbed data.\n3.1 Upper bounding the mis-clustering rate\n\nLittle is currently known about the connection between clustering error and perturbations to the\nLaplacian matrix in the spectral clustering setting. [5] presented an upper bound for the clustering\nerror, however this bound is usually quite loose and is not viable for practical applications. In this\nsection we propose a new approach based on a water-\ufb01lling argument that yields a tighter, practical\nbound. Let v2 and \u02dcv2 be the unit-length second eigenvectors of L and \u02dcL, respectively. We derive a\nrelationship between the mis-clustering rate \u03b7 and \u03b42 := k\u02dcv2 \u2212 v2k2.\nThe intuition behind our derivation is suggested in Fig. 3. Let a and b denote the sets of components\nin v2 corresponding to clusters of size k1 and k2, respectively, and similarly for a\u2032 and b\u2032 in the case\nof \u02dcv2. If v2 is changed to \u02dcv2 due to the perturbation, an incorrect clustering happens whenever a\ncomponent of v2 in set a jumps to set b\u2032, denoted as a \u2192 b\u2032, or a component in set b jumps to set a\u2032,\ndenoted as b \u2192 a\u2032. The key observation is that each \ufb02ipping of cluster membership in either a \u2192 b\u2032\n\n3\n\n\fComponent values\n\na\n\na\u2032\n\nmis-\n\nclustering\n\nmis-\n\nclustering\n\nComponent\n\nindices\n\nb\n\nb\u2032\n\nWisconsin Breast Cancer Data\n\n \n\nUpper Bound of Kannan\nOur Upper Bound\nMis\u2212clustering Rate\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nn\no\ni\nt\na\nb\nr\nu\nt\nr\ne\nP\n\n \n\n0\n0.005\n\n0.01\n\n0.015\n\n0.02\n\n\u03c3 of noise\n\n0.025\n\n0.03\n\n0.035\n\nFigure 3: The second eigenvector v2 and its per-\nturbed counterpart \u02dcv2 (denoted by dashed lines).\n\nFigure 4: An example of the tightness of\nthe upper bound for \u03b7 in Eq. (1).\n\nor b \u2192 a\u2032 contributes a fairly large amount to the value of \u03b42, compared to the short-range drifts\nin a \u2192 a\u2032 or b \u2192 b\u2032. Given a \ufb01xed value of \u03b42, the maximum possible number of \ufb02ippings (i.e.,\nmissed clusterings) is therefore constrained, and this translates into an upper bound for \u03b7.\nWe make the following assumptions on the data X and its perturbation:\n\nB1. The components of v2 form two clusters (with respect to the spectral bipartitioning algo-\n\nrithm in Fig. 1). The size of each cluster is comparable to n.\n\nB2. The perturbation is small with the total number of mis-clusterings m < min(k1, k2), and\n\nthe components of \u02dcv2 form two clusters. The size of each cluster is comparable to n.\n\nB3. The perturbation of individual components of v2 in each set of a \u2192 a\u2032, a \u2192 b\u2032, b \u2192 a\u2032\nand b \u2192 b\u2032 have identical (not necessary independent) distributions with bounded second\nmoments, respectively, and they are uncorrelated with the components in v2.\n\nOur perturbation bound can now be stated as follows:\nProposition 1. Under assumptions B1, B2 and B3, the mis-clustering rate \u03b7 of the spectral biparti-\ntioning algorithm under the perturbation satis\ufb01es \u03b7 \u2264 \u03b42 = k\u02dcv2 \u2212 v2k2. If we further assume that\nall components of \u02dcv2 \u2212 v2 are independent, then\n\n\u03b7 \u2264 (1 + op(1))Ek\u02dcv2 \u2212 v2k2.\n\n(1)\n\nThe proof of the proposition is provided in the Appendix.\n\nRemarks.\n(i) Assumption B3 was motivated by our empirical work. Although it is dif\ufb01cult to\nestablish general necessary and suf\ufb01cient conditions for B3 to hold, in the Appendix we present\nsome special cases that allow B3 to be veri\ufb01ed a priori. It is also worth noting that B3 appears\nto hold (approximately) across a range of experiments presented in Section 4. (ii) If we assume\npiecewise constancy for v2, then we can relax the uncorrelated assumption in B3. (iii) Our bound\nhas a different \ufb02avor than that obtained in [5]. Although the bound in Theorem 4.3 in [5] works for\nk-way clustering, it assumes a block-diagonal Laplacian matrix and requires the gap between the\nkth and (k + 1)th eigenvalues to be greater than 1/2, which is unrealistic in many data sets. In the\nsetting of 2-way spectral clustering and a small perturbation, our bound is much tighter than that\nderived in [5]; see Fig. 4 in particular.\n\n3.2 Perturbation on the second eigenvector of Laplacian matrix\n\nWe now turn to the relationship between the perturbation of eigenvectors with that of its matrix.\nOne approach is to simply draw on the classical domain of matrix perturbation theory; in particular,\napplying Theorem V.2.8 from [13], we have the following bound on the (small) perturbation of the\nsecond eigenvector:\n\nk\u02dcv2 \u2212 v2k \u2264\n\n,\n\n(2)\n\nk4dLkF\n\n\u03bd \u2212 \u221a2kdLkF\n\nwhere \u03bd is the gap between the second and the third eigenvalue. However, in our experimental\nevaluation we found that \u03bd can be quite small in some data sets, and in these cases the right-hand\n\n4\n\n\f(a) Wisconsin Breast Cancer Data\n\n \n\nRHS\nLHS\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nl\n\ne\nu\na\nV\n\n0.07\n\n0.06\n\n0.05\n\nl\n\ne\nu\na\nV\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n(b) Waveform Data\n\nRHS\nLHS\n\n \n\n0.05\n\nRHS\nLHS\n\n(c) Pen\u2212digits Data\n\n \n\n0.04\n\nl\n\ne\nu\na\nV\n\n0.03\n\n0.02\n\n0.01\n\n \n\n0\n0.005\n\n0.01\n\n0.015\n\n0.02\n\n\u03c3 of noise\n\n0.025\n\n0.03\n\n0.035\n\n \n\n0\n0.005\n\n0.01\n\n0.015\n\n0.02\n\n\u03c3 of noise\n\n0.025\n\n0.03\n\n0.035\n\n \n\n0\n0.005\n\n0.01\n\n0.015\n\n0.02\n\n\u03c3 of noise\n\n0.025\n\n0.03\n\n0.035\n\nFigure 5: Experimental examples of the \ufb01delity of the approximation in Eq. (5). We add i.i.d. zero mean\nGaussian noise to the input data with different \u03c3, and we see that the right-hand side (RHS) of (5) approximately\nupper bounds the left-hand side (LHS).\n\nside of (2) can be quite large even for a small perturbation. Thus the bound given by (2) is often not\nuseful in practical applications.\n\nTo derive a more practically useful bound, we begin with a well-known \ufb01rst-order Taylor expansion\nto compute the perturbation on the second eigenvector of a Laplacian matrix as follows:\n\nvj + O(dL2) \u2248\n\nq=1 vq2dLpq is a random variable determined by the effect of the perturbation on\n\neigendecomposition of the Laplacian matrix L. Then we have\n\nn\n\nn\n\n=\n\n\u02dcv2 \u2212 v2 =\n\nvT\nj dLv2\nXj=1,j6=2\n\u03bb2 \u2212 \u03bbj\n\uf8ee\nvq2dLpq! \u00b7\uf8eb\n\uf8f0 n\nXp=1\nXq=1\n\uf8ed\nwhere \u03b2p = Pn\nthe Laplacian matrix L, and the vector up = Pn\nEk\u02dcv2 \u2212 v2k2 \u2248 E(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u03b2pup(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nXp=1\n\nXp=1\n\n=\n\nn\n\nn\n\n2\n\nn\n\nn\n\nn\n\nvj\n\nn\n\nn\n\n(3)\n\n\u03b2pup,\n\nXq=1\n\nvpjvq2dLpq\n\nXp=1\n\u03bb2 \u2212 \u03bbj\n\uf8f9\nXp=1\n\uf8fb =\n\nXj=1,j6=2\n\u03bb2 \u2212 \u03bbj\uf8f6\nvpj \u00b7 vj\nXj=1,j6=2\n\uf8f8\nj=1,j6=2(cid:16) vpj vj\n\u03bb2\u2212\u03bbj(cid:17) is a constant determined by the\nXi=1\n\nE(cid:0)\u03b2iui \u00b7 \u03b2j u\n\nXj=i+1\n\nj(cid:1) .\n\n(4)\n\nn\n\nn\n\nT\n\nEk\u03b2pupk2 + 2\n\nIn our experimental work we have found that for i 6= j, \u03b2iui is either very weakly correlated with\n\u03b2j uj (i.e., the total sum of all cross terms is typically one or two orders of magnitude less than that\nof squared term), or negatively correlated with \u03b2j uj (i.e., the total sum of all cross terms is less than\nzero). This empirical evidence suggests the following approximate bound:\n\nEk\u02dcv2 \u2212 v2k2 .\n\nE\u03b22\n\np \u00b7 kupk2.\n\nn\n\nXp=1\n\nExamples of the \ufb01delity of this approximation for particular data sets are shown in Fig. 5.\n\nFinally, E\u03b22\n\np is related to dLpq, and can be upper bounded by\n\nE\u03b22\n\np = E n\nXq=1\n\n\u2264\nwhere \u03c3pi is the variance of dLpi.\n\nvq2dLpq!2\n\nn\n\nn\n\nXi=1\n\nXj=1\n\n[vi2vj2 \u00b7 E (dLpi) E (dLpj) + |vi2vj2|\u03c3pi\u03c3pj] ,\n\n(5)\n\n(6)\n\nRemark. Through Eqs. (5) and (6), we can bound the squared norm of the perturbation on the\nsecond eigenvector in expectation, which in turn bounds the mis-clustering rate. To compute the\nbound, we need to estimate the \ufb01rst two moments of dL, which we discuss next.\n\n3.3 Perturbation on the Laplacian matrix\n\nLet D be the diagonal matrix with Di = Pj Kij. We de\ufb01ne the normalized Laplacian matrix as\nL = I \u2212 D\u22121K. Letting \u2206 = \u02dcD \u2212 D and dK = \u02dcK \u2212 K, we have the following approximation for\ndL = \u02dcL \u2212 L:\n\n5\n\n\fLemma 2. If perturbation dK is small compared to K, then\n\ndL = (1 + o(1)) \u2206D\u22122K \u2212 D\u22121dK.\nThen, element-wise, the \ufb01rst two moments of dL can be estimated as\n\nE(dL) \u2248 E(\u2206)D\u22122K \u2212 D\u22121E(dK)\nE(dL2) \u2248 E(cid:0)\u2206D\u22122K \u25e6 \u2206D\u22122K \u2212 2D\u22121dK \u25e6 \u2206D\u22122K + D\u22121dK \u25e6 D\u22121dK(cid:1)\n\n= E(cid:0)\u22062(cid:1) D\u22124K 2 + D\u22122E(cid:0)dK 2(cid:1) \u2212 2E(\u2206dK)D\u22123 \u25e6 K,\n\n(9)\nwhere \u25e6 denotes element-wise product. The quantities needed to estimate E(dL) and E(dL2) can\nbe obtained from moments and correlations among the elements of the similarity matrix \u02dcKij. In\nparticular, we have\n\nE(dKij) = E(cid:16) \u02dcKij(cid:17) \u2212 Kij, E(dKij)2 = E \u02dcK 2\n\nij \u2212 2KijE(cid:16) \u02dcKij(cid:17) + K 2\n\nn\n\nij\n\n(7)\n\n(8)\n\n(10)\n\n(11)\n\nE\u2206i = E \u02dcDi \u2212 Di, E \u02dcDi =\ni = E\uf8eb\n\uf8ed\n\n\u02dcKij\uf8f6\n\uf8f8\n\nXj=1\nXj=1\n\nXj=1\n\nE \u02dcD2\n\n=\n\nn\n\nn\n\n2\n\nE(cid:16) \u02dcKij(cid:17) , E\u22062\n\ni = E \u02dcD2\n\ni \u2212 2Di \u00b7 E \u02dcDi + D2\n\ni\n\nE \u02dcK 2\n\nij + 2\n\nn\n\nXj=1\n\nn\n\nXq=j+1(cid:16)E \u02dcKijE \u02dcKiq + \u03c1k\n\nijq\u03c3k\n\nij\u03c3k\n\niq(cid:17) (12)\n\nE(\u2206dK)ij = E( \u02dcDi \u2212 Di)( \u02dcKij \u2212 Kij) = E(cid:16) \u02dcDi \u02dcKij(cid:17) \u2212 DiE \u02dcKij \u2212 KijE\u2206i\n\nn\n\n\u02dcK 2\n\nXq=1,q6=j\n\n= E\uf8ee\n\uf8f0\n\nij + \u02dcKij\uf8eb\n\uf8f9\n\u02dcKiq\uf8f6\n\uf8fb \u2212 DiE \u02dcKij \u2212 KijE\u2206i\n\uf8f8\n\uf8ed\nXq=1,q6=j(cid:16)E \u02dcKijE \u02dcKiq + \u03c1k\nij is the standard deviation of \u02dcKij and \u22121 \u2264 \u03c1k\n\n= E \u02dcK 2\n\nijq\u03c3k\n\nij\u03c3k\n\nij +\n\nn\n\n\u2032\n\nwhere \u03c3k\nijq \u2264 1 is the correlation coef\ufb01cient between\n\u02dcKij and \u02dcKiq. Estimating all \u03c1k\ns would require an intensive effort. For simplicity, we could set\nijq\nijq to 1 in Eq. (12) and to \u22121 in Eq. (13), and obtain an upper bound for E(dL2). This bound could\n\u03c1k\nijq. However, in our\noptionally be tightened by using a simulation method to estimate the values of \u03c1k\nexperimental work we have found that our results are insensitive to the values of \u03c1k\nijq, and setting\nijq = 0.5 usually achieves good results.\n\u03c1k\n\niq(cid:17) \u2212 DiE \u02dcKij \u2212 KijE\u2206i,\n\n(13)\n\nRemark. Eqs. (8)\u2013(13) allow us to estimate (i.e., to upper bound) the \ufb01rst two moments of dL\nusing those of dK, which are computed using Eq. (15) or (16) in Section 3.4.\n\n3.4 Perturbation on the similarity matrix\n\nThe similarity matrix \u02dcK on perturbed data \u02dcX is\n\n\u02dcKij = exp(cid:18)\u2212||xi \u2212 xj + \u01ebi \u2212 \u01ebj||2\n\n2\u03c32\nk\n\n(cid:19) ,\n\n(14)\n\nwhere \u03c3k is the kernel bandwidth. Then, given data X, the \ufb01rst two moments of dKij = \u02dcKij \u2212 Kij,\nthe error in the similarity matrix, can be determined by one of the following lemmas.\nLemma 3. Given X, if all components of \u01ebi and \u01ebj are i.i.d. Gaussian N (0, \u03c32), then\n\n\u03c32\n\u03c32\n\nE(cid:16) \u02dcKij(cid:17) = Mij(cid:18)\u2212\n\nk(cid:19) , E(cid:16) \u02dcK 2\n\nk (cid:19) ,\n1\u22122t(cid:17)/(1 \u2212 2t)d/2i, and \u03bbij =(cid:0)||xi \u2212 xj||2/2\u03c32(cid:1).\n\nij(cid:17) = Mij(cid:18)\u2212\n\n2\u03c32\n\u03c32\n\nwhere Mij(t) =hexp(cid:16) \u03bbij t\n\n6\n\n(15)\n\n\f(a) Gaussian data\n\n(b) Sin\u2212sep data\n\n(c) Concentric data\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u22122\n\n0\n\n2\n\n4\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\nFigure 6: Synthetic data sets illustrated in two dimensions.\n\nLemma 4. Under Assumption A, given X and for large values of the dimension d, the \ufb01rst two\nmoments of \u02dcKij can be computed approximately as follows:\n\nE(cid:16) \u02dcKij(cid:17) = Mij(cid:18)\u2212\n\n1\n2\u03c32\n\nk(cid:19) , E(cid:16) \u02dcK 2\n\nij(cid:17) = Mij(cid:18)\u2212\n\n1\n\u03c32\n\nk(cid:19) ,\n\n(16)\n\nwhere Mij(t) = exp(cid:2)(cid:0)\u03bbij + 2d\u03c32(cid:1) t +(cid:0)d\u00b54 + d\u03c34 + 4\u03c32\u03bb2\n\nij(cid:1) t2(cid:3), and \u03bbij = ||xi \u2212 xj||2.\n\nRemark.\n(i) Given data perturbation error \u03c3, kernel bandwidth \u03c3k and data X, the \ufb01rst two mo-\nments of dKij can be estimated directly using (15) or (16). (ii) Through Eqs. (1)\u2013(16), we have\nestablished a relationship between the mis-clustering rate \u03b7 and the data perturbation magnitude \u03c3.\nBy inverting this relationship (e.g., using binary search), we can determine a \u03c3\u2217 for a given \u03b7\u2217.\n\n4 Evaluation\n\nIn this section we present an empirical evaluation of our analysis on 3 synthetic data sets (see Fig. 6)\nand 6 real data sets from the UCI repository [14]. The data domains are diverse, including im-\nage, medicine, agriculture, etc., and the different data sets impose different dif\ufb01culty levels on the\nunderlying spectral clustering algorithm, demonstrating the wide applicability of our analysis.\n\nIn the experiments, we use data quantization as the perturbation scheme to evaluate the upper bound\nprovided by our analysis on the clustering error. Fig. 7 plots the mis-clustering rate and the upper\nbound for data sets subject to varying degrees of quantization. As expected, the mis-clustering\nrate increases as one decreases the number of quantization bits. We \ufb01nd that the error bounds are\nremarkably tight, which validate the assumptions we make in the analysis. It is also interesting to\nnote that even when using as few as 3-4 bits, the clustering degrades very little in both real error and\nas assessed by our bound. The effectiveness of our bound should allow the practitioner to determine\nthe right amount of quantization given a permitted loss in clustering performance.\n\n5 Conclusion\n\nIn this paper, we proposed a theoretical analysis of the clustering error for spectral clustering in the\nface of stochastic perturbations. Our experimental evaluation has provided support for the assump-\ntions made in the analysis, showing that the bound is tight under conditions of practical interest. We\nbelieve that our work, which provides an analytical relationship between the mis-clustering rate and\nthe variance of the perturbation, constitutes a critical step towards enabling a large class of appli-\ncations that seek to perform clustering of objects, machines, data, etc in a distributed environment.\nMany networks are bandwidth constrained, and our methods can guide the process of data thinning\nso as to limit the amount of data transmitted through the network for the purpose of clustering.\n\nReferences\n[1] L. Bottou and O. Bousquet, \u201cThe tradeoffs of large scale learning,\u201d in Advances in Neural Information\n\nProcessing Systems 20, 2007.\n\n[2] A. Silberstein, G. P. A. Gelfand, K. Munagala, and J. Yang, \u201cSuppression and failures in sensor networks:\n\nA Bayesian approach,\u201d in Proceedings of VLDB, 2007.\n\n[3] X. Nguyen, M. J. Wainwright, and M. I. Jordan, \u201cNonparametric decentralized detection using kernel\n\nmethods,\u201d IEEE Transactions on Signal Processing, vol. 53, no. 11, pp. 4053\u20134066, 2005.\n\n7\n\n\f(a) Sin\u2212sep Data\n\n \n\nUpper Bound\nTest Value\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0.037\n\n0.018\n\n0.009\n\n0.005\n\n0.002\n\n0.001\n\n0.001\n\ne\n\nt\n\n \n\na\nR\ng\nn\ni\nr\ne\n\nl\n\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\n0\n\n \n\n3\n\n4\n\n5\n\n6\n\n7\n\nNumber of quantization bits\n\n8\n\n9\n\nx 10\u22123\n\n(d) Image Segmentation Data\n\nUpper Bound\nTest Value\n\n8\n\n6\n\n4\n\n2\n\n0.056\n\n0.029\n\n0.015\n\n0.008\n\n0.004\n\n0.002\n\n0.001\n\ne\n\nt\n\n \n\na\nR\ng\nn\ni\nr\ne\n\nl\n\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\n4\n\n5\n\nNumber of quantization bits\n\n6\n\n7\n\n8\n\n(g) Iris Data\n\nUpper Bound\nTest Value\n\n0\n\n \n\n2\n\n3\n\n0.03\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\n \n\nt\n\ne\na\nR\ng\nn\ni\nr\ne\n\nl\n\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\n \n\n \n\n(b) Concentric Circle Data\n\n \n\nUpper Bound\nTest Value\n\ne\n\nt\n\n \n\na\nR\ng\nn\ni\nr\ne\n\nl\n\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nt\na\nR\n \ng\nn\ni\nr\ne\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\nl\n\n0.036\n\n0.018\n\n0.009\n\n0.004\n\n0.002\n\n0.001\n\n0.001\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n(c) Gaussian Data\n\n \n\nUpper Bound\nTest Value\n\n0.036\n\n0.018\n\n0.009\n\n0.005\n\n0.002\n\n0.001\n\n0.001\n\n5\n\n6\n\n7\n\nNumber of quantization bits\n\n8\n\n9\n\n0\n\n \n\n3\n\n4\n\n5\n\n6\n\n7\n\nNumber of quantization bits\n\n8\n\n9\n\n(e) Pen\u2212digits Data\n\n(f) Wine Data\n\nUpper Bound\nTest Value\n\n0\n\n \n\n3\n\n4\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\ne\nt\na\nR\n \ng\nn\ni\nr\ne\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\nl\n\n0.062\n\n0.030\n\n0.015\n\n0.008\n\n0.004\n\n0.002\n\n0.001\n\n0\n\n \n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nNumber of quantization bits\n\n(h) Wisconsin Breast Cancer Data\n\nUpper Bound\nTest Value\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\nt\n\n \n\ne\na\nR\ng\nn\ni\nr\ne\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\nl\n\n \n\n \n\n \n\n \n\ne\nt\na\nR\n \ng\nn\ni\nr\ne\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\nl\n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\nUpper Bound\nTest Value\n\n0.071\n\n0.036\n\n0.018\n\n0.009\n\n0.005\n\n0.002\n\n0.001\n\n4\n\n5\n\n6\n\n7\n\n8\n\nNumber of quantization bits\n\n(i) Waveform Data\n\nUpper Bound\nTest Value\n\n0\n\n \n\n2\n\n3\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nt\n\n \n\ne\na\nR\ng\nn\ni\nr\ne\nt\ns\nu\nC\n\u2212\ns\nM\n\ni\n\nl\n\n0.070 0.037 0.017 0.009 0.004 0.002 0.001\n\n0.074\n\n0.036\n\n0.018\n\n0.009\n\n0.005\n\n0.002\n\n0.001\n\n0.072\n\n0.036\n\n0.018\n\n0.009\n\n0.005\n\n0.002\n\n0.001\n\n0\n\n \n\n2\n\n3\nNumber of quantization bits\n\n4\n\n5\n\n6\n\n7\n\n8\n\n0\n\n \n\n2\n\n3\n\n4\n\n5\n\n6\n\nNumber of quantization bits\n\n7\n\n8\n\n0\n\n \n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nNumber of quantization bits\n\nFigure 7: Upper bounds of clustering error on approximate data obtained from quantization as a function of\nthe number of bits. (a\u2013c) Simulated data sets (1000 sample size, 2, 2, 10 features, respectively); (d) Statlog\nimage segmentation data (2310 sample size, 19 features); (e) Handwritten digits data (10992 sample size, 16\nfeatures); (f) Wine data (178 sample size, 13 features); (g) Iris data (150 sample size, 4 features). (h) Wisconsin\nbreast cancer data (569 sample size, 30 features); (i) Waveform data (5000 sample size, 21 features). The x-axis\nshows the number of quantization bits and (above the axis) the corresponding data perturbation error \u03c3. Error\nbars are derived from 25 replications. In the experiments, all data values are normalized in range [0, 1]. For\ndata sets with more than two clusters, we choose two of them for the experiments.\n\n[4] L. Huang, X. Nguyen, M. Garofalakis, A. D. Joseph, M. I. Jordan, and N. Taft, \u201cIn-network PCA and\n\nanomaly detection,\u201d in Advances in Neural Information Processing Systems (NIPS), 2006.\n\n[5] R. Kannan, S. Vempala, and A. Vetta, \u201cOn clusterings: Good, bad and spectral,\u201d Journal of the ACM,\n\nvol. 51, no. 3, pp. 497\u2013515, 2004.\n\n[6] A. Y. Ng, M. Jordan, and Y. Weiss, \u201cOn spectral clustering: Analysis and an algorithm,\u201d in Advances in\n\nNeural Information Processing Systems (NIPS), 2002.\n\n[7] J. Shi and J. Malik, \u201cNormalized cuts and image segmentation,\u201d IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, vol. 22, no. 8, pp. 888\u2013905, 2000.\n\n[8] U. von Luxburg, M. Belkin, and O. Bousquet, \u201cConsistency of spectral clustering,\u201d Annals of Statistics,\n\nvol. 36, no. 2, pp. 555\u2013586, 2008.\n\n[9] P. Drineas and M. W. Mahoney, \u201cOn the Nystr\u00a8om method for approximating a Gram matrix for improved\n\nkernel-based learning,\u201d in Proceedings of COLT, 2005, pp. 323\u2013337.\n\n[10] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, \u201cSpectral grouping using the Nystr\u00a8om method,\u201d IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, 2004.\n\n[11] G. Cormode and M. Garofalakis, \u201cSketching streams through the net: Distributed approximate query\n\ntracking,\u201d in Proceedings of VLDB, 2005, pp. 13\u201324.\n\n[12] D. Kushnir, M. Galun, and A. Brandt, \u201cFast multiscale clustering and manifold identi\ufb01cation,\u201d Pattern\n\nRecognition, vol. 39, no. 10, pp. 1876\u20131891, 2006.\n\n[13] G. W. Stewart and J. Guang Sun, Matrix Perturbation Theory. Academic Press, 1990.\n[14] A. Asuncion and D. Newman, \u201cUCI Machine Learning Repository, Department of Information and Com-\n\nputer Science,\u201d 2007, http://www.ics.uci.edu/ mlearn/MLRepository.html.\n\n8\n\n\f", "award": [], "sourceid": 746, "authors": [{"given_name": "Ling", "family_name": "Huang", "institution": null}, {"given_name": "Donghui", "family_name": "Yan", "institution": null}, {"given_name": "Nina", "family_name": "Taft", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}