{"title": "On Robustness of Kernel Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 3098, "page_last": 3106, "abstract": "Clustering is an important unsupervised learning problem in machine learning and statistics. Among many existing algorithms, kernel \\km has drawn much research attention due to its ability to find non-linear cluster boundaries and its inherent simplicity. There are two main approaches for kernel k-means: SVD of the kernel matrix and convex relaxations. Despite the attention kernel clustering has received both from theoretical and applied quarters, not much is known about robustness of the methods. In this paper we first introduce a semidefinite programming relaxation for the kernel clustering problem, then prove that under a suitable model specification, both K-SVD and SDP approaches are consistent in the limit, albeit SDP is strongly consistent, i.e. achieves exact recovery, whereas K-SVD is weakly consistent, i.e. the fraction of misclassified nodes vanish. Also the error bounds suggest that SDP is more resilient towards outliers, which we also demonstrate with experiments.", "full_text": "On Robustness of Kernel Clustering\n\nBowei Yan\n\nPurnamrita Sarkar\n\nDepartment of Statistics and Data Sciences\n\nDepartment of Statistics and Data Sciences\n\nUniversity of Texas at Austin\n\nUniversity of Texas at Austin\n\nAbstract\n\nClustering is an important unsupervised learning problem in machine learning\nand statistics. Among many existing algorithms, kernel k-means has drawn much\nresearch attention due to its ability to \ufb01nd non-linear cluster boundaries and its\ninherent simplicity. There are two main approaches for kernel k-means: SVD\nof the kernel matrix and convex relaxations. Despite the attention kernel cluster-\ning has received both from theoretical and applied quarters, not much is known\nabout robustness of the methods. In this paper we \ufb01rst introduce a semide\ufb01nite\nprogramming relaxation for the kernel clustering problem, then prove that under a\nsuitable model speci\ufb01cation, both K-SVD and SDP approaches are consistent in\nthe limit, albeit SDP is strongly consistent, i.e. achieves exact recovery, whereas\nK-SVD is weakly consistent, i.e. the fraction of misclassi\ufb01ed nodes vanish. Also\nthe error bounds suggest that SDP is more resilient towards outliers, which we also\ndemonstrate with experiments.\n\n1\n\nIntroduction\n\nClustering is an important problem which is prevalent in a variety of real world problems. One of the\n\ufb01rst and widely applied clustering algorithms is k-means, which was named by James MacQueen [14],\nbut was proposed by Hugo Steinhaus [21] even before. Despite being half a century old, k-means has\nbeen widely used and analyzed under various settings.\nOne major drawback of k-means is its incapability to separate clusters that are non-linearly separated.\nThis can be alleviated by mapping the data to a high dimensional feature space and do clustering on\ntop of the feature space [19, 9, 12], which is generally called kernel-based methods. For instance,\nthe widely-used spectral clustering [20, 16] is an algorithm to calculate top eigenvectors of a kernel\nmatrix of af\ufb01nities, followed by a k-means on the top r eigenvectors. The consistency of spectral\nclustering is analyzed by [22]. [9] shows that spectral clustering is essentially equivalent to a weighted\nversion of kernel k-means.\nThe performance guarantee for clustering is often studied under distributional assumptions; usually\na mixture model with well-separated centers suf\ufb01ces to show consistency. [5] uses a Gaussian\nmixture model, and proposes a variant of EM algorithm that provably recovers the center of each\nGaussian when the minimum distance between clusters is greater than some multiple of the square\nroot of dimension. [2] works with a projection based algorithm and shows the separation needs to\nbe greater than the operator norm and the Frobenius norm of difference between data matrix and its\ncorresponding center matrix, up to a constant.\nAnother popular technique is based on semide\ufb01nite relaxations. For example [18] proposes a SDP\nrelaxation for k-means typed clustering. In a very recent work, [15] shows the effectiveness of SDP\nrelaxation with k-means clustering for subgaussian mixtures, provided the minimum distance between\ncenters is greater than the variance of the sub-gaussian times the square of the number of clusters r.\nOn a related note, SDP relaxations have been shown to be consistent for community detection in\nnetworks [1, 3]. In particular, [3] consider \u201cinlier\u201d (these are generated from the underlying clustering\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmodel, to be speci\ufb01c, a blockmodel) and \u201coutlier\u201d nodes. The authors show that SDP is weakly\nconsistent in terms of clustering the inlier nodes as long as the number of outliers m is a vanishing\nfraction of the number of nodes.\nIn contrast, among the numerous work on clustering, not much focus has been on robustness of\ndifferent kernel k-means algorithms in presence of arbitrary outliers. [24] illustrates the robustness of\nGaussian kernel based clustering, where no explicit upper bound is given. [8] detects the in\ufb02uential\npoints in kernel PCA by looking at an in\ufb02uence function. In data mining community, many \ufb01nd\nclustering can be used to detect outliers, with often heuristic but effective procedures [17, 10]. On the\nother hand, kernel based methods have been shown to be robust for many machine learning tasks.\nFor supervised learning, [23] shows the robustness of SVM by introducing an outlier indicator and\nrelaxing the problem to a SDP. [6, 7, 4] develop the robustness for kernel regression. For unsupervised\nlearning, [13] proposes a robust kernel density estimation.\nIn this paper we ask the question: how robust are SVD type algorithms and SDP relaxations when\noutliers are present. In the process we also present results which compare these two methods. To be\nspeci\ufb01c, we show that without outliers, SVD is weakly consistent, i.e. the fraction of misclassi\ufb01ed\nthe number of\nnodes vanishes with high probability, whereas SDP is strongly consistent, i.e.\nmisclassi\ufb01ed nodes vanishes with high probability. We also prove that both methods are robust to\narbitrary outliers as long as the number of outliers is growing at a slower rate than the number of\nnodes. Surprisingly our results also indicate that SDP relaxations are more resilient to outliers than\nK-SVD methods. The paper is organized as follows. In Section 2 we set up the problem and the data\ngenerating model. We present the main results in Section 3. Proof sketch and more technical details\nare introduced in Section 4. Numerical experiments in Section 5 illustrate and support our theoretical\nanalysis. More additional analysis are included in the extended version of this paper 1.\n\n2 Problem Setup\nWe denote by Y = [Y1,\u00b7\u00b7\u00b7 , Yn]T the n \u00d7 p data matrix. Among the n observations, m outliers are\ndistributed arbitrarily, and n \u2212 m inliers form r equal-sized clusters, denoted by C1,\u00b7\u00b7\u00b7 , Cr. Let\nus denote the index set of inliers by I and index set of outliers by O, I \u222a O = [n]. Also denote by\nR = {(i, j) : i \u2208 O or j \u2208 O}.\nThe problem is to recover the true and unknown data partition given by a membership matrix Z =\n{0, 1}n\u00d7r, where Zik = 1 if i belongs to the k-th cluster and 0 otherwise. For convenience we assume\nthe outliers are also arbitrarily equally assigned to r clusters, so that each extended cluster, denoted\nby \u02dcCi, i \u2208 [r] has exactly n/r points. A ground truth clustering matrix X0 \u2208 Rn\u00d7n can be achieved\nby X0 = ZZ T . It can be seen that X0(i, j) =\n\nif i, j \u2208 I, and belong to the same cluster;\notherwise.\n\n(cid:26)1\n\n0\n\nFor the inliers, we assume the following mixture distribution model.\n\nConditioned on Zia = 1, Yi = \u00b5a +\n\nWi\u221a\np\n\n, E[Wi] = 0, Cov[Wi] = \u03c32\n\naIp,\n\nWi are independent sub-gaussian random vectors.\n\nWe treat Y as a low dimensional signal hidden in high dimensional noise. More concretely \u00b5a is\nsparse and (cid:107)\u00b5a(cid:107)0 does not depend on n or p; as n \u2192 \u221e, p \u2192 \u221e. Wi\u2019s for i \u2208 [n] are independent.\nFor simplicity, we assume the noise is isotropic and the covariance only depends on the cluster. The\nsub-gaussian assumption is non-parametric and includes most of the commonly used distribution\nsuch as Gaussian and bounded distributions. We include some background materials on sub-gaussian\nrandom variables in Appendix A. This general setting for inliers is common and also motivated by\nmany practical problems where the data lies on a low dimensional manifold, but is obscured by\nhigh-dimensional noise [11].\nWe use the kernel matrix based on Euclidean distances between covariates. Our analysis can be\nextended to inner product kernels as well. From now onwards, we will assume that the function\ngenerating the kernel is bounded and Lipschitz.\n\n1https://arxiv.org/abs/1606.01869\n\n2\n\n\fAssumption 1. For n observations Y1,\u00b7\u00b7\u00b7 , Yn, the kernel matrix (sometimes also called Gram\nmatrix) K is induced by K(i, j) = f ((cid:107)Yi \u2212 Yj(cid:107)2\n2), where f satis\ufb01es |f (x)| \u2264 1, \u2200x and \u2203C0 >\n0, s.t. supx,y |f (x) \u2212 f (y)| \u2264 C0|x \u2212 y|.\nA widely used example that satis\ufb01es the above condition is the Gaussian kernel. For simplicity, we\nwill without loss of generality assume K(x, y) = f ((cid:107)x \u2212 y(cid:107)2) = exp(\u2212\u03b7(cid:107)x \u2212 y(cid:107)2).\nFor the asymptotic analysis, we use the following standard notations for approximated rate of\nconvergence. T (n) is O(f (n)) iff for some constant c and n0, T (n) \u2264 cf (n) for all n \u2265 n0; T (n)\nis \u2126(f (n)) if for some constant c and n0, T (n) \u2265 cf (n) for all n \u2265 n0; T (n) is \u0398(f (n)) if T (n) is\nO(f (n)) and \u2126(f (n)); T (n) is o(f (n)) if T (n) is O(f (n)) but not \u2126(f (n)). T (n) is oP (f (n)) ( or\nOP (f (n))) if it is o(f (n)) (or O(f (n))) with high probability.\nSeveral matrix norms are considered in this manuscript. Assume M \u2208 Rn\u00d7n, the (cid:96)1 and (cid:96)\u221e norm\nij |Mij| and (cid:107)M(cid:107)\u221e :=\nmaxi,j |Mij|. For two matrices M, Q \u2208 Rm\u00d7n, their inner product is (cid:104)M, Q(cid:105) = trace(M T Q). The\noperator norm (cid:107)M(cid:107) is simply the largest singular value of M, which equals the largest eigenvalue for\na symmetric matrix. Throughout the manuscript, we use 1n to represent the all one n \u00d7 1 vector and\nEn, En,k to represent the all one matrix with size n \u00d7 n and n \u00d7 k. The subscript will be dropped\nwhen it is clear from context.\n\nare de\ufb01ned the same as the vector (cid:96)1 and (cid:96)\u221e norm. We de\ufb01ne: (cid:107)M(cid:107)1 :=(cid:80)\n\n2.1 Two kernel clustering algorithms\n\nKernel clustering algorithms can be broadly divided into two categories; one is based on semide\ufb01nite\nrelaxation of the k-means objective function and the other is eigen-decomposition based, like kernel\nPCA, spectral clustering, etc. In this section we describe these two settings.\n\nSDP relaxation for kernel clustering\nIt is well known [9] that kernel k-means could be achieved\nby maximizing trace(Z T KZ) where Z is the n \u00d7 r matrix of cluster memberships. However due to\nthe non-convexity of the constraints, the problem is NP-hard. Thus lots of convex relaxations are\nproposed in literature. In this paper, we propose the following semide\ufb01nite programming relaxation.\nThe same relaxation has been used in stochastic block models [1].\n\nX\n\ntrace(KX)\n\nmax\ns.t., X (cid:23) 0, X \u2265 0, X1 =\n\nn\nr\n\n1, diag(X) = 1\n\n(SDP-1)\n\nThe clustering procedure is listed in Algorithm 1.\n\nAlgorithm 1 SDP relaxation for kernel clustering\nRequire: Observations Y1,\u00b7\u00b7\u00b7 , Yn, kernel function f.\n1: Compute kernel matrix K where K(i, j) = f ((cid:107)Yj \u2212 Yj(cid:107)2\n2);\n2: Solve SDP-1 and let \u02c6X be the optimal solution;\n3: Do k-means on the r leading eigenvectors U of \u02c6X.\n\nKernel singular value decomposition Kernel singular value decomposition (K-SVD) is a spectral\nbased clustering approach. One \ufb01rst do SVD on the kernel matrix, then do k-means on \ufb01rst r\neigenvectors. Different variants include K-PCA which uses singular vectors of centered kernel\nmatrix and spectral clustering which uses singular vectors of normalized kernel matrix. The detailed\nalgorithm is shown in Algorithm 2.\n\n3 Main results\n\nIn this section we summarize our main results. In this paper we analyze SDP relaxation of kernel\nk-means and K-SVD type methods. Our main contribution is two-fold. First, we show that SDP\nrelaxation produces strongly consistent results, i.e. the number of misclustered nodes goes to zero\nwith high probability when there are no outliers, which means r without rounding. On the other\n\n3\n\n\fAlgorithm 2 K-SVD (K-PCA, spectral clustering)\nRequire: Observations Y1,\u00b7\u00b7\u00b7 , Yn, kernel function f.\n1: Compute kernel matrix K where K(i, j) = f ((cid:107)Yj \u2212 Yj(cid:107)2\n2);\n2: if K-PCA then\n3: K = K \u2212 K11T /n \u2212 11T K/n + 11T K11T /n2;\n4: else if spectral clustering then\n5: K = D\u22121/2KD\u22121/2 where D = diag(K1n);\n6: end if\n7: Do k-means on the r leading singular vectors V of K.\n\nhand, K-SVD is weakly consistent, i.e. fraction of misclassi\ufb01ed nodes goes to zero when there are\nno outliers. In presence of outliers, we see an interesting dichotomy in the behaviors of these two\nmethods. Both can be proven to be weakly consistent in terms of misclassi\ufb01cation error. However,\nSDP is more resilient to the effect of outliers than K-SVD, if the number of clusters grows or if the\nseparation between the cluster means decays.\nOur analysis is organized as follows. First we present a result on the concentration of kernel matrices\naround their population counterpart. The population kernel matrix for inliers is blockwise constant\nwith r blocks (except the diagonal, which is one). Next we prove that as n increases, the optima \u02c6X of\nSDP-1 converges strongly to X0, when there are no outliers and weakly if the number of outliers\ngrows slowly with n. Then we show the mis-clustering error of the clustering returned by Algorithm 1\ngoes to zero with probability tending to one as n \u2192 \u221e when there are no outliers. Finally, when the\nnumber of outliers is growing slowly with n, the fraction of mis-clustered nodes from algorithms 1\nand 2 converges to zero.\nWe will start with the concentration of kernel matrices to their population counterpart. We show that\nunder our data model (1) the empirical kernel matrix with the Gaussian kernel restricted on inliers\nconcentrates around a \"population\" matrix \u02dcK, and the (cid:96)\u221e norm of KI\u00d7I\ngoes to zero at\nthe rate of O(\nTheorem 1. Let dk(cid:96) = (cid:107)\u00b5k \u2212 \u00b5(cid:96)(cid:107). For i \u2208 \u02dcCk, j \u2208 \u02dcC(cid:96), de\ufb01ne\n\n(cid:113) log p\n\n\u2212 \u02dcKI\u00d7I\n\np ).\n\nf\n\nf\n\n(cid:26) f (d2\n\nf (0)\n\n\u02dcKf (i, j) =\n\nk(cid:96) + \u03c32\n\nk + \u03c32\n(cid:96) )\n\nif i (cid:54)= j,\nif i = j.\n(cid:107)\u221e \u2265 c\n\n.\n\n(cid:113) log p\np ) \u2264 n2p\u2212\u03c1c2.\n\n(1)\n\nThen there exists constant \u03c1 > 0, such that P ((cid:107)KI\u00d7I\n\n\u2212 \u02dcKI\u00d7I\n\nf\n\nf\n\nRemark 1. Setting c =\n\np log p , there exists constant \u03c1 > 0, such that\n\n(cid:115)\n\n(cid:33)\n\n(cid:113) 3 log n\n(cid:32)\n\n(cid:107)KI\u00d7I \u2212 \u02dcKI\u00d7I(cid:107)\u221e \u2265\n\nP\n\n3 log n\n\n\u03c1p\n\n\u2264 1\nn\n\n.\n\n(cid:113) log p\n\nThe error probability goes to zero for a suitably chosen constant as long as p is growing faster than\nlog n.\nWhile our analysis is inspired by [11], there are two main differences. First we have a mixture model\nrates of convergence\nwhere the population kernel is blockwise constant. Second, we obtain\nby carefully bounding the tail probabilities. In order to attain this we further assume that the noise is\nsub-gaussian and isotropic. From now on we will drop the subscript f and refer to the kernel matrix\nas K.\nBy de\ufb01nition, \u02dcK is blockwise constant with r unique rows (except the diagonal elements which are\nones). An important property of \u02dcK is that \u03bbr \u2212 \u03bbr+1 (where \u03bbi is the ith largest eigenvalue of \u02dcK)\nwill be \u2126(n\u03bbmin(B)/r). B is the r \u00d7 r Gaussian kernel matrix generated by the centers.\nLemma 1. If the scale parameter in Gaussian kernel is non-zero, and none of the clusters shares a\nsame center, let B be the r \u00d7 r matrix where Bk(cid:96) = f ((cid:107)\u00b5k \u2212 \u00b5(cid:96)(cid:107)), then\n(1 \u2212 f (2\u03c32\n\n(cid:0)f (\u03c32\nk)(cid:1)2 \u2212 2 max\n\nk)) = \u2126(n\u03bbmin(B)/r)\n\n\u03bbmin(B) \u00b7 min\n\n\u03bbr( \u02dcK) \u2212 \u03bbr+1( \u02dcK) \u2265 n\nr\n\nk\n\nk\n\np\n\n4\n\n\fNow we present our result on the consistency of SDP-1. To this end, we will upper bound (cid:107) \u02c6X \u2212 X0(cid:107)1,\nwhere \u02c6X is the optima returned by SDP-1 and X0 is the true clustering matrix. We \ufb01rst present a\nlemma, which is crucial to the proof of the theorem. Before presenting this, we de\ufb01ne\n\n\u03b3k(cid:96) := f (2\u03c32\n\nk) \u2212 f (d2\n\nk(cid:96) + \u03c32\n\nk + \u03c32\n\n(cid:96) );\n\n\u03b3min := min\n(cid:96)(cid:54)=k\n\n\u03b3k(cid:96)\n\n(2)\n\nThe \ufb01rst quantity \u03b3k(cid:96) measures separation between the two clusters k and (cid:96). The second quantity\nmeasures the smallest separation possible. We will assume that \u03b3min is positive. This is very similar\nto the analysis in asymptotic network analysis where strong assortativity is often assumed. Our results\nshow that the consistency of clustering deteriorates as \u03b3min decreases.\nLemma 2. Let \u02c6X be the solution to (SDP-1), then\n(cid:107)X0 \u2212 \u02c6X(cid:107)1 \u2264 2(cid:104)K \u2212 \u02dcK, \u02c6X \u2212 X0(cid:105)\n\n(3)\n\n\u03b3min\n\n(cid:16)(cid:113) log p\n\n(cid:17)\n\np\n\n.\n\nr\u03b3min\n\n(cid:17)(cid:111)\n\nk(cid:96) > |\u03c32\n\n(cid:16) mn\n\nthen for some absolute\n\nCombining the above with the concentration of K from Theorem 1 we have the following result:\n(cid:110)\nk \u2212 \u03c32\n(cid:96)|,\u2200k (cid:54)= (cid:96), and \u03b3min = \u2126\nTheorem 2. When d2\nconstant c > 0, (cid:107)X0 \u2212 \u02c6X(cid:107)1 \u2264 max\noP (1), oP\nRemark 2. When there\u2019s no outlier in the data, i.e., m = 0, \u02c6X = X0 with high probability and\nSDP-1 is strongly consistent without rounding. When m > 0, the right hand side of the inequality is\ndominated by mn/r. Note that (cid:107)X0(cid:107)1 = n2\nr , therefore after suitable normalization, the error rate\ngoes to zero with rate O(m/(n\u03b3min)) when n \u2192 \u221e.\nNow we will present the mis-clustering error rate of Algorithm 1 and 2. Although \u02c6X is strongly\nconsistent in the absence of outliers, in practice one often wants to get the labeling in addition to the\nclustering matrix. Therefore it is usually needed to carry out the last eigen-decomposition step in\nAlgorithm 1. Since X0 is the clustering matrix, its principal eigenvectors are blockwise constant. In\norder to show small mis-clustering error one needs to show that the eigenvectors of \u02c6X are converging\n(modulo a rotation) to those of X0. This is achieved by a careful application of Davis-Kahan theorem,\na detailed discussion of which we defer to the analysis in Section 4.\nThe Davis-Kahan theorem lets one bound the deviation of the r principal eigenvectors \u02c6U of a\nHermitian matrix \u02c6M, from the r principal eigenvectors U of M as : (cid:107) \u02c6U \u2212 U O(cid:107)F \u2264 23/2(cid:107)M \u2212\n\u02c6M(cid:107)F /(\u03bbr \u2212 \u03bbr+1) [25], where \u03bbr is the rth largest eigenvalue of M and O is the optimal rotation\nmatrix. For a complete statement of the theorem see Appendix F.\nApplying the result to X0 and \u02dcK provides us with two different upper bounds on the distance between\nleading eigenvectors. We will see in Theorem 3 that the eigengap derived by two algorithms differ,\nwhich results in different upper bounds for number of misclustered nodes. Since the Davis-Kahan\nbounds are tight up-to a constant [25], despite being upper bounds, this indicates that algorithm 1 is\nless sensitive to the separation between cluster means than Algorithm 2.\nOnce the eigenvector deviation is established, we present explicit bounds on mis-clustering error for\nboth methods in the following theorem. K-means assigns each row of \u02c6U (input eigenvectors of K or\n\u02c6X) to one of r clusters. De\ufb01ne c1 \u00b7\u00b7\u00b7 , cn \u2208 Rr such that ci is the centroid corresponding to the ith\nrow of \u02c6U. Similarly, for the population eigenvectors U (top r eigenvectors of \u02dcK or X0), we de\ufb01ne the\npopulation centroids as (Z\u03bd)i , for some \u03bd \u2208 Rr\u00d7r. Recall that we construct Z such that the outliers\nare equally and arbitrarily divided amongst the r clusters. We show that when the empirical centroids\nare close to the population centroids with a rotation, then the node will be correctly clustered. We\ngive a general de\ufb01nition of a superset of the misclustered nodes applicable both to K-SVD and SDP:\n(4)\nTheorem 3. Let Msdp and Mksvd be de\ufb01ned as Eq. 4, where ci\u2019s are generated from Algorithm 1\nand 2 respectively. Let \u03bbr be the rth largest eigenvalue value of \u02dcK. We have:\n\nM = {i : (cid:107)ci \u2212 Zi\u03bdO(cid:107) \u2265 1/(cid:112)2n/r}\n\n(cid:26)\n\n(cid:26)\n\n|Msdp| \u2264 max\n\noP (1), OP\n\n|Mksvd| \u2264 OP max\n\n(cid:18) m\n\n(cid:19)(cid:27)\n\n\u03b3min\n\n(cid:27)\n\nmn2\n\nr(\u03bbr \u2212 \u03bbr+1)2 ,\n\nn3 log p\n\nrp(\u03bbr \u2212 \u03bbr+1)2\n\n5\n\n\fRemark 3. Getting a bound for \u03bbr in terms of \u03b3min for general blockwise constant matrices is\ndif\ufb01cult. But as shown in Lemma 1, the eigengap is \u2126(n/r\u03bbmin(B)). Plugging this back in we have,\n\n(cid:26)\n\nOP\n\n(cid:18) mr\n\n(cid:19)\n\n(cid:18) nr log p/p\n\n(cid:19)(cid:27)\n\n\u03bbmin(B)2\n\n\u03bbmin(B)2\n\n, OP\n\n|Mksvd| \u2264 max\n\n.\n\nIn some simple cases one can get explicit bounds for \u03bbr, and we have the following.\nCorollary 1. Consider the special case when all clusters share the same variance \u03c32 and dk(cid:96) are\nidentical for all pairs of clusters. The number of misclustered nodes of K-SVD is upper bounded by:\n\n(cid:18)\n\n(cid:18) mr\n\n(cid:19)\n\n\u03b32\nmin\n\n(cid:18) nr log p/p\n\n(cid:19)(cid:19)\n\n, OP\n\n\u03b32\nmin\n\n|Mksvd| \u2264 max\n\nOP\n\n(5)\n\nCorollary 1 is proved in Appendix H.\nRemark 4. The situation may happen if cluster center for a is of the form cea where ea is a\nbinary vector with ea(i) = 1a=i. In this case, the algorithm is weakly consistent (fraction of\n. Compared to |Msdp|,\nmisclassi\ufb01ed nodes vanish) when \u03b3min = \u2126\n|Mksvd| an additional factor of\n. With same m, n, the algorithm has worse upper bound of errors\nand is more sensitive to \u03b3min, which depends both on the data distribution and the scale parameter of\nthe kernel. The proposed SDP can be seen as a denoising procedure which enlarges the separation.\nIt succeeds as long as the denoising is faithful, which requires much weaker assumptions.\n\nmax{(cid:113) r log p\n\n,(cid:112)mr/n}(cid:17)\n\n(cid:16)\n\n\u03b3min\n\np\n\nr\n\n4 Proof of the main results\n\nIn this section, we show the proof sketch of the main theorems. The full proofs are deferred to\nsupplementary materials.\n\n4.1 Proof of Theorem 1\nIn Theorem 1, we show that if the data distribution is sub-gaussian, the (cid:96)\u221e norm of K \u2212 \u02dcK restricted\non the inlier nodes concentrates with rate O\n\n(cid:16)(cid:113) log p\n\n(cid:17)\n\n.\n\np\n\nProof sketch. With the Lipschitz condition, it suf\ufb01ces to show (cid:107)Yi\u2212Yj(cid:107)2\n(cid:96) .\nk+\u03c32\nTo do this, we decompose (cid:107)Yi \u2212 Yj(cid:107)2\n2 = (cid:107)\u00b5k \u2212 \u00b5(cid:96)(cid:107)2\n. Now\nit suf\ufb01ces to show the third term concentrates to \u03c32\n(cid:96) and the second term concentrates around\n0. Note the fact that Wi \u2212 Wj is sub-gaussian, its square is sub-exponential. With sub-gaussian tail\nbound and a Bernstein type inequality for sub-exponential random variables, we prove the result.\n\n2 + 2 (Wi\u2212Wj )T\nk + \u03c32\n\n2 concentrates to d2\n(\u00b5k \u2212 \u00b5(cid:96)) +\n\nWith the elementwise bound, the Frobenius norm of the matrix difference is just one more factor of n.\n\nCorollary 2. With probability at least 1 \u2212 n2p\u2212\u03c1c2, (cid:107)KI\u00d7I \u2212 \u02dcKI\u00d7I(cid:107)F \u2264 cn(cid:112)log p/p.\n\n(cid:107)Wi\u2212Wj(cid:107)2\n\nk(cid:96)+\u03c32\n\n\u221a\n\np\n\np\n\n2\n\n4.2 Proof of Theorem 2\n\nLemma 2 is proved in Appendix D, where we make use of the optimality condition and the constraints\nin SDP-1. Equipped with Lemma 2 we\u2019re ready to prove Theorem 2.\n\n\u03b3min\n\nProof sketch. In the outlier-free ideal scenario, Lemma 2 along with the dualtiy of (cid:96)1 and (cid:96)\u221e norms\nwe get (cid:107) \u02c6X \u2212 X0(cid:107)1 \u2264 2(cid:107)K\u2212 \u02dcK(cid:107)\u221e(cid:107) \u02c6X\u2212X0(cid:107)1\n. Then by Theorem 1, we get the strong consistency result.\nWhen outliers are present, we have to derive a slightly different upper bound. The main idea is to\ndivide the matrices into two parts, one corresponding to the rows and columns of inliers, and the other\ncorresponding to those of the outliers. Now by the concentration result (Theorem 1) on K along\nwith the fact that both the kernel function and X0, \u02c6X are bounded by 1; and the rows of \u02c6X sums to\nn/r because of the constraint in SDP-1, we obtain the proof. The full proof is deferred to Appendix\nE.\n\n6\n\n\f4.3 Proof of Theorem 3\n\nAlthough Theorem 2 provides insights on how close the recovered matrix \u02c6X is to the ground truth,\nit remains unclear how the \ufb01nal clustering result behaves. In this section, we bound the number of\nmisclassi\ufb01ed points by bounding the distance in eigenvectors of \u02c6X and X0. We start by presenting a\nlemma that provides a bound for k-means step.\nK-means is a non-convex procedure and is usually hard to analyze directly. However, when the\ncentroids are well-separated, it is possible to come up with suf\ufb01cient conditions for a node to be\ncorrectly clustered. When the set of misclustered nodes is de\ufb01ned as Eq. 4, the cardinality of M is\ndirectly upper bounded by the distance between eigenvectors. To be explicit, we have the following\nlemma. Here \u02c6U denotes top r eigenvectors of K for K-SVD and \u02c6X for SDP. U denotes the top r\neigenvectors of \u02dcK for K-SVD and X0 for SDP. O denotes the corresponding rotation that aligns the\nempirical eigenvectors to their population counterpart.\nLemma 3. M is de\ufb01ned as Eq. (4), then |M| \u2264 8n\nLemma 3 is proved in Appendix G.\nAnalysis of |Msdp|: In order to get the deviation in eigenvectors, note the rth eigenvalue of X0 is\nn/r, and r + 1th is 0, let U \u2208 Rn\u00d7r be top r eigenvectors of X and \u02c6U be eigenvectors of X0. By\napplying Davis-Kahan Theorem, we have\n\nr (cid:107) \u02c6U \u2212 U O(cid:107)2\nF .\n\n(cid:113)\n\n\u2264\n\nApplying Lemma 3,\n\nAnalysis of |Mksvd|: In the outlier-present kernel scenario, by Corollary 2,\n\nn/r\n\n= OP\n\nn\u03b3min\n\nn/r\n\nn/r\n\n(cid:32)\n\n\u2264 cn\nr\n\n8(cid:107) \u02c6X \u2212 X0(cid:107)1\n\n23/2(cid:107) \u02c6X \u2212 X0(cid:107)F\n\n|Msdp| \u2264 8n\nr\n\n(cid:18)(cid:114) mr\n\n(cid:19)2 \u2264 OP\n\n\u2203O,(cid:107) \u02c6U \u2212 U O(cid:107)F \u2264 23/2(cid:107) \u02c6X \u2212 X0(cid:107)F\n(cid:33)2\n\n(cid:18)(cid:114) mr\n(cid:19)\n(cid:18) m\n(cid:107)K \u2212 \u02dcK(cid:107)F \u2264 (cid:107)KI\u00d7I \u2212 \u02dcKI\u00d7I(cid:107)F + (cid:107)KR \u2212 \u02dcKR(cid:107)F = OP (n(cid:112)log p/p) + OP (\n(cid:33)\nmn, n(cid:112)log p/p}\n(cid:33)2\n\n\u2203O,(cid:107) \u02c6U \u2212 U O(cid:107)F \u2264 23/2(cid:107)K \u2212 \u02dcK(cid:107)F\n\u03bbr \u2212 \u03bbr+1\n\n\u03bbr \u2212 \u03bbr+1\n\n\u2264 OP\n\n(cid:32)\n\nn\u03b3min\n\n\u03b3min\n\n(cid:32)\n\nAgain by Davis-Kahan theorem, and the eigengap between \u03bbr and \u03bbr+1 of \u02dcK from Lemma 1, let U\nbe the matrix with rows as the top r eigenvectors of \u02dcK. Let \u02c6U be its empirical counterpart.\n\nNow we apply Lemma 3 and get the upper bound for number of misclustered nodes for K-SVD.\n\n(cid:19)\n\n\u221a\n\n(6)\n\nmn)\n\n(7)\n\nmax{\u221a\nmn, n(cid:112)log p/p}\n(cid:19)2\n\n,\n\nn2 log p\n\np(\u03bbr \u2212 \u03bbr+1)\n\n(cid:41)\n(cid:27)\n\n|Mksvd| \u2264 8n\nr\n\n\u03bbr( \u02dcK) \u2212 \u03bbr+1( \u02dcK)\n\n23/2C max{\u221a\n(cid:40)(cid:18) \u221a\n(cid:26)\n\nmn2\n\nmn\n\n\u03bbr \u2212 \u03bbr+1\n\n\u2264 Cn\nr\n\nmax\n\n\u2264OP max\n\nr(\u03bbr \u2212 \u03bbr+1)2 ,\n\nrp(\u03bbr \u2212 \u03bbr+1)2\n\nn3 log p\n\n5 Experiments\n\nIn this section, we collect some numerical results. For implementation of the proposed SDP, we use\nAlternating Direction Method of Multipliers that is used in [1]. In each synthetic experiment, we\ngenerate n \u2212 m inliers from r equal-sized clusters. The centers of the clusters are sparse and hidden\nin a p-dim noise. For each generated data set, we add in m observations of outliers. To capture the\n\n7\n\n\f(a) # clusters\n\n(b) # outliers\n\n(c) Separation\n\nFigure 1: Performance vs parameters: (a) Inlier accuracy vs number of cluster (n = p = 1500, m =\n10, d2 = 0.125, \u03c3 = 1); (b) Inlier accuracy vs number of outliers (n = 1000, r = 5, d2 = 0.02, \u03c3 =\n1, p = 500); (c) Inlier accuracy vs separation (n = 1000, r = 5, m = 50, \u03c3 = 1, p = 1000).\n\narbitrary nature of the outliers, we generate half the outliers by a random Gaussian with large variance\n(100 times of the signal variance), and the other half by a uniform distribution that scatters across all\nclusters. We compare Algorithm 1 with 1) k-means by Lloyd\u2019s algorithms; 2) kernel SVD and 3)\nkernel PCA by [19].\nThe evaluating metric is accuracy of inliers, i.e., number of correctly clustered nodes divided by\nthe total number of inliers. To avoid the identi\ufb01cation problem, we compare all permutations of the\npredicted labels to ground truth labels and record the best accuracy. Each set of parameter is run 10\nreplicates and the mean accuracy and standard deviation (shown as error bars) are reported. For all\nk-means used in the experiments we do 10 restarts and choose the one with smallest k-means loss.\nFor each experiment, we change only one parameter and \ufb01x all the others. Figure 1 shows how\nthe performance of different clustering algorithms change when (a) number of clusters, (b) number\nof outliers, (c) minimum distance between clusters, increase. The value of all parameters used are\nspeci\ufb01ed in the caption of the \ufb01gure. Panel (a) shows the inlier accuracy for various methods as\nwe increase number of clusters. It can be seen that with r growing, the performance of all methods\ndeteriorate except for the SDP. We also examine the (cid:96)1 norm of X0 \u2212 \u02c6X, which remains stable as the\nnumber of clusters increases. Panel (b) describes the trend with respect to number of outliers. The\naccuracy of SDP on inliers is almost unaffected by the number of outliers while other methods suffer\nwith large m. Panel (c) compares the performance as the minimum distance between cluster centers\nchanges. Both SDP and K-SVD are consistent as the distance increases. Compared with K-SVD,\nSDP achieves consistency faster and variates less across random runs, which matches the analysis\ngiven in Section 3.\n\n6 Conclusion\n\nIn this paper, we investigate the consistency and robustness of two kernel-based clustering algorithms.\nWe propose a semide\ufb01nite programming relaxation which is shown to be strongly consistent without\noutliers and weakly consistent in presence of arbitrary outliers. We also show that K-SVD is also\nweakly consistent in that the misclustering rate is going to zero as the observation grows and the\noutliers are of a small fraction of inliers. By comparing two methods, we conclude that although both\nare robust to outliers, the proposed SDP is less sensitive to the minimum separation between clusters.\nThe experimental result also supports the theoretical analysis.\n\n8\n\n\fReferences\n[1] A. A. Amini and E. Levina. On semide\ufb01nite relaxations for the block model.\n\narXiv:1406.5647, 2014.\n\narXiv preprint\n\n[2] P. Awasthi and O. Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization,\n\nand Combinatorial Optimization. Algorithms and Techniques, pages 37\u201349. Springer, 2012.\n\n[3] T. T. Cai, X. Li, et al. Robust and computationally feasible community detection in the presence of arbitrary\n\noutlier nodes. The Annals of Statistics, 43(3):1027\u20131059, 2015.\n\n[4] A. Christmann and I. Steinwart. Consistency and robustness of kernel-based regression in convex risk\n\nminimization. Bernoulli, pages 799\u2013819, 2007.\n\n[5] S. Dasgupta and L. Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians.\n\nThe Journal of Machine Learning Research, 8:203\u2013226, 2007.\n\n[6] K. De Brabanter, K. Pelckmans, J. De Brabanter, M. Debruyne, J. A. Suykens, M. Hubert, and B. De Moor.\nRobustness of kernel based regression: a comparison of iterative weighting schemes. In Arti\ufb01cial Neural\nNetworks\u2013ICANN 2009, pages 100\u2013110. Springer, 2009.\n\n[7] M. Debruyne, M. Hubert, and J. A. Suykens. Model selection in kernel based regression using the in\ufb02uence\n\nfunction. Journal of Machine Learning Research, 9(10), 2008.\n\n[8] M. Debruyne, M. Hubert, and J. Van Horebeek. Detecting in\ufb02uential observations in kernel pca. Computa-\n\ntional Statistics & Data Analysis, 54(12):3007\u20133019, 2010.\n\n[9] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts.\n\nIn\nProceedings of the tenth ACM SIGKDD international conference on KDD, pages 551\u2013556. ACM, 2004.\n[10] L. Duan, L. Xu, Y. Liu, and J. Lee. Cluster-based outlier detection. Annals of Operations Research,\n\n168(1):151\u2013168, 2009.\n\n[11] N. El Karoui et al. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191\u2013\n\n3216, 2010.\n\n[12] D.-W. Kim, K. Y. Lee, D. Lee, and K. H. Lee. Evaluation of the performance of clustering algorithms in\n\nkernel-induced feature space. Pattern Recognition, 38(4):607\u2013611, 2005.\n\n[13] J. Kim and C. D. Scott. Robust kernel density estimation. The Journal of Machine Learning Research,\n\n13(1):2529\u20132565, 2012.\n\n[14] J. MacQueen et al. Some methods for classi\ufb01cation and analysis of multivariate observations.\n\nIn\nProceedings of the \ufb01fth Berkeley symposium on mathematical statistics and probability, volume 1, pages\n281\u2013297. Oakland, CA, USA., 1967.\n\n[15] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semide\ufb01nite programming. arXiv\n\npreprint arXiv:1602.06612, 2016.\n\n[16] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in\n\nneural information processing systems, 2:849\u2013856, 2002.\n\n[17] R. Pamula, J. K. Deka, and S. Nandi. An outlier detection method based on clustering. In 2011 Second\n\nInternational Conference on EAIT, pages 253\u2013256. IEEE, 2011.\n\n[18] J. Peng and Y. Wei. Approximating k-means-type clustering via semide\ufb01nite programming. SIAM Journal\n\non Optimization, 18(1):186\u2013205, 2007.\n\n[19] B. Sch\u00f6lkopf, A. Smola, and K.-R. M\u00fcller. Nonlinear component analysis as a kernel eigenvalue problem.\n\nNeural computation, 10(5):1299\u20131319, 1998.\n\n[20] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence,\n\nIEEE Transactions on, 22(8):888\u2013905, 2000.\n\n[21] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1:801\u2013804, 1956.\n[22] U. Von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. The Annals of Statistics,\n\npages 555\u2013586, 2008.\n\n[23] L. Xu, K. Crammer, and D. Schuurmans. Robust support vector machine training via convex outlier\n\nablation. In AAAI, volume 6, pages 536\u2013542, 2006.\n\n[24] M.-S. Yang and K.-L. Wu. A similarity-based robust clustering method. Pattern Analysis and Machine\n\nIntelligence, IEEE Transactions on, 26(4):434\u2013448, 2004.\n\n[25] Y. Yu, T. Wang, and R. Samworth. A useful variant of the davis\u2013kahan theorem for statisticians. Biometrika,\n\n102(2):315\u2013323, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1536, "authors": [{"given_name": "Bowei", "family_name": "Yan", "institution": "University of Texas at Austin"}, {"given_name": "Purnamrita", "family_name": "Sarkar", "institution": "U.C. Berkeley"}]}