{"title": "Sparse Embedded $k$-Means Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 3319, "page_last": 3327, "abstract": "The $k$-means clustering algorithm is a ubiquitous tool in data mining and machine learning that shows promising performance. However, its high computational cost has hindered its applications in broad domains. Researchers have successfully addressed these obstacles with dimensionality reduction methods. Recently, [1] develop a state-of-the-art random projection (RP) method for faster $k$-means clustering. Their method delivers many improvements over other dimensionality reduction methods. For example, compared to the advanced singular value decomposition based feature extraction approach,  [1] reduce the running time by a factor of $\\min \\{n,d\\}\\epsilon^2 log(d)/k$ for data matrix $X \\in \\mathbb{R}^{n\\times d} $ with $n$ data points and $d$ features, while losing only a factor of one in approximation accuracy. Unfortunately, they still require $\\mathcal{O}(\\frac{ndk}{\\epsilon^2log(d)})$ for matrix multiplication and this cost will be prohibitive for large values of $n$ and $d$. To break this bottleneck, we carefully build a sparse embedded $k$-means clustering algorithm which requires $\\mathcal{O}(nnz(X))$ ($nnz(X)$ denotes the number of non-zeros in $X$) for fast matrix multiplication. Moreover, our proposed algorithm improves on [1]'s results for approximation accuracy by a factor of one. Our empirical studies corroborate our theoretical findings, and demonstrate that our approach is able to significantly accelerate $k$-means clustering, while achieving satisfactory clustering performance.", "full_text": "Sparse Embedded k-Means Clustering\n\nWeiwei Liu\u2020,(cid:92),\u2217, Xiaobo Shen\u2021,\u2217,\n\nIvor W. Tsang(cid:92)\n\n\u2020 School of Computer Science and Engineering, The University of New South Wales\n\u2021 School of Computer Science and Engineering, Nanyang Technological University\n\n(cid:92) Centre for Arti\ufb01cial Intelligence, University of Technology Sydney\n\n{liuweiwei863,njust.shenxiaobo}@gmail.com\n\nivor.tsang@uts.edu.au\n\nAbstract\n\nThe k-means clustering algorithm is a ubiquitous tool in data mining and machine\nlearning that shows promising performance. However, its high computational\ncost has hindered its applications in broad domains. Researchers have success-\nfully addressed these obstacles with dimensionality reduction methods. Recently,\n[1] develop a state-of-the-art random projection (RP) method for faster k-means\nclustering. Their method delivers many improvements over other dimensionality\nreduction methods. For example, compared to the advanced singular value de-\ncomposition based feature extraction approach, [1] reduce the running time by a\nfactor of min{n, d}\u00012log(d)/k for data matrix X \u2208 Rn\u00d7d with n data points and\nd features, while losing only a factor of one in approximation accuracy. Unfortu-\nnately, they still require O( ndk\n\u00012log(d) ) for matrix multiplication and this cost will\nbe prohibitive for large values of n and d. To break this bottleneck, we carefully\nbuild a sparse embedded k-means clustering algorithm which requires O(nnz(X))\n(nnz(X) denotes the number of non-zeros in X) for fast matrix multiplication.\nMoreover, our proposed algorithm improves on [1]\u2019s results for approximation\naccuracy by a factor of one. Our empirical studies corroborate our theoretical \ufb01nd-\nings, and demonstrate that our approach is able to signi\ufb01cantly accelerate k-means\nclustering, while achieving satisfactory clustering performance.\n\n1\n\nIntroduction\n\nDue to its simplicity and \ufb02exibility, the k-means clustering algorithm [2, 3, 4] has been identi\ufb01ed\nas one of the top 10 data mining algorithms. It has shown promising results in various real world\napplications, such as bioinformatics, image processing, text mining and image analysis. Recently, the\ndimensionality and scale of data continues to grow in many applications, such as biological, \ufb01nance,\ncomputer vision and web [5, 6, 7, 8, 9], which poses a serious challenge in designing ef\ufb01cient and\naccurate algorithmic solutions for k-means clustering.\nTo address these obstacles, one prevalent technique is dimensionality reduction, which embeds the\noriginal features into low dimensional space before performing k-means clustering. Dimensionality\nreduction encompasses two kinds of approaches: 1) feature selection (FS), which embeds the\ndata into a low dimensional space by selecting the actual dimensions of the data; and 2) feature\nextraction (FE), which \ufb01nds an embedding by constructing new arti\ufb01cial features that are, for\nexample, linear combinations of the original features. Laplacian scores [10] and Fisher scores\n[11] are two basic feature selection methods. However, they lack a provable guarantee. [12] \ufb01rst\npropose a provable singular value decomposition (SVD) feature selection method. It uses the SVD\nto \ufb01nd O(klog(k/\u0001)/\u00012) actual features such that with constant probability the clustering structure\n\n\u2217The \ufb01rst two authors make equal contributions.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: Dimension reduction methods for k-means clustering. The third column corresponds to the\nnumber of selected or extracted features. The fourth column corresponds to the time complexity of\neach dimension reduction method. The last column corresponds to the approximation accuracy. N/A\ndenotes not available. nnz(X) denotes the number of non-zeros in X. \u0001 and \u03b4 represent the gap to\noptimality and the con\ufb01dence level, respectively. Sparse embedding is abbreviated to SE.\n\nMETHOD\n[13]\nFOLKLORE\n[12]\n[14]\n[1]\n[15]\nTHIS PAPER\n\nDESCRIPTION\n\nSVD-FE\nRP-FE\nSVD-FS\nSVD-FE\nRP-FE\nRP-FE\nSE-FE\n\nk\n\n\u00012\n\nDIMENSIONS\nO( log(n)\n)\nO( klog(k/\u0001)\nO( k\n\u00012\n\u00012 )\nO( k\n\u00012 )\nO( log(n)\nn )\nO(max( k+log(1/\u03b4)\n\u00012\n\n)\n\n, 6\n\u00012\u03b4 ))\n\nTIME\n\nO(nd min{n, d})\n\nO( ndlog(n)\n\u00012log(d) )\n\nO(nd min{n, d})\nO(nd min{n, d})\n\nO(dlog(d)n + dlog(n))\n\nO( ndk\n\n\u00012log(d) )\nO(nnz(X))\n\nACCURACY\n\n2\n\n1 + \u0001\n\n2 + \u0001\n1 + \u0001\n2 + \u0001\nN/A\n1 + \u0001\n\nis preserved within a factor of 2 + \u0001. [13] propose a popular feature extraction approach, where k\narti\ufb01cial features are constructed using the SVD such that the clustering structure is preserved within\na factor of two. Recently, corollary 4.5 in [14]\u2019s study improves [13]\u2019s results, by claiming that O( k\n\u00012 )\ndimensions are suf\ufb01cient for preserving 1 + \u0001 accuracy.\nBecause SVD is computationally expensive, we focus on another important feature extraction method\nthat randomly projects the data into low dimensional space. [1] develop a state-of-the-art random\nprojection (RP) method, which is based on random sign matrices. Compared to SVD-based feature\nextraction approaches [14], [1] reduce the running time by a factor of min{n, d}\u00012log(d)/k2, while\nlosing only a factor of one in approximation accuracy. They also improve the results of the folklore\nRP method by a factor of log(n)/k in terms of the number of embedded dimensions and the running\ntime, while losing a factor of one in approximation accuracy. Compared to SVD-based feature\nselection methods, [1] reduce the embedded dimension by a factor of log(k/\u0001) and the running time\nby a factor of min{n, d}\u00012log(d)/k, respectively. Unfortunately, they still require O( ndk\n\u00012log(d) ) for\nmatrix multiplication and this cost will be prohibitive for large values of n and d.\nThis paper carefully constructs a sparse matrix for the RP method that only requires O(nnz(X)) for\nfast matrix multiplication. Our algorithm is signi\ufb01cantly faster than other dimensionality reduction\nmethods, especially when nnz(X) << nd. Theoretically, we show a provable guarantee for our\nalgorithm. Given \u02dcd = O(max( k+log(1/\u03b4)\n\u00012\u03b4 )), with probability at least 1 \u2212 O(\u03b4), our algorithm\n, 6\npreserves the clustering structure within a factor of 1 + \u0001, improving on the results of [12] and [1] by\na factor of one for approximation accuracy. These results are summarized in Table 1.\nExperiments on three real-world data sets show that our algorithm outperforms other dimension\nreduction methods. The results verify our theoretical analysis. We organize this paper as follows.\nSection 2 introduces the concept of \u0001-approximation k-means clustering and our proposed sparse\nembedded k-means clustering algorithm. Section 3 analyzes the provable guarantee for our algorithm\nand experimental results are presented in Section 4. The last section provides our conclusions.\n\n\u00012\n\n2 Sparse Embedded k-Means Clustering\n\n\u0001-Approximation k-Means Clustering\n\n2.1\nGiven X \u2208 Rn\u00d7d with n data points and d features. We denote the transpose of the vector/matrix\nby superscript (cid:48) and the logarithms to base 2 by log. Let r = rank(X). By using singular value\ndecomposition (SVD), we have X = U \u03a3V (cid:48), where \u03a3 \u2208 Rr\u00d7r is a positive diagonal matrix containing\nthe singular values of X in decreasing order (\u03c31 \u2265 \u03c32 \u2265 . . . \u2265 \u03c3r), and U \u2208 Rn\u00d7r and V \u2208 Rd\u00d7r\ncontain orthogonal left and right singular vectors of X. Let Uk and Vk represent U and V with all\nbut their \ufb01rst k columns zeroed out, respectively, and \u03a3k be \u03a3 with all but its largest k singular\nvalues zeroed out. Assume k \u2264 r, [16] have already shown that Xk = Uk\u03a3kV (cid:48)\nk is the optimal rank k\n\n2Refer to Section 2.1 for the notations.\n\n2\n\n\fapproximation to X for any unitarily invariant norm, including the Frobenius and spectral norms. The\npseudoinverse of X is given by X + = V \u03a3\u22121U(cid:48). Assume Xr|k = X \u2212 Xk. In denotes the n \u00d7 n\nidentity matrix. Let ||X||F be the Frobenius norm of matrix X. For concision, ||A||2 represents the\nspectral norm of A if A is a matrix and the Euclidean norm of A if A is a vector. Let nnz(X) denote\nthe number of non-zeros in X.\nThe task of k-means clustering is to partition n data points in d dimensions into k clusters. Let \u00b5i\nbe the centroid of the vectors in cluster i and c(xi) be the cluster that xi is assigned to. Assume\nD \u2208 Rn\u00d7k is the indicator matrix which has exactly one non-zero element per row, which denotes\n\u221a\ncluster membership. The i-th data point belongs to the j-th cluster if and only if Dij = 1/\nzj, where\nzj denotes the number of data points in cluster j. Note that D(cid:48)D = Ik and the i-th row of DD(cid:48)X is\n2 = ||X \u2212 DD(cid:48)X||2\nF . We\nformally de\ufb01ne the k-means clustering task as follows, which is also studied in [12] and [1].\nDe\ufb01nition 1 (k-Means Clustering). Given X \u2208 Rn\u00d7d and a positive integer k denoting the number\nof clusters. Let D be the set of all n \u00d7 k indicator matrices D. The task of k-means clustering is to\nsolve\n\nthe centroid of xi\u2019s assigned cluster. Thus, we have(cid:80)n\n\ni=1 ||xi \u2212 \u00b5c(xi)||2\n\nD\u2208D ||X \u2212 DD(cid:48)X||2\n\nmin\n\nF\n\n(1)\n\nTo accelerate the optimization of problem 1, we aim to \ufb01nd a \u0001-approximate solution for problem 1\nby optimizing D (either exactly or approximately) over an embedded matrix \u02c6X \u2208 Rn\u00d7 \u02dcd with \u02dcd < d.\nTo measure the quality of approximation, we \ufb01rst de\ufb01ne the \u0001-approximation embedded matrix:\nDe\ufb01nition 2 (\u0001-Approximation Embedded Matrix). Given 0 \u2264 \u0001 < 1 and a non-negative constant \u03c4.\n\u02c6X \u2208 Rn\u00d7 \u02dcd with \u02dcd < d is a \u0001-approximation embedded matrix for X, if\n\n(1 \u2212 \u0001)||X \u2212 DD(cid:48)X||2\n\nF\u2264 || \u02c6X \u2212 DD(cid:48) \u02c6X||2\n\nF + \u03c4 \u2264 (1 + \u0001)||X \u2212 DD(cid:48)X||2\n\nF\n\n(2)\n\nWe show that a \u0001-approximation embedded matrix is suf\ufb01cient for approximately optimizing problem\n1:\nLemma 1 (\u0001-Approximation k-Means Clustering). Given X \u2208 Rn\u00d7d and D be the set of all n \u00d7 k\nindicator matrices D, let D\u2217 = arg minD\u2208D ||X \u2212 DD(cid:48)X||2\nF . Given \u02c6X \u2208 Rn\u00d7 \u02dcd with \u02dcd < d, let\n\u02c6D\u2217 = arg minD\u2208D || \u02c6X \u2212 DD(cid:48) \u02c6X||2\nF . If \u02c6X is a \u0001(cid:48)-approximation embedded matrix for X, given\n\u0001 = 2\u0001(cid:48)/(1 \u2212 \u0001(cid:48)), then for any \u03b3 \u2265 1, if || \u02c6X \u2212 \u02c6D \u02c6D(cid:48) \u02c6X||2\n\nF \u2264 \u03b3|| \u02c6X \u2212 \u02c6D\u2217 \u02c6D\u2217(cid:48) \u02c6X||2\n\nF , we have\n\nProof. By de\ufb01nition, we have || \u02c6X \u2212 \u02c6D\u2217 \u02c6D\u2217(cid:48) \u02c6X||2\n\n|| \u02c6X \u2212 \u02c6D \u02c6D(cid:48) \u02c6X||2\n\n||X \u2212 \u02c6D \u02c6D(cid:48)X||2\n\nF \u2264 (1 + \u0001)\u03b3||X \u2212 D\u2217D\u2217(cid:48)\n\nX||2\nF \u2264 || \u02c6X \u2212 D\u2217D\u2217(cid:48) \u02c6X||2\nF \u2264 \u03b3|| \u02c6X \u2212 D\u2217D\u2217(cid:48) \u02c6X||2\n\nF\n\nF\n\nF and thus\n\nSince \u02c6X is a \u0001-approximation embedded matrix for X, we have\n\n|| \u02c6X \u2212 D\u2217D\u2217(cid:48) \u02c6X||2\n|| \u02c6X \u2212 \u02c6D \u02c6D(cid:48) \u02c6X||2\nCombining Eq.(3) and Eq.(4), we obtain:\n\nF\u2264(1 + \u0001(cid:48))||X \u2212 D\u2217D\u2217(cid:48)\nF\u2265(1 \u2212 \u0001(cid:48))||X \u2212 \u02c6D \u02c6D(cid:48)X||2\n\nX||2\nF \u2212 \u03c4\n\nF \u2212 \u03c4\n\n(1 \u2212 \u0001(cid:48))||X \u2212 \u02c6D \u02c6D(cid:48)X||2\n\nF \u2212 \u03c4 \u2264 || \u02c6X \u2212 \u02c6D \u02c6D(cid:48) \u02c6X||2\n\nF \u2264\u03b3|| \u02c6X \u2212 D\u2217D\u2217(cid:48) \u02c6X||2\n\u2264(1 + \u0001(cid:48))\u03b3||X \u2212 D\u2217D\u2217(cid:48)\n\nF\n\n(3)\n\n(4)\n\n(5)\n\nX||2\n\nF \u2212 \u03c4 \u03b3\n\nEq.(5) implies that\n||X \u2212 \u02c6D \u02c6D(cid:48)X||2\n\nF \u2264 (1 + \u0001(cid:48))/(1 \u2212 \u0001(cid:48))\u03b3||X \u2212 D\u2217D\u2217(cid:48)\n\nX||2\n\nF \u2264 (1 + \u0001)\u03b3||X \u2212 D\u2217D\u2217(cid:48)\n\nX||2\n\nF\n\n(6)\n\nRemark. Lemma 1 implies that if \u02c6D is an optimal solution for \u02c6X, then it also preserves \u0001-\napproximation for X. Parameter \u03b3 allows \u02c6D to be approximately global optimal for \u02c6X.\n\n3\n\n\fAlgorithm 1 Sparse Embedded k-Means Clustering\nInput: X \u2208 Rn\u00d7d. Number of clusters k.\nOutput: \u0001-approximate solution for problem 1.\n1: Set \u02dcd = O(max( k+log(1/\u03b4)\n2: Build a random map h so that for any i \u2208 [d], h(i) = j for j \u2208 [ \u02dcd] with probability 1/ \u02dcd.\n3: Construct matrix \u03a6 \u2208 {0, 1}d\u00d7 \u02dcd with \u03a6i,h(i) = 1, and all remaining entries 0.\n4: Construct matrix Q \u2208 Rd\u00d7d is a random diagonal matrix whose entries are i.i.d. Rademacher\n\n\u00012\u03b4 )).\n, 6\n\n\u00012\n\nvariables.\n\n5: Compute the product \u02c6X = XQ\u03a6 and run exact or approximate k-means algorithms on \u02c6X.\n\n2.2 Sparse Embedding\n\n\u02dcd\n\n\u02dcd\n\nor \u22121\u221a\n\n[1] construct a random embedded matrix for fast k-means clustering by post-multiplying X with a\nd\u00d7 \u02dcd random matrix having entries 1\u221a\nwith equal probability. However, this method requires\nO( ndk\n\u00012log(d) ) for matrix multiplication and this cost will be prohibitive for large values of n and d. To\nbreak this bottleneck, Algorithm 1 demonstrates our sparse embedded k-means clustering, which\nrequires O(nnz(X)) for fast matrix multiplication. Our algorithm is signi\ufb01cantly faster than other\ndimensionality reduction methods, especially when nnz(X) << nd. For an index i taking values in\nthe set {1,\u00b7\u00b7\u00b7 , n}, we write i \u2208 [n].\nNext, we state our main theorem to show that XQ\u03a6 is the \u0001-approximation embedded matrix for X:\nTheorem 1. Let \u03a6 and Q be constructed as in Algorithm 1 and R = (Q\u03a6)(cid:48) \u2208 R \u02dcd\u00d7d. Given\n\u02dcd = O(max( k+log(1/\u03b4)\n\u00012\u03b4 )). Then for any X \u2208 Rn\u00d7d, with a probability of at least 1 \u2212 O(\u03b4),\n, 6\nXR(cid:48) is the \u0001-approximation embedded matrix for X.\n\n\u00012\n\n3 Proofs\nLet Z = In \u2212 DD(cid:48) and tr be the trace notation. Eq.(2) can be formulated as: (1 \u2212 \u0001)tr(ZXX(cid:48)Z) \u2264\ntr(Z \u02c6X \u02c6X(cid:48)Z) + \u03c4 \u2264 (1 + \u0001)tr(ZXX(cid:48)Z). Then, we try to approximate XX(cid:48) with \u02c6X \u02c6X(cid:48). To prove our\nmain theorem, we write \u02c6X = XR(cid:48) and our goal is to show that tr(ZXX(cid:48)Z) can be approximated\nby tr(ZXR(cid:48)RX(cid:48)Z). Lemma 2 provides conditions on the error matrix E = \u02c6X \u02c6X(cid:48) \u2212 XX(cid:48) that are\nsuf\ufb01cient to guarantee that \u02c6X is a \u0001-approximation embedded matrix for X. For any two symmetric\nmatrices A, B \u2208 Rn\u00d7n, A (cid:22) B indicates that B \u2212 A is positive semide\ufb01nite. Let \u03bbi(A) denote the\ni-th largest eigenvalue of A in absolute value. (cid:104)\u00b7,\u00b7(cid:105) represents the inner product, and 0n\u00d7d denotes an\nn \u00d7 d zero matrix with all its entries being zero.\nLemma 2. Let C = XX(cid:48) and \u02c6C = \u02c6X \u02c6X(cid:48). If we write \u02c6C = C + E1 + E2 + E3 + E4, where:\n\n(i) E1 is symmetric and \u2212\u00011C (cid:22) E1 (cid:22) \u00011C.\n\n(ii) E2 is symmetric,(cid:80)k\n\ni=1 |\u03bbi(E2)| \u2264 \u00012||Xr|k||2\n\n(iii) The columns of E3 fall in the column span of C and tr(E (cid:48)\n(iv) The rows of E4 fall in the row span of C and tr(E4C +E (cid:48)\n\nF , and tr(E2) \u2264 \u02dc\u00012||Xr|k||2\nF .\n3||Xr|k||2\n3C +E3) \u2264 \u00012\nF .\n4||Xr|k||2\n4) \u2264 \u00012\nF .\n\nand \u00011 + \u00012 + \u02dc\u00012 + \u00013 + \u00014 = \u0001, then \u02c6X is a \u0001-approximation embedded matrix for X. Speci\ufb01cally,\nwe have (1 \u2212 \u0001)tr(ZCZ) \u2264 tr(Z \u02c6CZ) \u2212 min{0, tr(E2)} \u2264 (1 + \u0001)tr(ZCZ).\nThe proof can be referred to [17]. Next, we show XR(cid:48) is the \u0001-approximation embedded matrix for\nX. We \ufb01rst present the following theorem:\nTheorem 2. Assume r > 2k and let V2k \u2208 Rd\u00d7r represent V with all but their \ufb01rst 2k columns\n2k) and M \u2208 R(n+r)\u00d7d as\nzeroed out. We de\ufb01ne M1 = V (cid:48)\ncontaining M1 as its \ufb01rst r rows and M2 as its lower n rows. We construct R = (Q\u03a6)(cid:48) \u2208 R \u02dcd\u00d7d,\n\nk/||Xr|k||F (X \u2212 XV2kV (cid:48)\n\n2k, M2 =\n\n\u221a\n\n4\n\n\fwhich is shown in Algorithm 1. Given \u02dcd = O(max( k+log(1/\u03b4)\na probability of at least 1 \u2212 O(\u03b4), we have\n(i) ||(RM(cid:48))(cid:48)(RM(cid:48)) \u2212 M M(cid:48)||2 < \u0001.\n(ii) |||RM(cid:48)\n\nF \u2212 ||M(cid:48)\n\nF| \u2264 \u0001k.\n\n2||2\n\n2||2\n\n\u00012\n\n\u00012\u03b4 )), then for any X \u2208 Rn\u00d7d, with\n, 6\n\n1 is 1.\n\nProof. To prove the \ufb01rst result, one can easily check that M1M(cid:48)\n2 = 0r\u00d7n, thus M M(cid:48) is a block\ndiagonal matrix with an upper left block equal to M1M(cid:48)\n1 and lower right block equal to M2M(cid:48)\n2.\n2 = k||X\u2212XV2kV (cid:48)\n||M2M(cid:48)\n2||2 = ||M2||2\nThe spectral norm of M1M(cid:48)\n. As\n||Xr|k||2\n2, we derive ||M2M(cid:48)\nF \u2265 k||Xr|2k||2\n2||2 \u2264 1. Since M M(cid:48) is a block diagonal matrix, we\n||Xr|k||2\n2 = ||M M(cid:48)||2 = max{||M1M(cid:48)\nhave ||M||2\n2||2} = 1. tr(M1M(cid:48)\n1||2,||M2M(cid:48)\n1) = 2k. tr(M2M(cid:48)\n2) =\nk||Xr|2k||2\n2) \u2264 k. Then we have ||M||2\nF \u2265 ||Xr|2k||2\n. As ||Xr|k||2\nF , we derive tr(M2M(cid:48)\nF =\n||Xr|k||2\n2) \u2264 3k. Applying Theorem 6 from [18], we can obtain that\n1) + tr(M2M(cid:48)\ntr(M M(cid:48)) = tr(M1M(cid:48)\ngiven \u02dcd = O( k+log(1/\u03b4)\n), with a probability of at least 1 \u2212 \u03b4, ||(RM(cid:48))(cid:48)(RM(cid:48)) \u2212 M M(cid:48)||2 < \u0001.\nThe proof of the second result can be found in the Supplementary Materials.\n\n= k||Xr|2k||2\n||Xr|k||2\n\n2k||2\n\n\u00012\n\nF\n\nF\n\nF\n\nF\n\n2\n\n2\n\n\u00012\n\nBased on Theorem 2, we show that \u02c6X = XR(cid:48) satis\ufb01es the conditions of Lemma 2.\nLemma 3. Assume r > 2k and we construct M and R as in Theorem 2. Given \u02dcd =\n\u00012\u03b4 )), then for any X \u2208 Rn\u00d7d, with a probability of at least 1\u2212O(\u03b4), \u02c6X = XR(cid:48)\nO(max( k+log(1/\u03b4)\n, 6\nsatis\ufb01es the conditions of Lemma 2.\nProof. We construct H1 \u2208 Rn\u00d7(n+r) as H1 = [XV2k, 0n\u00d7n], thus H1M = XV2kV (cid:48)\nH2 \u2208 Rn\u00d7(n+r) as H2 = [0n\u00d7r,\nXr|2k and X = H1M + H2M and we obtain the following:\n\n2k. And we set\n2k =\n\nM2 = X \u2212 XV2kV (cid:48)\n\nIn], so we have H2M =\n\n||Xr|k||F\u221a\n\n||Xr|k||F\u221a\n\nk\n\nk\n\nE = \u02c6X \u02c6X(cid:48) \u2212 XX(cid:48) = XR(cid:48)RX(cid:48) \u2212 XX(cid:48) = 1(cid:13) + 2(cid:13) + 3(cid:13) + 4(cid:13)\n\n(7)\nWhere 1(cid:13) = H1M R(cid:48)RM(cid:48)H(cid:48)\n2, 3(cid:13) =\n2 \u2212 H1M M(cid:48)H(cid:48)\n1. We bound 1(cid:13), 2(cid:13),\nH1M R(cid:48)RM(cid:48)H(cid:48)\n3(cid:13) and 4(cid:13) separately, showing that each corresponds to one of the error terms described in Lemma 2.\nBounding 1(cid:13).\n\n1 \u2212 H1M M(cid:48)H(cid:48)\n2 and 4(cid:13) = H2M R(cid:48)RM(cid:48)H(cid:48)\n\n1, 2(cid:13) = H2M R(cid:48)RM(cid:48)H(cid:48)\n\n2 \u2212 H2M M(cid:48)H(cid:48)\n\n1 \u2212 H2M M(cid:48)H(cid:48)\n\n1 \u2212 H1M M(cid:48)H(cid:48)\n\nE1 = H1M R(cid:48)RM(cid:48)H(cid:48)\n\n(8)\nE1 is symmetric. By Theorem 2, we know that with a probability of at least 1\u2212 \u03b4, ||(RM(cid:48))(cid:48)(RM(cid:48))\u2212\nM M(cid:48)||2 < \u0001 holds. Then we get \u2212\u0001In+r (cid:22) (RM(cid:48))(cid:48)(RM(cid:48)) \u2212 M M(cid:48) (cid:22) \u0001In+r. And we derive the\nfollowing:\n\n1 = XV2kV (cid:48)\n\n2kR(cid:48)RV2kV (cid:48)\n\n2kX(cid:48) \u2212 XV2kV (cid:48)\n\n2kV2kV (cid:48)\n\n2kX(cid:48)\n\n\u2212\u0001H1H(cid:48)\n1 (cid:22) E1 (cid:22) \u0001H1H(cid:48)\n2kX(cid:48)v = ||V2kV (cid:48)\n1 = XV2kV (cid:48)\n2kV2kV (cid:48)\n\n1\n\nFor any vector v, v(cid:48)XV2kV (cid:48)\n||X(cid:48)v||2\n2kX(cid:48) = XV2kV (cid:48)\nXV2kV (cid:48)\n\n2kV2kV (cid:48)\n2 = v(cid:48)XX(cid:48)v, so H1M M(cid:48)H(cid:48)\n2kX(cid:48) = H1H(cid:48)\n2kV2kV (cid:48)\nH1H(cid:48)\n\n1, we have\n1 = H1M M(cid:48)H(cid:48)\n\nCombining Eqs.(9) and (10), we arrive at a probability of at least 1 \u2212 \u03b4,\n\n\u2212\u0001C (cid:22) E1 (cid:22) \u0001C\n\nsatisfying the \ufb01rst condition of Lemma 2.\nBounding 2(cid:13).\n\nE2 =H2M R(cid:48)RM(cid:48)H(cid:48)\n=(X \u2212 XV2kV (cid:48)\n\n2 \u2212 H2M M(cid:48)H(cid:48)\n2k)R(cid:48)R(X \u2212 XV2kV (cid:48)\n\n2\n\n2k)(cid:48) \u2212 (X \u2212 XV2kV (cid:48)\n\n2k)(X \u2212 XV2kV (cid:48)\n\n2k)(cid:48)\n\n5\n\n2 \u2264 ||V2kV (cid:48)\n\n2||X(cid:48)v||2\n2kX(cid:48)v||2\n2kX(cid:48) (cid:22) XX(cid:48). Since H1M M(cid:48)H(cid:48)\n\n2k||2\n\n2 =\n1 =\n\n1 (cid:22) XX(cid:48) = C\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n\f\u221a\nE2 is symmetric. Note that H2 just selects M2 from M and scales it by ||Xr|k||F /\nTheorem 2, we know that with a probability of at least 1 \u2212 \u03b4,\n\nk. Using\n\ntr(E2) =\n\n||Xr|k||2\n\nF\n\nk\n\ntr(M2R(cid:48)RM(cid:48)\n\n2 \u2212 M2M(cid:48)\n\n2) \u2264 \u0001||Xr|k||2\n\nF\n\nApplying Theorem 6.2 from [19] and rescaling \u0001 , we can obtain a probability of at least 1 \u2212 \u03b4,\n\n||E2||F = ||Xr|2kR(cid:48)RX(cid:48)\n\nr|2k \u2212 Xr|2kX(cid:48)\n\nr|2k||F \u2264 \u0001\u221a\nk\n\n||Xr|2k||2\n\nF\n\n(13)\n\n(14)\n\nCombining Eq.(14), Cauchy-Schwarz inequality and ||Xr|2k||2\nprobability of at least 1 \u2212 \u03b4,\n\nF \u2264 ||Xr|k||2\n\nF , we get that with a\n\nk(cid:88)\n\n\u221a\n\n|\u03bbi(E2)| \u2264\n\nk||E2||F \u2264 \u0001||Xr|k||2\n\nF\n\n(15)\n\ni=1\n\nEqs.(13) and (15) satisfy the second conditions of Lemma 2.\nBounding 3(cid:13).\n\nE3 =H1M R(cid:48)RM(cid:48)H(cid:48)\n\n2 \u2212 H1M M(cid:48)H(cid:48)\n\n2\n\n=XV2kV (cid:48)\n\n2kR(cid:48)R(X \u2212 XV2kV (cid:48)\n\n2k)(cid:48) \u2212 XV2kV (cid:48)\nThe columns of E3 are in the column span of H1M = XV2kV (cid:48)\nC. ||V2k||2\n2kV2k) = 2k. As V (cid:48)\n\u03a32kU(cid:48)\nthat with a probability of at least 1 \u2212 \u03b4,\n\n2k(X \u2212 XV2kV (cid:48)\n2k, and so in the column span of\n2k) =\n2k = 0r\u00d7n. Applying Theorem 6.2 from [19] again and rescaling \u0001, we can obtain\n\n2k(V \u03a3U(cid:48) \u2212 V2k\u03a32kU(cid:48)\n\n2k \u2212 \u03a32kU(cid:48)\n\nF = tr(V (cid:48)\n\n2kV = V (cid:48)\n\nr|2k = V (cid:48)\n\n2kV2k, V (cid:48)\n\n2kX(cid:48)\n\n2k)(cid:48)\n\n(16)\n\ntr(E (cid:48)\n\n3C +E3) =||\u03a3\u22121U(cid:48)(H1M R(cid:48)RM(cid:48)H(cid:48)\nr|2k \u2212 0r\u00d7n||2\n\n2kR(cid:48)RX(cid:48)\n\n=||V (cid:48)\n\n2 \u2212 H1M M(cid:48)H(cid:48)\nF \u2264 \u00012||Xr|k||2\n\nF\n\n2)||2\n\nF\n\nThus, Eq.(17) satis\ufb01es the third condition of Lemma 2.\nBounding 4(cid:13).\n\nE4 =H2M R(cid:48)RM(cid:48)H(cid:48)\n=(X \u2212 XV2kV (cid:48)\n\n1 \u2212 H2M M(cid:48)H(cid:48)\n2k)R(cid:48)RV2kV (cid:48)\n\n2k)V2kV (cid:48)\n3 and thus we immediately have that with a probability of at least 1 \u2212 \u03b4,\n\n2kX(cid:48) \u2212 (X \u2212 XV2kV (cid:48)\n\n1\n\n2kX(cid:48)\n\nE4 = E (cid:48)\n\n(17)\n\n(18)\n\n(19)\nLastly, Eqs.(11), (13), (15), (17) and (19) ensure that, for any X \u2208 Rn\u00d7d, \u02c6X = XR(cid:48) satis\ufb01es the\nconditions of Lemma 2 and is the \u0001-approximation embedded matrix for X with a probability of at\nleast 1 \u2212 O(\u03b4).\n\nF\n\ntr(E4C +E (cid:48)\n\n4) \u2264 \u00012||Xr|k||2\n\n4 Experiment\n\n4.1 Data Sets and Baselines\n\nWe denote our proposed sparse embedded k-means clustering algorithm as SE for short. This section\nevaluates the performance of the proposed method on four real-world data sets: COIL20, SECTOR,\nRCV1 and ILSVRC2012. The COIL20 [20] and ILSVRC2012 [21] data sets are collected from\nwebsite34, and other data sets are collected from the LIBSVM website5. The statistics of these data\nsets are presented in the Supplementary Materials.\nWe compare SE with several other dimensionality reduction techniques:\n\n3http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php\n4http://www.image-net.org/challenges/LSVRC/2012/\n5https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/\n\n6\n\n\f(a) COIL20\n\n(b) SECTOR\n\n(c) RCV1\n\n(d) ILSVRC2012\n\nFigure 1: Clustering accuracy of various methods on COIL20, SECTOR, RCV1 and ILSVRC2012\ndata sets.\n\n(a) COIL20\n\n(b) SECTOR\n\n(c) RCV1\n\n(d) ILSVRC2012\n\nFigure 2: Dimension reduction time of various methods on COIL20, SECTOR, RCV1 and ILSVR-\nC2012 data sets.\n\n(a) COIL20\n\n(b) SECTOR\n\n(c) RCV1\n\n(d) ILSVRC2012\n\nFigure 3: Clustering time of various methods on COIL20, SECTOR, RCV1 and ILSVRC2012 data\nsets.\n\nreduction approach.\n\n\u2022 SVD: The singular value decomposition or principal components analysis dimensionality\n\u2022 LLE: The local linear embedding (LLE) algorithm is proposed by [22]. We use the code\n\u2022 LS: [10] develop the laplacian score (LS) feature selection method. We use the code from\n\u2022 PD: [15] propose an advanced compression scheme for accelerating k-means clustering. We\n\u2022 RP: The state-of-the-art random projection method is proposed by [1].\n\nuse the code from website8 with default parameters.\n\nfrom website6 with default parameters.\n\nwebsite7 with default parameters.\n\nAfter dimensionality reduction, we run all methods on a standard k-means clustering package, which\nis from website9 with default parameters. We also compare all these methods against the standard\nk-means algorithm on the full dimensional data sets. To measure the quality of all methods, we report\nclustering accuracy based on the labelled information of the input data. Finally, we report the running\n\n6http://www.cs.nyu.edu/ roweis/lle/\n7www.cad.zju.edu.cn/home/dengcai/Data/data.html\n8https://github.com/stephenbeckr/Sparsi\ufb01edKMeans\n9www.cad.zju.edu.cn/home/dengcai/Data/data.html\n\n7\n\n# of dimensions02004006008001000Clustering accuracy (in %)10203040506070k-meansSVDLLELSRPPDSE# of dimensions02004006008001000Clustering accuracy (in %)05101520k-meansSVDLLELSRPPDSE# of dimensions02004006008001000Clustering accuracy (in %)1015202530k-meansRPPDSE# of dimensions02004006008001000Clustering accuracy (in %)010203040k-meansRPPDSE# of dimensions02004006008001000Preprocessing time (in second)10-310-210-1100101SVDLLELSRPPDSE# of dimensions02004006008001000Preprocessing time (in second)10-2100102104SVDLLELSRPPDSE# of dimensions02004006008001000Preprocessing time (in second)100101102103RPPDSE# of dimensions02004006008001000Preprocessing time (in second)10-2100102104RPPDSE# of dimensions02004006008001000Clustering time (in second)10-210-1100101k-meansSVDLLELSRPPDSE# of dimensions02004006008001000Clustering time (in second)10-1100101102k-meansSVDLLELSRPPDSE# of dimensions02004006008001000Clustering time (in second)101102103k-meansRPPDSE# of dimensions05001000Clustering time (in second)103104k-meansRPPDSE\ftimes (in seconds) of both the dimensionality reduction procedure and the k-means clustering for all\nbaselines.\n\n4.2 Results\n\nThe experimental results of various methods on all data sets are shown in Figures 1, 2 and 3. The Y\naxes of Figures 2 and 3 represent dimension reduction and clustering time in log scale. We can\u2019t get\nthe results of SVD, LLE and LS within three days on RCV1 and ILSVRC2012 data sets. Thus, these\nresults are not reported.\nFrom Figures 1, 2 and 3, we can see that:\n\n\u2022 As the number of embedded dimensions increases, the clustering accuracy and running times\nof all dimensionality reduction methods increases, which is consistent with the empirical\nresults in [1].\n\n\u2022 Our proposed dimensionality reduction method has superior performance compared to the\nRP method and other baselines in terms of accuracy, which veri\ufb01es our theoretical results.\nLLE and LS generally underperforms on the COIL20 and SECTOR data sets.\n\n\u2022 SVD and LLE are the two slowest methods compared with the other baselines in terms of\ndimensionality reduction time. The dimension reduction time of the RP method increases\nsigni\ufb01cantly with the increasing dimensions, while our method obtains a stable and lowest\ndimensionality reduction time. We achieve several hundred orders of magnitude faster than\nthe RP method and other baselines. The results also support our complexity analysis.\n\n\u2022 All dimensionality reduction methods are signi\ufb01cantly faster than standard k-means al-\ngorithm with full dimensions. Finally, we conclude that our proposed method is able to\nsigni\ufb01cantly accelerate k-means clustering, while achieving satisfactory clustering perfor-\nmance.\n\n5 Conclusion\n\nThe k-means clustering algorithm is a ubiquitous tool in data mining and machine learning with\nnumerous applications. The increasing dimensionality and scale of data has provided a considerable\nchallenge in designing ef\ufb01cient and accurate k-means clustering algorithms. Researchers have\nsuccessfully addressed these obstacles with dimensionality reduction methods. These methods\nembed the original features into low dimensional space, and then perform k-means clustering on\nthe embedded dimensions. SVD is one of the most popular dimensionality reduction methods.\nHowever, it is computationally expensive. Recently, [1] develop a state-of-the-art RP method for\nfaster k-means clustering. Their method delivers many improvements over other dimensionality\nreduction methods. For example, compared to an advanced SVD-based feature extraction approach\n[14], [1] reduce the running time by a factor of min{n, d}\u00012log(d)/k, while only losing a factor of\none in approximation accuracy. They also improve the result of the folklore RP method by a factor\nof log(n)/k in terms of the number of embedded dimensions and the running time, while losing\na factor of one in approximation accuracy. Unfortunately, it still requires O( ndk\n\u00012log(d) ) for matrix\nmultiplication and this cost will be prohibitive for large values of n and d. To break this bottleneck, we\ncarefully construct a sparse matrix for the RP method that only requires O(nnz(X)) for fast matrix\nmultiplication. Our algorithm is signi\ufb01cantly faster than other dimensionality reduction methods,\nespecially when nnz(X) << nd. Furthermore, we improve the results of [12] and [1] by a factor\nof one for approximation accuracy. Our empirical studies demonstrate that our proposed algorithm\noutperforms other dimension reduction methods, which corroborates our theoretical \ufb01ndings.\n\nAcknowledgments\n\nWe would like to thank the area chairs and reviewers for their valuable comments and constructive\nsuggestions on our paper. This project is supported by the ARC Future Fellowship FT130100746,\nARC grant LP150100671, DP170101628, DP150102728, DP150103071, NSFC 61232006 and NSFC\n61672235.\n\n8\n\n\fReferences\n[1] Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Randomized dimen-\n\nsionality reduction for k-means clustering. IEEE Trans. Information Theory, 61(2):1045\u20131062, 2015.\n\n[2] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Applied Statistics,\n\n28(1):100\u2013108, 1979.\n\n[3] Xiao-Bo Shen, Weiwei Liu, Ivor W. Tsang, Fumin Shen, and Quan-Sen Sun. Compressed k-means for\n\nlarge-scale clustering. In AAAI, pages 2527\u20132533, 2017.\n\n[4] Xinwang Liu, Miaomiao Li, Lei Wang, Yong Dou, Jianping Yin, and En Zhu. Multiple kernel k-means\n\nwith incomplete kernels. In AAAI, pages 2259\u20132265, 2017.\n\n[5] Tom M. Mitchell, Rebecca A. Hutchinson, Radu Stefan Niculescu, Francisco Pereira, Xuerui Wang,\nMarcel Adam Just, and Sharlene D. Newman. Learning to decode cognitive states from brain images.\nMachine Learning, 57(1-2):145\u2013175, 2004.\n\n[6] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature selection: Beyond the\n\nlinear model. JMLR, 10:2013\u20132038, 2009.\n\n[7] Jorge S\u00e1nchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. Image classi\ufb01cation with the\n\n\ufb01sher vector: Theory and practice. International Journal of Computer Vision, 105(3):222\u2013245, 2013.\n\n[8] Yiteng Zhai, Yew-Soon Ong, and Ivor W. Tsang. The emerging \u201cbig dimensionality\u201d. IEEE Computational\n\nIntelligence Magazine, 9(3):14\u201326, 2014.\n\n[9] Weiwei Liu and Ivor W. Tsang. Making decision trees feasible in ultrahigh feature and label dimensions.\n\nJournal of Machine Learning Research, 18(81):1\u201336, 2017.\n\n[10] Xiaofei He, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. In NIPS, pages 507\u2013514,\n\n2005.\n\n[11] Donald H. Foley and John W. Sammon Jr. An optimal set of discriminant vectors. IEEE Trans. Computers,\n\n24(3):281\u2013289, 1975.\n\n[12] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. Unsupervised feature selection for the\n\nk-means clustering problem. In NIPS, pages 153\u2013161, 2009.\n\n[13] Petros Drineas, Alan M. Frieze, Ravi Kannan, Santosh Vempala, and V. Vinay. Clustering in large graphs\nand matrices. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages\n291\u2013299, 1999.\n\n[14] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size\ncoresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-\nSIAM Symposium on Discrete Algorithms, pages 1434\u20131453, 2013.\n\n[15] Farhad Pourkamali Anaraki and Stephen Becker. Preconditioned data sparsi\ufb01cation for big data with\n\napplications to PCA and k-means. IEEE Trans. Information Theory, 63(5):2954\u20132974, 2017.\n\n[16] Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The Quarterly Journal of\n\nMathematics, 11:50\u201359, 1960.\n\n[17] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality\nreduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual\nACM on Symposium on Theory of Computing, pages 163\u2013172, 2015.\n\n[18] Michael B. Cohen, Jelani Nelson, and David P. Woodruff. Optimal approximate matrix product in terms\nof stable rank. In 43rd International Colloquium on Automata, Languages, and Programming, pages\n11:1\u201311:14, 2016.\n\n[19] Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM,\n\n61(1):4:1\u20134:23, 2014.\n\n[20] Rong Wang, Feiping Nie, Xiaojun Yang, Feifei Gao, and Minli Yao. Robust 2DPCA with non-greedy\n\nl1-norm maximization for image analysis. IEEE Trans. Cybernetics, 45(5):1108\u20131112, 2015.\n\n[21] Weiwei Liu, Ivor W. Tsang, and Klaus-Robert M\u00fcller. An easy-to-hard learning paradigm for multiple\n\nclasses and multiple labels. Journal of Machine Learning Research, 18(94):1\u201338, 2017.\n\n[22] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nSCIENCE, 290:2323\u20132326, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1880, "authors": [{"given_name": "Weiwei", "family_name": "Liu", "institution": "UTS"}, {"given_name": "Xiaobo", "family_name": "Shen", "institution": "NJUST"}, {"given_name": "Ivor", "family_name": "Tsang", "institution": "University of Technology, Sydney"}]}