{"title": "Greedy Subspace Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 2753, "page_last": 2761, "abstract": "We consider the problem of subspace clustering: given points that lie on or near the union of many low-dimensional linear subspaces, recover the subspaces. To this end, one first identifies sets of points close to the same subspace and uses the sets to estimate the subspaces. As the geometric structure of the clusters (linear subspaces) forbids proper performance of general distance based approaches such as K-means, many model-specific methods have been proposed. In this paper, we provide new simple and efficient algorithms for this problem. Our statistical analysis shows that the algorithms are guaranteed exact (perfect) clustering performance under certain conditions on the number of points and the affinity be- tween subspaces. These conditions are weaker than those considered in the standard statistical literature. Experimental results on synthetic data generated from the standard unions of subspaces model demonstrate our theory. We also show that our algorithm performs competitively against state-of-the-art algorithms on real-world applications such as motion segmentation and face clustering, with much simpler implementation and lower computational cost.", "full_text": "Greedy Subspace Clustering\n\nDohyung Park\n\nConstantine Caramanis\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nThe University of Texas at Austin\n\ndhpark@utexas.edu\n\nThe University of Texas at Austin\nconstantine@utexas.edu\n\nSujay Sanghavi\n\nDept. of Electrical and Computer Engineering\n\nThe University of Texas at Austin\nsanghavi@mail.utexas.edu\n\nAbstract\n\nWe consider the problem of subspace clustering: given points that lie on or near\nthe union of many low-dimensional linear subspaces, recover the subspaces. To\nthis end, one \ufb01rst identi\ufb01es sets of points close to the same subspace and uses the\nsets to estimate the subspaces. As the geometric structure of the clusters (linear\nsubspaces) forbids proper performance of general distance based approaches such\nas K-means, many model-speci\ufb01c methods have been proposed. In this paper,\nwe provide new simple and ef\ufb01cient algorithms for this problem. Our statistical\nanalysis shows that the algorithms are guaranteed exact (perfect) clustering perfor-\nmance under certain conditions on the number of points and the af\ufb01nity between\nsubspaces. These conditions are weaker than those considered in the standard\nstatistical literature. Experimental results on synthetic data generated from the\nstandard unions of subspaces model demonstrate our theory. We also show that\nour algorithm performs competitively against state-of-the-art algorithms on real-\nworld applications such as motion segmentation and face clustering, with much\nsimpler implementation and lower computational cost.\n\n1\n\nIntroduction\n\nSubspace clustering is a classic problem where one is given points in a high-dimensional ambient\nspace and would like to approximate them by a union of lower-dimensional linear subspaces. In\nparticular, each subspace contains a subset of the points. This problem is hard because one needs to\njointly \ufb01nd the subspaces, and the points corresponding to each; the data we are given are unlabeled.\nThe unions of subspaces model naturally arises in settings where data from multiple latent phenom-\nena are mixed together and need to be separated. Applications of subspace clustering include motion\nsegmentation [23], face clustering [8], gene expression analysis [10], and system identi\ufb01cation [22].\nIn these applications, data points with the same label (e.g., face images of a person under varying\nillumination conditions, feature points of a moving rigid object in a video sequence) lie on a low-\ndimensional subspace, and the mixed dataset can be modeled by unions of subspaces. For detailed\ndescription of the applications, we refer the readers to the reviews [10, 20] and references therein.\nThere is now a sizable literature on empirical methods for this particular problem and some statis-\ntical analysis as well. Many recently proposed methods, which perform remarkably well and have\ntheoretical guarantees on their performances, can be characterized as involving two steps: (a) \ufb01nd-\ning a \u201cneighborhood\u201d for each data point, and (b) \ufb01nding the subspaces and/or clustering the points\ngiven these neighborhoods. Here, neighbors of a point are other points that the algorithm estimates\nto lie on the same subspace as the point (and not necessarily just closest in Euclidean distance).\n\n1\n\n\fAlgorithm\nSSC [4, 16]\nLRR [14]\n\nSSC-OMP [3]\n\nTSC [6, 7]\nLRSSC [24]\nNSN+GSR\n\nNSN+Spectral\n\nWhat is guaranteed\n\nCorrect neighborhoods\n\nSubspace\ncondition\n\nNone\n\nExact clustering\n\nCorrect neighborhoods\n\nExact clustering\n\nCorrect neighborhoods\n\nExact clustering\nExact clustering\n\nNo intersection\nNo intersection\n\nNone\nNone\nNone\nNone\n\nConditions for:\n\n\u21b5\n\n=\n\nFully random model\np = O( log(n/d)\nd\nlog(nL) )\n-\n-\nlog(nL) )\n=\nlog(nL) )\nlog n\nlog(ndL) ) max a\u21b5=\nlog(ndL) )\n\nd\np = O(\nd\np = O(\nd\np = O(\nd\np = O(\n\nlog n\n\n\u21b5\n\n1\n\n1\n\nmax a\n\nmax a\n\nSemi-random model\n\nplog(n/d)\nlog(nL) )\n\nlog(nL) )\n\nO(\n-\n-\nO(\n-\n(log dL)\u00b7log(ndL) )\n-\n\nlog n\n\n1\n\nO(q\n\nTable 1: Subspace clustering algorithms with theoretical guarantees. LRR and SSC-OMP have only\ndeterministic guarantees, not statistical ones. In the two standard statistical models, there are n data\npoints on each of L d-dimensional subspaces in Rp. For the de\ufb01nition of max a\u21b5, we refer the\nreaders to Section 3.1.\n\nOur contributions: In this paper we devise new algorithms for each of the two steps above; (a) we\ndevelop a new method, Nearest Subspace Neighbor (NSN), to determine a neighborhood set for each\npoint, and (b) a new method, Greedy Subspace Recovery (GSR), to recover subspaces from given\nneighborhoods. Each of these two methods can be used in conjunction with other methods for the\ncorresponding other step; however, in this paper we focus on two algorithms that use NSN followed\nby GSR and Spectral clustering, respectively. Our main result is establishing statistical guarantees\nfor exact clustering with general subspace conditions, in the standard models considered in recent\nanalytical literature on subspace clustering. Our condition for exact recovery is weaker than the\nconditions of other existing algorithms that only guarantee correct neighborhoods1, which do not\nalways lead to correct clustering. We provide numerical results which demonstrate our theory. We\nalso show that for the real-world applications our algorithm performs competitively against those\nof state-of-the-art algorithms, but the computational cost is much lower than them. Moreover, our\nalgorithms are much simpler to implement.\n\n1.1 Related work\n\nThe problem was \ufb01rst formulated in the data mining community [10]. Most of the related work in\nthis \ufb01eld assumes that an underlying subspace is parallel to some canonical axes. Subspace cluster-\ning for unions of arbitrary subspaces is considered mostly in the machine learning and the computer\nvision communities [20]. Most of the results from those communities are based on empirical justi-\n\ufb01cation. They provided algorithms derived from theoretical intuition and showed that they perform\nempirically well with practical dataset. To name a few, GPCA [21], Spectral curvature clustering\n(SCC) [2], and many iterative methods [1, 19, 26] show their good empirical performance for sub-\nspace clustering. However, they lack theoretical analysis that guarantees exact clustering.\nAs described above, several algorithms with a common structure are recently proposed with both\ntheoretical guarantees and remarkable empirical performance. Elhamifar and Vidal [4] proposed an\nalgorithm called Sparse Subspace Clustering (SSC), which uses `1-minimization for neighborhood\nconstruction. They proved that if the subspaces have no intersection2, SSC always \ufb01nds a correct\nneighborhood matrix. Later, Soltanolkotabi and Candes [16] provided a statistical guarantee of the\nalgorithm for subspaces with intersection. Dyer et al. [3] proposed another algorithm called SSC-\nOMP, which uses Orthogonal Matching Pursuit (OMP) instead of `1-minimization in SSC. Another\nalgorithm called Low-Rank Representation (LRR) which uses nuclear norm minimization is pro-\nposed by Liu et al. [14]. Wang et al. [24] proposed an hybrid algorithm, Low-Rank and Sparse Sub-\nspace Clustering (LRSSC), which involves both `1-norm and nuclear norm. Heckel and B\u00a8olcskei [6]\npresented Thresholding based Subspace Clustering (TSC), which constructs neighborhoods based\non the inner products between data points. All of these algorithms use spectral clustering for the\nclustering step.\nThe analysis in those papers focuses on neither exact recovery of the subspaces nor exact clustering\nin general subspace conditions. SSC, SSC-OMP, and LRSSC only guarantee correct neighbor-\nhoods which do not always lead to exact clustering. LRR guarantees exact clustering only when\n\n1By correct neighborhood, we mean that for each point every neighbor point lies on the same subspace.\n2By no intersection between subspaces, we mean that they share only the null point.\n\n2\n\n\fthe subspaces have no intersections. In this paper, we provide novel algorithms that guarantee exact\nclustering in general subspace conditions. When we were preparing this manuscript, it is proved\nthat TSC guarantees exact clustering under certain conditions [7], but the conditions are stricter than\nours. (See Table 1)\n\n1.2 Notation\nThere is a set of N data points in Rp, denoted by Y = {y1, . . . , yN}. The data points are lying on\nor near a union of L subspaces D = [L\ni=1Di. Each subspace Di is of dimension di which is smaller\nthan p. For each point yj, wj denotes the index of the nearest subspace. Let Ni denote the number\nof points whose nearest subspace is Di, i.e., Ni = PN\nj=1 Iwj =i. Throughout this paper, sets and\nsubspaces are denoted by calligraphic letters. Matrices and key parameters are denoted by letters\nin upper case, and vectors and scalars are denoted by letters in lower case. We frequently denote\nthe set of n indices by [n] = {1, 2, . . . , n}. As usual, span{\u00b7} denotes a subspace spanned by a\nset of vectors. For example, span{v1, . . . , vn} = {v : v =Pn\ni=1 \u21b5ivi,\u21b5 1, . . . ,\u21b5 n 2 R}. ProjU y\nis de\ufb01ned as the projection of y onto subspace U. That is, ProjU y = arg minu2U ky uk2. I{\u00b7}\ndenotes the indicator function which is one if the statement is true and zero otherwise. Finally,L\ndenotes the direct sum.\n\n2 Algorithms\n\nWe propose two algorithms for subspace clustering as follows.\n\n\u2022 NSN+GSR : Run Nearest Subspace Neighbor (NSN) to construct a neighborhood matrix\n\u2022 NSN+Spectral : Run Nearest Subspace Neighbor (NSN) to construct a neighborhood ma-\n\nW 2{ 0, 1}N\u21e5N, and then run Greedy Subspace Recovery (GSR) for W .\ntrix W 2{ 0, 1}N\u21e5N, and then run spectral clustering for Z = W + W >.\n\n2.1 Nearest Subspace Neighbor (NSN)\n\nNSN approaches the problem of \ufb01nding neighbor points most likely to be on the same subspace in\na greedy fashion. At \ufb01rst, given a point y without any other knowledge, the one single point that is\nmost likely to be a neighbor of y is the nearest point of the line span{y}. In the following steps, if\nwe have found a few correct neighbor points (lying on the same true subspace) and have no other\nknowledge about the true subspace and the rest of the points, then the most potentially correct point\nis the one closest to the subspace spanned by the correct neighbors we have. This motivates us to\npropose NSN described in the following.\n\nAlgorithm 1 Nearest Subspace Neighbor (NSN)\nInput: A set of N samples Y = {y1, . . . , yN}, The number of required neighbors K, Maximum\nOutput: A neighborhood matrix W 2{ 0, 1}N\u21e5N\n\nsubspace dimension kmax.\nyi yi/kyik2, 8i 2 [N ]\nfor i = 1, . . . , N do\n\n. Normalize magnitudes\n. Run NSN for each data point\n\n. Iteratively add the closest point to the current subspace\n\nIi { i}\nfor k = 1, . . . , K do\nif k \uf8ff kmax then\nend if\nj\u21e4 arg maxj2[N ]\\Ii kProjU yjk2\nIi I i [{ j\u21e4}\n\nU span{yj : j 2I i}\n\nend for\nWij Ij2Ii or yj2U , 8j 2 [N ]\n\nend for\n\n. Construct the neighborhood matrix\n\nNSN collects K neighbors sequentially for each point. At each step k, a k-dimensional subspace U\nspanned by the point and its k 1 neighbors is constructed, and the point closest to the subspace is\n\n3\n\n\fnewly collected. After k kmax, the subspace U constructed at the kmaxth step is used for collect-\ning neighbors. At last, if there are more points lying on U, they are also counted as neighbors. The\nsubspace U can be stored in the form of a matrix U 2 Rp\u21e5dim(U) whose columns form an orthonor-\nmal basis of U. Then kProjU yjk2 can be computed easily because it is equal to kU>yjk2. While\na naive implementation requires O(K2pN 2) computational cost, this can be reduced to O(KpN 2),\nand the faster implementation is described in Section A.1. We note that this computational cost is\nmuch lower than that of the convex optimization based methods (e.g., SSC [4] and LRR [14]) which\nsolve a convex program with N 2 variables and pN constraints.\nNSN for subspace clustering shares the same philosophy with Orthogonal Matching Pursuit (OMP)\nfor sparse recovery in the sense that it incrementally picks the point (dictionary element) that is\nthe most likely to be correct, assuming that the algorithms have found the correct ones. In subspace\nclustering, that point is the one closest to the subspace spanned by the currently selected points, while\nin sparse recovery it is the one closest to the residual of linear regression by the selected points. In\nthe sparse recovery literature, the performance of OMP is shown to be comparable to that of Basis\nPursuit (`1-minimization) both theoretically and empirically [18, 11]. One of the contributions of\nthis work is to show that this high-level intuition is indeed born out, provable, as we show that NSN\nalso performs well in collecting neighbors lying on the same subspace.\n\n2.2 Greedy Subspace Recovery (GSR)\n\nSuppose that NSN has found correct neighbors for a data point. How can we check if they are\nindeed correct, that is, lying on the same true subspace? One natural way is to count the number\nof points close to the subspace spanned by the neighbors. If they span one of the true subspaces,\nthen many other points will be lying on the span. If they do not span any true subspaces, few points\nwill be close to it. This fact motivates us to use a greedy algorithm to recover the subspaces. Using\nthe neighborhood constructed by NSN (or some other algorithm), we recover the L subspaces. If\nthere is a neighborhood set containing only the points on the same subspace for each subspace, the\nalgorithm successfully recovers the unions of the true subspaces exactly.\n\nAlgorithm 2 Greedy Subspace Recovery (GSR)\nInput: N points Y = {y1, . . . , yN}, A neighborhood matrix W 2{ 0, 1}N\u21e5N, Error bound \u270f\nOutput: Estimated subspaces \u02c6D = [L\nyi yi/kyik2, 8i 2 [N ]\n. Normalize magnitudes\nWi Top-d{yj : Wij = 1}, 8i 2 [N ] . Estimate a subspace using the neighbors for each point\nI [N ]\nwhile I6 = ; do\n. Iteratively pick the best subspace estimates\ni\u21e4 arg maxi2IPN\n\u02c6Dl \u02c6Wi\u21e4\nI I \\ { j : kProjWi\u21e4\nend while\n\u02c6wi arg maxl2[L] kProj \u02c6Dl\n\nyjk2 1 \u270f}\nyik2, 8i 2 [N ]\n\n. Label the points using the subspace estimates\n\n\u02c6Dl. Estimated labels \u02c6w1, . . . , \u02c6wN\n\nl=1\n\nj=1 I{kProjWiyjk2 1 \u270f}\n\nRecall that the matrix W contains the labelings of the points, so that Wij = 1 if point i is assigned\nto subspace j. Top-d{yj : Wij = 1} denotes the d-dimensional principal subspace of the set of\nvectors {yj : Wij = 1}. This can be obtained by taking the \ufb01rst d left singular vectors of the\nmatrix whose columns are the vector in the set. If there are only d vectors in the set, Gram-Schmidt\northogonalization will give us the subspace. As in NSN, it is ef\ufb01cient to store a subspace Wi in\nthe form of its orthogonal basis because we can easily compute the norm of a projection onto the\nsubspace.\nTesting a candidate subspace by counting the number of near points has already been considered in\nthe subspace clustering literature. In [25], the authors proposed to run RANdom SAmple Consensus\n(RANSAC) iteratively. RANSAC randomly selects a few points and checks if there are many other\npoints near the subspace spanned by the collected points. Instead of randomly choosing sample\npoints, GSR receives some candidate subspaces (in the form of sets of points) from NSN (or possibly\nsome other algorithm) and selects subspaces in a greedy way as speci\ufb01ed in the algorithm above.\n\n4\n\n\f3 Theoretical results\n\nWe analyze our algorithms in two standard noiseless models. The main theorems present suf\ufb01cient\nconditions under which the algorithms cluster the points exactly with high probability. For simplicity\nof analysis, we assume that every subspace is of the same dimension, and the number of data points\non each subspace is the same, i.e., d , d1 = \u00b7\u00b7\u00b7 = dL, n , N1 = \u00b7\u00b7\u00b7 = NL. We assume that d\nis known to the algorithm. Nonetheless, our analysis can extend to the general case.\n\n3.1 Statistical models\n\nWe consider two models which have been used in the subspace clustering literature:\n\nalso iid randomly generated.\n\n\u2022 Fully random model: The subspaces are drawn iid uniformly at random, and the points are\n\u2022 Semi-random model: The subspaces are arbitrarily determined, but the points are iid ran-\n\ndomly generated.\n\nLet Di 2 Rp\u21e5d, i 2 [L] be a matrix whose columns form an orthonormal basis of Di. An important\nmeasure that we use in the analysis is the af\ufb01nity between two subspaces, de\ufb01ned as\n\na\u21b5(i, j) , kD>i DjkFpd\n\n=sPd\n\nk=1 cos2 \u2713i,j\n\nk\n\nd\n\n2 [0, 1],\n\nwhere \u2713i,j\nk is the kth principal angle between Di and Dj. Two subspaces Di and Dj are identical if\nand only if a\u21b5(i, j) = 1. If a\u21b5(i, j) = 0, every vector on Di is orthogonal to any vectors on Dj. We\nalso de\ufb01ne the maximum af\ufb01nity as\n\nmax a\u21b5 , max\n\ni,j2[L],i6=j\n\na\u21b5(i, j) 2 [0, 1].\n\nThere are N = nL points, and there are n points exactly lying on each subspace. We assume that\neach data point yi is drawn iid uniformly at random from Sp1 \\D wi where Sp1 is the unit sphere\nin Rp. Equivalently,\n\nyi = Dwixi,\n\nxi \u21e0 Unif(Sd1),\n\n8i 2 [N ].\n\nAs the points are generated randomly on their corresponding subspaces, there are no points lying on\nan intersection of two subspaces, almost surely. This implies that with probability one the points are\nclustered correctly provided that the true subspaces are recovered exactly.\n\n3.2 Main theorems\n\nThe \ufb01rst theorem gives a statistical guarantee for the fully random model.\n\nTheorem 1 Suppose L d-dimensional subspaces and n points on each subspace are generated in\nthe fully random model with n polynomial in d. There are constants C1, C2 > 0 such that if\n\nn\nd\n\nne\n\n,\n\n> C1\u21e3log\n1 , NSN+GSR3 clusters the points exactly. Also, there are\nthen with probability at least 1 3L\nother constants C01, C02 > 0 such that if (1) with C1 and C2 replaced by C01 and C02 holds then\nNSN+Spectral4 clusters the points exactly with probability at least 1 3L\n1 . e is the exponential\nconstant.\n\nd\u23182\n\nlog(ndL1)\n\nC2 log n\n\nd\np\n\n<\n\n,\n\n(1)\n\n3NSN with K = kmax = d followed by GSR with arbitrarily small \u270f.\n4NSN with K = kmax = d.\n\n5\n\n\fOur suf\ufb01cient conditions for exact clustering explain when subspace clustering becomes easy or\ndif\ufb01cult, and they are consistent with our intuition. For NSN to \ufb01nd correct neighbors, the points on\nthe same subspace should be many enough so that they look like lying on a subspace. This condition\nis spelled out in the \ufb01rst inequality of (1). We note that the condition holds even when n/d is a\nconstant, i.e., n is linear in d. The second inequality implies that the dimension of the subspaces\nshould not be too high for subspaces to be distinguishable. If d is high, the random subspaces are\nmore likely to be close to each other, and hence they become more dif\ufb01cult to be distinguished.\nHowever, as n increases, the points become dense on the subspaces, and hence it becomes easier to\nidentify different subspaces.\nLet us compare our result with the conditions required for success in the fully random model in the\nexisting literature. In [16], it is required for SSC to have correct neighborhoods that n should be\nsuperlinear in d when d/p \ufb01xed. In [6, 24], the conditions on d/p becomes worse as we have more\npoints. On the other hand, our algorithms are guaranteed exact clustering of the points, and the\nsuf\ufb01cient condition is order-wise at least as good as the conditions for correct neighborhoods by the\nexisting algorithms (See Table 1). Moreover, exact clustering is guaranteed even when n is linear in\nd, and d/p \ufb01xed.\nFor the semi-random model, we have the following general theorem.\n\nn\nd\n\nne\n\n.\n\n(2)\n\nC2 log n\n\nTheorem 2 Suppose L d-dimensional subspaces are arbitrarily chosen, and n points on each\nsubspace are generated in the semi-random model with n polynomial in d. There are constants\nC1, C2 > 0 such that if\n\n> C1\u21e3log\n\nlog(dL1) \u00b7 log(ndL1)\n\n, max a\u21b5