{"title": "Provable Subspace Clustering: When LRR meets SSC", "book": "Advances in Neural Information Processing Systems", "page_first": 64, "page_last": 72, "abstract": "Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for {\\em subspace clustering}. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of Self-Expressiveness''. The main difference is that SSC minimizes the vector $\\ell_1$ norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the \"Self-Expressiveness Property'' and \"Graph Connectivity'' at the same time.\"", "full_text": "Provable Subspace Clustering:\n\nWhen LRR meets SSC\n\nYu-Xiang Wang\n\nSchool of Computer Science\nCarnegie Mellon University\nPittsburgh, PA 15213 USA\nyuxiangw@cs.cmu.edu\n\nHuan Xu\n\nDept. of Mech. Engineering\nNational Univ. of Singapore\n\nSingapore, 117576\n\nChenlei Leng\n\nDepartment of Statistics\nUniversity of Warwick\nCoventry, CV4 7AL, UK\n\nmpexuh@nus.edu.sg\n\nC.Leng@warwick.ac.uk\n\nAbstract\n\nSparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both\nconsidered as the state-of-the-art methods for subspace clustering. The two meth-\nods are fundamentally similar in that both are convex optimizations exploiting the\nintuition of \u201cSelf-Expressiveness\u201d. The main difference is that SSC minimizes the\nvector (cid:96)1 norm of the representation matrix to induce sparsity while LRR mini-\nmizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the\nrepresentation matrix is often simultaneously sparse and low-rank, we propose a\nnew algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by com-\nbining SSC and LRR, and develops theoretical guarantees of when the algorithm\nsucceeds. The results reveal interesting insights into the strength and weakness of\nSSC and LRR and demonstrate how LRSSC can take the advantages of both meth-\nods in preserving the \u201cSelf-Expressiveness Property\u201d and \u201cGraph Connectivity\u201d at\nthe same time.\n\n1\n\nIntroduction\n\nWe live in the big data era \u2013 a world where an overwhelming amount of data is generated and collect-\ned every day, such that it is becoming increasingly impossible to process data in its raw form, even\nthough computers are getting exponentially faster over time. Hence, compact representations of data\nsuch as low-rank approximation (e.g., PCA [13], Matrix Completion [4]) and sparse representation\n[6] become crucial in understanding the data with minimal storage. The underlying assumption is\nthat high-dimensional data often lie in a low-dimensional subspace [4]). Yet, when such data points\nare generated from different sources, they form a union of subspaces. Subspace Clustering deals\nwith exactly this structure by clustering data points according to their underlying subspaces. Appli-\ncation include motion segmentation and face clustering in computer vision [16, 8], hybrid system\nidenti\ufb01cation in control [26, 2], community clustering in social networks [12], to name a few.\nNumerous algorithms have been proposed to tackle the problem. Recent examples include GP-\nCA [25], Spectral Curvature Clustering [5], Sparse Subspace Clustering (SSC) [7, 8], Low Rank\nRepresentation (LRR) [17, 16] and its noisy variant LRSC [9] (for a more exhaustive survey of sub-\nspace clustering algorithms, we refer readers to the excellent survey paper [24] and the references\ntherein). Among these algorithms, LRR and SSC, based on minimizing the nuclear norm and (cid:96)1\nnorm of the representation matrix respectively, remain the top performers on the Hopkins155 mo-\ntion segmentation benchmark dataset [23]. Moreover, they are among the few subspace clustering\nalgorithms supported with theoretic guarantees: Both algorithms are known to succeed when the\nsubspaces are independent [27, 16]. Later, [8] showed that subspace being disjoint is suf\ufb01cient\nfor SSC to succeed1, and [22] further relaxed this condition to include some cases of overlapping\n1 Disjoint subspaces only intersect at the origin. It is a less restrictive assumption comparing to independent\n\nsubspaces, e.g., 3 coplanar lines passing the origin are not independent, but disjoint.\n\n1\n\n\fsubspaces. Robustness of the two algorithms has been studied too. Liu et al. [18] showed that a vari-\nant of LRR works even in the presence of some arbitrarily large outliers, while Wang and Xu [29]\nprovided both deterministic and randomized guarantees for SSC when data are noisy or corrupted.\nDespite LRR and SSC\u2019s success, there are questions unanswered. LRR has never been shown to\nsucceed other than under the very restrictive \u201cindependent subspace\u201d assumption. SSC\u2019s solution is\nsometimes too sparse that the af\ufb01nity graph of data from a single subspace may not be a connected\nbody [19]. Moreover, as our experiment with Hopkins155 data shows, the instances where SSC fails\nare often different from those that LRR fails. Hence, a natural question is whether combining the\ntwo algorithms lead to a better method, in particular since the underlying representation matrix we\nwant to recover is both low-rank and sparse simultaneously.\nIn this paper, we propose Low-Rank Sparse Subspace Clustering (LRSSC), which minimizes a\nweighted sum of nuclear norm and vector 1-norm of the representation matrix. We show theoretical\nguarantees for LRSSC that strengthen the results in [22]. The statement and proof also shed insight\non why LRR requires independence assumption. Furthermore, the results imply that there is a fun-\ndamental trade-off between the interclass separation and the intra-class connectivity. Indeed, our\nexperiment shows that LRSSC works well in cases where data distribution is skewed (graph connec-\ntivity becomes an issue for SSC) and subspaces are not independent (LRR gives poor separation).\nThese insights would be useful when developing subspace clustering algorithms and applications.\nWe remark that in the general regression setup, the simultaneous nuclear norm and 1-norm regular-\nization has been studied before [21]. However, our focus is on the subspace clustering problem, and\nhence the results and analysis are completely different.\n\n2 Problem Setup\nNotations: We denote the data matrix by X \u2208 Rn\u00d7N , where each column of X (normalized to\nunit vector) belongs to a union of L subspaces\n\nS1 \u222a S2 \u222a ... \u222a SL.\n\nEach subspace (cid:96) contains N(cid:96) data samples with N1 + N2 + ... + NL = N. We observe the noisy\ndata matrix X. Let X ((cid:96)) \u2208 Rn\u00d7N(cid:96) denote the selection (as a set and a matrix) of columns in\nX that belong to S(cid:96) \u2282 Rn, which is an d(cid:96)-dimensional subspace. Without loss of generality, let\nX = [X (1), X (2), ..., X (L)] be ordered. In addition, we use (cid:107) \u00b7 (cid:107) to represent Euclidean norm (for\nvectors) or spectral norm (for matrices) throughout the paper.\nMethod: We solve the following convex optimization problem\n\n(cid:107)C(cid:107)\u2217 + \u03bb(cid:107)C(cid:107)1\n\nmin\n\nC\n\nLRSSC :\n\ndiag(C) = 0.\n\ns.t. X = XC,\n\n(1)\nSpectral clustering techniques (e.g., [20]) are then applied on the af\ufb01nity matrix W = |C| + |C|T\nwhere C is the solution to (1) to obtain the \ufb01nal clustering and |\u00b7| is the elementwise absolute value.\nCriterion of success: In the subspace clustering task, as opposed to compressive sensing or matrix\ncompletion, there is no \u201cground-truth\u201d C to compare the solution against. Instead, the algorithm\nsucceeds if each sample is expressed as a linear combination of samples belonging to the same\nsubspace, i.e., the output matrix C are block diagonal (up to appropriate permutation) with each\nsubspace cluster represented by a disjoint block. Formally, we have the following de\ufb01nition.\nDe\ufb01nition 1 (Self-Expressiveness Property (SEP)). Given subspaces {S(cid:96)}L\n(cid:96)=1 and data points X\nfrom these subspaces, we say a matrix C obeys Self-Expressiveness Property, if the nonzero entries of\neach ci (ith column of C) corresponds to only those columns of X sampled from the same subspace\nas xi.\n\nNote that the solution obeying SEP alone does not imply the clustering is correct, since each block\nmay not be fully connected. This is the so-called \u201cgraph connectivity\u201d problem studied in [19].\nOn the other hand, failure to achieve SEP does not necessarily imply clustering error either, as the\nspectral clustering step may give a (sometimes perfect) solution even when there are connections\nbetween blocks. Nevertheless, SEP is the condition that veri\ufb01es the design intuition of SSC and\nLRR. Notice that if C obeys SEP and each block is connected, we immediately get the correct\nclustering.\n\n2\n\n\f3 Theoretic Guanratees\n\n3.1 The Deterministic Setup\n\nBefore we state our theoretical results for the deterministic setup, we need to de\ufb01ne a few quantities.\nDe\ufb01nition 2 (Normalized dual matrix set). Let {\u039b1(X)} be the set of optimal solutions to\n(\u039b3) = 0,\n\n(cid:107)\u039b2(cid:107)\u221e \u2264 \u03bb, (cid:107)X T \u039b1 \u2212 \u039b2 \u2212 \u039b3(cid:107) \u2264 1, diag\n\n(cid:104)X, \u039b1(cid:105)\n\nmax\n\ns.t.\n\n\u22a5\n\nwhere (cid:107) \u00b7 (cid:107)\u221e is the vector (cid:96)\u221e norm and diag\n[\u03bd\u2217\n1 , ..., \u03bd\u2217\n{\u039b1(X)}, we de\ufb01ne normalized dual matrix V for X as\n\nN ] \u2208 {\u039b1(X)} obey \u03bd\u2217\n\n\u22a5 selects all the off-diagonal entries. Let \u039b\u2217 =\ni \u2208 span(X) for every i = 1, ..., N.2 For every \u039b = [\u03bd1, ..., \u03bdN ] \u2208\n\n\u039b1,\u039b2,\u039b3\n\n(cid:20) \u03bd1(cid:107)\u03bd\u2217\n\nV (X) (cid:44)\n\n1(cid:107) , ...,\n\n\u03bdN(cid:107)\u03bd\u2217\nN(cid:107)\n\n(cid:21)\n\n,\n\nand the normalized dual matrix set {V (X)} as the collection of V (X) for all \u039b \u2208 {\u039b1(X)}.\nDe\ufb01nition 3 (Minimax subspace incoherence property). Compactly denote V ((cid:96)) = V (X ((cid:96))). We\nsay the vector set X ((cid:96)) is \u00b5-incoherent to other points if\n\n\u00b5 \u2265 \u00b5(X ((cid:96))) := min\n\nV ((cid:96))\u2208{V ((cid:96))} max\n\nx\u2208X\\X ((cid:96))\n\n(cid:107)V ((cid:96))T\n\nx(cid:107)\u221e.\n\nSL) =(cid:80)\n\nThe incoherence \u00b5 in the above de\ufb01nition measures how separable the sample points in S(cid:96) are a-\ngainst sample points in other subspaces (small \u00b5 represents more separable data). Our de\ufb01nition\ndiffers from Soltanokotabi and Candes\u2019s de\ufb01nition of subspace incoherence [22] in that it is de\ufb01ned\nas a minimax over all possible dual directions. It is easy to see that \u00b5-incoherence in [22, De\ufb01ni-\ntion 2.4] implies \u00b5-minimax-incoherence as their dual direction are contained in {V (X)}. In fact,\nin several interesting cases, \u00b5 can be signi\ufb01cantly smaller under the new de\ufb01nition. We illustrate the\npoint with the two examples below and leave detailed discussions in the supplementary materials.\nExample 1 (Independent Subspace). Suppose the subspaces are independent, i.e., dim(S1 \u2295 ... \u2295\n(cid:96)=1,...,L dim(S(cid:96)), then all X ((cid:96)) are 0-incoherent under our De\ufb01nition 3. This is because\nfor each X ((cid:96)) one can always \ufb01nd a dual matrix V ((cid:96)) \u2208 {V ((cid:96))} whose column space is orthogonal to\nthe span of all other subspaces. To contrast, the incoherence parameter according to De\ufb01nition 2.4\nin [22] will be a positive value, potentially large if the angles between subspaces are small.\nExample 2 (Random except 1 subspace). Suppose we have L disjoint 1-dimensional subspaces\nin Rn (L > n). S1, ...,SL\u22121 subspaces are randomly drawn. SL is chosen such that its angle\nto one of the L \u2212 1 subspace, say S1, is \u03c0/6. Then the incoherence parameter \u00b5(X (L)) de\ufb01ned\nin [22] is at least cos(\u03c0/6). However under our new de\ufb01nition, it is not dif\ufb01cult to show that\n\u00b5(X (L)) \u2264 2\n\n(cid:113) 6 log(L)\n\nwith high probability3.\n\nn\n\nThe result also depends on the smallest singular value of a rank-d matrix (denoted by \u03c3d) and the\ninradius of a convex body as de\ufb01ned below.\nDe\ufb01nition 4 (inradius). The inradius of a convex body P, denoted by r(P), is de\ufb01ned as the radius\nof the largest Euclidean ball inscribed in P.\nThe smallest singular value and inradius measure how well-represented each subspace is by its data\nsamples. Small inradius/singular value implies either insuf\ufb01cient data, or skewed data distribution,\nin other word, it means that the subspace is \u201cpoorly represented\u201d. Now we may state our main result.\nTheorem 1 (LRSSC). Self-expressiveness property holds for the solution of (1) on the data X\nif there exists a weighting parameter \u03bb such that for all (cid:96) = 1, ..., L, one of the following two\nconditions holds:\n\n(cid:112)\n\n\u00b5(X ((cid:96)))(1 + \u03bb\n\nN(cid:96)) < \u03bb min\n\nor\n\n\u00b5(X ((cid:96)))(1 + \u03bb) < \u03bb min\n\nk\n\nk\n\n\u03c3d(cid:96) (X ((cid:96))\u2212k),\nr(conv(\u00b1X ((cid:96))\u2212k)),\n\n(2)\n\n(3)\n\n2If this is not unique, pick the one with least Frobenious norm.\n3The full proof is given in the supplementary. Also it is easy to generalize this example to d-dimensional\n\nsubspaces and to \u201crandom except K subspaces\u201d.\n\n3\n\n\fwhere X\u2212k denotes X with its kth column removed and \u03c3d(cid:96) (X ((cid:96))\u2212k) represents the dth\nnon-zero) singular value of the matrix X ((cid:96))\u2212k.\nWe brie\ufb02y explain the intuition of the proof. The theorem is proven by duality. First we write out\nthe dual problem of (1),\n\n(smallest\n\n(cid:96)\n\nDual LRSSC : max\n\n\u039b1,\u039b2,\u039b3\n\n(cid:104)X, \u039b1(cid:105)\n\ns.t. (cid:107)\u039b2(cid:107)\u221e \u2264 \u03bb, (cid:107)X T \u039b1 \u2212 \u039b2 \u2212 \u039b3(cid:107) \u2264 1, diag\n\n\u22a5\n\n(\u039b3) = 0.\n\nThis leads to a set of optimality conditions, and leaves us to show the existence of a dual certi\ufb01cate\nsatisfying these conditions. We then construct two levels of \ufb01ctitious optimizations (which is the\nmain novelty of the proof) and construct a dual certi\ufb01cate from the dual solution of the \ufb01ctitious\noptimization problems. Under condition (2) and (3), we establish this dual certifacte meets all opti-\nmality conditions, hence certifying that SEP holds. Due to space constraints, we defer the detailed\nproof to the supplementary materials and focus on the discussions of the results in the main text.\nRemark 1 (SSC). Theorem 1 can be considered a generalization of Theorem 2.5 of [22]. Indeed,\nwhen \u03bb \u2192 \u221e, (3) reduces to the following\n\n\u00b5(X ((cid:96))) < min\n\nk\n\nr(conv(\u00b1X ((cid:96))\u2212k)).\n\nThe readers may observe that this is exactly the same as Theorem 2.5 of [22], with the only difference\nbeing the de\ufb01nition of \u00b5. Since our de\ufb01nition of \u00b5(X ((cid:96))) is tighter (i.e., smaller) than that in [22],\nour guarantee for SSC is indeed stronger. Theorem 1 also implies that the good properties of SSC\n(such as overlapping subspaces, large dimension) shown in [22] are also valid for LRSSC for a range\nof \u03bb greater than a threshold.\n\nTo further illustrate the key difference from [22], we describe the following scenario.\nExample 3 (Correlated/Poorly Represented Subspaces). Suppose the subspaces are poorly repre-\nsented, i.e., the inradius r is small. If furthermore, the subspaces are highly correlated, i.e., canonical\nangles between subspaces are small, then the subspace incoherence \u00b5(cid:48) de\ufb01ned in [22] can be quite\nlarge (close to 1). Thus, the succeed condition \u00b5(cid:48) < r presented in [22] is violated. This is an\nimportant scenario because real data such as those in Hopkins155 and Extended YaleB often suffer\nfrom both problems, as illustrated in [8, Figure 9 & 10]. Using our new de\ufb01nition of incoherence\n\u00b5, as long as the subspaces are \u201csuf\ufb01ciently independent\u201d4 (regardless of their correlation) \u00b5 will\nassume very small values (e.g., Example 2), making SEP possible even if r is small, namely when\nsubspaces are poorly represented.\nRemark 2 (LRR). The guarantee is the strongest when \u03bb \u2192 \u221e and becomes super\ufb01cial when\n\u03bb \u2192 0 unless subspaces are independent (see Example 1). This seems to imply that the \u201cindependent\nsubspace\u201d assumption used in [16, 18] to establish suf\ufb01cient conditions for LRR (and variants) to\nwork is unavoidable.5 On the other hand, for each problem instance, there is a \u03bb\u2217 such that whenever\n\u03bb > \u03bb\u2217, the result satis\ufb01es SEP, so we should expect phase transition phenomenon when tuning \u03bb.\nRemark 3 (A tractable condition). Condition (2) is based on singular values, hence is computa-\ntionally tractable. In contrast, the veri\ufb01cation of (3) or the deterministic condition in [22] is NP-\nComplete, as it involves computing the inradii of V-Polytopes [10]. When \u03bb \u2192 \u221e, Theorem 1\nreduces to the \ufb01rst computationally tractable guarantee for SSC that works for disjoint and poten-\ntially overlapping subspaces.\n\n3.2 Randomized Results\n\nWe now present results for the random design case, i.e., data are generated under some random\nmodels.\nDe\ufb01nition 5 (Random data). \u201cRandom sampling\u201d assumes that for each (cid:96), data points in X ((cid:96))\nare iid uniformly distributed on the unit sphere of S(cid:96). \u201cRandom subspace\u201d assumes each S(cid:96) is\ngenerated independently by spanning d(cid:96) iid uniformly distributed vectors on the unit sphere of Rn.\n\n4Due to space constraint, the concept is formalized in supplementary materials.\n5Our simulation in Section 6 also supports this conjecture.\n\n4\n\n\f(cid:32)(cid:114) N(cid:96)\n\nd(cid:96)\n\n(cid:114) log N(cid:96)\n\n(cid:33)\n\nd(cid:96)\n\nLemma 1 (Singular value bound). Assume random sampling. If d(cid:96) < N(cid:96) < n, then there exists an\nabsolute constant C1 such that with probability of at least 1 \u2212 N\u221210\n\n,\n\n(cid:96)\n\n\u03c3d(cid:96) (X) \u2265 1\n2\n\n\u2212 3 \u2212 C1\n\n,\n\nor simply\n\n\u03c3d(cid:96)(X) \u2265 1\n4\n\nif we assume N(cid:96) \u2265 C2d(cid:96), for some constant C2.\nLemma 2 (Inradius bound [1, 22]). Assume random sampling of N(cid:96) = \u03ba(cid:96)d(cid:96) data points in each S(cid:96),\n\nthen with probability larger than 1 \u2212(cid:80)L\n\nd(cid:96)N(cid:96)\n\n(cid:114) N(cid:96)\n\n,\n\nd(cid:96)\n\n(cid:115)\n(cid:96)=1 N(cid:96)e\u2212\u221a\nr(conv(\u00b1X ((cid:96))\u2212k)) \u2265 c(\u03ba(cid:96))\n\nlog (\u03ba(cid:96))\n\n2d(cid:96)\n\nfor all pairs ((cid:96), k).\n\nHere, c(\u03ba(cid:96)) is a constant depending on \u03ba(cid:96). When \u03ba(cid:96) is suf\ufb01ciently large, we can take c(\u03ba(cid:96)) = 1/\n\nCombining Lemma 1 and Lemma 2, we get the following remark showing that conditions (2) and\n(3) are complementary.\nRemark 4. Under the random sampling assumption, when \u03bb is smaller than a threshold, the singular\nvalue condition (2) is better than the inradius condition (3). Speci\ufb01cally, \u03c3d(cid:96) (X) > 1\nwith\n4\nd(cid:96)\nhigh probability, so for some constant C > 1, the singular value condition is strictly better if\n\n\u221a\n\n8.\n\n(cid:17)\n(cid:16)\u221a\nN(cid:96) \u2212(cid:112)log (N(cid:96)/d(cid:96))\n(cid:17) ,\n(cid:16)\n1 +(cid:112)log (N(cid:96)/d(cid:96))\n\nN(cid:96)\n\nC\n\u221a\n\n\u03bb <\n\nor when N(cid:96) is large, \u03bb <\n\n(cid:113) N(cid:96)\n1 +(cid:112)log (N(cid:96)/d(cid:96))\n\nC\n\n.\n\nBy further assuming random subspace, we provide an upper bound of the incoherence \u00b5.\nLemma 3 (Subspace incoherence bound). Assume random subspace and random sampling. It holds\nwith probability greater than 1 \u2212 2/N that for all (cid:96),\n\n(cid:114)\n\n\u00b5(X ((cid:96))) \u2264\n\n6 log N\n\nn\n\n.\n\nCombining Lemma 1 and Lemma 3, we have the following theorem.\nTheorem 2 (LRSSC for random data). Suppose L rank-d subspace are uniformly and independently\ngenerated from Rn, and N/L data points are uniformly and independently sampled from the unit\nsphere embedded in each subspace, furthermore N > CdL for some absolute constant C, then SEP\nholds with probability larger than 1 \u2212 2/N \u2212 1/(Cd)10, if\n\nd <\n\nn\n\n96 log N\n\n,\n\nfor all \u03bb >\n\n(cid:113) N\n\nL\n\n(cid:16)(cid:113) n\n(cid:17) .\n96d log N \u2212 1\n\n1\n\n(4)\n\nThe above condition is obtained from the singular value condition. Using the inradius guarantee,\ncombined with Lemma 2 and 3, we have a different succeed condition requiring d < n log(\u03ba)\n96 log N for all\n. Ignoring constant terms, the condition on d is slightly better than (4) by a log\n\u03bb >\n\n1(cid:113) n log \u03ba\n96d log N \u22121\n\nfactor but the range of valid \u03bb is signi\ufb01cantly reduced.\n\n4 Graph Connectivity Problem\n\nThe graph connectivity problem concerns when SEP is satis\ufb01ed, whether each block of the solution\nC to LRSSC represents a connected graph. The graph connectivity problem concerns whether each\ndisjoint block (since SEP holds true) of the solution C to LRSSC represents a connected graph. This\nis equivalent to the connectivity of the solution of the following \ufb01ctitious optimization problem,\nwhere each sample is constrained to be represented by the samples of the same subspace,\n\n(cid:107)C ((cid:96))(cid:107)\u2217 + \u03bb(cid:107)C ((cid:96))(cid:107)1\n\nmin\nC((cid:96))\n\ns.t. X ((cid:96)) = X ((cid:96))C ((cid:96)),\n\ndiag(C ((cid:96))) = 0.\n\n(5)\n\n5\n\n\fThe graph connectivity for SSC is studied by [19] under deterministic conditions (to make the prob-\nlem well-posed). They show by a negative example that even if the well-posed condition is satis\ufb01ed,\nthe solution of SSC may not satisfy graph connectivity if the dimension of the subspace is greater\nthan 3. On the other hand, graph connectivity problem is not an issue for LRR: as the following\nproposition suggests, the intra-class connections of LRR\u2019s solution are inherently dense (fully con-\nnected).\nProposition 1. When the subspaces are independent, X is not full-rank and the data points are\nrandomly sampled from a unit sphere in each subspace, then the solution to LRR, i.e.,\n\n(cid:107)C(cid:107)\u2217\n\nmin\n\nC\n\ns.t. X = XC,\n\nis class-wise dense, namely each diagonal block of the matrix C is all non-zero.\n\nThe proof makes use of the following lemma which states the closed-form solution of LRR.\nLemma 4 ([16]). Take skinny SVD of data matrix X = U \u03a3V T . The closed-form solution to LRR\nis the shape interaction matrix C = V V T .\n\nProposition 1 then follows from the fact that each entry of V V T has a continuous distribution,\nhence the probability that any is exactly zero is negligible (a complete argument is given in the\nsupplementary).\nReaders may notice that when \u03bb \u2192 0, (5) is not exactly LRR, but with an additional constraint\nthat diagonal entries are zero. We suspect this constrained version also have dense solution. This is\ndemonstrated numerically in Section 6.\n\n5 Practical issues\n\n5.1 Data noise/sparse corruptions/outliers\n\nThe natural extension of LRSSC to handle noise is\n\nmin\n\nC\n\n1\n2\n\n(cid:107)X \u2212 XC(cid:107)2\n\nF + \u03b21(cid:107)C(cid:107)\u2217 + \u03b22(cid:107)C(cid:107)1\n\ns.t. diag(C) = 0.\n\n(6)\n\nWe believe it is possible (but maybe tedious) to extend our guarantee to this noisy version following\nthe strategy of [29] which analyzed the noisy version of SSC. This is left for future research.\nAccording to the noisy analysis of SSC, a rule of thumb of choosing the scale of \u03b21 and \u03b22 is\n\n\u03b21 =\n\n\u03c3( 1\n1+\u03bb )\n\u221a\n2 log N\n\n,\n\n\u03b22 =\n\n\u03c3( \u03bb\n1+\u03bb )\n\u221a\n2 log N\n\n,\n\nwhere \u03bb is the tradeoff parameter used in noiseless case (1), \u03c3 is the estimated noise level and N is\nthe total number of entries.\nIn case of sparse corruption, one may use (cid:96)1 norm penalty instead of the Frobenious norm. For\noutliers, SSC is proven to be robust to them under mild assumptions [22], and we suspect a similar\nargument should hold for LRSSC too.\n\n5.2 Fast Numerical Algorithm\n\nAs subspace clustering problem is usually large-scale, off-the-shelf SDP solvers are often too slow\nto use. Instead, we derive alternating direction methods of multipliers (ADMM) [3], known to be\nscalable, to solve the problem numerically. The algorithm involves separating out the two objectives\nand diagonal constraints with dummy variables C2 and J like\n\nmin\n\n(cid:107)C1(cid:107)\u2217 + \u03bb(cid:107)C2(cid:107)1\n\nC1,C2,J\ns.t. X = XJ,\n\nJ = C2 \u2212 diag(C2),\n\nJ = C1,\n\n(7)\n\nand update J, C1, C2 and the three dual variables alternatively. Thanks to the change of variables,\nall updates can be done in closed-form. To further speed up the convergence, we adopt the adap-\ntive penalty mechanism of Lin et.al [15], which in some way ameliorates the problem of tuning\nnumerical parameters in ADMM. Detailed derivations, update rules, convergence guarantee and the\ncorresponding ADMM algorithm for the noisy version of LRSSC are made available in the supple-\nmentary materials.\n\n6\n\n\fGiniIndex (vec(CM)) = 1 \u2212 2\n\n|M|(cid:88)\n\nk=1\n\nck(cid:107)(cid:126)c(cid:107)1\n\n(cid:18)|M| \u2212 k + 1/2\n\n(cid:19)\n\n|M|\n\n.\n\n6 Numerical Experiments\n\nTo verify our theoretical results and illustrate the advantages of LRSSC, we design several numerical\nexperiments. Due to space constraints, we discuss only two of them in the paper and leave the rest to\nthe supplementary materials. In all our numerical experiments, we use the ADMM implementation\nof LRSSC with \ufb01xed set of numerical parameters. The results are given against an exponential grid\nof \u03bb values, so comparisons to only 1-norm (SSC) and only nuclear norm (LRR) are clear from two\nends of the plots.\n\n6.1 Separation-Sparsity Tradeoff\n\nWe \ufb01rst illustrate the tradeoff of the solution between obeying SEP and being connected (this is\nmeasured using the intra-class sparsity of the solution). We randomly generate L subspaces of\ndimension 10 from R50. Then, 50 unit length random samples are drawn from each subspace and\nwe concatenate into a 50 \u00d7 50L data matrix. We use Relative Violation [29] to measure of the\nviolation of SEP and Gini Index [11] to measure the intra-class sparsity6. These quantities are\nde\ufb01ned below:\n\nRelViolation (C,M) =\n\n(cid:80)\n(cid:80)\n(i,j) /\u2208M |C|i,j\n(i,j)\u2208M |C|i,j\n\n,\n\nwhere M is the index set that contains all (i, j) such that xi, xj \u2208 S(cid:96) for some (cid:96).\nGiniIndex (C,M) is obtained by \ufb01rst sorting the absolute value of Cij\u2208M into a non-decreasing\nsequence (cid:126)c = [c1, ..., c|M|], then evaluate\n\nNote that RelViolation takes the value of [0,\u221e] and SEP is attained when RelViolation is zero.\nSimilarly, Gini index takes its value in [0, 1] and it is larger when intra-class connections are sparser.\nThe results for L = 6 and L = 11 are shown in Figure 1. We observe phase transitions for both\nmetrics. When \u03bb = 0 (corresponding to LRR), the solution does not obey SEP even when the\nindependence assumption is only slightly violated (L = 6). When \u03bb is greater than a threshold,\nRelViolation goes to zero. These observations match Theorems 1 and 2. On the other hand, when \u03bb\nis large, intra-class sparsity is high, indicating possible disconnection within the class.\nMoreover, we observe that there exists a range of \u03bb where RelViolation reaches zero yet the sparsity\nlevel does not reaches its maximum. This justi\ufb01es our claim that the solution of LRSSC, taking \u03bb\nwithin this range, can achieve SEP and at the same time keep the intra-class connections relatively\ndense. Indeed, for the subspace clustering task, a good tradeoff between separation and intra-class\nconnection is important.\n\n6.2 Skewed data distribution and model selection\n\nIn this experiment, we use the data for L = 6 and combine the \ufb01rst two subspaces into one 20-\ndimensional subspace and randomly sample 10 more points from the new subspace to \u201cconnect\u201d\nthe 100 points from the original two subspaces together. This is to simulate the situation when data\ndistribution is skewed, i.e., the data samples within one subspace has two dominating directions.\nThe skewed distribution creates trouble for model selection (judging the number of subspaces), and\nintuitively, the graph connectivity problem might occur.\nWe \ufb01nd that model selection heuristics such as the spectral gap [28] and spectral gap ratio [14] of\nthe normalized Laplacian are good metrics to evaluate the quality of the solution of LRSSC. Here\nthe correct number of subspaces is 5, so the spectral gap is the difference between the 6th and 5th\nsmallest singular value and the spectral gap ratio is the ratio of adjacent spectral gaps. The larger\nthese quantities, the better the af\ufb01nity matrix reveals that the data contains 5 subspaces.\n\n6We choose Gini Index over the typical (cid:96)0 to measure sparsity as the latter is vulnerable to numerical\n\ninaccuracy.\n\n7\n\n\fFigure 1: Illustration of the separation-sparsity trade-off. Left: 6 subspaces. Right: 11 subspace.\n\nFigure 2 demonstrates how singular values change when \u03bb increases. When \u03bb = 0 (corresponding\nto LRR), there is no signi\ufb01cant drop from the 6th to the 5th singular value, hence it is impossible for\neither heuristic to identify the correct model. As \u03bb increases, the last 5 singular values gets smaller\nand become almost zero when \u03bb is large. Then the 5-subspace model can be correctly identi\ufb01ed\nusing spectral gap ratio. On the other hand, we note that the 6th singular value also shrinks as \u03bb\nincreases, which makes the spectral gap very small on the SSC side and leaves little robust margin\nfor correct model selection against some violation of SEP. As is shown in Figure 3, the largest\nspectral gap and spectral gap ratio appear at around \u03bb = 0.1, where the solution is able to bene\ufb01t\nfrom both the better separation induced by the 1-norm factor and the relatively denser connections\npromoted by the nuclear norm factor.\n\nFigure 2: Last 20 singular values of the normalized\nLaplacian in the skewed data experiment.\n\nFigure 3: Spectral Gap and Spectral Gap\nRatio in the skewed data experiment.\n\n7 Conclusion and future works\n\nIn this paper, we proposed LRSSC for the subspace clustering problem and provided theoretical\nanalysis of the method. We demonstrated that LRSSC is able to achieve perfect SEP for a wider\nrange of problems than previously known for SSC and meanwhile maintains denser intra-class con-\nnections than SSC (hence less likely to encounter the \u201cgraph connectivity\u201d issue). Furthermore, the\nresults offer new understandings to SSC and LRR themselves as well as problems such as skewed\ndata distribution and model selection. An important future research question is to mathematically\nde\ufb01ne the concept of the graph connectivity, and establish conditions that perfect SEP and connec-\ntivity indeed occur together for some non-empty range of \u03bb for LRSSC.\n\nAcknowledgments\n\nH. Xu is partially supported by the Ministry of Education of Singapore through AcRF Tier Two\ngrant R-265-000-443-112 and NUS startup grant R-265-000-384-133.\n\n8\n\n\fReferences\n[1] D. Alonso-Guti\u00b4errez. On the isotropy constant of random convex sets. Proceedings of the American\n\nMathematical Society, 136(9):3293\u20133300, 2008.\n\n[2] L. Bako. Identi\ufb01cation of switched linear systems via sparse optimization. Automatica, 47(4):668\u2013677,\n\n2011.\n[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine Learning,\n3(1):1\u2013122, 2011.\n\n[4] E.J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computa-\n\ntional mathematics, 9(6):717\u2013772, 2009.\n\n[5] G. Chen and G. Lerman. Spectral curvature clustering (SCC). International Journal of Computer Vision,\n\n81(3):317\u2013330, 2009.\n\n[6] M. Elad. Sparse and redundant representations. Springer, 2010.\n[7] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition\n\n(CVPR\u201909), pages 2790\u20132797. IEEE, 2009.\n\n[8] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. to appear in\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2013.\n\n[9] P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace estimation and\n\nclustering. In Computer Vision and Pattern Recognition (CVPR\u201911), pages 1801\u20131807. IEEE, 2011.\n\n[10] P. Gritzmann and V. Klee. Computational complexity of inner and outerj-radii of polytopes in \ufb01nite-\n\ndimensional normed spaces. Mathematical programming, 59(1):163\u2013213, 1993.\n\n[11] N. Hurley and S. Rickard. Comparing measures of sparsity. Information Theory, IEEE Transactions on,\n\n55(10):4723\u20134741, 2009.\n\n[12] A. Jalali, Y. Chen, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex optimization.\n\nIn International Conference on Machine Learning (ICML\u201911), pages 1001\u20131008, 2011.\n\n[13] I.T. Jolliffe. Principal component analysis, volume 487. Springer-Verlag New York, 1986.\n[14] F. Lauer and C. Schnorr. Spectral clustering of linear subspaces for motion segmentation. In International\n\nConference on Computer Vision (ICCV\u201909), pages 678\u2013685. IEEE, 2009.\n\n[15] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with adaptive penalty for low-rank\nrepresentation. In Advances in Neural Information Processing Systems 24 (NIPS\u201911), pages 612\u2013620.\n2011.\n\n[16] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank\n\nrepresentation. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), 2012.\n\n[17] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In International\n\nConference on Machine Learning (ICML\u201910), pages 663\u2013670, 2010.\n\n[18] G. Liu, H. Xu, and S. Yan. Exact subspace segmentation and outlier detection by low-rank representation.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS\u201912), 2012.\n\n[19] B. Nasihatkon and R. Hartley. Graph connectivity in sparse subspace clustering. In Computer Vision and\n\nPattern Recognition (CVPR\u201911), pages 2137\u20132144. IEEE, 2011.\n\n[20] A.Y. Ng, M.I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. In Advances in\n\nNeural Information Processing Systems 15 (NIPS\u201902), volume 2, pages 849\u2013856, 2002.\n\n[21] E. Richard, P. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank matrices. In\n\nInternational Conference on Machine learning (ICML\u201912), 2012.\n\n[22] M. Soltanolkotabi and E.J. Candes. A geometric analysis of subspace clustering with outliers. The Annals\n\nof Statistics, 40(4):2195\u20132238, 2012.\n\n[23] R. Tron and R. Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms.\n\nComputer Vision and Pattern Recognition (CVPR\u201907), pages 1\u20138. IEEE, 2007.\n\nIn\n\n[24] R. Vidal. Subspace clustering. Signal Processing Magazine, IEEE, 28(2):52\u201368, 2011.\n[25] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (gpca). IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 27(12):1945\u20131959, 2005.\n\n[26] R. Vidal, S. Soatto, Y. Ma, and S. Sastry. An algebraic geometric approach to the identi\ufb01cation of a\nclass of linear hybrid systems. In Decision and Control, 2003. Proceedings. 42nd IEEE Conference on,\nvolume 1, pages 167\u2013172. IEEE, 2003.\n\n[27] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data using powerfac-\n\ntorization and gpca. International Journal of Computer Vision, 79(1):85\u2013105, 2008.\n\n[28] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416, 2007.\n[29] Y.X. Wang and H. Xu. Noisy sparse subspace clustering.\n\nIn International Conference on Machine\n\nLearning (ICML\u201913), volume 28, pages 100\u2013108, 2013.\n\n9\n\n\f", "award": [], "sourceid": 80, "authors": [{"given_name": "Yu-Xiang", "family_name": "Wang", "institution": "National University of Singapore"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}, {"given_name": "Chenlei", "family_name": "Leng", "institution": "University of Warwick"}]}