{"title": "Probabilistic Low-Rank Subspace Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 2744, "page_last": 2752, "abstract": "In this paper, we consider the problem of clustering data points into low-dimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we first develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the first method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in clustering and identifying outliers.", "full_text": "Probabilistic Low-Rank Subspace Clustering\n\nS. Derin Babacan\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801, USA\ndbabacan@gmail.com\n\nShinichi Nakajima\nNikon Corporation\n\nTokyo, 140-8601, Japan\n\nnakajima.s@nikon.co.jp\n\nMinh N. Do\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL 61801, USA\n\nminhdo@illinois.edu\n\nAbstract\n\nIn this paper, we consider the problem of clustering data points into low-\ndimensional subspaces in the presence of outliers. We pose the problem using a\ndensity estimation formulation with an associated generative model. Based on this\nprobability model, we \ufb01rst develop an iterative expectation-maximization (EM) al-\ngorithm and then derive its global solution. In addition, we develop two Bayesian\nmethods based on variational Bayesian (VB) approximation, which are capable\nof automatic dimensionality selection. While the \ufb01rst method is based on an al-\nternating optimization scheme for all unknowns, the second method makes use of\nrecent results in VB matrix factorization leading to fast and effective estimation.\nBoth methods are extended to handle sparse outliers for robustness and can han-\ndle missing values. Experimental results suggest that proposed methods are very\neffective in subspace clustering and identifying outliers.\n\nIntroduction\n\n1\nModeling data using low-dimensional representations is a fundamental approach in data analysis,\nmotivated by the inherent redundancy in many datasets and to increase the interpretability of data\nvia dimensionality reduction. A classical approach is principal component analysis (PCA), which\nimplicitly models data to live in a single low-dimensional subspace within the high-dimensional\nambient space. However, a more suitable model in many applications is the union of multiple\nlow-dimensional subspaces. This modeling leads to the more challenging problem of subspace\nclustering, which attempts to simultaneously cluster data points into multiple subspaces and \ufb01nd the\nbasis of the corresponding subspace for each cluster.\nMathematically, subspace clustering can be de\ufb01ned as follows: Let Y be the M \u00d7 N data matrix\nconsisting of N vectors {yi \u2208 RM}N\ni=1, which are assumed be drawn from a union of K linear (or\naf\ufb01ne) subspaces Sk of unknown dimensions dk = dim(Sk) with 0 < dk < M. The subspace\nclustering problem is to \ufb01nd the number of subspaces K, their dimensions {dk}K\nk=1, the subspace\nbases, and the clustering of vectors yi into these subspaces.\nSubspace clustering is widely investigated problem due to its application in a large number of \ufb01elds,\nincluding computer vision [6, 12, 23], machine learning [11, 22] and system identi\ufb01cation [31]\n(see [22, 28] for comprehensive reviews). Some of the common approaches include algebraic-\ngeometric approaches such as generalized PCA (GPCA) [19, 29], spectral clustering [18], and mix-\nture models [9, 26]. Recently, there has been a great interest in methods based on sparse and/or\nlow-rank representation of the data [5, 7, 8, 14\u201317, 25]. The general approach in these methods is to\n\ufb01rst \ufb01nd a sparse/low-rank representation X of the data and then apply a spectral clustering method\non X. It has been shown that with appropriate modeling, X provides information about the seg-\n\n1\n\n\fmentation of the vectors into the subspaces. Two common models for X are summarized below.\n\u2022 Sparse Subspace Clustering (SSC) [7, 25]: This approach is based on representing data\npoints yi as sparse linear combinations of other data points. A possible optimization for-\nmulation is\nmin\nD,X\n\nF + (cid:107)D \u2212 DX(cid:107)2\n\nsubject to diag(X) = 0 ,\n\nF + \u03bb(cid:107)X(cid:107)1 ,\n\n\u03b2(cid:107)Y \u2212 D(cid:107)2\n\n(1)\n\nwhere (cid:107) \u00b7 (cid:107)F is the Frobenius norm and (cid:107) \u00b7 (cid:107)1 is the l1-norm.\n\n\u2022 Low-Rank Representation (LRR) [8, 14\u201317] : These methods are based on a principle\nsimilar to SSC, but X is modeled as low-rank instead of sparse. A general formulation for\nthis model is\n\n\u03b2(cid:107)Y \u2212 D(cid:107)2\n\nF + (cid:107)D \u2212 DX(cid:107)2\n\nF + \u03bb(cid:107)X(cid:107)\u2217 ,\n\nmin\nD,X\n\n(2)\n\nwhere (cid:107) \u00b7 (cid:107)\u2217 is the nuclear norm.\n\nIn these formulations, D is a clean dictionary and data Y is assumed to be the noisy version of D\npossibly with outliers. When \u03b2 \u2192 \u221e, Y = D, and thus the data itself is used as the dictionary\n[7,15,25]. If the subspaces are disjoint or independent1, the solution X in both formulations is shown\nto be such that Xik (cid:54)= 0 only if data points yi and yk belong to the same subspace [7, 14, 15, 25].\nThat is, the sparsest/lowest rank solution is obtained when each point yi is represented as a linear\ncombination of points in its own subspace. The estimated X is used to de\ufb01ne an af\ufb01nity matrix [18]\nsuch as |X| + |XT| and a spectral clustering algorithm, such as normalized cuts [24], is applied on\nthis af\ufb01nity to cluster the data vectors. The subspace bases can then be obtained in a straightforward\nmanner using this clustering. These methods have also been extended to include sparse outliers.\nIn this paper, we develop probabilistic modeling and inference procedures based on a principle\nsimilarly to LRR. Speci\ufb01cally, we formulate the problem using a latent variable model based on\nthe factorized form X = AB, and develop inference procedures for estimating A, B, D (and\npossibly outliers), along with the associated hyperparameters. We \ufb01rst show a maximum-likelihood\nformulation of the problem, which is solved using an expectation-maximization (EM) method. We\nderive and analyze its global solution, and show that it is related to closed-form solution of the rank-\nminimization formulation (2) in [8]. To incorporate automatic estimation of the latent dimensionality\nof subspaces and the algorithmic parameters, we further present two Bayesian approaches: The \ufb01rst\none is based on same probability model as the EM method, but additional priors are placed on the\nlatent variables and variational Bayesian inference is employed for approximate marginalization to\navoid over\ufb01tting. The second one is based on a matrix-factorization formulation, and exploits the\nrecent results on Bayesian matrix factorization [20] to achieve fast estimation that is less prone\nto errors due to alternating optimization. Finally, we extent both methods to handle large errors\n(outliers) in the data, to achieve robust estimation.\nCompared to deterministic methods, proposed Bayesian methods have the advantages of automati-\ncally estimating the dimensionality and the algorithmic parameters. This is crucial in unsupervised\nclustering as the parameters can have a drastic effect on the solution, especially in the presence of\nheavy noise and outliers. While our methods are closely related to Bayesian PCA [2, 3, 20] and\nmixture models [9, 26], our formulation is based on a different model and leads to robust estimation\nless dependent on the initialization, which is one of the main disadvantages of such methods.\n2 Probabilistic Model for Low-Rank Subspace Clustering\nIn the following, without loss of generality we assume that M \u2264 N and Y is full row-rank. We also\nassume that each subspace is suf\ufb01ciently sampled, that is, for each Si of dimension di, there exist at\nleast di data vectors yi in Y that span Si. As for notation, the expectations are denoted by (cid:104) \u00b7 (cid:105), N\nis the normal distribution, and diag() denotes the diagonal of a matrix. We do not differentiate the\nvariables from the parameters of the model to have a uni\ufb01ed presentation throughout the paper.\nWe formulate the latent variable model as\n\n1The subspaces Sk are called independent if dim((cid:76)K\n\nyi = di + nY ,\ndi = DAbi + nD ,\n\nThe subspaces are disjoint if they only intersect at the origin.\n\ni = 1, . . . , N\n\nk=1 SK ) =(cid:80)K\n\nk=1 dim(Sk) with(cid:76) the direct sum.\n\n(3)\n(4)\n\n2\n\n\fd IM\n\ny IM\n\np(bi) = N (bi |0, IN ) .\n\nwhere D is M \u00d7 N, A is N \u00d7 N, and nY , nD are i.i.d. Gaussian noise independent of the data.\nThe associated probability model is given by2\n\np(yi|di) = N(cid:0)yi | di, \u03c32\n(cid:1) ,\np(di|D, A, bi) = N(cid:0)di | DAbi, \u03c32\n(cid:1) ,\nWe model the components as independent such that p(Y|D) = (cid:81)N\n(cid:81)N\ni=1 p(di|D, A, bi), and p(B) =(cid:81)N\n\n(5)\n(6)\n(7)\ni=1 p(yi|di), p(D|A, B) =\ni=1 p(bi). This model has the generative interpretation where\nlatent vectors bi are drawn from an isotropic Gaussian distribution, shaped by A to obtain Abi,\nwhich then chooses a sample of points from the dictionary D to generate the ith dictionary element\ndi. In this sense, matrix DA has a role similar to principal subspace matrix in probabilistic principal\ncomponent analysis (PPCA) [26]. However, notice that in contrast to this and related approaches\nsuch as mixture of PPCAs [9, 26], the principal subspaces are de\ufb01ned using the data itself in (6).\nIn (5), the observations yi are modeled as corrupted versions of dictionary elements di with iid\nGaussian noise. Such separation of D and Y is not necessary if there are no outliers, as the presence\nof noise nY and nD makes them unidenti\ufb01able. However, we use this general formulation to later\ninclude outliers.\n2.1 An Expectation-Maximization (EM) Algorithm\nIn (5) - (7), latent variables bi can be regarded as missing data and D, A as parameters, and an EM\nalgorithm can be devised for their joint estimation. The complete log-likelihood is given by\n\nN(cid:88)\n\nLC =\n\nlog p(yi, bi)\n\n(8)\n\ni=1\n\nwith p(yi, bi) = p(yi|di) p(di|D, A, bi) p(bi). The EM algorithm can be obtained by taking the\nexpectation of this log-likelihood with respect to (w.r.t.) B (E-step) and maximizing it w.r.t. D, A,\n\u03c3d, and \u03c3y (M-step). In the E-step, the distribution p(B|D, A, \u03c32\nd) is found as N ((cid:104)B(cid:105), \u03a3B) with\n1\n(9)\nAT DT DA ,\n\u03c32\nd\n\n\u03a3\u22121\nB = I +\n\n(cid:104)B(cid:105) = \u03a3B\n\nAT DT D,\n\n1\n\u03c32\nd\n\nand the expectation of the likelihood is taken w.r.t. this distribution. In the M-step, maximizing the\nexpected log-likelihood w.r.t. D and A in an alternating fashion yields the update equations\n\n(cid:21)\u22121\n\n, A = (cid:104)B(cid:105)T(cid:2)(cid:104)BBT(cid:105)(cid:3)\u22121\n\n,\n\n(10)\n\nD =\n\n1\n\u03c32\ny\n\nY\n\nI +\n\n\u03c32\ny\n\n1\n\u03c32\nd\n\n(cid:104) (I \u2212 AB) (I \u2212 AB)T (cid:105)B\n\n(cid:20) 1\n\nwith (cid:104)BBT(cid:105) = BBT + N \u03a3B. Finally, the estimates of \u03c32\n\nd and \u03c32\n\ny are found as\n\n(cid:107)D \u2212 DA(cid:104)B(cid:105)(cid:107)2\n\nF + N tr(AT DT DA\u03a3B)\n\n\u03c32\nd =\n\nM N\n\n,\n\n\u03c32\ny =\n\n(cid:107)Y \u2212 D(cid:107)2\n\nF\n\nM N\n\n.\n\n(11)\n\nIn summary, the maximum likelihood solution is obtained by an alternating iterative procedure\nwhere \ufb01rst the statistics of B are calculated using (9), followed by the M-step updates for D, A, \u03c3d,\nand \u03c3y in (10) and (11), respectively.\n2.2 Global Solution of the EM algorithm\n\nAlthough the iterative EM algorithm above can be applied to estimate A, B, D, the global solutions\ncan in fact be found in closed form. Speci\ufb01cally, the optimal solution is found (see the supplemen-\ntary) as either A(cid:104)B(cid:105) = 0 or\n\n(cid:2)Iq \u2212 N \u03c32\n\n(cid:3) VT\n\n\u00af\u039b\u22122\n\nq\n\nA(cid:104)B(cid:105) = Vq\n\n(12)\n2Here we assume that Abi (cid:54)= wi where wi is a zero vector with 1 as the ith coef\ufb01cient, to have a proper\ndensity. This is a reasonable assumption if each subspace is suf\ufb01ciently sampled and the dictionary element di\nbelongs to one of them (i.e., it is not an outlier). Outliers are explicitly handled later.\n\nq ,\n\nd\n\n3\n\n\fN(cid:88)\n\nq(cid:48)=q+1\n\n\u03c32\nd =\n\n1\n\nN \u2212 q\n\n\u03bb2\nq(cid:48) ,\n\n(14)\n\n\u221a\n\nwhere \u00af\u039bq is a q \u00d7 q diagonal matrix with coef\ufb01cients \u00af\u03bbj = max(\u03bbj,\nN \u03c3d). Here, D = U\u039bVT\nis the singular value decomposition (SVD) of D, and Vq contains its q right singular vectors that\ncorrespond to singular values that are larger than or equal to\nN \u03c3d. Hence, the solution (12) is\nrelated to the rank-q shape interaction matrix (SIM) VqVT\nq [6], while in addition it involves scaling\nof the singular vectors via thresholded singular values of D.\nUsing A(cid:104)B(cid:105) in (10), the singular vectors of the optimal D and Y are found to be the same, and the\nsingular values \u03bbj of D are related to the singular values \u03bej of Y as\n\u221a\nif \u03bbj \u2264 \u221a\n\n(cid:40)\u03bbj + N \u03c32\n\ny \u03bb\u22121\n,\n\nif \u03bbj >\n\n\u03bej =\n\nN \u03c3d\n\nN \u03c3d\n\n(13)\n\n\u221a\n\n\u03bbj\n\n,\n\nj\n\ny+\u03c32\n\u03c32\n\u03c32\nd\n\nd\n\nThis is a combination of two operations: down-scaling and the solutions a quadratic equation, where\nthe latter is a polynomial thresholding operation on the singular values \u03bej of Y (see supplementary).\nHence, the optimal D is obtained by applying the thresholding operation (13) on the singular values\nof Y, where the shrinkage amount is small for large singular values so that they are preserved,\nwhereas small singular values are shrank by down-scaling. This is an interesting result, as there is\nno explicit penalty on the rank of D in our modeling. As shown in [8], the nuclear norm formulation\n(2) leads to a similar closed-form solution, but it requires the solution of a quartic equation.\nFinally, at the stationary points, the noise variance \u03c32\n\nd is found as\n\ny and \u03c32\n\ny and \u03c32\n\nthat is, the average of the squared discarded singular values of D when computing DA(cid:104)B(cid:105). A\ny cannot be found due to the polynomial thresholding in (13),\nsimple closed form expression of \u03c32\nbut it can simply be calculated using (11).\nd are given, the optimal D and A(cid:104)B(cid:105) are found by taking the SVD of Y\nIn summary, if \u03c32\nand applying shrinkage/thresholding operations on the singular values of Y. However, this method\ny = 0),\nd manually. When Y itself is used as the dictionary D (i.e., \u03c32\nrequires setting \u03c32\nan alternative method is to choose q, the total number of independent dimensions to be retained in\ny (cid:54)= 0,\nd from (14), and \ufb01nally use (12) to obtain A(cid:104)B(cid:105). However, when \u03c32\nDA(cid:104)B(cid:105), calculate \u03c32\nq cannot directly be set and a trial-and-error procedure is required to \ufb01nd it. Although \u03c32\nd and \u03c32\ny\ncan also be estimated automatically using the iterative EM procedure in Sec. 2.1, this method is\nsusceptible to local minima, as the trivial solution A(cid:104)B(cid:105) = 0 also maximizes the likelihood.\nThese issues can be overcome by employing a Bayesian estimation to automatically determine the\neffective dimensionality of D and AB. We develop two methods towards this goal, which are\ndescribed next.\n\n3 Variational Bayesian Low-Rank Subspace Clustering\nBayesian estimation of D, A and B can be achieved by treating them as latent variables to be\nmarginalized over to avoid over\ufb01tting and trivial solutions such as AB = 0. Here we develop\nsuch a method based on the probability model in the previous section but with additional priors\nintroduced on A, B and the noise variances. Before presenting our complete probability model, we\n\ufb01rst introduce the matrix-variate normal distribution as its use signi\ufb01cantly simpli\ufb01es the algorithm\nderivation. For a M \u00d7 N matrix X, the matrix-variate normal distribution is given by [10]\nN (X|M, \u03a3, \u2126) = (2\u03c0)\nwhere M is the mean, and \u03a3, \u2126 are M \u00d7 M row and N \u00d7 N column covariances, respectively.\nTo automatically determine the number of principal components in AB, we employ an automatic\nrelevance determination mechanism [21] on the columns of A and rows of B using priors p(A) =\nN (A|0, I, CA), p(B) = N (B|0, CB, I), where CA and CB are diagonal matrices with CA =\ndiag(cA,i) and CB = diag(cB,i), i = 1, . . . , N. Jeffrey\u2019s priors are placed on cA,i and cB,i, and\nthey are assumed to be independent. To avoid scale ambiguity, the columns of A and rows of B can\nalso be coupled using the same set of hyperparameters CA = CB, as in [1].\n\n\u03a3\u22121 (X \u2212 M) \u2126\u22121 (X \u2212 M)T(cid:17)(cid:21)\n\n2 |\u2126|\u2212 M\n\n|\u03a3|\u2212 N\n\n\u2212 1\n2\n\n2 exp\n\n(cid:16)\n\n(15)\n\n(cid:20)\n\ntr\n\nNM\n\n2\n\n4\n\n\fFor inference, we employ the variational Bayesian (VB) method [4] which leads to a fast algorithm.\nLet q(D, A, B, CA, CB, \u03c32\ny) be the distribution that approximates the posterior. The variational\nfree energy is given by the following functional\nd, \u03c32\n\nd, \u03c32\n\nd, \u03c32\n\n(16)\n\ny) \u2212 log p(Y, D, A, B, CA, CB, \u03c32\nis\nthe\n\napproximate\n\ny)(cid:105) .\nfactorized\n\nas\nUsing\ny). Using the pri-\nq(D, A, B, CA, CB, \u03c32\nors de\ufb01ned above with the conditional distributions in (5) and (6), the approximating distributions\nof D, A and B minimizing the free energy F are found as matrix-variate normal distributions3\nq(D) = N ((cid:104)D(cid:105), I, \u2126D), q(A) = N ((cid:104)A(cid:105), \u03a3A, \u2126A) and q(B) = N ((cid:104)B(cid:105), \u03a3B, I), with parameters\n\ny) = q(D) q(A) q(B) q(CA) q(CB) q(\u03c32\n\nposterior\nd) q(\u03c32\n\nF = (cid:104) log q(D, A, B, CA, CB, \u03c32\nthe mean \ufb01eld\napproximation,\nd, \u03c32\n\nIN +\n\nd(cid:105)(cid:104) (I \u2212 AB) (I \u2212 AB)T (cid:105)\n1\n(cid:104)\u03c32\n\ntr(\u2126A(cid:104)BBT(cid:105))(cid:104)DT D(cid:105)\ntr(\u03a3A(cid:104)DT D(cid:105))(cid:104)BBT(cid:105)\n(cid:104)DT D(cid:105)(cid:104)B(cid:105)T\nd(cid:105)(cid:104)AT DT DA(cid:105) .\n1\n(cid:104)\u03c32\n\n1\n\u03c32\nd\n\nB +\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(cid:18) 1\n\n(cid:19)\n\n(cid:104)\u03c32\ny(cid:105)\n1\n\nN \u03c32\nd\n1\n\nN \u03c32\nd\n\n\u2126\u22121\nD =\n\ntr(C\u22121\n\nA \u2126A) I +\n\ntr(\u03a3A)C\u22121\n\nA +\n\n(cid:104)DT D(cid:105)(cid:104)A(cid:105)(cid:104)BBT(cid:105) =\n\n(cid:104)D(cid:105) =\n\n1\n(cid:104)\u03c32\ny(cid:105) Y \u2126D,\n\u03a3\u22121\n1\nA =\nN\n\u2126\u22121\n1\nA =\nN\n(cid:104)A(cid:105)C\u22121\nd(cid:105)(cid:104)AT DT D(cid:105),\n1\n(cid:104)\u03c32\n\nA +\n\n1\n\u03c32\nd\n\n(cid:104)B(cid:105) = \u03a3B\n\n(21)\nThe estimate (cid:104)A(cid:105) in (20) is solved using \ufb01xed-point iterations. The hyperparameter updates are\ngiven by\n\n\u03a3\u22121\nB = C\u22121\n\n(cid:104)c\u22121\nA,i(cid:105) =\n(cid:104)\u03c32\nd(cid:105) =\n\nN\n\n,\n\n(cid:104)AT A(cid:105)ii\nF(cid:105)\n(cid:104)(cid:107)D \u2212 DAB(cid:107)2\n\nM N\n\n(cid:104)c\u22121\nB,i(cid:105) =\n(cid:104)\u03c32\ny(cid:105) =\n\n,\n\nN\n\ndiag((cid:104)BBT(cid:105)ii)\nF(cid:105)\n(cid:104)(cid:107)Y \u2212 D(cid:107)2\n.\n\n,\n\n(22)\n\n(23)\n\nM N\n\nd and \u03c32\n\nExplicit forms of the required moments are given in the supplementary. In summary, the algorithm\nalternates between calculating the suf\ufb01cient statistics of the distributions of D, A and B, and the\ny. The convergence can be monitored during\nupdates of the hyperparameters cA,i, cB,i, \u03c32\niterations using the variational free energy F. F is also useful in model comparison, which we use\nfor detecting outliers, as explained in Sec. 5.\nSimilarly to the matrix factorization approaches [2, 3, 13], automatic dimensionality selection is\ninvoked via hyperparameters cA,i and cB,i, which enforce sparsity in the columns and rows of A\nand B, respectively. Speci\ufb01cally, when a particular set of variances cA,i, cB,i assume very small\nvalues, the posteriors of the ith column of A and ith row of B will be concentrated around zero,\nsuch that the effective number of principal directions in AB will be reduced. In practice, this is\nperformed via thresholding of variances cA,i, cB,i with a small threshold (e.g., 10\u221210).\n\n4 A Factorization-Based Variational Bayesian Approach\nAnother Bayesian method can be developed by further investigating the probability model. Es-\nsentially, the estimates of A and B is based on the factorization of D and are independent of Y.\nThus, one can apply a matrix factorization method to D, and relate this factorization to DAB to\n\ufb01nd AB. Based on this idea, we modify the probabilistic model to p(D) = N (D|DLDR, I, 1\nI),\np(DL) = N (DL|0, I, CL), p(DR) = N (DR|0, CR, I), where diagonal covariances CL and CR\nare used to induce sparsity in the columns of DL and rows of DR, respectively. It has been shown\nin [20] that when variational Bayesian inference is applied to this model, the global solution is found\nanalytically and given by\n\n\u03c32\nd\n\nDLDR = U\u039bF VT ,\n\n(24)\n\n3The optimal distribution q(A) does not have a matrix-variate normal form. However, we force it to have\n\nthis form for computational ef\ufb01ciency (see supplementary for details).\n\n5\n\n\ff \u039bF VT\n\nf , where the subscript f denotes the retained singular value and vectors.\n\nwhere U, V contain the singular vectors of D, and \u039bF is a diagonal matrix, obtained by applying a\nspeci\ufb01c shrinkage method to the singular values of D [20]. The number of retained singular values\nare therefore automatically determined. Then, setting DLDR equal to DAB, we obtain the solution\nAB = Vf \u039b\u22121\nThe only modi\ufb01cation to the method in the previous section is to replace the estimation of A and\nB in (18)-(21) with the global solution Vf \u039b\u22121\nf . Thus, this method allows us to avoid the\nalternating optimization for \ufb01nding A and B, which potentially can get stuck in undesired local\nminima. Although the probability model is slightly different than the one described in the previous\nsection, we anticipate that its global solution to be related to the factorization-based solution.\n5 Robustness to Outliers\nDepending on the application, the outliers might be in various forms. For instance in motion tracking\napplications, an entire data point might become an outlier if the tracker fails at that instance. In\nother applications, only a subset of coordinates might be corrupted with large errors. Both types\n(and possibly others) can be handled in our modeling. The only required change in the model is in\nthe conditional distribution of the observations as\n\nf \u039bF VT\n\np(Y|D) = N (Y|D + E, \u03c32\n\ny) ,\n\n(25)\n\np(E) = N (E|0, CC\n\nwhere E is the sparse outlier matrix for which we introduce the prior\nE) = N (vec(E)|0, CC\nE \u2297 CR\nE) .\nE depends on the nature\nE and row covariance matrix CR\nE = I and independent terms\nE,i), i = 1, . . . , N. When entire coordinates can be corrupted,\nE,i). In the \ufb01rst case, the VB\n\nThe shape of the column covariance matrix CC\nof outliers. If only entire data points might be corrupted, we can use CC\nin CR\nrow-sparsity in E can be imposed using CR\nestimation rule becomes q(ei) = N ((cid:104)ei(cid:105), I, \u03a3ei) with\n\nE such that CR\n\nE = I and CC\n\nE = diag(cR\n\nE = diag(cC\n\nE, CR\n\n(26)\n\n(cid:32)\n\n(cid:33)\u22121\n\n(cid:104)ei(cid:105) = \u03a3ei\n\ny(cid:105) (yi \u2212 (cid:104)di(cid:105)) \u03a3ei = diag\n1\n(cid:104)\u03c32\n\n1\n(cid:104)\u03c32\ny(cid:105) +\n\n1\nE,i(cid:105)\n(cid:104)cR\n\n,\n\n(27)\n\nE,i(cid:105) = (cid:104)ei(cid:105)T(cid:104)ei(cid:105)+tr (\u03a3ei). The estimation rules for other outlier\n\nwith the hyperparameter update (cid:104)cR\nmodels can be derived in a similar manner.\nIn the presence of outlier data points, there is an inherent unidenti\ufb01ability between AB and E\nwhich can prevent the detection of outliers and hence reduce the performance of subspace clustering.\nSpeci\ufb01cally, an outlier yi can be included in the sparse component as ei = yi or included in the\ndictionary D with its own subspace, which leads to (AB)ii \u2248 1. To avoid the latter case, we\nintroduce a heuristic inspired by the birth and death method in [9]. During iterations, data points\nyi with (AB)ii larger than a threshold (e.g., 0.95) are assigned to the sparse component ei. As\nthis might initially increase the variational energy F, we monitor its progress over a few iterations\nand reject this \u201cbirth\u201d of the sparse component if F does not decrease below its original state.\nThis method is observed to be very effective in identifying outliers and alleviating the effect of the\ninitialization.\nFinally, missing values in Y can also be handled by modifying the distribution of the observations\n\n(cid:1), where Zi is the set containing the indices of the\n\nin (5) to p(yi|di) = (cid:81)\n\nN(cid:0)yik | dik, \u03c32\n\nobserved entries in vector yi. The inference procedures can be modi\ufb01ed with relative ease to ac-\ncommodate this change.\n6 Experiments\nIn this section, we evaluate the performance of the three algorithms introduced above, namely, the\nEM method in Sec. 2.2, the variational Bayesian method in Sec. 3 (VBLR) and the factorization-\nbased method in Sec. 4 (VBLR-Fac). We also include comparisons with deterministic subspace\nclustering and mixture of PPCA (MPPCA) methods. In all experiments, the estimated AB matrix is\nused to \ufb01nd the af\ufb01nity matrix and the normalized cuts algorithm [24] is applied to \ufb01nd the clustering\nand hence the subspaces.\n\nk\u2208Zi\n\ny\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1: Clustering 1D subspaces (points in the same\ncluster are in the same color) (a) MPPCA [3] result, (b)\nthe result of the EM algorithm (global solution). The\nBayesian methods give results almost identical to (b).\n\nFigure 2: Accuracy of clustering 5 inde-\npendent subspaces of dimension 5 for dif-\nferent percentage of outliers.\n\nSynthetic Data. We generated 27 line segments intersecting at the origin, as shown in Fig. 1, where\neach contains 800 points slightly corrupted by iid Gaussian noise of variance 0.1. Each line can\nbe considered as a separate 1D subspace, and the subspaces are disjoint but not independent. We\n\ufb01rst applied the mixture of PPCA [3] to which we provided the dimensions and the number of the\nsubspaces. This method is sensitive to the proximity of the subspaces, and in all of our trials gave\nresults similar to Fig. 1(a), where close lines are clustered together. On the other hand, the EM\nmethod accurately clusters the lines into different subspaces (Fig. 1(b)), and it is extremely ef\ufb01cient\ninvolving only one SVD. Both Bayesian methods VBLR and VBLR-Fac gave similar results and\naccurately estimated the subspace dimensions, while the VB-variant of MPPCA [9] gave results\nsimilar to Fig. 1(a).\nNext, similarly to the setup in [15], we construct 5 independent subspaces {Si} \u2282 R50 of dimension\n5 with bases Ui generated as follows: We \ufb01rst generate a random 50\u00d7 5 orthogonal matrix U1, and\nthen rotate it with random orthonormal matrices Ri to obtain Ui = RiU1, i = 2, 3, 4. Dictionary\nD is obtained by sampling 25 points from each subspace using Di = UiVi where Vi are 5 \u00d7 25\nmatrices with elements drawn from N (0, 1). Finally, Y is obtained by corrupting D with outliers\nsampled from N (0, 1) and normalized to lie on the unit sphere. We applied our methods VBLR\nand VBLR-Fac to cluster the data into 5 groups, and compare their performance with MPPCA\nand LRR. Average clustering errors (over 20 trials) in Fig. 2 show that LRR and the proposed\nmethods provide much better performance than MPPCA. VBLR and VBLR-Fac gave similar results,\nwhile VBLR-Fac converges much faster (generally about 10 vs 100 iterations). Although LRR also\ngives very good results, its performance varies with its parameters. As an example, we included its\nresults obtained by the optimal and a slightly different parameter value, where in the latter case the\ndegradation in accuracy is evident.\n\nTable 1: Clustering errors (%) on the Hopkins155 motion database\n\nMethod GPCA [19]\nMean\nMax\nStd\n\n30.51\n55.67\n11.79\n\nLSA [30]\n\n8.77\n38.37\n9.80\n\nSSC [7]\n\n3.66\n37.44\n7.21\n\nLRR [15] VBLR VBLR-Fac\n\n1.85\n37.32\n5.10\n\n1.71\n32.50\n4.85\n\n1.75\n35.13\n4.92\n\nReal Data with Small Corruptions. The Hopkins155 motion database [27] is frequently used to\ntest subspace clustering methods. It consists of 156 sequences where each contains 39 to 550 data\nvectors corresponding to either 2 or 3 motions. Each motion corresponds to a subspace and each\nsequence is regarded as a separate clustering task. While most existing methods use a pre-processing\nstage that generally involves dimensionality reduction using PCA, we do not employ pre-processing\nand apply our Bayesian methods directly (the EM method cannot handle outliers and thus is not\nincluded in the experiments). The mean and maximum clustering errors and the standard deviation\nin the whole set are shown in Table 1. The proposed methods provide close to state-of-the-art\nperformance, while competing methods require manual tuning of their parameters, which can affect\ntheir performance. For instance, the results of LRR is obtained by setting its parameter \u03bb = 4,\nwhile changing it to \u03bb = 2.4 gives 3.13% error [15]. The method in [8], which is similar to our EM-\n\n7\n\n02040608010030405060708090100110Percentage of Outliers (%)Clustering accuracy (%) LRR (\u03bb = 0.01)LRR (\u03bb = 0.16)VBLRVBLR\u2212FacMPPCA\fmethod except that it also handles outliers, achieves an error rate of 1.44%. Finally, the deterministic\nmethod [17] achieves an error rate of 0.85% and to our knowledge, is the best performing method\nin this dataset.\nReal Data with Large Corruptions. To test our methods in real data with large corruptions, we\nuse the Extended Yale Database B [12] where we chose the \ufb01rst 10 classes that contain 640 frontal\nface images. Each class contains 64 images and each image is resized to 48 \u00d7 42 and stacked to\ngenerate the data vectors. Figure 3 depicts some example images, where signi\ufb01cant corruption due\nto shadows and heavy noise is evident. The task is to cluster the 640 images into 10 classes. The\nsegmentation accuracies achieved by the proposed methods and some existing methods are listed in\nTable 2, where it is evident that the proposed methods achieve state-of-art-performance. Example\nrecovered clean dictionary and sparse outlier components are shown in Fig. 3.\n\nTable 2: Clustering accuracy (%) on the Extended Yale Database B\n\nMethod\nAverage\n\nLSA [30]\n\n31.72\n\nSSC [7]\n37.66\n\nLRR [15] VBLR VBLR-Fac\n\n67.62\n\n62.53\n\n69.72\n\nY\n\nDAB\n\nE\n\nVBLR\n\nVBLR-Fac\n\nDAB\n\nE\n\nFigure 3: Examples of recovered clean data and large corruptions. Original images are shown in the\nleft column (denoted by Y), the clean dictionary elements obtained by VBLR and VBLR-Fac are\nshown in columns denoted by DAB, and columns denoted by E show corruption captured by the\nsparse element.\n\n7 Conclusion\nIn this work we presented a probabilistic treatment of low dimensional subspace clustering. Using\na latent variable formulation, we developed an expectation-maximization method and derived its\nglobal solution. We further proposed two effective Bayesian methods both based on the automatic\nrelevance determination principle and variational Bayesian approximation for inference. While the\n\ufb01rst one, VBLR, relies completely on alternating optimization, the second one, VBLR-Fac, makes\nuse of the global solution of VB matrix factorization to eliminate one alternating step and leads\nto faster convergence. Both methods have been extended to handle sparse large corruptions in the\ndata for robustness. These methods are advantageous over deterministic methods as they are able\nto automatically determine the total number of principal dimensions and all required algorithmic\nparameters. This property is particularly important in unsupervised settings. Finally, our formulation\ncan potentially be extended for modeling multiple nonlinear manifolds, by the use of kernel methods.\nAcknowledgments. The authors thank anonymous reviewers for helpful comments. SDB acknowl-\nedges the Beckman Institute Postdoctoral Fellowship. SN thanks the support from MEXT Kakenhi\n23120004. MND was partially supported by NSF CHE 09-57849.\n\n8\n\n\fReferences\n[1] S. D. Babacan, M. Luessi, R. Molina, and A. K. Katsaggelos. Sparse Bayesian methods for low-rank\n\nmatrix estimation. IEEE Trans. Signal Proc., 60(8), Aug 2012.\n\n[2] C. M. Bishop. Bayesian principal components. In NIPS, volume 11, pages 382\u2013388, 1999.\n[3] C. M. Bishop. Variational principal components. In Proc. of ICANN, volume 1, pages 514\u2013509, 1999.\n[4] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[5] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, abs/0912.3599,\n\n2009.\n\n[6] J. P. Costeira and T. Kanade. A multibody factorization method for independently moving objects. Int. J.\n\nComput. Vision, 29(3):159\u2013179, September 1998.\n\n[7] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, pages 2790\u20132797, 2009.\n[8] P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace estimation and\n\nclustering. In CVPR, pages 1801\u20131807, 2011.\n\n[9] Z. Ghahramani and M. J. Beal. Variational inference for Bayesian mixtures of factor analysers. In NIPS,\n\nvolume 12, pages 449\u2013455, 2000.\n\n[10] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, New York, 2000.\n[11] K. Huang and S. Aviyente. Sparse representation for signal classi\ufb01cation. In NIPS, 2006.\n[12] K.-C. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable\n\nlighting. IEEE Trans. Pattern Anal. Machine Intell., 27:684\u2013698, 2005.\n\n[13] Y. J. Lim and T. W. Teh. Variational Bayesian approach to movie rating prediction. In Proc. of KDD Cup\n\nand Workshop, 2007.\n\n[14] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank\n\nrepresentation. CoRR, abs/1010.2955, 2012.\n\n[15] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages\n\n663\u2013670, 2010.\n\n[16] G. Liu, H. Xu, and S. Yan. Exact subspace segmentation and outlier detection by low-rank representation.\n\nIn AISTATS, 2012.\n\n[17] G. Liu and S. Yan. Latent low-rank representation for subspace segmentation and feature extraction. In\n\nICCV, 2011.\n\n[18] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416, December 2007.\n[19] Y. Ma, A. Yang, H. Derksen, and R. Fossum. Estimation of subspace arrangements with applications in\n\nmodeling and segmenting mixed data,. SIAM Review, 50(3):413\u2013458, 2008.\n\n[20] S. Nakajima and M. Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of Machine\n\nLearning Research, 12:2583\u20132648, 2011.\n\n[21] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.\n[22] H. Peterkriegel, P. Kroger, and A. Zimek. Clustering high-dimensional data: a survey on subspace slus-\n\ntering, pattern-based clustering, and correlation clustering. In Proc. KDD, 2008.\n\n[23] S. Rao, R. Tron, R. Vidal, and Y. Ma. Motion segmentation in the presence of outlying, incomplete, or\n\ncorrupted trajectories. IEEE Trans. Pattern Anal. Machine Intell., 32(10):1832\u20131845, 2010.\n\n[24] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell.,\n\n22(8):888 \u2013905, aug 2000.\n\n[25] M. Soltanolkotabi and E. J. Cand`es. A geometric analysis of subspace clustering with outliers. CoRR,\n\n2011.\n\n[26] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural\n\nComput., 11(2):443\u2013482, February 1999.\n\n[27] R. Tron and R. Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms. In CVPR,\n\nJune 2007.\n\n[28] R. Vidal. Subspace clustering. IEEE Signal Process. Mag., 28(2):52\u201368, 2011.\n[29] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (gpca). IEEE Trans. on PAMI,\n\n27(12):1945\u20131959, 2005.\n\n[30] J. Yan and M. Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid,\n\nnon-rigid, degenerate and non-degenerate. In ECCV, volume 4, pages 94\u2013106, 2006.\n\n[31] C. Zhang and R. R. Bitmead. Subspace system identi\ufb01cation for training-based MIMO channel estima-\n\ntion. Automatica, 41:1623\u20131632, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1267, "authors": [{"given_name": "S.", "family_name": "Babacan", "institution": null}, {"given_name": "Shinichi", "family_name": "Nakajima", "institution": null}, {"given_name": "Minh", "family_name": "Do", "institution": null}]}