{"title": "Recovery of Coherent Data via Low-Rank Dictionary Pursuit", "book": "Advances in Neural Information Processing Systems", "page_first": 1206, "page_last": 1214, "abstract": "The recently established RPCA method provides a convenient way to restore low-rank matrices from grossly corrupted observations. While elegant in theory and powerful in reality, RPCA is not an ultimate solution to the low-rank matrix recovery problem. Indeed, its performance may not be perfect even when data are strictly low-rank. This is because RPCA ignores clustering structures of the data which are ubiquitous in applications. As the number of cluster grows, the coherence of data keeps increasing, and accordingly, the recovery performance of RPCA degrades. We show that the challenges raised by coherent data (i.e., data with high coherence) could be alleviated by Low-Rank Representation (LRR)~\\cite{tpami_2013_lrr}, provided that the dictionary in LRR is configured appropriately. More precisely, we mathematically prove that if the dictionary itself is low-rank then LRR is immune to the coherence parameter which increases with the underlying cluster number. This provides an elementary principle for dealing with coherent data and naturally leads to a practical algorithm for obtaining proper dictionaries in unsupervised environments. Experiments on randomly generated matrices and real motion sequences verify our claims. See the full paper at arXiv:1404.4032.", "full_text": "Recovery of Coherent Data via Low-Rank\n\nDictionary Pursuit\n\nGuangcan Liu\n\nPing Li\n\nDepartment of Statistics and Biostatistics\n\nDepartment of Statistics and Biostatistics\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nRutgers University\n\nPiscataway, NJ 08854, USA\ngcliu@rutgers.edu\n\nRutgers University\n\nPiscataway, NJ 08854, USA\npingli@rutgers.edu\n\nAbstract\n\nThe recently established RPCA [4] method provides a convenient way to restore\nlow-rank matrices from grossly corrupted observations. While elegant in theory\nand powerful in reality, RPCA is not an ultimate solution to the low-rank ma-\ntrix recovery problem. Indeed, its performance may not be perfect even when\ndata are strictly low-rank. This is because RPCA ignores clustering structures of\nthe data which are ubiquitous in applications. As the number of cluster grows,\nthe coherence of data keeps increasing, and accordingly, the recovery perfor-\nmance of RPCA degrades. We show that the challenges raised by coherent data\n(i.e., data with high coherence) could be alleviated by Low-Rank Representation\n(LRR) [13], provided that the dictionary in LRR is con\ufb01gured appropriately. More\nprecisely, we mathematically prove that if the dictionary itself is low-rank then\nLRR is immune to the coherence parameter which increases with the underlying\ncluster number. This provides an elementary principle for dealing with coherent\ndata and naturally leads to a practical algorithm for obtaining proper dictionaries\nin unsupervised environments. Experiments on randomly generated matrices and\nreal motion sequences verify our claims. See the full paper at arXiv:1404.4032.\n\n1 Introduction\n\nNowadays our data are often high-dimensional, massive and full of gross errors (e.g., corruptions,\noutliers and missing measurements). In the presence of gross errors, the classical Principal Com-\nponent Analysis (PCA) method, which is probably the most widely used tool for data analysis and\ndimensionality reduction, becomes brittle \u2014 A single gross error could render the estimate produced\nby PCA arbitrarily far from the desired estimate. As a consequence, it is crucial to develop new sta-\ntistical tools for robustifying PCA. A variety of methods have been proposed and explored in the\nliterature over several decades, e.g., [2, 3, 4, 8, 9, 10, 11, 12, 24, 13, 16, 19, 25]. One of the most ex-\nciting methods is probably the so-called RPCA (Robust Principal Component Analysis) method [4],\nwhich was built upon the exploration of the following low-rank matrix recovery problem:\nProblem 1 (Low-Rank Matrix Recovery) Suppose we have a data matrix X \u2208 Rm\u00d7n and we\nknow it can be decomposed as\n(1.1)\nwhere L0 \u2208 Rm\u00d7n is a low-rank matrix each column of which is a data point drawn from some\nlow-dimensional subspace, and S0 \u2208 Rm\u00d7n is a sparse matrix supported on \u2126 \u2286 {1,\u00b7\u00b7\u00b7 , m} \u00d7\n{1,\u00b7\u00b7\u00b7 , n}. Except these mild restrictions, both components are arbitrary. The rank of L0 is un-\nknown, the support set \u2126 (i.e., the locations of the nonzero entries of S0) and its cardinality (i.e.,\nthe amount of the nonzero entries of S0) are unknown either. In particular, the magnitudes of the\nnonzero entries in S0 may be arbitrarily large. Given X, can we recover both L0 and S0, in a\nscalable and exact fashion?\n\nX = L0 + S0,\n\n1\n\n\fcluster 1\n\ncluster 2\n\nFigure 1: Exemplifying the extra structures of low-rank data. Each entry of the data matrix is a grade\nthat a user assigns to a movie. It is often the case that such data are low-rank, as there exist wide\ncorrelations among the grades that different users assign to the same movie. Also, such data could\nown some clustering structure, since the preferences of the same type of users are more similar to\neach other than to those with different gender, personality, culture and education background. In\nsummary, such data (1) are often low-rank and (2) exhibit some clustering structure.\n\nThe theory of RPCA tells us that, very generally, when the low-rank matrix L0 is meanwhile inco-\nherent (i.e., with low coherence), both the low-rank and the sparse matrices can be exactly recovered\nby using the following convex, potentially scalable program:\n\nmin\nL,S kLk\u2217 + \u03bbkSk1,\n\ns.t. X = L + S,\n\n(1.2)\n\nwhere k \u00b7 k\u2217 is the nuclear norm [7] of a matrix, k \u00b7 k1 denotes the \u21131 norm of a matrix seen as\na long vector, and \u03bb > 0 is a parameter. Besides of its elegance in theory, RPCA also has good\nempirical performance in many practical areas, e.g., image processing [26], computer vision [18],\nradar imaging [1], magnetic resonance imaging [17], etc.\n\nWhile complete in theory and powerful in reality, RPCA cannot be an ultimate solution to the low-\nrank matrix recovery Problem 1. Indeed, the method might not produce perfect recovery even when\nL0 is strictly low-rank. This is because RPCA captures only the low-rankness property, which\nhowever is not the only property of our data, but essentially ignores the extra structures (beyond\nlow-rankness) widely existing in data: Given the low-rankness constraint that the data points (i.e.,\ncolumns vectors of L0) locate on a low-dimensional subspace, it is unnecessary for the data points\nto locate on the subspace uniformly at random and it is quite normal that the data may have some\nextra structures, which specify in more detail how the data points locate on the subspace. Figure 1\ndemonstrates a typical example of extra structures; that is, the clustering structures which are ubiq-\nuitous in modern applications. Whenever the data are exhibiting some clustering structures, RPCA\nis no longer a method of perfection. Because, as will be shown in this paper, while the rank of L0 is\n\ufb01xed and the underlying cluster number goes large, the coherence of L0 keeps heightening and thus,\narguably, the performance of RPCA drops.\n\nTo better handle coherent data (i.e., the cases where L0 has large coherence parameters), a seem-\ningly straightforward idea is to avoid the coherence parameters of L0. However, as explained in [4],\nthe coherence parameters are indeed necessary (if there is no additional condition assumed on the\ndata). This paper shall further indicate that the coherence parameters are related in nature to some\nextra structures intrinsically existing in L0 and therefore cannot be discarded simply. Interestingly,\nwe show that it is possible to avoid the coherence parameters by using some additional conditions,\nwhich are easy to obey in supervised environment and can also be approximately achieved in un-\nsupervised environment. Our study is based on the following convex program termed Low-Rank\nRepresentation (LRR) [13]:\n\nmin\nZ,S kZk\u2217 + \u03bbkSk1,\n\ns.t. X = AZ + S,\n\n(1.3)\n\nwhere A \u2208 Rm\u00d7d is a size-d dictionary matrix constructed in advance1, and \u03bb > 0 is a parameter. In\norder for LRR to avoid the coherence parameters which increase with the cluster number underlying\n\n1It is not crucial to determine the exact value of d. Suppose Z \u2217 is the optimal solution with respect to Z.\nThen LRR uses AZ \u2217 to restore L0. LRR falls back to RPCA when A = I (identity matrix). Furthermore, it can\nbe proved that the recovery produced by LRR is the same as RPCA whenever the dictionary A is orthogonal.\n\n2\n\n\fL0, we prove that it is suf\ufb01cient to construct in advance a dictionary A which is low-rank by itself.\nThis gives a generic prescription to defend the possible infections raised by coherent data, providing\nan elementary criteria for learning the dictionary matrix A. Subsequently, we propose a simple and\neffective algorithm that utilizes the output of RPCA to construct the dictionary in LRR. Our exten-\nsive experiments demonstrated on randomly generated matrices and motion data show promising\nresults. In summary, the contributions of this paper include the following:\n\n\u22c4 For the \ufb01rst time, this paper studies the problem of recovering low-rank, and coherent (or\nless incoherent as equal) matrices from their corrupted versions. We investigate the physical\nregime where coherent data arise. For example, the widely existing clustering structures\nmay lead to coherent data. We prove some basic theories for resolving the problem, and\nalso establish a practical algorithm that outperforms RPCA in our experimental study.\n\n\u22c4 Our studies help reveal the physical meaning of coherence, which is now standard and\nwidely used in various literatures, e.g., [2, 3, 4, 25, 15]. We show that the coherence\nparameters are not \u201cassumptions\u201d for a proof, but rather some excellent quantities that\nrelate in nature to the extra structures (beyond low-rankness) intrinsically existing in L0.\n\u22c4 This paper provides insights regarding the LRR model proposed by [13]. While the special\ncase of A = X has been extensively studied, the LRR model (1.3) with general dictionaries\nis not fully understood yet. We show that LRR (1.3) equipped with proper dictionaries\ncould well handle coherent data.\n\n\u22c4 The idea of replacing L with AZ is essentially related to the spirit of matrix factorization\nwhich has been explored for long, e.g., [20, 23]. In that sense, the explorations of this paper\nhelp to understand why factorization techniques are useful.\n\n2 Summary of Main Notations\n\nCapital letters such as M are used to represent matrices, and accordingly, [M ]ij denotes its (i, j)th\nentry. Letters U , V , \u2126 and their variants (complements, subscripts, etc.) are reserved for left singular\nvectors, right singular vectors and support set, respectively. We shall abuse the notation U (resp. V )\nto denote the linear space spanned by the columns of U (resp. V ), i.e., the column space (resp. row\nspace). The projection onto the column space U , is denoted by PU and given by PU (M ) = U U T M ,\nand similarly for the row space PV (M ) = M V V T . We shall also abuse the notation \u2126 to denote\nthe linear space of matrices supported on \u2126. Then P\u2126 and P\u2126\u22a5 respectively denote the projections\nonto \u2126 and \u2126c such that P\u2126 + P\u2126\u22a5 = I, where I is the identity operator. The symbol (\u00b7)+ denotes\nthe Moore-Penrose pseudoinverse of a matrix: M + = VM \u03a3\u22121\nM for a matrix M with Singular\nValue Decomposition (SVD)2 UM \u03a3M V T\nM .\nSix different matrix norms are used in this paper. The \ufb01rst three norms are functions of the singular\nvalues: 1) The operator norm (i.e., the largest singular value) denoted by kMk, 2) the Frobenius\nnorm (i.e., square root of the sum of squared singular values) denoted by kMkF , and 3) the nuclear\nnorm (i.e., the sum of singular values) denoted by kMk\u2217. The other three are the \u21131, \u2113\u221e (i.e.,\nsup-norm) and \u21132,\u221e norms of a matrix: kMk1 = Pi,j |[M ]ij|, kMk\u221e = maxi,j{|[M ]ij|} and\nkMk2,\u221e = maxj{qPi[M ]2\nThe Greek letter \u00b5 and its variants (e.g., subscripts and superscripts) are reserved for the coherence\nparameters of a matrix. We shall also reserve two lower case letters, m and n, to respectively denote\nthe data dimension and the number of data points, and we use the following two symbols throughout\nthis paper:\n\nij}, respectively.\n\nM U T\n\nn1 = max(m, n)\n\nand n2 = min(m, n).\n\n3 On the Recovery of Coherent Data\n\nIn this section, we shall \ufb01rstly investigate the physical regime that raises coherent (or less incoher-\nent) data, and then discuss the problem of recovering coherent data from corrupted observations,\nproviding some basic principles and an algorithm for resolving the problem.\n\n2In this paper, SVD always refers to skinny SVD. For a rank-r matrix M \u2208 Rm\u00d7n, its SVD is of the form\n\nUM \u03a3M V T\n\nM , with UM \u2208 Rm\u00d7r, \u03a3M \u2208 Rr\u00d7r and VM \u2208 Rn\u00d7r.\n\n3\n\n\f3.1 Coherence Parameters and Their Properties\n\nAs the rank function cannot fully capture all characteristics of L0, it is necessary to de\ufb01ne some\nquantities to measure the effects of various extra structures (beyond low-rankness) such as the clus-\ntering structure as demonstrated in Figure 1. The coherence parameters de\ufb01ned in [3, 4] are excellent\nexemplars of such quantities.\n\n3.1.1 Coherence Parameters: \u00b51, \u00b52, \u00b53\n0 , some important properties can\nFor an m \u00d7 n matrix L0 with rank r0 and SVD L0 = U0\u03a30V T\nbe characterized by three coherence parameters, denoted as \u00b51, \u00b52 and \u00b53, respectively. The \ufb01rst\ncoherence parameter, 1 \u2264 \u00b51(L0) \u2264 m, which characterizes the column space identi\ufb01ed by U0, is\nde\ufb01ned as\n\n\u00b51(L0) =\n\nm\nr0\n\n1\u2264i\u2264mkU T\nmax\n\n0 eik2\n2,\n\n(3.4)\n\nwhere ei denotes the ith standard basis. The second coherence parameter, 1 \u2264 \u00b52(L0) \u2264 n, which\ncharacterizes the row space identi\ufb01ed by V0, is de\ufb01ned as\n\n\u00b52(L0) =\n\nn\nr0\n\n1\u2264j\u2264n kV T\nmax\n\n0 ejk2\n2.\n\n(3.5)\n\nThe third coherence parameter, 1 \u2264 \u00b53(L0) \u2264 mn, which characterizes the joint space identi\ufb01ed\nby U0V T\n\n0 , is de\ufb01ned as\n\n\u00b53(L0) =\n\nmn\nr0\n\n(kU0V T\n\n0 k\u221e)2 =\n\nmn\nr0\n\nmax\ni,j\n\n(|hU T\n\n0 ei, V T\n\n0 eji|)2.\n\n(3.6)\n\nThe analysis in RPCA [4] merges the above three parameters into a single one: \u00b5(L0) =\nmax{\u00b51(L0), \u00b52(L0), \u00b53(L0)}. As will be seen later, the behaviors of those three coherence pa-\nrameters are different from each other, and hence it is more adequate to consider them individually.\n\n3.1.2 \u00b52-phenomenon\nAccording to the analysis in [4], the success condition (regarding L0) of RPCA is\n\nrank (L0) \u2264\n\ncrn2\n\n\u00b5(L0)(log n1)2 ,\n\n(3.7)\n\nwhere \u00b5(L0) = max{\u00b51(L0), \u00b52(L0), \u00b53(L0)} and cr > 0 is some numerical constant. Thus,\nRPCA will be less successful when the coherence parameters are considerably larger. In this subsec-\ntion, we shall show that the widely existing clustering structure can enlarge the coherence parameters\nand, accordingly, downgrades the performance of RPCA.\n\nGiven the restriction that rank (L0) = r0, the data points (i.e., column vectors of L0) are unneces-\nsarily sampled from a r0-dimensional subspace uniformly at random. A more realistic interpretation\nis to consider the data points as samples from the union of k number of subspaces (i.e., clusters),\nand the sum of those multiple subspaces together has a dimension r0. That is to say, there are\nmultiple \u201csmall\u201d subspaces inside one r0-dimensional \u201clarge\u201d subspace, as exempli\ufb01ed in Figure 1.\nWhenever the low-rank matrix L0 is meanwhile exhibiting such clustering behaviors, the second\ncoherence parameter \u00b52(L0) (and so \u00b53(L0)) will increase with the number of clusters underlying\nL0, as shown in Figure 2. When the coherence is heightening, (3.7) suggests that the performance\nof RPCA will drop, as veri\ufb01ed in Figure 2(d). Note here that the variation of \u00b53 is mainly due\nto the variation of the row space, which is characterized by \u00b52. We call the phenomena shown in\nFigure 2(b)\u223c(d) as the \u201c\u00b52-phenomenon\u201d. Readers can also refer to the full paper to see why the\nsecond coherence parameter increases with the cluster number underlying L0.\nInterestingly, one may have noticed that \u00b51 is invariant to the variation of the clustering number, as\ncan be seen from Figure 2(a). This is because the clustering behavior of the data points can only\naffect the row space, while \u00b51 is de\ufb01ned on the column space. Yet, if the row vectors of L0 also\nown some clustering structure, \u00b51 could be large as well. Such kind of data can exist widely in text\ndocuments and we leave this as future work.\n\n4\n\n\f(a)\n\n(b)\n\n1.5\n\n1\n\n1\n\n\u00b5\n\n0.5\n\n0\n\n4\n\n3\n\n2\n\n1\n\n0\n\n2\n\n\u00b5\n\n1 2 5 10 20 50\n\n#cluster\n\n60\n\n40\n\n3\n\n\u00b5\n\n(c)\n\n(d)\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\ne\n \nr\ne\nv\no\nc\ne\nr\n\n1 2 5 10 20 50\n\n#cluster\n\n0\n\n1\n\n2\n\n5 10 20 50\n\n#cluster\n\n20\n\n0\n\n1 2 5 10 20 50\n\n#cluster\n\nFigure 2: Exploring the in\ufb02uence of the cluster number, using randomly generated matrices. The\nsize and rank of L0 are \ufb01xed to be 500 \u00d7 500 and 100, respectively. The underlying cluster number\nvaries from 1 to 50. For the recovery experiments, S0 is \ufb01xed as a sparse matrix with 13% nonzero\nentries. (a) The \ufb01rst coherence parameter \u00b51(L0) vs cluster number. (b) \u00b52(L0) vs cluster number.\n(c) \u00b53(L0) vs cluster number.\n(d) Recover error (produced by RPCA) vs cluster number. The\nnumbers shown in these \ufb01gure are averaged from 100 random trials. The recover error is computed\nas k \u02c6L0 \u2212 L0kF /kL0kF , where \u02c6L0 denotes an estimate of L0.\n3.2 Avoiding \u00b52 by LRR\n\nThe \u00b52-phenomenon implies that the second coherence parameter \u00b52 is related in nature to some\nintrinsic structures of L0 and thus cannot be eschewed without using additional conditions. In the\nfollowing, we shall \ufb01gure out under what conditions the second coherence parameter \u00b52 (and \u00b53)\ncan be avoided such that LRR could well handle coherent data.\nMain Result: We show that, when the dictionary A itself is low-rank, LRR is able to avoid \u00b52.\nNamely, the following theorem is proved without using \u00b52. See the full paper for a detailed proof.\nTheorem 1 (Noiseless) Let A \u2208 Rm\u00d7d with SVD A = UA\u03a3AV T\nA be a column-wisely unit-normed\n(i.e., kAeik2 = 1,\u2200i) dictionary matrix which satis\ufb01es PUA(U0) = U0 (i.e., U0 is a subspace of\nUA). For any 0 < \u01eb < 0.5 and some numerical constant ca > 1, if\n\nrank (L0) \u2264 rank (A) \u2264\nthen with probability at least 1 \u2212 n\u221210\n1/\u221an1 is unique and exact, in a sense that\n\n1\n\n\u01eb2n2\n\nca\u00b51(A) log n1\n\nand |\u2126| \u2264 (0.5 \u2212 \u01eb)mn,\n\n(3.8)\n\n, the optimal solution to the LRR problem (1.3) with \u03bb =\n\nand S \u2217 = S0,\n\nZ \u2217 = A+L0\nwhere (Z \u2217, S \u2217) is the optimal solution to (1.3).\nIt is worth noting that the restriction rank (L0) \u2264 O(n2/ log n1) is looser than that of PRCA3, which\nrequires rank (L0) \u2264 O(n2/(log n1)2). The requirement of column-wisely unit-normed dictionary\n(i.e., kAeik2 = 1,\u2200i) is purely for complying the parameter estimate of \u03bb = 1/\u221an1, which is\nconsistent with RPCA. The condition PUA(U0) = U0, i.e., U0 is a subspace of UA, is indispensable\nif we ask for exact recovery, because PUA (U0) = U0 is implied by the equality AZ \u2217 = L0. This\nnecessary condition, together with the low-rankness condition, provides an elementary criterion for\nlearning the dictionary matrix A in LRR. Figure 3 presents an example, which further con\ufb01rms our\nmain result; that is, LRR is able to avoid \u00b52 as long as U0 \u2282 UA and A is low-rank. It is also\nworth noting that it is unnecessary for A to satisfy UA = U0, and that LRR is actually tolerant to the\n\u201cerrors\u201d possibly existing in the dictionary.\n\nThe program (1.3) is designed for the case where the uncorrupted observations are noiseless. In\nreality this assumption is often not true and all entries of X can be contaminated by a small amount\nof noises, i.e., X = L0 + S0 + N , where N is a matrix of dense Gaussian noises. In this case, the\nformula of LRR (1.3) need be modi\ufb01ed to\n\nmin\nZ,S kZk\u2217 + \u03bbkSk1,\n\ns.t.\n\nkX \u2212 AZ \u2212 SkF \u2264 \u03b5,\n\n(3.9)\n\n3In terms of exact recovery, O(n2/ log n1) is probably the \u201c\ufb01nest\u201d bound one could accomplish in theory.\n\n5\n\n\fX\n\nAZ*\n\nS*\n\n0.2\n\n0.1\n\nr\no\nr\nr\ne\n\n \nr\ne\nv\no\nc\ne\nr\n\n0\n1 5 10 15 20\n\nrank(A)\n\nFigure 3: Exemplifying that LRR can void \u00b52. In this experiment, L0 is a 200 \u00d7 200 rank-1 matrix\nwith one column being 1 (i.e., a vector of all ones) and everything else being zero. Thus, \u00b51(L0) = 1\nand \u00b52(L0) = 200. The dictionary is set as A = [1, W ], where W is a 200 \u00d7 p random Gaussian\nmatrix (with varying p). As long as rank (A) = p + 1 \u2264 10, LRR with \u03bb = 0.08 can exactly recover\nL0 from a grossly corrupted observation matrix X.\n\nwhere \u03b5 is a parameter that measures the noise level of data.\nIn the experiments of this paper,\nwe consistently set \u03b5 = 10\u22126kXkF . In the presence of dense noises, the latent matrices, L0 and\nS0, cannot be exactly restored. Yet we have the following theorem to guarantee the near recovery\nproperty of the solution produced by the program (3.9):\nTheorem 2 (Noisy) Suppose kX \u2212 L0 \u2212 S0kF \u2264 \u03b5. Let A \u2208 Rm\u00d7d with SVD A = UA\u03a3AV T\nA be a\ncolumn-wisely unit-normed dictionary matrix which satis\ufb01es PUA(U0) = U0 (i.e., U0 is a subspace\nof UA). For any 0 < \u01eb < 0.35 and some numerical constant ca > 1, if\n\n(3.10)\nrank (L0) \u2264 rank (A) \u2264\nca\u00b51(A) log n1\n, any solution (Z \u2217, S \u2217) to (3.9) with \u03bb = 1/\u221an1 gives a near\nthen with probability at least 1\u2212 n\u221210\nrecovery to (L0, S0), in a sense that kAZ \u2217 \u2212 L0kF \u2264 8\u221amn\u03b5 and kS \u2217 \u2212 S0kF \u2264 (8\u221amn + 2)\u03b5.\n\nand |\u2126| \u2264 (0.35 \u2212 \u01eb)mn,\n\n\u01eb2n2\n\n1\n\n3.3 An Unsupervised Algorithm for Matrix Recovery\n\nTo handle coherent (equivalently, less incoherent) data, Theorem 1 suggests that the dictionary ma-\ntrix A should be low-rank and satisfy U0 \u2282 UA. In certain supervised environment, this might not be\ndif\ufb01cult as one could potentially use clear, well processed training data to construct the dictionary. In\nan unsupervised environment, however, it will be challenging to identify a low-rank dictionary that\nobeys U0 \u2282 UA. Note that U0 \u2282 UA can be viewed as supervision information (if A is low-rank).\nIn this paper, we will introduce a heuristic algorithm that can work distinctly better than RPCA in\nan unsupervised environment. As can be seen from (3.7), RPCA is actually not brittle with respect\nto coherent data (although its performance is depressed). Based on this observation, we propose\na simple algorithm, as summarized in Algorithm 1, to achieve a solid improvement over RPCA.\nOur idea is straightforward: We \ufb01rst obtain an estimate of L0 by using RPCA and then utilize the\nestimate to construct the dictionary matrix A in LRR. The post-processing steps (Step 2 and Step 3)\nthat slightly modify the solution of RPCA is to encourage well-conditioned dictionary, which is the\ncircumstance favoring LRR.\n\nWhenever the recovery produced by RPCA is already exact, the claim in Theorem 1 gives that the\nrecovery produced by our Algorithm 1 is exact as well. That is to say, in terms of exactly recovering\nL0 from a given X, the success probability of our Algorithm 1 is greater than or equal to that of\nRPCA. From the computational perspective, Algorithm 1 does not really double the work of RPCA,\nalthough there are two convex programs in our algorithm. In fact, according to our simulations,\nusually the computational time of Algorithm 1 is merely about 1.2 times as much as RPCA. The\nreason is that, as has been explored by [13], the complexity of solving the LRR problem (1.3) is\nO(n2rA) (assuming m = n), which is much lower than that of RPCA (which requires O(n3))\nprovided that the obtained dictionary matrix A is fairly low-rank (i.e., rA is small).\nOne may have noticed that the procedure of Algorithm 1 could be made iterative, i.e., one can\nconsider \u02c6AZ \u2217 as a new estimate of L0 and use it to further update the dictionary matrix A, and so\non. Nevertheless, we empirically \ufb01nd that such an iterative procedure often converges within two\niterations. Hence, for the sake of simplicity, we do not consider iterative strategies in this paper.\n\n6\n\n\fAlgorithm 1 Matrix Recovery\n\ninput: Observed data matrix X \u2208 Rm\u00d7n.\nadjustable parameter: \u03bb.\n1. Solve for \u02c6L0 by optimizing the RPCA problem (1.2) with \u03bb = 1/\u221an1.\n2. Estimate the rank of \u02c6L0 by\n\nwhere \u03c31, \u03c32,\u00b7\u00b7\u00b7 , \u03c3n2 are the singular values of \u02c6L0.\n3. Form \u02dcL0 by using the rank-\u02c6r0 approximation of \u02c6L0. That is,\n\n\u02c6r0 = #{i : \u03c3i > 10\u22123\u03c31},\n\n\u02dcL0 = arg min\n\nL kL \u2212 \u02c6L0k2\n\nF , s.t. rank (L) \u2264 \u02c6r0,\n\nwhich is solved by SVD.\n4. Construct a dictionary \u02c6A from \u02dcL0 by normalizing the column vectors of \u02dcL0:\n\n[ \u02c6A]:,i =\n\n[ \u02dcL0]:,i\nk[ \u02dcL0]:,ik2\nwhere [\u00b7]:,i denotes the ith column of a matrix.\n5. Solve for Z \u2217 by optimizing the LRR problem (1.3) with A = \u02c6A and \u03bb = 1/\u221an1.\noutput: \u02c6AZ \u2217.\n\n, i = 1,\u00b7\u00b7\u00b7 , n,\n\n4 Experiments\n\n4.1 Results on Randomly Generated Matrices\n\nWe \ufb01rst verify the effectiveness of our Algorithm 1 on randomly generated matrices. We generate\na collection of 200 \u00d7 1000 data matrices according to the model of X = P\u2126\u22a5(L0) + P\u2126(S0):\n\u2126 is a support set chosen at random; L0 is created by sampling 200 data points from each of 5\nrandomly generated subspaces; S0 consists of random values from Bernoulli \u00b11. The dimension of\neach subspace varies from 1 to 20 with step size 1, and thus the rank of L0 varies from 5 to 100 with\nstep size 5. The fraction |\u2126|/(mn) varies from 2.5% to 50% with step size 2.5%. For each pair of\nrank and support size (r0,|\u2126|), we run 10 trials, resulting in a total of 4000 (20 \u00d7 20 \u00d7 10) trials.\n\nRPCA\n\nAlgorithm 1\n\n)\n\n%\n\n(\n \nn\no\n\ni\nt\n\np\nu\nr\nr\no\nc\n\n42\n\n32\n\n22\n\n12\n\n2\n\n)\n\n%\n\n(\n \n\nn\no\n\ni\nt\np\nu\nr\nr\no\nc\n\n0.1 0.2 0.3 0.4 0.5\n\n)/n\nrank(L\n0\n2\n\n42\n\n32\n\n22\n\n12\n\n2\n\n)\n\n%\n\n(\n \n\nn\no\n\ni\nt\np\nu\nr\nr\no\nc\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\n \n\nRPCA\nAlgorithm 1\n\n0.1 0.2 0.3 0.4 0.5\n\n)/n\nrank(L\n0\n2\n\n0.1 0.2 0.3 0.4 0.5\n\n)/n\nrank(L\n0\n2\n\nFigure 4: Algorithm 1 vs RPCA for the task of recovering randomly generated matrices, both using\n\u03bb = 1/\u221an1. A curve shown in the third sub\ufb01gure is the boundary for a method to be successful\n\u2014 The recovery is successful for any pair (r0/n2,|\u2126|/(mn)) that locates below the curve. Here, a\nsuccess means k \u02c6L0 \u2212 L0kF < 0.05kL0kF , where \u02c6L0 denotes an estimate of L0.\nFigure 4 compares our Algorithm 1 to RPCA, both using \u03bb = 1/\u221an1. It can be seen that, using the\nlearned dictionary matrix, Algorithm 1 works distinctly better than RPCA. In fact, the success area\n(i.e., the area of the white region) of our algorithm is 47% wider than that of RPCA! We should also\nmention that it is possible for RPCA to be exactly successful on coherent (or less incoherent) data,\nprovided that the rank of L0 is low enough and/or S0 is sparse enough. Our algorithm in general\nimproves RPCA when L0 is moderately low-rank and/or S0 is moderately sparse.\n\n7\n\n\f4.2 Results on Corrupted Motion Sequences\nWe now present our experiment with 11 additional sequences attached to the Hopkins155 [21]\ndatabase. In those sequences, about 10% of the entries in the data matrix of trajectories are un-\nobserved (i.e., missed) due to vision occlusion. We replace each missed entry with a number from\nBernoulli \u00b11, resulting in a collection of corrupted trajectory matrices for evaluating the effective-\nness of matrix recovery algorithms. We perform subspace clustering on both the corrupted trajectory\nmatrices and the recovered versions, and use the clustering error rates produced by existing subspace\nclustering methods as the evaluation metrics. We consider three state-of-the-art subspace clustering\nmethods: Shape Interaction Matrix (SIM) [5], Low-Rank Representation with A = X [14] (which\nis referred to as \u201cLRRx\u201d) and Sparse Subspace Clustering (SSC) [6].\n\nTable 1: Clustering error rates (%) on 11 corrupted motion sequences.\n\nSIM\n\nRPCA + SIM\n\nAlgorithm 1 + SIM\n\nLRRx\n\nRPCA + LRRx\n\nAlgorithm 1 + LRRx\n\nSSC\n\nRPCA + SSC\n\nAlgorithm 1 + SSC\n\nMean Median Maximum Minimum Std.\n11.74\n29.19\n16.23\n14.82\n12.95\n8.74\n17.10\n21.38\n10.70\n15.63\n10.59\n7.09\n18.46\n22.81\n16.17\n9.50\n5.74\n8.52\n\n12.45\n0.97\n0.23\n0.58\n0.20\n0.22\n1.55\n0.61\n0.20\n\n27.77\n8.38\n3.09\n22.00\n3.05\n3.06\n20.78\n2.13\n1.85\n\n45.82\n45.78\n42.61\n56.96\n46.25\n32.33\n58.24\n50.32\n27.84\n\nTime (sec.)\n\n0.07\n9.96\n11.64\n1.80\n10.75\n12.11\n3.18\n12.51\n13.11\n\nTable 1 shows the error rates of various algorithms. Without the preprocessing of matrix recovery,\nall the subspace clustering methods fail to accurately categorize the trajectories of motion objects,\nproducing error rates higher than 20%. This illustrates that it is important for motion segmentation\nto correct the gross corruptions possibly existing in the data matrix of trajectories. By using RPCA\n(\u03bb = 1/\u221an1) to correct the corruptions, the clustering performances of all considered methods are\nimproved dramatically. For example, the error rate of SSC is reduced from 22.81% to 9.50%. By\nchoosing an appropriate dictionary for LRR (\u03bb = 1/\u221an1), the error rates can be reduced again,\nfrom 9.50% to 5.74%, which is a 40% relative improvement. These results verify the effectiveness\nof our dictionary learning strategy in realistic environments.\n\n5 Conclusion and Future Work\n\nWe have studied the problem of disentangling the low-rank and sparse components in a given data\nmatrix. Whenever the low-rank component exhibits clustering structures, the state-of-the-art RPCA\nmethod could be less successful. This is because RPCA prefers incoherent data, which however may\nbe inconsistent with data in the real world. When the number of clusters becomes large, the second\nand third coherence parameters enlarge and hence the performance of RPCA could be depressed. We\nhave showed that the challenges arising from coherent (equivalently, less incoherent) data could be\neffectively alleviated by learning a suitable dictionary under the LRR framework. Namely, when the\ndictionary matrix is low-rank and contains information about the ground truth matrix, LRR can be\nimmune to the coherence parameters that increase with the underlying cluster number. Furthermore,\nwe have established a practical algorithm that outperforms RPCA in our extensive experiments.\n\nThe problem of recovering coherent data essentially concerns the robustness issues of the General-\nized PCA (GPCA) [22] problem. Although the classic GPCA problem has been explored for several\ndecades, robust GPCA is new and has not been well studied. The approach proposed in this pa-\nper is in a sense preliminary, and it is possible to develop other effective methods for learning the\ndictionary matrix in LRR and for handling coherent data. We leave these as future work.\n\nAcknowledgement\n\nGuangcan Liu was a Postdoctoral Researcher supported by NSF-DMS0808864, NSF-SES1131848,\nNSF-EAGER1249316, AFOSR-FA9550-13-1-0137, and ONR-N00014-13-1-0764. Ping Li is also\npartially supported by NSF-III1360971 and NSF-BIGDATA1419210.\n\n8\n\n\fReferences\n[1] Liliana Borcea, Thomas Callaghan, and George Papanicolaou. Synthetic aperture radar imaging and\n\nmotion estimation via robust principle component analysis. Arxiv, 2012.\n\n[2] Emmanuel Cand`es and Yaniv Plan. Matrix completion with noise. In IEEE Proceeding, volume 98, pages\n\n925\u2013936, 2010.\n\n[3] Emmanuel Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational Mathematics, 9(6):717\u2013772, 2009.\n\n[4] Emmanuel J. Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?\n\nJournal of the ACM, 58(3):1\u201337, 2011.\n\n[5] Joao Costeira and Takeo Kanade. A multibody factorization method for independently moving objects.\n\nInternational Journal of Computer Vision, 29(3):159\u2013179, 1998.\n\n[6] E. Elhamifar and R. Vidal. Sparse subspace clustering. In IEEE Conference on Computer Vision and\n\nPattern Recognition, volume 2, pages 2790\u20132797, 2009.\n\n[7] M. Fazel. Matrix rank minimization with applications. PhD thesis, 2002.\n[8] Martin Fischler and Robert Bolles. Random sample consensus: A paradigm for model \ufb01tting with ap-\nplications to image analysis and automated cartography. Communications of the ACM, 24(6):381\u2013395,\n1981.\n\n[9] R. Gnanadesikan and J. R. Kettenring. Robust estimates, residuals, and outlier detection with multire-\n\nsponse data. Biometrics, 28(1):81\u2013124, 1972.\n\n[10] D. Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis.\n\nInformation Theory, 57(3):1548\u20131566, 2011.\n\nIEEE Transactions on\n\n[11] Qifa Ke and Takeo Kanade. Robust l1 norm factorization in the presence of outliers and missing data\nby alternative convex programming. In IEEE Conference on Computer Vision and Pattern Recognition,\npages 739\u2013746, 2005.\n\n[12] Fernando De la Torre and Michael J. Black. A framework for robust subspace learning. International\n\nJournal of Computer Vision, 54(1-3):117\u2013142, 2003.\n\n[13] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace\nstructures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n35(1):171\u2013184, 2013.\n\n[14] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation.\n\nIn International Conference on Machine Learning, pages 663\u2013670, 2010.\n\n[15] Guangcan Liu, Huan Xu, and Shuicheng Yan. Exact subspace segmentation and outlier detection by\nlow-rank representation. Journal of Machine Learning Research - Proceedings Track, 22:703\u2013711, 2012.\n[16] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning\n\nlarge incomplete matrices. Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n\n[17] Ricardo Otazo, Emmanuel Cand`es, and Daniel K. Sodickson. Low-rank and sparse matrix decomposition\n\nfor accelerated dynamic mri with separation of background and dynamic components. Arxiv, 2012.\n\n[18] YiGang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. Rasl: Robust alignment by sparse\nand low-rank decomposition for linearly correlated images. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 34(11):2233\u20132246, 2012.\n\n[19] Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel Candes.\n\narXiv:1301.2603, 2013.\n\nRobust subspace clustering.\n\n[20] Nathan Srebro and Tommi Jaakkola. Generalization error bounds for collaborative prediction with low-\n\nrank matrices. In Neural Information Processing Systems, pages 5\u201327, 2005.\n\n[21] Roberto Tron and Rene Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20138, 2007.\n\n[22] Rene Vidal, Yi Ma, and S. Sastry. Generalized Principal Component Analysis. Springer Verlag, 2012.\n[23] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. Co\ufb01 rank - maximum margin\n\nmatrix factorization for collaborative ranking. In Neural Information Processing Systems, 2007.\n\n[24] Huan Xu, Constantine Caramanis, and Shie Mannor. Outlier-robust pca: The high-dimensional case.\n\nIEEE Transactions on Information Theory, 59(1):546\u2013572, 2013.\n\n[25] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. In Neural Infor-\n\nmation Processing Systems, 2010.\n\n[26] Zhengdong Zhang, Arvind Ganesh, Xiao Liang, and Yi Ma. Tilt: Transform invariant low-rank textures.\n\nInternational Journal of Computer Vision, 99(1):1\u201324, 2012.\n\n9\n\n\f", "award": [], "sourceid": 689, "authors": [{"given_name": "Guangcan", "family_name": "Liu", "institution": "Rutgers"}, {"given_name": "Ping", "family_name": "Li", "institution": "Rutgers University"}]}