{"title": "Non-convex Robust PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1107, "page_last": 1115, "abstract": "We propose a new provable method for robust PCA, where the task is to recover a low-rank matrix, which is corrupted with sparse perturbations. Our method consists of simple alternating projections onto the set of low rank and sparse matrices with intermediate de-noising steps. We prove correct recovery of the low rank and sparse components under tight recovery conditions, which match those for the state-of-art convex relaxation techniques. Our method is extremely simple to implement and has low computational complexity. For a $m \\times n$ input matrix (say m \\geq n), our method has O(r^2 mn\\log(1/\\epsilon)) running time, where $r$ is the rank of the low-rank component and $\\epsilon$ is the accuracy. In contrast, the convex relaxation methods have a running time O(mn^2/\\epsilon), which is not scalable to large problem instances. Our running time nearly matches that of the usual PCA (i.e. non robust), which is O(rmn\\log (1/\\epsilon)). Thus, we achieve ``best of both the worlds'', viz low computational complexity and provable recovery for robust PCA. Our analysis represents one of the few instances of global convergence guarantees for non-convex methods.", "full_text": "Provable Non-convex Robust PCA\n\nPraneeth Netrapalli 1\u2217 U N Niranjan2\u2217\n\nSujay Sanghavi3 Animashree Anandkumar2\n\n1Microsoft Research, Cambridge MA. 2The University of California at Irvine.\n\n3The University of Texas at Austin. 4Microsoft Research, India.\n\nPrateek Jain4\n\nAbstract\n\nWe propose a new method for robust PCA \u2013 the task of recovering a low-rank ma-\ntrix from sparse corruptions that are of unknown value and support. Our method\ninvolves alternating between projecting appropriate residuals onto the set of low-\nrank matrices, and the set of sparse matrices; each projection is non-convex but\neasy to compute. In spite of this non-convexity, we establish exact recovery of the\nlow-rank matrix, under the same conditions that are required by existing methods\n(which are based on convex optimization). For an m\u00d7n input matrix (m \u2264 n), our\n\nmethod has a running time of O(cid:0)r2mn(cid:1) per iteration, and needs O (log(1/\u0001)) it-\nconvex optimization, have O(cid:0)m2n(cid:1) complexity per iteration, and take O (1/\u0001)\n\nerations to reach an accuracy of \u0001. This is close to the running times of simple PCA\nvia the power method, which requires O (rmn) per iteration, and O (log(1/\u0001)) it-\nerations. In contrast, the existing methods for robust PCA, which are based on\n\niterations, i.e., exponentially more iterations for the same accuracy.\nExperiments on both synthetic and real data establishes the improved speed and\naccuracy of our method over existing convex implementations.\n\nKeywords:\ntions.\n\nRobust PCA, matrix decomposition, non-convex methods, alternating projec-\n\n1\n\nIntroduction\n\nPrincipal component analysis (PCA) is a common procedure for preprocessing and denoising, where\na low rank approximation to the input matrix (such as the covariance matrix) is carried out. Although\nPCA is simple to implement via eigen-decomposition, it is sensitive to the presence of outliers,\nsince it attempts to \u201cforce \ufb01t\u201d the outliers to the low rank approximation. To overcome this, the\nnotion of robust PCA is employed, where the goal is to remove sparse corruptions from an input\nmatrix and obtain a low rank approximation. Robust PCA has been employed in a wide range\nof applications, including background modeling [LHGT04], 3d reconstruction [MZYM11], robust\ntopic modeling [Shi13], and community detection [CSX12], and so on.\nConcretely, robust PCA refers to the following problem: given an input matrix M = L\u2217 + S\u2217, the\ngoal is to decompose it into sparse S\u2217 and low rank L\u2217 matrices. The seminal works of [CSPW11,\nCLMW11] showed that this problem can be provably solved via convex relaxation methods, under\nsome natural conditions on the low rank and sparse components. While the theory is elegant, in\npractice, convex techniques are expensive to run on a large scale and have poor convergence rates.\nConcretely, for decomposing an m\u00d7n matrix, say with m \u2264 n, the best specialized implementations\n\n(typically \ufb01rst-order methods) have a per-iteration complexity of O(cid:0)m2n(cid:1), and require O(1/\u0001)\n\nnumber of iterations to achieve an error of \u0001. In contrast, the usual PCA, which carries out a rank-\nr approximation of the input matrix, has O(rmn) complexity per iteration \u2013 drastically smaller\n\n\u2217Part of the work done while interning at Microsoft Research, India\n\n1\n\n\fwhen r is much smaller than m, n. Moreover, PCA requires exponentially fewer iterations for\nconvergence: an \u0001 accuracy is achieved with only O (log(1/\u0001)) iterations (assuming constant gap in\nsingular values).\nIn this paper, we design a non-convex algorithm which is \u201cbest of both the worlds\u201d and bridges the\ngap between (the usual) PCA and convex methods for robust PCA. Our method has low computa-\ntional complexity similar to PCA (i.e. scaling costs and convergence rates), and at the same time,\nhas provable global convergence guarantees, similar to the convex methods. Proving global conver-\ngence for non-convex methods is an exciting recent development in machine learning. Non-convex\nalternating minimization techniques have recently shown success in many settings such as matrix\ncompletion [Kes12, JNS13, Har13], phase retrieval [NJS13], dictionary learning [AAJ+13], tensor\ndecompositions for unsupervised learning [AGH+12], and so on. Our current work on the analysis\nof non-convex methods for robust PCA is an important addition to this growing list.\n\n1.1 Summary of Contributions\n\nWe propose a simple intuitive algorithm for robust PCA with low per-iteration cost and a fast con-\nvergence rate. We prove tight guarantees for recovery of sparse and low rank components, which\nmatch those for the convex methods. In the process, we derive novel matrix perturbation bounds,\nwhen subject to sparse perturbations. Our experiments reveal signi\ufb01cant gains in terms of speed-ups\nover the convex relaxation techniques, especially as we scale the size of the input matrices.\nOur method consists of simple alternating (non-convex) projections onto low-rank and sparse matri-\nces. For an m\u00d7n matrix, our method has a running time of O(r2mn log(1/\u0001)), where r is the rank of\nthe low rank component. Thus, our method has a linear convergence rate, i.e. it requires O(log(1/\u0001))\niterations to achieve an error of \u0001, where r is the rank of the low rank component L\u2217. When the rank\nr is small, this nearly matches the complexity of PCA, (which is O(rmn log(1/\u0001))).\nWe prove recovery of the sparse and low rank components under a set of requirements which are\ntight and match those for the convex techniques (up to constant factors). In particular, under the\ndeterministic sparsity model, where each row and each column of the sparse matrix S\u2217 has at most\n\n\u03b1 fraction of non-zeros, we require that \u03b1 = O(cid:0)1/(\u00b52r)(cid:1), where \u00b5 is the incoherence factor (see\n\nSection 3).\nIn addition to strong theoretical guarantees, in practice, our method enjoys signi\ufb01cant advan-\ntages over the state-of-art solver for (1), viz., the inexact augmented Lagrange multiplier (IALM)\nmethod [CLMW11]. Our method outperforms IALM in all instances, as we vary the sparsity levels,\nincoherence, and rank, in terms of running time to achieve a \ufb01xed level of accuracy. In addition,\non a real dataset involving the standard task of foreground-background separation [CLMW11], our\nmethod is signi\ufb01cantly faster and provides visually better separation.\n\nOverview of our techniques: Our proof technique involves establishing error contraction with\neach projection onto the sets of low rank and sparse matrices. We \ufb01rst describe the proof ideas when\nL\u2217 is rank one. The \ufb01rst projection step is a hard thresholding procedure on the input matrix M to\nremove large entries and then we perform rank-1 projection of the residual to obtain L(1). Standard\nmatrix perturbation results (such as Davis-Kahan) provide (cid:96)2 error bounds between the singular\nvectors of L(1) and L\u2217. However, these bounds do not suf\ufb01ce for establishing the correctness of our\nmethod. Since the next step in our method involves hard thresholding of the residual M \u2212 L(1),\nwe require element-wise error bounds on our low rank estimate. Inspired by the approach of Erd\u02ddos\net al. [EKYY13], where they obtain similar element-wise bounds for the eigenvectors of sparse\nErd\u02ddos\u2013R\u00b4enyi graphs, we derive these bounds by exploiting the \ufb01xed point characterization of the\neigenvectors1. A Taylor\u2019s series expansion reveals that the perturbation between the estimated and\nthe true eigenvectors consists of bounding the walks in a graph whose adjacency matrix corresponds\nto (a subgraph of) the sparse component S\u2217. We then show that if the graph is sparse enough,\nthen this perturbation can be controlled, and thus, the next thresholding step results in further error\ncontraction. We use an induction argument to show that the sparse estimate is always contained in\nthe true support of S\u2217, and that there is an error contraction in each step. For the case, where L\u2217 has\nrank r > 1, our algorithm proceeds in several stages, where we progressively compute higher rank\n\n1If the input matrix M is not symmetric, we embed it in a symmetric matrix and consider the eigenvectors\n\nof the corresponding matrix.\n\n2\n\n\fprojections which alternate with the hard thresholding steps. In stage k = [1, 2, . . . , r], we compute\nrank-k projections, and show that after a suf\ufb01cient number of alternating projections, we reduce the\nerror to the level of (k + 1)th singular value of L\u2217, using similar arguments as in the rank-1 case.\nWe then proceed to performing rank-(k + 1) projections which alternate with hard thresholding.\nThis stage-wise procedure is needed for ill-conditioned matrices, since we cannot hope to recover\nlower eigenvectors in the beginning when there are large perturbations. Thus, we establish global\nconvergence guarantees for our proposed non-convex robust PCA method.\n\n1.2 Related Work\n\n(cid:107)L(cid:107)\u2217 + \u03bb(cid:107)S(cid:107)1,\n\nGuaranteed methods for robust PCA have received a lot of attention in the past few years, starting\nfrom the seminal works of [CSPW11, CLMW11], where they showed recovery of an incoherent low\nrank matrix L\u2217 through the following convex relaxation method:\n\nmin\nL,S\n\nConv-RPCA :\n\ns.t., M = L + S,\n\n(1)\nwhere (cid:107)L(cid:107)\u2217 denotes the nuclear norm of L (nuclear norm is the sum of singular values). A typical\nsolver for this convex program involves projection on to (cid:96)1 and nuclear norm balls (which are convex\nsets). Note that the convex method can be viewed as \u201csoft\u201d thresholding in the standard and spectral\ndomains, while our method involves hard thresholding in these domains.\n[CSPW11] and [CLMW11] consider two different models of sparsity for S\u2217. Chandrasekaran et\nal. [CSPW11] consider a deterministic sparsity model, where each row and column of the m \u00d7 n\nmatrix, S, has at most \u03b1 fraction of non-zero entries. For guaranteed recovery, they require \u03b1 =\n\n\u221a\n\n\u221a\n\n\u221a\n\nn)(cid:1), where \u00b5 is the incoherence level of L\u2217, and r is its rank. Hsu et al. [HKZ11]\nO(cid:0)1/(\u00b52r\nimprove upon this result to obtain guarantees for an optimal sparsity level of \u03b1 = O(cid:0)1/(\u00b52r)(cid:1).\nr/n. Note that our assumption of incoherence, viz., (cid:107)U (i)(cid:107) < \u00b5(cid:112)r/n,\n\nThis matches the requirements of our non-convex method for exact recovery. Note that when the\nrank r = O(1), this allows for a constant fraction of corrupted entries. Cand`es et al. [CLMW11]\nconsider a different model with random sparsity and additional incoherence constraints, viz., they\nrequire (cid:107)U V (cid:62)(cid:107)\u221e < \u00b5\nonly yields (cid:107)U V (cid:62)(cid:107)\u221e < \u00b52r/n. The additional assumption enables [CLMW11] to prove exact\nrecovery with a constant fraction of corrupted entries, even when L\u2217 is nearly full-rank. We note that\nremoving the (cid:107)U V (cid:62)(cid:107)\u221e condition for robust PCA would imply solving the planted clique problem\nn [Che13]. Thus, our recovery guarantees are tight upto constants\nwhen the clique size is less than\nwithout these additional assumptions.\nA number of works have considered modi\ufb01ed models under the robust PCA framework,\ne.g. [ANW12, XCS12]. For instance, Agarwal et al. [ANW12] relax the incoherence assumption to\na weaker \u201cdiffusivity\u201d assumption, which bounds the magnitude of the entries in the low rank part,\nbut incurs an additional approximation error. Xu et al.[XCS12] impose special sparsity structure\nwhere a column can either be non-zero or fully zero.\nIn terms of state-of-art specialized solvers, [CLMW11] implements the in-exact augmented La-\ngrangian multipliers (IALM) method and provides guidelines for parameter tuning. Other related\nmethods such as multi-block alternating directions method of multipliers (ADMM) have also been\nconsidered for robust PCA, e.g. [WHML13]. Recently, a multi-step multi-block stochastic ADMM\nmethod was analyzed for this problem [SAJ14], and this requires 1/\u0001 iterations to achieve an error\nof \u0001. In addition, the convergence rate is tight in terms of scaling with respect to problem size (m, n)\nand sparsity and rank parameters, under random noise models.\nThere is only one other work which considers a non-convex method for robust PCA [KC12]. How-\never, their result holds only for signi\ufb01cantly more restrictive settings and does not cover the de-\nterministic sparsity assumption that we study. Moreover, the projection step in their method can\nhave an arbitrarily large rank, so the running time is still O(m2n), which is the same as the convex\nmethods. In contrast, we have an improved running time of O(r2mn).\n\n2 Algorithm\n\nIn this section, we present our algorithm for the robust PCA problem. The robust PCA problem can\nbe formulated as the following optimization problem: \ufb01nd L, S s.t. (cid:107)M \u2212 L\u2212 S(cid:107)F \u2264 \u00012 and\n\n2\u0001 is the desired reconstruction error\n\n3\n\n\fFigure 1: Illustration of alternating projections. The goal is to \ufb01nd a matrix L\u2217 which lies in the\nintersection of two sets: L = { set of rank-r matrices} and SM = {M \u2212 S, where S is a sparse\nmatrix}. Intuitively, our algorithm alternately projects onto the above two non-convex sets, while\nappropriately relaxing the rank and the sparsity levels.\n\n1. L lies in the set of low-rank matrices,\n2. S lies in the set of sparse matrices.\n\nA natural algorithm for the above problem is to iteratively project M \u2212 L onto the set of sparse\nmatrices to update S, and then to project M \u2212 S onto the set of low-rank matrices to update L. Al-\nternatively, one can view the problem as that of \ufb01nding a matrix L in the intersection of the following\ntwo sets: a) L = { set of rank-r matrices}, b) SM = {M \u2212S, where S is a sparse matrix}. Note that\nthese projections can be done ef\ufb01ciently, even though the sets are non-convex. Hard thresholding\n(HT) is employed for projections on to sparse matrices, and singular value decomposition (SVD) is\nused for projections on to low rank matrices.\nRank-1 case: We \ufb01rst describe our algorithm for the special case when L\u2217 is rank 1. Our algo-\nrithm performs an initial hard thresholding to remove very large entries from input M. Note that if\nwe performed the projection on to rank-1 matrices without the initial hard thresholding, we would\nnot make any progress since it is subject to large perturbations. We alternate between computing\nthe rank-1 projection of M \u2212 S, and performing hard thresholding on M \u2212 L to remove entries\nexceeding a certain threshold. This threshold is gradually decreased as the iterations proceed, and\nthe algorithm is run for a certain number of iterations (which depends on the desired reconstruction\nerror).\nGeneral rank case: When L\u2217 has rank r > 1, a naive extension of our algorithm consists of al-\nternating projections on to rank-r matrices and sparse matrices. However, such a method has poor\nperformance on ill-conditioned matrices. This is because after the initial thresholding of the input\nmatrix M, the sparse corruptions in the residual are of the order of the top singular value (with the\nchoice of threshold as speci\ufb01ed in the algorithm). When the lower singular values are much smaller,\nthe corresponding singular vectors are subject to relatively large perturbations and thus, we cannot\nmake progress in improving the reconstruction error. To alleviate the dependence on the condition\nnumber, we propose an algorithm that proceeds in stages. In the kth stage, the algorithm alternates\nbetween rank-k projections and hard thresholding for a certain number of iterations. We run the\nalgorithm for r stages, where r is the rank of L\u2217. Intuitively, through this procedure, we recover\nthe lower singular values only after the input matrix is suf\ufb01ciently denoised, i.e. sparse corruptions\nat the desired level have been removed. Figure 1 shows a pictorial representation of the alternating\nprojections in different stages.\nParameters: As can be seen, the only real parameter to the algorithm is \u03b2, used in thresholding,\nwhich represents \u201cspikiness\u201d of L\u2217. That is if the user expects L\u2217 to be \u201cspiky\u201d and the sparse part\nto be heavily diffused, then higher value of \u03b2 can be provided. In our implementation, we found\n\u221a\nthat selecting \u03b2 aggressively helped speed up recovery of our algorithm. In particular, we selected\n\u03b2 = 1/\nComplexity: The complexity of each iteration within a single stage is O(kmn), since it involves\ncalculating the rank-k approximation3 of an m\u00d7n matrix (done e.g. via vanilla PCA). The number of\niterations in each stage is O (log (1/\u0001)) and there are at most r stages. Thus the overall complexity\nof the entire algorithm is then O(r2mn log(1/\u0001)). This is drastically lower than the best known\n\nbound of O(cid:0)m2n/\u0001(cid:1) on the number of iterations required by convex methods, and just a factor r\n\nn.\n\naway from the complexity of vanilla PCA.\n\n3Note that we only require a rank-k approximation of the matrix rather than the actual singular vectors.\n\nThus, the computational complexity has no dependence on the gap between the singular values.\n\n4\n\n\fAlgorithm 1 ((cid:98)L, (cid:98)S) = AltProj(M, \u0001, r, \u03b2): Non-convex Alternating Projections based Robust PCA\n\ni.e. (HT\u03b6(A))ij = Aij if |Aij| \u2265 \u03b6 and 0 otherwise.\n\n1: Input: Matrix M \u2208 Rm\u00d7n, convergence criterion \u0001, target rank r, thresholding parameter \u03b2.\n2: Pk(A) denotes the best rank-k approximation of matrix A. HT\u03b6(A) denotes hard-thresholding,\n3: Set initial threshold \u03b60 \u2190 \u03b2\u03c31(M ).\n4: L(0) = 0, S(0) = HT\u03b60(M \u2212 L(0))\n5: for Stage k = 1 to r do\n(cid:32)\n6:\n7:\n\nfor Iteration t = 0 to T = 10 log(cid:0)n\u03b2(cid:13)(cid:13)M \u2212 S(0)(cid:13)(cid:13)2 /\u0001(cid:1) do\n(cid:19)t\n(cid:18) 1\n\nSet threshold \u03b6 as\n\n(cid:33)\n\n\u03b6 = \u03b2\n\n\u03c3k+1(M \u2212 S(t)) +\n\n(2)\n\n\u03c3k(M \u2212 S(t))\n\n2\n\nL(t+1) = Pk(M \u2212 S(t))\nS(t+1) = HT\u03b6(M \u2212 L(t+1))\n\nend for\nif \u03b2\u03c3k+1(L(t+1)) < \u0001\nReturn: L(T ), S(T )\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\nend if\n15:\n16: end for\n17: Return: L(T ), S(T )\n\nS(0) = S(T )\n\nelse\n\n2n then\n\n/* Return rank-k estimate if remaining part has small norm */\n\n/* Continue to the next stage */\n\n3 Analysis\n\nIn this section, we present our main result on the correctness of AltProj. We assume the following\nconditions:\n\n(L1) Rank of L\u2217 is at most r.\n(L2) L\u2217 is \u00b5-incoherent, i.e., if L\u2217 = U\u2217\u03a3\u2217(V \u2217)(cid:62) is the SVD of L\u2217, then (cid:107)(U\u2217)i(cid:107)2 \u2264 \u00b5\n\n\u221a\nr\u221a\nm ,\n\u221a\nn , \u22001 \u2264 i \u2264 n, where (U\u2217)i and (V \u2217)i denote the ith\nr\u221a\n\n\u22001 \u2264 i \u2264 m and (cid:107)(V \u2217)i(cid:107)2 \u2264 \u00b5\nrows of U\u2217 and V \u2217 respectively.\n\n(S1) Each row and column of S have at most \u03b1 fraction of non-zero entries such that \u03b1 \u2264 1\n\n512\u00b52r .\nNote that in general, it is not possible to have a unique recovery of low-rank and sparse components.\nFor example, if the input matrix M is both sparse and low rank, then there is no unique decom-\nposition (e.g. M = e1e(cid:62)\n1 ). The above conditions ensure uniqueness of the matrix decomposition\nproblem.\nAdditionally, we set the parameter \u03b2 in Algorithm 1 be set as \u03b2 = 4\u00b52r\u221a\nmn.\n\nWe now establish that our proposed algorithm recovers the low rank and sparse components under\nthe above conditions.\nTheorem 1 (Noiseless Recovery). Under conditions (L1), (L2) and S\u2217, and choice of \u03b2 as above,\n\nthe outputs(cid:98)L and (cid:98)S of Algorithm 1 satisfy:\n(cid:13)(cid:13)(cid:13)(cid:98)S \u2212 S\u2217(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:13)(cid:13)(cid:13)(cid:98)L \u2212 L\u2217(cid:13)(cid:13)(cid:13)F\n\n\u2264 \u0001,\n\n(cid:16)(cid:98)S\n(cid:17) \u2286 Supp (S\u2217) .\n\n\u2264 \u0001\u221a\n\nmn\n\n, and Supp\n\nRemark (tight recovery conditions): Our result is tight up to constants, in terms of allowable\nsparsity level under the deterministic sparsity model. In other words, if we exceed the sparsity limit\nimposed in S1, it is possible to construct instances where there is no unique decomposition4. Our\n\n4For instance, consider the n \u00d7 n matrix which has r copies of the all ones matrix, each of size n\n\nr , placed\nacross the diagonal. We see that this matrix has rank r and is incoherent with parameter \u00b5 = 1. Note that\n\n5\n\n\fconditions L1, L2 and S1 also match the conditions required by the convex method for recovery, as\nestablished in [HKZ11].\nRemark (convergence rate): Our method has a linear rate of convergence, i.e. O(log(1/\u0001))\nto achieve an error of \u0001, and hence we provide a strongly polynomial method for robust PCA. In\ncontrast, the best known bound for convex methods for robust PCA is O(1/\u0001) iterations to converge\nto an \u0001-approximate solution.\nTheorem 1 provides recovery guarantees assuming that L\u2217 is exactly rank-r. However, in several\nreal-world scenarios, L\u2217 can be nearly rank-r. Our algorithm can handle such situations, where\nM = L\u2217 + N\u2217 + S\u2217, with N\u2217 being an additive noise. Theorem 1 is a special case of the following\ntheorem which provides recovery guarantees when N\u2217 has small (cid:96)\u221e norm.\n100n ,the outputs(cid:98)L,(cid:98)S of Algorithm 1 satisfy:\nTheorem 2 (Noisy Recovery). Under conditions (L1), (L2) and S\u2217, and choice of \u03b2 as in Theo-\nrem 1, when the noise (cid:107)N\u2217(cid:107)\u221e \u2264 \u03c3r(L\u2217)\n(cid:18)\n7(cid:107)N\u2217(cid:107)2 +\n(cid:16)(cid:98)S\n(cid:17) \u2286 Supp (S\u2217) .\n2\u00b52r\u221a\nmn\n\n(cid:13)(cid:13)(cid:13)(cid:98)L \u2212 L\u2217(cid:13)(cid:13)(cid:13)F\n(cid:13)(cid:13)(cid:13)(cid:98)S \u2212 S\u2217(cid:13)(cid:13)(cid:13)\u221e\n\n\u221a\nmn\u221a\n8\nr\n7(cid:107)N\u2217(cid:107)2 +\n\n(cid:107)N\u2217(cid:107)\u221e\n\u221a\nmn\u221a\n8\nr\n\n\u2264 \u0001 + 2\u00b52r\n\n(cid:107)N\u2217(cid:107)\u221e\n\n, and Supp\n\n\u2264 \u0001\u221a\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\nmn\n\n+\n\n,\n\n3.1 Proof Sketch\n\nWe now present the key steps in the proof of Theorem 1. A detailed proof is provided in the ap-\npendix.\nStep I: Reduce to the symmetric case, while maintaining incoherence of L\u2217 and sparsity of S\u2217.\nUsing standard symmetrization arguments, we can reduce the problem to the symmetric case, where\nall the matrices involved are symmetric. See appendix for details on this step.\nStep II: Show decay in (cid:107)L \u2212 L\u2217(cid:107)\u221e after projection onto the set of rank-k matrices.\nThe\nt-th iterate L(t+1) of the k-th stage is given by L(t+1) = Pk(L\u2217 + S\u2217 \u2212 S(t)). Hence, L(t+1) is\nobtained by using the top principal components of a perturbation of L\u2217 given by L\u2217 + (S\u2217 \u2212 S(t)).\nThe key step in our analysis is to show that when an incoherent and low-rank L\u2217 is perturbed by a\nsparse matrix S\u2217 \u2212 S(t), then (cid:107)L(t+1) \u2212 L\u2217(cid:107)\u221e is small and is much smaller than |S\u2217 \u2212 S(t)|\u221e. The\nfollowing lemma formalizes the intuition; see the appendix for a detailed proof.\nLemma 1. Let L\u2217, S\u2217 be symmetric and satisfy the assumptions of Theorem 1 and let S(t) and L(t)\nn be the eigenvalues of L\u2217, s.t.,\nbe the tth iterates of the kth stage of Algorithm 1. Let \u03c3\u2217\n|\u03c3\u2217\n1| \u2265 \u00b7\u00b7\u00b7 \u2265 |\u03c3\u2217\n\nr|. Then, the following holds:\n\n1, . . . , \u03c3\u2217\n\n\u2264 2\u00b52r\nn\n\n(cid:33)\n(cid:19)t |\u03c3\u2217\n(cid:13)(cid:13)(cid:13)L(t+1) \u2212 L\u2217(cid:13)(cid:13)(cid:13)\u221e\nk|\n(cid:33)\n(cid:19)t |\u03c3\u2217\n(cid:13)(cid:13)(cid:13)S\u2217 \u2212 S(t+1)(cid:13)(cid:13)(cid:13)\u221e\nk|\nMoreover, the outputs(cid:98)L and (cid:98)S of Algorithm 1 satisfy:\n(cid:13)(cid:13)(cid:13)(cid:98)L \u2212 L\u2217(cid:13)(cid:13)(cid:13)F\n\n(cid:32)(cid:12)(cid:12)\u03c3\u2217\n(cid:18) 1\n(cid:12)(cid:12) +\n(cid:32)(cid:12)(cid:12)\u03c3\u2217\n(cid:18) 1\n(cid:12)(cid:12) +\n(cid:13)(cid:13)(cid:13)(cid:98)S \u2212 S\u2217(cid:13)(cid:13)(cid:13)\u221e\n\n\u2264 8\u00b52r\nn\n\n\u2264 \u0001\nn\n\n\u2264 \u0001,\n\nk+1\n\nk+1\n\n2\n\n2\n\n,\n\n, and Supp\n\n, and Supp\n\n(cid:16)\nS(t+1)(cid:17) \u2286 Supp (S\u2217) .\n(cid:16)(cid:98)S\n(cid:17) \u2286 Supp (S\u2217) .\n\nStep III: Show decay in (cid:107)S \u2212 S\u2217(cid:107)\u221e after projection onto the set of sparse matrices. We next\nshow that if (cid:107)L(t+1) \u2212 L\u2217(cid:107)\u221e is much smaller than (cid:107)S(t) \u2212 S\u2217(cid:107)\u221e then the iterate S(t+1) also has\na much smaller error (w.r.t. S\u2217) than S(t). The above given lemma formally provides the error\nbound.\nStep IV: Recurse the argument. We have now reduced the (cid:96)\u221e norm of the sparse part by a factor of\nhalf, while maintaining its sparsity. We can now go back to steps II and III and repeat the arguments\nfor subsequent iterations.\n\na fraction of \u03b1 = O (1/r) sparse perturbations suf\ufb01ce to erase one of these blocks making it impossible to\nrecover the matrix.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Comparison of AltProj and IALM on synthetic datasets. (a) Running time of AltProj and\nIALM with varying \u03b1. (b) Maximum rank of the intermediate iterates of IALM. (c) Running time\nof AltProj and IALM with varying \u00b5. (d) Running time of AltProj and IALM with varying r.\n4 Experiments\n\n\u221a\n\n\u221a\n\n\u221a\n\nn\n\nn in our experiments.\n\nWe now present an empirical study of our AltProj method. The goal of this study is two-fold: a)\nestablish that our method indeed recovers the low-rank and sparse part exactly, without signi\ufb01cant\nparameter tuning, b) demonstrate that AltProj is signi\ufb01cantly faster than Conv-RPCA (see (1)); we\nsolve Conv-RPCA using the IALM method [CLMW11], a state-of-the-art solver [LCM10]. We\nimplemented our method in Matlab and used a Matlab implementation of the IALM method by\n[LCM10].\nWe consider both synthetic experiments and experiments on real data involving the problem of\nforeground-background separation in a video. Each of our results for synthetic datasets is averaged\nover 5 runs.\nParameter Setting: Our pseudo-code (Algorithm 1) prescribes the threshold \u03b6 in Step 4, which\ndepends on the knowledge of the singular values of the low rank component L\u2217. Instead, in the\nexperiments, we set the threshold at the (t + 1)-th step of k-th stage as \u03b6 = \u00b5\u03c3k+1(M\u2212S(t))\n. For\nsynthetic experiments, we employ the \u00b5 used for data generation, and for real-world datasets, we\ntune \u00b5 through cross-validation. We found that the above thresholding provides exact recovery\nwhile speeding up the computation signi\ufb01cantly. We would also like to note that [CLMW11] sets\nn (assuming m \u2264 n). However, we found\nthe regularization parameter \u03bb in Conv-RPCA (1) as 1/\n\u221a\nthat for problems with large incoherence such a parameter setting does not provide exact recovery.\nInstead, we set \u03bb = \u00b5/\nSynthetic datasets: Following the experimental setup of [CLMW11], the low-rank part L\u2217 = U V T\nis generated using normally distributed U \u2208 Rm\u00d7r, V \u2208 Rn\u00d7r. Similarly, supp(S\u2217) is generated\nby sampling a uniformly random subset of [m]\u00d7[n] with size (cid:107)S\u2217(cid:107)0 and each non-zero S\u2217\nij is drawn\n\u221a\nmn]. For increasing incoherence of L\u2217,\ni.i.d. from the uniform distribution over [r/(2\nwe randomly zero-out rows of U, V and then re-normalize them.\nThere are three key problem parameters for RPCA with a \ufb01xed matrix size: a) sparsity of S\u2217,\nb) incoherence of L\u2217, c) rank of L\u2217. We investigate performance of both AltProj and IALM by\nvarying each of the three parameters while \ufb01xing the others. In our plots (see Figure 2), we report\ncomputational time required by each of the two methods for decomposing M into L + S up to a\nrelative error ((cid:107)M \u2212 L \u2212 S(cid:107)F /(cid:107)M(cid:107)F ) of 10\u22123. Figure 2 shows that AltProj scales signi\ufb01cantly\nbetter than IALM for increasingly dense S\u2217. We attribute this observation to the fact that as (cid:107)S\u2217(cid:107)0\nincreases, the problem is \u201charder\u201d and the intermediate iterates of IALM have ranks signi\ufb01cantly\nlarger than r. Our intuition is con\ufb01rmed by Figure 2 (b), which shows that when density (\u03b1) of S\u2217\nis 0.4 then the intermediate iterates of IALM can be of rank over 500 while the rank of L\u2217 is only\n5. We observe a similar trend for the other parameters, i.e., AltProj scales signi\ufb01cantly better than\nIALM with increasing incoherence parameter \u00b5 (Figure 2 (c)) and increasing rank (Figure 2 (d)).\nSee Appendix C for additional plots.\nReal-world datasets: Next, we apply our method to the problem of foreground-background (F-B)\nseparation in a video [LHGT04]. The observed matrix M is formed by vectorizing each frame and\nstacking them column-wise. Intuitively, the background in a video is the static part and hence forms\na low-rank component while the foreground is a dynamic but sparse perturbation.\nHere, we used two benchmark datasets named Escalator and Restaurant dataset. The Escalator\ndataset has 3417 frames at a resolution of 160 \u00d7 130. We \ufb01rst applied the standard PCA method for\nextracting low-rank part. Figure 3 (b) shows the extracted background from the video. There are\n\nmn), r/\n\n7\n\n500600700800102n \u03b1Time(s)n = 2000, r = 5, \u00b5 = 1 AltProjIALM500600700800200400600n \u03b1Max. Rankn = 2000, r = 5, \u00b5 = 1 IALM11.522.53102\u00b5Time(s)n = 2000, r = 10, n \u03b1 = 100 AltProjIALM50100150200102rTime(s)n = 2000, \u00b5 = 1, n \u03b1 = 3r AltProjIALM\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Foreground-background separation in the Escalator video. (a): Original image frame. (b):\nBest rank-10 approximation; time taken is 3.1s. (c): Low-rank frame obtained using AltProj; time\ntaken is 63.2s. (d): Low-rank frame obtained using IALM; time taken is 1688.9s.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Foreground-background separation in the Restaurant video. (a): Original frame from the\nvideo. (b): Best rank-10 approximation (using PCA) of the original frame; 2.8s were required to\ncompute the solution (c): Low-rank part obtained using AltProj; computational time required by\nAltProj was 34.9s. (d): Low-rank part obtained using IALM; 693.2s required by IALM to compute\nthe low-rank+sparse decomposition.\n\nseveral artifacts (shadows of people near the escalator) that are not desirable. In contrast, both IALM\nand AltProj obtain signi\ufb01cantly better F-B separation (see Figure 3(c), (d)). Interestingly, AltProj re-\nmoves the steps of the escalator which are moving and arguably are part of the dynamic foreground,\nwhile IALM keeps the steps in the background part. Also, our method is signi\ufb01cantly faster, i.e.,\nour method, which takes 63.2s is about 26 times faster than IALM, which takes 1688.9s.\nRestaurant dataset: Figure 4 shows the comparison of AltProj and IALM on a subset of the \u201cRestau-\nrant\u201d dataset where we consider the last 2055 frames at a resolution of 120\u00d7160. AltProj was around\n19 times faster than IALM. Moreover, visually, the background extraction seems to be of better qual-\nity (for example, notice the blur near top corner counter in the IALM solution). Plot(b) shows the\nPCA solution and that also suffers from a similar blur at the top corner of the image, while the\nbackground frame extracted by AltProj does not have any noticeable artifacts.\n5 Conclusion\nIn this work, we proposed a non-convex method for robust PCA, which consists of alternating pro-\njections on to low rank and sparse matrices. We established global convergence of our method under\nconditions which match those for convex methods. At the same time, our method has much faster\nrunning times, and has superior experimental performance. This work opens up a number of interest-\ning questions for future investigation. While we match the convex methods, under the deterministic\nsparsity model, studying the random sparsity model is of interest. Our noisy recovery results as-\nsume deterministic noise; improving the results under random noise needs to be investigated. There\nare many decomposition problems beyond the robust PCA setting, e.g. structured sparsity models,\nrobust tensor PCA problem, and so on. It is interesting to see if we can establish global convergence\nfor non-convex methods in these settings.\nAcknowledgements\nAA and UN would like to acknowledge NSF grant CCF-1219234, ONR N00014-14-1-0665, and Mi-\ncrosoft faculty fellowship. SS would like to acknowledge NSF grants 1302435, 0954059, 1017525\nand DTRA grant HDTRA1-13-1-0024. PJ would like to acknowledge Nikhil Srivastava and\nDeeparnab Chakrabarty for several insightful discussions during the course of the project.\n\n8\n\n\fReferences\n\n[AAJ+13] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning\nSparsely Used Overcomplete Dictionaries via Alternating Minimization. Available\non arXiv:1310.7991, Oct. 2013.\n\n[AGH+12] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor Methods for\n\nLearning Latent Variable Models. Available at arXiv:1210.7559, Oct. 2012.\n\n[ANW12] A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex\nrelaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171\u2013\n1197, 2012.\nRajendra Bhatia. Matrix Analysis. Springer, 1997.\nY. Chen. Incoherence-Optimal Matrix Completion. ArXiv e-prints, October 2013.\n\n[Bha97]\n[Che13]\n[CLMW11] Emmanuel J. Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal com-\n\nponent analysis? J. ACM, 58(3):11, 2011.\n\n[CSPW11] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky.\nRank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization,\n21(2):572\u2013596, 2011.\nYudong Chen, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. In Advances\nin neural information processing systems, pages 2204\u20132212, 2012.\n\n[CSX12]\n\n[EKYY13] L\u00b4aszl\u00b4o Erd\u02ddos, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of\n\n[Har13]\n\n[HKZ11]\n\n[JNS13]\n\n[KC12]\n\n[Kes12]\n\n[LCM10]\n\nErd\u02ddos\u2013R\u00b4enyi graphs I: Local semicircle law. The Annals of Probability, 2013.\nMoritz Hardt. On the provable convergence of alternating minimization for matrix\ncompletion. arXiv:1312.0925, 2013.\nDaniel Hsu, Sham M Kakade, and Tong Zhang. Robust matrix decomposition with\nsparse corruptions. ITIT, 2011.\nPrateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion\nusing alternating minimization. In STOC, 2013.\nAnastasios Kyrillidis and Volkan Cevher. Matrix alps: Accelerated low rank and sparse\nmatrix reconstruction. In SSP Workshop, 2012.\nRaghunandan H. Keshavan. Ef\ufb01cient algorithms for collaborative \ufb01ltering. Phd Thesis,\nStanford University, 2012.\nZhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method\nfor exact recovery of corrupted low-rank matrices. arXiv:1009.5055, 2010.\n\n[LHGT04] Liyuan Li, Weimin Huang, IY-H Gu, and Qi Tian. Statistical modeling of complex\n\nbackgrounds for foreground object detection. ITIP, 2004.\n\n[MZYM11] Hossein Mobahi, Zihan Zhou, Allen Y. Yang, and Yi Ma. Holistic 3d reconstruction of\n\n[NJS13]\n\n[SAJ14]\n\n[Shi13]\n\nurban structures from low-rank textures. In ICCV Workshops, pages 593\u2013600, 2011.\nPraneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating\nminimization. In NIPS, pages 2796\u20132804, 2013.\nH. Sedghi, A. Anandkumar, and E. Jonckheere. Guarantees for Stochastic ADMM in\nHigh Dimensions. Preprint., Feb. 2014.\nLei Shi. Sparse additive text models with low rank background. In Advances in Neural\nInformation Processing Systems, pages 172\u2013180, 2013.\n\n[WHML13] X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex\nminimization problems using two-block alternating direction method of multipliers.\narXiv:1308.5294, 2013.\nHuan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit.\nIEEE Transactions on Information Theory, 58(5):3047\u20133064, 2012.\n\n[XCS12]\n\n9\n\n\f", "award": [], "sourceid": 648, "authors": [{"given_name": "Praneeth", "family_name": "Netrapalli", "institution": "Microsoft Research"}, {"given_name": "Niranjan", "family_name": "U N", "institution": "UC Irvine"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UT Austin"}, {"given_name": "Animashree", "family_name": "Anandkumar", "institution": "UC Irvine"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}]}