{"title": "Fast Algorithms for Robust PCA via Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 4152, "page_last": 4160, "abstract": "We consider the problem of Robust PCA in the fully and partially observed settings. Without corruptions, this is the well-known matrix completion problem. From a statistical standpoint this problem has been recently well-studied, and conditions on when recovery is possible (how many observations do we need, how many corruptions can we tolerate) via polynomial-time algorithms is by now understood. This paper presents and analyzes a non-convex optimization approach that greatly reduces the computational complexity of the above problems, compared to the best available algorithms. In particular, in the fully observed case, with $r$ denoting rank and $d$ dimension, we reduce the complexity from $O(r^2d^2\\log(1/\\epsilon))$ to $O(rd^2\\log(1/\\epsilon))$ -- a big savings when the rank is big. For the partially observed case, we show the complexity of our algorithm is no more than $O(r^4d\\log(d)\\log(1/\\epsilon))$. Not only is this the best-known run-time for a provable algorithm under partial observation, but in the setting where $r$ is small compared to $d$, it also allows for near-linear-in-$d$ run-time that can be exploited in the fully-observed case as well, by simply running our algorithm on a subset of the observations.", "full_text": "Fast Algorithms for Robust PCA via Gradient\n\nDescent\n\nXinyang Yi\u2217 Dohyung Park\u2217 Yudong Chen\u2020 Constantine Caramanis\u2217\n\u2020Cornell University\n\u2217The University of Texas at Austin\n\u2217{yixy,dhpark,constantine}@utexas.edu\n\u2020yudong.chen@cornell.edu\n\nAbstract\n\nWe consider the problem of Robust PCA in the fully and partially observed set-\ntings. Without corruptions, this is the well-known matrix completion problem.\nFrom a statistical standpoint this problem has been recently well-studied, and\nconditions on when recovery is possible (how many observations do we need,\nhow many corruptions can we tolerate) via polynomial-time algorithms is by\nnow understood. This paper presents and analyzes a non-convex optimization\napproach that greatly reduces the computational complexity of the above prob-\nlems, compared to the best available algorithms. In particular, in the fully ob-\nserved case, with r denoting rank and d dimension, we reduce the complexity\nfrom O(r2d2 log(1/\u03b5)) to O(rd2 log(1/\u03b5)) \u2013 a big savings when the rank is big.\nFor the partially observed case, we show the complexity of our algorithm is no\nmore than O(r4d log d log(1/\u03b5)). Not only is this the best-known run-time for a\nprovable algorithm under partial observation, but in the setting where r is small\ncompared to d, it also allows for near-linear-in-d run-time that can be exploited in\nthe fully-observed case as well, by simply running our algorithm on a subset of the\nobservations.\nIntroduction\n\n1\nPrincipal component analysis (PCA) aims to \ufb01nd a low rank subspace that best-approximates a data\nmatrix Y \u2208 Rd1\u00d7d2. The simple and standard method of PCA by singular value decomposition\n(SVD) fails in many modern data problems due to missing and corrupted entries, as well as sheer scale\nof the problem. Indeed, SVD is highly sensitive to outliers by virtue of the squared-error criterion\nit minimizes. Moreover, its running time scales as O(rd2) to recover a rank r approximation of a\nd-by-d matrix.\nWhile there have been recent results developing provably robust algorithms for PCA (e.g., [5, 26]), the\nrunning times range from O(r2d2) to O(d3) and hence are signi\ufb01cantly worse than SVD. Meanwhile,\nthe literature developing sub-quadratic algorithms for PCA (e.g., [15, 14, 3]) seems unable to\nguarantee robustness to outliers or missing data.\nOur contribution lies precisely in this area: provably robust algorithms for PCA with improved\nrun-time. Speci\ufb01cally, we provide an ef\ufb01cient algorithm with running time that matches SVD while\nnearly matching the best-known robustness guarantees. In the case where rank is small compared to\ndimension, we develop an algorithm with running time that is nearly linear in the dimension. This\nlast algorithm works by subsampling the data, and therefore we also show that our algorithm solves\nthe Robust PCA problem with partial observations (a generalization of matrix completion and Robust\nPCA).\n1.1 The Model and Related Work\nWe consider the following setting for robust PCA. Suppose we are given a matrix Y \u2208 Rd1\u00d7d2 that\nhas decomposition Y = M\u2217 + S\u2217, where M\u2217 is a rank r matrix and S\u2217 is a sparse corruption matrix\ncontaining entries with arbitrary magnitude. The goal is to recover M\u2217 and S\u2217 from Y . To ease\nnotation, we let d1 = d2 = d in the remainder of this section.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmin\nM,S\n\n\u221a\n\nProvable solutions for this model are \ufb01rst provided in the works of [9] and [5]. They propose to solve\nthis problem by convex relaxation:\n\n|||M|||nuc + \u03bb(cid:107)S(cid:107)1, s.t. Y = M + S,\n\n(1)\nwhere |||M|||nuc denotes the nuclear norm of M. Despite analyzing the same method, the corruption\nmodels in [5] and [9] differ. In [5], the authors consider the setting where the entries of M\u2217 are\ncorrupted at random with probability \u03b1. They show their method succeeds in exact recovery with\n\u03b1 as large as 0.1, which indicates they can tolerate a constant fraction of corruptions. Work in [9]\nconsiders a deterministic corruption model, where nonzero entries of S\u2217 can have arbitrary position,\nbut the sparsity of each row and column does not exceed \u03b1d. They prove that for exact recovery, it\ncan allow \u03b1 = O(1/(\u00b5r\nd)). This was subsequently further improved to \u03b1 = O(1/(\u00b5r)), which is\nin fact optimal [11, 18]. Here, \u00b5 represents the incoherence of M\u2217 (see Section 2 for details). In this\npaper, we follow this latter line and focus on the deterministic corruption model.\nThe state-of-the-art solver [20] for (1) has time complexity O(d3/\u03b5) to achieve error \u03b5, and is thus\nmuch slower than SVD, and prohibitive for even modest values of d. Work in [21] considers the\ndeterministic corruption model, and improves this running time without sacri\ufb01cing the robustness\nguarantee on \u03b1. They propose an alternating projection (AltProj) method to estimate the low\nrank and sparse structures iteratively and simultaneously, and show their algorithm has complexity\nO(r2d2 log(1/\u03b5)), which is faster than the convex approach but still slower than SVD.\nNon-convex approaches have recently seen numerous developments for applications in low-rank\nestimation, including alternating minimization (see e.g. [19, 17, 16]) and gradient descent (see e.g.\n[4, 12, 23, 24, 29, 30]). These works have fast running times, yet do not provide robustness guarantees.\nOne exception is [12], where the authors analyze a row-wise (cid:96)1 projection method for recovering\nS\u2217. Their analysis hinges on positive semide\ufb01nite M\u2217, and the algorithm requires prior knowledge\nof the (cid:96)1 norm of every row of S\u2217 and is thus prohibitive in practice. Another exception is work\n[16], which analyzes alternating minimization plus an overall sparse projection. Their algorithm is\nshown to tolerate at most a fraction of \u03b1 = O(1/(\u00b52/3r2/3d)) corruptions. As we discuss in Section\n1.2, we can allow S\u2217 to have much higher sparsity \u03b1 = O(1/(\u00b5r1.5)), which is close to optimal.\nIt is worth mentioning other works that obtain provable guarantees of non-convex algorithms or\nproblems including phase retrieval [6, 13, 28], EM algorithms [2, 25, 27], tensor decompositions [1]\nand second order method [22]. It might be interesting to bring robust considerations to these works.\n1.2 Our Contributions\nIn this paper, we develop ef\ufb01cient non-convex algorithms for robust PCA. We propose a novel\nalgorithm based on the projected gradient method on the factorized space. We also extend it to solve\nrobust PCA in the setting with partial observations, i.e., in addition to gross corruptions, the data\nmatrix has a large number of missing values. Our main contributions are summarized as follows.1\n1. We propose a novel sparse estimator for the setting of deterministic corruptions. For the low-rank\nstructure to be identi\ufb01able, it is natural to assume that deterministic corruptions are \u201cspread out\u201d (no\nmore than some number in each row/column). We leverage this information in a simple but critical\nalgorithmic idea, that is tied to the ultimate complexity advantages our algorithm delivers.\n2. Based on the proposed sparse estimator, we propose a projected gradient method on the matrix\nfactorized space. While non-convex, the algorithm is shown to enjoy linear convergence under proper\ninitialization. Along with a new initialization method, we show that robust PCA can be solved\nwithin complexity O(rd2 log(1/\u03b5)) while ensuring robustness \u03b1 = O(1/(\u00b5r1.5)). Our algorithm is\nthus faster than the best previous known algorithm by a factor of r, and enjoys superior empirical\nperformance as well.\n3. Algorithms for Robust PCA with partial observations still rely on a computationally expensive\nconvex approach, as apparently this problem has evaded treatment by non-convex methods. We\nconsider precisely this problem. In a nutshell, we show that our gradient method succeeds (it is\nguaranteed to produce the subspace of M\u2217) even when run on no more than O(\u00b52r2d log d) random\nentries of Y . The computational cost is O(\u00b53r4d log d log(1/\u03b5)). When rank r is small compared to\nthe dimension d, in fact this dramatically improves on our bound above, as our cost becomes nearly\nlinear in d. We show, moreover, that this savings and robustness to erasures comes at no cost in the\n1To ease presentation, the discussion here assumes M\u2217 has constant condition number, whereas our results\n\nbelow show the dependence on condition number explicitly.\n\n2\n\n\frobustness guarantee for the deterministic (gross) corruptions. While this demonstrates our algorithm\nis robust to both outliers and erasures, it also provides a way to reduce computational costs even in\nthe fully observed setting, when r is small.\n4. An immediate corollary of the above result provides a guarantee for exact matrix completion, with\ngeneral rectangular matrices, using O(\u00b52r2d log d) observed entries and O(\u00b53r4d log d log(1/\u03b5))\ntime, thereby improving on existing results in [12, 23].\n\nNotation. For any index set \u2126 \u2286 [d1] \u00d7 [d2], we let \u2126(i,\u00b7) :=(cid:8)(i, j) \u2208 \u2126(cid:12)(cid:12) j \u2208 [d2](cid:9), \u2126(\u00b7,j) :=\n(cid:8)(i, j) \u2208 \u2126(cid:12)(cid:12) i \u2208 [d1](cid:9). For any matrix A \u2208 Rd1\u00d7d2, we denote its projector onto support \u2126 by\n\n\u03a0\u2126 (A), i.e., the (i, j)-th entry of \u03a0\u2126 (A) is equal to A if (i, j) \u2208 \u2126 and zero otherwise. The i-th\nrow and j-th column of A are denoted by A(i,\u00b7) and A(\u00b7,j). The (i, j)-th entry is denoted as A(i,j).\nOperator norm of A is |||A|||op. Frobenius norm of A is |||A|||F. The (cid:96)a/(cid:96)b norm of A is denoted by\n|||A|||b,a, i.e., the (cid:96)a norm of the vector formed by the (cid:96)b norm of every row. For instance, (cid:107)A(cid:107)2,\u221e\nstands for maxi\u2208[d1] (cid:107)A(i,\u00b7)(cid:107)2.\n2 Problem Setup\nWe consider the problem where we observe a matrix Y \u2208 Rd1\u00d7d2 that satis\ufb01es Y = M\u2217 + S\u2217, where\nM\u2217 has rank r, and S\u2217 is corruption matrix with sparse support. Our goal is to recover M\u2217 and S\u2217.\nIn the partially observed setting, in addition to sparse corruptions, we have erasures. We assume that\neach entry of M\u2217 + S\u2217 is revealed independently with probability p \u2208 (0, 1). In particular, for any\n(i, j) \u2208 [d1] \u00d7 [d2], we consider the Bernoulli model where\n\n(cid:26)(M\u2217 + S\u2217)(i,j), with probability p;\n\n\u2217,\n\nY(i,j) =\n\n(2)\nWe denote the support of Y by \u03a6 = {(i, j) | Y(i,j) (cid:54)= \u2217}. Note that we assume S\u2217 is not adaptive to\n\u03a6. As is well-understood thanks to work in matrix completion, this task is impossible in general \u2013\nwe need to guarantee that M\u2217 is not both low-rank and sparse. To avoid such identi\ufb01ability issues,\nwe make the following standard assumptions on M\u2217 and S\u2217: (i) M\u2217 is not near-sparse or \u201cspiky.\u201d\nWe impose this by requiring M\u2217 to be \u00b5-incoherent \u2013 given a singular value decomposition (SVD)\nM\u2217 = L\u2217\u03a3\u2217R\u2217(cid:62), we assume that\n\notherwise.\n\n(cid:114) \u00b5r\n\nd1\n\n(cid:114) \u00b5r\n\n.\n\nd2\n\n(cid:40)\n\nT\u03b1 [A] :=\n\n(cid:107)L\u2217(cid:107)2,\u221e \u2264\n\n, (cid:107)R\u2217(cid:107)2,\u221e \u2264\n\n(ii) The entries of S\u2217 are \u201cspread out\u201d \u2013 for \u03b1 \u2208 [0, 1), we assume S\u2217 \u2208 S\u03b1, where\n\nS\u03b1 :=(cid:8)A \u2208 Rd1\u00d7d2(cid:12)(cid:12) (cid:107)A(i,\u00b7)(cid:107)0 \u2264 \u03b1d2 for all i \u2208 [d1] ;(cid:107)A(\u00b7,j)(cid:107)0 \u2264 \u03b1d1 for all j \u2208 [d2](cid:9) .\n\n(3)\n\nIn other words, S\u2217 contains at most \u03b1-fraction nonzero entries per row and column.\n3 Algorithms\nFor both the full and partial observation settings, our method proceeds in two phases. In the \ufb01rst\nphase, we use a new sorting-based sparse estimator to produce a rough estimate Sinit for S\u2217 based on\nthe observed matrix Y , and then \ufb01nd a rank r matrix factorized as U0V (cid:62)\n0 that is a rough estimate\nof M\u2217 by performing SVD on (Y \u2212 Sinit). In the second phase, given (U0, V0), we perform an\niterative method to produce series {(Ut, Vt)}\u221e\nt=0. In each step t, we \ufb01rst apply our sparse estimator\nto produce a sparse matrix St based on (Ut, Vt), and then perform a projected gradient descent\nstep on the low-rank factorized space to produce (Ut+1, Vt+1). This \ufb02ow is the same for full and\npartial observations, though a few details differ. Algorithm 1 gives the full observation algorithm,\nand Algorithm 2 gives the partial observation algorithm. We now describe the key details of each\nalgorithm.\nSparse Estimation. A natural idea is to keep those entries of residual matrix Y \u2212 M that have large\nmagnitude. At the same time, we need to make use of the dispersed property of S\u03b1 that every column\nand row contain at most \u03b1-fraction of nonzero entries. Motivated by these two principles, we introduce\nthe following sparsi\ufb01cation operator: For any matrix A \u2208 Rd1\u00d7d2: for all (i, j) \u2208 [d1] \u00d7 [d2], we let\n\nA(i,j),\n0,\n\nif |A(i,j)| \u2265 |A(\u03b1d2)\notherwise\n\n(i,\u00b7) | and |A(i,j)| \u2265 |A(\u03b1d1)\n(\u00b7,j) |\n\n,\n\n(4)\n\n3\n\n\f(i,\u00b7) and A(k)\n\nwhere A(k)\n(\u00b7,j) denote the elements of A(i,\u00b7) and A(\u00b7,j) that have the k-th largest magnitude\nrespectively. In other words, we choose to keep those elements that are simultaneously among the\nlargest \u03b1-fraction entries in the corresponding row and column. In the case of entries having identical\nmagnitude, we break ties arbitrarily. It is thus guaranteed that T\u03b1 [A] \u2208 S\u03b1.\nAlgorithm 1 Fast RPCA\nINPUT: Observed matrix Y with rank r and corruption fraction \u03b1; parameters \u03b3, \u03b7; number of\n\n// see (4) for the de\ufb01nition of T\u03b1 [\u00b7].\n\niterations T .\n// Phase I: Initialization.\n1: Sinit \u2190 T\u03b1 [Y ]\n2: [L, \u03a3, R] \u2190 SVDr[Y \u2212 Sinit] 2\n3: U0 \u2190 L\u03a31/2, V0 \u2190 R\u03a31/2. Let U,V be de\ufb01ned according to (7).\n// Phase II: Gradient based iterations.\n4: U0 \u2190 \u03a0U (U0), V0 \u2190 \u03a0V (V0)\n5: for t = 0, 1, . . . , T \u2212 1 do\n6:\n7:\n8:\n9: end for\nOUTPUT: (UT , VT )\n\nUt+1 \u2190 \u03a0U(cid:0)Ut \u2212 \u03b7\u2207UL(Ut, Vt; St) \u2212 1\nVt+1 \u2190 \u03a0V(cid:0)Vt \u2212 \u03b7\u2207V L(Ut, Vt; St) \u2212 1\n\n(cid:2)Y \u2212 UtV (cid:62)\n\nt Ut \u2212 V (cid:62)\nt Vt \u2212 U(cid:62)\n\n2 \u03b7Ut(U(cid:62)\n2 \u03b7Vt(V (cid:62)\n\nSt \u2190 T\u03b3\u03b1\n\n(cid:3)\n\nt Vt)(cid:1)\nt Ut)(cid:1)\n\nt\n\nIn the fully observed setting, we compute Sinit based on Y as Sinit = T\u03b1 [Y ]. In\nInitialization.\nthe partially observed setting with sampling rate p, we let Sinit = T2p\u03b1 [Y ]. In both cases, we then\nset U0 = L\u03a31/2 and V0 = R\u03a31/2, where L\u03a3R(cid:62) is an SVD of the best rank r approximation of\nY \u2212 Sinit.\nGradient Method on Factorized Space. After initialization, we proceed by projected gradient\ndescent. To do this, we de\ufb01ne loss functions explicitly in the factored space, i.e., in terms of U, V and\nS:\n\nL(U, V ; S)\n\n(cid:101)L(U, V ; S)\n\n:=\n\n:=\n\n|||U V (cid:62) + S \u2212 Y |||2\n1\nF ,\n2\n|||\u03a0\u03a6\n1\n2p\n\n(cid:0)U V (cid:62) + S \u2212 Y(cid:1)|||2\n\nF .\n\n(fully observed)\n\n(partially observed)\n\n(5)\n\n(6)\n\n(cid:27)\n\n.\n(7)\n\nRecall that our goal is to recover M\u2217 that satis\ufb01es the \u00b5-incoherent condition. Given an SVD\nM\u2217 = L\u2217\u03a3R\u2217(cid:62), we expect that the solution (U, V ) is close to (L\u2217\u03a31/2, R\u2217\u03a31/2) up to some\nrotation. In order to serve such \u00b5-incoherent structure, it is natural to put constraints on the row\nnorms of U, V based on |||M\u2217|||op. As |||M\u2217|||op is unavailable, given U0, V0 computed in the \ufb01rst phase,\nwe rely on the sets U, V de\ufb01ned as\n\n(cid:26)\n\nA \u2208 Rd1\u00d7r(cid:12)(cid:12) (cid:107)A(cid:107)2,\u221e \u2264\n\nU :=\n\n(cid:114) 2\u00b5r\n\nd1\n\n(cid:27)\n\n|||U0|||op\n\n, V :=\n\n(cid:26)\n\nA \u2208 Rd2\u00d7r(cid:12)(cid:12) (cid:107)A(cid:107)2,\u221e \u2264\n\n(cid:114) 2\u00b5r\n\nd2\n\n|||V0|||op\n\nNow we consider the following optimization problems with constraints:\n\nmin\n\nU\u2208U ,V \u2208V,S\u2208S\u03b1\n\nmin\n\nU\u2208U ,V \u2208V,S\u2208Sp\u03b1\n\nL(U, V ; S) +\n\n(cid:101)L(U, V ; S) +\n\n|||U(cid:62)U \u2212 V (cid:62)V |||2\nF ,\n|||U(cid:62)U \u2212 V (cid:62)V |||2\nF .\n\n1\n8\n1\n64\n\n(fully observed)\n\n(partially observed)\n\n(8)\n\n(9)\n\nThe regularization term in the objectives above is used to encourage that U and V have the same\nscale. Given (U0, V0), we propose the following iterative method to produce series {(Ut, Vt)}\u221e\nand {St}\u221e\nt=0\nt=0. We give the details for the fully observed case \u2013 the partially observed case is similar.\n\n1 SVDr[A] stands for computing a rank-r SVD of matrix A, i.e., \ufb01nding the top r singular values and vectors\nof A. Note that we only need to compute rank-r SVD approximately (see the initialization error requirement in\nTheorem 1) so that we can leverage fast iterative approaches such as block power method and Krylov subspace\nmethods.\n\n4\n\n\fFor t = 0, 1, . . ., we update St using the sparse estimator St = T\u03b3\u03b1\nprojected gradient update on Ut and Vt:\n\nUt+1 = \u03a0U\n\nVt+1 = \u03a0V\n\nUt \u2212 \u03b7\u2207UL(Ut, Vt; St) \u2212 1\n2\nVt \u2212 \u03b7\u2207V L(Ut, Vt; St) \u2212 1\n2\n\n\u03b7Ut(U(cid:62)\n\n\u03b7Vt(V (cid:62)\n\n(cid:18)\n(cid:18)\n\n(cid:3), followed by a\n\n(cid:2)Y \u2212 UtV (cid:62)\n(cid:19)\n(cid:19)\n\nt Vt)\n\nt\n\n,\n\nt Ut)\n\n.\n\nt Ut \u2212 V (cid:62)\nt Vt \u2212 U(cid:62)\n\nHere \u03b1 is the model parameter that characterizes the corruption fraction, \u03b3 and \u03b7 are algorithmic\ntunning parameters, which we specify in our analysis. Essentially, the above algorithm corresponds\nto applying projected gradient method to optimize (8), where S is replaced by the aforementioned\nsparse estimator in each step.\n\nAlgorithm 2 Fast RPCA with partial observations\nINPUT: Observed matrix Y with support \u03a6; parameters \u03c4, \u03b3, \u03b7; number of iterations T .\n// Phase I: Initialization.\n1: Sinit \u2190 T2p\u03b1 [\u03a0\u03a6(Y )]\n2: [L, \u03a3, R] \u2190 SVDr[ 1\n3: U0 \u2190 L\u03a31/2, V0 \u2190 R\u03a31/2. Let U,V be de\ufb01ned according to (7).\n// Phase II: Gradient based iterations.\n4: U0 \u2190 \u03a0U (U0), V0 \u2190 \u03a0V (V0)\n5: for t = 0, 1, . . . , T \u2212 1 do\n6:\n7:\n\np (Y \u2212 Sinit)]\n(cid:1)(cid:3)\n\n(cid:2)\u03a0\u03a6\n(cid:0)Y \u2212 UtV (cid:62)\n(cid:16)\nUt \u2212 \u03b7\u2207U(cid:101)L(Ut, Vt; St) \u2212 1\n(cid:16)\nVt \u2212 \u03b7\u2207V (cid:101)L(Ut, Vt; St) \u2212 1\n\nSt \u2190 T\u03b3p\u03b1\nUt+1 \u2190 \u03a0U\nVt+1 \u2190 \u03a0V\n\nt Ut \u2212 V (cid:62)\nt Vt \u2212 U(cid:62)\n\n16 \u03b7Ut(U(cid:62)\n16 \u03b7Vt(V (cid:62)\n\n(cid:17)\n(cid:17)\n\nt Vt)\n\nt Ut)\n\nt\n\n8:\n9: end for\nOUTPUT: (UT , VT )\n\n4 Main Results\n4.1 Analysis of Algorithm 1\nWe begin with some de\ufb01nitions and notation. It is important to de\ufb01ne a proper error metric because\nthe optimal solution corresponds to a manifold and there are many distinguished pairs (U, V ) that\nminimize (8). Given the SVD of the true low-rank matrix M\u2217 = L\u2217\u03a3\u2217R\u2217(cid:62), we let U\u2217 := L\u2217\u03a3\u22171/2\nand V \u2217 := R\u2217\u03a3\u22171/2. We also let \u03c3\u2217\nr be sorted nonzero singular values of\nM\u2217, and denote the condition number of M\u2217 by \u03ba, i.e., \u03ba := \u03c3\u2217\nr . We de\ufb01ne estimation error\nd(U, V ; U\u2217, V \u2217) as the minimal Frobenius norm between (U, V ) and (U\u2217, V \u2217) with respect to the\noptimal rotation, namely\n\n2 \u2265 . . . \u2265 \u03c3\u2217\n\n1 \u2265 \u03c3\u2217\n\n1/\u03c3\u2217\n\nd(U, V ; U\u2217, V \u2217) := min\nQ\u2208Qr\n\n|||U V (cid:62) \u2212 M\u2217|||F (cid:46)(cid:112)\u03c3\u2217\n\nfor Qr the set of r-by-r orthonormal matrices. This metric controls reconstruction error, as\n\nwhen d(U, V ; U\u2217, V \u2217) \u2264 (cid:112)\u03c3\u2217\n\n1d(U, V ; U\u2217, V \u2217),\n\n(11)\n1. We denote the local region around the optimum (U\u2217, V \u2217) with\n\nB2 (\u03c9) :=(cid:8)(U, V ) \u2208 Rd1\u00d7r \u00d7 Rd2\u00d7r(cid:12)(cid:12) d(U, V ; U\u2217, V \u2217) \u2264 \u03c9(cid:9) .\n\nradius \u03c9 as\n\nThe next two theorems provide guarantees for the initialization phase and gradient iterations, respec-\ntively, of Algorithm 1.\nTheorem 1 (Initialization). Consider the paired (U0, V0) produced in the \ufb01rst phase of Algorithm 1.\nIf \u03b1 \u2264 1/(16\u03ba\u00b5r), we have\n\n(cid:113)|||U \u2212 U\u2217Q|||2\n\nF + |||V \u2212 V \u2217Q|||2\nF ,\n\n(10)\n\nr(cid:112)\u03c3\u2217\n\n1.\n\n\u221a\n\n\u221a\n\n\u03ba\u03b1\u00b5r\n\nd(U0, V0; U\u2217, V \u2217) \u2264 28\n\n5\n\n\fTheorem 2 (Convergence). Consider the second phase of Algorithm 1. Suppose we choose \u03b3 = 2\n1 for any c \u2264 1/36. There exist constants c1, c2 such that when \u03b1 \u2264 c1/(\u03ba2\u00b5r), given\nand \u03b7 = c/\u03c3\u2217\nany (U0, V0) \u2208 B2\n\n, the iterates {(Ut, Vt)}\u221e\n\nt=0 satisfy\n\nc2\n\n(cid:16)\n\n(cid:17)\n\n(cid:112)\u03c3\u2217\nd2(Ut, Vt; U\u2217, V \u2217) \u2264(cid:16)\n\nr /\u03ba\n\n(cid:17)t\n\n1 \u2212 c\n8\u03ba\n\nd2(U0, V0; U\u2217, V \u2217).\n\n0 \u2212 M\u2217|||op \u2264 1\n\n\u221a\n1. When \u03b1 (cid:46) 1/(\u00b5\n\nTherefore, using proper initialization and step size, the gradient iteration converges at a linear\nrate with a constant contraction factor 1 \u2212 O(1/\u03ba). To obtain relative precision \u03b5 compared to\nthe initial error, it suf\ufb01ces to perform O(\u03ba log(1/\u03b5)) iterations. Note that the step size is chosen\naccording to 1/\u03c3\u2217\n\u03bar3), Theorem 1 and the inequality (11) together imply that\n|||U0V (cid:62)\n0 )) using being the\ntop singular value \u03c31(U0V (cid:62)\nCombining Theorems 1 and 2 implies the following result that provides an overall guarantee for\nAlgorithm 1.\nCorollary 1. Suppose that\n\n1. Hence we can set the step size as \u03b7 = O(1/\u03c31(U0V (cid:62)\n0 ) of the matrix U0V (cid:62)\n\n2 \u03c3\u2217\n\n0\n\n(cid:40)\n\n(cid:41)\n\n\u03b1 \u2264 c min\n\n1\n\u221a\n\u03bar3 ,\n\u00b5\n\n1\n\n\u00b5\u03ba2r\n\nfor some constant c. Then for any \u03b5 \u2208 (0, 1), Algorithm 1 with T = O(\u03ba log(1/\u03b5)) outputs a pair\n(UT , VT ) that satis\ufb01es\n\n|||UT V (cid:62)\n\nT \u2212 M\u2217|||F \u2264 \u03b5 \u00b7 \u03c3\u2217\nr .\n\n(12)\n\nRemark 1 (Time Complexity). For simplicity we assume d1 = d2 = d. Our sparse estimator (4)\ncan be implemented by \ufb01nding the top \u03b1d elements of each row and column via partial quick sort,\nwhich has running time O(d2 log(\u03b1d)). Performing rank-r SVD in the \ufb01rst phase and computing the\ngradient in each iteration both have complexity O(rd2).3 Algorithm 1 thus has total running time\nO(\u03bard2 log(1/\u03b5)) for achieving an \u0001 accuracy as in (12). We note that when \u03ba = O(1), our algorithm\nis orderwise faster than the AltProj algorithm in [21], which has running time O(r2d2 log(1/\u03b5)).\nMoreover, our algorithm only requires computing one singular value decomposition.\nRemark 2 (Robustness). Assuming \u03ba = O(1), our algorithm can tolerate corruption at a sparsity\nlevel up to \u03b1 = O(1/(\u00b5r\nr compared to the optimal statistical\nguarantee 1/(\u00b5r) obtained in [11, 18, 21]. This looseness is a consequence of the condition for\n(U0, V0) in Theorem 2. Nevertheless, when \u00b5r = O(1), our algorithm can tolerate a constant \u03b1\nfraction of corruptions.\n\nr)). This is worse by a factor\n\n\u221a\n\n\u221a\n\n4.2 Analysis of Algorithm 2\nWe now move to the guarantees of Algorithm 2. We show here that not only can we handle partial\nobservations, but in fact subsampling the data in the fully observed case can signi\ufb01cantly reduce the\ntime complexity from the guarantees given in the previous section without sacri\ufb01cing robustness. In\nparticular, for smaller values of r, the complexity of Algorithm 2 has near linear dependence on the\ndimension d, instead of quadratic.\nIn the following discussion, we let d := max{d1, d2}. The next two results control the quality of the\ninitialization step, and then the gradient iterations.\nTheorem 3 (Initialization, partial observations). Suppose the observed indices \u03a6 follow the Bernoulli\nmodel given in (2). Consider the pair (U0, V0) produced in the \ufb01rst phase of Algorithm 2. There exist\nconstants {ci}3\n\ni=1 such that for any \u0001 \u2208 (0,\n\nr/(8c1\u03ba)), if\n\n\u221a\n\n\u03b1 \u2264 1\n\n64\u03ba\u00b5r\n\n, p \u2265 c2\n\nthen we have\n\nd(U0, V0; U\u2217, V \u2217) \u2264 51\n\n\u03ba\u03b1\u00b5r\n\n\u221a\n\nwith probability at least 1 \u2212 c3d\u22121.\n\n(cid:18) \u00b5r2\n\n(cid:19) log d\n1 + 7c1\u0001(cid:112)\u03ba\u03c3\u2217\nr(cid:112)\u03c3\u2217\n\nd1 \u2227 d2\n\n1\n\u03b1\n\n,\n\n1,\n\n\u00012 +\n\u221a\n\n(13)\n\n3In fact, it suf\ufb01ces to compute the best rank-r approximation with running time independent of the eigen gap.\n\n6\n\n\fTheorem 4 (Convergence, partial observations). Suppose the observed indices \u03a6 follow the Bernoulli\nmodel given in (2). Consider the second phase of Algorithm 2. Suppose we choose \u03b3 = 3, and\n\u03b7 = c/(\u00b5r\u03c3\u2217\n\n1) for a suf\ufb01ciently small constant c. There exist constants {ci}4\n\ni=1 such that if\n\n,\n\n(14)\n\nd1 \u2227 d2\nthen with probability at least 1 \u2212 c3d\u22121, the iterates {(Ut, Vt)}\u221e\n\n\u03b1 \u2264 c1\n\u03ba2\u00b5r\n\nand p \u2265 c2\n\n\u03ba4\u00b52r2 log d\n\nt=0 satisfy\n\n(cid:18)\n\n(cid:19)t\n\n1 \u2212 c\n\n64\u00b5r\u03ba\n\nd2(U0, V0; U\u2217, V \u2217)\n\nd2(Ut, Vt; U\u2217, V \u2217) \u2264\n\n(cid:16)\n\n(cid:112)\u03c3\u2217\n\n(cid:17)\n\nfor all (U0, V0) \u2208 B2\n\nc4\n\nr /\u03ba\n\n.\n\nSetting p = 1 in the above result recovers Theorem 2 up to an additional factor \u00b5r in the contraction\nfactor. For achieving \u03b5 relative accuracy, now we need O(\u00b5r\u03ba log(1/\u03b5)) iterations. Putting Theorems\n3 and 4 together, we have the following overall guarantee for Algorithm 2.\nCorollary 2. Suppose that\n\n(cid:40)\n\n(cid:41)\n\n\u03b1 \u2264 c min\n\n1\n\u221a\n\u03bar3 ,\n\u00b5\n\n1\n\n\u00b5\u03ba2r\n\n, p \u2265 c(cid:48) \u03ba4\u00b52r2 log d\n\nd1 \u2227 d2\n\n,\n\nfor some constants c, c(cid:48). With probability at least 1 \u2212 O(d\u22121), for any \u03b5 \u2208 (0, 1), Algorithm 2 with\nT = O(\u00b5r\u03ba log(1/\u03b5)) outputs a pair (UT , VT ) that satis\ufb01es\nT \u2212 M\u2217|||F \u2264 \u03b5 \u00b7 \u03c3\u2217\nr .\n\n|||UT V (cid:62)\n\n(15)\n\nThis result shows that partial observations do not compromise robustness to sparse corruptions: as\nlong as the observation probability p satis\ufb01es the condition in Corollary 2, Algorithm 2 enjoys the\nsame robustness guarantees as the method using all entries. Below we provide two remarks on the\nsample and time complexity. For simplicity, we assume d1 = d2 = d, \u03ba = O(1).\nRemark 3 (Sample complexity and matrix completion). Using the lower bound on p, it is suf\ufb01cient\nto have O(\u00b52r2d log d) observed entries. In the special case S\u2217 = 0, our partial observation model\nis equivalent to the model of exact matrix completion (see, e.g., [8]). We note that our sample\ncomplexity (i.e., observations needed) matches that of completing a positive semide\ufb01nite (PSD)\nmatrix by gradient descent as shown in [12], and is better than the non-convex matrix completion\nalgorithms in [19] and [23]. Accordingly, our result reveals the important fact that we can obtain\nrobustness in matrix completion without deterioration of our statistical guarantees. It is known that\nthat any algorithm for solving exact matrix completion must have sample size \u2126(\u00b5rd log d) [8], and a\nnearly tight upper bound O(\u00b5rd log2 d) is obtained in [10] by convex relaxation. While sub-optimal\nby a factor \u00b5r, our algorithm is much faster than convex relaxation as shown below.\nRemark 4 (Time complexity). Our sparse estimator on the sparse matrix with support \u03a6 can be\nimplemented via partial quick sort with running time O(pd2 log(\u03b1pd)). Computing the gradient\nin each step involves the two terms in the objective function (9). Computing the gradient of the\nphase, performing rank-r SVD on a sparse matrix with support \u03a6 can be done in time O(r|\u03a6|). We\nconclude that when |\u03a6| = O(\u00b52r2d log d), Algorithm 2 achieves the error bound (15) with running\ntime O(\u00b53r4d log d log(1/\u03b5)). Therefore, in the small rank setting with r (cid:28) d1/3, even when full\nobservations are given, it is better to use Algorithm 2 by subsampling the entries of Y .\n5 Numerical Results\nIn this section, we provide numerical results and compare the proposed algorithms with existing\nmethods, including the inexact augmented lagrange multiplier (IALM) approach [20] for solving\nthe convex relaxation (1) and the alternating projection (AltProj) algorithm proposed in [21]. All\nalgorithms are implemented in MATLAB 4, and the codes for existing algorithms are obtained from\ntheir authors. SVD computation in all algorithms uses the PROPACK library.5 We ran all simulations\non a machine with Intel 32-core Xeon (E5-2699) 2.3GHz with 240GB RAM.\n4Our code is available at https://www.yixinyang.org/code/RPCA_GD.zip.\n5http://sun.stanford.edu/~rmunk/PROPACK/\n\n\ufb01rst term (cid:101)L takes time O(r|\u03a6|), whereas the second term takes time O(r2d). In the initialization\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Results on synthetic data. (a) Plot of log estimation error versus number of iterations when using\ngradient descent (GD) with varying sub-sampling rate p. It is conducted using d = 5000, r = 10, \u03b1 = 0.1.\n(b) Plot of running time of GD versus dimension d with r = 10, \u03b1 = 0.1, p = 0.15r2 log d/d. The low-rank\nmatrix is recovered in all instances, and the line has slope approximately one. (c) Plot of log estimation error\nversus running time for different algorithms in problem with d = 5000, r = 10, \u03b1 = 0.1.\n\nFigure 2: Foreground-background separation in Restaurant and ShoppingMall videos. In each line, the leftmost\nimage is an original frame, and the other four are the separated background obtained from our algorithms with\np = 1, p = 0.2, AltProj, and IALM. The running time required by each algorithm is shown in the title.\n\nSynthetic Datasets. We generate a squared data matrix Y = M\u2217 + S\u2217 \u2208 Rd\u00d7d as follows. The\nlow-rank part M\u2217 is given by M\u2217 = AB(cid:62), where A, B \u2208 Rd\u00d7r have entries drawn independently\nfrom a zero mean Gaussian distribution with variance 1/d. For a given sparsity parameter \u03b1, each\nentry of S\u2217 is set to be nonzero with probability \u03b1, and the values of the nonzero entries are sampled\nuniformly from [\u22125r/d, 5r/d]. The results are summarized in Figure 1. Figure 1a shows the\nconvergence of our algorithms for different random instances with different sub-sampling rate p.\nFigure 1b shows the running time of our algorithm with partially observed data. We note that our\nalgorithm is memory-ef\ufb01cient: in the large scale setting with d = 2 \u00d7 105, using approximately\n0.1% entries is suf\ufb01cient for the successful recovery. In contrast, AltProj and IALM are designed\nto manipulate the entire matrix with d2 = 4 \u00d7 1010 entries, which is prohibitive on single machine.\nFigure 1c compares our algorithms with AltProj and IALM by showing reconstruction error versus\nreal running time. Our algorithm requires signi\ufb01cantly less computation to achieve the same accuracy\nlevel, and using only a subset of the entries provides additional speed-up.\nForeground-background Separation. We apply our method to the task of foreground-background\n(FB) separation in a video. We use two public benchmarks, the Restaurant and ShoppingMall\ndatasets.6 Each dataset contains a video with static background. By vectorizing and stacking the\nframes as columns of a matrix Y , the FB separation problem can be cast as RPCA, where the static\nbackground corresponds to a low rank matrix M\u2217 with identical columns, and the moving objects in\nthe video can be modeled as sparse corruptions S\u2217. Figure 2 shows the output of different algorithms\non two frames from the dataset. Our algorithms require signi\ufb01cantly less running time than both\nAltProj and IALM. Moreover, even with 20% sub-sampling, our methods still seem to achieve\nbetter separation quality. The details about parameter setting and more results are deferred to the\nsupplemental material.\n\n6http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html\n\n8\n\nIterationcount0246810d(U,V;U\u2217,V\u2217)10-210-1100GDp=1GDp=0.5GDp=0.2Dimensiond103104105Time(secs)101102103Time(secs)020406080100d(U,V;U\u2217,V\u2217)10-310-210-1100101GDp=1GDp=0.5GDp=0.2GDp=0.1AltProjIALMOriginalGD(49.8s)GD,20%sample(18.1s)AltProj(101.5s)IALM(434.6s)OriginalGD(87.3s)GD,20%sample(43.4s)AltProj(283.0s)IALM(801.4s)\fReferences\n[1] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent\n\nvariable models. The Journal of Machine Learning Research, 15(1):2773\u20132832, 2014.\n\n[2] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-\n\nbased analysis. arXiv preprint arXiv:1408.2156, 2014.\n\n[3] Srinadh Bhojanapalli, Prateek Jain, and Sujay Sanghavi. Tighter low-rank approximation via sampling the leveraged element.\n\nProceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 902\u2013920. SIAM, 2015.\n\nIn\n\n[4] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-de\ufb01nite optimization. arXiv\n\npreprint arXiv:1509.03917, 2015.\n\n[5] Emmanuel J. Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?\n\n58(3):11, 2011.\n\nJournal of the ACM (JACM),\n\n[6] Emmanuel J. Cand\u00e8s, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger \ufb02ow: Theory and algorithms.\n\nTransactions on Information Theory, 61(4):1985\u20132007, 2015.\n\nIEEE\n\n[7] Emmanuel J. Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathemat-\n\nics, 9(6):717\u2013772, 2009.\n\n[8] Emmanuel J. Cand\u00e8s and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nInformation Theory, 56(5):2053\u20132080, 2010.\n\nIEEE Transactions on\n\n[9] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A. Parrilo, and Alan S. Willsky. Rank-sparsity incoherence for matrix decomposition.\n\nSIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[10] Yudong Chen. Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909\u20132923, 2015.\n\n[11] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Constantine Caramanis. Low-rank matrix recovery from errors and erasures.\n\nTransactions on Information Theory, 59(7):4324\u20134337, 2013.\n\nIEEE\n\n[12] Yudong Chen and Martin J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic\n\nguarantees. arXiv preprint arXiv:1509.03025, 2015.\n\n[13] Yuxin Chen and Emmanuel J. Cand\u00e8s. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In\n\nAdvances in Neural Information Processing Systems, pages 739\u2013747, 2015.\n\n[14] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the\n\nforty-\ufb01fth annual ACM symposium on Theory of computing, pages 81\u201390. ACM, 2013.\n\n[15] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for \ufb01nding low-rank approximations. Journal of the ACM\n\n(JACM), 51(6):1025\u20131041, 2004.\n\n[16] Quanquan Gu, Zhaoran Wang, and Han Liu. Low-rank and sparse structure pursuit via alternating minimization. In Proceedings of the\n\n19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 600\u2013609, 2016.\n\n[17] Moritz Hardt. Understanding alternating minimization for matrix completion. In 2014 IEEE 55th Annual Symposium on Foundations of\n\nComputer Science (FOCS), pages 651\u2013660. IEEE, 2014.\n\n[18] Daniel Hsu, Sham M. Kakade, and Tong Zhang. Robust matrix decomposition with sparse corruptions. IEEE Transactions on Information\n\nTheory, 57(11):7221\u20137234, 2011.\n\n[19] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of\n\nthe forty-\ufb01fth annual ACM symposium on Theory of computing, pages 665\u2013674. ACM, 2013.\n\n[20] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices.\n\nArxiv preprint arxiv:1009.5055v3, 2013.\n\n[21] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Advances in\n\nNeural Information Processing Systems, pages 1107\u20131115, 2014.\n\n[22] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.\n\n[23] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via nonconvex factorization. In 2015 IEEE 56th Annual Symposium on\n\nFoundations of Computer Science (FOCS), pages 270\u2013289. IEEE, 2015.\n\n[24] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes \ufb02ow.\n\narXiv preprint arXiv:1507.03566, 2015.\n\n[25] Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional em algorithm: Statistical optimization and asymptotic\n\nnormality. In Advances in Neural Information Processing Systems, pages 2512\u20132520, 2015.\n\n[26] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit.\n\n58(5):3047\u20133064, May 2012.\n\nIEEE Transactions on Information Theory,\n\n[27] Xinyang Yi and Constantine Caramanis. Regularized em algorithms: A uni\ufb01ed framework and statistical guarantees. In Advances in\n\nNeural Information Processing Systems, pages 1567\u20131575, 2015.\n\n[28] Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Provable non-convex phase retrieval with outliers: Median truncated wirtinger \ufb02ow.\n\narXiv preprint arXiv:1603.03805, 2016.\n\n[29] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix estimation. In Advances in Neural\n\nInformation Processing Systems, pages 559\u2013567, 2015.\n\n[30] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimization and semide\ufb01nite programming from\n\nrandom linear measurements. In Advances in Neural Information Processing Systems, pages 109\u2013117, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2067, "authors": [{"given_name": "Xinyang", "family_name": "Yi", "institution": "UT Austin"}, {"given_name": "Dohyung", "family_name": "Park", "institution": "University of Texas at Austin"}, {"given_name": "Yudong", "family_name": "Chen", "institution": "Cornell University"}, {"given_name": "Constantine", "family_name": "Caramanis", "institution": "UT Austin"}]}