{"title": "Online PCA for Contaminated Data", "book": "Advances in Neural Information Processing Systems", "page_first": 764, "page_last": 772, "abstract": "We consider the online Principal Component Analysis (PCA) for contaminated samples (containing outliers) which are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily bad. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a $50\\%$ breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data.", "full_text": "Online PCA for Contaminated Data\n\nJiashi Feng\n\nECE Department\n\nNational University of Singapore\n\njiashi@nus.edu.sg\n\nHuan Xu\n\nME Department\n\nNational University of Singapore\n\nmpexuh@nus.edu.sg\n\nShie Mannor\nEE Department\n\nTechnion\n\nshie@ee.technion.ac.il\n\nShuicheng Yan\nECE Department\n\nNational University of Singapore\n\neleyans@nus.edu.sg\n\nAbstract\n\nWe consider the online Principal Component Analysis (PCA) where contaminated\nsamples (containing outliers) are revealed sequentially to the Principal Compo-\nnents (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA\nalgorithms fail in this case and their results can be arbitrarily skewed by the out-\nliers. Here we propose the online robust PCA algorithm, which is able to im-\nprove the PCs estimation upon an initial one steadily, even when faced with a\nconstant fraction of outliers. We show that the \ufb01nal result of the proposed online\nRPCA has an acceptable degradation from the optimum. Actually, under mild\nconditions, online RPCA achieves the maximal robustness with a 50% breakdown\npoint. Moreover, online RPCA is shown to be ef\ufb01cient for both storage and com-\nputation, since it need not re-explore the previous samples as in traditional robust\nPCA algorithms. This endows online RPCA with scalability for large scale data.\n\n1\n\nIntroduction\n\nIn this paper, we investigate the problem of robust Principal Component Analysis (PCA) in an online\nfashion. PCA aims to construct a low-dimensional subspace based on a set of principal components\n(PCs) to approximate all the observed samples in the least-square sense [19]. Conventionally, it\ncomputes PCs as the eigenvectors of the sample covariance matrix in batch mode, which is both\ncomputationally expensive and in particular memory exhausting, when dealing with large scale data.\nTo address this problem, several online PCA algorithms have been developed in literature [15, 23,\n10]. For online PCA, at each time instance, a new sample is revealed, and the PCs estimation is\nupdated accordingly without having to re-explore all previous samples. Signi\ufb01cant advantages of\nonline PCA algorithms include independence of their storage space requirement of the number of\nsamples, and handling newly revealed samples quite ef\ufb01ciently.\nDue to the quadratic loss used, PCA is notoriously sensitive to corrupted observations (outliers),\nand the quality of its output can suffer severely in the face of even a few outliers. Therefore, much\nwork has been dedicated to robustifying PCA [12, 2, 24, 6]. However, all of these methods work\nin batch mode and cannot handle sequentially revealed samples in the online learning framework.\nFor instance, [24] proposed a high-dimensional robust PCA (HR-PCA) algorithm that is based on\niterative performing PCA and randomized removal. Notice that the random removal process involves\ncalculating the order statistics over all the samples to obtain the removal probability. Therefore, all\nsamples must be stored in memory throughout the process. This hinders its application to large scale\ndata, for which storing all data is impractical.\n\n1\n\n\fIn this work, we propose a novel online Robust PCA algorithm to handle contaminated sample set,\ni.e., sample set that comprises both authentic samples (non-corrupted samples) and outliers (cor-\nrupted samples), which are revealed sequentially to the algorithm. Previous online PCA algorithms\ngenerally fail in this case, since they update the PCs estimation through minimizing the quadratic\nerror w.r.t. every new sample and are thus sensitive to outliers. The outliers may manipulate the PCs\nestimation severely and the result can be arbitrarily bad. In contrast, the proposed online RPCA is\nshown to be robust to the outliers. This is achieved by a probabilistic admiting/rejection procedure\nwhen a new sample comes. This is different from previous online PCA methods, where each and\nevery new sample is admitted. The probabilistic admittion/rejection procedure endows online RPCA\nwith the ability to reject more outliers than authentic samples and thus alleviates the affect of outliers\nand robusti\ufb01es the PCs estimation. Indeed, we show that given a proper initial estimation, online\nRPCA is able to steadily improve its output until convergence. We further bound the deviation of the\n\ufb01nal output from the optimal solution. In fact, under mild conditions, online RPCA can be resistent\nto 50% outliers, namely having a 50% breakdown point. This is the maximal robustness that can be\nachieved by any method.\nCompared with previous robust PCA methods (typically works in batch mode), online RPCA only\nneeds to maintain a covariance matrix whose size is independent of the number of data points. Upon\naccepting a newly revealed sample, online RPCA updates the PCs estimation accordingly without\nre-exploring the previous samples. Thus, online RPCA can deal with large amounts of data with\nlow storage expense. This is in stark contrast with previous robust PCA methods which typically\nrequires to remember all samples. To the best of our knowledge, this is the \ufb01rst attempt to make\nonline PCA work for outlier-corrupted data, with theoretical performance guarantees.\n\n2 Related Work\n\nStandard PCA is performed in batch mode, and its high computational complexity may become\ncumbersome for the large datasets. To address this issue, different online learning techniques have\nbeen proposed, for example [1, 8], and many others.\nMost of current online PCA methods perform the PCs estimation in an incremental manner [8, 18,\n25]. They maintain a covariance matrix or current PCs estimation, and update it according to the\nnew sample incrementally. Those methods provide similar PCs estimation accuracy. Recently, a\nrandomized online PCA algorithm was proposed by [23], whose objective is to minimize the total\nexpected quadratic error minus the total error of the batch algorithm (i.e., the regret). However, none\nof these online PCA algorithms is robust to the outliers.\nTo overcome the sensitiveness of PCA to outliers, many robust PCA algorithms have been pro-\nposed [21, 4, 12], which can be roughly categorized into two groups. They either pursue robust\nestimation of the covariance matrix, e.g., M-estimator [17], S-estimator [22], and Minimum Co-\nvariance Determinant (MCD) estimator [21], or directly maximize certain robust estimation of uni-\nvariate variance for the projected observations [14, 3, 4, 13]. These algorithms inherit the robustness\ncharacteristics of the adopted estimators and are qualitatively robust. However, none of them can\nbe directly applied in online learning setting. Recently, [24] and the following work [6] propose\nhigh-dimensional robust PCA, which can achieve maximum 50% breakdown point. However, these\nmethods iteratively remove the observations or tunes the observations weights based on statistics\nobtained from the whole data set. Thus, when a new data point is revealed, these methods need to\nre-explore all of the data and become quite expensive in computation and in storage.\nThe most related works to ours are the following two works. In [15], an incremental and robust\nsubspace learning method is proposed. The method proposes to integrate the M-estimation into the\nstandard incremental PCA calculation. Speci\ufb01cally, each newly coming data point is re-weighted by\na pre-de\ufb01ned in\ufb02uence function [11] of its residual to the current estimated subspace. However, no\nperformance guarantee is provided in this work. Moreover, the performance of the proposed algo-\nrithm relies on the accuracy of PCs obtained previously. And the error will be cumulated inevitably.\nRecently, a compressive sensing based recursive robust PCA algorithm was proposed in [20]. In this\nwork, the authors focused on the case where the outliers can be modeled as sparse vectors. In con-\ntrast, we do not impose any structural assumption on the outliers. Moreover, the proposed method\nin [20] essentially solves compressive sensing optimization over a small batch of data to update the\nPCs estimation instead of using a single sample, and it is not clear how to extend the method to the\n\n2\n\n\flatter case. Recently, He et al. propose an incremental gradient descent method on Grassmannian\nmanifold for solving the robust PCA problem, named GRASTA [9]. However, they also focus on a\ndifferent case from ours where the outliers are sparse vectors.\n\n3 The Algorithm\n\n3.1 Problem Setup\nGiven a set of observations {y1,\u00b7\u00b7\u00b7 , yT} (here T can be \ufb01nite or in\ufb01nite) which are revealed se-\nquentially, the goal of online PCA is to estimate and update the principal components (PCs) based on\nthe newly revealed sample yt at time instance t. Here, the observations are the mixture of authentic\nsamples (non-corrupted samples) and outliers (corrupted samples). The authentic samples zi \u2208 Rp\nare generated through a linear mapping: zi = Axi + ni. Noise ni is sampled from normal distribu-\ntion N (0, Ip); and the signal xi \u2208 Rd are i.i.d. samples of a random variable x with mean zero and\nvariance Id. Let \u00b5 denote the distribution of x. The matrix A \u2208 Rp\u00d7d and the distribution \u00b5 are un-\n\u221a\nknown. We assume \u00b5 is absolutely continuous w.r.t. the Borel measure and spherically symmetric.\nAnd \u00b5 has light tails, i.e., there exist constants C > 0 such that Pr((cid:107)x(cid:107) \u2265 x) \u2264 d exp(1\u2212Cx/\u03b1\nd)\nfor all x \u2265 0. The outliers are denoted as oi \u2208 Rp and in particular they are de\ufb01ned as follows.\nDe\ufb01nition 1 (Outlier). A sample oi \u2208 Rp is an outlier w.r.t. the subspace spanned by {wj}d\n\ndeviates from the subspace, i.e.,(cid:80)d\n\nj=1 if it\n\nj=1 |wT\n\nj oi|2 \u2264 \u0393o.\n\nIn the above de\ufb01nition, we assume that the basis wj and outliers o are both normalized (see Al-\ngorithm 1 step 1)-a) where all the samples are (cid:96)2-normalized). Thus, we directly use inner product\nto de\ufb01ne \u0393o. Namely a sample is called outlier if it is distant from the underlying subspace of the\nsignal. Apart from this assumption, the outliers are arbitrary. In this work, we are interested in the\ncase where the outliers are mixed with authentic samples uniformly in the data stream, i.e., taking\nany subset of the dataset, the outlier fraction is identical when the size of the subset is large enough.\nto the proposed online RPCA algorithm is the sequence of observations Y =\nThe input\n{y1, y2,\u00b7\u00b7\u00b7 , yT}, which is the union of authentic samples Z = {zi} generated by the aforemen-\ntioned linear model and outliers O = {oi}. The outlier fraction in the observations is denoted as\n\u03bb. Online RPCA aims at learning the PCs robustly and the learning process proceeds in time in-\nstances. At the time instance t, online RPCA chooses a set of principal components {w(t)\nj }d\nj=1. The\nperformance of the estimation is measured by the Expressed Variance (E.V.) [24]:\n\n(cid:80)d\n(cid:80)d\n\nT\n\n.\n\nE.V. (cid:44)\n\nj\n\nj=1 w(t)\nj=1 wT\n\nAAT w(t)\nj\nj AAT wj\n\nHere, wj denotes the true principal components of matrix A. The E.V. represents the portion of\nsignal Ax being expressed by {w(t)\nj=1. Thus, 1 \u2212 E.V. is the reconstruction error of the signal.\nj }d\nThe E.V. is a commonly used evaluation metric for the PCA algorithms [24, 5]. It is always less than\nor equal to one, with equality achieved by a perfect recovery.\n\n3.2 Online Robust PCA Algorithm\n\nThe details of the proposed online RPCA algorithm are shown in Algorithm 1. In the algorithm,\nthe observation sequence Y = {y1, y2,\u00b7\u00b7\u00b7 , yT} is sequentially partitioned into (T (cid:48) + 1) batches\n{B0, B1, B2, . . . , BT (cid:48)}. Each batch consists of b observations. Since the authentic samples and\noutliers are mixed uniformly, the outlier fraction in each batch is also \u03bb. Namely, in each batch Bi,\nthere are (1 \u2212 \u03bb)b authentic samples and \u03bbb outliers.\nNote that such small batch partition is only for the ease of illustration and analysis. Since the\nalgorithm only involves standard PCA computation, we can employ any incremental or online PCA\nmethod [8, 15] to update the PCs estimation upon accepting a new sample. The maintained sample\ncovariance matrix, can be set to zero every b time instances. Thus the batch partition is by no means\nnecessary in practical implementation. In the algorithm, the initial PC estimation can be obtained\nthrough standard PCA or robust PCA [24] on a mini batch of the samples.\n\n3\n\n\fAlgorithm 1 Online Robust PCA Algorithm\n\nInput: Data sequence {y1, . . . , yT}, buffer size b.\nInitialization: Partition the data sequence into small batches {B0, B1, . . . , BT (cid:48)}. Each patch\ncontains b data points. Perform PCA on the \ufb01rst batch B0 and obtain the initial principal compo-\nj }d\nnent {w(0)\nj=1.\nt = 1. w\u2217\nj = w(0)\nwhile t \u2264 T (cid:48) do\n\n,\u2200j = 1, . . . , d.\n\nj\n\n1) Initialize the sample covariance matrix: C (t) = 0.\nfor i = 1 to b do\n\na) Normalize the data point by its (cid:96)2-norm: y(t)\ni\nb) Calculate the variance of y(t)\n\nalong the direction w(t\u22121): \u03b4i =(cid:80)d\n\ni /(cid:107)y(t)\n\ni (cid:107)(cid:96)2.\n\n:= y(t)\n\ni\n\nj=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)w(t\u22121)\n\nj\n\nT\n\ny(t)\ni\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\n.\n\nc) Accept y(t)\nd) Scale y(t)\ne) If y(t)\n\ni\n\ni\n\ni with probability \u03b4i.\n\u221a\nas y(t)\n\u03b4i.\ni /b\n\ni \u2190 y(t)\n\nis accepted, update C (t) \u2190 C (t) + y(t)\n\ni y(t)\n\ni\n\nT\n\n.\n\nend for\n2) Perform eigen-decomposition on Ct and obtain the leading d eigenvector {w(t)\nj }d\n3) Update the PC as w\u2217\n4) t := t + 1.\n\n,\u2200j = 1, . . . , d.\n\nj = w(t)\n\nj\n\nj=1.\n\nend while\nReturn w\u2217.\n\nWe now explain the intuition of the proposed online RPCA algorithm. Given an initial solution\nw(0) which is \u201ccloser\u201d to the true PC directions than to the outlier direction 1, the authentic samples\nwill have larger variance along the current PC direction than outliers. Thus in the probabilistic data\nselection process (as shown in Algorithm 1 step b) to step d)), authentic samples are more likely\nto be accepted than outliers. Here the step d) of scaling the samples is important for obtaining\nan unbiased estimator (see details in the proof of Lemma 4 in supplementary material and [16]).\nTherefore, in the following PC updating based on standard PCA on the accepted data, authentic\nsamples will contribute more than the outliers. The estimated PCs will be \u201cmoved\u201d towards to the\ntrue PCs gradually. Such process is repeated until convergence.\n\n4 Main Results\n\nIn this section we present the theoretical performance guarantee of the proposed online RPCA al-\ngorithm (Algorithm 1). In the sequel, w(t)\nis the solution at the t-th time instance. Here without\nj\nloss of generality we assume the matrix A is normalized, such that the E.V. of the true princi-\nj AT Awj = 1. The following theorem provides the performance\nguarantee of Algorithm 1 under the noisy case. The performance of w(t) can be measured by\n\nj=1 wT\nj A(cid:107)2. Let s = (cid:107)x(cid:107)2/(cid:107)n(cid:107)2 be the signal noise ratio.\n\npal component wj is(cid:80)d\nH(w(t)) (cid:44)(cid:80)d\n\nTheorem 1 (Noisy Case Performance). There exist constants c(cid:48)\n2 which depend on the signal\nnoise ratio s and \u00011, \u00012 > 0 which approximate zero when s \u2192 \u221e or b \u2192 \u221e, such that if the initial\nsolution w(0)\n\nin Algorithm 1 satis\ufb01es:\n\nj=1 (cid:107)w(t)T\n\n1, c(cid:48)\n\nj\n\n\u03bbb(cid:88)\n\nd(cid:88)\n\ni=1\n\nj=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)w(0)\n\nj\n\nT\n\noi\n\n1(1\u22122\u0001)\u2212\u00011)\u2212\n(c(cid:48)\n\n(cid:18) 1\n\n(cid:12)(cid:12)(cid:12)(cid:12)2 \u2264 (1 \u2212 \u03bb)b(1 \u2212 \u00012)\n(cid:118)(cid:117)(cid:117)(cid:116) (c(cid:48)\n\n2(1 \u2212 \u0393o)\nc(cid:48)\n\n1(1 \u2212 \u0001) + \u00011)2 \u2212 4\u00012\n\n4\n\n4\n\nand\n\nH(w(0)) \u2265 1\n2\n\n(cid:19)\n\n,\n\nT\n\n1(1 \u2212 \u0001) \u2212 \u00011)2 \u2212 \u00012\n(c(cid:48)\n(cid:80)\u03bbb\n\n(cid:80)d\n\n\u2212 c(cid:48)\n\n2\n\ni=1\n\nj=1(w(0)\n\nj\n\n(1 \u2212 \u03bb)b(1 \u2212 \u00012)\n\noi)2(1 \u2212 \u0393o)\n\n,\n\n1In the following section, we will provide a precise description of the required closeness.\n\n4\n\n\fthen the performance of the solution from Algorithm 1 will be improved in each iteration, and even-\ntually converges to:\n\nt\u2192\u221e H(w(t))\nlim\n\n\u2265 1\n2\n\n1(1 \u2212 2\u0001) \u2212 \u00011) +\n(c(cid:48)\n\n(cid:118)(cid:117)(cid:117)(cid:116) (c(cid:48)\n\n1(1 \u2212 2\u0001) \u2212 \u00011)2 \u2212 4\u00012\n\n\u2212 c(cid:48)\n\n2\n\n4\n\nHere \u00011 and \u00012 decay as \u02dcO(d 1\n(1 + 1/s)4.\nRemark 1. From Theorem 1, we can observe followings:\n\n2 s\u22121), \u0001 decays as \u02dcO(d 1\n\n2 b\u2212 1\n\n2 b\u2212 1\n\n2 ), and c(cid:48)\n\n(cid:80)\u03bbb\n\ni=1\n\n(cid:80)d\n\nj=1(w(0)\n\nj\n\nT\n\noi)2(1 \u2212 \u0393o)\n\n(1 \u2212 \u03bb)b(1 \u2212 \u00012)\n1 = (s \u2212 1)2/(s + 1)2, c(cid:48)\n\n2 =\n\n.\n\n1(1 \u2212 2\u0001) \u2212 \u00011)/2 +(cid:112)(c(cid:48)\n\n1. When the outliers vanish, the second term in the square root of performance H(w(t)) is\n1(1 \u2212 2\u0001) \u2212 \u00011)2 \u2212 4\u00012/2 <\n1 < 1. Namely, the \ufb01nal performance is smaller than but approximates\n\nzero. H(w(t)) will converge to (c(cid:48)\n1(1 \u2212 2\u0001) \u2212 \u00011 < c(cid:48)\nc(cid:48)\n1. Here c(cid:48)\n\n1, \u00011, \u00012 explain the affect of noise.\n\n2. When s \u2192 \u221e, the affect of noise is eliminated, \u00011, \u00012 \u2192 0, c(cid:48)\n\n1 \u2192 1. H(w(t)) converges\nto 1 \u2212 2\u0001. Here \u0001 depends on the ratio of intrinsic dimension over the sample size, and \u0001\naccounts for the statistical bias due to performing PCA on a small portion of the data.\n\n3. When the batch size increases to in\ufb01nity, \u0001 \u2192 0, H(w(t)) converges to 1, meaning perfect\n\nrecovery.\n\nTo further investigate the behavior of the proposed online RPCA in presence of outliers, we consider\nthe following noiseless case. For the noiseless case, the signal noise ratio s \u2192 \u221e, and thus c(cid:48)\n2 \u2192\n1 and \u00011, \u00012 \u2192 0. Then we can immediately obtain the performance bound of Algorithm 1 for the\nnoiseless case from Theorem 1.\nTheorem 2 (Noiseless Case Performance). Suppose there is no noise. If the initial solution w(0) in\nAlgorithm 1 satis\ufb01es:\n\n1, c(cid:48)\n\n\u03bbb(cid:88)\n\nj=1\n\nd(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\n4\n\n+\n\n\u2212\n\n4\n\nand\n\ni=1\n\nH(w(0)) \u2265 1\n2\n\n\u2212\n\n(w(0)T\n\n,\n\nj oi)2 \u2264 (1 \u2212 \u03bb)b\n4(1 \u2212 \u0393o)\n(cid:80)\u03bbb\n\n(cid:80)d\n\nj=1(w(0)\n\ni=1\n\nT\n\nj\n\n(1 \u2212 \u03bb)b\n\noi)2(1 \u2212 \u0393o)\n\n,\n\nthen the performance of the solution from Algorithm 1 will be improved in each updating and even-\ntually converges to:\n\nt\u2192\u221e H(w(t)) \u2265 1\n\nlim\n\n2\n\n(cid:80)\u03bbb\n\ni=1\n\n(cid:80)d\n\n\u2212\n\nj=1(w(0)\n\nj\n\n(1 \u2212 \u03bb)b\n\nT\n\noi)2(1 \u2212 \u0393o)\n\n.\n\nRemark 2. Observe from Theorem 2 the followings:\n\n(cid:80)d\n\n1. When the outliers are distributed on the groundtruth subspace, i.e.,(cid:80)d\nconditions become(cid:80)\u03bbb\n2. When the outliers are orthogonal to the groundtruth subspace, i.e.,(cid:80)d\nthe conditions for the initial solution becomes(cid:80)\u03bbb\n\ninitial solution, the \ufb01nal performance will converge to 1.\n\n(cid:80)d\nj=1 |w(0)T\n\nj=1(w(0)T\n\ni=1\n\ni=1\n\nj oi|2 = 1, the\noi)2 < \u221e and H(w(0)) \u2265 0. Namely, for whatever\n\nj=1 |wT\n\nj oi|2 = 0,\nj oi|2 \u2264 b(1 \u2212 \u03bb)/4, and\noi)2/(1 \u2212 \u03bb)b. Hence, when the outlier fraction\n\nj=1 |wT\n\n(cid:114)\n1/4 \u2212(cid:80)\u03bbb\n\n(cid:80)d\n\nH0 \u2265 1/2 \u2212\n\u03bb increases, the initial solution should be further away from outliers.\n\nj=1(w(0)\n\ni=1\n\nT\n\nj\n\n5\n\n\f3. When 0 < (cid:80)d\n1/4 \u2212(cid:80)\u03bbb\n\n(cid:114)\n\n(cid:80)d\nj oi|2 < 1, the performance of online RPCA is improved by at\nj=1 |wT\noi)2(1 \u2212 \u0393o)/(1 \u2212 \u03bb)b from its initial solution. Hence,\nj=1(w(0)\nleast 2\nwhen the initial solution is further away from the outliers, the outlier fraction is smaller, or\nthe outliers are closer to groundtruth subspace, the improvement is more signi\ufb01cant. More-\nover, observe that given a proper initial solution, even if \u03bb = 0.5, the performance of online\nRPCA still has a positive lower bound. Therefore, the breakdown point of online RPCA is\n50%, the highest that any algorithm can achieve.\n\ni=1\n\nT\n\nj\n\nDiscussion on the initial condition In Theorem 1 and Theorem 2, a mild condition is imposed on\nthe initial estimate. In practice, the initial estimate can be obtained by applying batch RPCA [6] or\nHRPCA [24] on a small subset of the data. These batch methods are able to provide initial estimate\nwith performance guarantee, which may satisfy the initial condition.\n\n5 Proof of The Results\n\nWe brie\ufb02y explain the proof of Theorem 1: we \ufb01rst show that when the PCs estimation is being\nimproved, the variance of outliers along the PCs will keep decreasing. Then we demonstrate that\neach PCs updating conducted by Algorithm 1 produces a better PCs estimation and decreases the\nimpact of outliers. Such improvement will continue until convergence, and the \ufb01nal performance\nhas bounded deviation from the optimum.\nWe provide here some concentration lemmas which are used in the proof of Theorem 1. The proof\nof these lemmas is provided in the supplementary material. We \ufb01rst show that with high probability,\nboth the largest and smallest eigenvalues of the signals xi in the original space converge to 1. This\nresult is adopted from [24].\nLemma 1. There exists a constant c that only depends on \u00b5 and d, such that for all \u03b3 > 0 and b\nsignals {xi}b\n\ni=1, the following holds with high probability:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nb\n\nb(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u0001,\n\n(wT xi)2 \u2212 1\n\n(cid:113)\n\nd log3 b/b.\n\nwhere \u0001 = c\u03b1\n\nsup\nw\u2208Sd\n\nNext lemma is about the sampling process in the Algorithm 1 from step b) to step d). Though the\nsampling process is without replacement and the sampled observations are not i.i.d., the following\nlemma provides the concentration of the sampled observations.\nLemma 2 (Operator-Bernstein inequality [7]). Let {z(cid:48)\ni=1, which is\nformed by randomly sampling without replacement from Z, as in Algorithm 1. Then the following\nstatement holds\n\ni}m\ni=1 be a subset of Z = {zi}t\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m(cid:88)\n\ni=1\n\n(cid:32) m(cid:88)\n\ni=1\n\n(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b4\n\nwT z(cid:48)\n\ni \u2212 E\n\nwT z(cid:48)\n\ni\n\nwith probability larger than 1 \u2212 2 exp(\u2212\u03b42/4m).\n\nGiven the result in Lemma 1 , we obtain that the authentic samples concentration properties as stated\nin the following lemma [24].\nLemma 3. If there exists \u0001 such that\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u0001,\n\nsup\nw\u2208Sd\n\n|wT xi|2 \u2212 1\n\nt(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nt\n\ni=1\n\n6\n\n\fand the observations zi are normalized by (cid:96)2-norm, then for any w1,\u00b7\u00b7\u00b7 , wd \u2208 Sp, the following\nholds:\n\n(1 \u2212 \u0001)H(w) \u2212 2(cid:112)(1 + \u0001)H(w)/s\nt(cid:88)\nd(cid:88)\n\u2264 1\nt\nj A(cid:107)2 and s is the signal noise ratio.\nj=1 (cid:107)wT\n\nj zi)2 \u2264 (1 + \u0001)H(w) + 2(cid:112)(1 + \u0001)H(w)/s + 1/s2\n\n(1/s \u2212 1)2\n\n(1/s + 1)2\n\n(wT\n\nj=1\n\ni=1\n\n,\n\nwhere H(w) =(cid:80)d\n\nBased on Lemma 2 and Lemma 3, we obtain the following concentration results for the selected\nobservations in the Algorithm 1.\nLemma 4. If there exists \u0001 such that\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nt\n\nt(cid:88)\n\ni=1\n\nsup\nw\u2208Sd\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u0001,\n\n|wT xi|2 \u2212 1\n\n\u2212 \u03b4\n\n\u2264 1\nt\n\nthen for any\n\ni=1 as in Algorithm 1,\n\nand the observations {z(cid:48)\ni}m\ni=1 are sampled from {zi}d\nw1, . . . , wd \u2208 Sp, with large probability, the following holds:\n\n(1 \u2212 \u0001)H(w) \u2212 2(cid:112)(1 + \u0001)H(w)/s\nd(cid:88)\nt(cid:88)\nwhere H(w) (cid:44)(cid:80)d\nj A(cid:107)2, s is the signal noise ratio and m is the number of sampled obser-\nj=1 (cid:107)wT\n\ni)2 \u2264 (1 + \u0001)H(w) + 2(cid:112)(1 + \u0001)H(w)/s + 1/s2\n\n(1/s + 1)2b/m\nj z(cid:48)\n(wT\n\n(1/s \u2212 1)2b/m\n\nvations in each batch and \u03b4 > 0 is a small constant.\nWe denote the set of accepted authentic samples as Zt and the set of accepted outliers as Ot from the\nt-th small batch. In the following lemma, we provide the estimation of number of accepted authentic\nsamples |Zt| and outliers |Ot|.\nLemma 5. For the current obtained principal components {w(t\u22121)\nauthentic samples |Zt| and outliers |Ot| satisfy\n\n}d\nj=1, the number of the accepted\n\n+ \u03b4,\n\nj=1\n\ni=1\n\nj\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b4\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)|Zt|\n\nb\n\n(1\u2212\u03bb)b(cid:88)\n\nd(cid:88)\n\ni=1\n\nj=1\n\n\u2212 1\nb\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b4 and\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)|Ot|\n\nb\n\n\u03bbb(cid:88)\n\nd(cid:88)\n\ni=1\n\nj=1\n\n\u2212 1\nb\n\n(w(t\u22121)\n\nj\n\nT\n\nzi)2\n\n(w(t\u22121)\n\nj\n\nT\n\noi)2\n\nwith probability at least 1 \u2212 e\u22122\u03b42b. Here \u03b4 > 0 is a small constant, \u03bb is the outlier fraction and b\nis the size of the small batch.\n\nFrom the above lemma, we can see that when the batch size b is suf\ufb01ciently large, the above estima-\ntion for |Zt| and |Ot| holds with large probability. In the following lemma, we show that when the\nalgorithm improves the PCs estimation, the impact of outliers will be decreased accordingly.\nLemma 6. For an outlier oi, an arbitrary orthogonal basis {wj}d\n{wj}d\n\nj oi and(cid:80)d\nj oi is a monotonically decreasing function of(cid:80)d\n\nj=1 and the groundtruth basis\nj oi, the\nj=1 wT\nj=1 wT\n\nj=1 which satisfy that(cid:80)d\n\nj wj \u2265 (cid:80)d\n\nj oi \u2265 (cid:80)d\n\nvalue of(cid:80)d\n\nj=1 wT\n\nj=1 wT\n\nj=1 wT\n\nj=1 wT\n\nj wj.\n\nBeing equipped by the above lemmas, we can proceed to prove Theorem 1. The details of the proof\nis deferred to the supplementary material due to the space limit.\n\n6 Simulations\n\nThe numerical study is aimed to illustrate the performance of online robust PCA algorithm. We\nfollow the data generation method in [24] to randomly generate a p \u00d7 d matrix A and then scale its\n\n7\n\n\fleading singular value to s, which is the signal noise ratio. A \u03bb fraction of outliers are generated.\nSince it is hard to determine the most adversarial outlier distribution, in simulations, we generate\nthe outliers concentrate on several directions deviating from the groundtruth subspace. This makes a\nrather adversarial case and is suitable for investigating the robustness of the proposed online RPCA\nalgorithm. In the simulations, in total T = 10, 000 samples are generated to form the sample se-\nquence. For each parameter setting, we report the average result of 20 tests and standard deviation.\nThe initial solution is obtained by performing batch HRPCA [24] on the \ufb01rst batch. Simulation\nresults for p = 100, d = 1, s = 2 and outliers distributed on one direction are shown in Figure 1. It\ntakes around 0.5 seconds for the proposed online RPCA to process 10, 000 samples of 100 dimen-\nsional, on a PC with Quad CPU with 2.83GHz and RAM of 8GB. In contrast, HRPCA costs 237\nseconds to achieve E.V. = 0.99. More simulation results for the d > 1 case are provided in the\nsupplementary material due to the space limit.\nFrom the results, we can make the following observations. Firstly, online RPCA can improve the PC\nestimation steadily. With more samples being revealed, the E.V. of the online RPCA outputs keep\nincreasing. Secondly, the performance of online RPCA is rather robust to outliers. For example, the\n\ufb01nal result converges to E.V. \u2248 0.95 (HRPCA achieves 0.99) even with \u03bb = 0.3 for relatively low\nsignal noise ratio s = 2 as shown in Figure 1. To more clearly demonstrate the robustness of online\nRPCA to outliers, we implement the online PCA proposed in [23] as baseline for the \u03c3 = 2 case.\nThe results are presented in Figure 1, from which we can observe that the performance of online\nPCA drops due to the sensitiveness to newly coming outliers. When the outlier fraction \u03bb \u2265 0.1, the\nonline PCA cannot recover the true PC directions and the performance is as low as 0.\n\nFigure 1: Performance comparison of online RPCA (blue line) with online PCA (red line). Here\ns = 2, p = 100, T = 10, 000, d = 1. The outliers are distributed on a single direction.\n7 Conclusions\n\nIn this work, we proposed an online robust PCA (online RPCA) algorithm for samples corrupted\nby outliers. The online RPCA alternates between standard PCA for updating PCs and probabilistic\nselection of the new samples which alleviates the impact of outliers. Theoretical analysis showed\nthat the online RPCA could improve the PC estimation steadily and provided results with bounded\ndeviation from the optimum. To the best of our knowledge, this is the \ufb01rst work to investigate such\nonline robust PCA problem with theoretical performance guarantee. The proposed online robust\nPCA algorithm can be applied to handle challenges imposed by the modern big data analysis.\n\nAcknowledgement\n\nJ. Feng and S. Yan are supported by the Singapore National Research Foundation under its Inter-\nnational Research Centre @Singapore Funding Initiative and administered by the IDM Programme\nOf\ufb01ce. H. Xu is partially supported by the Ministry of Education of Singapore through AcRF Tier\nTwo grant R-265-000-443-112 and NUS startup grant R-265-000-384-133. S. Mannor is partially\nsupported by the Israel Science Foundation (under grant 920/12) and by the Intel Collaborative\nResearch Institute for Computational Intelligence (ICRI-CI).\n\n8\n\n0102030405000.20.40.60.81 \u03bb= 0.01# batchesE.V. Online RPCAOnline PCA0102030405000.20.40.60.81 \u03bb= 0.03# batchesE.V. Online RPCAOnline PCA0102030405000.20.40.60.81 \u03bb= 0.05# batchesE.V. Online RPCAOnline PCA0102030405000.20.40.60.81 \u03bb= 0.08# batchesE.V. Online RPCAOnline PCA0102030405000.20.40.60.81 \u03bb= 0.10# batchesE.V. Online RPCAOnline PCA0102030405000.20.40.60.81 \u03bb= 0.15# batchesE.V. Online RPCAOnline PCA0102030405000.20.40.60.81 \u03bb= 0.20# batchesE.V. Online RPCAOnline PCA0102030405000.20.40.60.81 \u03bb= 0.30# batchesE.V. Online RPCAOnline PCA\fReferences\n[1] J.R. Bunch and C.P. Nielsen. Updating the singular value decomposition. Numerische Mathe-\n\nmatik, 1978.\n\n[2] E.J. Candes, X. Li, Y. Ma, and J. Wright.\n\nArXiv:0912.3599, 2009.\n\nRobust principal component analysis?\n\n[3] C. Croux and A. Ruiz-Gazen. A fast algorithm for robust principal components based on\n\nprojection pursuit. In COMPSTAT, 1996.\n\n[4] C. Croux and A. Ruiz-Gazen. High breakdown estimators for principal components:\n\nprojection-pursuit approach revisited. Journal of Multivariate Analysis, 2005.\n\nthe\n\n[5] A. d\u2019Aspremont, F. Bach, and L. Ghaoui. Optimal solutions for sparse principal component\n\nanalysis. JMLR, 2008.\n\n[6] J. Feng, H. Xu, and S. Yan. Robust PCA in high-dimension: A deterministic approach. In\n\nICML, 2012.\n\n[7] David Gross and Vincent Nesme. Note on sampling without replacing from a \ufb01nite collection\n\nof matrices. arXiv preprint arXiv:1001.2738, 2010.\n\n[8] P. Hall, D. Marshall, and R. Martin. Merging and splitting eigenspace models. TPAMI, 2000.\n[9] Jun He, Laura Balzano, and John Lui. Online robust subspace tracking from partial informa-\n\ntion. arXiv preprint arXiv:1109.3827, 2011.\n\n[10] P. Honeine. Online kernel principal component analysis: a reduced-order model. TPAMI,\n\n2012.\n\n[11] P.J. Huber, E. Ronchetti, and MyiLibrary. Robust statistics. John Wiley & Sons, New York,\n\n1981.\n\n[12] M. Hubert, P.J. Rousseeuw, and K.V. Branden. Robpca: a new approach to robust principal\n\ncomponent analysis. Technometrics, 2005.\n\n[13] M. Hubert, P.J. Rousseeuw, and S. Verboven. A fast method for robust principal components\nwith applications to chemometrics. Chemometrics and Intelligent Laboratory Systems, 2002.\n[14] G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal\ncomponents: primary theory and monte carlo. Journal of the American Statistical Association,\n1985.\n\n[15] Y. Li. On incremental and robust subspace learning. Pattern recognition, 2004.\n[16] Michael W Mahoney. Randomized algorithms for matrices and data.\n\narXiv preprint\n\narXiv:1104.5557, 2011.\n\n[17] R.A. Maronna. Robust m-estimators of multivariate location and scatter. The annals of statis-\n\ntics, 1976.\n\n[18] S. Ozawa, S. Pang, and N. Kasabov. A modi\ufb01ed incremental principal component analysis for\n\non-line learning of feature space and classi\ufb01er. PRICAI, 2004.\n\n[19] K. Pearson. On lines and planes of closest \ufb01t to systems of points in space. Philosophical\n\nMagazine, 1901.\n\n[20] C. Qiu, N. Vaswani, and L. Hogben. Recursive robust pca or recursive sparse recovery in large\n\nbut structured noise. arXiv preprint arXiv:1211.3754, 2012.\n\n[21] P.J. Rousseeuw. Least median of squares regression. Journal of the American statistical asso-\n\nciation, 1984.\n\n[22] P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. John Wiley & Sons\n\nInc, 1987.\n\n[23] M.K. Warmuth and D. Kuzmin. Randomized online pca algorithms with regret bounds that are\n\nlogarithmic in the dimension. JMLR, 2008.\n\n[24] H. Xu, C. Caramanis, and S. Mannor. Principal component analysis with contaminated data:\n\nThe high dimensional case. In COLT, 2010.\n\n[25] H. Zhao, P.C. Yuen, and J.T. Kwok. A novel incremental principal component analysis and its\n\napplication for face recognition. TSMC-B, 2006.\n\n9\n\n\f", "award": [], "sourceid": 447, "authors": [{"given_name": "Jiashi", "family_name": "Feng", "institution": "NUS"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": "National University of Singapore"}]}