{"title": "Optimal Shrinkage of Singular Values Under Random Data Contamination", "book": "Advances in Neural Information Processing Systems", "page_first": 6160, "page_last": 6170, "abstract": "A low rank matrix X has been contaminated by uniformly distributed noise, missing values, outliers and corrupt entries. Reconstruction of X from the singular values and singular vectors of the contaminated matrix Y is a key problem in machine learning, computer vision and data science. In this paper we show that common contamination models (including arbitrary combinations of uniform noise, missing values, outliers and corrupt entries) can be described efficiently using a single framework. We develop an asymptotically optimal algorithm that estimates X by manipulation of the singular values of Y, which applies to any of the contamination models considered. Finally, we find an explicit signal-to-noise cutoff, below which estimation of X from the singular value decomposition of Y must fail, in a well-defined sense.", "full_text": "Optimal Shrinkage of Singular Values Under\n\nRandom Data Contamination\n\nSchool of Computer Science and Engineering\n\nSchool of Computer Science and Engineering\n\nDanny Barash\n\nHebrew University\nJerusalem, Israel\n\nMatan Gavish\n\nHebrew University\nJerusalem, Israel\n\ndanny.barash@mail.huji.ac.il\n\ngavish@cs.huji.ac.il\n\nAbstract\n\nA low rank matrix X has been contaminated by uniformly distributed noise, missing\nvalues, outliers and corrupt entries. Reconstruction of X from the singular values\nand singular vectors of the contaminated matrix Y is a key problem in machine\nlearning, computer vision and data science. In this paper, we show that common\ncontamination models (including arbitrary combinations of uniform noise, missing\nvalues, outliers and corrupt entries) can be described ef\ufb01ciently using a single\nframework. We develop an asymptotically optimal algorithm that estimates X by\nmanipulation of the singular values of Y , which applies to any of the contamination\nmodels considered. Finally, we \ufb01nd an explicit signal-to-noise cutoff, below which\nestimation of X from the singular value decomposition of Y must fail, in a well-\nde\ufb01ned sense.\n\n1\n\nIntroduction\n\nReconstruction of low-rank matrices from noisy and otherwise contaminated data is a key problem in\nmachine learning, computer vision and data science. Well-studied problems such as dimension reduc-\ntion [3], collaborative \ufb01ltering [24, 28], topic models [13], video processing [21], face recognition\n[35], predicting preferences [26], analytical chemistry [29] and background-foreground separation\n[4] all reduce, under popular approaches, to low-rank matrix reconstruction. A signi\ufb01cant part of the\nliterature on these problems is based on the singular value decomposition (SVD) as the underlying\nalgorithmic component, see e.g. [7, 19, 23].\nUnderstanding and improving the behavior of SVD in the presence of random data contamination\ntherefore arises as a crucially important problem in machine learning. While this is certainly a\nclassical problem [14, 17, 20], it remains of signi\ufb01cant interest, owing in part to the emergence of\nlow-rank matrix models for matrix completion and collaborative \ufb01ltering [9, 34].\nLet X be an m-by-n unknown low-rank matrix of interest (m \u2264 n), and assume that we only observe\nthe data matrix Y , which is a contaminated or noisy version of X. Let\n\nY =\n\nyiuiv(cid:48)\n\ni\n\n(1)\n\nm(cid:88)\n\ni=1\n\nr(cid:88)\n\nbe the SVD of the data matrix Y . Any algorithm based on the SVD essentially aims to obtain an\nestimate for the target matrix X from (1). Most practitioners simply form the Truncated SVD (TSVD)\nestimate [18]\n\n\u02c6Xr =\n\nyiuiv(cid:48)\n\ni\n\n(2)\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\ni=1\n\n\fwhere r is an estimate of rank(X), whose choice in practice tends to be ad hoc [15].\nRecently, [10, 16, 32] have shown that under white additive noise, it is useful to apply a carefully\ndesigned shrinkage function \u03b7 : R \u2192 R to the data singular values, and proposed estimators of the\nform\n\n\u02c6X\u03b7 =\n\n\u03b7(yi)uiv(cid:48)\ni .\n\n(3)\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\nSuch estimators are extremely simple to use, as they involve only simple manipulation of the data\nsingular values. Interestingly, in the additive white noise case, it was shown that a unique optimal\nshrinkage function \u03b7(y) exists, which asymptotically delivers the same performance as the best\npossible rotation-invariant estimator based on the data Y [16]. Singular value shrinkage thus emerged\nas a simple yet highly effective method for improving the SVD in the presence of white additive\nnoise, with the unique optimal shrinker as a natural choice for the shrinkage function. A typical form\nof optimal singular value shrinker is shown in Figure 1 below, left panel.\nShrinkage of singular values, an idea that can be traced back to Stein\u2019s groundbreaking work on\ncovariance estimation from the 1970\u2019s [33], is a natural generalization of the classical TSVD. Indeed,\n\u02c6Xr is equivalent to shrinkage with the hard thresholding shrinker \u03b7(y) = 1y\u2265\u03bb, as (2) is equivalent\nto\n\n\u02c6X\u03bb =\n\n1yi\u2265\u03bbuiv(cid:48)\n\ni\n\n(4)\n\ni=1\n\nwith a speci\ufb01c choice of the so-called hard threshold \u03bb. While the choice of the rank r for truncation\npoint TSVD is often ad hoc and based on gut feeling methods such as the Scree Plot method [11], its\nequivalent formulation, namely hard thresholding of singular values, allows formal and systematic\nanalysis. In fact, restricting attention to hard thresholds alone [15] has shown that under white\nadditive noise there exists a unique asymptotically optimal choice of hard threshold for singular\nvalues. The optimal hard threshold is a systematic, rational choice for the number of singular values\nthat should be included in a truncated SVD of noisy data. [27] has proposed an algorithm that \ufb01nds\n\u03b7\u2217 in presence of additive noise and missing values, but has not derived an explicit shrinker.\n\n1.1 Overview of main results\n\nIn this paper, we extend this analysis to common data contaminations that go well beyond additive\nwhite noise, including an arbitrary combination of additive noise, multiplicative noise, missing-at-\nrandom entries, uniformly distributed outliers and uniformly distributed corrupt entries.\nThe primary contribution of this paper is formal proof that there exists a unique asymptotically\noptimal shrinker for singular values under uniformly random data contaminations, as well a unique\nasymptotically optimal hard threshold. Our results are based on a novel, asymptotically precise\ndescription of the effect of these data contaminations on the singular values and the singular vectors of\nthe data matrix, extending the technical contribution of [16, 27, 32] to the setting of general uniform\ndata contamination.\n\nGeneral contamination model. We introduce the model\n\nY = A (cid:12) X + B\n\n(5)\n\nwhere X is the target matrix to be recovered, and A, B are random matrices with i.i.d entries. Here,\n(A (cid:12) B)i,j = Ai,jBi,j is the Hadamard (entrywise) product of A and B.\niid\u223c (\u00b5A, \u03c32\nA), meaning that the entries of A are i.i.d drawn from a distribution\nAssume that Ai,j\nwith mean \u00b5A and variance \u03c32\nB). In Section 2 we show that for various\nchoices of the matrix A and B, this model represents a broad range of uniformly distributed random\ncontaminations, including an arbitrary combination of additive noise, multiplicative noise, missing-at-\nrandom entries, uniformly distributed outliers and uniformly distributed corrupt entries. As a simple\nexample, if B \u2261 0 and P (Ai,j = 1) = \u03ba, then the Y simply has missing-at-random entries.\n\nA, and that Bi,j\n\niid\u223c (0, \u03c32\n\n2\n\n\fTo quantify what makes a \u201cgood\u201d singular value shrinker \u03b7 for use in (3), we use the standard Mean\nSquare Error (MSE) metric and\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u02c6X\u03b7(Y ) \u2212 X\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\nF\n\n.\n\nL(\u03b7|X) =\n\nUsing the methods of [16], our results can easily be extended to other error metrics, such as the\nnuclear norm or operator norm losses. Roughly speaking, an optimal shrinker \u03b7\u2217 has the property\nthat, asymptotically as the matrix size grows,\n\nL(\u03b7\u2217|X) \u2264 L(\u03b7|X)\nfor any other shrinker \u03b7 and any low-rank target matrix X.\nThe design of optimal shrinkers requires a subtle understanding of the random \ufb02uctuations of the data\nsingular values y1, . . . , yn, which are caused by the random contamination. Such results in random\nmatrix theory are generally hard to prove, as there are nontrivial correlations between yi and yj,\ni (cid:54)= j. Fortunately, in most applications it is very reasonable to assume that the target matrix X is\nlow rank. This allows us to overcome this dif\ufb01culty by following [15, 27, 32] and considering an\nasymptotic model for low-rank X, inspired by Johnstone\u2019s Spiked Covariance Model [22], in which\nthe correlation between yi and yj, for i (cid:54)= j vanish asymptotically.\nWe state our main results informally at \ufb01rst. The \ufb01rst main result of this paper is the existence of a\nunique asymptotically optimal hard threshold \u03bb\u2217 in (4).\nImportantly, as E(Y ) = \u00b5AX, to apply hard thresholding to Y = A (cid:12) X + B we must from now on\nde\ufb01ne\n\nn(cid:88)\n\ni=1\n\n\u02c6X\u03bb =\n\n1\n\u00b5A\n\n1yi>\u03bbuiv\n\n(cid:48)\ni .\n\nTheorem 1. (Informal.) Let X be an m-by-n low-rank matrix and assume that we observe the\ncontaminated data matrix Y given by the general contamination model (5). Then there exists a\nunique optimal (def. 3) hard threshold \u03bb\u2217 for the singular values of Y , given by\n\n(cid:19)\n\n\u03b2\nc\n\nc +\n\u221a\n\n1\nc\n\nc +\n\n\u2217\n\n\u03bb\n\n= \u03c3B\n\n(cid:115)(cid:18)\n\n(cid:19)(cid:18)\n(cid:113)\n1 + \u03b2 +(cid:112)1 + 14\u03b2 + \u03b22/\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:19)2 \u2212 \u03b2 \u2212 1\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:32)(cid:18) y\n\n\u03c32\nB\ny\u00b5A\n\n\u03c3B\n\n0\n\n(cid:33)2\n\n\u2217\n\n\u03b7\n\n(y) =\n\nwhere \u03b2 = m/n and c =\nOur second main result is the existence of a unique asymptotically optimal shrinkage function \u03b7\u2217 in\n(equation (3)). We calculate this shrinker explicitly:\nTheorem 2. (Informal.) Assume everything as in Theorem 1. Then there exists a unique optimal (def.\n3) shrinker \u03b7\u2217 for the singular values of Y given by\n\n2.\n\n\u2212 4\u03b2 y \u2265 \u03c3B(1 +(cid:112)\u03b2)\ny < \u03c3B(1 +(cid:112)\u03b2)\n\nWe also discover that for each contamination model, there is a critical signal-to-noise cutoff, below\nwhich X cannot be reconstructed from the singular values and vectors of Y . Speci\ufb01cally, let \u03b70 be\nthe zero singular value shrinker, \u03b70(y) \u2261 0, so that \u02c6X\u03b70 (Y ) \u2261 0. De\ufb01ne the critical signal level for a\nshrinker \u03b7 by\n\nxcritical(\u03b7) = inf\nx\n\n{x : L(\u03b7|X) < L(\u03b70|X)}\n\nwhere X = x\u02dcu\u02dcv(cid:48) is an arbitrary rank-1 matrix with singular value x. In other words, xcritical(\u03b7)\nis the smallest singular value of the target matrix, for which \u03b7 still outperforms the trivial zero\nshrinker \u03b70. As we show in Section 4, a target matrix X with a singular value below xcritical(\u03b7)\ncannot be reliably reconstructed using \u03b7. The critical signal level for the optimal shrinker \u03b7\u2217 is\nof special importance, since a target matrix X with a singular value below xcritical(\u03b7\u2217) cannot be\nreliably reconstructed using any shrinker \u03b7. Restricting attention to hard thresholds only, we de\ufb01ne\nxcritical(\u03bb), the critical level for a hard threshold, similarly. Again, singular values of X that fall\nbelow xcritical(\u03bb\u2217) cannot be reliably reconstructed using any hard threshold.\nOur third main result is the explicit calculation of these critical signal levels:\n\n3\n\n\fTheorem 3. (Informal.) Assume everything as in Theorem 1 and let c be as in Theorem 1. Let \u03b7\u2217 be\nthe optimal shrinker from Theorem 2 and let \u03bb\u2217 be the optimal hard threshold from Theorem 1. The\ncritical signal levels for \u03b7\u2217 and \u03bb\u2217 are given by:\n\nxcritical(\u03b7\u2217) = (\u03c3B/\u00b5A) \u00b7 \u03b2\nxcritical(\u03bb\u2217) = (\u03c3B/\u00b5A) \u00b7 c .\n\n1\n4\n\nFinally, one might ask what the improvement is in terms of the mean square error that is guaranteed\nby using the optimal shrinker and optimal threshold. As discussed below, existing methods are either\ninfeasible in terms of running time on medium and large matrices, or lack a theory that can predict\nthe reconstruction mean square error. For lack of a better candidate, we compare the optimal shrinker\nand optimal threshold to the default method, namely, TSVD.\nTheorem 4. (Informal.) Consider \u03b2 = 1, and denote the worst-case mean square error of TSVD, \u03b7\u2217\nand \u03bb\u2217 by MT SV D, M\u03b7\u2217 and M\u03bb\u2217, respectively, over a target matrix of low rank r. Then\n\n(cid:19)2\n(cid:19)2\n(cid:19)2\n\n(cid:18) \u03c3B\n(cid:18) \u03c3B\n(cid:18) \u03c3B\n\n\u00b5A\n\n\u00b5A\n\n\u00b5A\n\n5r\n\n2r\n\n3r .\n\nMT SV D =\n\nM\u03b7\u2217 =\n\nM\u03bb\u2217 =\n\nIndeed, the optimal shrinker offers a signi\ufb01cant performance improvement (speci\ufb01cally, an improve-\nment of 3r(\u03c3B/\u00b5A)2, over the TSVD baseline.\n\nFigure 1: Left: Optimal shrinker for additive noise and missing-at-random contamination. Right:\nPhase plane for critical signal levels, see Section 6, Simulation 2.\n\nOur main results allow easy calculation of the optimal threshold, optimal shrinkage and signal-to-noise\ncutoffs for various speci\ufb01c contamination models. For example:\n\n1. Additive noise and missing-at-random. Let X be an m-by-n low-rank matrix. Assume\nthat some entries are completely missing and the rest suffer white additive noise. Formally,\nwe observe the contaminated matrix\n\n(cid:26)Xi,j + Zi,j w.p. \u03ba\n\nYi,j =\n\n0\n\nw.p. 1 \u2212 \u03ba\n\n,\n\niid\u223c (0, \u03c32), namely, follows an unknown distribution with mean 0 and variance\nwhere Zi,j\n\u03c32. Let \u03b2 = m/n. Theorem 1 implies that in this case, the optimal hard threshold for the\nsingular values of Y is\n(cid:113)\n1 + \u03b2 +(cid:112)1 + 14\u03b2 + \u03b22/\n\n\u03bb\u2217 =(cid:112)\u03c32\u03ba (c + 1/c) (c + \u03b2/c)\n\nwhere c =\n2. In other words, the optimal location (w.r.t mean\nsquare error) to truncate the singular values of Y , in order to recover X, is given by \u03bb\u2217. The\n\n\u221a\n\n4\n\n0123456y0123456\u03b7(y)\u03b2=0.3\u03b2=0.6\u03b2=10.150.250.350.450.550.650.750.850.95\u03ba00.20.40.60.811.21.41.61.822.22.42.62.83xThreshold CriticalShrinker Critical\foptimal shrinker from Theorem 2 for this contamination mode may be calculated similarly,\nand is shown in Figure 1, left panel. By Theorem 4, the improvement in mean square\nerror obtained by using the optimal shrinker, over the TSVD baseline, is 3r\u03c32/\u03ba, quite a\nsigni\ufb01cant improvement.\n\n2. Additive noise and corrupt-at-random. Let X be an m-by-n low-rank matrix. Assume\nthat some entries are irrecoverably corrupt (replaced by random entries), and the rest suffer\nwhite additive noise. Formally,\n\n(cid:26)Xi,j + Zi,j w.p. \u03ba\n\n.\n\nWhere Zi,j\nThe optimal shrinker, which should be applied to the singular values of Y , is given by:\n\niid\u223c (0, \u03c32), Wi,j\n\nYi,j =\n\nWi,j\n\nw.p. 1 \u2212 \u03ba\n\niid\u223c (0, \u03c4 2), and \u03c4 is typically large. Let \u02dc\u03c3 =(cid:112)\u03ba\u03c32 + (1 \u2212 \u03ba)\u03c4 2.\n\uf8f1\uf8f2\uf8f3\u02dc\u03c32/(y\u03ba)\n(cid:113)(cid:0)(y/\u02dc\u03c3)2 \u2212 \u03b2 \u2212 1(cid:1)2 \u2212 4 y \u2265 \u02dc\u03c3(1 +(cid:112)\u03b2)\ny < \u02dc\u03c3(1 +(cid:112)\u03b2)\n\n0\n\n\u2217\n\n\u03b7\n\n(y) =\n\n.\n\nBy Theorem 4, the improvement in mean square error, obtained by using the optimal\nshrinker, over the TSVD baseline, is 3r(\u03ba\u03c32 + (1 \u2212 \u03ba)\u03c4 2)/\u03ba2.\n\n1.2 Related Work\n\nThe general data contamination model we propose includes as special cases several modes extensively\nstudied in the literature, including missing-at-random and outliers. While it is impossible to propose a\ncomplete list of algorithms to handle such data, we offer a few pointers, organized around the notions\nof robust principal component analysis (PCA) and matrix completion. To the best of our knowledge,\nthe precise effect of general data contamination on the SVD (or the closely related PCA) has not been\ndocumented thus far. The approach we propose, based on careful manipulation of the data singular\nvalues, enjoys three distinct advantages. One, its running time is not prohibitive; indeed, it involves a\nsmall yet important modi\ufb01cation on top of the SVD or TSVD, so that it is available whenever the\nSVD is available. Two, it is well understood and its performance (say, in mean square error) can be\nreliably predicted by the available theory. Three, to the best of our knowledge, none of the approaches\nbelow have become mainstream, and most practitioners still turn to the SVD, even in the presence of\ndata contamination. Our approach can easily be used in practice, as it relies on the well-known and\nvery widely used SVD, and can be implemented as a simple modi\ufb01cation on top of the existing SVD\nimplementations.\nRobust Principle Component Analysis (RPCA). In RPCA, one assumes Y = X + W where\nX is the low rank target matrix and W is a sparse outliers matrix. Classical approaches such as\nin\ufb02uence functions [20], multivariate trimming [17] and random sampling techniques [14] lack a\nformal theoretical framework and are not well understood. More modern approaches based on convex\noptimization [9, 34] proposed reconstructing X from Y via the nuclear norm minimization\n\n||X||\u2217 + \u03bb||Y \u2212 X||1 ,\n\nmin\n\nX\n\nwhose runtime and memory requirements are both prohibitively large in medium and large matrices.\nMatrix Completion. There are numerous heuristic approaches for data analysis in the presence of\nmissing values [5, 30, 31]. To the best of our knowledge, there are no formal guarantees of their\nperformance. When the target matrix is known to be low rank, the reconstruction problem is known\nas matrix completion. [7\u20139] and numerous other authors have shown that a semi-de\ufb01nite program\nmay be used to stably recover the target matrix, even in the presence of additive noise. Here too, the\nruntime and memory requirements are both prohibitively large in medium and large matrices, making\nthese algorithms infeasible in practice.\n\n2 A Uni\ufb01ed Model for Uniformly Distributed Contamination\n\nContamination modes encountered in practice are best described by a combination of primitive modes,\nshown in Table 1 below. These primitive contamination modes \ufb01t into a single template:\n\n5\n\n\fDe\ufb01nition 1. Let A and B be two random variables, and assume that all moments of A and B are\nbounded. De\ufb01ne the contamination link function\n\nGiven a matrix X, de\ufb01ne the corresponding contaminated matrix Y with entries\n\nfA,B(x) = Ax + B .\n\nindep.\u223c fA,B(Xi,j) .\n\nYi,j\n\n(6)\n\nNow observe that each of the primitive modes above corresponds to a different choice of random\nvariables A and B, as shown in Table 1. Speci\ufb01cally, each of the primitive modes is described by a\ndifferent assignment to A and B. We employ three different random variables in these assignments:\niid\u223c (0, \u03c4 2/n), a\nZ\niid\u223c Bernoulli(\u03ba) describing a\nrandom variable describing a large \u201coutlier\u201d measurement; and M\nrandom choice of \u201cdefective\u201d entries, such as a missing value, an outlier and so on.\n\niid\u223c (0, \u03c32/n), a random variable describing multiplicative or additive noise; W\n\nTable 1: Primitive modes \ufb01t into the model (6). By convention, Y is m-by-n, Z\niid\u223c (0, \u03c4 2/n) denotes an outlier random variable and M\nnoise random variable, W\ncontaminated entry random variable.\n\niid\u223c (0, \u03c32/n) denotes a\niid\u223c Bernoulli(\u03ba) is a\n\nmode\ni.i.d additive noise\ni.i.d multiplicative noise\nmissing-at-random\noutliers-at-random\ncorruption-at-random\n\nmodel\nYi,j = Xi,j + Zi,j\n\u03c3\nYi,j = Xi,j Zi,j\n\u03c3\nYi,j = Mi,j Xi,j\n\u03ba\n\u03ba,\u03c4\nYi,j = Xi,j + Mi,jWi,j\nYi,j = Mi,jXi,j + (1 \u2212 Mi,j)Wi,j M (1 \u2212 M )W \u03ba,\u03c4\n\nA\n1\nZ\nM\n1\n\nB\nZ\n0\n0\n\nM W\n\nlevels\n\nActual datasets rarely demonstrate a single primitive contamination mode. To adequately describe\ncontamination observed in practice, one usually needs to combine two or more of the primitive\ncontamination modes into a composite mode. While there is no point in enumerating all possible\ncombinations, Table 2 offers a few notable composite examples, using the framework (6). Many other\nexamples are possible of course.\n\n3 Signal Model\n\nFollowing [32] and [15], as we move toward our formal results we are considering an asymptotic\nmodel inspired by Johnstone\u2019s Spiked Model [22]. Speci\ufb01cally, we are considering a sequence of\niid\u223c fAn,Bn (Xn). We\nincreasingly larger data target matrices Xn, and corresponding data matrices Yn\nmake the following assumptions regarding the matrix sequence {Xn}:\n\nA1 Limiting aspect ratio: The matrix dimension mn \u00d7 n sequence converges: mn/n \u2192 \u03b2 as\nA2 Fixed signal column span: Let the rank r > 0 be \ufb01xed and choose a vector x \u2208 Rr with\n\nn \u2192 \u221e. To simplify the results, we assume 0 < \u03b2 \u2264 1.\n\ncoordinates x = (x1, . . . xr) such that x1 > . . . > xr > 0. Assume that for all n\n\nis an arbitrary singular value decomposition of Xn,\n\nXn = \u02dcUn diag(x1, . . . , xr) \u02dcVn\n\nTable 2: Some examples of composite contamination modes and how they \ufb01t into the model (6). Z,W ,M are\nthe same as in Table 1.\n\nmode\n\nAdditive noise and missing-at-random\nAdditive noise and corrupt-at-random\n\nmultiplicative noise and corrupt-at-random ZM\n1\n\nAdditive noise and outliers\n\n6\n\nA\nM\nM ZM + W (1 \u2212 M )\n\nB\nZM\n\nW (1 \u2212 M )\n\nZ + W (1 \u2212 M )\n\nlevels\n\u03c3,\u03ba\n\u03c3,\u03ba,\u03c4\n\u03c3,\u03ba,\u03c4\n\u03c3,\u03ba,\u03c4\n\n\fA3 Incoherence of the singular vectors of Xn: We make one of the following two assumptions\n\nregarding the singular vectors of Xn:\nA3.1 Xn is random with an orthogonally invariant distribution. Speci\ufb01cally, \u02dcUn and \u02dcVn,\nwhich follow the Haar distribution on orthogonal matrices of size mn and n, respec-\ntively.\n\nA3.2 The singular vectors of Xn are non-concentrated. Speci\ufb01cally, each left singular vector\n\u02dcun,i of Xn (the i-th column of \u02dcUn) and each right singular vector \u02dcvn,j of Xn (the j-th\ncolumn of \u02dcVn) satisfy1\n||\u02dcun,i||\u221e \u2264 C\n\n||\u02dcvn,j||\u221e \u2264 C\n\nlogD(n)\u221a\n\nlogD(mn)\n\nand\n\n\u221a\n\nn\n\nmn\n\nfor any i, j and \ufb01xed constants C, D.\n\nDe\ufb01nition 2. (Signal model.) Let An\nB/n) have bounded\nmoments. Let Xn follow assumptions [A1]\u2013[A3] above. We say that the matrix sequence\nYn = fAn,Bn (Xn) follows our signal model, where fA,B(X) is as in De\ufb01nition 1. We further denote\ni=1 yn,iun,ivn,i\n\ni=1 xi\u02dcun,i\u02dcvn,i for the singular value decomposition of Xn and Yn =(cid:80)m\n\nXn =(cid:80)r\n\nA/n) and Bn\n\niid\u223c (\u00b5A, \u03c32\n\niid\u223c (0, \u03c32\n\nfor the singular value decomposition of Yn.\n\n4 Main Results\n\nHaving described the contamination and the signal model, we can now formulate our main results.\nAll proofs are deferred to the Supporting Information. Let Xn and Yn follow our signal model,\nDe\ufb01nition 2, and write x = (x1, . . . , xr) for the non-zero singular values of Xn. For a shrinker \u03b7,\nwe write\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u02c6Xn(Yn) \u2212 Xn\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n.\n\nF\n\nL\u221e(\u03b7|x) a.s.= lim\nn\u2192\u221e\n\nassuming the limit exists almost surely. The special case of hard thresholding at \u03bb is denoted as\nL\u221e(\u03b7|x).\nDe\ufb01nition 3. Optimal shrinker and optimal threshold. A shrinker \u03b7\u2217 is called optimal if\n\nL\u221e(\u03b7|x) \u2264 L\u221e(\u03b7|x)\n\nfor any shrinker \u03b7, any r \u2265 1 and any x = (x1, . . . , xr). Similarly, a threshold \u03bb is called optimal if\nL\u221e(\u03bb\u2217|x) \u2264 L\u221e(\u03bb|x) for any threshold \u03bb, any r \u2265 1 and any x = (x1, . . . , xr).\nWith these de\ufb01nitions, our main results Theorem 2 and Theorem 1 become formal. To make Theorem\n3 formal, we need the following lemma and de\ufb01nition.\nLemma 1. Decomposition of the asymptotic mean square error. Let Xn and Yn follow our signal\nmodel (De\ufb01nition 2) and write x = (x1, . . . , xr) for the non-zero singular values of Xn, and let \u03b7 be\n\nthe optimal shrinker. Then the limit L\u221e(\u03b7|x) a.s. exists, and L\u221e(\u03b7|x) a.s.= (cid:80)r\n\ni=1 L1(\u03b7|x), where\n\n(cid:19)\n\n(t4 \u2212 \u03b2)2\n\n(cid:18)\n\n1 \u2212\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3x2\n(cid:19)2(cid:18)(cid:18)\n\nx2\n\nL1(\u03b7|x) =\n\nwhere t = (\u00b5A \u00b7 x)/\u03c3B. Similarly, for a threshold \u03bb we have L\u221e(\u03bb|x) =(cid:80)r\n\nt < \u03b2\n\n1\n4\n\n(t4 + \u03b2t2)(t4 + t2)\n\n(cid:19)(cid:18)\n\n(cid:19)\n\n(cid:18)\n\nt +\n\n1\nt\n\nt +\n\n\u03b2\nt\n\n\u2212\n\nt2 \u2212 2\u03b2\nt2\n\nt \u2265 \u03b2\n\n1\n4\n\n(cid:19)(cid:19)\n\ni=1 L1(\u03bb|x) with\n\n\u00b5Ax \u2265 x(\u03bb)\n\n\u00b5Ax < x(\u03bb)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n(cid:18) \u03c3B\n\n\u00b5A\n\nx2\n\n(cid:114)\n\nL1(\u03bb|x) =\n\nWhere\n\nx(y) =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3(\u03c3B/\n\n0\n\n\u221a\n\n2\u00b5A)\n\n(y/\u03c3B)2 \u2212 \u03b2 \u2212 1 +\n\n(cid:113)(cid:0)1 + \u03b2 \u2212 (y/\u03c3B)2(cid:1)2 \u2212 4\u03b2 t \u2265 \u03b2\n\nt < \u03b2\n\n1\n4\n\n1\n4\n\n(7)\n\n1The incoherence assumption is widely used in related literature [6, 12, 27], and asserts that the singular\n\nvectors are spread out so X is not sparse and does not share singular subspaces with the noise.\n\n7\n\n\fDe\ufb01nition 4. Let \u03b70 be the zero singular value shrinker, \u03b70(y) \u2261 0, so that \u02c6X\u03b70(Y ) \u2261 0. Let \u03b7 be a\nsingular value shrinker. The critical signal level for \u03b7 is\n\nxcritical(\u03b7) = inf\nx\n\n{L1(\u03b7|X) < L1(\u03b70|X)}\n\nAs we can see, the asymptotic mean square error decomposes over the singular values of the target\nmatrix, x1, . . . , xr. Each value xi that falls below xcritical(\u03b7) is better estimated with the zero\nshrinker \u03b70 than with \u03b7. It follows that any xi that falls below xcritical(\u03b7\u2217), where \u03b7\u2217 is the optimal\nshrinker, cannot be reliably estimated by any shrinker \u03b7, and its corresponding data singular value yi\nshould simply be set to zero. This makes Theorem 2 formal.\n\n5 Estimating the model parameters\n\nIn practice, using the optimal shrinker we propose requires an estimate of the model parameters. In\ngeneral, \u03c3B is easy to estimate from the data via a median-matching method [15], namely\n\n\u02c6\u03c3B =\n\nymed\u221a\nn\u00b5\u03b2\n\n,\n\nwhere ymed is the median singular value of Y, and \u00b5\u03b2 is the median of the Mar\u02d8cenko-Pastur distribu-\ntion. However, estimation of \u00b5A and \u03c3A must be considered on a case-by-case basis. For example, in\nthe \u201cAdditive noise and missing at random\u201d mode (table 2), \u03c3A \u2261 1 is known, and \u00b5A is estimated\nby dividing the amount of missing values by the matrix size.\n\n6 Simulation\n\nSimulations were performed to verify the correctness of our main results2. For more details, see\nSupporting Information.\n\n1. Critical signal level xcritical(\u03bb\u2217) under increasing noise. Figure 2, left panel, shows\nthe amount of data singular values yi above xcritical(\u03bb\u2217), as a function of the fraction of\nmissing values \u03ba. Theorem 3 correctly predicts the exact values of \u03ba at which the \u201cnext\u201d\ndata singular value falls below xcritical(\u03bb\u2217).\n\n2. Phase plane for critical signal levels xcritical(\u03b7\u2217) and xcritical(\u03bb\u2217). Figure 1, right panel,\nshows the x, \u03ba plane, where x is the signal level and \u03ba is the fraction of missing values. At\neach point in the plane, several independent data matrices were generated. Heatmap shows\nthe fraction of the experiments at which the data singular value y1 was above xcritical(\u03b7\u2217)\nand xcritical(\u03bb\u2217). The overlaid graphs are theoretical predictions of the critical points.\n\n3. Brute-force veri\ufb01cation of the optimal shrinker shape. Figure 2, right panel, shows the\nshape of the optimal shrinker (Theorem 1). We performed a brute-force search for the value\nof \u03b7(y) that produces the minimal mean square error. A brute force search, performed with\na relatively small matrix size, matches the asymptotic shape of the optimal shrinker.\n\n7 Conclusions\n\nSingular value shrinkage emerges as an effective method to reconstruct low-rank matrices from\ncontaminated data that is both practical and well understood. Through simple, carefully designed\nmanipulation of the data singular values, we obtain an appealing improvement in the reconstruction\nmean square error. While beyond our present scope, following [16], it is highly likely that the\noptimal shrinker we have developed offers the same mean square error, asymptotically, as the best\nrotation-invariant estimator based on the data, making it asymptotically the best SVD-based estimator\nfor the target matrix.\n\n2The full Matlab code that generated the \ufb01gures in this paper and in the Supporting Information is permanently\n\navailable at https://purl.stanford.edu/kp113fq0838.\n\n8\n\n\fFigure 2: Left: empirical validation of the predicted critical signal level (Simulation 1). Right:\nEmpirical validation of the optimal shrinker shape (Simulation 3).\n\nAcknowledgements\n\nDB was supported by Israeli Science Foundation grant no. 1523/16 and German-Israeli Foundation\nfor scienti\ufb01c research and development program no. I-1100-407.1-2015.\n\nReferences\n[1] Benaych-Georges, Florent and Nadakuditi, Raj Rao. The singular values and vectors of low\nrank perturbations of large rectangular random matrices. Journal of Multivariate Analysis, 111:\n120\u2013135, 2012. ISSN 0047259X.\n\n[2] Bloemendal, Alex, Erdos, Laszlo, Knowles, Antti, Yau, Horng Tzer, and Yin, Jun. Isotropic\nlocal laws for sample covariance and generalized Wigner matrices. Electronic Journal of\nProbability, 19(33):1\u201353, 2014. ISSN 10836489.\n\n[3] Boutsidis, Christos, Zouzias, Anastasios, Mahoney, Michael W, and Drineas, Petros. Ran-\ndomized dimensionality reduction for k-means clustering. IEEE Transactions on Information\nTheory, 61(2):1045\u20131062, 2015.\n\n[4] Bouwmans, Thierry, Sobral, Andrews, Javed, Sajid, Ki, Soon, and Zahzah, El-hadi. Decomposi-\ntion into low-rank plus additive matrices for background / foreground separation : A review for\na comparative evaluation with a large-scale dataset. Computer Science Review, 2016. ISSN\n1574-0137.\n\n[5] Buuren, Stef and Groothuis-Oudshoorn, Karin. mice: Multivariate imputation by chained\n\nequations in r. Journal of statistical software, 45(3), 2011.\n\n[6] Cai, Jian-Feng, Candes, Emmanuel J., and Zuowei, Shen. A singular value thresholding\nalgorithm for matrix completion. 2010 Society for Industrial and Applied Mathematics, 20(4):\n1956\u20131982, 2010.\n\n[7] Candes, Emmanuel J. and Plan, Yaniv. Matrix completion with noise. Proceedings of the IEEE,\n\n98(6):925\u2013936, 2010. ISSN 00189219.\n\n[8] Candes, Emmanuel J and Plan, Yaniv. Matrix completion with noise. Proceedings of the IEEE,\n\n98(6):925\u2013936, 2010.\n\n[9] Cand\u00e8s, Emmanuel J., Li, Xiaodong, Ma, Yi, and Wright, John. Robust principal component\n\nanalysis? Journal of the ACM, 58(3):1\u201337, may 2011. ISSN 00045411.\n\n[10] Candes, Emmanuel J, Sing-Long, Carlos A, and Trzasko, Joshua D. Unbiased risk estimates for\nsingular value thresholding and spectral estimators. IEEE transactions on signal processing, 61\n(19):4643\u20134657, 2013.\n\n9\n\n00.20.40.60.81\u03ba00.511.522.533.544.55Number of estimable singular values11.522.533.544.5y-10123456\u03b7(y)TheoreticalEmpirical\f[11] Cattell, Raymond B. The scree test for the number of factors. Multivariate Behavioral Research,\n\n1(2):245\u2013276, 1966.\n\n[12] Chandrasekaran, Venkat, Sanghavi, Sujay, Parrilo, Pablo a., and Willsky, Alan S. Rank-Sparsity\nIncoherence for Matrix Decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\nISSN 1052-6234.\n\n[13] Das, Rajarshi, Zaheer, Manzil, and Dyer, Chris. Gaussian lda for topic models with word\n\nembeddings. In ACL (1), pp. 795\u2013804, 2015.\n\n[14] Fischler, Martin A and Bolles, Robert C. Random sample consensus: a paradigm for model\n\ufb01tting with applications to image analysis and automated cartography. Communications of the\nACM, 24(6):381\u2013395, 1981.\n\n[15] Gavish, Matan and Donoho, David L. The optimal hard threshold for singular values is 4/sqrt(3).\n\nIEEE Transactions on Information Theory, 60(8):5040\u20135053, 2014. ISSN 00189448.\n\n[16] Gavish, Matan and Donoho, David L. Optimal shrinkage of singular values. IEEE Transactions\n\non Information Theory, 63(4):2137\u20132152, 2017.\n\n[17] Gnanadesikan, Ramanathan and Kettenring, John R. Robust estimates, residuals, and outlier\n\ndetection with multiresponse data. Biometrics, pp. 81\u2013124, 1972.\n\n[18] Golub, Gene and Kahan, William. Calculating the singular values and pseudo-inverse of a\nmatrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical\nAnalysis, 2(2):205\u2013224, 1965.\n\n[19] Hastie, Trevor, Tibshirani, Robert, Sherlock, Gavin, Brown, Patrick, Botstein, David, and\nEisen, Michael. Imputing Missing Data for Gene Expression Arrays Imputation using the SVD.\nTechnical Report, pp. 1\u20139, 1999.\n\n[20] Huber, Peter J. Robust statistics. Springer, 2011.\n\n[21] Ji, Hui, Liu, Chaoqiang, Shen, Zuowei, and Xu, Yuhong. Robust video denoising using low\nrank matrix completion. 2010 IEEE Computer Society Conference on Computer Vision and\nPattern Recognition, pp. 1791\u20131798, 2010. ISSN 1063-6919.\n\n[22] Johnstone, Iain M. On the distribution of the largest eigenvalue in principal components analysis.\n\nThe Annals of Statistics, 29(2):295\u2013327, 2001.\n\n[23] Lin, Zhouchen, Chen, Minming, and Ma, Yi. The Augmented Lagrange Multiplier Method for\n\nExact Recovery of Corrupted Low-Rank Matrices. 2013.\n\n[24] Luo, Xin, Zhou, Mengchu, Xia, Yunni, and Zhu, Qingsheng. An ef\ufb01cient non-negative\nmatrix-factorization-based approach to collaborative \ufb01ltering for recommender systems. IEEE\nTransactions on Industrial Informatics, 10(2):1273\u20131284, 2014.\n\n[25] Marcenko, V. A. and Pastur, L. A. Distribution of eigenvalues for some sets of random matrices.\n\nMath. USSR-Sbornik, 1(4):457\u2013483, 1967.\n\n[26] Meloun, Milan, Capek, Jindrich, Miksk, Petr, and Brereton, Richard G. Critical comparison of\nmethods predicting the number of components in spectroscopic data. Analytica Chimica Acta,\n423(1):51\u201368, 2000.\n\n[27] Nadakuditi, Raj Rao. OptShrink: An algorithm for improved low-rank signal matrix Denoising\nby optimal, data-driven singular value shrinkage. IEEE Transactions on Information Theory, 60\n(5):3002\u20133018, 2014. ISSN 00189448.\n\n[28] Rao, Nikhil, Yu, Hsiang-Fu, Ravikumar, Pradeep K, and Dhillon, Inderjit S. Collaborative\n\ufb01ltering with graph information: Consistency and scalable methods. In Advances in neural\ninformation processing systems, pp. 2107\u20132115, 2015.\n\n10\n\n\f[29] Rennie, Jasson Dm M and Srebro, Nathan. Fast Maximum Margin Matrix Factorization for\nCollaborative Prediction. Proceedings of the 22Nd International Conference on Machine\nLearning, pp. 713\u2013719, 2005. ISSN 1595931805. doi: 10.1145/1102351.1102441. URL\nhttp://doi.acm.org/10.1145/1102351.1102441.\n\n[30] Rubin, Donald B. Multiple imputation after 18+ years. Journal of the American statistical\n\nAssociation, 91(434):473\u2013489, 1996.\n\n[31] Schafer, Joseph L. Analysis of incomplete multivariate data. CRC press, 1997.\n\n[32] Shabalin, Andrey A and Nobel, Andrew B. Reconstruction of a low-rank matrix in the presence\n\nof Gaussian noise. Journal of Multivariate Analysis, 118:67\u201376, 2013. ISSN 0047-259X.\n\n[33] Stein, Charles M. Lectures on the theory of estimation of many parameters. Journal of\nSoviet Mathematics, 74(5), 1986. URL http://link.springer.com/article/10.1007/\nBF01085007.\n\n[34] Wright, John, Peng, Yigang, Ma, Yi, Ganesh, Arvind, and Rao, Shankar. Robust Principal\nComponent Analysis: Exact Recovery of Corrupted Low-Rank Matrices. Advances in Neural\nInformation Processing Systems (NIPS), pp. 2080\u2014-2088, 2009. ISSN 0010-3640.\n\n[35] Yang, Jian, Qian, Jianjun, Luo, Lei, Zhang, Fanlong, and Gao, Yicheng. Nuclear norm based\nmatrix regression with applications to face recognition with occlusion and illumination changes.\nIEEE Transactions on Pattern Analysis and Machine Intelligence Machine Intelligence, pp(99):\n1\u20131, 2016. ISSN 0162-8828.\n\n11\n\n\f", "award": [], "sourceid": 3120, "authors": [{"given_name": "Danny", "family_name": "Barash", "institution": "The Hebrew University Of Jerusalem"}, {"given_name": "Matan", "family_name": "Gavish", "institution": "Hebrew University"}]}