{"title": "The Noisy Power Method: A Meta Algorithm with Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 2861, "page_last": 2869, "abstract": "We provide a new robust convergence analysis of the well-known power method for computing the dominant singular vectors of a matrix that we call noisy power method. Our result characterizes the convergence behavior of the algorithm when a large amount noise is introduced after each matrix-vector multiplication. The noisy power method can be seen as a meta-algorithm that has recently found a number of important applications in a broad range of machine learning problems including alternating minimization for matrix completion, streaming principal component analysis (PCA), and privacy-preserving spectral analysis. Our general analysis subsumes several existing ad-hoc convergence bounds and resolves a number of open problems in multiple applications. A recent work of Mitliagkas et al.~(NIPS 2013) gives a space-efficient algorithm for PCA in a streaming model where samples are drawn from a spiked covariance model. We give a simpler and more general analysis that applies to arbitrary distributions. Moreover, even in the spiked covariance model our result gives quantitative improvements in a natural parameter regime. As a second application, we provide an algorithm for differentially private principal component analysis that runs in nearly linear time in the input sparsity and achieves nearly tight worst-case error bounds. Complementing our worst-case bounds, we show that the error dependence of our algorithm on the matrix dimension can be replaced by an essentially tight dependence on the coherence of the matrix. This result resolves the main problem left open by Hardt and Roth (STOC 2013) and leads to strong average-case improvements over the optimal worst-case bound.", "full_text": "The Noisy Power Method:\n\nA Meta Algorithm with Applications\n\nMoritz Hardt\u2217\n\nIBM Research Almaden\n\nEric Price\u2020\n\nIBM Research Almaden\n\nAbstract\n\nWe provide a new robust convergence analysis of the well-known power method for\ncomputing the dominant singular vectors of a matrix that we call the noisy power\nmethod. Our result characterizes the convergence behavior of the algorithm when\na signi\ufb01cant amount noise is introduced after each matrix-vector multiplication.\nThe noisy power method can be seen as a meta-algorithm that has recently found a\nnumber of important applications in a broad range of machine learning problems\nincluding alternating minimization for matrix completion, streaming principal\ncomponent analysis (PCA), and privacy-preserving spectral analysis. Our general\nanalysis subsumes several existing ad-hoc convergence bounds and resolves a\nnumber of open problems in multiple applications:\nStreaming PCA. A recent work of Mitliagkas et al. (NIPS 2013) gives a space-\nef\ufb01cient algorithm for PCA in a streaming model where samples are drawn from a\ngaussian spiked covariance model. We give a simpler and more general analysis that\napplies to arbitrary distributions con\ufb01rming experimental evidence of Mitliagkas\net al. Moreover, even in the spiked covariance model our result gives quantitative\nimprovements in a natural parameter regime. It is also notably simpler and follows\neasily from our general convergence analysis of the noisy power method together\nwith a matrix Chernoff bound.\nPrivate PCA. We provide the \ufb01rst nearly-linear time algorithm for the problem\nof differentially private principal component analysis that achieves nearly tight\nworst-case error bounds. Complementing our worst-case bounds, we show that the\nerror dependence of our algorithm on the matrix dimension can be replaced by an\nessentially tight dependence on the coherence of the matrix. This result resolves the\nmain problem left open by Hardt and Roth (STOC 2013). The coherence is always\nbounded by the matrix dimension but often substantially smaller thus leading to\nstrong average-case improvements over the optimal worst-case bound.\n\n1\n\nIntroduction\n\nComputing the dominant singular vectors of a matrix is one of the most important algorithmic\ntasks underlying many applications including low-rank approximation, PCA, spectral clustering,\ndimensionality reduction, matrix completion and topic modeling. The classical problem is well-\nunderstood, but many recent applications in machine learning face the fundamental problem of\napproximately \ufb01nding singular vectors in the presence of noise. Noise can enter the computation\nthrough a variety of sources including sampling error, missing entries, adversarial corruptions and\nprivacy constraints. It is desirable to have one robust method for handling a variety of cases without\nthe need for ad-hoc analyses. In this paper we consider the noisy power method, a fast general purpose\nmethod for computing the dominant singular vectors of a matrix when the target matrix can only be\naccessed through inaccurate matrix-vector products.\n\n\u2217Email: mhardt@us.ibm.com\n\u2020Email: ecprice@cs.utexas.edu\n\n1\n\n\fFigure 1 describes the method when the target matrix A is a symmetric d\u00d7 d matrix\u2014a generalization\nto asymmetric matrices is straightforward. The algorithm starts from an initial matrix X0 \u2208 Rd\u00d7p\nand iteratively attempts to perform the update rule X(cid:96) \u2192 AX(cid:96). However, each such matrix product is\nfollowed by a possibly adversarially and adaptively chosen perturbation G(cid:96) leading to the update rule\nX(cid:96) \u2192 AX(cid:96) + G(cid:96). It will be convenient though not necessary to maintain that X(cid:96) has orthonormal\ncolumns which can be achieved through a QR-factorization after each update.\n\nInput: Symmetric matrix A \u2208 Rd\u00d7d, number of iterations L, dimension p\n\n1. Choose X0 \u2208 Rd\u00d7p.\n2. For (cid:96) = 1 to L:\n\n(a) Y(cid:96) \u2190 AX(cid:96)\u22121 + G(cid:96) where G(cid:96) \u2208 Rd\u00d7p is some perturbation\n(b) Let Y(cid:96) = X(cid:96)R(cid:96) be a QR-factorization of Y(cid:96)\n\nOutput: Matrix XL\n\nFigure 1: Noisy Power Method (NPM)\n\nThe noisy power method is a meta algorithm that when instantiated with different settings of G(cid:96)\nand X0 adapts to a variety of applications. In fact, there have been a number of recent surprising\napplications of the noisy power method:\n\n1. Jain et al. [JNS13, Har14] observe that the update rule of the well-known alternating least\nsquares heuristic for matrix completion can be considered as an instance of NPM. This lead\nto the \ufb01rst provable convergence bounds for this important heuristic.\n\n2. Mitgliakas et al. [MCJ13] observe that NPM applies to a streaming model of principal\ncomponent analysis (PCA) where it leads to a space-ef\ufb01cient and practical algorithm for\nPCA in settings where the covariance matrix is too large to process directly.\n\n3. Hardt and Roth [HR13] consider the power method in the context of privacy-preserving\n\nPCA where noise is added to achieve differential privacy.\n\nIn each setting there has so far only been an ad-hoc analysis of the noisy power method. In the \ufb01rst\nsetting, only local convergence is argued, that is, X0 has to be cleverly chosen. In the second setting,\nthe analysis only holds for the spiked covariance model of PCA. In the third application, only the\ncase p = 1 was considered.\nIn this work we give a completely general analysis of the noisy power method that overcomes\nlimitations of previous analyses. Our result characterizes the global convergence properties of the\nalgorithm in terms of the noise G(cid:96) and the initial subspace X0. We then consider the important\ncase where X0 is a randomly chosen orthonormal basis. This case is rather delicate since the initial\ncorrelation between a random matrix X0 and the target subspace is vanishing in the dimension d for\nsmall p. Another important feature of the analysis is that it shows how X(cid:96) converges towards the \ufb01rst\nk (cid:54) p singular vectors. Choosing p to be larger than the target dimension leads to a quantitatively\nstronger result. Theorem 2.3 formally states our convergence bound. Here we highlight one useful\ncorollary to illustrate our more general result.\nCorollary 1.1. Let k (cid:54) p. Let U \u2208 Rd\u00d7k represent the top k singular vectors of A and let\n\u03c31 (cid:62) \u00b7\u00b7\u00b7 (cid:62) \u03c3n (cid:62) 0 denote its singular values. Suppose X0 is an orthonormal basis of a random\np-dimensional subspace. Further suppose that at every step of NPM we have\np\u2212\u221a\n\u221a\n\u221a\n\n5(cid:107)G(cid:96)(cid:107) (cid:54) \u03b5(\u03c3k \u2212 \u03c3k+1) and\n\n5(cid:107)U(cid:62)G(cid:96)(cid:107) (cid:54) (\u03c3k \u2212 \u03c3k+1)\n\nfor some \ufb01xed parameter \u03c4 and \u03b5 < 1/2. Then with all but \u03c4\u2212\u2126(p+1\u2212k) + e\u2212\u2126(d) probability, there\nexists an L = O(\n\nlog(d\u03c4 /\u03b5)) so that after L steps we have that(cid:13)(cid:13)(I \u2212 XLX(cid:62)\n\nL )U(cid:13)(cid:13) (cid:54) \u03b5.\n\n\u03c3k\n\nk\u22121\nd\n\n\u03c4\n\n\u03c3k\u2212\u03c3k+1\n\nThe corollary shows that the algorithm converges in the strong sense that the entire spectral norm\nof U up to an \u03b5 error is contained in the space spanned by XL. To achieve this the result places two\nassumptions on the magnitude of the noise. The total spectral norm of G(cid:96) must be bounded by \u03b5\ntimes the separation between \u03c3k and \u03c3k+1. This dependence on the singular value separation arises\neven in the classical perturbation theory of Davis-Kahan [DK70]. The second condition is speci\ufb01c to\nthe power method and requires that the noise term is proportionally smaller when projected onto the\nspace spanned by the top k singular vectors. This condition ensures that the correlation between X(cid:96)\n\n2\n\n\fand U that is initially very small is not destroyed by the noise addition step. If the noise term has\nsome spherical properties (e.g. a Gaussian matrix), we expect the projection onto U to be smaller\n\nby a factor of(cid:112)k/d, since the space U is k-dimensional. In the case where p = k + \u2126(k) this is\n\nprecisely what the condition requires. When p = k the requirement is stronger by a factor of k. This\nphenomenon stems from the fact that the smallest singular value of a random p \u00d7 k gaussian matrix\nbehaves differently in the square and the rectangular case.\nWe demonstrate the usefulness of our convergence bound with several novel results in some of the\naforementioned applications.\n\n1.1 Application to memory-ef\ufb01cient streaming PCA\nIn the streaming PCA setting we receive a stream of samples z1, z2, . . . zn \u2208 Rd drawn i.i.d. from\nan unknown distribution D over Rd. Our goal is to compute the dominant k eigenvectors of the\ncovariance matrix A = Ez\u223cD zz(cid:62). The challenge is to do this in space linear in the output size,\nnamely O(kd). Recently, Mitgliakas et al. [MCJ13] gave an algorithm for this problem based on the\nnoisy power method. We analyze the same algorithm, which we restate here and call SPM:\n\nInput: Stream of samples z1, z2, . . . , zn \u2208 Rd, iterations L, dimension p\n1. Let X0 \u2208 Rd\u00d7p be a random orthonormal basis. Let T = (cid:98)m/L(cid:99)\n2. For (cid:96) = 1 to L:\n\n(a) Compute Y(cid:96) = A(cid:96)X(cid:96)\u22121 where A(cid:96) =(cid:80)(cid:96)T\n\ni=((cid:96)\u22121)T +1 ziz(cid:62)\n\ni\n\n(b) Let Y(cid:96) = X(cid:96)R(cid:96) be a QR-factorization of Y(cid:96)\n\nOutput: Matrix XL\n\nFigure 2: Streaming Power Method (SPM)\n\nThe algorithm can be executed in space O(pd) since the update step can compute the d \u00d7 p matrix\nA(cid:96)X(cid:96)\u22121 incrementally without explicitly computing A(cid:96). The algorithm maps to our setting by\nde\ufb01ning G(cid:96) = (A(cid:96) \u2212 A)X(cid:96)\u22121. With this notation Y(cid:96) = AX(cid:96)\u22121 + G(cid:96). We can apply Corollary 1.1\ndirectly once we have suitable bounds on (cid:107)G(cid:96)(cid:107) and (cid:107)U(cid:62)G(cid:96)(cid:107).\nThe result of [MCJ13] is speci\ufb01c to the spiked covariance model. The spiked covariance model\nis de\ufb01ned by an orthonormal basis U \u2208 Rd\u00d7k and a diagonal matrix \u039b \u2208 Rk\u00d7k with diagonal\nentries \u03bb1 (cid:62) \u03bb2 (cid:62) \u00b7\u00b7\u00b7 (cid:62) \u03bbk > 0. The distribution D(U, \u039b) is de\ufb01ned as the normal distribution\nN(0, (U \u039b2U(cid:62) + \u03c32Idd\u00d7d)). Without loss of generality we can scale the examples such that \u03bb1 = 1.\n\nOne corollary of our result shows that the algorithm outputs XL such that(cid:13)(cid:13)(I \u2212 XLX(cid:62)\n\nL )U(cid:13)(cid:13) (cid:54) \u03b5\n\nwith probability 9/10 provided p = k + \u2126(k) and the number of samples satis\ufb01es\n\n(cid:19)\n\n(cid:18) \u03c36 + 1\n\n\u03b52\u03bb6\nk\n\nn = \u0398\n\nkd\n\n.\n\nPreviously, the same bound1 was known with a quadratic dependence on k in the case where p = k.\nHere we can strengthen the bound by increasing p slightly.\nWhile we can get some improvements even in the spiked covariance model, our result is substantially\nmore general and applies to any distribution. The sample complexity bound we get varies according\nto a technical parameter of the distribution. Roughly speaking, we get a near linear sample complexity\nif the distribution is either \u201cround\u201d (as in the spiked covariance setting) or is very well approximated\nby a k dimensional subspace. To illustrate the latter condition, we have the following result without\nmaking any assumptions other than scaling the distribution:\nCorollary 1.2. Let D be any distribution scaled so that Pr{(cid:107)z(cid:107) > t} (cid:54) exp(\u2212t) for every t (cid:62) 1.\nLet U represent the top k eigenvectors of the covariance matrix E zz(cid:62) and \u03c31 (cid:62) \u00b7\u00b7\u00b7 (cid:62) \u03c3d (cid:62) 0 its\neigenvalues. Then, SPM invoked with p = k + \u2126(k) outputs a matrix XL such with probability\n\nL )U(cid:13)(cid:13) (cid:54) \u03b5 provided SPM receives n samples where n satis\ufb01es n =\n\n(cid:17)\n\u03b52k(\u03c3k\u2212\u03c3k+1)3 \u00b7 d\n1That the bound stated in [MCJ13] has a \u03c36 dependence is not completely obvious. There is a O(\u03c34) in the\nk)) in the denominator which simpli\ufb01es to O(1/\u03c32) for constant\n\n9/10 we have(cid:13)(cid:13)(I \u2212 XLX(cid:62)\n(cid:16)\n\nk)/(\u03c32 + 0.5\u03bb2\n\n\u02dcO\n\n\u03c3k\n\n.\n\nnumerator and log((\u03c32 + 0.75\u03bb2\n\u03bbk and \u03c32 (cid:62) 1.\n\n3\n\n\fThe corollary establishes a sample complexity that\u2019s linear in d provided that the spectrum decays\nquickly, as is common in applications. For example, if the spectrum follows a power law so that\n\u03c3j \u2248 j\u2212c for a constant c > 1/2, the bound becomes n = \u02dcO(k2c+2d/\u03b52).\n\n1.2 Application to privacy-preserving spectral analysis\n\nMany applications of singular vector computation are plagued by the fact that the underlying matrix\ncontains sensitive information about individuals. A successful paradigm in privacy-preserving data\nanalysis rests on the notion of differential privacy which requires all access to the data set to be\nrandomized in such a way that the presence or absence of a single data item is hidden. The notion of\ndata item varies and could either refer to a single entry, a single row, or a rank-1 matrix of bounded\nnorm. More formally, Differential Privacy requires that the output distribution of the algorithm\nchanges only slightly with the addition or deletion of a single data item. This requirement often\nnecessitates the introduction of signi\ufb01cant levels of noise that make the computation of various\nobjectives challenging. Differentially private singular vector computation has been studied actively\nsince the work of Blum et al. [BDMN05]. There are two main objectives. The \ufb01rst is computational\nef\ufb01ciency. The second objective is to minimize the amount of error that the algorithm introduces.\nIn this work, we give a fast algorithm for differentially private singular vector computation based\non the noisy power method that leads to nearly optimal bounds in a number of settings that were\nconsidered in previous work. The algorithm is described in Figure 3. It\u2019s a simple instance of NPM\nin which each noise matrix G(cid:96) is a gaussian random matrix scaled so that the algorithm achieves\n(\u03b5, \u03b4)-differential privacy (as formally de\ufb01ned in De\ufb01nition E.1). It is easy to see that the algorithm\ncan be implemented in time nearly linear in the number of nonzero entries of the input matrix (input\nsparsity). This will later lead to strong improvements in running time compared with several previous\nworks.\n\nInput: Symmetric A \u2208 Rd\u00d7d, L, p, privacy parameters \u03b5, \u03b4 > 0\n\n1. Let X0 be a random orthonormal basis and put \u03c3 =\n\n\u03b5\u22121(cid:112)4pL log(1/\u03b4)\n\n2. For (cid:96) = 1 to L:\n\n(a) Y(cid:96) \u2190 AX(cid:96)\u22121 + G(cid:96) where G(cid:96) \u223c N(0,(cid:107)X(cid:96)\u22121(cid:107)2\u221e\u03c32)d\u00d7p.\n(b) Compute the QR-factorization Y(cid:96) = X(cid:96)R(cid:96)\n\nOutput: Matrix XL\n\nFigure 3: Private Power Method (PPM). Here (cid:107)X(cid:107)\u221e = maxij |Xij|.\n\nWe \ufb01rst state a general purpose analysis of PPM that follows from Corollary 1.1.\nTheorem 1.3. Let k (cid:54) p. Let U \u2208 Rd\u00d7k represent the top k singular vectors of A and let\n\u03c31 (cid:62) \u00b7\u00b7\u00b7 (cid:62) \u03c3d (cid:62) 0 denote its singular values. Then, PPM satis\ufb01es (\u03b5, \u03b4)-differential privacy and\nafter L = O(\n\n\u03c3k\n\n\u03c3k\u2212\u03c3k+1\n\n(cid:13)(cid:13)(I \u2212 XLX(cid:62)\n\nlog(d)) iterations we have with probability 9/10 that\n\u221a\np \u2212 \u221a\n\nL )U(cid:13)(cid:13) (cid:54) O\n\n\u221a\n\u03c3k \u2212 \u03c3k+1\n\n\u03c3 max(cid:107)X(cid:96)(cid:107)\u221e\n\nd log L\n\n\u00b7\n\n\u221a\n\np\nk \u2212 1\n\n.\n\n(cid:32)\n\n(cid:33)\n\nWhen p = k + \u2126(k) the trailing factor becomes a constant. If p = k it creates a factor k overhead.\nIn the worst-case we can always bound (cid:107)X(cid:96)(cid:107)\u221e by 1 since X(cid:96) is an orthonormal basis. However, in\nprinciple we could hope that a much better bound holds provided that the target subspace U has small\ncoordinates. Hardt and Roth [HR12, HR13] suggested a way to accomplish a stronger bound by\nconsidering a notion of coherence of A, denoted as \u00b5(A). Informally, the coherence is a well-studied\nparameter that varies between 1 and n, but is often observed to be small. Intuitively, the coherence\nmeasures the correlation between the singular vectors of the matrix with the standard basis. Low\ncoherence means that the singular vectors have small coordinates in the standard basis. Many results\non matrix completion and robust PCA crucially rely on the assumption that the underlying matrix\nhas low coherence [CR09, CT10, CLMW11] (though the notion of coherence here will be somewhat\ndifferent).\n\n4\n\n\fTheorem 1.4. Under the assumptions of Theorem 1.3, we have the conclusion\n\n(cid:33)\n\n.\n\n\u221a\np \u2212 \u221a\n\np\nk \u2212 1\n\n\u00b7\n\n\u221a\n\n(cid:13)(cid:13)(I \u2212 XLX(cid:62)\n\nL )U(cid:13)(cid:13) (cid:54) O\n\n(cid:32)\n\n\u03c3(cid:112)\u00b5(A) log d log L\n\n\u03c3k \u2212 \u03c3k+1\n\n\u221a\n\nHardt and Roth proved this result for the case where p = 1. The extension to p > 1 lost a factor\nof\nd in general and therefore gave no improvement over Theorem 1.3. Our result resolves the\nmain problem left open in their work. The strength of Theorem 1.4 is that the bound is essentially\ndimension-free under a natural assumption on the matrix and never worse than our worst-case result.\nIt is also known that in general the dependence on d achieved in Theorem 1.3 is best possible in the\nworst case (see discussion in [HR13]) so that further progress requires making stronger assumptions.\nCoherence is a natural such assumption. The proof of Theorem 1.4 proceeds by showing that each\n\niterate X(cid:96) satis\ufb01es (cid:107)X(cid:96)(cid:107)\u221e (cid:54) O((cid:112)\u00b5(A) log(d)/d) and applying Theorem 1.3. To do this we exploit\n\na non-trivial symmetry of the algorithm that we discuss in Section E.3.\n\nOther variants of differential privacy. Our discussion above applied to (\u03b5, \u03b4)-differential privacy\nunder changing a single entry of the matrix. Several works consider other variants of differential\nprivacy. It is generally easy to adapt the power method to these settings by changing the noise\ndistribution or its scaling. To illustrate this aspect, we consider the problem of privacy-preserving\nprincipal component analysis as recently studied by [CSS12, KT13]. Both works consider an\nalgorithm called exponential mechanism. The \ufb01rst work gives a heuristic implementation that may\nnot converge, while the second work gives a provably polynomial time algorithm though the running\n\u221a\ntime is more than cubic. Our algorithm gives strong improvements in running time while giving\nnearly optimal accuracy guarantees as it matches a lower bound of [KT13] up to a \u02dcO(\nk) factor. We\nalso improve the error dependence on k by polynomial factors compared to previous work. Moreover,\nwe get an accuracy improvement of O(\nd) for the case of (\u03b5, \u03b4)-differential privacy, while these\nprevious works only apply to (\u03b5, 0)-differential privacy. Section E.2 provides formal statements.\n\n\u221a\n\n1.3 Related Work\n\nNumerical Analysis. One might expect that a suitable analysis of the noisy power method would\nhave appeared in the numerical analysis literature. However, we are not aware of a reference and\nthere are a number of points to consider. First, our noise model is adaptive thus setting it apart from\nthe classical perturbation theory of the singular vector decomposition [DK70]. Second, we think\nof the perturbation at each step as large making it conceptually different from \ufb02oating point errors.\nThird, research in numerical analysis over the past decades has largely focused on faster Krylov\nsubspace methods. There is some theory of inexact Krylov methods, e.g., [SS07] that captures the\neffect of noisy matrix-vector products in this context. Related to our work are also results on the\nperturbation stability of the QR-factorization since those could be used to obtain convergence bounds\nfor subspace iteration. Such bounds, however, must depend on the condition number of the matrix\nthat the QR-factorization is applied to. See Chapter 19.9 in [Hig02] and the references therein for\nbackground. Our proof strategy avoids this particular dependence on the condition number.\n\nStreaming PCA. PCA in the streaming model is related to a host of well-studied problems that we\ncannot survey completely here. We refer to [ACLS12, MCJ13] for a thorough discussion of prior\nwork. Not mentioned therein is a recent work on incremental PCA [BDF13] that leads to space\nef\ufb01cient algorithms computing the top singular vector; however, it\u2019s not clear how to extend their\nresults to computing multiple singular vectors.\n\nPrivacy. There has been much work on differentially private spectral analysis starting with Blum\net al. [BDMN05] who used an algorithm known as Randomized Response which adds a single\nnoise matrix N either to the input matrix A or the covariance matrix AA(cid:62). This approach appears\nin a number of papers, e.g. [MM09]. While often easy to analyze it has the disadvantage that it\nconverts sparse matrices to dense matrices and is often impractical on large data sets. Chaudhuri\net al. [CSS12] and Kapralov-Talwar [KT13] use the so-called exponential mechanism to sample\napproximate eigenvectors of the matrix. The sampling is done using a heuristic approach without\nconvergence polynomial time convergence guarantees in the \ufb01rst case and using a polynomial time\nalgorithm in the second. Both papers achieve a tight dependence on the matrix dimension d (though\n\n5\n\n\fthe dependence on k is suboptimal in general). Most closely related to our work are the results of\nHardt and Roth [HR13, HR12] that introduced matrix coherence as a way to circumvent existing\nworst-case lower bounds on the error. They also analyzed a natural noisy variant of power iteration\nfor the case of computing the dominant eigenvector of A. When multiple eigenvectors are needed,\ntheir algorithm uses the well-known de\ufb02ation technique. However, this step loses control of the\n\ncoherence of the original matrix and hence results in suboptimal bounds. In fact, a(cid:112)rank(A) factor\n\nis lost.\n\n1.4 Open Questions\n\nWe believe Corollary 1.1 to be a fairly precise characterization of the convergence of the noisy power\nmethod to the top k singular vectors when p = k. The main \ufb02aw is that the noise tolerance depends\non the eigengap \u03c3k \u2212 \u03c3k+1, which could be very small. We have some conjectures for results that do\nnot depend on this eigengap.\nFirst, when p > k, we think that Corollary 1.1 might hold using the gap \u03c3k \u2212 \u03c3p+1 instead of\n\u03c3k \u2212 \u03c3k+1. Unfortunately, our proof technique relies on the principal angle decreasing at each step,\nwhich does not necessarily hold with the larger level of noise. Nevertheless we expect the principal\nangle to decrease fairly fast on average, so that XL will contain a subspace very close to U. We are\nactually unaware of this sort of result even in the noiseless setting.\nConjecture 1.5. Let X0 be a random p-dimensional basis for p > k. Suppose at every step we have\n\n100(cid:107)G(cid:96)(cid:107) (cid:54) \u03b5(\u03c3k \u2212 \u03c3p+1) and 100(cid:107)U T G(cid:96)(cid:107) (cid:54)\n\n\u221a\n\np \u2212 \u221a\n\u221a\n\nk \u2212 1\nd\n\nThen with high probability, after L = O(\n\n\u03c3k\n\n\u03c3k\u2212\u03c3p+1\n\n(cid:107)(I \u2212 XLX(cid:62)\n\nlog(d/\u03b5)) iterations we have\nL )U(cid:107) (cid:54) \u03b5.\n\nThe second way of dealing with a small eigengap would be to relax our goal. Corollary 1.1 is quite\nstringent in that it requires XL to approximate the top k singular vectors U, which gets harder when\nthe eigengap approaches zero and the kth through p + 1st singular vectors are nearly indistinguishable.\nA relaxed goal would be for XL to spectrally approximate A, that is\n\n(cid:107)(I \u2212 XLX(cid:62)\n\nL )A(cid:107) (cid:54) \u03c3k+1 + \u03b5.\n\n(1)\n\nThis weaker goal is known to be achievable in the noiseless setting without any eigengap at all.\nIn particular, [?] shows that (1) happens after L = O( \u03c3k+1\nlog n) steps in the noiseless setting. A\n\u03b5\nplausible extension to the noisy setting would be:\nConjecture 1.6. Let X0 be a random 2k-dimensional basis. Suppose at every step we have\n\n(cid:107)G(cid:96)(cid:107) (cid:54) \u03b5 and\nThen with high probability, after L = O( \u03c3k+1\n(cid:107)(I \u2212 XLX(cid:62)\n\n\u03b5\n\nlog d) iterations we have that\nL )A(cid:107) (cid:54) \u03c3k+1 + O(\u03b5).\n\n(cid:107)U T G(cid:96)(cid:107) (cid:54) \u03b5(cid:112)k/d\n\n1.5 Organization\n\nAll proofs can be found in the supplementary material. In the remaining space, we limit ourselves to\na more detailed discussion of our convergence analysis and the application to streaming PCA. The\nentire section on privacy is in the supplementary materials in Section E.\n\n2 Convergence of the noisy power method\n\nFigure 1 presents our basic algorithm that we analyze in this section. An important tool in our analysis\nare principal angles, which are useful in analyzing the convergence behavior of numerical eigenvalue\nmethods. Roughly speaking, we will show that the tangent of the k-th principal angle between X and\nthe top k eigenvectors of A decreases as \u03c3k+1/\u03c3k in each iteration of the noisy power method.\n\n6\n\n\fDe\ufb01nition 2.1 (Principal angles). Let X and Y be subspaces of Rd of dimension at least k. The\nprincipal angles 0 (cid:54) \u03b81 (cid:54) \u00b7\u00b7\u00b7 (cid:54) \u03b8k between X and Y and associated principal vectors x1, . . . , xk\nand y1, . . . , yk are de\ufb01ned recursively via\n\n\u03b8i(X ,Y) = min\n\narccos\n\n: x \u2208 X , y \u2208 Y, x \u22a5 xj, y \u22a5 yj for all j < i\n\n(cid:26)\n\n(cid:19)\n\n(cid:18) (cid:104)x, y(cid:105)\n\n(cid:107)x(cid:107)2(cid:107)y(cid:107)2\n\n(cid:27)\n\nand xi, yi are the x and y that give this value. For matrices X and Y , we use \u03b8k(X, Y ) to denote the\nkth principal angle between their ranges.\n\n2.1 Convergence argument\nFix parameters 1 (cid:54) k (cid:54) p (cid:54) d. In this section we consider a symmetric d \u00d7 d matrix A with singular\nvalues \u03c31 (cid:62) \u03c32 (cid:62) \u00b7\u00b7\u00b7 (cid:62) \u03c3d. We let U \u2208 Rd\u00d7k contain the \ufb01rst k eigenvectors of A. Our main\nlemma shows that tan \u03b8k(U, X) decreases multiplicatively in each step.\nLemma 2.2. Let U contain the largest k eigenvectors of a symmetric matrix A \u2208 Rd\u00d7d, and let\nX \u2208 Rd\u00d7p for p (cid:62) k. Let G \u2208 Rd\u00d7p satisfy\n\n4(cid:107)U(cid:62)G(cid:107) (cid:54) (\u03c3k \u2212 \u03c3k+1) cos \u03b8k(U, X)\n\n4(cid:107)G(cid:107) (cid:54) (\u03c3k \u2212 \u03c3k+1)\u03b5.\n\nfor some \u03b5 < 1. Then\n\ntan \u03b8k(U, AX + G) (cid:54) max\n\n(cid:32)\n\n(cid:32)\n\n(cid:18) \u03c3k+1\n\n\u03c3k\n\n(cid:19)1/4(cid:33)\n\n(cid:33)\n\ntan \u03b8k(U, X)\n\n.\n\n\u03b5, max\n\n\u03b5,\n\nWe can inductively apply the previous lemma to get the following general convergence result.\nTheorem 2.3. Let U represent the top k eigenvectors of the matrix A and \u03b3 = 1\u2212 \u03c3k+1/\u03c3k. Suppose\nthat the initial subspace X0 and noise G(cid:96) is such that\n\n5(cid:107)U(cid:62)G(cid:96)(cid:107) (cid:54) (\u03c3k \u2212 \u03c3k+1) cos \u03b8k(U, X0)\n\n5(cid:107)G(cid:96)(cid:107) (cid:54) \u03b5(\u03c3k \u2212 \u03c3k+1)\n\n(cid:16) tan \u03b8k(U,X0)\n\n(cid:17)\n\nsuch that for all\n\n\u03b5\n\nat every stage (cid:96), for some \u03b5 < 1/2. Then there exists an L (cid:46) 1\n(cid:96) (cid:62) L we have tan \u03b8(U, XL) (cid:54) \u03b5.\n\n\u03b3 log\n\n2.2 Random initialization\n\nThe next lemma essentially follows from bounds on the smallest singular value of gaussian random\nmatrices [RV09].\nLemma 2.4. For an arbitrary orthonormal U and random subspace X, we have\n\ntan \u03b8k(U, X) (cid:54) \u03c4\n\n\u221a\n\n\u221a\np \u2212 \u221a\n\nd\nk \u2212 1\n\nwith all but \u03c4\u2212\u2126(p+1\u2212k) + e\u2212\u2126(d) probability.\n\nWith this lemma we can prove the corollary that we stated in the introduction.\n\n. Hence cos \u03b8k(U, X0) (cid:62) 1/(1 + tan \u03b8k(U, X0)) (cid:62) \u221a\n\nProof of Corollary 1.1. By Lemma 2.4, with the desired probability we have tan \u03b8k(U, X0) (cid:54)\np\u2212\u221a\n\u221a\nk\u22121\np\u2212\u221a\n\u221a\n\u221a\n. Rescale \u03c4 and ap-\n2\u00b7\u03c4\nd\nL )U(cid:107) = sin \u03b8k(U, XL) (cid:54)\nply Theorem 2.3 to get that tan \u03b8k(U, XL) (cid:54) \u03b5. Then (cid:107)(I \u2212 XLX(cid:62)\ntan \u03b8k(U, XL) (cid:54) \u03b5.\n(cid:4)\n\n\u03c4\n\nd\nk\u22121\n\n7\n\n\f3 Memory ef\ufb01cient streaming PCA\nIn the streaming PCA setting we receive a stream of samples z1, z2,\u00b7\u00b7\u00b7 \u2208 Rd. Each sample is drawn\ni.i.d. from an unknown distribution D over Rd. Our goal is to compute the dominant k eigenvectors\nof the covariance matrix A = Ez\u223cD zz(cid:62). The challenge is to do this with small space, so we cannot\nstore the d2 entries of the sample covariance matrix. We would like to use O(dk) space, which is\nnecessary even to store the output.\nThe streaming power method (Figure 2, introduced by [MCJ13]) is a natural algorithm that performs\nstreaming PCA with O(dk) space. The question that arises is how many samples it requires to\nachieve a given level of accuracy, for various distributions D. Using our general analysis of the noisy\npower method, we show that the streaming power method requires fewer samples and applies to more\ndistributions than was previously known. We analyze a broad class of distributions:\nDe\ufb01nition 3.1. A distribution D over Rd is (B, p)-round if for every p-dimensional projection P and\nall t (cid:62) 1 we have Prz\u223cD {(cid:107)z(cid:107) > t} (cid:54) exp(\u2212t) and Prz\u223cD\nThe \ufb01rst condition just corresponds to a normalization of the samples drawn from D. Assuming the\n\ufb01rst condition holds, the second condition always holds with B = d/p. For this reason our analysis\nin principle applies to any distribution, but the sample complexity will depend quadratically on B.\nLet us illustrate this de\ufb01nition through the example of the spiked covariance model studied\nby [MCJ13]. The spiked covariance model is de\ufb01ned by an orthonormal basis U \u2208 Rd\u00d7k and a\ndiagonal matrix \u039b \u2208 Rk\u00d7k with diagonal entries \u03bb1 (cid:62) \u03bb2 (cid:62) \u00b7\u00b7\u00b7 (cid:62) \u03bbk > 0. The distribution D(U, \u039b)\ni \u03bb2\ni )\nis a normalization factor chosen so that the distribution satis\ufb01es the norm bound. Note that the the\ni + \u03c32)/D for 1 (cid:54) i (cid:54) k and \u03c3i = \u03c32/D for\ni-th eigenvalue of the covariance matrix is \u03c3i = (\u03bb2\ni > k. We show in Lemma D.2 that the spiked covariance model D(U, \u039b) is indeed (B, p)-round for\nB = O(\nTheorem 3.2. Let D be a (B, p)-round distribution over Rd with covariance matrix A whose\neigenvalues are \u03c31 (cid:62) \u03c32 (cid:62) \u00b7\u00b7\u00b7 (cid:62) \u03c3d (cid:62) 0. Let U \u2208 Rd\u00d7k be an orthonormal basis for the\neigenvectors corresponding to the \ufb01rst k eigenvalues of A. Then, the streaming power method SPM\nreturns an orthonormal basis X \u2208 Rd\u00d7p such that tan \u03b8(U, X) (cid:54) \u03b5 with probability 9/10 provided\nthat SPM receives n samples from D for some n satisfying\n\nis de\ufb01ned as the normal distribution N(0, (U \u039b2U(cid:62) + \u03c32Idd\u00d7d)/D) where D = \u0398(d\u03c32 +(cid:80)\n\n(cid:110)(cid:107)P z(cid:107) > t \u00b7(cid:112)Bp/d\n\n\u03bb2\n1+\u03c32\n\ntr(\u039b)/d+\u03c32 ), which is constant for \u03c3 (cid:38) \u03bb1. We have the following main theorem.\n\n(cid:111) (cid:54) exp(\u2212t) .\n\nif p = k + \u0398(k). More generally, for all p (cid:62) k one can get the slightly stronger result\n\n\u221a\nBp\u03c3k max{1/\u03b52, Bp/(\n\nk \u2212 1)2} log2 d\n\n(cid:32)\n\nn (cid:54) \u02dcO\n\n(cid:33)\n\n.\n\nInstantiating with the spiked covariance model gives the following:\nCorollary 3.3. In the spiked covariance model D(U, \u039b) the conclusion of Theorem 3.2 holds for\np = 2k with\n\nn (cid:54) \u02dcO\n\n(cid:19)\n\n(cid:18) B2\u03c3kk log2 d\n\u03b52(\u03c3k \u2212 \u03c3k+1)3d\np \u2212 \u221a\n(\u03c3k \u2212 \u03c3k+1)3d\n\n(cid:18) (\u03bb2\n\nn = \u02dcO\n\n1 + \u03c32)2(\u03bb2\n\n\u03b52\u03bb6\nk\n\nk + \u03c32)\n\n(cid:16) \u03c36+1\n\n\u03b52\n\n(cid:19)\n(cid:17)\n\ndk\n\n.\n\n\u00b7 dk\n\n.\n\n(cid:19)\n\nWhen \u03bb1 = O(1) and \u03bbk = \u2126(1) this becomes n = \u02dcO\n\nWe can apply Theorem 3.2 to all distributions that have exponentially concentrated norm by setting\nB = d/p. This gives the following result.\nCorollary 3.4. Let D be any distribution scaled such that Prz\u223cD[(cid:107)z(cid:107) > t] (cid:54) exp(\u2212t) for all t (cid:62) 1.\nThen the conclusion of Theorem 3.2 holds for p = 2k with\n\n(cid:18)\n\nn = \u02dcO\n\n\u03c3k\n\n\u03b52k(\u03c3k \u2212 \u03c3k+1)3 \u00b7 d\n\n.\n\nIf the eigenvalues follow a power law, \u03c3j \u2248 j\u2212c for a constant c > 1/2, this gives an n =\n\u02dcO(k2c+2d/\u03b52) bound on the sample complexity.\n\n8\n\n\fReferences\n[ACLS12] Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimiza-\ntion for pca and pls. In Communication, Control, and Computing (Allerton), 2012 50th\nAnnual Allerton Conference on, pages 861\u2013868. IEEE, 2012.\nAkshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of\nincremental PCA. In Proc. 27th Neural Information Processing Systems (NIPS), pages\n3174\u20133182, 2013.\n\n[BDF13]\n\n[BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy:\n\nthe SuLQ framework. In Proc. 24th PODS, pages 128\u2013138. ACM, 2005.\n\n[CLMW11] Emmanuel J. Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal compo-\n\n[CR09]\n\n[CSS12]\n\n[CT10]\n\n[DK70]\n\n[Har14]\n\n[Hig02]\n\n[HR12]\n\n[HR13]\n\n[JNS13]\n\n[KT13]\n\n[MCJ13]\n\n[MM09]\n\n[RV09]\n\n[SS07]\n\nnent analysis? J. ACM, 58(3):11, 2011.\nEmmanuel J. Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex opti-\nmization. Foundations of Computional Mathematics, 9:717\u2013772, December 2009.\nKamalika Chaudhuri, Anand Sarwate, and Kaushik Sinha. Near-optimal differentially\nprivate principal components. In Proc. 26th Neural Information Processing Systems\n(NIPS), 2012.\nEmmanuel J. Cand\u00e8s and Terence Tao. The power of convex relaxation: near-optimal\nmatrix completion. IEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\nChandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii.\nSIAM J. Numer. Anal., 7:1\u201346, 1970.\nMoritz Hardt. Understanding alternating minimization for matrix completion. In Proc.\n55th Foundations of Computer Science (FOCS). IEEE, 2014.\nNicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for\nIndustrial and Applied Mathematics, 2002.\nMoritz Hardt and Aaron Roth. Beating randomized response on incoherent matrices.\nIn Proc. 44th Symposium on Theory of Computing (STOC), pages 1255\u20131268. ACM,\n2012.\nMoritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector\ncomputation. In Proc. 45th Symposium on Theory of Computing (STOC). ACM, 2013.\nPrateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion\nusing alternating minimization. In Proc. 45th Symposium on Theory of Computing\n(STOC), pages 665\u2013674. ACM, 2013.\nMichael Kapralov and Kunal Talwar. On differentially private low rank approximation.\nIn Proc. 24rd Symposium on Discrete Algorithms (SODA). ACM-SIAM, 2013.\nIoannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory limited, stream-\ning PCA. In Proc. 27th Neural Information Processing Systems (NIPS), pages 2886\u2013\n2894, 2013.\nFrank McSherry and Ilya Mironov. Differentially private recommender systems: build-\ning privacy into the net. In Proc. 15th KDD, pages 627\u2013636. ACM, 2009.\nMark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular\nmatrix. Communications on Pure and Applied Mathematics, 62(12):1707\u20131739, 2009.\nValeria Simoncini and Daniel B. Szyld. Recent computational developments in krylov\nsubspace methods for linear systems. Numerical Linear Algebra With Applications,\n14:1\u201359, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1484, "authors": [{"given_name": "Moritz", "family_name": "Hardt", "institution": "IBM Research Almaden"}, {"given_name": "Eric", "family_name": "Price", "institution": "IBM"}]}