{"title": "Fixed-Rank Approximation of a Positive-Semidefinite Matrix from Streaming Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1225, "page_last": 1234, "abstract": "Several important applications, such as streaming PCA and semidefinite programming, involve a large-scale positive-semidefinite (psd) matrix that is presented as a sequence of linear updates. Because of storage limitations, it may only be possible to retain a sketch of the psd matrix. This paper develops a new algorithm for fixed-rank psd approximation from a sketch. The approach combines the Nystr\u00f6m approximation with a novel mechanism for rank truncation. Theoretical analysis establishes that the proposed method can achieve any prescribed relative error in the Schatten 1-norm and that it exploits the spectral decay of the input matrix. Computer experiments show that the proposed method dominates alternative techniques for fixed-rank psd matrix approximation across a wide range of examples.", "full_text": "Fixed-Rank Approximation of a\n\nPositive-Semide\ufb01nite Matrix from Streaming Data\n\nJoel A. Tropp\n\nCaltech\n\nAlp Yurtsever\n\nEPFL\n\njtropp@caltech.edu\n\nalp.yurtsever@epfl.ch\n\nmru8@cornell.edu\n\nvolkan.cevher@epfl.ch\n\nMadeleine Udell\n\nVolkan Cevher\n\nCornell\n\nEPFL\n\nAbstract\n\nSeveral important applications, such as streaming PCA and semide\ufb01nite program-\nming, involve a large-scale positive-semide\ufb01nite (psd) matrix that is presented as a\nsequence of linear updates. Because of storage limitations, it may only be possible\nto retain a sketch of the psd matrix. This paper develops a new algorithm for\n\ufb01xed-rank psd approximation from a sketch. The approach combines the Nystr\u00f6m\napproximation with a novel mechanism for rank truncation. Theoretical analysis\nestablishes that the proposed method can achieve any prescribed relative error in\nthe Schatten 1-norm and that it exploits the spectral decay of the input matrix. Com-\nputer experiments show that the proposed method dominates alternative techniques\nfor \ufb01xed-rank psd matrix approximation across a wide range of examples.\n\n1 Motivation\n\nIn recent years, researchers have studied many applications where a large positive-semide\ufb01nite (psd)\nmatrix is presented as a series of linear updates. A recurring theme is that we only have space to store\na small summary of the psd matrix, and we must use this information to construct an accurate psd\napproximation with speci\ufb01ed rank. Here are two important cases where this problem arises.\nStreaming Covariance Estimation. Suppose that we receive a stream h1, h2, h3,\u00b7\u00b7\u00b7 \u2208 Rn of\nhigh-dimensional vectors. The psd sample covariance matrix of these vectors has the linear dynamics\n\nA(0) \u2190 0 and A(i) \u2190 (1 \u2212 i\u22121)A(i\u22121) + i\u22121hih\u2217\ni .\n\nWhen the dimension n and the number of vectors are both large, it is not possible to store the vectors\nor the sample covariance matrix. Instead, we wish to maintain a small summary that allows us to\ncompute the rank-r psd approximation of the sample covariance matrix A(i) at a speci\ufb01ed instant i.\nThis problem and its variants are often called streaming PCA [3, 12, 14, 15, 25, 32].\nConvex Low-Rank Matrix Optimization with Optimal Storage. A primary application of\nsemide\ufb01nite programming (SDP) is to search for a rank-r psd matrix that satis\ufb01es additional con-\nstraints. Because of storage costs, SDPs are dif\ufb01cult to solve when the matrix variable is large.\nRecently, Yurtsever et al. [44] exhibited the \ufb01rst provable algorithm, called SketchyCGM, that\nproduces a rank-r approximate solution to an SDP using optimal storage.\nImplicitly, SketchyCGM forms a sequence of approximate psd solutions to the SDP via the iteration\n\nA(0) \u2190 0 and A(i) \u2190 (1 \u2212 \u03b7i)A(i\u22121) + \u03b7ihih\u2217\ni .\n\nThe step size \u03b7i = 2/(i + 2), and the vectors hi do not depend on the matrices A(i). In fact,\nSketchyCGM only maintains a small summary of the evolving solution A(i). When the iteration\nterminates, SketchyCGM computes a rank-r psd approximation of the \ufb01nal iterate using the method\ndescribed by Tropp et al. [37, Alg. 9].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Notation and Background\nThe scalar \ufb01eld F = R or F = C. De\ufb01ne \u03b1(R) = 1 and \u03b1(C) = 0. The asterisk \u2217 is the (conjugate)\ntranspose, and the dagger \u2020 denotes the Moore\u2013Penrose pseudoinverse. The notation A1/2 refers to\nthe unique psd square root of a psd matrix A. For p \u2208 [1,\u221e], the Schatten p-norm (cid:107) \u00b7 (cid:107)p returns the\n(cid:96)p norm of the singular values of a matrix. As usual, \u03c3r refers to the rth largest singular value.\nFor a nonnegative integer r, the phrase \u201crank-r\u201d and its variants mean \u201crank at most r.\u201d For a\n\nmatrix M, the symbol(cid:74)M(cid:75)r denotes a (simultaneous) best rank-r approximation of the matrix\nM with respect to any Schatten p-norm. We can take(cid:74)M(cid:75)r to be any r-truncated singular value\n\ndecomposition (SVD) of M [24, Sec. 6]. Every best rank-r approximation of a psd matrix is psd.\n\n2 Sketching and Fixed-Rank PSD Approximation\n\nWe begin with a streaming data model for a psd matrix that evolves via a sequence of general\nlinear updates, and it describes a randomized linear sketch for tracking the psd matrix. To compute\na \ufb01xed-rank psd approximation, we develop an algorithm based on the Nystr\u00f6m method [40], a\ntechnique from the literature on kernel methods. In contrast to previous approaches, our algorithm\nuses a distinct mechanism to truncate the rank of the approximation.\nThe Streaming Model. Fix a rank parameter r in the range 1 \u2264 r \u2264 n. Initially, the psd matrix\nA \u2208 Fn\u00d7n equals a known psd matrix Ainit \u2208 Fn\u00d7n. Then A evolves via a series of linear updates:\n(2.1)\nIn many applications, the innovation H is low-rank and/or sparse. We assume that the evolving matrix\nA always remains psd. At one given instant, we must produce an accurate rank-r approximation of\nthe psd matrix A induced by the stream of linear updates.\nThe Sketch. Fix a sketch size parameter k in the range r \u2264 k \u2264 n. Independent from A, we draw\nand \ufb01x a random test matrix\n\n\u03b8i \u2208 R, H \u2208 Fn\u00d7n is (conjugate) symmetric.\n\nA \u2190 \u03b81A + \u03b82H where\n\n\u2126 \u2208 Fn\u00d7k.\n\n(2.2)\n\nSee Sec. 3 for a discussion of possible distributions. The sketch of the matrix A takes the form\n\nY = A\u2126 \u2208 Fn\u00d7k.\n\n(2.3)\n\nThe sketch (2.3) supports updates of the form (2.1):\n\nY \u2190 \u03b81Y + \u03b82H\u2126.\n\n(2.4)\nTo \ufb01nd a good rank-r approximation, we must set the sketch size k larger than r. But storage costs\nand computation also increase with k. One of our main contributions is to clarify the role of k.\nUnder the model (2.1), it is more or less necessary to use a randomized linear sketch to track A [28].\nFor psd matrices, sketches of the form (2.2)\u2013(2.3) appear explicitly in Gittens\u2019s work [16, 17, 19].\nTropp et al. [37] relies on a more complicated sketch developed in [7, 42].\nThe Nystr\u00f6m Approximation. The Nystr\u00f6m method is a general technique for low-rank psd matrix\napproximation. Various instantiations appear in the papers [5, 11, 13, 16, 17, 19, 22, 27, 34, 40].\nHere is the application to the present situation. Given the test matrix \u2126 and the sketch Y = A\u2126, the\nNystr\u00f6m method constructs a rank-k psd approximation of the psd matrix A via the formula\n\n\u02c6Anys = Y (\u2126\u2217Y )\u2020Y \u2217.\n\n(2.5)\nIn most work on the Nystr\u00f6m method, the test matrix \u2126 depends adaptively on A, so these approaches\nare not valid in the streaming setting. Gittens\u2019s framework [16, 17, 19] covers the streaming case.\nFixed-Rank Nystr\u00f6m Approximation: Prior Art. To construct a Nystr\u00f6m approximation with\nexact rank r from a sketch of size k, the standard approach is to truncate the center matrix to rank r:\n\n(2.6)\nThe truncated Nystr\u00f6m approximation (2.6) appears in the many papers, including [5, 11, 18, 34].\nWe have found (Sec. 5) that the truncation method (2.6) performs poorly in the present setting. This\nobservation motivated us to search for more effective techniques.\n\n\u02c6Anys\ufb01x\n\nr\n\n= Y ((cid:74)\u2126\u2217Y(cid:75)r)\u2020Y \u2217.\n\n2\n\n\fFixed-Rank Nystr\u00f6m Approximation: Proposal. The purpose of this paper is to develop, analyze,\nand evaluate a new approach for \ufb01xed-rank approximation of a psd matrix under the streaming model.\nWe propose a more intuitive rank-r approximation:\n\n\u02c6Ar =(cid:74) \u02c6Anys(cid:75)r.\n\nThat is, we report a best rank-r approximation of the full Nystr\u00f6m approximation (2.5).\nThis \u201cmatrix nearness\u201d approach to \ufb01xed-rank approximation appears in the papers [21, 22, 37]. The\ncombination with the Nystr\u00f6m method (2.5) is totally natural. Let us emphasize that the approach (2.7)\nalso applies to Nystr\u00f6m approximations outside the streaming setting.\nSummary of Contributions. This paper contains a number of advances over the prior art:\n\n(2.7)\n\n1. We propose a new technique (2.7) for truncating the Nystr\u00f6m approximation to rank r. This\n\nformulation differs from the published literature on \ufb01xed-rank Nystr\u00f6m approximations.\n\n2. We present a stable numerical implementation of (2.7) based on the best practices outlined\n\nin the paper [27]. This approach is essential for achieving high precision! (Sec. 3)\n\n3. We establish informative error bounds for the method (2.7). In particular, we prove that it\n\nattains (1 + \u03b5)-relative error in the Schatten 1-norm when k = \u0398(r/\u03b5). (Sec. 4)\n\n4. We document numerical experiments on real and synthetic data to demonstrate that our\nmethod dominates existing techniques [18, 37] for \ufb01xed-rank psd approximation. (Sec. 5)\nPsd matrix approximation is a ubiquitous problem, so we expect these results to have a broad impact.\nRelated Work. Randomized algorithms for low-rank matrix approximation were proposed in\nthe late 1990s and developed into a technology in the 2000s; see [22, 30, 41]. In the absence of\nconstraints, such as streaming, we recommend the general-purpose methods from [22, 23, 27].\nAlgorithms for low-rank matrix approximation in the important streaming data setting are discussed\nin [4, 7, 8, 15, 22, 37, 41, 42]. Few of these methods are designed for psd matrices.\nNystr\u00f6m methods for low-rank psd matrix approximation appear in [11, 13, 16, 17, 19, 22, 26, 34,\n37, 40, 43]. These works mostly concern kernel matrices; they do not focus on the streaming model.\nWe are only aware of a few papers [16, 17, 19, 37] on algorithms for psd matrix approximation\nthat operate under the streaming model (2.1). These papers form the comparison group.\nAfter this paper was submitted, we learned about two contemporary works [35, 39] that propose the\n\ufb01xed-rank approximation (2.7) in the context of kernel methods. Our research is distinctive because\nwe focus on the streaming setting, we obtain precise error bounds, we address numerical stability,\nand we include an exhaustive empirical evaluation.\nFinally, let us mention two very recent theoretical papers [6, 33] that present existential results on\nalgorithms for \ufb01xed-rank psd matrix approximation. The approach in [6] is only appropriate for\nsparse input matrices, while the work [33] is not valid in the streaming setting.\n\n3\n\nImplementation\n\nDistributions for the Test Matrix. To ensure that the sketch is informative, we must draw the\ntest matrix (2.2) at random from a suitable distribution. The choice of distribution determines the\ncomputational requirements for the sketch (2.3), the linear updates (2.4), and the matrix approxima-\ntion (2.7). It also affects the quality of the approximation (2.7). Let us outline some of the most useful\ndistributions. A full discussion is outside the scope of our work, but see [17, 19, 22, 29, 30, 37, 41].\nIsotropic Models. Mathematically, the most natural model is to construct a test matrix \u2126 \u2208 Fn\u00d7k\nwhose range is a uniformly random k-dimensional subspace in Fn. There are two approaches:\n\n1. Gaussian. Draw each entry of the matrix \u2126 \u2208 Fn\u00d7k independently at random from the\n2. Orthonormal. Draw a Gaussian matrix G \u2208 Fn\u00d7k, as above. Compute a thin orthogonal\u2013\n\nstandard normal distribution on F.\ntriangular factorization G = \u2126R to obtain the test matrix \u2126 \u2208 Fn\u00d7k. Discard R.\n\nGaussian and orthonormal test matrices both require storage of kn \ufb02oating-point numbers in F for\nthe test matrix \u2126 and another kn \ufb02oating-point numbers for the sketch Y . In both cases, the cost of\nmultiplying a vector in Fn into \u2126 is \u0398(kn) \ufb02oating-point operations.\n\n3\n\n\fAlgorithm 1 Sketch Initialization. Implements (2.2)\u2013(2.3) with a random orthonormal test matrix.\nInput: Positive-semide\ufb01nite input matrix A \u2208 Fn\u00d7n; sketch size parameter k\nOutput: Constructs test matrix \u2126 \u2208 Fn\u00d7k and sketch Y = A\u2126 \u2208 Fn\u00d7k\n\n1\n2\n3\n4\n\n5\n6\n\n7\n8\n\nlocal: \u2126, Y\nfunction NYSTROMSKETCH(A; k)\n\n\u2126 \u2190 randn(n, k)\n\u2126 \u2190 randn(n, k) + i \u2217 randn(n, k)\n\nif F = R then\n\nif F = C then\n\u2126 \u2190 orth(\u2126)\nY \u2190 A\u2126\n\n(cid:46) Internal variables for NYSTROMSKETCH\n(cid:46) Constructor\n\n(cid:46) Improve numerical stability\n\nAlgorithm 2 Linear Update. Implements (2.4).\nInput: Scalars \u03b81, \u03b82 \u2208 R and conjugate symmetric H \u2208 Fn\u00d7n\nOutput: Updates sketch to re\ufb02ect linear innovation A \u2190 \u03b81A + \u03b82H\n\n1\n2\n3\n\nlocal: \u2126, Y\nfunction LINEARUPDATE(\u03b81, \u03b82, H)\n\nY \u2190 \u03b81Y + \u03b82H\u2126\n\n(cid:46) Internal variables for NYSTROMSKETCH\n\nFor isotropic models, we can analyze the approximation (2.7) in detail. In exact arithmetic, Gaussian\nand isotropic test matrices yield identical Nystr\u00f6m approximations (Supplement). In \ufb02oating-point\narithmetic, orthonormal matrices are more stable for large k, but we can generate Gaussian matrices\nwith less arithmetic and communication. References for isotropic test matrices include [21, 22, 31].\nSubsampled Scrambled Fourier Transform (SSFT). One shortcoming of the isotropic models is\nthe cost of storing the test matrix and the cost of multiplying a vector into the test matrix. We can\noften reduce these costs using an SSFT test matrix. An SSFT takes the form\n\n\u2126 = \u03a01F \u03a02F R \u2208 Fn\u00d7k.\n\n(3.1)\nThe \u03a0i \u2208 Fn\u00d7n are independent, signed permutation matrices,1 chosen uniformly at random. The\nmatrix F \u2208 Fn\u00d7n is a discrete Fourier transform (F = C) or a discrete cosine transform (F = R).\nThe matrix R \u2208 Fn\u00d7k is a restriction to k coordinates, chosen uniformly at random.\nAn SSFT \u2126 requires only \u0398(n) storage, but the sketch Y still requires storage of kn numbers.\nWe can multiply a vector in Fn into \u2126 using \u0398(n log n) arithmetic operations via an FFT or FCT\nalgorithm. Thus, for most choices of sketch size k, the SSFT improves over the isotropic models.\nIn practice, the SSFT yields matrix approximations whose quality is identical to those we obtain with\nan isotropic test matrix (Sec. 5). Although the analysis for SSFTs is less complete, the empirical\nevidence con\ufb01rms that the theory for isotropic models also offers excellent guidance for SSFTs.\nReferences for SSFTs and related test matrices include [1, 2, 9, 22, 29, 36, 42].\nNumerically Stable Implementation. It requires care to compute the \ufb01xed-rank approximation (2.7).\nThe supplement shows that a poor implementation may produce an approximation with 100% error!\nLet us outline a numerically stable and very accurate implementation of (2.7), based on an idea\nfrom [27, 38]. Fix a small parameter \u03bd > 0. Instead of approximating the psd matrix A directly, we\napproximate the shifted matrix A\u03bd = A + \u03bdI and then remove the shift. Here are the steps:\n\n1. Construct the shifted sketch Y\u03bd = Y + \u03bd\u2126.\n2. Form the matrix B = \u2126\u2217Y\u03bd.\n3. Compute a Cholesky decomposition B = CC\u2217.\n4. Compute E = Y\u03bdC\u22121 by back-substitution.\n5. Compute the (thin) singular value decomposition E = U \u03a3V \u2217.\n\n6. Form \u02c6Ar = U(cid:74)\u03a32 \u2212 \u03bdI(cid:75)rU\u2217.\n\n1A signed permutation has exactly one nonzero entry in each row and column; the nonzero has modulus one.\n\n4\n\n\fAlgorithm 3 Fixed-Rank PSD Approximation. Implements (2.7).\nInput: Matrix A in sketch must be psd; rank parameter 1 \u2264 r \u2264 k\nOutput: Returns factors U \u2208 Fn\u00d7r with orthonormal columns and nonnegative, diagonal \u039b \u2208 Fr\u00d7r\n\nthat form a rank-r psd approximation \u02c6Ar = U \u039bU\u2217 of the sketched matrix A\nlocal: \u2126, Y\nfunction FIXEDRANKPSDAPPROX(r)\n\n(cid:46) Internal variables for NYSTROMSKETCH\n(cid:46) \u00b5 = 2.2 \u00b7 10\u221216 in double precision\n(cid:46) Sketch of shifted matrix A + \u03bdI\n\n(cid:46) Force symmetry\n(cid:46) Solve least squares problem; form thin SVD\n(cid:46) Truncate to rank r\n(cid:46) Square to get eigenvalues; remove shift\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n\u03bd \u2190 \u00b5 norm(Y )\nY \u2190 Y + \u03bd\u2126\nB \u2190 \u2126\u2217Y\nC \u2190 chol((B + B\u2217)/2)\n(U , \u03a3,\u223c) \u2190 svd(Y /C, \u2019econ\u2019)\nU \u2190 U (:, 1:r) and \u03a3 \u2190 \u03a3(1:r, 1:r)\n\u039b \u2190 max{0, \u03a32 \u2212 \u03bdI}\nreturn (U , \u039b)\n\nThe pseudocode addresses some additional implementation details. Related, but distinct, methods\nwere proposed by Williams & Seeger [40] and analyzed in Gittens\u2019s thesis [17].\nPseudocode. We present detailed pseudocode for the sketch (2.2)\u2013(2.4) and the implementation of\nthe \ufb01xed-rank psd approximation (2.7) described above. For simplicity, we only elaborate the case of\na random orthonormal test matrix; we have also developed an SSFT implementation for empirical\ntesting. The pseudocode uses both mathematical notation and MATLAB 2017A functions.\nAlgorithms and Computational Costs. Algorithm 1 constructs a random orthonormal test matrix,\nand computes the sketch (2.3) of an input matrix. The test matrix and sketch require the storage of\n2kn \ufb02oating-point numbers. Owing to the orthogonalization step, the construction of the test matrix\nrequires \u0398(k2n) \ufb02oating-point operations. For a general input matrix, the sketch requires \u0398(kn2)\n\ufb02oating-point operations; this cost can be removed by initializing the input matrix to zero.\nAlgorithm 2 implements the linear update (2.4) to the sketch. Nominally, the computation requires\n\u0398(kn2) arithmetic operations, but this cost can be reduced when H has structure (e.g., low rank).\nUsing the SSFT test matrix (3.1) also reduces this cost.\nAlgorithm 3 computes the rank-r psd approximation (2.7). This method requires additional storage of\n\u0398(kn). The arithmetic cost is \u0398(k2n) operations, which is dominated by the SVD of the matrix E.\n\n4 Theoretical Results\n\nRelative Error Bound. Our \ufb01rst result is an accurate bound for the expected Schatten 1-norm error\nin the \ufb01xed-rank psd approximation (2.7).\nTheorem 4.1 (Fixed-Rank Nystr\u00f6m: Relative Error). Assume 1 \u2264 r < k \u2264 n. Let A \u2208 Fn\u00d7n be a\npsd matrix. Draw a test matrix \u2126 \u2208 Fn\u00d7k from the Gaussian or orthonormal distribution, and form\nthe sketch Y = A\u2126. Then the approximation \u02c6Ar given by (2.5) and (2.7) satis\ufb01es\n\n(cid:18)\n\nE(cid:107)A \u2212 \u02c6Ar(cid:107)1 \u2264\n\n(cid:19)\nE(cid:107)A \u2212 \u02c6Ar(cid:107)\u221e \u2264 (cid:107)A \u2212(cid:74)A(cid:75)r(cid:107)\u221e +\n\n1 +\n\nk \u2212 r \u2212 \u03b1\n\nr\n\n\u00b7 (cid:107)A \u2212(cid:74)A(cid:75)r(cid:107)1;\n\nr\n\nk \u2212 r \u2212 \u03b1\n\n\u00b7 (cid:107)A \u2212(cid:74)A(cid:75)r(cid:107)1.\n\n(4.1)\n\n(4.2)\n\nThe quantities \u03b1(R) = 1 and \u03b1(C) = 0. Similar results hold with high probability.\n\nThe proof appears in the supplement.\nIn contrast to all previous analyses of randomized Nystr\u00f6m methods, Theorem 4.1 yields explicit,\nsharp constants. (The contemporary work [39, Thm. 1] contains only a less precise variant of (4.1).)\nAs a consequence, the formulae (4.1)\u2013(4.2) offer an a priori mechanism for selecting the sketch size\nk to achieve a desired error bound. In particular, for each \u03b5 > 0,\n\nk = (1 + \u03b5\u22121)r + \u03b1 implies E(cid:107)A \u2212 \u02c6Ar(cid:107)1 \u2264 (1 + \u03b5) \u00b7 (cid:107)A \u2212(cid:74)A(cid:75)r(cid:107)1.\n\n5\n\n\fThus, we can attain an arbitrarily small relative error in the Schatten 1-norm. In the streaming setting,\nthe scaling k = \u0398(r/\u03b5) is optimal for this result [14, Thm. 4.2]. Furthermore, it is impossible [41,\nSec. 6.2] to obtain \u201cpure\u201d relative error bounds in the Schatten \u221e-norm unless k = \u2126(n).\nThe Role of Spectral Decay. To circumvent these limitations, it is necessary to develop a different\nkind of error bound. Our second result shows that the \ufb01xed-rank psd approximation (2.7) automatically\nexploits decay in the spectrum of the input matrix.\nTheorem 4.2 (Fixed-Rank Nystr\u00f6m: Spectral Decay). Instate the notation and assumptions of\nTheorem 4.1. Then\n\nE(cid:107)A \u2212 \u02c6Ar(cid:107)1 \u2264 (cid:107)A \u2212(cid:74)A(cid:75)r(cid:107)1 + 2 min\nE(cid:107)A \u2212 \u02c6Ar(cid:107)\u221e \u2264 (cid:107)A \u2212(cid:74)A(cid:75)r(cid:107)\u221e + 2 min\n\n\u0001 0 controls the rate of polynomial decay. We consider three examples:\nPolyDecaySlow (p = 0.5), PolyDecayMed (p = 1), PolyDecayFast (p = 2).\n\n3. Exponential Decay. These matrices take the form\n\nA = diag(1, . . . , 1\n\n, 10\u2212q, 10\u22122q, . . . , 10\u2212(n\u2212R)q) \u2208 Fn\u00d7n.\n\nThe parameter q > 0 controls the rate of exponential decay. We consider three examples:\nExpDecaySlow (q = 0.1), ExpDecayMed (q = 0.25), ExpDecayFast (q = 1).\n\nApplication Examples. We also consider non-diagonal matrices inspired by the SDP algorithm [44].\n1. MaxCut: This is a real-valued psd matrix with dimension n = 2 000, and its effective rank\nR = 14. We form approximations with rank r \u2208 {1, 14}. The matrix is an approximate\nsolution to the MAXCUT SDP [20] for the sparse graph G40 [10].\n2. PhaseRetrieval: This is a psd matrix with dimension n = 25 921. It has exact rank 250,\nbut its effective rank R = 5. We form approximations with rank r \u2208 {1, 5}. The matrix is\nan approximate solution to a phase retrieval SDP; the data is drawn from our paper [44].\n\nExperimental Results. Figures 5.1\u20135.2 display the performance of the three \ufb01xed-rank psd approxi-\nmation methods for a subcollection of the input matrices. The vertical axis is the Schatten 1-norm\n\n7\n\nStorage(T)1248163264128RelativeError(S1)10-810-610-410-2Storage(T)61224489619210-810-610-410-2100Storage(T)1248163264128RelativeError(S1)10-810-610-410-2Storage(T)16326412825610-1100101102[TYUC17,Alg.9]Standard(2.6)Proposed(2.7)\f(A) LowRankLowNoise\n\n(B) LowRankMedNoise\n\n(C) LowRankHiNoise\n\n(D) PolyDecayFast\n\n(E) PolyDecayMed\n\n(F) PolyDecaySlow\n\n(G) ExpDecayFast\n\n(H) ExpDecayMed\n\n(I) ExpDecaySlow\n\nFIGURE 5.2: Synthetic Examples with Effective Rank R = 10, Approximation Rank r = 10,\nSchatten 1-Norm Error. The data series show the performance of three algorithms for rank-r psd\napproximation with r = 10. Solid lines are generated from the Gaussian sketch; dashed lines are\nfrom the SSFT sketch. Each panel displays the Schatten 1-norm relative error (5.1) as a function of\nstorage cost T .\n\nrelative error (5.1). The variable T on the horizontal axis is proportional to the storage required for\nthe sketch only. For the Nystr\u00f6m-based approximations (2.6)\u2013(2.7), we have the correspondence\nT = k. For the approximation [37, Alg. 9], we set T = k + (cid:96).\nThe experiments demonstrate that the proposed method (2.7) has a signi\ufb01cant bene\ufb01t over the\nalternatives for input matrices that admit a good low-rank approximation. It equals or improves on the\ncompetitors for almost all other examples and storage budgets. The supplement contains additional\nnumerical results; these experiments only reinforce the message of Figures 5.1\u20135.2.\nConclusions. This paper makes the case for using the proposed \ufb01xed-rank psd approximation (2.7)\nin lieu of the alternatives (2.6) or [37, Alg. 9]. Theorem 4.1 shows that the proposed \ufb01xed-rank psd\napproximation (2.7) can attain any prescribed relative error, and Theorem 4.2 shows that it can exploit\nspectral decay. Furthermore, our numerical work demonstrates that the proposed approximation\nimproves (almost) uniformly over the competitors for a range of examples. These results are timely\nbecause of the recent arrival of compelling applications, such as [44], for sketching psd matrices.\n\n8\n\nStorage(T)12244896192RelativeError(S1)10-1100[TYUC17,Alg.9]Standard(2.6)Proposed(2.7)Storage(T)1224489619210-210-1Storage(T)1224489619210-210-1Storage(T)12244896192RelativeError(S1)10-310-210-1100Storage(T)1224489619210-210-1Storage(T)1224489619210-210-1Storage(T)12244896192RelativeError(S1)10-810-610-410-2100Storage(T)1224489619210-810-610-410-2100Storage(T)1224489619210-810-610-410-2100\fAcknowledgments. The authors wish to thank Mark Tygert and Alex Gittens for helpful feedback\non preliminary versions of this work. JAT gratefully acknowledges partial support from ONR Award\nN00014-17-1-2146 and the Gordon & Betty Moore Foundation. VC and AY were supported in\npart by the European Commission under Grant ERC Future Proof, SNF 200021-146750, and SNF\nCRSII2-147633. MU was supported in part by DARPA Award FA8750-17-2-0101.\n\nReferences\n[1] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors.\n\nSIAM J. Comput., 39(1):302\u2013322, 2009.\n\n[2] C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled randomized Hadamard\n\ntransform. SIAM J. Matrix Anal. Appl., 34(3):1301\u20131340, 2013.\n\n[3] C. Boutsidis, D. Garber, Z. Karnin, and E. Liberty. Online principal components analysis. In Proc. 26th\n\nAnn. ACM-SIAM Symp. Discrete Algorithms (SODA), pages 887\u2013901, 2015.\n\n[4] C. Boutsidis, D. Woodruff, and P. Zhong. Optimal principal component analysis in distributed and\n\nstreaming models. In Proc. 48th ACM Symp. Theory of Computing (STOC), 2016.\n\n[5] J. Chiu and L. Demanet. Sublinear randomized algorithms for skeleton decompositions. SIAM J. Matrix\n\nAnal. Appl., 34(3):1361\u20131383, 2013.\n\n[6] K. Clarkson and D. Woodruff. Low-rank PSD approximation in input-sparsity time. In Proc. 28th Ann.\n\nACM-SIAM Symp. Discrete Algorithms (SODA), pages 2061\u20132072, Jan. 2017.\n\n[7] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In Proc. 41st ACM\n\nSymp. Theory of Computing (STOC), 2009.\n\n[8] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means\nclustering and low rank approximation. In Proc. 47th ACM Symp. Theory of Computing (STOC), pages\n163\u2013172. ACM, New York, 2015.\n\n[9] M. B. Cohen, J. Nelson, and D. P. Woodruff. Optimal Approximate Matrix Product in Terms of Stable\nRank. In 43rd Int. Coll. Automata, Languages, and Programming (ICALP), volume 55, pages 11:1\u201311:14,\n2016.\n\n[10] T. A. Davis and Hu. The University of Florida sparse matrix collection. ACM Trans. Math. Softw., 3(1):\n\n1:1\u20131:25, 2011.\n\n[11] P. Drineas and M. W. Mahoney. On the Nystr\u00f6m method for approximating a Gram matrix for improved\n\nkernel-based learning. J. Mach. Learn. Res., 6:2153\u20132175, 2005.\n\n[12] D. Feldman, M. Volkov, and D. Rus. Dimensionality reduction of massive sparse datasets using coresets.\n\nIn Adv. Neural Information Processing Systems 29 (NIPS), 2016.\n\n[13] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystr\u00f6m method. IEEE\n\nTrans. Pattern Anal. Mach. Intell., 26(2):214\u2013225, Jan. 2004.\n\n[14] M. Ghasemi, E. Liberty, J. M. Phillips, and D. P. Woodruff. Frequent directions: Simple and deterministic\n\nmatrix sketching. SIAM J. Comput., 45(5):1762\u20131792, 2016.\n\n[15] A. C. Gilbert, J. Y. Park, and M. B. Wakin. Sketched SVD: Recovering spectral features from compressed\n\nmeasurements. Available at http://arXiv.org/abs/1211.0361, Nov. 2012.\n\n[16] A. Gittens. The spectral norm error of the na\u00efve Nystr\u00f6m extension. Available at http:arXiv.org/abs/\n\n1110.5305, Oct. 2011.\n\n[17] A. Gittens. Topics in Randomized Numerical Linear Algebra. PhD thesis, California Institute of Technology,\n\n2013.\n\n[18] A. Gittens and M. W. Mahoney. Revisiting the Nystr\u00f6m method for improved large-scale machine learning.\n\nAvailable at http://arXiv.org/abs/1303.1849, Mar. 2013.\n\n[19] A. Gittens and M. W. Mahoney. Revisiting the Nystr\u00f6m method for improved large-scale machine learning.\n\nJ. Mach. Learn. Res., 17:Paper No. 117, 65, 2016.\n\n[20] M. X. Goemans and D. P. Williamson.\n\nImproved approximation algorithms for maximum cut and\nsatis\ufb01ability problems using semide\ufb01nite programming. J. Assoc. Comput. Mach., 42(6):1115\u20131145, 1995.\n[21] M. Gu. Subspace iteration randomization and singular value problems. SIAM J. Sci. Comput., 37(3):\n\nA1139\u2013A1173, 2015.\n\n[22] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: probabilistic algorithms\n\nfor constructing approximate matrix decompositions. SIAM Rev., 53(2):217\u2013288, 2011.\n\n9\n\n\f[23] Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for the principal\ncomponent analysis of large data sets. SIAM J. Sci. Comput., 33(5):2580\u20132594, 2011. ISSN 1064-8275.\ndoi: 10.1137/100804139. URL http://dx.doi.org/10.1137/100804139.\n\n[24] N. J. Higham. Matrix nearness problems and applications. In Applications of matrix theory (Bradford,\n\n1988), pages 1\u201327. Oxford Univ. Press, New York, 1989.\n\n[25] P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford. Streaming PCA: Matching matrix Bernstein and\nnear-optimal \ufb01nite sample guarantees for Oja\u2019s algorithm. In 29th Ann. Conf. Learning Theory (COLT),\npages 1147\u20131164, 2016.\n\n[26] S. Kumar, M. Mohri, and A. Talwalkar. Sampling methods for the Nystr\u00f6m method. J. Mach. Learn. Res.,\n\n13:981\u20131006, Apr. 2012.\n\n[27] H. Li, G. C. Linderman, A. Szlam, K. P. Stanton, Y. Kluger, and M. Tygert. Algorithm 971: An\nimplementation of a randomized algorithm for principal component analysis. ACM Trans. Math. Softw., 43\n(3):28:1\u201328:14, Jan. 2017.\n\n[28] Y. Li, H. L. Nguyen, and D. P. Woodruff. Turnstile streaming algorithms might as well be linear sketches.\n\nIn Proc. 2014 ACM Symp. Theory of Computing (STOC), pages 174\u2013183. ACM, 2014.\n\n[29] E. Liberty. Accelerated dense random projections. PhD thesis, Yale Univ., New Haven, 2009.\n[30] M. W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach. Learn., 3(2):123\u2013224,\n\n2011.\n\n[31] P.-G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the decomposition of matrices.\n\nAppl. Comput. Harmon. Anal., 30(1):47\u201368, 2011.\n\n[32] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Adv. Neural Information\n\nProcessing Systems 26 (NIPS), pages 2886\u20132894, 2013.\n\n[33] C. Musco and D. Woodruff. Sublinear time low-rank approximation of positive semide\ufb01nite matrices.\n\nAvailable at http://arXiv.org/abs/1704.03371, Apr. 2017.\n\n[34] J. C. Platt. FastMap, MetricMap, and Landmark MDS are all Nystr\u00f6m algorithms. In Proc. 10th Int.\n\nWorkshop Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 261\u2013268, 2005.\n\n[35] F. Pourkamali-Anaraki and S. Becker. Randomized clustered Nystr\u00f6m for large-scale kernel machines.\n\nAvailable at http://arXiv.org/abs/1612.06470, Dec. 2016.\n\n[36] J. A. Tropp. Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data\n\nAnal., 3(1-2):115\u2013126, 2011.\n\n[37] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher. Randomized single-view algorithms for low-rank\nmatrix approximation. ACM Report 2017-01, Caltech, Pasadena, Jan. 2017. Available at http://arXiv.\norg/abs/1609.00048, v1.\n\n[38] M. Tygert. Beta versions of Matlab routines for principal component analysis. Available at http:\n\n//tygert.com/software.html, 2014.\n\n[39] S. Wang, A. Gittens, and M. W. Mahoney. Scalable kernel K-means clustering with Nystr\u00f6m approximation:\n\nrelative-error bounds. Available at http://arXiv.org/abs/1706.02803, June 2017.\n\n[40] C. K. I. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In Adv. Neural\n\nInformation Processing Systems 13 (NIPS), 2000.\n\n[41] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci., 10\n\n(1-2):iv+157, 2014.\n\n[42] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation of\n\nmatrices. Appl. Comput. Harmon. Anal., 25(3):335\u2013366, 2008.\n\n[43] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystr\u00f6m method vs random Fourier features: A\ntheoretical and empirical comparison. In Adv. Neural Information Processing Systems 25 (NIPS), pages\n476\u2013484, 2012.\n\n[44] A. Yurtsever, M. Udell, J. A. Tropp, and V. Cevher. Sketchy decisions: Convex low-rank matrix optimization\nIn Proc. 20th Int. Conf. Arti\ufb01cial Intelligence and Statistics (AISTATS), Fort\n\nwith optimal storage.\nLauderdale, May 2017.\n\n10\n\n\f", "award": [], "sourceid": 821, "authors": [{"given_name": "Joel", "family_name": "Tropp", "institution": "Caltech"}, {"given_name": "Alp", "family_name": "Yurtsever", "institution": "\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, Switzerland"}, {"given_name": "Madeleine", "family_name": "Udell", "institution": "Cornell"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}