{"title": "Principal Component Projection and Regression in Nearly Linear Time through Asymmetric SVRG", "book": "Advances in Neural Information Processing Systems", "page_first": 3868, "page_last": 3878, "abstract": "Given a n-by-d data matrix A, principal component projection (PCP) and principal component regression (PCR), i.e. projection and regression restricted to the top-eigenspace of A, are fundamental problems in machine learning, optimization, and numerical analysis. In this paper we provide the first algorithms that solve these problems in nearly linear time for fixed eigenvalue distribution and large n. This improves upon previous methods which had superlinear running times when either the number of top eigenvalues or gap between the eigenspaces were large. \n\nWe achieve our results by applying rational polynomial approximations to reduce the problem to solving asymmetric linear systems which we solve by a variant of SVRG. We corroborate these findings with preliminary empirical experiments.", "full_text": "Principal Component Projection and Regression in\nNearly Linear Time through Asymmetric SVRG\n\nYujia Jin\n\nStanford Universty\n\nyujiajin@stanford.edu\n\nAaron Sidford\n\nStanford Universty\n\nsidford@stanford.edu\n\nAbstract\n\nGiven a data matrix A\u2208Rn\u00d7d, principal component projection (PCP) and princi-\npal component regression (PCR), i.e. projection and regression restricted to the\ntop-eigenspace of A, are fundamental problems in machine learning, optimization,\nand numerical analysis. In this paper we provide the \ufb01rst algorithms that solve\nthese problems in nearly linear time for \ufb01xed eigenvalue distribution and large n.\nThis improves upon previous methods which have superlinear running times when\nboth the number of top eigenvalues and inverse gap between eigenspaces is large.\nWe achieve our results by applying rational polynomial approximations to reduce\nPCP and PCR to solving asymmetric linear systems which we solve by a variant of\nSVRG. We corroborate these \ufb01ndings with preliminary empirical experiments.\n\n1 Introduction\n\nPCA is one of the most fundamental algorithmic tools for analyzing large data sets. Given a data\nmatrix A\u2208Rn\u00d7d and a parameter k the classic principal component analysis (PCA) problem asks to\ncompute the top k eigenvectors of A(cid:62)A. This is a core computational task in machine learning and\noften used for feature selection [1\u20133], data visualization [4, 5], and model compression [6].\nHowever, as k becomes large, the running time of PCA can degrade. Even just writing down the output\ntakes \u2126(kd) time and the performance of many methods degrade with k. This high-computational\ncost for exploring large-cardinality top-eigenspaces has motivated researchers to consider prominent\ntasks solved by PCA, for example principal component projection (PCP) which asks to project a\nvector onto the top-k eigenspace, and principal component regression (PCR) which asks to solve\nregression restricted to this top-k eigenspace (see Section 1.2).\nRecent work [7, 8] showed that the dependence on k in solving PCP and PCR can be overcome\nby instead depending on eigengap \u03b3, de\ufb01ned as the ratio between the smallest eigenvalue in the\nspace projected onto and the largest eigenvalue of the space projected orthogonal to. These works\nreplace the typical poly(k)nnz(A) dependence in runtime with a poly(1/\u03b3)nnz(A) (at the cost of\nlower order terms), by reducing these problems to solving poly(1/\u03b3) ridge-regression problems.\nUnfortunately, for large-scale problems, as data-set sizes grow so too can k and 1/\u03b3, yielding large\nsuper-linear running times for all previously known methods (see Section 1.4). Consequently, this\nleaves the following fundamental open problem:\n\nCan we obtain nearly linear running times for solving PCP and PCR to high precision, i.e.\nwith running time \u02dcO(nnz(A)) plus an additive term depending only on the eigenvalue distribution?\n\nThe main contribution of the paper is an af\ufb01rmative answer to this question. We design randomized\nalgorithms that solve PCP and PCR with high probability in nearly linear time. Leveraging rational\npolynomial approximations we reduce these problems to solving asymmetric linear systems, which\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwe solve by a technique we call asymmetric SVRG. Further, we provide experiments demonstrating\nthe ef\ufb01cacy of this method.\n\n1.1 Approach\n\nTo obtain our results, critically we depart from the previous frameworks in Frostig et al. [7], Allen-\nZhu and Li [8] for solving PCP and PCR. These papers use polynomial approximations to the sign\nfunction to reduce PCP and PCR to solving ridge regression. Their runtime is limited by the necessary\n\u2126(1/\u03b3) degree for polynomial approximation of the sign function shown by Eremenko and Yuditskii\n[9]. Consequently, to obtain nearly linear runtime, a new insight is required.\nIn our paper, we instead consider rational approximations to the sign function and show that these\nef\ufb01ciently reduce PCP and PCR to solving a particular class of squared linear systems. The closed\nform expression for the best rational approximation to sign function was given by Zolotarev [10] and\nhas recently been proposed for matrix sign approximation [11]. The degree of such rational functions\nis logarithmic in 1/\u03b3, leading to much fewer linear systems to solve. While the squared systems\n[(A(cid:62)A\u2212cI)2 +\u00b52I]x = b,\u00b5 > 0 induced by this rational approximation are computationally more\n\nexpensive to solve, as compared with simple ridge regression problems(cid:2)A(cid:62)A+\u00b5I(cid:3)x = b,\u00b5 > 0,\n\ninterestingly, we show that these systems can still be solved in nearly linear time for suf\ufb01ciently large\nmatrices. As a by-product of this analysis, we also obtain an ef\ufb01cient algorithm for leveraging linear\nsystem solvers to apply the square-root of a positive semide\ufb01nite (PSD) matrix to a vector, where we\ncall a matrix M positive semide\ufb01nite and denote M(cid:23) 0 when \u2200x,x(cid:62)Mx\u2265 0.\nWe believe the solver we develop for these squared systems is of independent interest. Our solver is a\nvariant of the stochastic variance-reduced gradient descent algorithm (SVRG) [12] modi\ufb01ed to solve\nasymmetric linear systems. Our iterative method can be viewed as an instance of the variance-reduced\nalgorithm for monotone operators discussed in Section 6 of Palaniappan and Bach [13], with a more\ncareful analysis of the error. We also combine this method with approximate proximal point [14] or\nCatalyst [15] to obtain accelerated variants.\nThe conventional wisdom when solving asymmetric systems Mx = b that are not positive semide\ufb01nite\n(PSD), i.e. M(cid:15) 0, is to instead solve its PSD counterpart M(cid:62)Mx = M(cid:62)b. However, this operation\ncan greatly impair the performance of stochastic methods, e.g. SVRG [12], SAG [16], etc. (See\nSection 3.) The solver developed in this paper constitutes one of few known cases where transforming\nit into asymmetric system solving enables better algorithm design and thus provides large savings\n(see Corollary 1.) Ultimately, we believe this work on SVRG-based methods outside of convex\noptimization as well as our improved PCP and PCR algorithms may \ufb01nd further impact.\n\n1.2 The Problems\n\nHere we formally de\ufb01ne the PCP (De\ufb01nition 1), PCR (De\ufb01nition 2), Squared Ridge Regression\n(De\ufb01nition 3), and Square-root Computation (De\ufb01nition 4) problems we consider throughout this\npaper. Throughout, we let A\u2208Rn\u00d7d (n\u2265 d) denote a data matrix where each row ai\u2208Rn is viewed\nas a datapoint. Our algorithms typically manipulate the positive semide\ufb01nite (PSD) matrix A(cid:62)A.\nWe denote the eigenvalues of A(cid:62)A as \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd \u2265 0 and corresponding eigenvectors as\n\u03bd1,\u03bd2,\u00b7\u00b7\u00b7,\u03bdd\u2208Rd, i.e. A(cid:62)A = V\u039bV(cid:62) with V def= (\u03bd1,\u00b7\u00b7\u00b7,\u03bdd)(cid:62) and \u039b def= diag(\u03bb1,\u00b7\u00b7\u00b7,\u03bbd).\nGiven eigenvalue threshold \u03bb\u2208 (0,\u03bb1) we de\ufb01ne P\u03bb\ndef= (\u03bd1,\u00b7\u00b7\u00b7,\u03bdk)(\u03bd1,\u00b7\u00b7\u00b7,\u03bdk)(cid:62) as a projection matrix\nprojecting any vector onto the top-k eigenvectors of A(cid:62)A, i.e. span{\u03bd1,\u03bd2,\u00b7\u00b7\u00b7,\u03bdk}, where \u03bbk is the\nminimum eigenvalue of A(cid:62)A no smaller than \u03bb, i.e. \u03bbk \u2265 \u03bb > \u03bbk+1. Without speci\ufb01cation (cid:107)\u00b7(cid:107) is the\nstandard (cid:96)2-norm of vector or matrix.\n\nGiven \u03b3 \u2208 (0,1), the goal of a PCP algorithm is to project any given vector v =(cid:80)\n\ni\u2208[d] \u03b1i\u03bdi in a\ndesired way: mapping \u03bdi of A(cid:62)A with eigenvalue \u03bbi in [\u03bb(1+\u03b3),\u221e) to itself, eigenvector \u03bdi with\neigenvalue \u03bbi in [0,\u03bb(1\u2212\u03b3)] to 0, and any eigenvector \u03bdi with eigenvalue \u03bbi in between the gap to\nanywhere between 0 and \u03bdi. Formally, we de\ufb01ne the PCP as follows.\nDe\ufb01nition 1 (Principal Component Projection). The principal component projection (PCP) of v\u2208Rd\n\u03bb = P\u03bbv. Given threshold \u03bb and eigengap \u03b3, an algorithm APCP(v,\u0001,\u03b4) is an\nat threshold \u03bb is v\u2217\n\n2\n\n\f\u0001-approximate PCP algorithm if with probability 1\u2212\u03b4, its output satis\ufb01es following:\n\n\u2022(cid:107)P(1+\u03b3)\u03bb(APCP(v)\u2212v)(cid:107)\u2264 \u0001(cid:107)v(cid:107); and\n\u2022(cid:107)(P(1+\u03b3)\u2212P(1\u2212\u03b3)\u03bb)(APCP(v)\u2212v)(cid:107)\u2264(cid:107)(P(1+\u03b3)\u2212P(1\u2212\u03b3)\u03bb)v(cid:107)+\u0001(cid:107)v(cid:107)\n\n(cid:107)(I\u2212P(1\u2212\u03b3)\u03bb)APCP(v)(cid:107)\u2264 \u0001(cid:107)v(cid:107)\n\n(1)\n\nThe goal of a PCR problem is to solve regression restricted to the particular eigenspace we are\nprojecting onto in PCP. The resulting solution should have no correlation with eigenvectors \u03bdi\ncorresponding to \u03bbi \u2264 \u03bb(1\u2212 \u03b3), while being accurate for \u03bdi corresponding to eigenvalues above\n\u03bbi \u2265 \u03bb(1 + \u03b3). Also, it shouldn\u2019t have too large correlation with \u03bdi corresponding to eigenvalues\nbetween (\u03bb(1\u2212\u03b3),\u03bb(1+\u03b3)). Formally, we de\ufb01ne the PCR problem as follows.\nDe\ufb01nition 2 (Principal Component Regression). The principal component regression (PCR) of an\narbitrary vector b \u2208 Rn at threshold \u03bb is x\u2217\n\u03bb = minx\u2208Rd (cid:107)AP\u03bbx \u2212 b(cid:107). Given threshold \u03bb and\neigengap \u03b3, an algorithm APCR(b,\u0001,\u03b4) is an \u0001-approximate PCR algorithm if with probability 1\u2212\u03b4,\nits output satis\ufb01es following:\n\n(cid:107)(I\u2212P(1\u2212\u03b3)\u03bb)APCR(b,\u0001)(cid:107)\u2264 \u0001(cid:107)b(cid:107) and (cid:107)AAPCR(b,\u0001)\u2212b(cid:107)\u2264(cid:107)Ax\u2217\n\n(1+\u03b3)\u03bb\u2212b(cid:107)+\u0001(cid:107)b(cid:107) .\n\n(2)\n\nWe reduce PCP and PCR to solving squared linear systems. The solvers we develop for this squared\nregression problem de\ufb01ned below we believe are of independent interest.\nDe\ufb01nition 3 (Squared Ridge Regression Solver). Given c\u2208 [0,\u03bb1], v\u2208Rd, we consider a squared\nridge regression problem where exact solution is x\u2217 = ((A(cid:62)A \u2212 cI)2 + \u00b52I)\u22121v. We call an\nalgorithm RidgeSquare(A,c,\u00b52,v,\u0001,\u03b4) an \u0001-approximate squared ridge regression solver if with\nprobability 1\u2212\u03b4 it returns a solution \u02dcx satisfying (cid:107)\u02dcx\u2212x\u2217(cid:107)\u2264 \u0001(cid:107)v(cid:107).\n\nUsing a similar idea of rational polynomial approximation, we also examine the problem of M1/2v\nfor arbitrarily given PSD matrix M to solving PSD linear systems approximately.\nDe\ufb01nition 4 (Square-root Computation). Given a PSD matrix M\u2208 Rn\u00d7n such that \u00b5I(cid:22) M(cid:22) \u03bbI\nand v \u2208 Rn, an algorithm SquareRoot(M,v,\u0001,\u03b4) is an \u0001-approximate square-root solver if with\nprobability 1\u2212\u03b4 it returns a solution x satisfying (cid:107)x\u2212M1/2v(cid:107)\u2264 \u0001(cid:107)M1/2v(cid:107).\n1.3 Our Results\n\nF/(cid:107)A(cid:107)2\n\n2 =(cid:107)A(cid:107)2\n\nHere we present the main results of our paper, all proved in Appendix D. For data matrix\nA \u2208 Rn\u00d7d, our running times are presented in terms of the following quantities: Input spar-\nsity nnz(A) def= number of nonzero entries in A; Frobenius norm (cid:107)A(cid:107)2\ndef= Tr(A(cid:62)A); stable rank\nsr(A) def= (cid:107)A(cid:107)2\nF/\u03bb1; condition number of top-eigenspace: \u03ba def= \u03bb1/\u03bb. When present-\ning running times we use \u02dcO to hide polylogarithmic factors in the input parameters \u03bb1,\u03b3,v,b, error\nrates \u0001, and success probability \u03b4.\nFor A\u2208 Rn\u00d7d (n\u2265 d), v\u2208 Rd, b\u2208 Rn, without loss of generality we assume \u03bb1 \u2208 [1/2,1]1. Given\nthreshold \u03bb\u2208 (0,\u03bb1) and eigengap \u03b3\u2208 (0,2/3], the main results of this paper are the following new\nrunning times for solving these problems.\nTheorem 1 (Principal Component Projection). For any \u0001\u2208 (0,1), there is an \u0001-approximate PCP\nalgorithm (see De\ufb01nition 1) ISPCP(A,v,\u03bb,\u03b3,\u0001,\u03b4) speci\ufb01ed in Algorithm 5 with runtime\n\nF\n\nTheorem 2 (Principal Component Regression). For any \u0001\u2208 (0,1), there is an \u0001-approximate PCR\nalgorithm (see De\ufb01nition 2) ISPCR(A,b,\u03bb,\u03b3,\u0001,\u03b4) speci\ufb01ed in Algorithm 6 with runtime\n\n(cid:16)\n(cid:16)\n\n\u02dcO\n\n\u02dcO\n\nnnz(A)+(cid:112)nnz(A)\u00b7d\u00b7sr(A)\u03ba/\u03b3\nnnz(A)+(cid:112)nnz(A)\u00b7d\u00b7sr(A)\u03ba/\u03b3\n\n(cid:17)\n(cid:17)\n\n.\n\n.\n\nWe achieve these results by introducing a technique we call asymmetric SVRG to solve squared\nsystems [(A(cid:62)A\u2212cI)2 +\u00b52I]x = v with c\u2208 [0,\u03bb1]. The resulting algorithm is closely related to the\nSVRG algorithm for monotone operators in Palaniappan and Bach [13], but involves a more \ufb01ne-\ngrained error analysis. This analysis coupled with approximate proximal point [14] or Catalyst [15]\nyields the following result (see Section 3 for more details).\n\n1This can be achieved by getting a constant approximating overestimate \u02dc\u03bb1 of A(cid:62)A\u2019s top eigenvector \u03bb1\n\nthrough power method in \u02dcO(nnz(A)) time, and consider A\u2190 A/\n\n(cid:112)\u02dc\u03bb1,\u03bb\u2190 \u03bb/\u02dc\u03bb1,b\u2190 b/\n\n(cid:112)\u02dc\u03bb1 instead.\n\n3\n\n\fTheorem 3 (Squared Solver). For any \u0001\u2208 (0,1), there is an \u0001-approximate squared ridge regression\nsolver (see De\ufb01nition 3) using AsySVRG(M,\u02c6v,z0,\u0001(cid:107)v(cid:107),\u03b4) that runs in time\n\n(cid:16)\n\n\u02dcO\n\nnnz(A)+(cid:112)nnz(A)d\u00b7sr(A)\u03bb1/\u00b5\n\n(cid:17)\n\n.\n\nWhen the eigenvalues of A(cid:62)A\u2212cI are bounded away from 0, such a solver can be utilized to solve\nnon-PSD linear systems in form (A(cid:62)A\u2212 cI)x = v through preconditioning and considering the\ncorresponding problem (A(cid:62)A\u2212cI)2x = (A(cid:62)A\u2212cI)v (see Corollary 1).\nCorollary 1. Given c \u2208 [0, \u03bb1], and a non-PSD system (A(cid:62)A \u2212 cI)x = v and an initial\npoint x0, for arbitrary c satisfying (A(cid:62)A \u2212 cI)2 (cid:23) \u00b52I, \u00b5 > 0, there is an algorithm returns\n\nwith probability 1 \u2212 \u03b4 a solution (cid:101)x such that (cid:107)(cid:101)x \u2212 (A(cid:62)A \u2212 cI)\u22121v(cid:107) \u2264 \u0001(cid:107)v(cid:107), within runtime\n\u02dcO(cid:0)nnz(A)+(cid:112)nnz(A)d\u00b7sr(A)\u03bb1/\u00b5(cid:1).\n\nAnother byproduct of the rational approximation used in the paper is a nearly-linear runtime for\ncomputing an \u0001-approximate square-root of PSD matrix M(cid:23) 0 applied to an arbitrary vector.\nTheorem 4 (Square-root Computation). For any \u0001 \u2208 (0, 1), given \u00b5I (cid:22) M (cid:22) \u03bbI, there is an \u0001-\napproximate square-root solver (see De\ufb01nition 4) SquareRoot(M,v,\u0001,\u03b4) that runs in time\n\n\u02dcO(nnz(M)+T )\n\nwhere T is the runtime for solving (M+\u03baI)x = v for arbitrary v\u2208Rn and \u03ba\u2208 [ \u02dc\u2126(\u00b5), \u02dcO(\u03bb)].\n1.4 Comparison to Previous Work\n\nPCP and PCR: The starting point for our paper is the work of [7], which provided the \ufb01rst nearly\nlinear time algorithm for the problem with constant eigengap by reducing the problem to \ufb01nding\nthe best polynomial approximation to sign function and solving a sequence of regression problems.\nIt was improved by [8] and then [17]. These algorithms were \ufb01rst to achieve input sparsity for\neigenspaces of any non-trivial size, but with super-linear running times whenever the eigenvalue-gap\nis super-constant. Departing from their polynomial approximation, we use rational function as\napproximant and reduce to different subproblems to get new algorithms with better running time\nguarantee in some regime. See Table 1 for a comparison between those results and ours.\n\nTable 1: Comparison with previous PCP/PCR runtimes. (Notations same as in Section 1.3.)\n\nAlgorithm\nFMMS16 [7]\n\nAL17 [8], MMS18 [17]\n\nTheorems 1 and 2\n\n(cid:16)\n\n\u02dcO\n\nRuntime\n\n(cid:16) 1\n(cid:17)\n(cid:16) 1\n(cid:17)\n\u03b32 (nnz(A)+d\u00b7sr(A)\u03ba)\nnnz(A)+(cid:112)nnz(A)\u00b7d\u00b7sr(A)\u03ba/\u03b3\n\u03b3 (nnz(A)+d\u00b7sr(A)\u03ba)\n\n\u02dcO\n\u02dcO\n\n(cid:17)\n\nAsymmetric SVRG and Iterative Methods for Solving Linear Systems: Variance reduction or\nvarianced reduced iterative methods (e.g. SVRG [12] is a powerful tool for improving convergence\nof stochastic methods. There has been work that used SVRG to develop primal-dual algorithms\nfor solving saddle-point problems and extended it to monotone operators [13]. Our asymmetric\nSVRG solver can be viewed as an instance of their algorithm. We obtain improved running time\nanalysis by performing a more \ufb01ne-grained analysis exploiting problem structure. Especially, we\nprovide Section 1.4 to comparing the effectiveness of our asymmetric SVRG solver with some classic\noptimization methods for solving non-PSD system (A(cid:62)A\u2212 cI)x = v satisfying (A(cid:62)A\u2212 cI)2 (cid:23)\n\u00b52I,\u00b5 > 0 (full discussion in Section 3 and Appendix C.4).\n\nTable 2: Comparison for runtimes of solving non-PSD system (A(cid:62)A\u2212cI)x = v.\n\nAlgorithm\n\nAGD applied to squared counterpart\nSVRG applied to squared counterpart\n\nAsymmetric SVRG (Corollary 1)\n\nRuntime\n\n\u02dcO(nnz(A)\u03bb1/\u00b5)\n\n\u02dcO(nnz(A)+nnz(A)3/4d1/4sr(A)1/2\u03bb1/\u00b5)\n\n\u02dcO(cid:0)nnz(A)+(cid:112)nnz(A)d\u00b7sr(A)\u03bb1/\u00b5(cid:1)\n\n4\n\n\fFast Matrix Multiplication: One can also use fast-matrix multiplication (FMM) to possibly speed\nup all runtimes for PCA, PCR, and PCP, mainly by computing A(cid:62)A in O(nd\u03c9) time and SVD\nof this matrix in an additional O(d\u03c9) time [18] where \u03c9 < 2.379 [19] is the matrix multiplication\nconstant. Given the well-known practicality concerns of methods using fast matrix multiplication, we\nfocus much of our comparison on methods that do not use FMM.\n\n1.5 Paper Organization\n\nThe remainder of the paper is organized as follows. In Section 2, we reduce the PCP problem2 to\nmatrix sign approximation and study the property of Zolotarev rational function used in approximation.\nIn Section 3, we develop the asymmetric and squared linear system solvers using variance reduction\nand show the theoretical guarantee to prove Theorem 3, and correspondingly Corollary 1. In Section 4,\nwe conduct experiments and compare with previous methods to show ef\ufb01cacy of proposed algorithms.\nWe conclude the paper in Section 5.\n\n2 PCP through Matrix Sign Approximation\n\nHere we provide our reductions from PCP to sign function approximation. We consider the rational\napproximation r(x) found by Zolotarev [10] and study its properties for ef\ufb01cient (Theorem 5) and\nstable (Lemma 5) algorithm design to reduce the problem to solving squared ridge regressions.\nThroughout the section, we denote sign function as sgn(x) : R \u2192 R, where sgn(x) = 1 whenever\nx > 0, sgn(x) =\u22121 whenever x < 0, and sgn(0) = 0. Pk\ndef= {akxk +\u00b7\u00b7\u00b7 + a1x + a0|ak (cid:54)= 0} denote\ndef= {rm,n|rm,n = pm/qn,pm \u2208Pm, qn \u2208Pn} denote class of\nclass of degree-k polynomials. Rm,n\n(m,n)-degree (or referred to as max{m,n}-degree) rational functions.\nFor the PCP problem (see De\ufb01nition 1), we need an ef\ufb01cient algorithm that can approximately apply\nP\u03bb to any given vector v\u2208 Rd. Consider the shifted matrix A(cid:62)A\u2212\u03bbI so that its eigenvalues are\nshifted to [\u22121,1] with \u03bb mapping to 0. Previous work has shown [7, 8] solving PCP can be reduced\nto \ufb01nding f (x) that approximates sign function sgn(x) on [\u22121,1], formally through the following\nreduction.\nLemma 1 (Reduction: from PCP to Matrix Sign Approximation). Given a function f (x) that\n2\u0001-approximates sgn(x):\n\nthen(cid:101)v = 1\n\n2\n\n|f (x)\u2212sgn(x)|\u2264 2\u0001,\u2200|x|\u2208 [\u03bb\u03b3,1] and\n\n(cid:0)f (A(cid:62)A\u2212\u03bbI)+I(cid:1)v is an \u0001-approximate PCP solution satisfying (1).\n\n|f (x)|\u2264 1,\u2200x\u2208 [\u22121,1],\n\nHowever, instead of approximating sgn(x) with polynomials as in previous work [7, 8], where the\noptimal degree for achieving condition |f (x)\u2212sgn(x)|\u2264 2\u0001,\u2200|x|\u2208 [\u03b3,1] is proved to be \u02dcO(1/\u03b3) in [9],\nwe use Zolotarev rational function for approximation. This brings down the degree to \u02dcO(log(1/\u03bb\u03b3)),\nleading to the nearly input sparsity runtime improvement in the paper.\nk (x) = x\u00b7p(x2)/q(x2)\u2208R2k+1,2k\nFormally, Zolotarev rationals are de\ufb01ned as the optimal solution r\u03b3\nfor the optimization problem:\n\nmax\np,q\u2208Pk\n\nmin\n\u03b3\u2264x\u22641\n\nx\n\np(x2)\nq(x2)\n\ns.t.\n\nx\n\np(x2)\nq(x2)\n\n\u2264 1,\u2200x\u2208 [0,1]\n\nZolotarev [10] showed this optimization problem (up to scaling) is equivalent to solving\n\nmin\n\nr\u2208R2k+1,2k\n\nmax\n|x|\u2208[\u03b3,1]\n\n|sgn(x)\u2212r(x)| .\n\n(3)\n\n(4)\n\n(5)\n\nFurther Zolotarev [11] showed that the analytic formula of r\u03b3\n\n(cid:89)\n\ni\u2208[k]\n\nr\u03b3\nk (x) = Cx\n\nx2 +c2i\nx2 +c2i\u22121\n\nwith ci\n\nk is given by\n2k+1 ;\u03b3(cid:48))\n2k+1 ;\u03b3(cid:48))\n\ndef= \u03b32 sn2( iK(cid:48)\ncn2( iK(cid:48)\nk (\u03b3) =\u2212(1\u2212r\u03b3\n\n,i\u2208 [2k].\n\nand C is the rescaling parameter to make sure 1\u2212r\u03b3\nk (1)). Note all coef\ufb01cients are\ndependent of degree k and range \u03b3. The explicit formulas for ci,K(cid:48),\u03b3(cid:48) are shown in Appendix B.1.\n\n2We refer reader to Appendix D.2 for the known reduction from PCR to PCP.\n\n5\n\n\f(a) \u03b3=0.1\n\n(b) \u03b3=0.05\n\n(c) \u03b3=0.01\n\nFigure 1: same degree = 21, different \u03b3\n\nThis rational polynomial approximates sgn(x) on range |x|\u2208 [\u03b3,1] with error decaying exponentially\nwith degree, as formally characterized by the following theorem.\nTheorem 5 (Rational Approximation Error). For any given \u0001\u2208 (0,1), when k\u2265 \u2126(log(1/\u0001)log(1/\u03b3)),\nit holds that max|x|\u2208[\u03b3,1]|sgn(x)\u2212r\u03b3\nAs a quick illustration, Fig. 1 shows a comparison between the approximation errors of Zolotarev\nrational function, polynomial used in [7] and chebyshev polynomial used in [8] with same degree.\nk with k\u2265 \u2126(log(1/\u0001)log(1/\u03bb\u03b3)) as the desired f in Lemma 1, it suf\ufb01ces to compute\nTreating r\u03bb\u03b3\n\nk (x)|\u2264 2\u0001.\n\nk ((A(cid:62)A\u2212\u03bbI))v = C(A(cid:62)A\u2212\u03bbI)\nr\u03bb\u03b3\n\n(A(cid:62)A\u2212\u03bbI)2 +c2iI\n(A(cid:62)A\u2212\u03bbI)2 +c2i\u22121I\n\nv.\n\nk(cid:89)\n\ni=1\n\nTo compute this formula approximately, we need to solve squared linear systems of the form\n((A(cid:62)A\u2212\u03bbI)2 +c2i\u22121I)x = v, the hardness of which is determined by the size of c2i\u22121(> 0). The\nlarger c2i\u22121 is, the more positive-de\ufb01nite (PD) the system becomes, and the faster we can solve it.\nk we need to use has coef\ufb01cients ci = \u02dc\u2126(1/\u03bb2\u03b32) when\nThe following lemma shows that, the r\u03bb\u03b3\nk = \u0398(log(1/\u0001)log(1/\u03bb\u03b3)).\nk , coef\ufb01cients ci are nondecreasing in i, \u2200i\u2208 [2k]. Also, \u2203 some\nLemma 2 (Bounding ci). For r\u03bb\u03b3\nconstant 0 < \u03b22,\u03b23 <\u221e, such that c1\u2265 \u03b22\nGiven a squared ridge regression solver RidgeSquare(A,\u03bb,c2i\u22121,v,\u0001,\u03b4) (See Section 3), we can\nget an \u0001-approximate PCP algorithm ISPCP(A,v,\u03bb,\u03b3,\u0001,\u03b4) shown in Algorithm 5 and its theoretical\nguarantee in Theorem 1. Using the known reduction [7, 8] from PCP to PCR solver, this also gives\nresults in Theorem 2. We refer readers to Appendix D for parameter choice and corresponding proofs.\n\nk2 , c2k \u2264 \u03b23k2.\n\n\u03bb2\u03b32\n\nAlgorithm 1: ISPCP(A,v,\u03bb,\u03b3,\u0001,\u03b4)\nInput: A data matrix, v projecting vector, \u03bb threshold, \u03b3 eigengap, \u0001 accuracy, \u03b4 probability.\nParameter: degree k (Theorem 5), coef\ufb01cients {ci}2k\n1 for i\u2190 1 to k do\n2\n3\n\nOutput: A vector(cid:101)v that solves PCP \u0001-approximately.\n(cid:101)v\u2190 (A(cid:62)A\u2212\u03bbI)2(cid:101)v+c2i(cid:101)v\n(cid:101)v\u2190 RidgeSquare(A,\u03bb,c2i\u22121,(cid:101)v,\u00011,\u03b4/k)\n4 (cid:101)v\u2190 C(A(cid:62)A\u2212\u03bbI)(cid:101)v\n2 (v+(cid:101)v)\n5 (cid:101)v\u2190 1\n\ni=1,C (Eqn. (5)), accuracy \u00011 (Appendix D)\n\n3 SVRG for Solving Asymmetric / Squared Systems\n\nIn this section, we reduce solving squared systems into solving asymmetric systems (Lemma 3) and\ndevelop SVRG-type solvers (Algorithm 2) for them. We study its theoretical guarantees in both\ngeneral (Theorems 6 and 7) and our speci\ufb01c case (Theorem 8). We defer all proofs to Appendix C.\nIn Section 2, we get low-degree rational function approximation at the cost of more complicated sub-\nproblems to solve. Indeed, instead of solving ridge-regression-type subproblems (A(cid:62)A+\u03bbI)x = v\n\n6\n\n-1-0.500.51-101polynomialchebyshevrational-1-0.500.51-101polynomialchebyshevrational-1-0.500.51-101polynomialchebyshevrational\fas in previous work [7, 8], we need to solve squared systems in the following form:\n[(A(cid:62)A\u2212cI)2 +\u00b52I]x = v, with A\u2208Rn\u00d7d,v\u2208Rd,\u00b5 > 0,c\u2208 [0,\u03bb1].\n\n(6)\nWhen the squared system is ill-conditioned (i.e. when \u03bb1/\u00b5(cid:29) 0), previous state-of-the-art methods\ncan have fairly large running times. As shown in Section 1.4 and proved in Appendix C.4, Accel-\n\nerated Gradient Descent [20] applied to solving system(cid:0)(A(cid:62)A\u2212cI)2 +\u00b52I(cid:1)x = v gives a runtime\n\n\u02dcO(nnz(A)\u03bb1/\u00b5), which is not nearly linear in nnz(A). Applying the standard SVRG [12] technique\nto the same system leads to a runtime \u02dcO(nnz(A) + d\u00b7 sr2(A)\u03bb4\n1/\u00b54 comes\nfrom the high variance in sampling aia(cid:62)\nThus rather than working with the squared system directly, we propose to consider (equivalently) a\nlarger dimensional space where we develop estimators with lower variance at the cost of asymmetry,\nformally in the reduction below.\nLemma 3 (Reducing Squared Systems to Asymmetric Systems). De\ufb01ne z\u2217 as the solution to the\nfollowing asymmetric linear system:\n\nj from (A(cid:62)A)2 independently.\n\n1/\u00b54), where sr2(A)\u03bb4\n\ni aja(cid:62)\n\nI\n\n\u2212 1\n\u00b5 (A(cid:62)A\u2212cI)\n\n(cid:19)\n\n(cid:18) 0\n\n(cid:19)\n\n1\n\n\u00b5 (A(cid:62)A\u2212cI)\n\nIf we are given a solver that returns with probability 1\u2212\u03b4 a solution(cid:101)z satisfying (cid:107)(cid:101)z\u2212z\u2217(cid:107)2\u2264 \u0001 within\n\nruntime T (\u0001,\u03b4), then we can use it to get an \u0001-approximate squared ridge regression solver (see\nDe\ufb01nition 3) with runtime T (\u0001(cid:107)v(cid:107),\u03b4) .\n3.1 SVRG for General Asymmetric Linear System Solving\n\nv/\u00b52\n\nz =\n\nI\n\n.\n\n(7)\n\n(cid:18)\n\nThe general goal for this section is to solve the general asymmetric system with PSD symmetric part,\nformally de\ufb01ned as:\n\nsolve Mz = \u02c6v with \u02c6v\u2208Ra, M\u2208Ra\u00d7a, M =\n\nMi, (cid:107)Mi(cid:107)\u2264 Li,\n\n(M(cid:62) +M).(cid:23) \u00b5I\n\n(8)\n\n1\n2\n\n(cid:88)\n\ni\u2208[n]\n\nFor simplicity, we denote Tmv(Mi) as the cost of the matrix-vector product of Mix for any x and\nT = maxi\u2208[n]Tmv(Mi). All results in this subsection can be viewed as a variant of Palaniappan and\nBach [13] and can be recovered by their slightly different algorithm which used proximal methods.\nUsing the idea of variance-reduced sampling [12]: At step t, we sample it \u2208 [n with probability\n\npit = Lit/((cid:80)\n\ni\u2208[n]Li) independently and conduct update\n\n(cid:0)Mitzt\u2212Mitz0 +pit(Mz0\u2212 \u02c6v)(cid:1).\n\nzt+1 := zt\u2212 \u03b7\npit\n\n(9)\n\nAlgorithm 2: AsySVRG(M,\u02c6v,z0,\u0001,\u03b4)\nInput: M\u2208Ra\u00d7a, \u02c6v\u2208Ra, z0\u2208Ra,\u0001 desired accuracy, \u03b4 probability parameter.\nOutput: z(Q+1)\n\ni\u2208[n]Li)2/\u00b52(cid:101), pi = Li/((cid:80)\n\ni\u2208[n]Li)2,T =(cid:100)((cid:80)\n\n1 Set \u03b7 = \u00b5/4((cid:80)\n\n\u2208Ra.\n\n0\n\ni\u2208[n]Li),i\u2208 [n] unless speci\ufb01ed\n\n2 for q = 1 to Q = \u0398(log(1/\u0001\u03b4)) do\n3\n4\n\nfor t\u2190 1 to T do\n\nSample it\u223c [n] according to {pi}n\n(cid:80)T\nt\u22121\u2212\u03b7/pit(Mitz(q)\nt \u2190 z(q)\nz(q)\nz(q+1)\nt=1z(q)\n= 1\n0\nT\n\n5\n\n6\n\nt\n\ni=1\n\nt\u22121\u2212Mitz(q)\n\n0 +pit(Mz(q)\n\n0 \u2212 \u02c6v))\n\nThe full pseudocode is shown in Algorithm 2. It has the following theoretical guarantee.\nTheorem 6 (General Asymmetric SVRG Solver). For asymmetric system Mz = \u02c6v in (8), there is a\nsolver AsySVRG(M,\u02c6v,z0,\u0001,\u03b4) as speci\ufb01ed in Algorithm 2 that returns with high probability \u2265 1\u2212\u03b4 a\n\nvector(cid:101)z such that (cid:107)(cid:101)z\u2212M\u22121 \u02c6v(cid:107)\u2264 \u0001, within runtime \u02dcO(nnz(M)+T ((cid:80)\nUsing approximate proximal point [14] or Catalyst [15], when nnz(M)\u2264T ((cid:80)\n\ni\u2208[n]Li)2/\u00b52).\n\ni\u2208[n]Li)2/\u00b52, we can\n\nfurther improve this running time to the following:\n\n7\n\n\fTheorem 7 (Accelerated Asymmetric SVRG Solver). Under (8), when nnz(M)\u2264T ((cid:80)\nsolution(cid:101)z satisfying (cid:107)(cid:101)z\u2212M\u22121 \u02c6v(cid:107)\u2264 \u0001, within runtime \u02dcO(cid:0)(cid:112)nnz(M)T ((cid:80)\ni\u2208[n]Li)/\u00b5(cid:1).\n\ni\u2208[n]Li)2/\u00b52,\nthe algorithm can be further accelerated to return with high probability \u2265 1\u2212 \u03b4 an approximate\n\n3.2 Asymmetric Linear System Solving for Squared System Solver\n\nFrom Lemma 3, the asymmetric linear system we actually need to solve is Mz = \u02c6v, where\n\n(cid:19)\n\u00b5 (A(cid:62)A\u2212cI)\n\u2212 1\n\nI\n\nand \u02c6v =(cid:0)0,v/\u00b52(cid:1)(cid:62)\n\n.\n\n(10)\n\nM =\n\nI\n\n1\n\n\u00b5 (A(cid:62)A\u2212cI)\n\n(cid:18)\n\n\uf8eb\uf8ed\n\nThrough a more \ufb01ne-grained analysis shown in Appendix C.2, AsySVRG(M,\u02c6v,z0,\u0001,\u03b4) with particular\nchoices of Mi,{pi}i\u2208[n], \u03b7, T can have a better runtime guarantee and be accelerated using similar\nidea as in the general case. This is stated formally in the following theorem.\nF, \u03b7 = \u00b52/2\u03bb1(cid:107)A(cid:107)2\nTheorem 8 (Particular Asymmetric SVRG Solver). Set pi = (cid:107)ai(cid:107)2/(cid:107)A(cid:107)2\nF,\nT =(cid:100)2(cid:107)A(cid:107)2\n\nF\u03bb1/\u00b52(cid:101) and\n\nMi :=\n\n(cid:107)ai(cid:107)2\n(cid:107)A(cid:107)2\ni \u2212c\n\nF\n\n1\n\n\u00b5 (aia(cid:62)\n\nI\n(cid:107)ai(cid:107)2\n(cid:107)A(cid:107)2\n\nF\n\n)I\n\n(cid:107)ai(cid:107)2\ni \u2212c\n\u2212 1\n\u00b5 (aia(cid:62)\n(cid:107)A(cid:107)2\n(cid:107)ai(cid:107)2\nI\n(cid:107)A(cid:107)2\n\nF\n\nI)\n\nF\n\nThen AsySVRG(M, \u02c6v, z0, \u0001, \u03b4) as speci\ufb01ed in Algorithm 2 returns with probability \u2265 1 \u2212 \u03b4 an \u0001-\n\napproximate solution(cid:101)z satisfying (cid:107)(cid:101)z\u2212 M\u22121 \u02c6v(cid:107) \u2264 \u0001 within runtime \u02dcO(cid:0)nnz(A) + d\u00b7 sr(A)\u03bb2\nAn accelerated variant of it improves the runtime to \u02dcO(cid:0)\u03bb1\nRidgeSquare(A, \u03bb, c2i\u22121, v, \u0001, \u03b4), with worst-case runtime (cid:101)O(cid:0)nnz(A) + d \u00b7 sr(A) \u03ba2\n\n1/\u00b52(cid:1).\n(cid:112)nnz(A)d\u00b7sr(A)/\u00b5(cid:1) when nnz(A) \u2264\n(cid:1). (See Ap-\n\nPicking c = \u03bb, \u00b52 = c2i\u22121 = \u02dc\u2126(1/\u03bb2\u03b32) (see Lemma 5) in (10), we know under minimal\ntrasformations AsySVRG(M, \u02c6v, z0, \u0001(cid:107)v(cid:107), \u03b4) is equivalent to an \u0001-approximate squared solver\n\nd\u00b7sr(A)\u03bb2\n\n1/\u00b52.\n\n\u03b32\n\npendix C.3 and Algorithm 4 for details.)\n\n\uf8f6\uf8f8, \u2200i\u2208 [n].\n\n4 Numerical Experiments\n\nWe evaluate our proposed algorithms following the settings in Frostig et al. [7], Allen-Zhu and Li\n[8]. As the runtimes in Theorems 1 and 2 show improvement compared with the ones in previous\nwork [7, 8] when nnz(A)/\u03b3(cid:29) d2\u03ba2/\u03b32, we pick the data matrix A such that \u03ba = \u0398(1) and n(cid:29) d\n\u03b3 to\ncorroborate the theoretical results.\nSince experiments in several papers [7, 8] have studied the reduction from PCR to PCP (see\nLemma 13), here we only show results regarding solving PCP problems. In all \ufb01gures below,\nthe y-axis denotes the relative error measured in (cid:107)APCP(v)\u2212P\u03bbv(cid:107)/(cid:107)P\u03bbv(cid:107) and x-axis denotes the\ntotal number of vector-vector products to achieve corresponding accuracy.\n\n\u221a\n\nDatasets. Similar to that in previous work [7, 8], we set \u03bb = 0.5,n = 2000,d = 50 and form a\nmatrix A = U\u039b1/2V\u2208R2000\u00d750. Here, U and V are random orthonormal matrices, and \u03a3 contains\n\u03bbi. Referring to [0,\u03bb(1\u2212\u03b3)]\u222a[\u03bb(1+\u03b3),1] as the away-from-\nrandomly chosen singular values \u03c3i =\n\u03bb region, and \u03bb(1\u2212\u03b3)\u00b7[0.9,1]\u222a\u03bb(1+\u03b3)\u00b7[1,1.1] as the close-to-\u03bb region, we generate \u03bbi differently\nto simulate the following three different cases:\ni. Eigengap-Uniform Case: generate all \u03bbi uniformly in the away-from-\u03bb region.\nii. Eigengap-Skewed Case: generate half the \u03bbi uniformly in the away-from-\u03bb and half uniformly in\nthe close-to-\u03bb regions.\niii. No-Eigengap-Skewed Case: uniformly generate half in [0,1], and half in the close-to-\u03bb region.\n\nAlgorithms. We implemented the following algorithms and compared them in the above settings:\n1. polynomial: the PC\u2212Proj algorithm in Frostig et al. [7].\n2. chebyshev: the QuickPCP algorithm in Allen-Zhu and Li [8].\n3. lanczos: the algorithm using Lanczos method discussed in Section 8.1 of Musco et al. [17].\n4. rational: the ISPCP algorithm (see Algorithm 5) proposed in our paper.\n5. rlanczos: the algorithm using rational Lanczos method [21] combined with ISPCP. (See Ap-\npendix E.1 for a more detailed discussion.)\n\n8\n\n\f(a) Eigengap-Uniform Case\n\n(b) Eigengap-Skewed Case\n\n(c) No-Eigengap-Skewed Case\n\nFigure 2: Synthetic Data: n = 2000, d = 50, \u03bb = 0.5, \u03b3 = 0.05.\n\n(a) n=5000\n\n(b) n=10000\n\n(c) n=20000\n\nFigure 3: Synthetic Data: Changing n, d = 50, \u03bb = 0.5, \u03b3 = 0.05. No-Eigengap-Skewed Case.\n\n6. slanczos: the algorithm using Lanczos method [17] with changed search space from form f ( x\u2212\u03bb\nx+\u03bb )\ninto f (\n\n(x\u2212\u03bb)(x+\u03bb)\n(x\u2212\u03bb)2+\u03b3(x+\u03bb)2 ) for approximation (f polynomial, x\u2190 A(cid:62)A).\n\nWe remark that 1-3 are algorithms in previous work; 4 is an exact implementation of ISPCP proposed\nin the paper; 5, 6 are variants of ISPCP combined with Lanczos method, both using the squared\nsystem solver. Algorithms 5, 6 are explained in greater detail in Appendix E.\nThere are several observations from the experiments:\n\u2022 For different eigenvalue distributions, (4-6) in general outperform all existing methods (1-3) in\nmost accuracy regime in terms of number of vector products as shown in Fig. 2.\n\u2022 In no-eigengap case, all methods get affected in precision. This is due to the projection error of\neigenvalues very close to eigengap, which simple don\u2019t exist in Eigengap cases. Nevertheless, (6) is\nstill the most accurate one with least computation cost, as shown in Fig. 2.\n\u2022 When n gets larger, (4,5) tends to enjoy similar performance, outperforming all other methods\nincluding (6), as shown in Fig. 3. This aligns with theory that runtime of (4,5) is dominated by\nnnz(A) while runtime of (6) is dominated by nnz(A)/\n\u03b3 (see Theorem 12 for theoretical analysis\nof slanczos), demonstrating the power of nearly-linear runtime of ISPCP proposed.\n\n\u221a\n\n5 Conclusion\n\nIn this paper we provided a new linear algebraic primitive, asymmetric SVRG for solving squared\nridge regression problems, and showed that it lead to nearly-linear-time algorithms for PCP and PCR.\nBeyond the direct running time improvements, this work shows that running time improvements\ncan be achieved for fundamental linear-algebraic problems by leveraging stronger subroutines than\nstandard ridge regression. The improvements we obtain for PCP, demonstrated theoretically and\nempirically, we hope are just the \ufb01rst instance of a more general approach for improving the running\ntime for solving large-scale machine learning problems.\n\nAcknowledgements This research was partially supported by NSF CAREER Award CCF-1844855\nand Stanford Graduate Fellowship. We would also like to thank the anonymous reviewers who helped\nimprove the completeness and readability of this paper by providing many helpful comments.\n\n9\n\n100105# vector product / n10-410-310-210-1100relative errorpolynomialchebyshevlanzcosrationalrlanczosslanczos100105# vector product / n10-410-310-210-1100relative errorpolynomialchebyshevlanzcosrationalrlanczosslanczos100105# vector product / n10-410-310-210-1100relative errorpolynomialchebyshevlanzcosrationalrlanczosslanczos100105# vector product / n10-410-310-210-1100relative errorpolynomialchebyshevlanzcosrationalrlanczosslanczos100105# vector product / n10-410-310-210-1100relative errorpolynomialchebyshevlanzcosrationalrlanczosslanczos100105# vector product / n10-410-310-210-1100relative errorpolynomialchebyshevlanzcosrationalrlanczosslanczos\fReferences\n\n[1] Arnaz Malhi and Robert X Gao. Pca-based feature selection scheme for machine defect classi\ufb01cation.\n\nIEEE Transactions on Instrumentation and Measurement, 53(6):1517\u20131525, 2004.\n\n[2] Fengxi Song, Zhongwei Guo, and Dayong Mei. Feature selection using principal component analysis. In\n2010 international conference on system science, engineering design and manufacturing informatization,\nvolume 1, pages 27\u201330. IEEE, 2010.\n\n[3] Cl\u00e1udia Pascoal, M Rosario De Oliveira, Rui Valadas, Peter Filzmoser, Paulo Salvador, and Ant\u00f3nio\nPacheco. Robust feature selection and robust pca for internet traf\ufb01c anomaly detection. In 2012 Proceedings\nIEEE INFOCOM, pages 1755\u20131763. IEEE, 2012.\n\n[4] Tomasz Niedoba. Multi-parameter data visualization by means of principal component analysis (pca) in\nqualitative evaluation of various coal types. physicochemical problems of Mineral processing, 50, 2014.\n\n[5] Tauno Metsalu and Jaak Vilo. Clustvis: a web tool for visualizing clustering of multivariate data using\n\nprincipal component analysis and heatmap. Nucleic acids research, 43(W1):W566\u2013W570, 2015.\n\n[6] Yihang Yin, Fengzheng Liu, Xiang Zhou, and Quanzhong Li. An ef\ufb01cient data compression model\nbased on spatial clustering and principal component analysis in wireless sensor networks. Sensors, 15(8):\n19443\u201319465, 2015.\n\n[7] Roy Frostig, Cameron Musco, Christopher Musco, and Aaron Sidford. Principal component projection\n\nwithout principal component analysis. In ICML, pages 2349\u20132357, 2016.\n\n[8] Zeyuan Allen-Zhu and Yuanzhi Li. Faster principal component regression and stable matrix chebyshev\napproximation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 107\u2013115. JMLR. org, 2017.\n\n[9] Alexandre Eremenko and Peter Yuditskii. Uniform approximation of sgn x by polynomials and entire\n\nfunctions. Journal d\u2019Analyse Math\u00e9matique, 101(1):313\u2013324, 2007.\n\n[10] EI Zolotarev. Application of elliptic functions to questions of functions deviating least and most from zero.\n\nZap. Imp. Akad. Nauk. St. Petersburg, 30(5):1\u201359, 1877.\n\n[11] Yuji Nakatsukasa and Roland W Freund. Computing fundamental matrix decompositions accurately via the\nmatrix sign function in two iterations: The power of zolotarev\u2019s functions. SIAM Review, 58(3):461\u2013493,\n2016.\n\n[12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[13] Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-point\n\nproblems. In Advances in Neural Information Processing Systems, pages 1416\u20131424, 2016.\n\n[14] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point\nand faster stochastic algorithms for empirical risk minimization. In International Conference on Machine\nLearning, pages 2540\u20132548, 2015.\n\n[15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimization. In\n\nAdvances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[16] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[17] Cameron Musco, Christopher Musco, and Aaron Sidford. Stability of the lanczos method for matrix\nfunction approximation. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, pages 1605\u20131624. Society for Industrial and Applied Mathematics, 2018.\n\n[18] Victor Y. Pan and Zhao Q. Chen. The complexity of the matrix eigenproblem. In Proceedings of the\nThirty-\ufb01rst Annual ACM Symposium on Theory of Computing, STOC \u201999, pages 507\u2013516, New York, NY,\nUSA, 1999. ACM. ISBN 1-58113-067-8. doi: 10.1145/301250.301389. URL http://doi.acm.org/\n10.1145/301250.301389.\n\n[19] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of\nthe 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22,\n2012, pages 887\u2013898, 2012.\n\n10\n\n\f[20] Y. E. NESTEROV. A method for solving the convex programming problem with convergence rate O(1/k2).\n\nDokl. Akad. Nauk SSSR, 269:543\u2013547, 1983.\n\n[21] Kyle Gallivan, G Grimme, and Paul Van Dooren. A rational lanczos algorithm for model reduction.\n\nNumerical Algorithms, 12(1):33\u201363, 1996.\n\n[22] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regular-\n\nized loss minimization. In International Conference on Machine Learning, pages 64\u201372, 2014.\n\n[23] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. The Journal of\n\nMachine Learning Research, 18(1):8194\u20138244, 2017.\n\n[24] Yurii Nesterov and Sebastian U Stich. Ef\ufb01ciency of the accelerated coordinate descent method on structured\n\noptimization problems. SIAM Journal on Optimization, 27(1):110\u2013123, 2017.\n\n[25] Naman Agarwal, Sham Kakade, Rahul Kidambi, Yin Tat Lee, Praneeth Netrapalli, and Aaron Sidford.\nLeverage score sampling for faster accelerated regression and erm. arXiv preprint arXiv:1711.08426, 2017.\n\n[26] Andrei Aleksandrovich Gonchar. Zolotarev problems connected with rational functions. Matematicheskii\n\nSbornik, 120(4):640\u2013654, 1969.\n\n[27] Frank Olver, Daniel Lozier, Ronald Boisvert, and Charles Clark. Nist handbook of mathematical functions.\n\n01 2010.\n\n[28] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas, graphs, and\n\nmathematical tables, volume 55. Courier Corporation, 1965.\n\n[29] S\u00e9bastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov\u2019s accelerated gradient\n\ndescent. arXiv preprint arXiv:1506.08187, 2015.\n\n[30] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2126, "authors": [{"given_name": "Yujia", "family_name": "Jin", "institution": "Stanford University"}, {"given_name": "Aaron", "family_name": "Sidford", "institution": "Stanford"}]}