{"title": "Towards a Zero-One Law for Column Subset Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 6123, "page_last": 6134, "abstract": "There are a number of approximation algorithms for NP-hard versions of low rank approximation, such as finding a rank-$k$ matrix $B$ minimizing the sum of absolute values of differences to a given $n$-by-$n$ matrix $A$, $\\min_{\\textrm{rank-}k~B}\\|A-B\\|_1$, or more generally finding a rank-$k$ matrix $B$ which minimizes the sum of $p$-th powers of absolute values of differences, $\\min_{\\textrm{rank-}k~B}\\|A-B\\|_p^p$. Many of these algorithms are linear time columns subset selection algorithms, \nreturning a subset of $\\poly(k \\log n)$ columns whose cost is no more than a $\\poly(k)$ factor larger than the cost of the best rank-$k$ matrix. \nThe above error measures are special cases of the following general entrywise\nlow rank approximation problem: given an arbitrary function $g:\\mathbb{R} \\rightarrow \\mathbb{R}_{\\geq 0}$, find a rank-$k$ matrix $B$ which minimizes $\\|A-B\\|_g = \\sum_{i,j}g(A_{i,j}-B_{i,j})$. A natural question is which functions $g$ admit efficient approximation algorithms? Indeed, this is a central question of recent work studying generalized low rank models. In this work we give approximation algorithms for {\\it every} function $g$ which is approximately monotone and satisfies an approximate triangle inequality, and we show both of these conditions are necessary. Further, our algorithm is efficient if the function $g$ admits an efficient approximate regression algorithm. Our approximation algorithms handle functions which are not even scale-invariant, such as the Huber loss function, which we show have very different structural properties than $\\ell_p$-norms, e.g., one can show the lack of scale-invariance causes any column subset selection algorithm to provably require a $\\sqrt{\\log n}$ factor larger number of columns than $\\ell_p$-norms; nevertheless we design the first efficient column subset selection algorithms for such error measures.", "full_text": "Towards a Zero-One Law for Column Subset\n\nSelection\n\nZhao Song\u2217\n\nUniversity of Washington\n\nmagic.linuxkde@gmail.com\n\nDavid P. Woodruff\u2217\n\nCarnegie Mellon University\ndwoodruf@cs.cmu.edu\n\nPeilin Zhong\u2217\n\nColumbia University\n\npz2225@columbia.edu\n\nAbstract\n\nThere are a number of approximation algorithms for NP-hard versions of low\nrank approximation, such as \ufb01nding a rank-k matrix B minimizing the sum of\nabsolute values of differences to a given n-by-n matrix A, minrank-k B (cid:107)A \u2212 B(cid:107)1,\nor more generally \ufb01nding a rank-k matrix B which minimizes the sum of p-th\npowers of absolute values of differences, minrank-k B (cid:107)A \u2212 B(cid:107)p\np. Many of these\nalgorithms are linear time columns subset selection algorithms, returning a subset\nof poly(k log n) columns whose cost is no more than a poly(k) factor larger\nthan the cost of the best rank-k matrix. The above error measures are special\ncases of the following general entrywise low rank approximation problem: given\nan arbitrary function g : R \u2192 R\u22650, \ufb01nd a rank-k matrix B which minimizes\n(cid:107)A \u2212 B(cid:107)g =(cid:80)i,j g(Ai,j \u2212 Bi,j). A natural question is which functions g admit\nef\ufb01cient approximation algorithms? Indeed, this is a central question of recent\nwork studying generalized low rank models. In this work we give approximation\nalgorithms for every function g which is approximately monotone and satis\ufb01es an\napproximate triangle inequality, and we show both of these conditions are necessary.\nFurther, our algorithm is ef\ufb01cient if the function g admits an ef\ufb01cient approximate\nregression algorithm. Our approximation algorithms handle functions which are\nnot even scale-invariant, such as the Huber loss function, which we show have\nvery different structural properties than (cid:96)p-norms, e.g., one can show the lack of\nscale-invariance causes any column subset selection algorithm to provably require\na \u221alog n factor larger number of columns than (cid:96)p-norms; nevertheless we design\nthe \ufb01rst ef\ufb01cient column subset selection algorithms for such error measures.\n\n1\n\nIntroduction\n\nA well-studied problem in machine learning and numerical linear algebra, with applications to recom-\nmendation systems, text mining, and computer vision, is that of computing a low-rank approximation\nof a matrix. Such approximations reveal low-dimensional structure, provide a compact way of storing\na matrix, and can quickly be applied to a vector.\nA commonly used version of the problem is to compute a near optimal low-rank approximation with\nrespect to the Frobenius norm. That is, given an n\u00d7n input matrix A and an accuracy parameter \u0001 > 0,\noutput a rank-k matrix B with large probability so that (cid:107)A \u2212 B(cid:107)2\nF , where for\na matrix C, (cid:107)C(cid:107)2\ni,j is its squared Frobenius norm, and Ak = argminrank-k B(cid:107)A \u2212 B(cid:107)F .\nAk can be computed exactly using the singular value decomposition (SVD), but takes O(n3) time in\npractice and n\u03c9 time in theory, where \u03c9 \u2248 2.373 is the exponent of matrix multiplication [1, 2, 3, 4].\nS\u00e1rlos [5] showed how to achieve the above guarantee with constant probability in (cid:101)O(nnz(A)\u00b7 k/\u0001) +\nn \u00b7 poly(k/\u0001) time, where nnz(A) denotes the number of non-zero entries of A. This was improved\n\nF =(cid:80)i,j C 2\n\nF \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2\n\n\u2217equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fin [6, 7, 8, 9, 10] using sparse random projections in O(nnz(A)) + n \u00b7 poly(k/\u0001) time. Large sparse\ndatasets in recommendation systems are common, such as the Bookcrossing (100K \u00d7 300K with\n106 observations) [11] and Yelp datasets (40K \u00d7 10K with 105 observations) [12], and this is a\nsubstantial improvement over the SVD.\nRobust Low Rank Approximation. To understand the role of the Frobenius norm in the algorithms\nabove, we recall a standard motivation for this error measure. Suppose one has n data points in a k-\ndimensional subspace of Rd, where k (cid:28) d. We can write these points as the rows of an n\u00d7d matrix A\u2217\nwhich has rank k. The matrix A\u2217 is often called the ground truth matrix. In a number of settings, due to\nmeasurement noise or other kinds of noise, we only observe the matrix A = A\u2217 +\u2206, where each entry\nof the noise matrix \u2206 \u2208 Rn\u00d7n is an i.i.d. random variable from a certain mean-zero noise distribution\nD. One method for approximately recovering A\u2217 from A is maximum likelihood estimation. Here\none tries to \ufb01nd a matrix B maximizing the log-likelihood: maxrank-k B(cid:80)i,j log p(Ai,j \u2212 Bi,j),\nwhere p(\u00b7) is the probability density function of the underlying noise distribution D. For example,\nwhen the noise distribution is Gaussian with mean zero and variance \u03c32, denoted by N (0, \u03c32), then\nthe optimization problem is maxrank-k B(cid:80)i,j(cid:16)log(1/\u221a2\u03c0\u03c32) \u2212 (Ai,j \u2212 Bi,j)2/(2\u03c32)(cid:17) , which is\n\nequivalent to solving the Frobenius norm loss low rank approximation problem de\ufb01ned above.\nThe Frobenius norm loss, while having nice statistical properties for Gaussian noise, is well-known\nto be sensitive to outliers. Applying the same maximum likelihood framework above to other\nkinds of noise distributions results in minimizing other kinds of loss functions. In general, if the\ndensity function of the underlying noise D is p(z) = c \u00b7 e\u2212g(z), where c is a normalization constant,\nthen the maximum likelihood estimation problem for this noise distribution becomes the following\ngeneralized entry-wise loss low rank approximation problem: minrank-k B(cid:80)i,j g(Ai,j \u2212 Bi,j) =\nminrank-k B (cid:107)A \u2212 B(cid:107)g, which is a central topic of recent work on generalized low-rank models [13].\nFor example, when the noise is Laplacian, the entrywise (cid:96)1 loss is the maximum likelihood estimation,\nwhich is also robust to sparse outliers. A natural setting is when the noise is a mixture of small\nGaussian noise and sparse outliers; this noise distribution is referred to as the Huber density. In this\ncase the Huber loss function gives the maximum likelihood estimate [13], where the Huber function\n[14] is de\ufb01ned to be: g(x) = x2/(2\u03c4 ) if |x| < \u03c4 /2, and g(x) = |x| \u2212 \u03c4 /2 if |x| \u2265 \u03c4. Another nice\nproperty of the Huber error measure is that it is differentiable everywhere, unlike the (cid:96)1-norm, yet\nstill enjoys the robustness properties as one moves away from the origin, making it less sensitive to\noutliers than the (cid:96)2-norm. There are many other kinds of loss functions, known as M-estimators [15],\nwhich are widely used as loss functions in robust statistics [16].\nAlthough several speci\ufb01c cases have been studied, such as entry-wise (cid:96)p loss [17, 18, 19, 20, 21],\nweighted entry-wise (cid:96)2 loss [22], and cascaded (cid:96)p((cid:96)2) loss [23, 24], the landscape of general entry-\nwise loss functions remains elusive. There are no results known for any loss function which is\nnot scale-invariant, much less any kind of characterization of which loss functions admit ef\ufb01cient\nalgorithms. This is despite the importance of these loss functions; we refer the reader to [13] for a\nsurvey of generalized low rank models. This motivates the main question in our work:\n\nQuestion 1.1 (General Loss Functions). For a given approximation factor \u03b1 > 1, which functions g\nallow for ef\ufb01cient low-rank approximation algorithms? Formally, given an n \u00d7 d matrix A, can we\n\ufb01nd a rank-k matrix B for which (cid:107)A \u2212 B(cid:107)g \u2264 \u03b1 minrank\u2212k B(cid:48) (cid:107)A \u2212 B(cid:48)\n(cid:107)g, where for a matrix C,\n(cid:107)C(cid:107)g =(cid:80)i\u2208[n],j\u2208[d] g(Ci,j)? What if we also allow B to have rank poly(k log n)?\nFor Question 1.1, one has g(x) = |x|p for p-norms, and note the Huber loss function also \ufb01ts into this\nframework. Allowing B to have slightly larger rank than k, namely, poly(k log n), is often suf\ufb01cient\nfor applications as it still allows for the space savings and computational gains outlined above. These\nare referred to as bicriteria approximations and are the focus of our work.\nNotation. Before we present our results, let us brie\ufb02y introduce the notation. For n \u2208 Z\u22650, let [n]\ndenote the set {1, 2,\u00b7\u00b7\u00b7 , n}. Let A \u2208 Rn\u00d7m. Ai and Aj denote the ith column and the jth row of A\nrespectively. Let P \u2286 [m], Q \u2286 [n]. AP denotes the matrix which is composed by the columns of A\nwith column indices in P . Similarly, AQ denotes the matrix composed by the rows of A with row\nindices in Q. Let S be a set and s \u2208 Z\u22650. We use(cid:0)S\ns(cid:1) to denote the set of all the size-s subsets of S.\n\n2\n\n\fTable 1: Example functions satisfying both structural properties.\n\n|x| \u2264 \u03c4\n|x| > \u03c4\n\ng(x)\n\u03c4 (|x| \u2212 \u03c4 /2)\n|x|p/p\n\n(cid:26) x2/2\n2((cid:112)1 + x2/2 \u2212 1)\n(cid:26) \u03c4 2/6 \u00b7 (1 \u2212 (1 \u2212 (x/\u03c4 )2)3)\n\nx2/(2 + 2x2)\n\n\u03c4 2 (|x|/\u03c4 \u2212 log(1 + |x|/\u03c4 ))\n\nHUBER\n(cid:96)p (p \u2265 1)\n(cid:96)1 \u2212 (cid:96)2\nGEMAN-MCCLURE\n\u201cFAIR\"\n\nTUKEY\n\nCAUCHY\nQUANTILE (\u03c4 \u2208 (0, 1))\n\n1.1 Our Results\n\natig,t\n\nO(t)\nO(tp\u22121)\nO(t)\nO(t)\nO(t)\n\nO(t)\n\nO(t)\n\n1\n\nmax\n\nmong\n\n1\n\n1\n1\n1\n1\n\n1\n\n(cid:16) \u03c4\n\n1\n1\u2212\u03c4 , 1\u2212\u03c4\n\n\u03c4\n\n(cid:17)\n\n|x| \u2264 \u03c4\n|x| > \u03c4\n\n\u03c4 2/6\n\n(cid:26) \u03c4 x\n\n\u03c4 2/2 \u00b7 log(1 + (x/\u03c4 )2)\nx \u2265 0\n(\u03c4 \u2212 1)x x < 0\n\nWe studied low rank approximation with respect to general error measures. Our algorithm is a\ncolumn subset selection algorithm, returning a small subset of columns which span a good low rank\napproximation. Column subset selection has the bene\ufb01t of preserving sparsity and interpretability, as\ndescribed above.\nWe give a \u201czero-one law\u201d for such column subset selection problems. We describe two properties on\nthe function g that we need to obtain our low rank approximation algorithms. We also show that if\nwe are missing any one of the properties, then we can \ufb01nd an example function g for which there is\nno good column subset selection (see Appendix B).\nSince we obtain column subset selection algorithms for a wide class of functions, our algorithms\nmust necessarily be bicriteria and have approximation factor at least poly(k). Indeed, a special case\nof our class of functions includes entrywise (cid:96)1-low rank approximation, for which it was shown in\nTheorem G.27 of [18] that any subset of poly(k) columns incurs an approximation error of at least\nk\u2126(1). We also show that for the entrywise Huber-low rank approximation, already for k = 1, \u221alog n\ncolumns are needed to obtain any constant factor approximation, thus showing that for some of the\nfunctions we consider, a dependence on n in our column subset size is necessary.\nWe note that previously for almost all such functions, it was not known how to obtain any non-trivial\napproximation factor with any sublinear number of columns.\n\n1.1.1 A Zero-One Law\n\nWe \ufb01rst state three general properties, the \ufb01rst two of which are structural properties and are necessary\nand suf\ufb01cient for obtaining a good approximation from a small subset of columns. The third property\nis needed for ef\ufb01cient running time.\nApproximate triangle inequality. For t \u2208 Z>0, we say a function g(x) : R \u2192 R\u22650 satis\ufb01es the\natig,t-approximate triangle inequality if for any x1, x2,\u00b7\u00b7\u00b7 , xt \u2208 R, g ((cid:80) xi) \u2264 atig,t \u00b7(cid:80) g(xi).\nMonotone property. For any parameter mong \u2265 1, we say function g(x) : R \u2192 R\u22650 is mong-\nmonotone if for any x, y \u2208 R with 0 \u2264 |x| \u2264 |y|, we have g(x) \u2264 mong \u00b7g(y).\nMany functions including most M-estimators [15] and the quantile function [26] satisfy the above\ntwo properties. See Table 1 for several examples. We refer the reader to the supplementary, namely\nAppendix B, for the necessity of these two properties. Our next property is not structural, but rather\nstates that if the loss function has an ef\ufb01cient regression algorithm, then that suf\ufb01ces to ef\ufb01ciently\n\ufb01nd a small subset of columns spanning a good low rank approximation.\nRegression property. We say function g(x) : R \u2192 R\u22650 has the (regg,d,Treg,g,n,d,m)-regression\nproperty if the following holds: given two matrices A \u2208 Rn\u00d7d and B \u2208 Rn\u00d7m, for each i \u2208 [m], let\nOPTi denote minx\u2208Rd (cid:107)Ax \u2212 Bi(cid:107)g. There is an algorithm that runs in Treg,g,n,d,m time and outputs\na matrix X(cid:48)\ni \u2212 Bi(cid:107)g \u2264 regg,d \u00b7 OPTi,\u2200i \u2208 [m] and outputs a vector of\n\n\u2208 Rd\u00d7m such that (cid:107)AX(cid:48)\n\n3\n\n\festimated regression costs v \u2208 Rd such that OPTi \u2264 vi \u2264 regg,d \u00b7 OPTi,\u2200i \u2208 [m]. The success\nprobability is at least 1 \u2212 1/ poly(nm).\nSome functions for which regression itself is non-trivial are e.g., the (cid:96)0-loss function and Tukey\nfunction. The (cid:96)0-loss function corresponds to the nearest codeword problem over the reals and has\nslightly better than an O(d)-approximation ([27, 28], see also [20]). For the Tukey function, [29]\nshows that Tukey regression is NP-hard, and it also gives approximation algorithms. For discussion\non regression solvers, we refer the reader to Appendix C.\n\nZero-one law (suf\ufb01cient conditions): For any function, as long as the above general three proper-\nties hold, we can provide an ef\ufb01cient algorithm, as our following main theorem shows.\nTheorem 1.2. Given a matrix A \u2208 Rn\u00d7n, let k \u2265 1, k(cid:48) = 2k + 1. Let g : R \u2192 R\u22650 denote a\nfunction satisfying the atig,k(cid:48)-approximate triangle inequality, the mong-monotone property , and\nthe (regg,k(cid:48),Treg,g,n,k(cid:48),n)-regression property. Let OPT = minrank \u2212k A(cid:48) (cid:107)A(cid:48)\n\u2212 A(cid:107)g. There is an\nalgorithm that runs in (cid:101)O(n + Treg,g,n,k(cid:48),n) time and outputs a set S \u2286 [n] with |S| = O(k log n)\nsuch that with probability at least 0.99,\n\nX\u2208R|S|\u00d7n (cid:107)ASX \u2212 A(cid:107)g \u2264 atig,k(cid:48) \u00b7 mong \u00b7 regg,k(cid:48) \u00b7O(k log k) \u00b7 OPT .\n\nmin\n\nAlthough the input matrix A in the above statement is a square matrix, it is straightforward to extend\nthe result to the rectangular case. By the above theorem, we can obtain a good subset of columns. To\nfurther get a low rank matrix B which is a good low rank approximation to A, it is suf\ufb01cient to take\nan additional Treg,g,n,|S|,n time to solve the regression problem.\nZero-one law (necessary conditions):\nIn Appendix B.1, we show how to construct a monotone\nfunction without approximate triangle inequality such that it is not possible to obtain a good low rank\napproximation by selecting a small subset of columns.\nIn Appendix B.2, we discuss a function which has the approximate triangle inequality but is not\nmonotone. We show that for some matrices, there is no small subset of columns which can give a\ngood low rank approximation for such loss function.\n\n1.1.2 Lower Bound on the Number of Columns\n\nOne may wonder if the log n blowup in rank is necessary in our theorem. We show some dependence\non n is necessary by showing that for the important Huber loss function, at least \u221alog n columns are\nrequired in order to obtain a constant factor approximation for k = 1:\n\nTheorem 1.3. Let H(x) denote the following function: H(x) =(cid:26)x2,\nFor any n \u2265 1, there is a matrix A \u2208 Rn\u00d7n such that, if we select o(\u221alog n) columns to \ufb01t the entire\nmatrix, there is no O(1)-approximation, i.e., for any subset S \u2286 [n] with |S| = o(\u221alog n),\n\nif |x| < 1;\nif |x| \u2265 1.\n\n|x|,\n\nmin\n\nX\u2208R|S|\u00d7n (cid:107)ASX \u2212 A(cid:107)H \u2265 \u03c9(1) \u00b7 min\n\nrank \u22121 A(cid:48) (cid:107)A\n\n(cid:48)\n\n\u2212 A(cid:107)H .\n\nNotice that the above function H(x) is always a constant approximation to the Huber function (see\nTable 1) with \u03c4 = 1. Thus, the hardness also holds for the Huber function. For more discussion on\nour lower bound, we refer the reader to Appendix D.\n\n1.2 Overview of our Approach and Related Work\n\nLow Rank Approximation for General Functions. A natural approach to low rank approximation\nis \u201ccolumn subset selection\u201d, which has been extensively studied in numerical linear algebra [30, 31,\n32, 33, 34, 35, 36, 37, 18, 38]. One can take the column subset selection algorithm for (cid:96)p-low rank\napproximation in [19] and try to adapt it to general loss functions. Namely, their argument shows that\nfor any matrix A \u2208 Rn\u00d7n there exists a subset S of k columns of A, denoted by AS \u2208 Rn\u00d7k, for\np; we\nwhich there exists a k \u00d7 n matrix V for which (cid:107)ASV \u2212 A(cid:107)p\n(cid:107)p\nrefer the reader to Theorem 3 of [19]. Given the existence of such a subset S, a natural next idea is to\nthen sample a set T of k columns of A uniformly at random. It is then likely the case that if we look\n\np \u2264 (k + 1)p minrank-k B(cid:48) (cid:107)A \u2212 B(cid:48)\n\n4\n\n\f(cid:48)\n\n(cid:107)p\np.\n\nj(cid:107)p+(cid:13)(cid:13)A\u2217\n\ni = A\u2217\n\n(2(k + 1)/n) \u00b7 min\n\nrank-k B(cid:48) (cid:107)A \u2212 B\n\n(cid:107)\u2206s(cid:107)p(cid:107)p = (cid:107)\u2206j(cid:107)p, and it follows that (cid:107)Aj \u2212 As\n\nat a random column Ai, (1) with probability 1/(k + 1), i is not among the subset S of k columns out\nof the k + 1 columns T \u222a {i} de\ufb01ning the optimal rank-k approximation to the submatrix AT\u222a{i},\nand (2) with probability at least 1/2, the best rank-k approximation to AT\u222a{i} has cost at most\n(1)\nIndeed, (1) follows from T \u222a {i} being a uniformly random subset of k + 1 columns, while (2)\nfollows from a Markov bound. The argument in Theorem 7 of [19] is then able to \u201cprune\u201d a 1/(k + 1)\nfraction of columns (this can be optimized to a constant fraction) in expectation, by \u201ccovering\u201d them\nwith the random set T . Recursing on the remaining columns, this procedure stops after k log n\niterations, giving a column subset of size O(k2 log n) (which can be optimized to O(k log n)) and an\nO(k)-approximation.\nThe proof in [19] of the existence of a subset S of k columns of A spanning a (k + 1)-approximation\nabove is quite general, and one might suspect it generalizes to a large class of error functions. Suppose,\nfor example, that k = 1. The idea there is to write A = A\u2217 +\u2206, where A\u2217 = U \u00b7V is the optimal rank-\n1 (cid:96)p-low rank approximation to A. One then \u201cnormalizes\u201d by the error, de\ufb01ning (cid:101)A\u2217\ni /(cid:107)\u2206i(cid:107)p\nand letting s be such that (cid:107)(cid:101)A\u2217\ns(cid:107)p is largest. The rank-1 subset S is then just As. Note that since (cid:101)A\u2217\nj for every j (cid:54)= s as \u03b1j \u00b7 (cid:101)A\u2217\ns(cid:107)p is largest, one can write (cid:101)A\u2217\nhas rank-1 and (cid:107)(cid:101)A\u2217\ns for |\u03b1j| \u2264 1. The\nfact that |\u03b1j| \u2264 1 is crucial; indeed, consider what happens when we try to \u201capproximate\u201d Aj by\nj \u2212 As\u03b1j(cid:107)\u2206j(cid:107)p/(cid:107)\u2206s(cid:107)p(cid:13)(cid:13)p =\nAs\u00b7 \u03b1j(cid:107)\u2206j(cid:107)p\n. Then (cid:107)Aj \u2212 As\u03b1j(cid:107)\u2206j(cid:107)p/(cid:107)\u2206s(cid:107)p(cid:107)p \u2264 (cid:107)Aj\u2212A\u2217\n(cid:107)\u2206s(cid:107)p\ns + \u2206s)\u03b1j(cid:107)\u2206j(cid:107)p/(cid:107)\u2206s(cid:107)p(cid:13)(cid:13)p = (cid:107)\u2206j(cid:107)p +(cid:107)\u2206s\u03b1j(cid:107)\u2206j(cid:107)p/(cid:107)\u2206s(cid:107)p(cid:107)p , and since the\n(cid:107)\u2206j(cid:107)p +(cid:13)(cid:13)A\u2217\nj \u2212 (A\u2217\n\n(cid:107)\u2206j(cid:107)p\n(cid:107)\u2206s(cid:107)p(cid:107)p. So far,\np-norm is monotonically increasing and \u03b1j \u2264 1, the latter is at most (cid:107)\u2206j(cid:107)p + (cid:107)\u2206s\nall we have used about the p-norm is the monotone increasing property, so one could hope that the\nargument could be generalized to a much wider class of functions.\n(cid:107)\u2206j(cid:107)p\nHowever, at this point the proof uses that the p-norm has scale-invariance, and so (cid:107)\u2206s\n(cid:107)\u2206s(cid:107)p(cid:107)p =\n\u03b1j(cid:107)\u2206j(cid:107)p\n(cid:107)\u2206s(cid:107)p (cid:107)p \u2264 2(cid:107)\u2206j(cid:107)p, giving an overall\n(cid:107)\u2206j(cid:107)p \u00b7 (cid:107) \u2206s\n2-approximation (recall k = 1). But what would happen for a general, not necessarily scale-invariant\n(cid:107)\u2206j(cid:107)g\nfunction g? We need to bound (cid:107)\u2206s\n(cid:107)\u2206s(cid:107)g (cid:107)g. If we could bound this by O((cid:107)\u2206j(cid:107)g), we would\nobtain the same conclusion as before, up to constant factors. Consider, though, the \u201creverse Huber\nfunction\u201d: g(x) = x2 if x \u2265 1 and g(x) = |x| for x \u2264 1. Suppose that \u2206s and \u2206j were just\n1-dimensional vectors, i.e., real numbers, so we need to bound g(\u2206sg(\u2206j)/g(\u2206s)) by O(g(\u2206j)).\nSuppose \u2206s = 1. Then g(\u2206s) = 1 and g(\u2206sg(\u2206j)/g(\u2206s)) = g(g(\u2206j)) and if \u2206j = n, then\ng(g(\u2206j)) = n4 = g(\u2206j)2, much larger than the O(g(\u2206j)) we were aiming for.\nMaybe the analysis can be slightly changed to correct for these normalization issues? This is not\nthe case, as we show that unlike for (cid:96)p-low rank approximation, for the reverse Huber function\nthere is no subset of 2 columns of A obtaining better than an n1/4-approximation factor. (See\nSection D.2 for more details). Further, the lack of scale invariance not only breaks the argument\nin [19], it shows that combinatorially such functions g behave very differently than (cid:96)p-norms. We\nshow more generally there exist functions, in particular the Huber function, for which one needs\nto choose \u2126(\u221alog n) columns to obtain a constant factor approximation; we describe this more\nbelow. Perhaps more surprisingly, we show a subset of O(log n) columns suf\ufb01ce to obtain a constant\nfactor approximation to the best rank-1 approximation for any function g(x) which is approximately\nmonotone and has the approximate triangle inequality, the latter implying for any constant C > 0 and\nany x \u2208 R\u22650, g(Cx) = O(g(x)). For k > 1, these conditions become: (1) g(x) is monotone non-\ndecreasing in x, (2) g(x) is within a poly(k) factor of g(\u2212x), and (3) for any real number x \u2208 R\u22650,\ng(O(kx)) \u2264 poly(k) \u00b7 g(x). We show it is possible to obtain an O(k2 log k) approximation with\nO(k log n) columns. We give the intuition and main lemma statements for our result in Section 2,\ndeferring proofs to the supplementary material.\nEven for (cid:96)p-low rank approximation, our algorithms slightly improve and correct a minor error in\n[19] which claims in Theorem 7 an O(k)-approximation with O(k log n) columns for (cid:96)p-low rank\napproximation. However, their algorithm actually gives an O(k log n)-approximation with O(k log n)\ncolumns. In [19] it was argued that one expects to pay a cost of O(k/n) \u00b7 minrank-k B(cid:48) (cid:107)A \u2212 B(cid:48)\np per\n(cid:107)p\ncolumn as in (1), and since each column is only counted in one iteration, summing over the columns\ngives O(k) \u00b7 minrank-k B(cid:48) (cid:107)A \u2212 B(cid:48)\n(cid:107)p total cost. The issue is that the value of n is changing in each\n\n5\n\n\fAlgorithm 1 Low rank approximation algorithm for general functions\n1: procedure GENERALFUNCTIONLOWRANKAPPROX(A \u2208 Rn\u00d7n, k \u2208 Z\u22651, g : R \u2192 R\u22650)\n\nInitialization: T0 \u2190 [n], i \u2190 1, r \u2190 0\nfor |Ti\u22121| \u2265 1000k do\n\nfor j = 1 \u2192 log n do\n\ni\n\nSample S(j)\nSolve the regg,2k-approximate regression minx\u2208R2k (cid:107)AS(j)\n, and let v(j)\n\n2k (cid:1) uniformly at random\n\ni,t be the regg,2k-estimated regression cost\n\nfrom(cid:0)Ti\u22121\n\ni\n\nx \u2212 At(cid:107)g for each t \u2208\n(cid:46) See Section 1.1.1 for\n\nTi\u22121 \\ S(j)\nregression property\n\ni\n\n|/20 largest value in {v(j)\n\ni,t(cid:48) | t(cid:48)\n\n\u2208 Ti\u22121 \\\n\n2:\n3:\n4:\n5:\n6:\n\n7:\n\n8:\n9:\n10:\n\nS(j)\ni }}\n\ni\n\ni\n\ni,t is the bottom |Ti\u22121 \\ S(j)\nv(j)\ni,t\n\ni \u2190 {t | v(j)\nR(j)\ni \u2190(cid:80)t\u2208R(j)\nc(j)\nend for\n\u2190 arg minj\u2208[log n](cid:110)c(j)\ni (cid:111)\nj\u2217\n, Ri \u2190 R(j\u2217\nSi \u2190 S(j\u2217\n, Ti \u2190 Ti\u22121\\ (Si \u222a Ri)\nr \u2190 i\ni \u2190 i + 1\nend for\n\n)\n\n)\n\ni\n\ni\n\n11:\n12:\n13:\n14:\n15:\n16: end procedure\n\nreturn S = Tr \u222a(cid:83)i\u2208[r] Si\n\n(cid:46) It is easy to see r \u2264 O(log n) from the above procedure\n\niteration, so if in the i-th iteration it is ni, then we could pay ni \u00b7 O(k/ni)\u00b7 minrank-k B(cid:48) (cid:107)A\u2212 B(cid:48)\n(cid:107)p =\nO(k) \u00b7 minrank-k B(cid:48) (cid:107)A \u2212 B(cid:48)\n(cid:107)p in each of O(log n) iterations, giving O(k log n) approximation ratio.\nIn contrast, our algorithm achieves an O(k log k) approximation ratio for (cid:96)p-low rank approximation\nas a special case, which gives the \ufb01rst O(1) approximation in nearly linear time for any constant k\nfor (cid:96)p norms. Our analysis is \ufb01ner in that we show not only do we expect to pay a cost of O(k/ni) \u00b7\nminrank-k B(cid:48) (cid:107)A \u2212 B(cid:48)\np per column in iteration i, we pay O(k/ni) times the cost of the best rank-k\n(cid:107)p\napproximation to A after the most costly n/k columns have been removed; thus we pay O(k/ni)\ntimes a residual cost with the top n/k columns removed. This ultimately implies any column\u2019s cost\ncan contribute in at most O(log k) of O(log n) recursive calls, replacing an O(log n) factor with\nan O(log k) factor in the approximation ratio. This also gives the \ufb01rst poly(k)-approximation for\n(cid:96)0-low rank approximation, studied in [20], improving the O(k2 log(n/k))-approximation there to\nO(k2 log k) and giving the \ufb01rst constant approximation for constant k.\n\n2 Algorithm for General Loss Low Rank Approximation\n\nx \u2212 At(cid:107)g for all t \u2208 Ti\u22121 \\ S(j)\n\nOur algorithm is presented in Algorithm 1. First, let us brie\ufb02y analyze the running time. Consider\n\ufb01xed i \u2208 [r], j \u2208 [log n]. Sampling S(j)\ntakes O(k) time. Solving regg,2k-approximate regression\n| \u2264 Treg,g,n,2k+1,n time.\nminx (cid:107)AS(j)\nSince \ufb01nding |Ti\u22121 \\ S(j)\ncan be computed\nin O(n) time. Thus the inner loop takes O(n + Treg,g,n,2k+1,n) time. Since r = O(log n), the total\nrunning time over all i, j is O((n + Treg,g,n,2k+1,n) log2 n). In the remainder of the section, we will\nsketch the proof of the correctness. For the missing proofs, we refer the reader to Appendix A.\n\n|/20 smallest element can be done in O(n) time, R(j)\n\ntakes Treg,g,n,2k,|Ti\u22121\\S(j)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n2.1 Properties of Uniform Column Sampling\nLet us \ufb01rst introduce some useful notation. Consider a rank-k matrix M\u2217\nH \u2286 [m], let RM\u2217 (H) \u2286 H be a set such that\n\u2217\n)Q\n\nRM\u2217 (H) = arg max\n\nP :P\u2286H(cid:26)(cid:12)(cid:12)(cid:12)det(cid:16)(M\n\nP(cid:17)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) |P| = |Q| = rank(M\n\n\u2208 Rn\u00d7m. For a set\nH ), Q \u2286 [n](cid:27) .\n\n\u2217\n\n6\n\n\fwhere det(C) denotes the determinant of a square matrix C. Notice that in the above formula,\nthe maximum is over all possible choices of P and Q while RM\u2217 (H) only takes the value of the\ncorresponding P . By Cramer\u2019s rule, if we use a linear combination of the columns of M\u2217\nRM\u2217 (H)\nto express any column of M\u2217\nH, the absolute value of every \ufb01tting coef\ufb01cient will be at most 1. For\nexample, consider a rank k matrix M\u2217\n\u2208 Rn\u00d7(k+1) and H = [k + 1]. Let P \u2286 [k + 1], Q \u2286\n[n],|P| = |Q| = k be such that | det((M\u2217)Q\nP )| is maximized. Since M\u2217 has rank k, we know\nP ) (cid:54)= 0 and thus the columns of M\u2217\ndet((M\u2217)Q\nP are independent. Let i \u2208 [k + 1] \\ P . Then the linear\nequation M\u2217\nP x = M\u2217\ni is feasible and there is a unique solution x. Furthermore, by Cramer\u2019s rule\ndet((M\u2217\nP )| \u2265 | det((M\u2217)Q\n. Since | det((M\u2217)Q\nxj =\n\n[k+1]\\{j})|, we have (cid:107)x(cid:107)\u221e \u2264 1.\n\ndet((M\u2217)Q\nP )\n\n)Q\n[k+1]\\{j})\n\nH to express M\u2217\n\nConsider an arbitrary matrix M \u2208 Rn\u00d7m. We can write M = M\u2217 + N, where M\u2217\n\u2208 Rn\u00d7m is an\narbitrary rank-k matrix, and N \u2208 Rn\u00d7m is the residual matrix. The following lemma shows that, if\nwe randomly choose a subset H \u2286 [m] of 2k columns, and we randomly look at another column i,\nthen with constant probability, the absolute values of all the coef\ufb01cients of using a linear combination\nof the columns of M\u2217\ni are at most 1, and furthermore, if we use the same coef\ufb01cients\nto use columns of MH to \ufb01t Mi, then the \ufb01tting cost is proportional to (cid:107)NH(cid:107)g + (cid:107)Ni(cid:107)g.\nLemma 2.1. Given a matrix M \u2208 Rn\u00d7m and a parameter k \u2265 1,\nlet M\u2217\n\u2208 Rn\u00d7m\nbe an arbitrary rank-k matrix. Let N = M \u2212 M\u2217. Let H \u2286 [m] be a uniformly ran-\ndom subset of [m], and let i denote a uniformly random index sampled from [m]\\H. Then\n(I) Pr [i /\u2208 RM\u2217 (H \u222a {i})] \u2265 1/2; (II) If i /\u2208 RM\u2217 (H \u222a {i}), then there exist |H| coef\ufb01cients\ni =(cid:80)|H|\n\u03b11, \u03b12,\u00b7\u00b7\u00b7 , \u03b1|H| for which M\u2217\nH )j,\u2200j \u2208 [|H|],|\u03b1j| \u2264 1, and minx\u2208R|H| (cid:107)MH x \u2212\nMi(cid:107)g \u2264 atig,|H|+1 \u00b7 mong \u00b7(cid:16)(cid:107)Ni(cid:107)g +(cid:80)|H|\n\nj=1 \u03b1j(M\u2217\nj=1 (cid:107)(NH )j(cid:107)g(cid:17) .\n\nNotice that part (II) of the above lemma does not depend on any randomness of H or i. By applying\npart (I) of the above lemma, it is enough to prove that if we randomly choose a subset H of 2k\ncolumns, there is a constant fraction of columns that each column M\u2217\ni can be expressed by a linear\ncombination of columns in M\u2217\nH , and the absolute values of all the \ufb01tting coef\ufb01cients are at most 1.\nBecause of Cramer\u2019s rule, it thus suf\ufb01ces to prove the following lemma.\nLemma 2.2.\n\nPr\nH\u223c([m]\n\n2k )(cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:26)i(cid:12)(cid:12)(cid:12)(cid:12) i \u2208 [m] \\ H, i (cid:54)\u2208 RM\u2217 (H \u222a {i})(cid:27)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 (m \u2212 2k)/4(cid:21) \u2265 1/4.\n\n2.2 Correctness of the Algorithm\nWe write the input matrix A as A\u2217 + \u2206, where A\u2217\n\n\u2208 Rn\u00d7n is the best rank-k approximation to A,\nand \u2206 \u2208 Rn\u00d7n is the residual matrix with respect to A\u2217. Then (cid:107)\u2206(cid:107)g =(cid:80)n\ni=1 (cid:107)\u2206i(cid:107)g is the optimal\ncost. As shown in Algorithm 1, our approach iteratively eliminates all the columns. In each iteration,\nwe sample a subset of columns, and use these columns to \ufb01t other columns. We drop a constant\nfraction of columns which have a good \ufb01tting cost. Suppose the indices of the columns surviving\nafter the i-th outer iteration are Ti = {ti,1, ti,2,\u00b7\u00b7\u00b7 , ti,mi} \u2286 [n]. Without loss of generality, we can\nassume (cid:107)\u2206ti,1(cid:107)g \u2265 (cid:107)\u2206ti,2(cid:107)g \u2265 \u00b7\u00b7\u00b7 \u2265 (cid:107)\u2206ti,mi(cid:107)g. The following claim shows that if we randomly\nsample 2k column indices H from Ti, then the cost of \u2206H will not be large.\n100k (cid:107)\u2206ti,j(cid:107)g(cid:105) \u2265 19\nmi(cid:80)mi\nClaim 2.3. If |Ti| = mi \u2265 1000k, PrH\u223c(Ti\nBy an averaging argument, in the following claim, we can show that there is a constant fraction of\ncolumns in Ti whose optimal cost is also small.\n100k (cid:107)\u2206ti,j(cid:48)(cid:107)g(cid:27)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 1\n\n2k)(cid:104)(cid:80)j\u2208H (cid:107)\u2206j(cid:107)g \u2264 400 k\nClaim 2.4. If |Ti| = mi \u2265 1000k,(cid:12)(cid:12)(cid:12)(cid:12)(cid:26)ti,j(cid:12)(cid:12)(cid:12)(cid:12) ti,j \u2208 Ti,(cid:107)\u2206ti,j(cid:107)g \u2265 20\nmi(cid:80)mi\n\nBy combining Lemma 2.2, part (II) of Lemma 2.1 with the above two claims, it is suf\ufb01cient to prove\nthe following core lemma. It says that if we randomly choose a subset of 2k columns from Ti, then\nwe can \ufb01t a constant fraction of the columns from Ti with a small cost.\n\n5 mi.\n\n20 .\n\nmi\n\nj=\n\nj(cid:48)=\n\nmi\n\n7\n\n\fLemma 2.5. If |Ti| = mi \u2265 1000k,\n\nPr\nH\u223c(Ti\n\n2k)\uf8ee\uf8f0(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\uf8f1\uf8f2\uf8f3\nj(cid:12)(cid:12)(cid:12)(cid:12) j \u2208 Ti, min\n\nx\u2208R|H| (cid:107)AH x \u2212 Aj(cid:107)g \u2264 C1 \u00b7\n\n1\nmi \u00b7\n\nmi(cid:88)j(cid:48)=\n\nmi\n100k\n\n(cid:107)\u2206ti,j(cid:48)(cid:107)g\uf8fc\uf8fd\uf8fe\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n1\n20\n\n\u2265\n\nmi\uf8f9\uf8fb \u2265\n\n1\n5\n\n,\n\nwhere C1 = 500 \u00b7 k \u00b7 atig,|S|+1 \u00b7 mong .\nLet us brie\ufb02y explain why the above lemma is enough to prove the correctness of our algorithm. For\neach column j \u2208 [m], either the column j is in Tr and is selected by the end of the algorithm, or\n\u2203i < r such that j \u2208 Ti\\Ti+1. If j \u2208 Ti\\Ti+1, then by the above lemma, we can show that with high\nprobability, minx (cid:107)ASi+1x \u2212 Aj(cid:107)g \u2264 O(C1(cid:107)\u2206(cid:107)1/|Ti|). Thus, minX (cid:107)ASi+1 X \u2212 ATi\\Ti+1(cid:107)g \u2264\nO(C1(cid:107)\u2206(cid:107)1). It directly gives a O(rC1) = O(C1 log n) approximation. For the detailed proof of\nTheorem 1.2, we refer the reader to Appendix A.\n\n3 Experiments\n\nWe show that with the Huber loss low rank approximation, it is possible to outperform the SVD and\nentrywise (cid:96)1-low rank approximation on certain noise distributions. Even bi-criteria solutions can\nwork very well. This motivates our study of general entry-wise loss functions.\nSuppose the noise of the input matrix is a mixture of small Gaussian noise and sparse outliers.\nConsider an extreme case: the data matrix A \u2208 Rn\u00d7n is a block diagonal matrix which contains three\nblocks: one block has size n1 \u00d7 n1 (n1 = \u0398(n)) which has uniformly small noise (every entry is\n\u0398(1/\u221an)), another block has only one entry which is a large outlier (with value \u0398(n0.8)), and the\nthird matrix is the ground truth matrix with size n3 \u00d7 n3 (n3 = \u0398(n0.6)) where the absolute value of\neach entry is at least 1/no(1) and at most no(1). If we apply Frobenius norm rank-1 approximation,\nthen since (n0.8)2 > (n0.6)2 \u00b7 no(1) and (n0.8)2 > n2 \u00b7 (1/\u221an)2, we can only learn the large outlier.\nIf we apply entry-wise (cid:96)1 norm rank-1 approximation, then since n2 \u00b7 1/\u221an > (n0.6)2 \u00b7 no(1) and\nn2 \u00b7 1/\u221an > n0.8, we can only learn the uniformly small noise. But if we apply Huber loss rank-1\n\napproximation, then we can learn the ground truth matrix.\nA natural question is: can bi-criteria Huber loss low rank approximation also learn the ground truth\nmatrix under certain noise distributions? We did experiments to answer this question.\nParameters. In each iteration, we choose 2k columns to \ufb01t the remaining columns, and we drop half\nof the columns with smallest regression cost. In each iteration, we repeat 20 times to \ufb01nd the best 2k\ncolumns. At the end, if there are at most 4k columns remaining, we \ufb01nish our algorithm. We choose\nto optimize the Huber loss function, i.e., f (x) = 1\nData. We evaluate our algorithms on several input data matrix A \u2208 Rn\u00d7n sizes, for n \u2208\n{200, 300, 400, 500}. For rank-1 bi-criteria solutions, the output rank is given in Table 2.\n\n2 x2 for x \u2264 1, and f (x) = |x| \u2212 1\n\n2 for x > 1.\n\nTable 2: The output rank of our algorithm for different input sizes and for k = 1.\n\nn\n\nOutput rank\n\n200\n12\n\n300\n12\n\n400\n14\n\n500\n14\n\nA is constructed as a block diagonal matrix with three blocks. The \ufb01rst block has size 4\n5 n. It\ncontains many copies of k(cid:48) different columns where k(cid:48) is equal to the output rank corresponding\nto n (see Table 2). The entry of a column is uniformly drawn from {\u22125/\u221an, 5/\u221an}. The second\nblock is the ground truth matrix. It is generated by 1/\u221ak(cid:48) \u00b7 U \u00b7 V (cid:62) where U, V \u2208 Rn\u00d7k(cid:48)\nare two\n\u00d7 k(cid:48) diagonal matrix where each diagonal\ni.i.d. random Gaussian matrices. The last block is a size k(cid:48)\nentry is a sparse outlier with magnitude of absolute value 5 \u00b7 n0.8.\nExperimental Results. We compare our algorithm with Frobenius norm low rank approximation\nand entry-wise (cid:96)1 loss low rank approximation algorithms [18]. To make it comparable, we set the\ntarget rank of previous algorithms to be the output rank of our algorithm. In Figure 1, we can see that\nthe ground truth matrix is well covered by our Huber loss low rank approximation. In Figure 2, we\nshow that our algorithm indeed gives a good solution with respect to the Huber loss.\n\n5 n \u00d7 4\n\n8\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: The input data has size 500 \u00d7 500. The color indicates the logarithmic magnitude of the\nabsolute value of each entry. (a) is the input matrix. It contains 3 blocks on its diagonal. The top-left\none has uniformly small noise. The central one is the ground truth. The bottom-right one contains\nsparse outliers. Each block has rank 14. So the rank of the input matrix is 3 \u00d7 14 = 42. (b) is the\nentry-wise (cid:96)1 loss rank-14 approximation given by [18]. As shown above, it mainly covers the small\nnoise, but loses the information of the ground truth. (c) is the Frobenius norm rank-14 approximation\ngiven by the top 14 singular vectors. As shown in the \ufb01gure, it mainly covers the outliers. However, it\nloses the information of the ground truth. (d) is the rank-1 bi-criteria solution given by our algorithm.\nAs we can see, it can cover the ground truth matrix quite well.\n\nFigure 2: The Huber loss given by different algorithms. The red bar is for the entrywise (cid:96)1 low\nrank approximation algorithm [18]. The green bar is for traditional PCA. The blue bar is for our\nalgorithm. For input size n = 200, 300, all the algorithms output rank-12 approximations. For input\nsize n = 400, 500, all the algorithms output rank-14 approximations.\n\nAcknowledgments. David P. Woodruff was supported in part by Of\ufb01ce of Naval Research (ONR)\ngrant N00014- 18-1-2562. Part of this work was done while he was visiting the Simons Institute\nfor the Theory of Computing. Peilin Zhong was supported in part by NSF grants (CCF-1703925,\nCCF-1421161, CCF-1714818, CCF-1617955 and CCF-1740833), Simons Foundation (#491119 to\nAlexandr Andoni), Google Research Award and a Google Ph.D. fellowship. Part of this work was\ndone while Zhao Song and Peilin Zhong were interns at IBM Research - Almaden and while Zhao\nSong was visiting the Simons Institute for the Theory of Computing.\n\n9\n\n\fReferences\n[1] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354\u2013356,\n\n1969.\n\n[2] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions.\nIn Proceedings of the nineteenth annual ACM symposium on Theory of computing, pages 1\u20136.\nACM, 1987.\n\n[3] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In\nProceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC), pages\n887\u2013898. ACM, 2012.\n\n[4] Fran\u00e7ois Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th\ninternational symposium on symbolic and algebraic computation, pages 296\u2013303. ACM, 2014.\n\n[5] Tam\u00e1s Sarl\u00f3s. Improved approximation algorithms for large matrices via random projections.\nIn 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS) , 21-24 October\n2006, Berkeley, California, USA, Proceedings, pages 143\u2013152, 2006.\n\n[6] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input\nsparsity time. In Symposium on Theory of Computing Conference, STOC\u201913, Palo Alto, CA,\nUSA, June 1-4, 2013, pages 81\u201390. https://arxiv.org/pdf/1207.6365, 2013.\n\n[7] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity\ntime and applications to robust linear regression. In Proceedings of the forty-\ufb01fth annual ACM\nsymposium on Theory of computing, pages 91\u2013100. ACM, https://arxiv.org/pdf/1210.\n3135, 2013.\n\n[8] Jelani Nelson and Huy L Nguy\u00ean. Osnap: Faster numerical linear algebra algorithms via sparser\nsubspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer\nScience (FOCS), pages 117\u2013126. IEEE, https://arxiv.org/pdf/1211.1002, 2013.\n\n[9] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a uni\ufb01ed theory of sparse dimen-\nsionality reduction in euclidean space. In Proceedings of the Forty-Seventh Annual ACM on\nSymposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages\n499\u2013508, 2015.\n\n[10] Michael B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities.\n\nIn\nProceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms\n(SODA), Arlington, VA, USA, January 10-12, 2016, pages 278\u2013287, 2016.\n\n[11] Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen.\n\nImproving\nrecommendation lists through topic diversi\ufb01cation. In Proceedings of the 14th international\nconference on World Wide Web, pages 22\u201332. ACM, 2005.\n\n[12] Yelp. Yelp dataset. http://www.yelp.com/dataset_challenge, 2014.\n\n[13] Madeleine Udell, Corinne Horn, Reza Zadeh, Stephen Boyd, et al. Generalized low rank models.\n\nFoundations and Trends R(cid:13) in Machine Learning, 9(1):1\u2013118, 2016.\n\n[14] Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics,\n\n35(1):73\u2013101, 1964.\n\n[15] Zhengyou Zhang. Parameter estimation techniques: A tutorial with application to conic \ufb01tting.\n\nImage and vision Computing, 15(1):59\u201376, 1997.\n\n[16] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust\nstatistics: the approach based on in\ufb02uence functions, volume 196. John Wiley & Sons, 2011.\n\n[17] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM (JACM), 58(3):11, 2011.\n\n10\n\n\f[18] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise\n(cid:96)1-norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing\n(STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017.\n\n[19] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy, and\nIn ICML. arXiv preprint\n\nDavid P Woodruff. Algorithms for (cid:96)p low rank approximation.\narXiv:1705.06730, 2017.\n\n[20] Karl Bringmann, Pavel Kolev, and David P. Woodruff. Approximation algorithms for (cid:96)0-low\nrank approximation. In Advances in Neural Information Processing Systems (NIPS), pages\n6651\u20136662, 2017.\n\n[21] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P.\n\nWoodruff. A PTAS for (cid:96)p-low rank approximation. In SODA, 2019.\n\n[22] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with\nprovable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing\n(STOC), 2016.\n\n[23] Amit Deshpande, Kasturi R. Varadarajan, Madhur Tulsiani, and Nisheeth K. Vishnoi. Algo-\n\nrithms and hardness for subspace approximation. CoRR, abs/0912.1403, 2009.\n\n[24] Kenneth L Clarkson and David P Woodruff. Input sparsity and hardness for robust subspace\napproximation. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science\n(FOCS), pages 310\u2013329. IEEE, https://arxiv.org/pdf/1510.06073, 2015.\n\n[25] Zhao Song, Ruosong Wang, Lin F Yang, Hongyang Zhang, and Peilin Zhong. Ef\ufb01cient\n\nsymmetric norm regression via linear sketching. arXiv preprint arXiv:1910.01788, 2019.\n\n[26] Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the\n\nEconometric Society, pages 33\u201350, 1978.\n\n[27] Piotr Berman and Marek Karpinski. Approximating minimum unsatis\ufb01ability of linear equations.\nIn Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms,\nJanuary 6-8, 2002, San Francisco, CA, USA., pages 514\u2013516, 2002.\n\n[28] Noga Alon, Rina Panigrahy, and Sergey Yekhanin. Deterministic approximation algorithms for\n\nthe nearest codeword problem. In Algebraic Methods in Computational Complexity, 2009.\n\n[29] Kenneth L. Clarkson, Ruosong Wang, and David P. Woodruff. Dimensionality reduction for\n\ntukey regression. In ICML, 2019.\n\n[30] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Subspace sampling and relative-\nerror matrix approximation: Column-row-based methods. In Algorithms - ESA 2006, 14th\nAnnual European Symposium, Zurich, Switzerland, September 11-13, 2006, Proceedings, pages\n304\u2013314, 2006.\n\n[31] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Subspace sampling and relative-\nerror matrix approximation: Column-based methods. In Approximation, Randomization, and\nCombinatorial Optimization. Algorithms and Techniques, 9th International Workshop on Ap-\nproximation Algorithms for Combinatorial Optimization Problems, APPROX 2006 and 10th\nInternational Workshop on Randomization and Computation, RANDOM 2006, Barcelona, Spain,\nAugust 28-30 2006, Proceedings, pages 316\u2013326, 2006.\n\n[32] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix\n\ndecompositions. SIAM J. Matrix Analysis Applications, 30(2):844\u2013881, 2008.\n\n[33] Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation\nalgorithm for the column subset selection problem. In Proceedings of the twentieth Annual\nACM-SIAM Symposium on Discrete Algorithms (SODA), pages 968\u2013977. Society for Industrial\nand Applied Mathematics, https://arxiv.org/pdf/0812.4293, 2009.\n\n11\n\n\f[34] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near optimal column-based\nmatrix reconstruction. In IEEE 52nd Annual Symposium on Foundations of Computer Science\n(FOCS), 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 305\u2013314. https://arxiv.\norg/pdf/1103.0995, 2011.\n\n[35] Ahmed K Farahat, Ahmed Elgohary, Ali Ghodsi, and Mohamed S Kamel. Distributed column\nsubset selection on mapreduce. In 2013 IEEE 13th International Conference on Data Mining\n(ICDM), pages 171\u2013180. IEEE, 2013.\n\n[36] Christos Boutsidis and David P Woodruff. Optimal cur matrix decompositions. In Proceedings\nof the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 353\u2013362. ACM,\nhttps://arxiv.org/pdf/1405.7910, 2014.\n\n[37] Yining Wang and Aarti Singh. Column subset selection with missing data via active sampling.\nIn The 18th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages\n1033\u20131041, 2015.\n\n[38] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation.\n\nIn SODA. https://arxiv.org/pdf/1704.08246, 2019.\n\n[39] Jiyan Yang, Xiangrui Meng, and Michael W. Mahoney. Quantile regression for large-scale\n\napplications. SIAM J. Scienti\ufb01c Computing, 36(5), 2014.\n\n[40] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\nguarantees for one-hidden-layer neural networks. In ICML. https://arxiv.org/pdf/1706.\n03175.pdf, 2017.\n\n[41] Peter Bartlett, Dave Helmbold, and Phil Long. Gradient descent with identity initialization\nIn International Conference on\n\nef\ufb01ciently learns positive de\ufb01nite linear transformations.\nMachine Learning, pages 520\u2013529, 2018.\n\n[42] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlap-\n\nping patches. In ICML. arXiv preprint arXiv:1802.02547, 2018.\n\n[43] Kenneth L Clarkson and David P Woodruff. Sketching for m-estimators: A uni\ufb01ed approach\nto robust regression. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on\nDiscrete Algorithms (SODA), pages 921\u2013939. SIAM, 2015.\n\n[44] David P. Woodruff and Qin Zhang. Subspace embeddings and (cid:96)p-regression using exponential\nrandom variables. In COLT 2013 - The 26th Annual Conference on Learning Theory, June\n12-14, 2013, Princeton University, NJ, USA, pages 546\u2013567, 2013.\n\n12\n\n\f", "award": [], "sourceid": 3303, "authors": [{"given_name": "Zhao", "family_name": "Song", "institution": "University of Washington"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}, {"given_name": "Peilin", "family_name": "Zhong", "institution": "Columbia University"}]}