{"title": "Optimal Sketching for Kronecker Product Regression and Low Rank Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 4737, "page_last": 4748, "abstract": "We study the Kronecker product regression problem, in which the design matrix is a Kronecker product of two or more matrices. Formally, given $A_i \\in \\R^{n_i \\times d_i}$ for $i=1,2,\\dots,q$ where $n_i \\gg d_i$ for each $i$, and $b \\in \\R^{n_1 n_2 \\cdots n_q}$, let $\\mathcal{A} = A_i \\otimes A_2 \\otimes \\cdots \\otimes A_q$. Then for $p \\in [1,2]$, the goal is to find $x \\in \\R^{d_1 \\cdots d_q}$ that approximately minimizes $\\|\\mathcal{A}x - b\\|_p$. Recently, Diao, Song, Sun, and Woodruff (AISTATS, 2018) gave an algorithm which is faster than forming the Kronecker product $\\mathcal{A} \\in \\R^{n_1 \\cdots n_q \\times d_1 \\cdots d_q}$. Specifically, for $p=2$ they achieve a running time of $O(\\sum_{i=1}^q \\texttt{nnz}(A_i) + \\texttt{nnz}(b))$, where $ \\texttt{nnz}(A_i)$ is the number of non-zero entries in $A_i$. Note that $\\texttt{nnz}(b)$ can be as large as $\\Theta(n_1 \\cdots n_q)$. For $p=1,$ $q=2$ and $n_1 = n_2$, they achieve a worse bound of $O(n_1^{3/2} \\text{poly}(d_1d_2) + \\texttt{nnz}(b))$. In this work, we provide significantly faster algorithms. For $p=2$, our running time is $O(\\sum_{i=1}^q \\texttt{nnz}(A_i) )$, which has no dependence on $\\texttt{nnz}(b)$. For $p<2$, our running time is $O(\\sum_{i=1}^q \\texttt{nnz}(A_i) + \\texttt{nnz}(b))$, which matches the prior best running time for $p=2$. We also consider the related all-pairs regression problem, where given $A \\in \\R^{n \\times d}, b \\in \\R^n$, we want to solve $\\min_{x \\in \\R^d} \\|\\bar{A}x - \\bar{b}\\|_p$, where $\\bar{A} \\in \\R^{n^2 \\times d}, \\bar{b} \\in \\R^{n^2}$ consist of all pairwise differences of the rows of $A,b$. We give an $O(\\texttt{nnz}(A))$ time algorithm for $p \\in[1,2]$, improving the $\\Omega(n^2)$ time required to form $\\bar{A}$. Finally, we initiate the study of Kronecker product low rank and and low-trank approximation. For input $\\mathcal{A}$ as above, we give $O(\\sum_{i=1}^q \\texttt{nnz}(A_i))$ time algorithms, which is much faster than computing $\\mathcal{A}$.", "full_text": "Optimal Sketching for Kronecker Product Regression\n\nand Low Rank Approximation\n\nHuaian Diao\u2217 Rajesh Jayaram\u2020 Zhao Song\u2021 Wen Sun\u00a7 David P. Woodruff\u00b6\n\nAbstract\n\nof O((cid:80)q\ntime is O((cid:80)q\nour running time is O((cid:80)q\n\nWe study the Kronecker product regression problem, in which the design matrix\nis a Kronecker product of two or more matrices. Formally, given Ai \u2208 Rni\u00d7di\nfor i = 1, 2, . . . , q where ni (cid:29) di for each i, and b \u2208 Rn1n2\u00b7\u00b7\u00b7nq, let A =\nA1 \u2297 A2 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq. Then for p \u2208 [1, 2], the goal is to \ufb01nd x \u2208 Rd1\u00b7\u00b7\u00b7dq that\napproximately minimizes (cid:107)Ax \u2212 b(cid:107)p. Recently, Diao, Song, Sun, and Woodruff\n(AISTATS, 2018) gave an algorithm which is faster than forming the Kronecker\nproduct A \u2208 Rn1\u00b7\u00b7\u00b7nq\u00d7d1\u00b7\u00b7\u00b7dq. Speci\ufb01cally, for p = 2 they achieve a running time\ni=1 nnz(Ai) + nnz(b)), where nnz(Ai) is the number of non-zero entries\nin Ai. Note that nnz(b) can be as large as \u0398(n1 \u00b7\u00b7\u00b7 nq). For p = 1, q = 2 and\nn1 = n2, they achieve a worse bound of O(n3/2\nIn this work, we provide signi\ufb01cantly faster algorithms. For p = 2, our running\ni=1 nnz(Ai)), which has no dependence on nnz(b). For p < 2,\ni=1 nnz(Ai) + nnz(b)), which matches the prior best\nrunning time for p = 2. We also consider the related all-pairs regression problem,\nwhere given A \u2208 Rn\u00d7d, b \u2208 Rn, we want to solve minx\u2208Rd (cid:107) \u00afAx \u2212 \u00afb(cid:107)p, where\n\u00afA \u2208 Rn2\u00d7d, \u00afb \u2208 Rn2 consist of all pairwise differences of the rows of A, b.\nWe give an O(nnz(A)) time algorithm for p \u2208 [1, 2], improving the \u2126(n2) time\nrequired to form \u00afA. Finally, we initiate the study of Kronecker product low rank\ni=1 nnz(Ai))\ntime algorithms, which is much faster than computing A.\n\nand low t-rank approximation. For input A as above, we give O((cid:80)q\n\n1 poly(d1d2) + nnz(b)).\n\n1\n\nIntroduction\n\nIn the q-th order Kronecker product regression problem, one is given matrices A1, A2, . . . , Aq,\nwhere Ai \u2208 Rni\u00d7di, as well as a vector b \u2208 Rn1n2\u00b7\u00b7\u00b7nq, and the goal is to obtain a solution to\nthe optimization problem:\n\nmin\n\nx\u2208Rd1d2\u00b7\u00b7\u00b7dq\n\n(cid:107)(A1 \u2297 A2 \u00b7\u00b7\u00b7 \u2297 Aq)x \u2212 b(cid:107)p,\n\nStatistics, Northeast Normal University, China\n\n\u2217hadiao@nenu.edu.cn. Key Laboratory for Applied Statistics of MOE and School of Mathematics and\n\u2020rkjayara@cs.cmu.edu. Carnegie Mellon University. Rajesh Jayaram would like to thank support from\nthe Of\ufb01ce of Naval Research (ONR) grant N00014-18-1-2562. This work was partly done while Rajesh Ja-\nyaram was visiting the Simons Institute for the Theory of Computing.\n\u2021zhaosong@uw.edu. University of Washington. This work was partly done while Zhao Song was visiting\n\u00a7sun.wen@microsoft.com. Microsoft Research New York.\n\u00b6dwoodruf@cs.cmu.edu. Carnegie Mellon University. David Woodruff would like to thank support from\nthe Of\ufb01ce of Naval Research (ONR) grant N00014-18-1-2562. This work was also partly done while David\nWoodruff was visiting the Simons Institute for the Theory of Computing.\n\nthe Simons Institute for the Theory of Computing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere p \u2208 [1, 2], and for a vector x \u2208 Rn the (cid:96)p norm is de\ufb01ned by (cid:107)x(cid:107)p = ((cid:80)n\n\ni=1 |xi|p)1/p. For\np = 2, this is known as least squares regression, and for p = 1 this is known as least absolute\ndeviation regression.\nKronecker product regression is a special case of ordinary regression in which the design matrix\nis highly structured. Namely, the design matrix is the Kronecker product of two or more smaller\nmatrices. Such Kronecker product matrices naturally arise in applications such as spline regression,\nsignal processing, and multivariate data \ufb01tting. We refer the reader to [VL92, VLP93, GVL13] for\nfurther background and applications of Kronecker product regression. As discussed in [DSSW18],\nKronecker product regression also arises in structured blind deconvolution problems [OY05], and\nthe bivariate problem of surface \ufb01tting and multidimensional density smoothing [EM06].\nA recent work of Diao, Song, Sun, and Woodruff [DSSW18] utilizes sketching techniques to output\nan x \u2208 Rd1d2\u00b7\u00b7\u00b7dq with objective function at most (1 + \u0001)-times larger than optimal, for both least\nsquares and least absolute deviation Kronecker product regression. Importantly, their time complex-\nity is faster than the time needed to explicitly compute the product A1\u2297\u00b7\u00b7\u00b7\u2297Aq. We note that sketch-\ning itself is a powerful tool for compressing extremely high dimensional data, and has been used in\na number of tensor related problems, e.g., [SWZ16, LHW17, DSSW18, SWZ19b, AKK+20].\n\nFor least squares regression, the algorithm of [DSSW18] achieves O((cid:80)q\n\ni=1 nnz(Ai) + nnz(b) +\npoly(d/\u0001)) time, where nnz(C) for a matrix C denotes the number of non-zero entries of C. Note\nthat the focus is on the over-constrained regression setting, when ni (cid:29) di for each i, and so the goal\nis to have a small running time dependence on the ni\u2019s. We remark that over-constrained regression\nhas been the focus of a large body of work over the past decade, which primarily attempts to design\nfast regression algorithms in the big data (large sample size) regime, see, e.g., [Mah11, Woo14] for\nsurveys.\n\nObserve that explicitly forming the matrix A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq would take(cid:81)q\ncan be as large as(cid:81)q\nUnfortunately, since b \u2208 Rn1n2\u00b7\u00b7\u00b7nq, we can have nnz(b) =(cid:81)q\nto solve this problem in time sub-linear in nnz(b), with a dominant term of O((cid:80)q\n\ni=1 nnz(Ai) time, which\ni=1 nidi, and so the results of [DSSW18] offer a large computational advantage.\ni=1 ni, and therefore nnz(b) is likely\nto be the dominant term in the running time. This leaves open the question of whether it is possible\n\ni=1 nnz(Ai)).\n\nFor least absolute deviation regression, the bounds of [DSSW18] achieved are still an improvement\nover computing A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq, though worse than the bounds for least squares regression. The\nauthors focus on q = 2 and the special case n = n1 = n2. Here, they obtain a running time\nof O(n3/2 poly(d1d2/\u0001) + nnz(b))6. This leaves open the question of whether an input-sparsity\nO(nnz(A1) + nnz(A2) + nnz(b) + poly(d1d2/\u0001)) time algorithm exists.\n\nAll-Pairs Regression In this work, we also study the related all-pairs regression problem. Given\nA \u2208 Rn\u00d7d, b \u2208 Rn, the goal is to approximately solve the (cid:96)p regression problem minx (cid:107) \u00afAx \u2212 \u00afb(cid:107)p,\nwhere \u00afA \u2208 Rn2\u00d7d is the matrix formed by taking all pairwise differences of the rows of A (and \u00afb\nis de\ufb01ned similarly). For p = 1, this is known as the rank regression estimator, which has a long\nhistory in statistics. It is closely related to the renowned Wilconxon rank test [WL09], and enjoys\nthe desirable property of being robust with substantial ef\ufb01ciency gain with respect to heavy-tailed\nrandom errors, while maintaining high ef\ufb01ciency for Gaussian errors [WKL09, WL09, WPB+18,\nWan19a]. In many ways, it has properties more desirable in practice than that of the Huber M-\nestimator [WPB+18, Wan19b]. Recently, the all-pairs loss function was also used by [WPB+18]\nas an alternative approach to overcoming the challenges of tuning parameter selection for the Lasso\nalgorithm. However, the rank regression estimator is computationally intensive to compute, even\nfor moderately sized data, since the standard procedure (for p = 1) is to solve a linear program\nwith O(n2) constraints. In this work, we demonstrate the \ufb01rst highly ef\ufb01cient algorithm for this\nestimator.\n\nLow-Rank Approximation Finally, in addition to regression, we extend our techniques to the\nLow Rank Approximation (LRA) problem. Here, given a large data matrix A, the goal is to\n6We remark that while the nnz(b) term is not written in the Theorem of [DSSW18], their approach of\nleverage score sampling from a well-conditioned basis requires one to sample from a well conditioned basis of\n[A1 \u2297 A2, b] for a subspace embedding. As stated, their algorithm only sampled from [A1 \u2297 A2]. To \ufb01x this\nomission, their algorithm would require an additional nnz(b) time to leverage score sample from the augmented\nmatrix.\n\n2\n\n\f\ufb01nd a low rank matrix B which well-approximates A. LRA is useful in numerous applica-\ntions, such as compressing massive datasets to their primary components for storage, denois-\ning, and fast matrix-vector products. Thus, designing fast algorithms for approximate LRA has\nbecome a large and highly active area of research; see [Woo14] for a survey. For an incom-\nplete list of recent work using sketching techniques for LRA, see [CW13, MM13, NN13, BW14,\nCW15b, CW15a, RSW16, BWZ16, SWZ17, MW17, CGK+17, LHW17, SWZ18, BW18, SWZ19a,\nSWZ19b, SWZ19c, BBB+19, IVWW19] and the references therein.\nMotivated by the importance of LRA, we initiate the study of low-rank approximation of Kronecker\nproduct matrices. Given q matrices A1,\u00b7\u00b7\u00b7 , Aq where Ai \u2208 Rni\u00d7di, ni (cid:29) di, A = \u2297q\ni=1Ai,\nthe goal is to output a rank-k matrix B \u2208 Rn\u00d7d such that (cid:107)B \u2212 A(cid:107)2\nF \u2264 (1 + \u0001) OPTk,\nwhere OPTk is the cost of the best rank-k approximation, n = n1 \u00b7\u00b7\u00b7 nq, and d = d1 \u00b7\u00b7\u00b7 dq.\nHere (cid:107)A(cid:107)2\ni,j. The fastest general purpose algorithms for this problem run in time\nO(nnz(A) + poly(dk/\u0001)) [CW13]. However, as in regression, if A = \u2297q\ni=1Ai, we have nnz(A) =\ni=1 nnz(Ai), which grows very quickly. Instead, one might also hope to obtain a running time of\n\nF = (cid:80)\n\ni,j A2\n\n(cid:81)q\nO((cid:80)q\n\ni=1 nnz(Ai) + poly(dk/\u0001)).\n\n1.1 Our Contributions\n\nOur main contribution is an input sparsity time (1 + \u0001)-approximation algorithm to Kronecker\nproduct regression for every p \u2208 [1, 2], and q \u2265 2. Given Ai \u2208 Rni\u00d7di, i = 1, . . . , q,\ni=1 ni, together with accuracy parameter \u0001 \u2208 (0, 1/2) and fail-\ni=1 di such that\n|(A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x(cid:48) \u2212 b(cid:107)p \u2264 (1 + \u0001) minx (cid:107)(A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x \u2212 b(cid:107)p holds with probability at\n\nand b \u2208 Rn where n = (cid:81)q\nure probability \u03b4 > 0, the goal is to output a vector x(cid:48) \u2208 Rd where d = (cid:81)q\nleast 1 \u2212 \u03b4. For p = 2, our algorithm runs in (cid:101)O(cid:0)(cid:80)q\ntime is (cid:101)O (((cid:80)q\n\ni=1 nnz(Ai)) + poly(d\u03b4\u22121/\u0001)(cid:1) time.7 Notice\n\nthat this is sub-linear in the input size, since it does not depend on nnz(b). For p < 2, the running\n\ni=1 nnz(Ai) + nnz(b) + poly(d/\u0001)) log(1/\u03b4)).\n\nObserve that in both cases, this running time is signi\ufb01cantly faster than the time to write down\nA1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq. For p = 2, up to logarithmic factors, the running time is the same as the time\nrequired to simply read each of the Ai. Moreover, in the setting p < 2, q = 2 and n1 = n2\nconsidered in [DSSW18], our algorithm offers a substantial improvement over their running time\nof O(n3/2 poly(d1d2/\u0001)). We empirically evaluate our Kronecker product regression algorithm on\nexactly the same datasets as those used in [DSSW18]. For p \u2208 {1, 2}, the accuracy of our algorithm\nis nearly the same as that of [DSSW18], while the running time is signi\ufb01cantly faster.\nFor the all-pairs (or rank) regression problem, we \ufb01rst note that for A \u2208 Rn\u00d7d, one can rewrite\n\u00afA \u2208 Rn2\u00d7d as the difference of Kronecker products \u00afA = A \u2297 1n \u2212 1n \u2297 A where 1n \u2208 Rn is\nthe all ones vector. Since \u00afA is not a Kronecker product itself, our earlier techniques for Kronecker\nproduct regression are not directly applicable. Therefore, we utilize new ideas, in addition to careful\n\nsketching techniques, to obtain an (cid:101)O(nnz(A) + poly(d/\u0001)) time algorithm for p \u2208 [1, 2], which\n\np, . . . ,|yn2|p/(cid:107)y(cid:107)p\n\np) in (cid:101)O(nd + poly(ds)) time. For the (cid:96)p\n\nimproves substantially on the O(n2d) time required to even compute \u00afA, by a factor of at least n.\nOur main technical contribution for both our (cid:96)p regression algorithm and the rank regression problem\nis a novel and highly ef\ufb01cient (cid:96)p sampling algorithm. Speci\ufb01cally, for the rank-regression problem\nwe demonstrate, for a given x \u2208 Rd, how to independently sample s entries of a vector \u00afAx = y \u2208\nRn2 from the (cid:96)p distribution (|y1|p/(cid:107)y(cid:107)p\nand in time (cid:101)O((cid:80)q\nregression problem, we demonstrate the same result when y = (A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x \u2212 b \u2208 Rn1\u00b7\u00b7\u00b7nq,\ni=1 nnz(Ai) + nnz(b) + poly(ds)). This result allows us to sample a small number\nof rows of the input to use in our sketch. Our algorithm draws from a large number of disparate\nsketching techniques, such as the dyadic trick for quickly \ufb01nding heavy hitters [CM05, KNPW11,\nO((cid:80)q\nLNNT16, NS19], and the precision sampling framework from the streaming literature [AKO11].\nFor the Kronecker Product Low-Rank Approximation (LRA) problem, we give an input sparsity\n(cid:80)q\ni=1 nnz(Ai) is substantially smaller than the nnz(A) = (cid:81)q\ni=1 nnz(Ai) + poly(dk/\u0001))-time algorithm which computes a rank-k matrix B such that\n(cid:107)B \u2212 \u2297q\nF \u2264 (1 + \u0001) minrank \u2212k B(cid:48) (cid:107)B(cid:48) \u2212 \u2297q\nF . Note again that the dominant term\ni=1 nnz(Ai) time required to write\n7For a function f (n, d, \u0001, \u03b4), (cid:101)O(f ) = O(f \u00b7 poly(log n))\n\ni=1Ai(cid:107)2\n\ni=1Ai(cid:107)2\n\n3\n\n\fi1\n\ni2\n\ni3\n\n(cid:80)\n\n(cid:80)\n\n(cid:107)A(cid:107)p = ((cid:80)\n\nUsing similar sketching ideas, we provide an O((cid:80)q\n\ndown the Kronecker Product A, which is also the running time of state-of-the-art general purpose\nLRA algorithms [CW13, MM13, NN13]. Thus, our results demonstrate that substantially faster\nalgorithms for approximate LRA are possible for inputs with a Kronecker product structure.\nFinally, motivated by [VL00], we use our techniques to solve the low-trank approximation problem,\nwhere we are given an arbitrary matrix A \u2208 Rnq\u00d7nq, and the goal is to output a trank-k matrix\nB \u2208 Rnq\u00d7nq such that (cid:107)B\u2212A(cid:107)F is minimized. Here, the trank of a matrix B is the smallest integer\nk such that B can be written as a summation of k matrices, where each matrix is the Kronecker\nproduct of q matrices with dimensions n\u00d7n. Compressing a matrix A to a low-trank approximation\nyields many of the same bene\ufb01ts as LRA, such as compact representation, fast matrix-vector product,\nand fast matrix multiplication, and thus is applicable in many of the settings where LRA is used.\ni=1 nnz(Ai) + poly(d1 \u00b7\u00b7\u00b7 dq/\u0001)) time algorithm\nfor this problem under various loss functions. Our results for low-trank approximation can be found\nin the full version of this work.\n2 Preliminaries\nNotation For a tensor A \u2208 Rn1\u00d7n2\u00d7n3, we use (cid:107)A(cid:107)p to denote the entry-wise (cid:96)p norm of A, i.e.,\n|Ai1,i2,i3|p)1/p. For n \u2208 N, let [n] = {1, 2, . . . , n}. For a matrix A, let\nAi,\u2217 denote the i-th row of A, and A\u2217,j the j-th column. For a, b \u2208 R and \u0001 \u2208 (0, 1), we write\na = (1 \u00b1 \u0001)b to denote (1 \u2212 \u0001)b \u2264 a \u2264 (1 + \u0001)b. We now de\ufb01ne various sketching matrices used by\nour algorithms.\nStable Transformations We will utilize the well-known p-stable distribution, Dp (see [Nol07,\nInd06] for further discussion), which exist for p \u2208 (0, 2]. For p \u2208 (0, 2), X \u223c Dp is de\ufb01ned by\n\u221a\u22121tX)] = exp(\u2212|t|p), and can be ef\ufb01ciently generated to a\nits characteristic function EX [exp(\n\ufb01xed precision [Nol07, KNW10]. For p = 2, D2 is just the standard Gaussian distribution, and for\ni=1 ziai \u223c z(cid:107)a(cid:107)p where (cid:107)a(cid:107)p = ((cid:80)n\np = 1, D1 is the Cauchy distribution. The distribution Dp has the property that if z1, . . . , zn \u223c Dp\ni=1 |ai|p)1/p, and z \u223c Dp. This\nproperty will allow us to utilize sketches with entries independently drawn from Dp to preserve the\n(cid:96)p norm.\nDe\ufb01nition 2.1 (Dense p-stable Transform, [CDMI+13, SW11]). Let p \u2208 [1, 2]. Let S = \u03c3 \u00b7 C \u2208\nRm\u00d7n, where \u03c3 is a scalar, and each entry of C \u2208 Rm\u00d7n is chosen independently from Dp.\nWe will also need a sparse version of the above.\nDe\ufb01nition 2.2 (Sparse p-Stable Transform, [MM13, CDMI+13]). Let p \u2208 [1, 2]. Let \u03a0 = \u03c3 \u00b7 SC \u2208\nRm\u00d7n, where \u03c3 is a scalar, S \u2208 Rm\u00d7n has each column chosen independently and uniformly from\nthe m standard basis vectors of Rm, and C \u2208 Rn\u00d7n is a diagonal matrix with diagonals chosen\nindependently from the standard p-stable distribution. For any matrix A \u2208 Rn\u00d7d, \u03a0A can be\ncomputed in O(nnz(A)) time.\n\nare i.i.d., and a \u2208 Rn, then(cid:80)n\n\nOne nice property of p-stable transformations is that they provide low-distortion (cid:96)p embeddings.\nLemma 2.3 (Theorem 1.4 of [WW19]; see also Theorem 2 and 4 of [MM13] for earlier work 8 ).\nFix A \u2208 Rn\u00d7d, and let S \u2208 Rk\u00d7n be a sparse or dense p-stable transform for p \u2208 [1, 2), with\nk = \u0398(d2/\u03b4). Then with probability 1 \u2212 \u03b4, for all x \u2208 Rd:\n\n(cid:107)Ax(cid:107)p \u2264 (cid:107)SAx(cid:107)p \u2264 O(d log d)(cid:107)Ax(cid:107)p\n\nWe simply call a matrix S \u2208 Rk\u00d7n a low distortion (cid:96)p embedding for A \u2208 Rn\u00d7d if it satis\ufb01es the\nabove inequality for all x \u2208 Rd.\nLeverage Scores & Well Condition Bases. We now introduce the notions of (cid:96)2 leverage scores\nand well-conditioned bases for a matrix A \u2208 Rn\u00d7d.\nDe\ufb01nition 2.4 ((cid:96)2-Leverage Scores, [Woo14, BSS12]). Given a matrix A \u2208 Rn\u00d7d, let A = Q \u00b7 R\ndenote the QR factorization of matrix A. For each i \u2208 [n], we de\ufb01ne \u03c3i =\n, where\n\n(cid:107)(AR\u22121)i(cid:107)2\n(cid:107)AR\u22121(cid:107)2\n\n2\n\nF\n\n8In discussion with the authors of these works, the original O((d log d)1/p) distortion factors stated in these\npapers should be replaced with O(d log d); as we do not optimize the poly(d) factors in our analysis, this does\nnot affect our bounds.\n\n4\n\n\fi=1 ni.\n\nfor i = 1, . . . , q do\n\ni=1 di, n \u2190(cid:81)q\n\nd \u2190(cid:81)q\nCompute approximate leverage scores(cid:101)\u03c3i(Aj) for all j \u2208 [q], i \u2208 [nj].\ni=1 ni, m \u2190 \u0398(d/(\u03b4\u00012)).\n(cid:98)x = arg minx\u2208Rd (cid:107)D(A1 \u2297 A2 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x \u2212 Db(cid:107)2\nreturn(cid:98)x\nd \u2190(cid:81)q\n\nAlgorithm 1 Our (cid:96)2 Kronecker Product Regression Algorithm\n1: procedure (cid:96)2 KRONECKER REGRESSION(({Ai, ni, di}i\u2208[q], b))\n2:\n3:\n4:\n5:\n6:\n7:\n8: end procedure\nAlgorithm 2 Our (cid:96)p Kronecker Product Regression Algorithm, 1 \u2264 p < 2\n1: procedure O(1)-APPROXIMATE (cid:96)p REGRESSION({Ai, ni, di}i\u2208[q])\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\nsi \u2190 O(qd2\ni )\nGenerate sparse p-stable transform Si \u2208 Rsi\u00d7n (def 2.2)\nTake the QR factorization of SiAi = QiRi to obtain Ri \u2208 Rdi\u00d7di\nLet Z \u2208 Rd\u00d7\u03c4 be a dense p-stable transform for \u03c4 = \u0398(log(n))\nfor j = 1, . . . , ni do\n\ni=1 di, n \u2190(cid:81)q\n\nai,j \u2190 median\u03b7\u2208[\u03c4 ]{(|(AiR\u22121\n\nqi = min{1, r1q(cid:48)\n\ni} and 0 otherwise, where r1 = \u0398(d3/\u00012).\n\nend for\nDe\ufb01ne a distribution D = {q(cid:48)\nLet \u03a0 \u2208 Rn\u00d7n denote a diagonal sampling matrix, where \u03a0i,i = 1/q1/p\nLet x(cid:48) \u2208 Rd denote the solution of\nreturn x(cid:48)\n\n14:\n15:\n16:\n17: end procedure\n18: procedure (1 + \u0001)-APPROXIMATE (cid:96)p REGRESSION(x(cid:48) \u2208 Rd)\nImplicitly de\ufb01ne \u03c1 = (A1 \u2297 A2 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x(cid:48) \u2212 b \u2208 Rn\n19:\nCompute a diagonal sampling matrix \u03a3 \u2208 Rn\u00d7n such that \u03a3i,i = 1/\u03b11/p\n20:\np}} where r2 = \u0398(d3/\u00013).\n\nminx\u2208Rd (cid:107)\u03a0(A1 \u2297 A2 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x \u2212 \u03a0b(cid:107)p\n\n\u03b1i = min{1, max{qi, r2|\u03c1i|p/(cid:107)\u03c1(cid:107)p\n\nn} by q(cid:48)(cid:80)q\n\n=(cid:81)q\n\n1, q(cid:48)\n\n1, . . . , q(cid:48)\n\n(cid:81)j\u22121\n\ni=1 ji\n\nl=1 nl\n\nend for\n\n13:\n\ni Z)j,\u03b7|/\u03b8p)p}, where \u03b8p is the median of Dp.\n\n(cid:46) Theorem 3.2\n\n(cid:46) Lemma 2.3\n(cid:46) Fact 2.6\n(cid:46) De\ufb01nition 2.1\n\ni=1 ai,ji.\n\ni with probability\n(cid:46) [DDH+09]\n\n(AR\u22121)i \u2208 Rd is the i-th row of matrix (AR\u22121) \u2208 Rn\u00d7d. We say that \u03c3 \u2208 Rn is the (cid:96)2 leverage\nscore vector of A.\nDe\ufb01nition 2.5 (((cid:96)p, \u03b1, \u03b2) Well-Conditioned Basis, [Cla05]). Given a matrix A \u2208 Rn\u00d7d, we say\nU \u2208 Rn\u00d7d is an ((cid:96)p, \u03b1, \u03b2) well-conditioned basis for the column span of A if the columns of U span\nthe columns of A, and if for any x \u2208 Rd, we have \u03b1(cid:107)x(cid:107)p \u2264 (cid:107)U x(cid:107)p \u2264 \u03b2(cid:107)x(cid:107)p, where \u03b1 \u2264 1 \u2264 \u03b2. If\n\u03b2/\u03b1 = dO(1), then we simply say that U is an (cid:96)p well conditioned basis for A.\nFact 2.6 ([WW19, MM13]). Let A \u2208 Rn\u00d7d, and let SA \u2208 Rk\u00d7d be a low distortion (cid:96)p embedding\nfor A (see Lemma 2.3), where k = O(d2/\u03b4). Let SA = QR be the QR decomposition of SA. Then\nAR\u22121 is an (cid:96)p well-conditioned basis with probability 1 \u2212 \u03b4.\n\nConstruct diagonal leverage score sampling matrix D \u2208 Rn\u00d7n, with m non-zero entries\nCompute (via the psuedo-inverse)\n\n(cid:46) Theorem 3.1\n\n21:\n\nCompute (cid:98)x = arg minx\u2208Rd (cid:107)\u03a3(A1 \u2297 A2 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq) \u2212 \u03a3b(cid:107)p (via convex optimization\nreturn(cid:98)x\n\nmethods, e.g., [BCLL18, AKPS19, LSZ19])\n\ni with probability\n\n22:\n23: end procedure\n\n3 Kronecker Product Regression\nWe \ufb01rst introduce our algorithm for p = 2. Our algorithm for 1 \u2264 p < 2 is given in Section\n3.1. Our regression algorithm for p = 2 is formally stated in Algorithm 1. Recall that our input\ndesign matrix is A = \u2297q\ni=1Ai, where Ai \u2208 Rni\u00d7di, and we are also given b \u2208 Rn1\u00b7\u00b7\u00b7nq. Let\n\n5\n\n\fn =(cid:81)q\n\ni=1 ni and d =(cid:81)q\n\ni=1 di. The crucial insight of the algorithm is that one can approximately\ncompute the leverage scores of A given only good approximations to the leverage scores of each Ai.\nApplying this fact gives a ef\ufb01cient algorithm for sampling rows of A with probability proportional\nto the leverage scores. Following standard arguments, we will show that by restricting the regression\nproblem to the sampled rows, we can obtain our desired (1 \u00b1 \u0001)-approximate solution ef\ufb01ciently.\nOur main theorem for this section is stated below.\nwhere Ai \u2208 Rni\u00d7di, and b \u2208 Rn, where n = (cid:81)q\nTheorem 3.1 (Kronecker product (cid:96)2 regression). Let D \u2208 Rn\u00d7n be the diagonal row sampling\ni=1 di. Then let (cid:98)x =\nmatrix generated in Algorithm 1, with m = \u0398(d/(\u03b4\u00012)) non-zero entries, and let A = \u2297q\ni=1Ai,\nwe have (cid:107)A(cid:98)x \u2212 b(cid:107)2 \u2264 (1 + \u0001)(cid:107)Ax\u2217 \u2212 b(cid:107)2. Moreover, the total running time required to compute(cid:98)x\narg minx\u2208Rd (cid:107)DAx \u2212 Db(cid:107)2, and let x\u2217 = arg minx(cid:48)\u2208Rd (cid:107)Ax \u2212 b(cid:107)2. Then with probability 1 \u2212 \u03b4,\nis (cid:101)O((cid:80)q\n\ni=1 ni and d = (cid:81)q\n\ni=1 nnz(Ai) + (dq/(\u03b4\u0001))O(1)). 9\n\n3.1 Kronecker Product (cid:96)p Regression\nWe now consider (cid:96)p regression for 1 \u2264 p < 2. Our algorithm is stated formally in Algorithm 2.\nOur main theorem is as follows.\nTheorem 3.2 (Main result, (cid:96)p (1+\u0001)-approximate regression). Fix 1 \u2264 p < 2. Then for any constant\ni=1 di. Let\n\nq = O(1), given matrices A1, A2,\u00b7\u00b7\u00b7 , Aq, where Ai \u2208 Rni\u00d7di, let n =(cid:81)q\n(cid:98)x \u2208 Rd be the output of Algorithm 2. Then\n(cid:101)O (((cid:80)q\n\n(cid:107)(A1 \u2297 A2 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)(cid:98)x \u2212 b(cid:107)p \u2264 (1 + \u0001) min\ni=1 nnz(Ai) + nnz(b) + poly(d log(1/\u03b4)/\u0001)) log(1/\u03b4)) time to output(cid:98)x \u2208 Rd.\n\n(cid:107)(A1 \u2297 A2 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x \u2212 b(cid:107)p\nIn\n\ni=1 ni, d =(cid:81)q\n\nalgorithm takes\n\n1 \u2212 \u03b4.\n\nholds with\n\nprobability\n\nat\n\nleast\n\nx\u2208Rn\n\naddition,\n\nour\n\nOur high level approach follows that of [DDH+09]. Namely, we \ufb01rst obtain a vector x(cid:48) which is an\nO(1)-approximate solution to the optimal solution. This is done by \ufb01rst constructing (implicitly) a\nmatrix U \u2208 Rn\u00d7d that is a well-conditioned basis for the design matrix A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq. We then\nef\ufb01ciently sample rows of U with probability proportional to their (cid:96)p norm (which must be done\nwithout even explicitly computing most of U). We then use the results of [DDH+09] to demonstrate\nthat solving the regression problem constrained to these sampled rows gives a solution x(cid:48) \u2208 Rd such\nthat (cid:107)(A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x(cid:48) \u2212 b(cid:107)p \u2264 8 minx\u2208Rd (cid:107)(A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x(cid:48) \u2212 b(cid:107)p.\nWe de\ufb01ne the residual error \u03c1 = (A1 \u2297 \u00b7\u00b7\u00b7 \u2297 Aq)x(cid:48) \u2212 b \u2208 Rn of x(cid:48). Our goal is to sample\nadditional rows i \u2208 [n] with probability proportional to their residual error |\u03c1i|p/(cid:107)\u03c1(cid:107)p\np, and solve\nthe regression problem restricted to the sampled rows. However, we cannot afford to compute even\na small fraction of the entries in \u03c1 (even when b is dense, and certainly not when b is sparse). So to\ncarry out this sampling ef\ufb01ciently, we design an involved, multi-part sketching and sampling routine.\nThis sampling technique is the main technical contribution of this section, and relies on a number of\ntechniques, such as the Dyadic trick for quickly \ufb01nding heavy hitters from the streaming literature,\nand a careful pre-processing step to avoid a poly(d)-blow up in the runtime. Given these samples,\n\nwe can obtain the solution(cid:98)x after solving the regression problem on the sampled rows, and the fact\n\nthat this gives a (1 + \u0001) approximate solution will follow from Theorem 6 of [DDH+09].\n\n4 All-Pairs Regression\nGiven a matrix A \u2208 Rn\u00d7d and b \u2208 Rn, let \u00afA \u2208 Rn2\u00d7d be the matrix such that \u00afAi+(j\u22121)n,\u2217 =\nAi,\u2217 \u2212 Aj,\u2217, and let \u00afb \u2208 Rn2 be de\ufb01ned by \u00afbi+(j\u22121)n = bi \u2212 bj. Thus, \u00afA consists of all pairwise\ndifferences of rows of A, and \u00afb consists of all pairwise differences of rows of b,. The (cid:96)p all pairs\nregression problem on the inputs A, b is to solve minx\u2208Rd (cid:107) \u00afAx \u2212 \u00afb(cid:107)p.\n\n9We remark that the exponent of d in the runtime can be bounded by 3. To see this, \ufb01rst note that the main\ncomputation taking place is the leverage score computation. For a q input matrices, we need to generate the\nleverage scores to precision \u0398(1/q), and the complexity to achieve this is O(d3/q4) by the results of [CW13].\nThe remaining computation is to compute the pseudo-inverse of a d/\u00012 \u00d7 d matrix, which requires O(d3/\u00012)\ntime, so the additive term in the Theorem can be replaced with O(d3/\u00012 + d3/q4).\n\n6\n\n\fFirst note that this problem has a close connection to Kronecker product regression. Namely, the\nmatrix \u00afA can be written \u00afA = A \u2297 1n \u2212 1n \u2297 A, where 1n \u2208 Rn is the all 1\u2019s vector. Similarly,\n\u00afb = b \u2297 1n \u2212 1n \u2297 b. For simplicity, we now drop the superscript and write 1 = 1n.\nOur algorithm is given formally in Algorithm 3. The main technical step takes place on line 7,\nwhere we sample rows of the matrix (F \u2297 1 \u2212 1 \u2297 F )R\u22121 with probability proportional to their (cid:96)p\nnorms. This is done by an involved sampling procedure described in the full version of this work.\nWe summarize the guarantee of our algorithm in the following theorem.\nTheorem 4.1. Given A \u2208 Rn\u00d7d and b \u2208 Rn, for p \u2208 [1, 2], let \u00afA = A \u2297 1 \u2212 1 \u2297 A \u2208 Rn2\u00d7d\n\nand \u00afb = b \u2297 1 \u2212 1 \u2297 b \u2208 Rn2. Then there is an algorithm for that outputs (cid:98)x \u2208 Rd such that\nwith probability 1 \u2212 \u03b4 we have (cid:107) \u00afA(cid:98)x \u2212 \u00afb(cid:107)p \u2264 (1 + \u0001) minx\u2208Rd (cid:107) \u00afAx \u2212 \u00afb(cid:107)p. The running time is\n(cid:101)O(nnz(A) + (d/(\u0001\u03b4))O(1)).\n\nAlgorithm 3 Our All-Pairs Regression Algorithm\n1: procedure ALL-PAIRS REGRESSION(A, b)\nF = [A, b] \u2208 Rn\u00d7d+1. r \u2190 poly(d/\u0001)\n2:\nGenerate S1, S2 \u2208 Rk\u00d7n sparse p-stable transforms for k = poly(d/(\u0001\u03b4)).\n3:\nSketch (S1 \u2297 S2)(F \u2297 1 \u2212 1 \u2297 F ).\n4:\nCompute QR decomposition: (S1 \u2297 S2)(F \u2297 1 \u2212 1 \u2297 F ) = QR.\n5:\nLet M = (F \u2297 1 \u2212 1 \u2297 F )R\u22121, and \u03c3i = (cid:107)Mi,\u2217(cid:107)p\n6:\n7:\n\nObtain row sampling diagonal matrix \u03a0 \u2208 Rn\u00d7n such that \u03a0i,i = 1/(cid:101)qi\nwith probability qi \u2265 min{1, r\u03c3i}, where(cid:101)qi = (1 \u00b1 \u00012)qi.\nreturn(cid:98)x , where(cid:98)x = arg minx\u2208Rd (cid:107)\u03a0( \u00afAx \u2212 \u00afb)(cid:107)p.\n\np/(cid:107)M(cid:107)p\np.\n\n1/p independently\n\n8:\n9: end procedure\n\nn = (cid:81)q\n\ni=1 ni and d = (cid:81)q\n\n5 Low Rank Approximation of Kronecker Product Matrices\nWe now consider low rank approximation of Kronecker product matrices. Given q matrices\nA1, A2, . . . , Aq, where Ai \u2208 Rni\u00d7di, the goal is to output a rank-k matrix B \u2208 Rn\u00d7d, where\ni=1 di, such that (cid:107)B \u2212 A(cid:107)F \u2264 (1 + \u0001) OPTk, where OPTk =\nminrank\u2212k A(cid:48) (cid:107)A(cid:48) \u2212 A(cid:107)F , and A = \u2297q\ni=1Ai. Our approach employs the Count-Sketch distribu-\ntion of matrices [CW13, Woo14]. A count-sketch matrix S is generated as follows. Each column of\nS contains exactly one non-zero entry. The non-zero entry is placed in a uniformly random row, and\nthe value of the non-zero entry is either 1 or \u22121 chosen uniformly at random.\nOur algorithm is as follows. We sample q independent Count-Sketch matrices S1, . . . Sq, with Si \u2208\nRki\u00d7ni, where k1 = \u00b7\u00b7\u00b7 = kq = \u0398(qk2/\u00012). We then compute M = (\u2297q\ni=1Si)A, and let U \u2208 Rk\u00d7d\nbe the top k right singular vectors of M. Finally, we output B = AU(cid:62)U in factored form (as q + 1\nseparate matrices, A1, A2, . . . , Aq, U), as the desired rank-k approximation to A. The following\ntheorem demosntrates the correctness of this algorithm.\ni=1 nnz(Ai)+\nd poly(k/\u0001)) log(1/\u03b4)) and outputs a rank k-matrix B in factored form such that (cid:107)B \u2212 A(cid:107)F \u2264\n(1 + \u0001) OPTk with probability 1 \u2212 \u03b4. with probability 9/10.\n\nTheorem 5.1. For any constant q \u2265 2, there is an algorithm which runs in time O(((cid:80)q\n\n6 Numerical Simulations\n\nIn our numerical simulations, we compare our algorithms to two baselines: (1) brute force, i.e.,\ndirectly solving regression without sketching, and (2) the methods based sketching developed in\n[DSSW18]. All methods were implemented in Matlab on a Linux machine. We remark that in our\nimplementation, we simpli\ufb01ed some of the steps of our theoretical algorithm, such as the residual\nsampling algorithm used in Alg. 2. We found that in practice, even with these simpli\ufb01cations, our\nalgorithms already demonstrated substantial improvements over prior work.\nFollowing the experimental setup in [DSSW18], we generate matrices A1 \u2208 R300\u00d715, A2 \u2208\nR300\u00d715, and b \u2208 R3002, such that all entries of A1, A2, b are sampled i.i.d. from a normal distribu-\ntion. Note that A1 \u2297 A2 \u2208 R90000\u00d7225. We de\ufb01ne Tbf to be the time of the brute force algorithm,\n\n7\n\n\fTable 1: Results for (cid:96)2 and (cid:96)1-regression with respect to different sketch sizes m.\n\nr(cid:48)\n\ne\n\n(cid:96)2\n\n(cid:96)1\n\nm\n8100\n12100\n16129\n2000\n4000\n8000\n12000\n16000\n\nm/n\n.09\n.13\n.18\n.02\n.04\n.09\n.13\n.18\n\nre\n\nrt\n2.48% 1.51% 0.05\n1.55% 0.98% 0.06\n1.20% 0.71% 0.07\n7.72% 9.10% 0.02\n4.26% 4.00% 0.03\n1.85% 1.6% 0.07\n1.29% 0.99% 0.09\n1.01% 0.70% 0.14\n\nr(cid:48)\nt\n0.22\n0.24\n0.08\n0.59\n0.75\n0.83\n0.79\n0.90\n\nTold to be the time of the algorithms from [DSSW18], and Tours to be the time of our algorithms.\nWe are interested in the time ratio with respect to the brute force algorithm and the algorithms from\n[DSSW18], de\ufb01ned as, rt = Tours/Tbf, and r(cid:48)\nt = Tours/Told. The goal is to show that our methods\nare signi\ufb01cantly faster than both baselines, i.e., both rt and r(cid:48)\nWe are also interested in the quality of the solutions computed from our algorithms, compared to the\nbrute force method and the method from [DSSW18]. Denote the solution from our method as xour,\nthe solution from the brute force method as xbf, and the solution from the method in [DSSW18] as\nxold. We de\ufb01ne the relative residual percentage reand r(cid:48)\n\nt are signi\ufb01cantly less than 1.\n\ne to be:\n\nre = 100\n\n|(cid:107)Axours \u2212 b(cid:107) \u2212 (cid:107)Axbf \u2212 b(cid:107)|\n\n(cid:107)Axbf \u2212 b(cid:107)\n\nr(cid:48)\ne = 100\n\n,\n\n|(cid:107)Axold \u2212 b(cid:107) \u2212 (cid:107)Axbf \u2212 b(cid:107)|\n\n(cid:107)Axbf \u2212 b(cid:107)\n\nWhere A = A1 \u2297 A2. The goal is to show that re is close zero, i.e., our approximate solution is\ncomparable to the optimal solution in terms of minimizing the error (cid:107)Ax \u2212 b(cid:107).\nThroughout the simulations, we use a moderate input matrix size so that we can accommodate the\nbrute force algorithm and to compare to the exact solution. We consider varying values of m, where\nM denotes the size of the sketch (number of rows) used in either the algorithms of [DSSW18] or\nthe algorithms in this paper. We also include a column m/n in the table, which is the ratio between\nthe size of the sketch and the original matrix A1 \u2297 A2. Note in this case that n = 90000.\nSimulation Results for (cid:96)2 We \ufb01rst compare our algorithm, Alg. 1, to baselines under the (cid:96)2 norm.\nIn our implementation, minx (cid:107)Ax\u2212 b(cid:107)2 is solved by Matlab backslash A\\b. Table 1 summarizes the\ncomparison between our approach and the two baselines. The numbers are averaged over 5 random\ntrials. First of all, we notice that our method in general provides slightly less accurate solutions\nthan the method in [DSSW18], i.e., re > r(cid:48)\ne in this case. However, comparing to the brute force\nalgorithm, our method still generates relatively accurate solutions, especially when m is large, e.g.,\nthe optimal solution is around 1% when m \u2248 16000. On\nthe relative residual percentage w.r.t.\nthe other hand, as suggested by our theoretical improvements for (cid:96)2, our method is signi\ufb01cantly\nfaster than the method from [DSSW18], consistently across all sketch sizes m. Note that when\nm \u2248 16000, our method is around 10 times faster than the method in [DSSW18]. For small m, our\napproach is around 5 times faster than the method in [DSSW18].\nSimulation Results for (cid:96)1 We compare our algorithm, Alg. 2, to two baselines under the (cid:96)1-\nnorm. The \ufb01rst is a brute-force solution, and the second is the algorithm for [DSSW18]. For\nminx (cid:107)Ax \u2212 b(cid:107)1, the brute for solution is obtained via a Linear Programming solver in Gurobi\n[GO16]. Table 1 summarizes the comparison of our approach to the two baselines under the (cid:96)1-\nnorm. The statistics are averaged over 5 random trials. Compared to the Brute Force algorithm, our\nmethod is consistently around 10 times faster, while in general we have relative residual percentage\naround 1%. Compared to the method from [DSSW18], our approach is consistently faster (around\n1.3 times faster). Note our method has slightly higher accuracy than the one from [DSSW18] when\nthe sketch size is small, but slightly worse accuracy when the sketch size increases.\n\nAcknowledgments\nThe authors would like to thank Lan Wang and Ruosong Wang for a helpful discussion. The authors\nwould like to thank Lan Wang for introducing the All-Pairs Regression problem to us.\n\n8\n\n\fReferences\n[AKK+20] Thomas D. Ahle, Michael Kapralov, Jakob B. T. Knudsen, Rasmus Pagh, Ameya Vel-\ningker, David P. Woodruff, and Amir Zandieh. Oblivious sketching of high-degree\npolynomial kernels. In SODA. Merger version of https://arxiv.org/pdf/1909.\n01410.pdf and https://arxiv.org/pdf/1909.01821.pdf, 2020.\n\n[AKO11] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms\nvia precision sampling. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd\nAnnual Symposium on, pages 363\u2013372. IEEE, https://arxiv.org/pdf/1011.\n1263, 2011.\n\n[AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative re\ufb01ne-\nment for (cid:96)p-norm regression. In Proceedings of the Thirtieth Annual ACM-SIAM Sym-\nposium on Discrete Algorithms (SODA), pages 1405\u20131424. SIAM, 2019.\n\n[BBB+19] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and\nDavid P Woodruff. A ptas for (cid:96)p-low rank approximation. In Proceedings of the Thir-\ntieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 747\u2013766. SIAM,\n2019.\n\n[BCLL18] S\u00e9bastien Bubeck, Michael B Cohen, Yin Tat Lee, and Yuanzhi Li. An homotopy\nmethod for (cid:96)p regression provably beyond self-concordance and in input-sparsity time.\nIn Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing\n(STOC), pages 1130\u20131137. ACM, 2018.\n\n[BSS12] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava. Twice-ramanujan spar-\nIn SIAM Journal on Computing, volume 41(6), pages 1704\u20131721. https:\n\nsi\ufb01ers.\n//arxiv.org/pdf/0808.0163, 2012.\n\n[BW14] Christos Boutsidis and David P Woodruff. Optimal CUR matrix decompositions. In\nProceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC),\npages 353\u2013362. ACM, https://arxiv.org/pdf/1405.7910, 2014.\n\n[BW18] Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of dis-\ntance matrices. In Advances in Neural Information Processing Systems, pages 3782\u2013\n3792, 2018.\n\n[BWZ16] Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal compo-\nnent analysis in distributed and streaming models. In Proceedings of the 48th Annual\nACM SIGACT Symposium on Theory of Computing (STOC), pages 236\u2013249. ACM,\nhttps://arxiv.org/pdf/1504.06729, 2016.\n\n[CDMI+13] Kenneth L Clarkson, Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney,\nXiangrui Meng, and David P Woodruff. The fast cauchy transform and faster robust\nlinear regression. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium\non Discrete Algorithms (SODA), pages 466\u2013477. Society for Industrial and Applied\nMathematics, https://arxiv.org/pdf/1207.4684, 2013.\n\n[CGK+17] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy,\nand David P Woodruff. Algorithms for (cid:96)p low rank approximation. In ICML. arXiv\npreprint arXiv:1705.06730, 2017.\n\n[Cla05] Kenneth L Clarkson. Subgradient and sampling algorithms for (cid:96)1 regression.\n\nIn\nProceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms\n(SODA), pages 257\u2013266, 2005.\n\n[CM05] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the\n\ncount-min sketch and its applications. Journal of Algorithms, 55(1):58\u201375, 2005.\n\n[CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression\nin input sparsity time. In Symposium on Theory of Computing Conference (STOC),\npages 81\u201390. https://arxiv.org/pdf/1207.6365, 2013.\n\n9\n\n\f[CW15a] Kenneth L Clarkson and David P Woodruff. Input sparsity and hardness for robust\nsubspace approximation. In 2015 IEEE 56th Annual Symposium on Foundations of\nComputer Science (FOCS), pages 310\u2013329. IEEE, https://arxiv.org/pdf/1510.\n06073, 2015.\n\n[CW15b] Kenneth L Clarkson and David P Woodruff. Sketching for m-estimators: A uni\ufb01ed\napproach to robust regression. In Proceedings of the Twenty-Sixth Annual ACM-SIAM\nSymposium on Discrete Algorithms (SODA), pages 921\u2013939. SIAM, 2015.\n\n[DDH+09] Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W Ma-\nhoney. Sampling algorithms and coresets for (cid:96)p regression. SIAM Journal on Comput-\ning, 38(5):2060\u20132078, 2009.\n\n[DSSW18] Huaian Diao, Zhao Song, Wen Sun, and David P. Woodruff. Sketching for Kronecker\n\nproduct regression and p-splines. AISTATS 2018, 2018.\n\n[EM06] Paul HC Eilers and Brian D Marx. Multidimensional density smoothing with p-splines.\n\nIn Proceedings of the 21st international workshop on statistical modelling, 2006.\n\n[GO16] Inc. Gurobi Optimization. Gurobi optimizer reference manual, 2016.\n\n[GVL13] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins Studies\nin the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, 2013.\n\n[Ind06] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data\n\nstream computation. Journal of the ACM (JACM), 53(3):307\u2013323, 2006.\n\n[IVWW19] Piotr Indyk, Ali Vakilian, Tal Wagner, and David P. Woodruff. Sample-optimal low-\n\nrank approximation of distance matrices. In COLT, 2019.\n\n[KNPW11] Daniel M Kane, Jelani Nelson, Ely Porat, and David P Woodruff. Fast moment esti-\nmation in data streams in optimal space. In Proceedings of the forty-third annual ACM\nsymposium on Theory of computing (STOC), pages 745\u2013754. ACM, 2011.\n\n[KNW10] Daniel M Kane, Jelani Nelson, and David P Woodruff. On the exact space complexity\nIn Proceedings of the twenty-\ufb01rst annual\n\nof sketching and streaming small norms.\nACM-SIAM symposium on Discrete Algorithms, pages 1161\u20131178. SIAM, 2010.\n\n[LHW17] Xingguo Li, Jarvis Haupt, and David Woodruff. Near optimal sketching of low-\nrank tensor regression. In Advances in Neural Information Processing Systems, pages\n3466\u20133476, 2017.\n\n[LNNT16] Kasper Green Larsen, Jelani Nelson, Huy L Nguy\u00ean, and Mikkel Thorup. Heavy\nIn 2016 IEEE 57th Annual Symposium on\n\nhitters via cluster-preserving clustering.\nFoundations of Computer Science (FOCS), pages 61\u201370. IEEE, 2016.\n\n[LSZ19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in\nthe current matrix multiplication time. In COLT. https://arxiv.org/pdf/1905.\n04447.pdf, 2019.\n\n[Mah11] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and\n\nTrends in Machine Learning, 3(2):123\u2013224, 2011.\n\n[MM13] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in\ninput-sparsity time and applications to robust linear regression. In Proceedings of the\nforty-\ufb01fth annual ACM symposium on Theory of computing (STOC), pages 91\u2013100.\nACM, https://arxiv.org/pdf/1210.3135, 2013.\n\n[MW17] Cameron Musco and David P Woodruff. Sublinear time low-rank approximation of\npositive semide\ufb01nite matrices. In 2017 IEEE 58th Annual Symposium on Foundations\nof Computer Science (FOCS), pages 672\u2013683. IEEE, 2017.\n\n10\n\n\f[NN13] Jelani Nelson and Huy L Nguy\u00ean. Osnap: Faster numerical linear algebra algorithms\nvia sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foun-\ndations of Computer Science (FOCS), pages 117\u2013126. IEEE, https://arxiv.org/\npdf/1211.1002, 2013.\n\n[Nol07] John P Nolan. Stable distributions. 2007.\n\n[NS19] Vasileios Nakos and Zhao Song. Stronger L2/L2 compressed sensing; without iter-\nating. In Proceedings of the 51st Annual ACM Symposium on Theory of Computing\n(STOC), 2019.\n\n[OY05] S. Oh, S. Kwon and J. Yun. A method for structured linear total least norm on blind\n\ndeconvolution problem. Applied Mathematics and Computing, 19:151\u2013164, 2005.\n\n[RSW16] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approxima-\ntions with provable guarantees. In Proceedings of the 48th Annual Symposium on the\nTheory of Computing (STOC), 2016.\n\n[SW11] Christian Sohler and David P Woodruff. Subspace embeddings for the (cid:96)1-norm with\napplications. In Proceedings of the forty-third annual ACM symposium on Theory of\ncomputing, pages 755\u2013764. ACM, 2011.\n\n[SWZ16] Zhao Song, David P. Woodruff, and Huan Zhang. Sublinear time orthogonal tensor\ndecomposition. In Advances in Neural Information Processing Systems 29: Annual\nConference on Neural Information Processing Systems (NIPS) 2016, December 5-10,\n2016, Barcelona, Spain, pages 793\u2013801, 2016.\n\n[SWZ17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with en-\ntrywise (cid:96)1-norm error. In Proceedings of the 49th Annual Symposium on the Theory of\nComputing (STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017.\n\n[SWZ18] Zhao Song, David P Woodruff, and Peilin Zhong. Towards a zero-one law for entry-\n\nwise low rank approximation. arXiv preprint arXiv:1811.01442, 2018.\n\n[SWZ19a] Zhao Song, David P Woodruff, and Peilin Zhong. Average case column subset selec-\n\ntion for entrywise (cid:96)1-norm loss. In NeurIPS, 2019.\n\n[SWZ19b] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank\n\napproximation. In SODA 2019. https://arxiv.org/pdf/1704.08246, 2019.\n\n[SWZ19c] Zhao Song, David P Woodruff, and Peilin Zhong. Towards a zero-one law for column\n\nsubset selection. In NeurIPS, 2019.\n\n[VL92] Charles F Van Loan. Computational frameworks for the fast Fourier transform, vol-\nume 10 of Frontiers in Applied Mathematics. Society for Industrial and Applied Math-\nematics (SIAM), Philadelphia, PA, 1992.\n\n[VL00] Charles F Van Loan. The ubiquitous kronecker product. Journal of computational and\n\napplied mathematics, 123(1-2):85\u2013100, 2000.\n\n[VLP93] Charles F Van Loan and N. Pitsianis. Approximation with Kronecker products. In Lin-\near algebra for large scale and real-time applications (Leuven, 1992), volume 232 of\nNATO Adv. Sci. Inst. Ser. E Appl. Sci., pages 293\u2013314. Kluwer Acad. Publ., Dordrecht,\n1993.\n\n[Wan19a] Lan Wang. A new tuning-free approach to high-dimensional regression. ., 2019.\n\n[Wan19b] Lan Wang. Personal communication. 2019.\n\n[WKL09] Lan Wang, Bo Kai, and Runze Li. Local rank inference for varying coef\ufb01cient models.\n\nJournal of the American Statistical Association, 104(488):1631\u20131645, 2009.\n\n[WL09] Lan Wang and Runze Li. Weighted wilcoxon-type smoothly clipped absolute deviation\n\nmethod. Biometrics, 65(2):564\u2013571, 2009.\n\n11\n\n\f[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and\n\nTrends in Theoretical Computer Science, 10(1-2):1\u2013157, 2014.\n\n[WPB+18] Lan Wang, Bo Peng, Jelena Bradic, Runze Li, and Yunan Wu. A tuning-free robust\nand ef\ufb01cient approach to high-dimensional regression. Technical report, School of\nStatistics, University of Minnesota, 2018.\n\n[WW19] Ruosong Wang and David P Woodruff. Tight bounds for (cid:96)p oblivious subspace em-\n\nbeddings. In SODA, 2019.\n\n12\n\n\f", "award": [], "sourceid": 2652, "authors": [{"given_name": "Huaian", "family_name": "Diao", "institution": "Northeast Normal University"}, {"given_name": "Rajesh", "family_name": "Jayaram", "institution": "Carnegie Mellon University"}, {"given_name": "Zhao", "family_name": "Song", "institution": "UT-Austin"}, {"given_name": "Wen", "family_name": "Sun", "institution": "Microsoft Research NYC"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}]}