{"title": "SPALS: Fast Alternating Least Squares via Implicit Leverage Scores Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 729, "abstract": "Tensor CANDECOMP/PARAFAC (CP) decomposition is a powerful but computationally challenging tool in modern data analytics. In this paper, we show ways of sampling intermediate steps of alternating minimization algorithms for computing low rank tensor CP decompositions, leading to the sparse alternating least squares (SPALS) method. Specifically, we sample the the Khatri-Rao product, which arises as an intermediate object during the iterations of alternating least squares. This product captures the interactions between different tensor modes, and form the main computational bottleneck for solving many tensor related tasks. By exploiting the spectral structures of the matrix Khatri-Rao product, we provide efficient access to its statistical leverage scores. When applied to the tensor CP decomposition, our method leads to the first algorithm that runs in sublinear time per-iteration and approximates the output of deterministic alternating least squares algorithms. Empirical evaluations of this approach show significantly speedups over existing randomized and deterministic routines for performing CP decomposition. On a tensor of the size 2.4m by 6.6m by 92k with over 2 billion nonzeros formed by Amazon product reviews, our routine converges in two minutes to the same error as deterministic ALS.", "full_text": "SPALS: Fast Alternating Least Squares via Implicit\n\nLeverage Scores Sampling\n\nDehua Cheng\n\nUniversity of Southern California\n\ndehua.cheng@usc.edu\n\nRichard Peng\n\nGeorgia Institute of Technology\n\nrpeng@cc.gatech.edu\n\nIoakeim Perros\n\nGeorgia Institute of Technology\n\nperros@gatech.edu\n\nYan Liu\n\nUniversity of Southern California\n\nyanliu.cs@usc.edu\n\nAbstract\n\nTensor CANDECOMP/PARAFAC (CP) decomposition is a powerful but computa-\ntionally challenging tool in modern data analytics. In this paper, we show ways of\nsampling intermediate steps of alternating minimization algorithms for computing\nlow rank tensor CP decompositions, leading to the sparse alternating least squares\n(SPALS) method. Speci\ufb01cally, we sample the Khatri-Rao product, which arises\nas an intermediate object during the iterations of alternating least squares. This\nproduct captures the interactions between different tensor modes, and form the\nmain computational bottleneck for solving many tensor related tasks. By exploiting\nthe spectral structures of the matrix Khatri-Rao product, we provide ef\ufb01cient access\nto its statistical leverage scores. When applied to the tensor CP decomposition,\nour method leads to the \ufb01rst algorithm that runs in sublinear time per-iteration\nand approximates the output of deterministic alternating least squares algorithms.\nEmpirical evaluations of this approach show signi\ufb01cant speedups over existing\nrandomized and deterministic routines for performing CP decomposition. On a\ntensor of the size 2.4m \u21e5 6.6m \u21e5 92k with over 2 billion nonzeros formed by\nAmazon product reviews, our routine converges in two minutes to the same error\nas deterministic ALS.\n\n1\n\nIntroduction\n\nTensors, a.k.a. multidimensional arrays, appear frequently in many applications, including spatial-\ntemporal data modeling [40], signal processing [12, 14], deep learning [29] and more. Low-rank\ntensor decomposition [21] is a fundamental tool for understanding and extracting the information\nfrom tensor data, which has been actively studied in recent years. Developing scalable and provable\nalgorithms for most tensor processing tasks is challenging due to the non-convexity of the objec-\ntive [18, 21, 16, 1]. Especially in the era of big data, scalable low-rank tensor decomposition algorithm\n(that runs in nearly linear or even sublinear time in the input data size) has become an absolute must\nto command the full power of tensor analytics. For instance, the Amazon review data [24] yield a\n2, 440, 972 \u21e5 6, 643, 571 \u21e5 92, 626 tensor with 2 billion nonzero entries after preprocessing. Such\ndata sets pose challenges of scalability to some of the simplest tensor decomposition tasks.\nThere are multiple well-de\ufb01ned tensor ranks[21].\nIn this paper, we focus on the tensor CAN-\nDECOMP/PARAFAC (CP) decomposition [17, 3], where the low-rank tensor is modeled by the\nsummation over many rank-1 tensors. Due to its simplicity and interpretability, tensor CP decompo-\nsition, which is to \ufb01nd the best rank-R approximation for the input tensor often by minimizing the\nsquare loss function, has been widely adopted in many applications [21].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fMatrix Khatri-Rao (KRP) product captures the interactions between different tensor modes in the\nCP decomposition, and it is essential for understanding many tensor related tasks. For instance, in\nthe alternating least square (ALS) algorithm, which has been the workhorse for solving the tensor\nCP decomposition problem, a compact representation of the KRP can reduce the computational\ncost directly. ALS is a simple and parameter-free algorithm that optimizes the target rank-R tensor\nby updating its factor matrices in the block coordinate descent fashion.\nIn each iteration, the\ncomputational bottleneck is to solve a least square regression problem, where the size of the design\nmatrix, a KRP of factor matrices, is n2 \u21e5 n for an n \u21e5 n \u21e5 n tensor. While least square regression is\none of the most studied problem, solving it exactly requires at least O(n2) operations [23], which\ncan be larger than the size of input data for sparse tensors. For instance, the amazon review data\nwith 2 \u21e5 109 nonzeros leads to a computational cost on the order of 1012 per iteration. Exploiting\nthe structure of the KRP can reduce this cost to be linear in the input size, which on large-scale\napplications is still expensive for an iterative algorithm.\nAn effective way for speeding up such numerical computations is through randomization [23, 38],\nwhere the computational cost can be uncorrelated with the ambient size of the input data in the\noptimal case. By exploring the connection between the spectral structures of the design matrix as the\nKRP of the factor matrices, we provide ef\ufb01cient access to the statistical leverage score of the design\nmatrix. It allows us to propose the SPALS algorithm that samples rows of the KRP in a nearly-optimal\nmanner. This near optimality is twofold: 1) the estimates of leverage scores that we use have many\ntight cases; 2) the operation of sampling a row can be ef\ufb01ciently performed. The latter requirement is\nfar from trivial: Note that even when the optimal sampling probability is given, drawing a sample\nmay require O(n2) preprocessing. Our result on the spectral structures of the design matrix allows us\nto achieve both criteria simultaneously, leading to the \ufb01rst sublinear-per-iteration cost ALS algorithm\nwith provable approximation guarantees. Our contributions can be summarized as follows:\n\n1. We show a close connection between the statistical leverage scores of the matrix Khatri-Rao\nproduct and the scores of the input matrices. This yields ef\ufb01cient and accurate leverage\nscore estimations for importance sampling;\n\n2. Our algorithm achieves the state-of-art computational ef\ufb01ciency, while approximating the\nALS algorithm provably for computing CP tensor decompositions. The running time of\neach iteration of our algorithm is \u02dcO(nR3), sublinear in the input size for large tensors.\n\n3. Our theoretical results on the spectral structure of KRP can also be applied on other tensor\nrelated applications such as stochastic gradient descent [26] and high-order singular value\ndecompositions (HOSVD) [13].\n\nWe formalize the de\ufb01nitions in Section 2 and present our main results on leverage score estimation of\nthe KRP in Section 3. The SPALS algorithm and its theoretical analysis are presented in Section 4.\nWe discuss connections with previous works in Section 5. In Section 6, we empirical evaluate this\nalgorithm and its variants on both synthetic and real world data. And we conclude and discuss our\nwork in Section 7.\n\n2 Notation and Background\n\nVectors are represented by boldface lowercase letters, such as, a, b, c; Matrices are represented by\nboldface capital letters, such as, A, B, C; Tensors are represented by boldface calligraphic capital\nletters, such as, T . Without loss of generality, in this paper we focus our discussion for the 3-mode\ntensors, but our results and algorithm can be easily generalized to higher-order tensors.\nThe ith entry of a vector is denoted by ai, element (i, j) of a matrix A is denoted by Aij, and the\nelement (i, j, k) of a tensor T 2 RI\u21e5J\u21e5K is denoted by Tijk. For notation simplicity, we assume\nthat (i, j) also represents the index i + Ij between 1 and IJ, where the value I and J should be clear\nfrom the context.\n\nFor a tensor T 2 RI\u21e5J\u21e5K, we denote the tensor norm as kT k, i.e., kT k =qPI,J,K\n\nSpecial Matrix Products Our manipulation of tensors as matrices revolves around several matrix\nproducts. Our main focus is the matrix Khatri-Rao product (KRP) , where for a pair of matrices\nA 2 RI\u21e5R and B 2 RJ\u21e5R, A B 2 R(IJ)\u21e5R has element ((i, j), r) as AirBjr.\n\ni,j,k=1 T 2\nijk.\n\n2\n\n\fWe also utilize the matrix Kronecker product \u2326 and the elementwise matrix product \u21e4. More details\non these products can be found in Appendix A and [21].\nTensor Matricization Here we consider only the case of mode-n matricization. For n = 1, 2, 3, the\nmode-n matricization of a tensor T 2 RI\u21e5J\u21e5K is denoted by T(n). For instance, T(3) 2 RK\u21e5IJ,\nwhere the element (k, (i, j)) is Tijk.\nTensor CP Decomposition The tensor CP decomposition [17, 3] expresses a tensor as the sum of a\nnumber of rank-one tensors, e.g.,\n\nT =\n\nRXr=1\n\nar br cr,\n\nr = 1, 2, . . . , R. Tensor CP decomposition will be compactly represented usingJA, B, CK, where\nrespectively, i.e.,JA, B, CKijk = PR\n\nwhere denotes the outer product, T 2 RI\u21e5J\u21e5K and ar 2 RI, br 2 RJ , and cr 2 RK for\nthe factor matrices A 2 RI\u21e5R,B 2 RJ\u21e5R and C 2 RK\u21e5R and ar, br, cr are their r-th column\nr=1 AirBjrCkr. Similar as in the matrix case, each rank-1\ncomponent is usually interpreted as a hidden factor, which captures the interactions between all\ndimensions in the simplest way.\nGiven a tensor T 2 RI\u21e5J\u21e5K along with target rank R, the goal is to \ufb01nd a rank-R tensor speci\ufb01ed\nby its factor matrices A 2 RI\u21e5R, B 2 RJ\u21e5R, C 2 RK\u21e5R, that is as close to T as possible:\n\nmin\n\nA,B,CkT JA, B, CKk2 =Xi,j,k T i,j,k \n\nRXr=1\n\nAirBjrCkr!2\n\n.\n\nAlternating Least Squares Algorithm A widely used method for performing CP decomposition is\nalternating least squares (ALS) algorithm. It iteratively minimizes one of the factor matrices with the\nothers \ufb01xed. For instance, when the factors A and B are \ufb01xed, algebraic manipulations suggest that\nthe best choice of C can be obtained by solving the least squares regression:\n\nmin\n\nC XC> T>(3)\n\n2\n\n,\n\n(1)\n\nwhere the design matrix X = B A is the KRP of A and B, and T(3) is the matricization of T [21].\n3 Near-optimal Leverage Score Estimation for Khatri-Rao Product\n\nAs shown in Section 2, the matrix KRP captures the essential interactions between the factor matrices\nin the tensor CP decomposition. This task is challenging because the size of KRP of two matrices is\nsigni\ufb01cantly larger than the input matrices. For example, for the amazon review data, the KRP of two\nfactor matrices contains 1012 rows, which is much larger than the data set itself with 109 nonzeros.\nImportance sampling is one of the most powerful tools for obtaining sample ef\ufb01cient randomized\ndata reductions with strong guarantees. However, effective implementation requires comprehensive\nknowledge on the objects to be sampled: the KRP of factor matrices. In this section, we provide\nan ef\ufb01cient and effective toolset for estimating the statistical leverage scores of the KRP of factor\nmatrices, giving a direct way of applying importance sampling, one of the most important tools in\nrandomized matrix algorithms, for tensor CP decomposition related applications.\nIn the remainder of this section, we \ufb01rst de\ufb01ne and discuss the optimal importance: statistical\nleverage score, in the context of `2-regression. Then we propose and prove our near-optimal leverage\nscore estimation routine.\n\n3.1 Leverage Score Sampling for `2-regression\nIt is known that, when p \u2327 n, subsampling the rows of design matrix X 2 Rn\u21e5p by its statistical\nleverage score and solving on the samples provides ef\ufb01cient approximate solution to the least square\nregression problem: min kX yk2\n\n2, with strong theoretical guarantees [23].\n\n3\n\n\f\u2327i = kUi,:k2\n2,\n\nDe\ufb01nition 3.1 (Statistical Leverage Score). Given an n \u21e5 r matrix X, with n > r, let U denote the\nn \u21e5 r matrix consisting of the top-r left singular vectors of X. Then, the quantity\n\nwhere Ui,: denotes the i-th row of U, is the statistical leverage score of the i-th row of X.\nThe statistical leverage score of a certain row captures importance of the row in forming the linear\nsubspace. Its optimality in solving `2-regression can be explained by the subspace projection nature\nof linear regression.\nIt does not yield an ef\ufb01cient algorithm for the optimization problem in Equation (1) due to the\ndif\ufb01culties of computing statistical leverage scores. But this reduction to the matrix setting allows\nfor speedups using a variety of tools. In particular, sketching [6, 25, 27] or iterative sampling [22, 9]\nlead to routines that run in input sparsity time: O(nnz) plus the cost of solving an O(r log n) sized\nleast squares problem. However, directly applying these methods still require at least one pass over\nT at each iteration, which will dominate the overall cost.\n3.2 Near-optimal Leverage Score Estimation\nAs discussed in the previous section, the KRPs of factor matrices capture the interaction between\ntwo modes in the tensor CP decomposition, e.g., the design matrix B A in the linear regression\nproblem. To extract a compact representation of the interaction, the statistical leverage scores of\nB A provide an informative distribution over the rows, which can be utilized to select the important\nsubsets of rows randomly.\nFor a matrix with IJ rows in total, e.g., B A, in general, the calculation of statistical leverage\nscore is prohibitively expensive. However, due to the special structure of the KRP B A, the upper\nbound of statistical leverage score, which is suf\ufb01cient to obtain the same guarantee by using slightly\nmore samples, can be ef\ufb01ciently estimated, as shown in Theorem 3.2.\nTheorem 3.2 (Khatri-Rao Bound). For matrix A 2 RI\u21e5R and matrix B 2 RJ\u21e5R, where I > R\nj be the statistical leverage score of the i-th and j-th row of A and B,\nand J > R, let \u2327 A\nrespectively. Then, for statistical leverage score of the (iJ + j)-th row of matrix A B, \u2327 AB\n, we\nhave\n\ni and \u2327 B\n\ni,j\n\n\u2327 A\u2326B\ni,j \uf8ff \u2327 A\n\ni \u2327 B\nj .\n\nProof. Let the singular value decomposition of A and B be A = Ua\u21e4aVa> and B = Ub\u21e4bVb>,\nwhere Ua 2 RI\u21e5R, Ub 2 RJ\u21e5R, and \u21e4a, \u21e4b, Va, Vb 2 RR\u21e5R.\nBy the de\ufb01nition of Khatri-Rao product, we have that\n\nA B = [A:,1 \u2326 B:,1, . . . , A:,R \u2326 B:,R] 2 RIJ\u21e5R,\n\nwhere \u2326 is the Kronecker product. By the form of SVD and Lemma B.1, we have\nA B =[Ua\u21e4a(Va\n\nR,:)> \u2326 Ub\u21e4b(Vb\n\n1,:)> \u2326 Ub\u21e4b(Vb\n\n1,:)>, . . . , Ua\u21e4a(Va\n\n=h(Ua\u21e4a) \u2326 (Ub\u21e4b)i\u21e3Va> Vb>\u2318 =hUa \u2326 Ubih\u21e4a \u2326 \u21e4bi\u21e3Va> Vb>\u2318 =hUa \u2326 Ubi S,\nwhere S =\u21e5\u21e4a \u2326 \u21e4b\u21e4\u21e3Va> Vb>\u2318 2 RR2\u21e5R . So the SVD of A B can be constructed using\nthe SVD of S = Us\u21e4sV>s . So the leverage score of A B can be computed from [Ua \u2326 Ub] Us:\n(2)\n\nH = [Ua \u2326 Ub] UsU>s [Ua \u2326 Ub]> ,\n\nR,:)>]\n\nand for the index k = iJ + j, we have\n\n\u2327 AB\n\ni,j = Hk,k = e>k Hek \uf8ffhUa \u2326 Ubi> ek\n\n2\n\n2\n\n=\n\nRXp=1\n\nRXq=1\n\n(Ua\n\ni,p)2(Ub\n\nj,q)\n\n2\n\n= (\n\n(Ua\n\ni,p)2)(\n\nRXp=1\n\nRXq=1\n\n2\n\n(Ub\n\nj,q)\n\n) = \u2327 A\n\ni \u2327 B\nj ,\n\n(3)\n\n(4)\n\nwhere ei is the i-th natural basis vector. The \ufb01rst inequality is because H 4 [Ua \u2326 Ub] [Ua \u2326 Ub]> .\n\n4\n\n\fAlgorithm 1 Sample a row from B A and T(3).\n\nDraw a Bernoulli random variable z \u21e0 Bernoulli().\nif z = 0 then\n\nelse\n\n1 /R, . . . , \u2327 A\n\nDraw i \u21e0 Multi(\u2327 A\nDraw a entry (i, j, k) from the nonzero entries with probability proportional to T 2\n\nI /R) and j \u21e0 Multi(\u2327 B\n\nend if\nReturn the (jI + i)-th row of B A and T(3) with weight IJpi,j.\n\n1 /R, . . . , \u2327 B\n\nJ /R).\n\ni,j,k.\n\nFor the rank-R CP decomposition, the sum of the leverage scores for all rows in B A equals R.\nThe sum of our upper bound relaxes it to R2, which means that now we need \u02dcO(R2) samples instead\nof \u02dcO(R). This result directly generalizes to the Khatri-Rao product of k-dimensional tensors. The\nproof is provided in Appendix C.\nTheorem 3.3. For matrices A(k) 2 RIk\u21e5R where Ik > R for k = 1, . . . , K, let \u2327 (k)\nleverage score of the i-th row of A(k). Then, for theQk Ik-by-R matrix A(1) A(2) \u00b7\u00b7\u00b7 A(K)\nwith statistical leverage score \u2327i1,...,iK for the row corresponding to \u2327i1,...,iK , we have\n\nbe the statistical\n\ni\n\n\u2327 1:K\ni1,...,iK \uf8ff\n\n\u2327 (k)\nik ,\n\nKYk=1\n\ni1,...,iK denotes the statistical leverage score of the row of A(1) A(2) \u00b7\u00b7\u00b7 A(K)\nwhere \u2327 1:K\ncorresponding to the ik-th row of A(k) for k = 1, . . . , K.\nOur estimation enables the development of ef\ufb01cient numerical algorithms and is nearly optimal in\nthree ways:\n\n1. The estimation can be calculated in sublinear time given that max{I, J, K} = o (nnz(T )).\nFor instance, for the amazon review data, we have max{I, J, K} \u21e1 106 \u2327 nnz(T ) \u21e1 109;\n2. The form of the estimation allows ef\ufb01cient sample-drawing. In fact, the row index can be\ndrawn ef\ufb01ciently by considering each mode independently;\n\n3. The estimation is tight up to a constant factor R. And R is considered as modest constant\nfor low-rank decomposition. Therefore, the estimation allows sample-ef\ufb01cient importance\nsampling.\n\n4 SPALS: Sampling Alternating Least Squares\n\nThe direct application of our results on KRP leverage score estimation is an ef\ufb01cient version of the\nALS algorithm for tensor CP decomposition, where the computational bottleneck is to solve the\noptimization problem 1.\nOur main algorithmic result is a way to obtain a high quality O(r2 log n) row sample of X without\nexplicitly constructing the matrix X. This is motivated by a recent work that implicitly generates\nsparsi\ufb01ers for multistep random walks [4]. In particular, we sample the rows of X, the KRP of A and\nB, using products of quantities computed on the corresponding rows in A and B, which provides\na rank-1 approximation to the optimal importance: the statistical leverage scores. This leads to a\nsublinear time sampling routine, and implies that we can approximate the progress of each ALS step\nlinear in the size of the factor being updated, which can be sublinear in the number of non-zeros in T .\nIn the remainder of this section, we present our algorithm SPALS and prove its approximation\nguarantee. We will also discuss its extension to other tensor related applications.\n\n4.1 Sampling Alternating Least Squares\nThe optimal solution to optimization problem (1) is\n\nC = T(3) (B A)\u21e5A>A \u21e4B>B\u21e41\n\n.\n\nWe separate the calculation into two parts: (1) T(3) (B A), and (2)\u21e5A>A \u21e4B>B\u21e41, where\n\u21e4 denotes the elementwise matrix product. The latter is to invert the gram matrix of the Khatri-Rao\n\n5\n\n\fproduct, which can also be ef\ufb01ciently computed due to its R \u21e5 R size. We will mostly focus on\nevaluating the former expression.\nWe perform the matrix multiplication by drawing a few rows from both T>(3) and B A and construct\nthe \ufb01nal solution from the subset of rows. The row of B A can be indexed by (i, j) for i = 1, . . . , I\nand j = 1, . . . , J, which correspond to the i-th and j-th row in A and B, respectively. That is, our\nsampling problem can be seen as to sample the entries of a I \u21e5 J matrix P = {pi,j}i,j.\nWe de\ufb01ne the sampling probability pi,j as follows,\n\u2327 A\ni \u2327 B\nj\n\ni,j,k\n\npi,j = (1 )\n\nR2 + PK\n\nk=1 T 2\nkT k2\n\n.\n\n(5)\n\nwhere 2 (0, 1). The \ufb01rst term is a rank-1 component for matrix P. And when the input tensor is\nsparse, the second term is sparse, thus admitting the sparse plus low rank structure, which can be\neasily sampled as the mixture of two simple distributions. The sampling algorithm is described in\nAlgorithm 1. Note that sampling by the leverage scores of the design matrix B A alone provides a\nguaranteed but worse approximation for each step [23]. Since that the design matrix itself is formed\nby two factor matrices, i.e., we are not directly utilizing the information in the data, we design the\nsecond term for the worst case scenario.\nWhen R \u2327 n and n \u2327 nnz(T ), where n = max(I, J, K), we can afford to calculate \u2327 A\ni and \u2327 B\nj\nexactly in each iteration. So the distribution corresponding to the \ufb01rst term can be ef\ufb01ciently sampled\nwith preparation cost \u02dcO(r2n + r3) and per-sample-cost O(log n). Note that the second term requires\na one-time O(nnz(T )) preprocessing before the \ufb01rst iteration.\n4.2 Approximation Guarantees\nWe de\ufb01ne the following conditions:\n\n\u2327 AB\ni,j\n\nC1. The sampling probability pi,j satis\ufb01es pi,j 1\nC2. The sampling probability pi,j satis\ufb01es pi,j 2PK\n\nR for some constant 1;\nk=1 T 2\nkT k2\n\nfor some constant 2;\n\ni,j,k\n\nThe proposed probabilities pi,j in Equation (5) satisfy both conditions with 1 = (1 )/R and\n2 = . We can now prove our main approximation result.\nTheorem 4.1. For a tensor T 2 RI\u21e5J\u21e5K with n = max(I, J, K) and any factor matrices on the\n\ufb01rst two dimension as A 2 RI\u21e5R and B 2 RJ\u21e5R. If a step of ALS on the third dimension gives Copt,\nthen a step of SPALS that samples m = \u21e5(R2 log n/\u270f2) rows produces C satisfying\n\nT qA, B, Cy2\n\n< kT JA, B, CoptKk2 + \u270fkT k2.\n\nProof. Denote the sample-and-rescale matrix as S 2 Rm\u21e5IJ. By Corollary E.3, we have that\n\nT(3) (B A) T(3)S>S (B A) \uf8ff \u270fkT k. Together with Lemma E.1, we can conclude.\n\nNote that the approximation error of our algorithm does not accumulate over iterations. Similar to the\nstochastic gradient descent algorithm, the error occurred in the previous iterations can be addressed\nin the subsequent iterations.\n\n4.3 Extensions on Other Tensor Related Applications\nImportance Sampling SGD on CP Decompostion We can incorporate importance sampling in the\nstochastic gradient descent algorithm for CP decomposition. The gradient follows the form\n\n@\n\n@CkT JA, B, CKk2 = T(3) (B A) .\n\nBy sampling rows according to proposed distribution, it reduces the per-step variance via importance\nsampling [26]. Our result addresses the computational dif\ufb01culty of \ufb01nding the appropriate importance.\nSampling ALS on Higher-Order Singular Value Decomposition (HOSVD) For solving the\nHOSVD [13] on tensor, the Kronecker product is involved instead of the Khatri-Rao product. In\nAppendix D, we prove similar leverage score approximation results for Kronecker product. In fact,\nfor Kronecker product, our \u201capproximation\u201d provides the exact leverage score.\n\n6\n\n\fTheorem 4.2. For matrix A 2 RI\u21e5M and matrix B 2 RJ\u21e5N , where I > M and J > N, let \u2327 A\ni\nj be the statistical leverage score of the i-th and j-th row of A and B, respectively. Then, for\nand \u2327 B\nmatrix A \u2326 B 2 RIJ\u21e5M N with statistical leverage score \u2327 A\u2326B\nfor the (iJ + j)-th row, we have\n\u2327 A\u2326B\ni,j = \u2327 A\n5 Related Works\n\ni \u2327 B\nj .\n\ni,j\n\nCP decomposition is one of the simplest, most easily-interpretable tensor decomposition. Fitting it\nin an ALS fashion is still considered as the state-of-art in the recent tensor analytics literature [37].\nThe most widely used implementation of ALS is the MATLAB Tensor Toolbox [21]. It directly\nperforms the analytic solution of ALS steps. There is a line of work on speeding up this procedure in\ndistributed/parallel/MapReduce settings [20, 19, 5, 33]. Such approaches are compatible with our\napproach, as we directly reduce the number of steps by sampling. A similar connection holds for\nworks achieving more ef\ufb01cient computation of KRP steps of the ALS algorithm such as in [32].\nThe applicability of randomized numerical linear algebra tools to tensors was studied during their\ndevelopment [28]. Within the context of sampling-based tensor decomposition, early work has been\npublished in [36, 35] that focuses though on Tucker decomposition. In [30], sampling is used as\na means of extracting small representative sub-tensors out of the initial input, which are further\ndecomposed via the standard ALS and carefully merged to form the output. Another work based\non an a-priori sampling of the input tensor can be found in [2]. However, recent developments in\nrandomized numerical linear algebra often focused on over-constrained regression problems or low\nrank matrices. The incorporation of such tools into tensor analytics routines was fairly recent [31, 37]\nMost closely related to our algorithm are the routines from [37], which gave a sketch-based CP\ndecomposition inspired by the earlier work in [31]. Both approaches only need to examine the\nfactorization at each iteration, followed by a number of updates that only depends on rank. A main\ndifference is that the sketches in [37] moves the non-zeroes, while our sampling approach removes\nmany entries instead. Their algorithm also performs a subsequent FFT step, while our routine always\nworks on subsets of the matricizations. Our method is much more suitable for sparse tensors. Also,\nour routine can be considered as data dependent randomization, which enjoys better approximation\naccuracy than [37] in the worst case.\nFor direct comparison, the method in [37] and ours both require nnz(T ) preprocessing at the\nbeginning. Then, for each iteration, our method requires \u02dcO(nr3) operations comparing with O(r(n +\nBb log b) + r3) for [37]. Here B and b for [37] are parameters for the sketching and need to be tuned\nfor various applications. Depending on the target accuracy, b can be as large as the input size: on the\ncube synthetic tensors with n = 103 that the experiments in [37] focused on, b was set to between\n214 \u21e1 \u21e5103 and 216 \u21e1 6 \u21e5 104 in order to converge to good relative errors.\nFrom a distance, our method can be viewed as incorporating randomization into the intermediate steps\nof algorithms, and can be viewed as higher dimensional analogs of weighted SGD algorithms [39].\nCompared to more global uses of randomization [38], these more piecemeal invocations have several\nadvantages. For high dimensional tensors, sketching methods need to preserve all dimensions, while\nthe intermediate problems only involve matrices, and can often be reduced to smaller dimensions.\nFor approximating a rank R tensor in d dimensions to error \u270f, this represents the difference between\nd. Furthermore, the lower cost of each step of alternate minimization makes it much\npoly(R, \u270f) and R\n\u270f\neasier to increase accuracy at the last few steps, leading to algorithms that behave the same way in\nthe limit. The wealth of works on reducing sizes of matrices while preserving objectives such as `p\nnorms, hinge losses, and M-estimators [11, 10, 8, 7] also suggest that this approach can be directly\nadapted to much wider ranges of settings and objectives.\n\n6 Experimental Results\nWe implemented and evaluated our algorithms in a single machine setting. The source code is\navailable online1. Experiments are tested on a single machine with two Intel Xeon E5-2630 v3 CPU\nand 256GB memory. All methods are implemented in C++ with OpenMP parallelization. We report\naverages from 5 trials.\n\n1https://github.com/dehuacheng/SpAls\n\n7\n\n\fDense Synthetic Tensors We start by comparing our method against the sketching based algorithm\nfrom [37] in the single thread setting as in their evaluation. The synthetic data we tested are third-\norder tensors with dimension n = 1000, as described in [37]. We generated a rank-1000 tensor\nwith harmonically decreasing weights on rank-1 components. And then after normalization, random\nGaussian noise with noise-to-signal nsr = 0.1, 1, 10 was added. As with previous experimental\nevaluations [37], we set target rank to r = 10. The performances are given in Table 1a. We vary the\nsampling rate of our algorithm, i.e., SPALS(\u21b5) will sample \u21b5r2 log2 n rows at each iteration.\n\nnsr = 0.1\nerror\ntime\n0.27 64.8\n0.45 6.50\n0.30 16.0\n0.24 501\n0.20 1.76\n0.18 5.79\n0.21 15.9\n\nnsr = 1\nerror\ntime\n1.08 66.2\n1.37 4.70\n1.13 12.7\n1.09 512\n1.14 1.93\n1.10 5.64\n1.09 16.1\n\nALS-dense\nsketch(20, 14)\nsketch(40, 16)\nALS-sparse\nSPALS(0.3)\nSPALS(1)\nSPALS(3.0)\n\nnsr = 10\nerror\ntime\n67.6\n10.08\n4.90\n11.11\n12.4\n10.27\n498\n10.15\n10.40\n1.92\n5.94\n10.21\n10.15\n16.16\n(a) Running times per iterations in seconds and errors of various\nalternating least squares implementations\n\nALS-sparse\nSPALS(0.3)\nSPALS(1)\nSPALS(3.0)\n\nerror\n0.981\n0.987\n0.983\n0.982\n\ntime\n142\n6.97\n15.7\n38.9\n\n(b) Relative error and running\ntimes per iteration on the Ama-\nzon review tensor with dimensions\n2.44e6 \u21e5 6.64e6 \u21e5 9.26e4 and\n2.02 billion non-zeros\n\nOn these instances, a call to SPALS with rate \u21b5 samples was about 4.77\u21b5\u21e5103 rows, and as the tensor\nis dense, 4.77\u21b5 \u21e5 106 entries. The correspondence between running times and rates demonstrate\nthe sublinear runtimes of SPALS with low sampling rates. Comparing with the [37], our algorithm\nemploys data dependent random sketch with minimal overhead, which yields signi\ufb01cantly better\nprecision with similar amount of computation.\nSparse Data Tensor Our original motivation for SPALS was to handle large sparse data tensors. We\nran our algorithm on a large-scale tensor generated from Amazon review data [24]. Its sizes and\nconvergences of SPALS with various parameters are in Table 1b. We conduct the experiments in\nparallel with 16 threads. The Amazon data tensor has a much higher noise to signal ratio than our\nother experiments which common for large-scale data tensors: Running deterministic ALS with rank\n10 on it leads to a relative error of 98.1%. SPALS converges rapidly towards a good approximation\nwith only a small fraction of time comparing with the ALS algorithm.\n7 Discussion\nOur experiments show that SPALS provides notable speedup over previous CP decomposition routines\non both dense and sparse data. There are two main sources of speedups: (1) the low target rank and\nmoderate individual dimensions enable us to compute leverage scores ef\ufb01ciently; and (2) the simple\nrepresentations of the sampled form also allows us to use mostly code from existing ALS routines with\nminimal computational overhead. It is worth noting that in the dense case, the total number of entries\naccessed during all 20 iterations is far fewer than the size of T . Nonetheless, the adaptive nature\nof the sampling scheme means all the information from T are taken into account while generating\nthe \ufb01rst and subsequent iterations. From a randomized algorithms perspective, the sub-linear time\nsampling steps bear strong resemblances with stochastic optimization routines [34]. We believe more\nsystematically investigating such connections can lead to more direct connections between tensors\nand randomized numerical linear algebra, and in turn further algorithmic improvements.\nAcknowledgments\nThis work is supported in part by the U. S. Army Research Of\ufb01ce under grant number W911NF-15-1-\n0491, NSF Research Grant IIS-1254206 and IIS-1134990. The views and conclusions are those of\nthe authors and should not be interpreted as representing the of\ufb01cial policies of the funding agency,\nor the U.S. Government.\nReferences\n[1] B. Barak, J. A. Kelner, and D. Steurer. Dictionary learning and tensor decomposition via the sum-of-squares\n\nmethod. In STOC, 2015.\n\n[2] S. Bhojanapalli and S. Sanghavi. A New Sampling Technique for Tensors. ArXiv e-prints, 2015.\n[3] J. D. Carroll and J.-J. Chang. Analysis of individual differences in multidimensional scaling via an n-way\n\ngeneralization of \u201ceckart-young\u201d decomposition. Psychometrika, 1970.\n\n[4] D. Cheng, Y. Cheng, Y. Liu, R. Peng, and S.-H. Teng. Spectral sparsi\ufb01cation of random-walk matrix\n\npolynomials. arXiv preprint arXiv:1502.03496, 2015.\n\n[5] J. H. Choi and S. Vishwanathan. Dfacto: Distributed factorization of tensors. In NIPS, 2014.\n\n8\n\n\f[6] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In\n\n[7] K. L. Clarkson and D. P. Woodruff. Input sparsity and hardness for robust subspace approximation. In\n\n[8] K. L. Clarkson and D. P. Woodruff. Sketching for m-estimators: A uni\ufb01ed approach to robust regression.\n\nSTOC, 2013.\n\nFOCS, 2015.\n\nIn SODA, 2015.\n\n[9] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. Uniform sampling for matrix\n\napproximation. In ITCS, 2015.\n\n[10] M. B. Cohen and R. Peng. `p row sampling by Lewis weights. In STOC, 2015.\n[11] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney. Sampling algorithms and coresets for\n\n[12] L. De Lathauwer and B. De Moor. From matrix to tensor: Multilinear algebra and signal processing. In\n\n\\ell_p regression. SIAM Journal on Computing, 2009.\nInstitute of Mathematics and Its Applications Conference Series, 1998.\n\n[13] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM\n\njournal on Matrix Analysis and Applications, 2000.\n\n[14] V. De Silva and L.-H. Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem.\n\n[15] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarl\u00f3s. Faster least squares approximation.\n\nSIAM J. Matrix Anal. Appl., 2008.\n\nNumerische Mathematik, 2011.\n\ndecomposition. In COLT, 2015.\n\nmulti-modal factor analysis. 1970.\n\nICDE, 2015.\n\n[16] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points - online stochastic gradient for tensor\n\n[17] R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an\" explanatory\"\n\n[18] C. J. Hillar and L.-H. Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 2013.\n[19] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos. Haten2: Billion-scale tensor decompositions. In\n\n[20] U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: scaling tensor analysis up by 100\n\ntimes-algorithms and discoveries. In KDD, 2012.\n\n[21] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 2009.\n[22] M. Li, G. Miller, and R. Peng. Iterative row sampling. In FOCS, 2013.\n[23] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends R in Machine\n[24] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with\n\nLearning, 2011.\n\n[25] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications\n\n[26] D. Needell, R. Ward, and N. Srebro. Stochastic gradient descent, weighted sampling, and the randomized\n\n[27] J. Nelson and H. L. Nguy\u00ean. Osnap: Faster numerical linear algebra algorithms via sparser subspace\n\n[28] N. H. Nguyen, P. Drineas, and T. D. Tran. Tensor sparsi\ufb01cation via a bound on the spectral norm of random\n\nreview text. In RecSys, 2013.\n\nto robust linear regression. In STOC, 2013.\n\nkaczmarz algorithm. In NIPS, 2014.\n\nembeddings. In FOCS, 2013.\n\ntensors. CoRR, 2010.\n\n2009.\n\n[29] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizing neural networks. In NIPS, 2015.\n[30] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos. Parcube: Sparse parallelizable tensor decomposi-\n\ntions. In Machine Learning and Knowledge Discovery in Databases. Springer, 2012.\n\n[31] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In KDD, 2013.\n[32] A.-H. Phan, P. Tichavsky, and A. Cichocki. Fast alternating ls algorithms for high order candecomp/parafac\n\ntensor factorizations. Signal Processing, IEEE Transactions on, 2013.\n\n[33] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis. Splatt: Ef\ufb01cient and parallel sparse tensor-\n\nmatrix multiplication. 29th IEEE International Parallel & Distributed Processing Symposium, 2015.\n\n[34] T. Strohmer and R. Vershynin. A randomized kaczmarz algorithm with exponential convergence. JFAA,\n\n[35] J. Sun, S. Papadimitriou, C.-Y. Lin, N. Cao, S. Liu, and W. Qian. Multivis: Content-based social network\n\nexploration through multi-way visual analysis. In SDM. SIAM, 2009.\n\n[36] C. E. Tsourakakis. Mach: Fast randomized tensor decompositions. In SDM. SIAM, 2010.\n[37] Y. Wang, H.-Y. Tung, A. J. Smola, and A. Anandkumar. Fast and guaranteed tensor decomposition via\n\nsketching. In NIPS, 2015.\n\n[38] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical\n[39] J. Yang, Y. Chow, C. R\u00e9, and M. W. Mahoney. Weighted sgd for `p regression with randomized precondi-\n\nComputer Science, 2014.\n\n[40] R. Yu, D. Cheng, and Y. Liu. Accelerated online low rank tensor learning for multivariate spatiotemporal\n\ntioning. In SODA, 2016.\n\nstreams. In ICML, pages 238\u2013247, 2015.\n\n9\n\n\f", "award": [], "sourceid": 407, "authors": [{"given_name": "Dehua", "family_name": "Cheng", "institution": "Univ. of Southern California"}, {"given_name": "Richard", "family_name": "Peng", "institution": null}, {"given_name": "Yan", "family_name": "Liu", "institution": "University of Southern California"}, {"given_name": "Ioakeim", "family_name": "Perros", "institution": "Georgia Institute of Technology"}]}