{"title": "Sublinear Time Orthogonal Tensor Decomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 801, "abstract": "A recent work (Wang et. al., NIPS 2015) gives the fastest known algorithms for orthogonal tensor decomposition with provable guarantees. Their algorithm is based on computing sketches of the input tensor, which requires reading the entire input. We show in a number of cases one can achieve the same theoretical guarantees in sublinear time, i.e., even without reading most of the input tensor. Instead of using sketches to estimate inner products in tensor decomposition algorithms, we use importance sampling. To achieve sublinear time, we need to know the norms of tensor slices, and we show how to do this in a number of important cases. For symmetric tensors $ T = \\sum_{i=1}^k \\lambda_i u_i^{\\otimes p}$ with $\\lambda_i > 0$ for all i, we estimate such norms in sublinear time whenever p is even. For the important case of p = 3 and small values of k, we can also estimate such norms. For asymmetric tensors sublinear time is not possible in general, but we show if the tensor slice norms are just slightly below $\\| T \\|_F$ then sublinear time is again possible. One of the main strengths of our work is empirical - in a number of cases our algorithm is orders of magnitude faster than existing methods with the same accuracy.", "full_text": "Sublinear Time Orthogonal Tensor Decomposition\u2217\n\nZhao Song\u2021\nHuan Zhang(cid:63)\n\u2021Dept. of Computer Science, University of Texas, Austin, USA\n\nDavid P. Woodruff\u2020\n\n\u2020IBM Almaden Research Center, San Jose, USA\n\n(cid:63)Dept. of Electrical and Computer Engineering, University of California, Davis, USA\nzhaos@utexas.edu, dpwoodru@us.ibm.com, ecezhang@ucdavis.edu\n\nAbstract\n\nimportant cases. For symmetric tensors T =(cid:80)k\n\nA recent work (Wang et. al., NIPS 2015) gives the fastest known algorithms\nfor orthogonal tensor decomposition with provable guarantees. Their algorithm\nis based on computing sketches of the input tensor, which requires reading the\nentire input. We show in a number of cases one can achieve the same theoretical\nguarantees in sublinear time, i.e., even without reading most of the input tensor.\nInstead of using sketches to estimate inner products in tensor decomposition\nalgorithms, we use importance sampling. To achieve sublinear time, we need\nto know the norms of tensor slices, and we show how to do this in a number of\n\u2297p\ni with \u03bbi > 0 for all i, we\nestimate such norms in sublinear time whenever p is even. For the important case\nof p = 3 and small values of k, we can also estimate such norms. For asymmetric\ntensors sublinear time is not possible in general, but we show if the tensor slice\nnorms are just slightly below (cid:107) T(cid:107)F then sublinear time is again possible. One of\nthe main strengths of our work is empirical - in a number of cases our algorithm is\norders of magnitude faster than existing methods with the same accuracy.\n\ni=1 \u03bbiu\n\n1\n\nIntroduction\n\nTensors are a powerful tool for dealing with multi-modal and multi-relational data. In recommendation\nsystems, often using more than two attributes can lead to better recommendations. This could occur,\nfor example, in Groupon where one could look at users, activities, and time (season, time of day,\nweekday/weekend, etc.), as three attributes to base predictions on (see [13] for a discussion). Similar\nto low rank matrix approximation, we seek a tensor decomposition to succinctly store the tensor and\nto apply it quickly. A popular decomposition method is the canonical polyadic decomposition, i.e.,\nthe CANDECOMP/PARAFAC (CP) decomposition, where the tensor is decomposed into a sum of\nrank-1 components [9]. We refer the reader to [23], where applications of CP including data mining,\ncomputational neuroscience, and statistical learning for latent variable models are mentioned.\nA natural question, given the emergence of large data sets, is whether such decompositions can be\nperformed quickly. There are a number of works on this topic [17, 16, 7, 11, 10, 4, 20]. Most related\nto ours are several recent works of Wang et al. [23] and Tung et al. [18], in which it is shown how to\nsigni\ufb01cantly speed up this decomposition for orthogonal tensor decomposition using the randomized\ntechnique of linear sketching [15]. In this work we also focus on orthogonal tensor decomposition.\nThe idea in [23] is to create a succinct sketch of the input tensor, from which one can then perform\nimplicit tensor decomposition by approximating inner products in existing decomposition methods.\nExisting methods, like the power method, involve computing the inner product of a vector, which is\nnow a rank-1 matrix, with another vector, which is now a slice of a tensor. Such inner products can\n\n\u2217Full version appears on arXiv, 2017. \u2021Work done while visiting IBM Almaden.\n\u2020Supported by XDATA DARPA Air Force Research Laboratory contract FA8750-12-C-0323.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2\n\n, which is unbiased and has variance O((cid:107)u(cid:107)2\n\nbe approximated much faster by instead computing the inner product of the sketched vectors, which\nhave signi\ufb01cantly lower dimension. One can also replace the sketching with sampling to approximate\ninner products; we discuss some sampling schemes [17, 4] below and compare them to our work.\n1.1 Our Contributions\nWe show in a number of important cases, one can achieve the same theoretical guarantees in the\nwork of Wang et al. [23] (which was applied later by Tung et al. [18]), in sublinear time, that is,\nwithout reading most of the input tensor. While previous work needs to walk through the input at\nleast once to create a sketch, we show one can instead perform importance sampling of the tensor\nbased on the current iterate, together with reading a few entries of the tensor which help us learn the\nnorms of tensor slices. We use a version of (cid:96)2-sampling for our importance sampling. One source of\nspeedup in our work and in Wang et al. [23] comes from approximating inner products in iterations\nin the robust tensor power method (see below). To estimate (cid:104)u, v(cid:105) for n-dimensional vectors u and v,\ntheir work computes sketches S(u) and S(v) and approximates (cid:104)u, v(cid:105) \u2248 (cid:104)S(u), S(v)(cid:105). Instead, if\none has u, one can sample coordinates i proportional to u2\ni , which is known as (cid:96)2-sampling [14, 8].\nOne estimates (cid:104)u, v(cid:105) as vi(cid:107)u(cid:107)2\n2). These guarantees\nare similar to those using sketching, though the constants are signi\ufb01cantly smaller (see below), and\nunlike sketching, one does not need to read the entire tensor to perform such sampling.\nSymmetric Tensors: As in [23], we focus on orthogonal tensor decomposition of symmetric tensors,\nthough we explain the extension to the asymmetric case below. Symmetric tensors arise in engineering\napplications, for example, to represent the symmetric tensor \ufb01eld of stress, strain, and anisotropic\nconductivity. Another example is diffusion MRI in which one uses symmetric tensors to describe\ndiffusion in the brain or other parts of the body. In spectral methods symmetric tensors are exactly\nthose that come up in Latent Dirichlet Allocation problems. Although one can symmetrize a tensor\nusing simple matrix operations (see, e.g., [1]), we cannot do this in sublinear time.\nIn orthogonal tensor decompostion of a symmetric matrix, there is an underlying n \u00d7 n\u00b7\u00b7\u00b7 n tensor\n, and the input tensor is T = T\u2217 + E, where (cid:107) E(cid:107)2 \u2264 \u0001. We have \u03bb1 >\n\u03bb2 > \u00b7\u00b7\u00b7 > \u03bbk > 0 and that {vi}k\ni=1 is a set of orthonormal vectors. The goal is to reconstruct\napproximations \u02c6vi to the vectors vi, and approximations \u02c6\u03bbi to the \u03bbi. Our results naturally generalize\nto tensors with different lengths in different dimensions. For simplicity, we \ufb01rst focus on order p = 3.\nIn the robust tensor power method [1], one generates a random initial vector u, and performs T\nupdate steps \u02c6u = T(I, u, u)/(cid:107) T(I, u, u)(cid:107)2, where\n\nT\u2217 = (cid:80)k\n\n2(cid:107)v(cid:107)2\n\ni=1 \u03bbiv\n\n\u2297p\ni\n\nui\n\n(cid:104) n(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\n(cid:105)\n\nT(I, u, u) =\n\nT1,j,(cid:96) uju(cid:96),\n\nT2,j,(cid:96) uju(cid:96),\u00b7\u00b7\u00b7 ,\n\nTn,j,(cid:96) uju(cid:96)\n\n.\n\nj=1\n\n(cid:96)=1\n\nj=1\n\n(cid:96)=1\n\nj=1\n\n(cid:96)=1\n\nThe matrices T1,\u2217,\u2217, . . . , Tn,\u2217,\u2217 are referred to as the slices. The vector \u02c6u typically converges to the\ntop eigenvector in a small number of iterations, and one often chooses a small number L of random\ninitial vectors to boost con\ufb01dence. Successive eigenvectors can be found by de\ufb02ation. The algorithm\nand analysis immediately extend to higher order tensors.\nWe use (cid:96)2-sampling to estimate T(I, u, u). To achieve the same guarantees as in [23], for typical\nsettings of parameters (constant k and several eigenvalue assumptions) na\u00efvely one needs to take\nO(n2) (cid:96)2-samples from u for each slice in each iteration, resulting in \u2126(n3) time and destroying our\nsublinearity. We observe that if we additionally knew the squared norms (cid:107) T1,\u2217,\u2217 (cid:107)2\nF , . . . ,(cid:107) Tn,\u2217,\u2217 (cid:107)2\nF ,\nthen we could take O(n2) (cid:96)2-samples in total, where we take (cid:107) Ti,\u2217,\u2217 (cid:107)2\n\u00b7 O(n2) (cid:96)2-samples from the\n(cid:107) T (cid:107)2\ni-th slice in expectation. Perhaps in some applications such norms are known or cheap to compute in\na single pass, but without further assumptions, how can one obtain such norms in sublinear time?\n\nIf T is a symmetric tensor, then Tj,j,j =(cid:80)k\n\ni,j + Ej,j,j. Note that if there were no noise,\nthen we could read off approximations to the slice norms, since (cid:107) Tj,\u2217,\u2217 (cid:107)2\ni,j, and so\nj,j,j is an approximation to (cid:107) Tj,\u2217,\u2217 (cid:107)2\nT2/3\nF up to factors depending on k and the eigenvalues. However,\nthere is indeed noise. To obtain non-trivial guarantees, the robust tensor power method assumes\n(cid:107) E(cid:107)2 = O(1/n), where\n\ni=1 \u03bbiv3\n\ni=1 \u03bb2\n\ni v2\n\nF\n\nF\n\nF =(cid:80)k\nn(cid:88)\nn(cid:88)\n\nn(cid:88)\n\n(cid:107) E(cid:107)2 =\n\nsup\n\n(cid:107)u(cid:107)2=(cid:107)v(cid:107)2=(cid:107)w(cid:107)2=1\n\nE(u, v, w) =\n\nsup\n\n(cid:107)u(cid:107)2=(cid:107)v(cid:107)2=(cid:107)w(cid:107)2=1\n\n2\n\nEi,j,k uivjwk,\n\ni=1\n\nj=1\n\nk=1\n\n\fi=1 \u03bbiv3\n\nF = \u03bb2\n\nunit vectors and the \u03bbi and k were constant, then(cid:80)k\n\n\u221a\nwhich in particular implies | Ej,j,j | = O(1/n). This assumption comes from the \u0398(1/\nn)-\ncorrelation of the random initial vector to v1. This noise bound does not trivialize the problem;\nindeed, Ej,j,j can be chosen adversarially subject to | Ej,j,j | = O(1/n), and if the vi were random\ni,j = O(1/n3/2), which is small enough\nto be completely masked by the noise Ej,j,j. Nevertheless, there is a lot of information about the\nslice norms. Indeed, suppose k = 1, \u03bb1 = \u0398(1), and (cid:107) T(cid:107)F = 1. Then Tj,j,j = \u0398(v3\n1,j) + Ej,j,j,\n1,j \u00b1 O(1/n). Again using that | Ej,j,j | = O(1/n), this im-\nand one can show (cid:107) Tj,\u2217,\u2217 (cid:107)2\n1v2\nplies (cid:107) Tj,\u2217,\u2217 (cid:107)2\nF = \u03c9(n\u22122/3) if and only if Tj,j,j = \u03c9(1/n), and therefore one would notice this\nby reading Tj,j,j. There can only be o(n2/3) slices j for which (cid:107) Tj,\u2217,\u2217 (cid:107)2\nF = \u03c9(n\u22122/3), since\n(cid:107) T(cid:107)2\nF = 1. Therefore, for each of them we can afford to take O(n2) (cid:96)2-samples and still have an\nO(n2+2/3) = o(n3) sublinear running time. The remaining slices all have (cid:107) Tj,\u2217,\u2217 (cid:107)2\nF = O(n\u22122/3),\nand therefore if we also take O(n1/3) (cid:96)2-samples from every slice, we will also estimate the contribu-\ntion to T(I, u, u) from these slices well. This is also a sublinear O(n2+1/3) number of samples.\nWhile the previous paragraph illustrates the idea for k = 1, for k = 2 we need to read more than the\nTj,j,j entries to decide how many (cid:96)2-samples to take from a slice. The analysis is more complicated\nbecause of sign cancellations. Even for k = 2 we could have Tj,j,j = \u03bb1v3\n2,j + Ej,j,j,\nand if v1,j = \u2212v2,j then we may not detect that (cid:107) Tj,\u2217,\u2217 (cid:107)2\nF is large. We \ufb01x this by also reading the\nentries Ti,j,j, Tj,i,j, and Tj,j,i for every i and j. This is still only O(n2) entries and so we are still\nsublinear time. Without additional assumptions, we only give a formal analysis of this for k \u2208 {1, 2}.\nMore importantly, if instead of third-order symmetric tensors we consider p-th order symmetric\ntensors for even p, we do not have such sign cancellations. In this case we do not have any restrictions\non k for estimating slice norms. One does need to show after de\ufb02ation, the slice norms can still be\nestimated; this holds because the eigenvectors and eigenvalues are estimated suf\ufb01ciently well.\nWe also give several per-iteration optimizations of our algorithm, based on careful implementations\nof generating a sorted list of random numbers and random permutations. We \ufb01nd empirically (see\nbelow) that we are much faster per iteration than previous sketching algorithms, in addition to not\nhaving to read the entire input tensor in a preprocessing step.\n\n1,j + \u03bb2v3\n\nAsymmetric Tensors: For asymmetric tensors, e.g., 3rd-order tensors of the form(cid:80)k\n\ni=1 \u03bbiui \u2297 vi \u2297\nwi, it is impossible to achieve sublinear time in general, since it is hard to distinguish T = ei\u2297ej\u2297ek\nfor random i, j, k \u2208 {1, 2, . . . , n} from T = 0\u22973. We make a necessary and suf\ufb01cient assumption\nthat all the entries of the ui are less than n\u2212\u03b3 for an arbitrarily small constant \u03b3 > 0. In this case, all\nslice norms are o(n\u2212\u03b3) and by taking O(n2\u2212\u03b3) samples from each slice we achieve sublinear time.\nWe can also apply such an assumption to symmetric tensors.\nEmpirical Results: One of the main strengths of our work is our empirical results. In each iteration\nwe approximate T(I, u, u) a total of B times independently and take the median to increase our\ncon\ufb01dence. In the notation of [23], B corresponds to the number of independent sketches used.\nWhile the median works empirically, there are some theoretical issues with it discussed in Remark 4.\nAlso let b be the total number of (cid:96)2-samples we take per iteration, which corresponds to the sketch\nsize in the notation of [23]. We found that empirically we can set B and b to be much smaller than\nthat in [23] and achieve the same error guarantees. One explanation for this is that the variance bound\nwe obtain via importance sampling is a factor of 43 = 64 smaller than in [23], and for p-th order\ntensors, a factor of 4p smaller.\nTo give an idea of how much smaller we can set b and B, to achieve roughly the same squared residual\nnorm error on the synthetic data sets of dimension 1200 for \ufb01nding a good rank-1 approximation,\nthe algorithm of [23] would need to set parameters b = 216 and B = 50, whereas we can set\nb = 10 \u00d7 1200 and B = 5. Our running time is 2.595 seconds and we have no preprocessing time,\nwhereas the algorithm of [23] has a running time of 116.3 seconds and 55.34 seconds of preprocessing\ntime. We refer the reader to Table 1 in Section 3. In total we are over 50 times faster.\nWe also demonstrate our algorithm in a real-world application using real datasets, even when the\ndatasets are sparse. Namely, we consider a spectral algorithm for Latent Dirichlet Allocation [1, 2]\nwhich uses tensor decomposition as its core computational step. We show a signi\ufb01cant speedup can\nbe achieved on tensors occurring in applications such as LDA, and we refer the reader to Table 2 in\n\n3\n\n\f2(cid:107)v(cid:107)2\n\nSection 3. For example, on the wiki [23] dataset with a tensor dimension of 200, we run more than 5\ntimes faster than the sketching-based method.\nPrevious Sampling Algorithms: Previous sampling-based schemes of [17, 4] do not achieve our\nguarantees, because [17] uses uniform sampling, which does not work for tensors with spiky elements,\nwhile the non-uniform sampling in [4] requires touching all of the entries in the tensor and making\ntwo passes over it.\nNotation Let [n] denote {1, 2, . . . , n}. Let \u2297 denote the outer product, and u\u22973 = u \u2297\nu \u2297 u. Let T \u2208 Rnp, where p is the order of tensor T and n is the dimension of tensor\nT. Let (cid:104)A, B(cid:105) denote the entry-wise inner product between two tensors A, B \u2208 Rnp, e.g.,\nip=1 Ai1,i2,\u00b7\u00b7\u00b7 ,ip \u00b7 Bi1,i2,\u00b7\u00b7\u00b7 ,ip. For a tensor A \u2208 Rnp, (cid:107) A(cid:107)F =\n2 . For random variable X let E[X] denote its expectation of X\n) 1\n\n(cid:104)A, B(cid:105) = (cid:80)n\ni2=1 \u00b7\u00b7\u00b7(cid:80)n\n(cid:80)n\n(cid:80)n\n((cid:80)n\ni2=1 \u00b7\u00b7\u00b7(cid:80)n\n\nip=1 A2\n\ni1,\u00b7\u00b7\u00b7 ,ip\n\ni1=1\n\ni1=1\n\nthe true tensor contraction(cid:80)\n\n2(cid:107)w(cid:107)2\n\nand V[X] its variance (if these quantities exist).\n2 Main Results\nWe explain the details of our main results in this section. First, we state the importance sampling\nlemmas for our tensor application. Second, we explain how to quickly produce a list of random\ntuples according to a certain distribution needed by our algorithm. Third, we combine the \ufb01rst and\nthe second parts to get a fast way of approximating tensor contractions, which are used as subroutines\nin each iteration of the robust tensor power method. We then provide our main theoretical results, and\nhow to estimate the slice norms needed by our main algorithm.\nImportance sampling lemmas. Approximating an inner product is a simple application of impor-\ntance sampling. Tensor contraction T(u, v, w) can be regarded as the inner product between two\nn3-dimensional vectors, and thus importance sampling can be applied. Lemma 1 suggests that we can\ntake a few samples according to their importance, e.g., we can sample Ti,j,k uivjwk with probability\n|uivjwk|2/(cid:107)u(cid:107)2\n2. As long as the number of samples is large enough, it will approximate\n\n(cid:80)\n(cid:80)\nk Ti,j,k uivjwk with small variance after a \ufb01nal rescaling.\n2, qj = |vj|2/(cid:107)v(cid:107)2\n2, and rk = |wk|2/(cid:107)w(cid:107)2\nF \u00b7 (cid:107)u \u2297 v \u2297 w(cid:107)2\nF .\n\n(cid:80)L\nLemma 1. Suppose random variable X = Ti,j,k uivjwk/(piqjrk) with probability piqjrk where\npi = |ui|2/(cid:107)u(cid:107)2\n2, and we take L i.i.d. samples of X,\n(cid:96)=1 X(cid:96). Then (1) E[Y ] = (cid:104)T, u \u2297 v \u2297 w(cid:105), and (2)\ndenoted X1, X2,\u00b7\u00b7\u00b7 , XL. Let Y = 1\nV[Y ] \u2264 1\nL(cid:107) T(cid:107)2\nSimilarly, we also have importance sampling for each slice Ti,\u2217,\u2217, i.e., \u201cface\u201d of T.\nLemma 2. For all i \u2208 [n], suppose random variable X i = Ti,j,k vjwk/(qjrk) with probability\nqjrk, where qj = |vj|2/(cid:107)v(cid:107)2\n2, and we take Li i.i.d. samples of X i, say\n(cid:96). Then (1) E[Y i] = (cid:104)Ti,\u2217,\u2217, v \u2297 w(cid:105) and (2) V[Y i] \u2264\n1, X i\nX i\n(cid:107) Ti,\u2217,\u2217 (cid:107)2\n1\nLi\nGenerating importance samples in linear time. We need an ef\ufb01cient way to sample indices of a\nvector based on their importance. We view this problem as follows: imagine [0, 1] is divided into z\n\u201cbins\u201d with different lengths corresponding to the probability of selecting each bin, where z is the\nnumber of indices in a probability vector. We generate m random numbers uniformly from [0, 1] and\nsee which bin each random number belongs to. If a random number is in bin i, we sample the i-th\nindex of a vector. There are known algorithms [6, 19] to solve this problem in O(z + m) time.\nWe give an alternative algorithm GENRANDTUPLES. Our algorithm combines Bentley and Saxe\u2019s\nalgorithm [3] for ef\ufb01ciently generating m sorted random numbers in O(m) time, and Knuth\u2019s\nshuf\ufb02ing algorithm [12] for generating a random permutation of [m] in O(m) time. We use the\nnotation CUMPROB(v, w) and CUMPROB(u, v, w) for the algorithm creating the distributions on\nRn2 and Rn3 of Lemma 2 and Lemma 1, respectively. We note that na\u00efvely applying previous\nalgorithms would require z = O(n2) and z = O(n3) time to form these two distributions, but we\ncan take O(m) samples from them implicitly in O(n + m) time.\n\n2 and rk = |wk|2/(cid:107)w(cid:107)2\n\nF(cid:107)v \u2297 w(cid:107)2\nF .\n\n. Let Y i = 1\nLi\n\n2,\u00b7\u00b7\u00b7 , X i\n\ni\n\nj\n\nL\n\n(cid:80)L\n\n(cid:96)=1 X i\n\nLi\n\nFast approximate tensor contractions. We propose a fast way to approximately compute tensor\ncontractions T(I, v, w) and T(u, v, w) with a sublinear number of samples of T, as shown in\nAlogrithm 1 and Algorithm 2. Na\u00efvely computing tensor contractions using all of the entries of T\ngives an exact answer but could take n3 time. Also, to keep our algorithm sublinear time, we never\nexplicitly compute the de\ufb02ated tensor; rather we represent it implicitly and sample from it.\n\n4\n\n\fAlgorithm 1 Subroutine for approximate tensor\ncontraction T(I, v, w)\n\nAlgorithm 2 Subroutine for approximate tensor\ncontraction T(u, v, w)\n\ns(d) \u2190 0\nfor (i, j, k) \u2208 L do\ns(d) \u2190 s(d) + 1\n\n3: for d = 1 \u2192 B do\n4:\n5:\n6:\n7:\n\n3: for d = 1 \u2192 B do\n4:\nfor i = 1 \u2192 n do\n5:\n6:\n7:\n8:\n9:\n\n1: function APPROXTUVW(T, u, v, w, n, B,(cid:98)b)\n2: (cid:101)p,(cid:101)q,(cid:101)r \u2190 CUMPROB(u, v, w)\n\n1: function APPROXTIVW(T, v, w, n, B,{(cid:98)bi})\n2: (cid:101)q,(cid:101)r \u2190 CUMPROB(v, w)\nL \u2190 GENRANDTUPLES((cid:80)n\ni=1(cid:98)bi,(cid:101)q,(cid:101)r)\nfor (cid:96) = 1 \u2192(cid:98)bi do\ni \u2190 0\ns(d)\n(j, k) \u2190 L(i\u22121)b+(cid:96)\n10: (cid:98)T(I, v, w)i \u2190 median\ni \u2190 s(d)\ns(d)\ni + 1\nqj rk\n11: return(cid:98)T(I, v, w)\ns(d)\nand 2). Let(cid:98)bi be the number samples we take from slice i \u2208 [n] in APPROXTIVW, and let(cid:98)b denote\n(cid:98)T(u, u, u) \u2212 T(u, u, u) and the vector \u03b52,T(u) = (cid:98)T(I, u, u) \u2212 T(I, u, u). For any b > 0, if\n(cid:98)bi (cid:38) b(cid:107) Ti,\u2217,\u2217 (cid:107)2\n\nL \u2190 GENRANDTUPLES((cid:98)b,(cid:101)p,(cid:101)q,(cid:101)r).\ns(d) \u2190 s(d)/(cid:98)b\n9: (cid:98)T(u, v, w) \u2190 median\n10: return(cid:98)T(u, v, w)\n\nthe total number of samples in our algorithm.\nTheorem 3. For T \u2208 Rn\u00d7n\u00d7n and u \u2208 Rn with (cid:107)u(cid:107)2 = 1, de\ufb01ne the number \u03b51,T(u) =\n\nThe following theorem gives the error bounds of APPROXTIVW and APPROXTUVW (in Algorithm 1\n\nTi,j,k \u00b7uj \u00b7 uk\n\ni /(cid:98)bi, \u2200i \u2208 [n]\n\nF then the following bounds hold 1:\n\nTi,j,k \u00b7ui \u00b7 uj \u00b7 uk\n\nd\u2208[B]\n\nd\u2208[B]\n\npiqj rk\n\ns(d)\n\n8:\n\nF /(cid:107) T(cid:107)2\nE[|\u03b51,T(u)|2] = O((cid:107) T(cid:107)2\n\nF /b), and E[(cid:107)\u03b52,T(u)(cid:107)2\n\n2] = O(n(cid:107) T(cid:107)2\n\nF /b).\n\n(1)\n\nIn addition, for any \ufb01xed \u03c9 \u2208 Rn with (cid:107)\u03c9(cid:107)2 = 1,\n\nV[(cid:104)\u03c9, \u03b52,T(u)(cid:105)] =(cid:80)n\n\ni=1 \u03c92\ni\n\n(cid:107) Ti,\u2217,\u2217 (cid:107)2\n\nF\n\n(cid:98)bi\n\nE[(cid:104)\u03c9, \u03b52,T (u)(cid:105)2] = O((cid:107) T(cid:107)2\n\nF /b).\n\n(cid:46) ((cid:80)n\n\ni=1 \u03c92\ni )\n\n(cid:107) T (cid:107)2\n\nF\n\nb =\n\n(cid:107) T (cid:107)2\n\nF\n\nb\n\n.\n\nEq. (1) can be obtained by observing that each random variable [\u03b52,T(u)]i is independent and so\n\nRemark 4. In [23], the coordinate-wise median of B estimates to the T(I, v, w) is used to boost\nthe success probability. There appears to be a gap [21] in their argument as it is unclear how to\nachieve (1) after taking a coordinate-wise median, which is (7) in Theorem 1 of [23]. To \ufb01x this, we\ninstead pay a factor proportional to the number of iterations in Algorithm 3 in the sample complexity\n\n(cid:98)b. Since we have expectation bounds on the quantities in Theorem 3, we can apply a Markov bound\n\nand a union bound across all iterations. This suf\ufb01ces for our main theorem concerning sublinear time\nbelow. One can obtain high probability bounds by running Algorithm 3 multiple times independently,\nand taking coordinate-wise medians of the output eigenvectors. Empirically, our algorithm works\neven if we take the median in each iteration, which is done in line 10 in Algorithm 1.\n\nReplacing Theorem 1 in [23] by our Theorem 3, the rest of the analysis in [23] is unchanged. Our\nAlgorithm 3 is the same as the sketching-based robust tensor power method in [23], except for lines\n10, 12, 15, and 17, where the sketching-based approximate tensor contraction is replaced by our\nimportance sampling procedures APPROXTUVW and APPROXTIVW. Rather than use Theorem 2 of\nWang et al. [23], the main theorem concerning the correctness of the robust tensor decomposition\nalgorithm, we use a recent improvement of it by Wang and Anandkumar in Theorems 4.1 and 4.2\nof [22], which states general guarantees for any algorithm satisfying per iteration noise guarantees.\nThese theorems also remove many of the earlier eigenvalue assumptions in Theorem 2 of [23].\ni=1 \u03bbiv\u22973\n\nTheorem 5. (Theorem 4.1 and 4.2 of [22]), Suppose T = T\u2217 + E, where T =(cid:80)k\n\ni with\n\u03bbi > 0 and orthonormal basis vectors {v1, . . . , vk} \u2286 Rn, n \u2265 k. Let \u03bbmax, \u03bbmin be the largest and\nsmallest values in {\u03bbi}k\ni=1 be outputs of the robust tensor power method. There exist\nabsolute constants K0, C0, C1, C2, C3 > 0 such that if E satis\ufb01es\n)| \u2264 min{\u0001/\n| E(vi, u(\u03c4 )\n\n(2)\n1For two functions f, g, we use the shorthand f (cid:46) g (resp. (cid:38)) to indicate that f \u2264 Cg (resp. \u2265) for some\n\ni=1 and {(cid:98)\u03bbi,(cid:98)vi}k\n\nk, C0\u03bbmin/n},\n\n(cid:107) E(I, u(\u03c4 )\n\n)(cid:107)2 \u2264 \u0001,\n\n, u(\u03c4 )\n\n, u(\u03c4 )\n\n\u221a\n\nt\n\nt\n\nt\n\nt\n\nabsolute constant C.\n\n5\n\n\fF\n\n(cid:46) si then\n\nF ,\u2200i \u2208 [n]\n\n(cid:98)bi \u2190 b \u00b7 si/(cid:107) T(cid:107)2\n(cid:98)bi \u2190 b/n,\u2200i \u2208 [n]\n6: (cid:98)b =(cid:80)n\ni=1(cid:98)bi\nu((cid:96)) \u2190INITIALIZATION\nu((cid:96)) \u2190 APPROXTIVW(T, u((cid:96)), u((cid:96)), n, B,{(cid:98)bi})\nfor t = 1 \u2192 T do\n\u03bb((cid:96)) \u2190 APPROXTUVW(T, u((cid:96)), u((cid:96)), u((cid:96)), n, B,(cid:98)b)\nu((cid:96)) \u2190 u((cid:96))/(cid:107)u((cid:96))(cid:107)2\nu\u2217 \u2190 APPROXTIVW(T, u\u2217, u\u2217, n, B,{(cid:98)bi})\n17: \u03bb\u2217 \u2190 APPROXTUVW(T, u\u2217, u\u2217, u\u2217, n, B,(cid:98)b)\n\nAlgorithm 3 Our main algorithm\n1: function IMPORTANCESAMPLINGRB(T, n, B, b)\n2: if si are known, where (cid:107) Ti,\u2217,\u2217 (cid:107)2\n3:\n4: else\n5:\n7: for (cid:96) = 1 \u2192 L do\n8:\n9:\n10:\n11:\n12:\n13: (cid:96)\u2217 \u2190 arg max(cid:96)\u2208[L] \u03bb((cid:96)), u\u2217 \u2190 u((cid:96)\u2217)\n14: for t = 1 \u2192 T do\n15:\nu\u2217 \u2190 u\u2217/(cid:107)u\u2217(cid:107)2\n16:\n18: return \u03bb\u2217, u\u2217\nfor all i \u2208 [k], t \u2208 [T ], and \u03c4 \u2208 [L] and furthermore\n\n\u221a\n\n(a) Sketching v.s. importance sampling\n\n(b) Preprocessing time\n\nFigure 1: Running time with growing dimension\n\ni\n\ni\n\n(cid:107) T(cid:107)2\n\n\u0001 \u2264 C1 \u00b7 \u03bbmin/\n\nk, T = \u2126(log(\u03bbmaxn/\u0001)), L \u2265 max{K0, k} log(max{K0, k}),\n\nl=1(cid:98)\u03bbl(cid:98)v\u22973\n\nl (cid:107)2\n\nF where b (cid:38) n(cid:107) T(cid:107)2\n\n|\u03bbi \u2212(cid:98)\u03bb\u03c0(i)| \u2264 C2\u0001,\n\nthen with probability at least 9/10, there exists a permutation \u03c0 : [k] \u2192 [k] such that\n\n(cid:107)vi \u2212(cid:98)v\u03c0(i)(cid:107)2 \u2264 C3\u0001/\u03bbi,\nF /(cid:107) T\u2212(cid:80)j\u22121\n\n]i,\u2217,\u2217(cid:107)2\n\n(cid:38) bkT(cid:107)[T\u2212(cid:80)j\u22121\n\n\u2200i = 1,\u00b7\u00b7\u00b7 , k.\nCombining the previous theorem with our importance sampling analysis, we obtain:\n\nF /\u00012 +\nk, \u03bbmin/n}2. Then the output guarantees of Theorem 5 hold for Algorithm 3 with\n\nTheorem 6 (Main). Assume the notation of Theorem 5. For each j \u2208 [k], suppose we take(cid:98)b(j) =\n(cid:80)n\ni=1(cid:98)b(j)\nsamples during the power iterations for recovering(cid:98)\u03bbj and(cid:98)vj, the number of samples\nfor slice i is(cid:98)b(j)\nl=1(cid:98)\u03bbl(cid:98)v\u22973\n\u221a\nconstant probability. Our total time is O(LT k2(cid:98)b) and the space is O(nk), where(cid:98)b = maxj\u2208[k](cid:98)b(j).\nF / min{\u0001/\nIn Theorem 3, if we require(cid:98)bi = b(cid:107) Ti,\u2217,\u2217 (cid:107)2\nour algorithm is sublinear when sampling uniformly ((cid:98)bi = b/n) without computing (cid:107) Ti,\u2217,\u2217 (cid:107)2\n\nF , we need to scan the entire tensor to compute\n(cid:107) Ti,\u2217,\u2217 (cid:107)2\nF , making our algorithm not sublinear. With the following mild assumption in Theorem 7,\nTheorem 7 (Bounded slice norm). There is a constant \u03b1 > 0, a constant \u03b2 \u2208 (0, 1] and a suf\ufb01ciently\nsmall constant \u03b3 > 0, such that, for any 3rd order tensor T = T\u2217 + E \u2208 Rn3 with rank(T\u2217) \u2264 n\u03b3,\n\u03bbk \u2265 1/n\u03b3, if (cid:107) Ti,\u2217,\u2217 (cid:107)2\nF for all i \u2208 [n], and E satis\ufb01es (2), then Algorithm 3 runs in\nO(n3\u2212\u03b1) time.\nThe condition \u03b2 \u2208 (0, 1] is a practical one. When \u03b2 = 1, all tensor slices have equal Frobenius\nnorm. The case \u03b2 = 0 only occurs when (cid:107) Ti,\u2217,\u2217 (cid:107)F = (cid:107) T(cid:107)F ; i.e., all except one slice is zero. This\ntheorem can also be applied to asymmetric tensors, since the analysis in [23] can be extended to them.\nFor certain cases, we can remove the bounded slice norm assumption. The idea is to take a sublinear\nnumber of samples from the tensor to obtain upper bounds on all slice norms. In the full version,\nwe extend the algorithm and analysis of the robust tensor power method to p > 3 by replacing\ncontractions T(u, v, w) and T(I, v, w) with T(u1, u2,\u00b7\u00b7\u00b7 , up) and T(I, u2,\u00b7\u00b7\u00b7 , up). As outlined\nin Section 1, when p is even, because we do not have sign cancellations we can show:\nTheorem 8 (Even order). There is a constant \u03b1 > 0 and a suf\ufb01ciently small constant \u03b3 > 0,\nsuch that, for any even order-p tensor T = T\u2217 + E \u2208 Rnp with rank(T\u2217) \u2264 n\u03b3, p \u2264 n\u03b3 and\n\u03bbk \u2265 1/n\u03b3. For any suf\ufb01ciently large constant c0, there exists a suf\ufb01ciently small constant c > 0, for\nany \u0001 \u2208 (0, c\u03bbk/(c0p2kn(p\u22122)/2)) if E satis\ufb01es (cid:107) E(cid:107)2 \u2264 \u0001/(c0\nn), Algorithm 3 runs in O(np\u2212\u03b1)\ntime.\n\nF \u2264 1\n\nn\u03b2 (cid:107) T(cid:107)2\n\n\u221a\n\nF :\n\nl\n\nF /(cid:107) T(cid:107)2\n\n6\n\n20040060080010001200tensordimensionn020406080100Runningtime(seconds)SketchingSamplingwithoutpre-scanningSamplingwithpre-scanning20040060080010001200tensordimensionn020406080Preprocessingtime(seconds)SketchingSamplingwithoutpre-scanningSamplingwithpre-scanning\fn\u03b1(cid:107) T(cid:107)2\n\ni=1 \u03bbiv\u22973\n\ni\n\ni=1 \u03bbiui \u2297 vi \u2297 wi(cid:107)2\n\ni2 , and linear decay \u03bbi = 1 \u2212 i\u22121\n\ntensor as T\u2217 =(cid:80)k\n\nAs outlined in Section 1, for p = 3 and small k we can take sign considerations into account:\nTheorem 9 (Low rank). There is a constant \u03b1 > 0 and a suf\ufb01ciently small constant \u03b3 > 0 such that\nfor any symmetric tensor T = T\u2217 + E \u2208 Rn3 with E satisfying (2), rank(T\u2217) \u2264 2, and \u03bbk \u2265 1/n\u03b3,\nthen Algorithm 3 runs in O(n3\u2212\u03b1) time.\n3 Experiments\n3.1 Experiment Setup and Datasets\nOur implementation shares the same code base 1 as the sketching-based robust tensor power method\nproposed in [23]. We ran our experiments on an i7-5820k CPU with 64 GB of memory in single-\nthreaded mode. We ran two versions of our algorithm: the version with pre-scanning scans the full\ntensor to accurately measure per-slice Frobenius norms and make samples for each slice in proportion\nto its Frobenius norm in APPROXTIVW; the version without pre-scanning assumes that the Frobenius\nF , \u03b1 \u2208 (0, 1] and uses b/n samples per slice, where b is\nnorm of each slice is bounded by 1\nthe total number of samples our algorithm makes, analogous to the sketch length b in [23].\nSynthetic datasets. We \ufb01rst generated an orthonormal basis {vi}k\ni=1 and then computed the synthetic\n, with \u03bb1 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbk. Then we normalized T\u2217 such that (cid:107) T\u2217 (cid:107)F = 1,\nn1.5 ) for i \u2264 j \u2264 l. Then\nand added a symmetric Gaussian noise tensor E where Eijl \u223c N (0, \u03c3\n\u03c3 controls the noise-to-signal ratio and we kept it as 0.01 in all our synthetic tensors. For the\neigenvalues \u03bbi, we generated three different decays: inverse decay \u03bbi = 1\ni , inverse square decay\nk . We also set k = 100 when generating tensors, since higher\n\u03bbi = 1\nrank eigenvalues were almost indistinguishable from the added noise. To show the scalability of our\nalgorithm, we generated tensors with different dimensions: n = 200, 400, 600, 800, 1000, 1200.\nReal-life datasets. Latent Dirichlet Allocation [5] (LDA) is a powerful generative statistical model\nfor topic modeling. A spectral method has been proposed to solve LDA models [1, 2] and the most\ncritical step in spectral LDA is to decompose a symmetric K \u00d7 K \u00d7 K tensor with orthogonal\neigenvectors, where K is the number of modeled topics. We followed the steps in [1, 18] and built\na K \u00d7 K \u00d7 K tensor TLDA for each dataset, and then ran our algorithms directly on TLDA to see\nhow it works on those tensors in real applications. In our experiments we keep K = 200. We used\nthe two same datasets as the previous work [23]: Wiki and Enron, as well as four additional real-life\ndatasets. We refer the reader to our GitHub repository 2 for our code and full results.\n3.2 Results\nWe considered running time and the squared residual norm to evaluate the performance of our\nF denote the squared residual\nnorm where {(\u03bb1, u1, v1, w1),\u00b7\u00b7\u00b7 , (\u03bbk, uk, vk, wk)} are the eigenvalue/eigenvectors obtained by the\nrobust power method. To reduce the experiment time we looked only for the \ufb01rst eigenvalue and\neigenvector, but our algorithm is capable of \ufb01nding any number of eigenvalues/eigenvectors. We list\nthe pre-scanning time as preprocessing time in tables. It only depends on the tensor dimension n and\nunlike the sketching based method, it does not depend on b. Pre-scanning time is very short, because\nit only requires one pass of sequential access to the tensor which is very ef\ufb01cient on hardware.\nSublinear time veri\ufb01cation. Our theoretical result suggests the total number of samples bno-prescan\nfor our algorithm without pre-scanning is n1\u2212\u03b1(\u03b1 \u2208 (0, 1]) times larger than bprescan for our algorithm\nwith pre-scanning. But in experiments we observe that when bno-prescan = bprescan both algorithms\nachieve very similar accuracy, indicating that in practice \u03b1 \u2248 1.\nSynthetic datasets. We ran our algorithm on a large number of synthetic tensors with different\ndimensions and different eigengaps. Table 1 shows results for a tensor with 1200 dimensions with\n100 non-zero eigenvalues decaying as \u03bbi = 1\ni2 . To reach roughly the same residual norm, the running\ntime of our algorithm is over 50 times faster than that of the sketching-based robust tensor power\nmethod, thanks to the fact that we usually need a relatively small B and b to get a good residual, and\nthe hidden constant factor in the running time of sampling is much smaller than that of sketching.\nOur algorithm scales well on large tensors due to its sub-linear nature. In Figure 1(a), for the\nsketching-based method we kept b = 216, B = 30 for n \u2264 800 and B = 50 for n > 800 (larger n\nrequires more sketches to observe a reasonable recovery). For our algorithm, we chose b and B such\n\nalgorithms. Given a tensor T \u2208 Rn3, let (cid:107) T\u2212(cid:80)k\n\n1http://yining-wang.com/fftlda-code.zip\n2https://github.com/huanzhang12/sampling_tensor_decomp/\n\n7\n\n\fthat for each n, our residual norm is on-par or better than the sketching-based method. Our algorithm\nneeds much less time than the sketching-based one over all dimensions. Another advantage of our\nalgorithm is that there are zero or very minimal preprocessing steps. In Figure 1(b), we can see how\nthe preprocessing time grows to prepare sketches when the dimension increases. For applications\nwhere only the \ufb01rst few eigenvectors are needed, the preprocessing time could be a large overhead.\nReal-life datasets Due to the small tensor dimension (200), our algorithm shows less speedup than\nthe sketching-based method. But it is still 2 \u223c 6 times faster in each of the six real-life datasets,\nachieving the same squared residual norm. Table 2 reports results for one of the datasets in many\ndifferent settings of (b, B). Like in synthetic datasets, we also empirically observe that the constant b\nin importance sampling is much smaller than the b used in sketching to get the same error guarantee.\n\nSketching based robust power method: n = 1200, \u03bbi = 1\ni2\n\nSquared residual norm\n10\n50\n\n30\n\nRunning time (s)\n\n10\n\n30\n\n50\n\nPreprocessing time (s)\n10\n50\n\n30\n\n1.010\n1.020\n0.1513\n0.1065\n\n15.85\n17.23\n24.72\n34.74\nImportance sampling based robust power method (without prescanning): n = 1200, \u03bbi = 1\ni2\n\n1.014\n0.2271\n0.1097\n0.09242\n\n0.5437\n0.1549\n0.1003\n0.08936\n\n0.6114\n1.344\n4.928\n22.28\n\n2.423\n4.563\n15.51\n69.7\n\n5.361\n5.978\n8.788\n13.76\n\n4.374\n8.022\n27.87\n116.3\n\n26.08\n28.31\n40.4\n55.34\n\nRunning time (s)\n\nSquared residual norm\n10\n50\n\n30\n\n30\n0.0\n0.0\n0.0\n0.0\n0.0\nImportance sampling based robust power method (with prescanning): n = 1200, \u03bbi = 1\ni2\n\n0.08684\n0.08784\n0.08704\n0.08697\n0.08653\n\n0.08637\n0.08671\n0.08700\n0.08645\n0.08664\n\n0.08639\n0.08627\n0.08618\n0.08625\n0.08611\n\n30\n8.3\n13.68\n24.51\n35.35\n46.12\n\n10\n2.595\n4.42\n8.02\n11.63\n15.19\n\n15.46\n25.84\n46.37\n66.71\n87.24\n\n50\n\nPreprocessing time (s)\n10\n50\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\nSquared residual norm\n10\n50\n\n30\n\n0.08657\n0.08741\n0.08648\n0.08635\n0.08622\n\n0.08684\n0.08677\n0.08624\n0.08634\n0.08652\n\n0.08636\n0.08668\n0.08634\n0.08615\n0.08619\n\nRunning time (s)\n\n10\n3.1\n5.427\n9.843\n14.33\n18.68\n\n30\n\n10.47\n17.43\n31.42\n45.4\n59.32\n\n50\n18\n\n30.26\n54.49\n63.85\n82.83\n\n30\n\nPreprocessing time (s)\n10\n50\n2.234\n2.232\n2.226\n2.226\n2.225\n\n2.236\n2.233\n2.226\n2.224\n2.225\n\n2.234\n2.233\n2.226\n2.227\n2.225\n\nB\n210\n212\n214\n216\n\nB\n\n5n\n10n\n20n\n30n\n40n\n\nB\n\n5n\n10n\n20n\n30n\n40n\n\nb\n\nb\n\nb\n\nb\n\nB\n210\n211\n212\n213\n214\n\n5n\n10n\n20n\n30n\n40n\n\n2.091e+07\n1.971e+07\n1.947e+07\n1.931e+07\n1.928e+07\n\n1.951e+07\n1.938e+07\n1.930e+07\n1.927e+07\n1.926e+07\n\n1.931e+07\n1.931e+07\n1.935e+07\n1.929e+07\n1.928e+07\n\n1.928e+07\n1.929e+07\n1.926e+07\n1.926e+07\n1.925e+07\n\n0.2346\n0.4354\n1.035\n2.04\n4.577\n\n0.3698\n0.5623\n0.9767\n1.286\n1.692\n\n0.8749\n1.439\n2.912\n5.94\n13.93\n\n1.146\n1.623\n2.729\n3.699\n4.552\n\n0.1727\n0.2408\n0.4226\n0.5783\n1.045\n\n10\n0.0\n0.0\n0.0\n0.0\n0.0\n\n0.2535\n0.3167\n0.4275\n0.6493\n1.121\nF = 2.135e+07\n\n30\n0.0\n0.0\n0.0\n0.0\n0.0\n\nTable 1: Synthetic tensor decomposition using the robust tensor power method. We use an order-3 normalized\ndense tensor with dimension n = 1200 with \u03c3 = 0.01 noise added. We run sketching-based and sampling-based\nmethods to \ufb01nd the \ufb01rst eigenvalue and eigenvector by setting L = 50, T = 30 and varying B and b.\n\nSketching based robust power method: dataset wiki, (cid:107)T(cid:107)2\n\nSquared residual norm\n\nRunning time (s)\n\nF = 2.135e+07\n\nPreprocessing time (s)\n\n10\n\n30\n\n10\n\n30\n\n10\n\n30\n\nImportance sampling based robust power method (without prescanning): dataset wiki, (cid:107)T(cid:107)2\n\nb\n\nB\n\nSquared residual norm\n\nRunning time (s)\n\n10\n\n30\n\n10\n\n30\n\nPreprocessing time (s)\n\nImportance sampling based robust power method (with prescanning): dataset wiki, (cid:107)T(cid:107)2\n\nSquared residual norm\n\nRunning time (s)\n\nF = 2.135e+07\n\nPreprocessing time (s)\n\nb\n\nB\n\n5n\n10n\n20n\n30n\n40n\n\n10\n\n1.931e+07\n1.928e+07\n1.931e+07\n1.929e+07\n1.929e+07\n\n30\n\n1.930e+07\n1.930e+07\n1.927e+07\n1.925e+07\n1.925e+07\n\n10\n\n0.4376\n0.6357\n1.083\n1.457\n1.905\n\n30\n\n1.168\n1.8\n2.962\n4.049\n5.246\n\n10\n\n0.01038\n0.0104\n0.01102\n0.01102\n0.01105\n\n30\n\n0.01103\n0.01044\n0.01042\n0.01043\n0.01105\n\nTable 2: Tensor decomposition in LDA on the wiki dataset. The tensor is generated by spectral LDA with\ndimension 200 \u00d7 200 \u00d7 200. It is symmetric but not normalized. We \ufb01x L = 50, T = 30 and vary B and b.\n\n8\n\n\fReferences\n[1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. JMLR, 15(1):2773\u20132832, 2014.\n\n[2] A. Anandkumar, Y.-k. Liu, D. J. Hsu, D. P. Foster, and S. M. Kakade. A spectral algorithm for\n\nlatent dirichlet allocation. In NIPS, pages 917\u2013925, 2012.\n\n[3] J. L. Bentley and J. B. Saxe. Generating sorted lists of random numbers. ACM Transactions on\n\nMathematical Software (TOMS), 6(3):359\u2013364, 1980.\n\n[4] S. Bhojanapalli and S. Sanghavi. A new sampling technique for tensors. CoRR, abs/1502.05023,\n\n2015.\n\n[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[6] K. Bringmann and K. Panagiotou. Ef\ufb01cient sampling methods for discrete distributions. In\nInternational Colloquium on Automata, Languages, and Programming, pages 133\u2013144. Springer,\n2012.\n\n[7] J. H. Choi and S. Vishwanathan. Dfacto: Distributed factorization of tensors. In NIPS, pages\n\n1296\u20131304, 2014.\n\n[8] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. J.\n\nACM, 59(5):23, 2012.\n\n[9] R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an explana-\n\ntory multi-modal factor analysis. 16(1):84, 1970.\n\n[10] F. Huang, N. U. N, M. U. Hakeem, P. Verma, and A. Anandkumar. Fast detection of overlapping\n\ncommunities via online tensor methods on gpus. CoRR, abs/1309.0787, 2013.\n\n[11] U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos. Gigatensor: scaling tensor analysis\n\nup by 100 times - algorithms and discoveries. In KDD, pages 316\u2013324, 2012.\n\n[12] D. E. Knuth. The art of computer programming. vol. 2: Seminumerical algorithms. addisonwes-\n\nley. Reading, MA, pages 229\u2013279, 1969.\n\n[13] A. Moitra. Tensor decompositions and their applications, 2014.\n[14] M. Monemizadeh and D. P. Woodruff. 1-pass relative-error lp-sampling with applications. In\n\nSODA, pages 1143\u20131160, 2010.\n\n[15] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In KDD,\n\npages 239\u2013247, 2013.\n\n[16] A. H. Phan, P. Tichavsk\u00fd, and A. Cichocki. Fast alternating LS algorithms for high order\nCANDECOMP/PARAFAC tensor factorizations. IEEE Transactions on Signal Processing,\n61(19):4834\u20134846, 2013.\n\n[17] C. E. Tsourakakis. MACH: fast randomized tensor decompositions. In SDM, pages 689\u2013700,\n\n2010.\n\n[18] H.-Y. F. Tung, C.-Y. Wu, M. Zaheer, and A. J. Smola. Spectral methods for the hierarchical\n\ndirichlet process. 2015.\n\n[19] A. J. Walker. An ef\ufb01cient method for generating discrete random variables with general\n\ndistributions. ACM Transactions on Mathematical Software (TOMS), 3(3):253\u2013256, 1977.\n\n[20] C. Wang, X. Liu, Y. Song, and J. Han. Scalable moment-based inference for latent dirichlet\n\nallocation. In ECML-PKDD, pages 290\u2013305, 2014.\n\n[21] Y. Wang. Personal communication. 2016.\n[22] Y. Wang and A. Anandkumar. Online and differentially-private tensor decomposition. CoRR,\n\nabs/1606.06237, 2016.\n\n[23] Y. Wang, H.-Y. Tung, A. J. Smola, and A. Anandkumar. Fast and guaranteed tensor decomposi-\n\ntion via sketching. In NIPS, pages 991\u2013999, 2015.\n\n9\n\n\f", "award": [], "sourceid": 474, "authors": [{"given_name": "Zhao", "family_name": "Song", "institution": "UT-Austin"}, {"given_name": "David", "family_name": "Woodruff", "institution": "IBM Research"}, {"given_name": "Huan", "family_name": "Zhang", "institution": "UC-Davis"}]}