{"title": "Fitting Low-Rank Tensors in Constant Time", "book": "Advances in Neural Information Processing Systems", "page_first": 2473, "page_last": 2481, "abstract": "In this paper, we develop an algorithm that approximates the residual error of Tucker decomposition, one of the most popular tensor decomposition methods, with a provable guarantee. Given an order-$K$ tensor $X\\in\\mathbb{R}^{N_1\\times\\cdots\\times N_K}$, our algorithm randomly samples a constant number $s$ of indices for each mode and creates a ``mini'' tensor $\\tilde{X}\\in\\mathbb{R}^{s\\times\\cdots\\times s}$, whose elements are given by the intersection of the sampled indices on $X$. Then, we show that the residual error of the Tucker decomposition of $\\tilde{X}$ is sufficiently close to that of $X$ with high probability. This result implies that we can figure out how much we can fit a low-rank tensor to $X$ \\emph{in constant time}, regardless of the size of $X$. This is useful for guessing the favorable rank of Tucker decomposition. Finally, we demonstrate how the sampling method works quickly and accurately using multiple real datasets.", "full_text": "Fitting Low-Rank Tensors in Constant Time\n\nNational Institute of Advanced Industrial Science and Technology\n\nKohei Hayashi\u2217\n\nRIKEN AIP\n\nhayashi.kohei@gmail.com\n\nYuichi Yoshida\u2020\n\nNational Institute of Informatics\n\nyyoshida@nii.ac.jp\n\nAbstract\n\nIn this paper, we develop an algorithm that approximates the residual error of\nTucker decomposition, one of the most popular tensor decomposition methods,\nwith a provable guarantee. Given an order-K tensor X \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK , our\nalgorithm randomly samples a constant number s of indices for each mode and\ncreates a \u201cmini\u201d tensor \u02dcX \u2208 Rs\u00d7\u00b7\u00b7\u00b7\u00d7s, whose elements are given by the intersection\nof the sampled indices on X. Then, we show that the residual error of the Tucker\ndecomposition of \u02dcX is suf\ufb01ciently close to that of X with high probability. This\nresult implies that we can \ufb01gure out how much we can \ufb01t a low-rank tensor to X in\nconstant time, regardless of the size of X. This is useful for guessing the favorable\nrank of Tucker decomposition. Finally, we demonstrate how the sampling method\nworks quickly and accurately using multiple real datasets.\n\n1\n\nIntroduction\n\nTensor decomposition is a fundamental tool for dealing with array-structured data. Using tensor\ndecomposition, a tensor (or a multidimensional array) is approximated with multiple tensors in\nlower-dimensional space using a multilinear operation. This drastically reduces disk and memory\nusage. We say that a tensor is of order K if it is a K-dimensional array; each dimension is called a\nmode in tensor terminology.\nAmong the many existing tensor decomposition methods, Tucker decomposition [18] is a popular\nchoice. To some extent, Tucker decomposition is analogous to singular-value decomposition (SVD):\nas SVD decomposes a matrix into left and right singular vectors that interact via singular values,\nTucker decomposition of an order-K tensor consists of K factor matrices that interact via the so-\ncalled core tensor. The key difference between SVD and Tucker decomposition is that, with the latter,\nthe core tensor need not be diagonal and its \u201crank\u201d can differ for each mode k = 1, . . . , K. In this\npaper, we refer to the size of the core tensor, which is a K-tuple, as the Tucker rank of a Tucker\ndecomposition.\nWe are usually interested in obtaining factor matrices and a core tensor to minimize the residual error\u2014\nthe error between the input and low-rank approximated tensors. Sometimes, however, knowing the\nresidual error itself is an important task. The residual error tells us how the low-rank approximation\nis suitable to the input tensor, and is particularly useful to predetermine the Tucker rank. In real\n\n\u2217Supported by ONR N62909-17-1-2138.\n\u2020Supported by JSPS KAKENHI Grant Number JP17H04676 and JST ERATO Grant Number JPMJER1305,\n\nJapan.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fAlgorithm 1\nInput: Random access to a tensor X \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK , Tucker rank (R1, . . . , Rk), and \u0001, \u03b4 \u2208 (0, 1).\n\nSk \u2190 a sequence of s = s(\u0001, \u03b4) indices uniformly and independently sampled from [Nk].\n\nfor k = 1 to K do\nConstruct a mini-tensor X|S1,...,SK .\nReturn (cid:96)R1,...,RK (X|S1,...,SK ).\n\napplications, Tucker ranks are not explicitly given, and we must select them by considering the\nbalance between space usage and approximation accuracy. For example, if the selected Tucker rank\nis too small, we risk losing essential information in the input tensor. On the other hand, if the selected\nTucker rank is too large, the computational cost of computing the Tucker decomposition (even if we\nallow for approximation methods) increases considerably along with space usage. As with the case of\nthe matrix rank, one might think that a reasonably good Tucker rank can be found using a grid search.\nUnfortunately, grid searches for Tucker ranks are challenging because, for an order-K tensor, the\nTucker rank consists of K free parameters and the search space grows exponentially in K. Hence,\nwe want to evaluate each grid point as quickly as possible.\nUnfortunately, although several practical algorithms have been proposed, such as the higher-order\northogonal iteration (HOOI) [7], they are not suf\ufb01ciently scalable. For each mode, HOOI iteratively\napplies SVD to an unfolded tensor\u2014a matrix that is reshaped from the input tensor. Given an N1 \u00d7\nk Nk), which depends crucially\non the input size N1, . . . , NK. Although there are several approximation algorithms [8, 21, 17], their\ncomputational costs are still intensive. Consequently, we cannot search for good Tucker ranks. Rather,\nwe can only check a few candidates.\n\n\u00b7\u00b7\u00b7 \u00d7 NK tensor, the computational cost is hence O(K maxk Nk \u00b7(cid:81)\n\n1.1 Our Contributions\n\nWhen \ufb01nding a good Tucker rank with a grid search, we need only the residual error. More speci\ufb01cally,\ngiven an order-K tensor X \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK and integers Rk \u2264 Nk (k = 1, . . . , K), we consider the\nfollowing rank-(R1, . . . , RK) Tucker-\ufb01tting problem. For an integer n \u2208 N, let [n] denote the set\n{1, 2, . . . , n}. Then, we want to compute the following normalized residual error:\n\n(cid:13)(cid:13)(cid:13)X \u2212 [[G; U (1), . . . , U (K)]]\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:81)\n\nF\n\n,\n\n(1)\n\nk\u2208[K] Nk\n\n(cid:89)\n\nk\u2208[K]\n\n(cid:96)R1,...,RK (X) :=\n\nG\u2208RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK ,{U (k)\u2208RNk\u00d7Rk}k\u2208[K]\n\nmin\n\nwhere [[G; U (1), . . . , U (K)]] \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK is an order-K tensor, de\ufb01ned as\n\n(cid:88)\n\n[[G; U (1), . . . , U (K)]]i1\u00b7\u00b7\u00b7iK =\n\nGr1\u00b7\u00b7\u00b7rK\n\nU (k)\nikrk\n\nr1\u2208[R1],...,rK\u2208[RK ]\n\nfor every i1 \u2208 [N1], . . . , iK \u2208 [NK]. Here, G is the core tensor, and U (1), . . . , U (K) are the factor\nmatrices. Note that we are not concerned with computing the minimizer. Rather, we only want\nto compute the minimum value. In addition, we do not need the exact minimum. Indeed, a rough\nestimate still helps to narrow down promising rank candidates. The question here is how quickly we\ncan compute the normalized residual error (cid:96)R1,...,RK (X) with moderate accuracy.\nWe shed light on this question by considering a simple sampling-based algorithm. Given an order-K\ntensor X \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK , Tucker rank (R1, . . . , RK), and sample size s \u2208 N, we sample a sequence\ns ) uniformly and independently from {1, . . . , Nk} for each mode k \u2208\nof indices Sk = (xk\n[K]. Then, we construct a mini-tensor X|S1,...,SK \u2208 Rs\u00d7\u00b7\u00b7\u00b7\u00d7s such that (X|S1,...,SK )i1,...,iK\n=\n. Finally, we compute (cid:96)R1,...,RK (X|S1,...,SK ) using a solver, such as HOOI, that then\nXx1\ni1\noutputs the obtained value. The details are provided in Algorithm 1.\nIn this paper, we show that Algorithm 1 achieves our ultimate goal: with a provable guarantee, the\ntime complexity remains constant. Assume each rank parameter Rk is suf\ufb01ciently smaller than\nthe dimension of each mode Nk. Then, given error and con\ufb01dence parameters \u0001, \u03b4 \u2208 (0, 1), there\nexists a constant s = s(\u0001, \u03b4) such that the approximated residual (cid:96)R1,...,RK (X|S1,...,SK ) is close to\nthe original one (cid:96)R1,...,RK (X), to within \u0001 with a probability of at least 1 \u2212 \u03b4. Note that the time\n\n1, . . . , xk\n\n...,xK\niK\n\n2\n\n\fcomplexity for computing (cid:96)R1,...,RK (X|S1,...,SK ) does not depend on the input size N1, . . . , NK but\nrather on the sample size s, meaning that the algorithm runs in constant time, regardless of the input\nsize.\nThe main component in our proof is the weak version of Szemer\u00e9di\u2019s regularity lemma [9], which\nroughly states that any tensor can be well approximated by a tensor consisting of a constant number\nof blocks whose entries in the same block are equal. Then, we can show that X|S1,...,SK is a good\nsketch of the original tensor, because by sampling s many indices for each mode, we can hit each\nblock a suf\ufb01cient number of times. It follows that (cid:96)R1,...,RK (X) and (cid:96)R1,...,RK (X|S1,...,SK ) are close.\nTo formalize this argument, we want to measure the \u201cdistance\u201d between X and X|S1,...,SK , and we\nwant to show that it is small. To this end, we exploit graph limit theory, \ufb01rst described by Lov\u00e1sz and\nSzegedy [13] (see also [12]), in which we measure the distance between two graphs on a different\nnumber of vertices by considering continuous versions called dikernels. Hayashi and Yoshida [10]\nused graph limit theory to develop a constant-time algorithm that minimizes quadratic functions\ndescribed by matrices and vectors. We further extend this theory to tensors to analyze the Tucker\n\ufb01tting problem.\nWith both synthetic and real datasets, we numerically evaluate our algorithm. The results show that\nour algorithm overwhelmingly outperforms other approximation methods in terms of both speed and\naccuracy.\n\n2 Preliminaries\n\n(cid:113)(cid:80)\n\nmax\n\nmax\n\n|\n\ni1,...,iK\n\nX 2\n\ni1\u00b7\u00b7\u00b7iK\n\n(cid:80)\n\n, the max norm of X as (cid:107)X(cid:107)max =\n\nTensors Let X \u2208 RN1\u00d7\u00b7\u00b7\u00b7NK be a tensor. Then, we de\ufb01ne the Frobenius norm of X as (cid:107)X(cid:107)F =\n|Xi1\u00b7\u00b7\u00b7iK|, and the cut\nXi1\u00b7\u00b7\u00b7iK|. We note that these norms\n\nnorm of X as (cid:107)X(cid:107)(cid:3) =\nsatisfy the triangle inequalities.\nFor a vector v \u2208 Rn and a sequence S = (x1, . . . , xs) of indices in [n], we de\ufb01ne the restriction\nv|S \u2208 Rs of v such that (v|S)i = vxi for i \u2208 [s]. Let X \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK be a tensor, and\ns ) be a sequence of indices in [Nk] for each mode k \u2208 [K]. Then, we de\ufb01ne the\nSk = (xk\nrestriction X|S1,...,SK \u2208 Rs\u00d7\u00b7\u00b7\u00b7\u00d7s of X to S1\u00d7\u00b7\u00b7\u00b7\u00d7SK such that (X|S1,...,SK )i1\u00b7\u00b7\u00b7iK\nfor each i1 \u2208 [N1], . . . , iK \u2208 [Nk].\n\nS1\u2286[N1],...,SK\u2286[NK ]\n\ni1\u2208[N1],...,iK\u2208[NK ]\n\ni1\u2208S1,...,iK\u2208SK\n\n1, . . . , xk\n\n= Xx1\ni1\n\n,...,xK\niK\n\nk\u2208[K] f (k)(xk), which is an order-K dikernel.\n\n0 f (x)g(x)dx. For a\nk\u2208[K] f (k) \u2208 [0, 1]K \u2192 R as\n\nHyper-dikernels We call a (measurable) function W : [0, 1]K \u2192 R a (hyper-)dikernel of order K.\nWe can regard a dikernel as a tensor whose indices are speci\ufb01ed by real values in [0, 1]. We stress\nthat the term \u201cdikernel\u201d has nothing to do with kernel methods used in machine learning.\n\nFor two functions f, g : [0, 1] \u2192 R, we de\ufb01ne their inner product as (cid:104)f, g(cid:105) =(cid:82) 1\nsequence of functions f (1), . . . , f (K), we de\ufb01ne their tensor product(cid:78)\n(cid:78)\nk\u2208[K] f (k)(x1, . . . , xK) =(cid:81)\n(cid:113)(cid:82)\nLet W : [0, 1]K \u2192 R be a dikernel. Then, we de\ufb01ne the Frobenius norm of W as (cid:107)W(cid:107)F =\n[0,1]K W(x)2dx, the max norm of W as (cid:107)W(cid:107)max = maxx\u2208[0,1]K |W(x)|, and the cut norm of\nW as (cid:107)W(cid:107)(cid:3) = supS1,...,SK\u2286[0,1]\n(cid:82)\nthe triangle inequalities. For two dikernels W and W(cid:48), we de\ufb01ne their inner product as (cid:104)W,W(cid:48)(cid:105) =\n[0,1]K W(x)W(cid:48)(x)dx.\nLet \u03bb be a Lebesgue measure. A map \u03c0 : [0, 1] \u2192 [0, 1] is said to be measure-preserving, if the\npre-image \u03c0\u22121(X) is measurable for every measurable set X, and \u03bb(\u03c0\u22121(X)) = \u03bb(X). A measure-\npreserving bijection is a measure-preserving map whose inverse map exists and is also measurable\n(and, in turn, also measure-preserving). For a measure-preserving bijection \u03c0 : [0, 1] \u2192 [0, 1] and\na dikernel W : [0, 1]K \u2192 R, we de\ufb01ne a dikernel \u03c0(W) : [0, 1]K \u2192 R as \u03c0(W)(x1, . . . , xK) =\nW(\u03c0(x1), . . . , \u03c0(xK)).\n\n(cid:12)(cid:12)(cid:12). Again, we note that these norms satisfy\n\nW(x)dx\n\nS1\u00d7\u00b7\u00b7\u00b7\u00d7SK\n\n(cid:12)(cid:12)(cid:12)(cid:82)\n\n3\n\n\f(cid:88)\n\n(cid:89)\n\nk\u2208[K]\n\nFor a tensor G \u2208 RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK and vector-valued functions {F (k) : [0, 1] \u2192 RRk}k\u2208[K], we de\ufb01ne\nan order-K dikernel [[G; F (1), . . . , F (K)]] : [0, 1]K \u2192 R as\n\n[[G; F (1), . . . , F (K)]](x1, . . . , xK) =\n\nGr1,...,rK\n\nF (k)(xk)rk\n\nr1\u2208[R1],...,rK\u2208[RK ]\n\nWe note that [[G; F (1), . . . , F (K)]] is a continuous analogue of Tucker decomposition.\nTensors and hyper-dikernels We can construct the dikernel X : [0, 1]K \u2192 R from a tensor X \u2208\nRN1\u00d7\u00b7\u00b7\u00b7\u00d7NK as follows. For an integer n \u2208 N, let I n\nn , . . . , 1].\nFor x \u2208 [0, 1], we de\ufb01ne in(x) \u2208 [n] as a unique integer such that x \u2208 I n\ni . Then, we de\ufb01ne\nX (x1, . . . , xK) = XiN1 (x1)\u00b7\u00b7\u00b7iNK (xK ). The main motivation for creating a dikernel from a tensor is\nthat, in doing so, we can de\ufb01ne the distance between two tensors X and Y of different sizes via the\ncut norm\u2014that is, (cid:107)X \u2212 Y(cid:107)(cid:3).\ns ) for k \u2208 [K] be sequences of elements\nLet W : [0, 1]K \u2192 R be a dikernel and Sk = (xk\nin [0, 1]. Then, we de\ufb01ne a dikernel W|S1,...,SK : [0, 1]K \u2192 R as follows: We \ufb01rst extract a tensor\nW \u2208 Rs\u00d7\u00b7\u00b7\u00b7\u00d7s by setting Wi1\u00b7\u00b7\u00b7iK = W(x1\n). Then, we de\ufb01ne W|S1,...,SK as the dikernel\nconstructed from W .\n\nn = ( n\u22121\n\nn ], . . . , I n\n\n, . . . , xK\niK\n\n1 = [0, 1\n\n1, . . . , xk\n\nn ], I n\n\n2 = ( 1\n\nn , 2\n\ni1\n\n3 Correctness of Algorithm 1\n\nIn this section, we prove the correctness of Algorithm 1.\nThe following sampling lemma states that dikernels and their sampling versions are close in the cut\nnorm with high probability.\nLemma 3.1. Let W 1, . . . ,W T : [0, 1]K \u2192 [\u2212L, L] be dikernels. Let S1, . . . , SK be sequences of\ns elements uniformly and independently sampled from [0, 1]. Then, with a probability of at least\n1\u2212exp(\u2212\u2126K(s2(T / log2 s)1/(K\u22121))), there exists a measure-preserving bijection \u03c0 : [0, 1] \u2192 [0, 1]\nsuch that, for every t \u2208 [T ], we have\n\n(cid:107)W t \u2212 \u03c0(W t|S1,...,SK )(cid:107)(cid:3) = L \u00b7 OK\n\nwhere OK(\u00b7) and \u2126K(\u00b7) hide factors depending on K.\nWe now consider the dikernel counterpart to the Tucker \ufb01tting problem, in which we want to compute\nthe following:\n\n,\n\n(cid:0)T / log2 s(cid:1)1/(2K\u22122)\n(cid:13)(cid:13)(cid:13)X \u2212 [[G; f (1), . . . , f (K)]]\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n(cid:96)R1,...,RK (X ) :=\n\nG\u2208RR1\u00d7\u00b7\u00b7\u00b7\u00d7RK ,{f (k):[0,1]\u2192RRk}k\u2208[K]\n\ninf\n\n,\n\n(2)\n\nThe following lemma states that the Tucker \ufb01tting problem and its dikernel counterpart have the same\noptimum values.\nLemma 3.2. Let X \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK be a tensor, and let R1, . . . , RK \u2208 N be integers. Then, we have\n\n(cid:96)R1,...,RK (X) = (cid:96)R1,...,RK (X ).\n\nr\n\nFor a set of vector-valued functions F = {f (k) : [0, 1] \u2192 RRk}k\u2208[K], we de\ufb01ne (cid:107)F(cid:107)max =\n(x). For real values a, b, c \u2208 R, a = b \u00b1 c is shorthand for b \u2212 c \u2264 a \u2264\nmaxk\u2208[K],r\u2208[Rk],x\u2208[0,1] f (k)\nb + c. For a dikernel X : [0, 1]K \u2192 R, we de\ufb01ne a dikernel X 2 : [0, 1]K \u2192 R as X 2(x) = X (x)2\nfor every x \u2208 [0, 1]K. The following lemma states that if X and Y are close in the cut norm, then the\noptimum values of the Tucker \ufb01tting problem regarding them are also close.\nLemma 3.3. Let X ,Y : [0, 1]K \u2192 R be dikernels with (cid:107)X \u2212 Y(cid:107)(cid:3) \u2264 \u0001 and (cid:107)X 2 \u2212 Y 2(cid:107)(cid:3) \u2264 \u0001. For\nintegers R1, . . . , RK \u2208 N, we have\n\n(cid:16)\n(cid:96)R1,...,RK (X ) = (cid:96)R1,...,RK (Y) \u00b1 2\u0001\n\n1 + R(cid:0)(cid:107)GX(cid:107)max(cid:107)FX(cid:107)K\n\nwhere (GX , FX = {f (k)X }k\u2208[K]) and (GY,FY = {f (k)Y }\n\nX and Y, respectively, whose objective values exceed the in\ufb01ma by at most \u0001, and R =(cid:81)\n\n) are solutions to the problem (2) on\nk\u2208[K] Rk.\n\nk\u2208[K]\n\nmax + (cid:107)GY(cid:107)max(cid:107)FY(cid:107)K\n\nmax\n\n(cid:1)(cid:17)\n\n,\n\n4\n\n\fIt is well known that the Tucker \ufb01tting problem has a minimizer for which the factor matrices are\northonormal. Thus, we have the following guarantee for the approximation error of Algorithm 1.\nTheorem 3.4. Let X \u2208 RN1\u00d7\u00b7\u00b7\u00b7\u00d7NK be a tensor, R1, . . . , RK be integers, and \u0001, \u03b4 \u2208 (0, 1). For\ns(\u0001, \u03b4) = 2\u0398(1/\u00012K\u22122) + \u0398(log 1\n\u03b4 ), we have the following. Let S1, . . . , SK be sequences of\nindices as de\ufb01ned in Algorithm 1. Let (G\u2217, U\u2217\nK) be minimizers of\nthe problem (1) on X and X|S1,...,SK for which the factor matrices are orthonormal, respectively.\nThen, with a probability of at least 1 \u2212 \u03b4, we have\n\nK) and ( \u02dcG\u2217, \u02dcU\u2217\n\n1 , . . . , U\u2217\n\n1 , . . . , \u02dcU\u2217\n\n\u03b4 log log 1\n\nwhere L = (cid:107)X(cid:107)max, M = max{(cid:107)G\u2217(cid:107)max,(cid:107) \u02dcG\u2217(cid:107)max}, and R =(cid:81)\n\n(cid:96)R1,...,RK (X|S1,...,SK ) = (cid:96)R1,...,RK (X) \u00b1 O(\u0001L2(1 + 2M R)),\n\nk\u2208[K] Rk.\n\nWe remark that, for the matrix case (i.e., K = 2), (cid:107)G\u2217(cid:107)max and (cid:107) \u02dcG\u2217(cid:107)max are equal to the maximum\nsingular values of the original and sampled matrices, respectively.\n\nProof. We apply Lemma 3.1 to X and X 2. Then, with a probability of at least 1 \u2212 \u03b4, there exists a\nmeasure-preserving bijection \u03c0 : [0, 1] \u2192 [0, 1] such that\n\n(cid:107)X \u2212 \u03c0(X|S1,...,SK )(cid:107)(cid:3) \u2264 \u0001L and\n\n(cid:107)X 2 \u2212 \u03c0(X 2|S1,...,SK )(cid:107)(cid:3) \u2264 \u0001L2.\n\n(cid:96)R1,...,RK (X|S1,...,SK ) = (cid:96)R1,...,RK (X ) \u00b1 \u0001L2(cid:16)\n\nIn what follows, we assume that this has happened. Then, by Lemma 3.3 and the fact that\n(cid:96)R1,...,RK (X|S1,...,SK ) = (cid:96)R1,...,RK (\u03c0(X|S1,...,SK )), we have\n\n(cid:17)\n,\nwhere (G, F = {f (k)}k\u2208[K]) and ( \u02dcG, \u02dcF = { \u02dcf (k)}k\u2208[K]) be as in the statement of Lemma 3.3.\nFrom the proof of Lemma 3.2, we can assume that (cid:107)G(cid:107)max = (cid:107)G\u2217(cid:107)max, (cid:107) \u02dcG(cid:107)max = (cid:107) \u02dcG\u2217(cid:107)max,\n(cid:107)F(cid:107)max \u2264 1, and (cid:107) \u02dcF(cid:107)max \u2264 1 (owing to the orthonormality of U\u2217\nK). It\nfollows that\n\n1 + 2R((cid:107)G(cid:107)max(cid:107)F(cid:107)K\n\nmax + (cid:107) \u02dcG(cid:107)max(cid:107) \u02dcF(cid:107)K\n\nK and \u02dcU\u2217\n\n1 , . . . , \u02dcU\u2217\n\n1 , . . . , U\u2217\n\nmax)\n\nThen, we have\n\n(cid:96)R1,...,RK (X|S1,...,SK ) = (cid:96)R1,...,RK (X ) \u00b1 \u0001L2(cid:16)\n= (cid:96)R1,...,RK (X ) \u00b1 \u0001L2(cid:16)\n= (cid:96)R1,...,RK (X) \u00b1 \u0001L2(cid:16)\n\n(cid:96)R1,...,RK (X|S1,...,SK ) = (cid:96)R1,...,RK (X|S1,...,SK )\n\n1 + 2R((cid:107)G\u2217(cid:107)max + (cid:107) \u02dcG\u2217(cid:107)max)\n1 + 2R((cid:107)G\u2217(cid:107)max + (cid:107) \u02dcG\u2217(cid:107)max)\n\n(cid:17)\n(cid:17)\n\n1 + 2R((cid:107)G\u2217(cid:107)max + (cid:107) \u02dcG\u2217(cid:107)max)\n\n.\n\n(3)\n\n(cid:17)\n\n(By Lemma 3.2)\n\n(By (3))\n\n.\n\n(By Lemma 3.2)\n\nHence, we obtain the desired result.\n\n4 Related Work\n\nTo solve Tucker decomposition, several randomized algorithms have been proposed. A popular\napproach involves using a truncated or randomized SVD. For example, Zhou et al. [21] proposed\na variant of HOOI with randomized SVD. Another approach is based on tensor sparsi\ufb01cation.\nTsourakakis [17] proposed MACH, which randomly picks the element of the input tensor and\nsubstitutes zero, with a probability of 1 \u2212 p, where p \u2208 (0, 1] is an approximation parameter.\nMoreover, several authors proposed CUR-type Tucker decomposition, which approximates the input\ntensor by sampling tensor tubes [6, 8].\nUnfortunately, these methods do not signi\ufb01cantly reduce the computational cost. Randomized\nk Nk) to\nk Nk. CUR-type approaches require the same\ntime complexity. In MACH, to obtain accurate results, we need to set p as constant for instance\nk Nk\n\nSVD approaches reduce the computational cost of multiple SVDs from O(K maxk Nk \u00b7(cid:81)\nO(K maxk Rk \u00b7(cid:81)\np = 0.1 [17]. Although this will improve the runtime by a constant factor, the dependency on(cid:81)\n\nk Nk), but they still depend on(cid:81)\n\ndoes not change.\n\n5\n\n\fFigure 1: Synthetic data: computed residual errors for various Tucker ranks. The horizontal axis\nindicates the approximated residual error (cid:96)R1,...,RK (X|S1,...,SK ). The error bar indicates the standard\ndeviation over ten trials with different random seeds, which affected both data generation and\nsampling.\n\n5 Experiments\n\nFor the experimental evaluation, we slightly modi\ufb01ed our sampling algorithm. In Algorithm 1, the\nindices are sampled using sampling with replacement (i.e., the same indices can be sampled more\nthan once). Although this sampling method is theoretically sound, we risk obtaining redundant\ninformation by sampling the same index several times. To avoid this issue, we used sampling without\nreplacement\u2014i.e., each index was sampled at most once. Furthermore, if the dimension of a mode\nwas smaller than the sampling size, we used all the coordinates. That is, we sampled min(s, Nk)\nindices for each mode k \u2208 [K]. Note that both sampling methods, with and without replacement, are\nalmost equivalent when the input size N1, . . . , NK is suf\ufb01ciently larger than s (i.e., the probability\nthat a previously sampled index is sampled approaches zero.)\n\n5.1 Synthetic Data\nWe \ufb01rst demonstrate the accuracy of our method using synthetic data. We prepared N \u00d7 N \u00d7 N\ntensors for N \u2208 {100, 200, 400, 800}, with a Tucker rank of (15, 15, 15). Each element of the core\nG \u2208 R15\u00d715\u00d715 and the factor matrices U (1), U (2), U (3) \u2208 RN\u00d715 was drawn from a standard\nnormal distribution. We set Y = [[G; U (1), U (2), U (3)]]. Then, we generated X \u2208 RN\u00d7N\u00d7N as\nXijk = Yijk/(cid:107)Y (cid:107)F + 0.1\u0001ijk, where \u0001ijk follows the standard normal distribution for i, j, k \u2208 [N ].\nNamely, X had a low-rank structure, though some small noise was added. Subsequently, X was\ndecomposed using our method with various Tucker ranks (R, R, R) for R \u2208 {11, 12, . . . , 20} and\nthe sample size s \u2208 {20, 40, 80}.\nThe results (see Figure 1) show that our method behaved ideally. That is, the error was high when\nR was less than the true rank, 15, and it was almost zero when R was greater than or equal to the\ntrue rank. Note that the scale of the estimated residual error seems to depend on s, i.e., small s tends\nto yield a small residual error. This implies our method underestimates the residual error when s is\nsmall.\n\n5.2 Real Data\n\nTo evaluate how our method worked against real data tensors, we used eight datasets [1, 2, 4, 11,\n14, 19] described in Table 1, where the \u201c\ufb02uor\u201d dataset is order-4 and the others are order-3 tensors.\nDetails regarding the data are provided in the Supplementary material. Before the experiment, we\nnormalized each data tensor by its norm (cid:107)X(cid:107)F . To evaluate the approximation accuracy, we used\nHOOI implemented in Python by Nickel3 as \u201ctrue\u201d residual error.4 As baselines, we used the two\nrandomized methods introduced in Section 4: randomized SVD [21] and MACH [17]. We denote our\nmethod by \u201csamples\u201d where s indicates the sample size (e.g., sample40 denotes our method with\n\n3https://github.com/mnick/scikit-tensor\n4Note that, though no approximation is used in HOOI, the objective function (1) is nonconvex and it is not\nguaranteed to converge to the global minimum. The obtained solution can be different from the ground truth.\n\n6\n\nN=100200400800llllllllllllllllllllllllllllllllllllllll0.00.10.20.311121314151617181920111213141516171819201112131415161718192011121314151617181920RResidual errorsl204080\fTable 1: Real Datasets.\n\nDataset\nmovie_gray\nEEM\n\ufb02uorescence\nbonnie\n\ufb02uor\nwine\nBCI_Berlin\nvisor\n\nSize\n\n120 \u00d7 160 \u00d7 107\n28 \u00d7 13324 \u00d7 8\n299 \u00d7 301 \u00d7 41\n89 \u00d7 97 \u00d7 549\n405 \u00d7 136 \u00d7 19 \u00d7 5\n44 \u00d7 2700 \u00d7 200\n4001 \u00d7 59 \u00d7 1400\n16818 \u00d7 288 \u00d7 384\n\nTotal # of elements\n2.0M\n2.9M\n3.6M\n4.7M\n5.2M\n23.7M\n0.3G\n1.8G\n\nFigure 2: Real data: (approximated) residual errors for various Tucker ranks.\n\ns = 40). Similarly, \u201cmachp\u201d refers to MACH with sparsi\ufb01cation probability set at p. For all the\napproximation methods, we used the HOOI implementation to solve Tucker decomposition. Every\ndata tensor was decomposed with Tucker rank (R1, . . . , RK) on the grid Rk \u2208 {5, 10, 15, 20} for\nk \u2208 [K].\nFigure 2 shows the residual error for order-3 data.5 It shows that the random projection tends to\noverestimate the decomposition error. On the other hand, except for the wine dataset, our method\nstably estimated the residual error with reasonable approximation errors. For the wine dataset, our\nmethod estimated a very small value, far from the correct value. This result makes sense, however,\nbecause the wine dataset is sparse (where 90% of the elements are zero) and the residual error is too\nsmall. Table 2 shows the absolute error from HOOI averaged over all rank settings. In most of the\ndatasets, our methods achieved the lowest error.\n\n5Here we exclude the results of the EEM dataset because its size is too small and we were unable to run the\nexperiment with all the Tucker rank settings. Also, the results of MACH on some datasets are excluded owing to\nconsiderable errors.\n\n7\n\n0.0000.0050.0100.0150.0200.000.040.080.120.000.020.040.060.000.020.040.10.20.30.40.010.020.03movie_grayfluorescencebonniewineBCI_Berlinvisor5x5x510x5x55x10x55x5x1015x5x55x15x55x5x1510x10x510x5x1020x5x55x10x105x20x55x5x2010x15x510x5x1515x10x515x5x105x10x155x15x1010x10x1010x20x510x5x2020x10x520x5x105x10x205x20x1015x15x515x5x155x15x1510x10x1510x15x1015x10x1015x20x515x5x2020x15x520x5x155x15x205x20x1510x10x2010x20x1020x10x1020x20x520x5x205x20x2010x15x1515x10x1515x15x1010x15x2010x20x1515x10x2015x20x1020x10x1520x15x1015x15x1510x20x2020x10x2020x20x1015x15x2015x20x1520x15x1515x20x2020x15x2020x20x1520x20x20Tucker rankResidual errormethodhooimach0.1mach0.3randsvdsample40sample80\fTable 2: Real data: absolute error of HOOI\u2019s and other\u2019s residual errors averaged over ranks. The\nbest and the second best results are shown in bold and italic, respectively.\n\nmovie_gray\n\nmach0.1\n0.084 \u00b1 0.038\nEEM 2.370 \u00b1 0.792\n0.569 \u00b1 0.204\n1.170 \u00b1 0.412\n0.611 \u00b1 0.307\n6.826 \u00b1 0.733\n0.193 \u00b1 0.039\n0.002 \u00b1 0.001\n\n\ufb02uorescence\nbonnie\n\ufb02uor\nwine\nBCI_Berlin\nvisor\n\nmach0.3\n0.020 \u00b1 0.010\n0.587 \u00b1 0.210\n0.129 \u00b1 0.053\n0.300 \u00b1 0.121\n0.148 \u00b1 0.083\n1.417 \u00b1 0.191\n0.048 \u00b1 0.013\n0.000 \u00b1 0.000\n\nrandsvd\n0.004 \u00b1 0.003\n0.018 \u00b1 0.029\n0.024 \u00b1 0.023\n0.012 \u00b1 0.011\n0.009 \u00b1 0.007\n0.012 \u00b1 0.009\n0.057 \u00b1 0.020\n0.007 \u00b1 0.003\n\nsample40\n0.001 \u00b1 0.001\n0.003 \u00b1 0.003\n0.004 \u00b1 0.005\n0.004 \u00b1 0.002\n0.003 \u00b1 0.001\n0.008 \u00b1 0.006\n0.065 \u00b1 0.022\n0.003 \u00b1 0.001\n\nsample80\n0.000 \u00b1 0.000\n0.003 \u00b1 0.003\n0.002 \u00b1 0.002\n0.003 \u00b1 0.001\n0.002 \u00b1 0.001\n0.007 \u00b1 0.006\n0.055 \u00b1 0.007\n0.001 \u00b1 0.001\n\nTable 3: Real data: Kendall\u2019s tau against the ranking of Tucker ranks obtained by HOOI.\n\nmovie_gray\n\nmach0.1 mach0.3\n-0.07\nEEM 0.64\n0.08\n-0.05\n0.77\n0.12\n0.08\n0.07\n\n0.04\n0.68\n0.02\n-0.01\n0.73\n0.12\n0.09\n0.18\n\n\ufb02uorescence\nbonnie\n\ufb02uor\nwine\nBCI_Berlin\nvisor\n\nrandsvd\n0.1\n0.77\n0.28\n0.33\n0.83\n-0.02\n0.02\n0.11\n\nsample40\n0.71\n0.79\n0.61\n0.27\n0.93\n0.04\n0.18\n0.64\n\nsample80\n0.73\n0.91\n0.77\n0.67\n0.89\n0.15\n0.45\n0.7\n\nTable 4: Real data: runtime averaged over Tucker ranks (in seconds).\n\nEEM 3447.97\n\nmovie_gray\n\n\ufb02uorescence\nbonnie\n\ufb02uor\nwine\nBCI_Berlin\nvisor\n\nhooi\n0.71\n\n2.67\n9.13\n3.2\n142.34\n428.13\n10034.96\n\nmach0.1\n32.19\n7424.8\n30.05\n25.99\n34.54\n95.28\n2765.88\n27897.85\n\nmach0.3\n85.13\n7938.75\n73.52\n56.56\n98.63\n212.19\n7830\n27769.53\n\nrandsvd\n0.33\n2212.54\n1.47\n2.32\n1.43\n41.94\n82.43\n1950.45\n\nsample40\n0.13\n0.11\n0.13\n0.11\n0.2\n0.12\n0.2\n0.13\n\nsample80\n0.25\n0.11\n0.23\n0.41\n0.43\n0.23\n0.45\n0.26\n\nNext, we evaluated the correctness of the order of Tucker ranks. For rank determination, it is important\nthat the rankings of Tucker ranks in terms of residual errors are consistent between the original and\nthe sampled tensors. For example, if the rank-(15, 15, 5) Tucker decomposition of the original tensor\nachieves a lower error than the rank-(5, 15, 15) Tucker decomposition, this order relation should\nbe preserved in the sampled tensor. We evaluated this using Kendall\u2019s tau coef\ufb01cient, between the\nrankings of Tucker ranks obtained by HOOI and the others. Kendall\u2019s tau coef\ufb01cient takes as its\nvalue +1 when the two rankings are the same, and \u22121 when they are opposite. Table 3 shows the\nresults. We can see that, again, our method outperformed the others.\nTable 4 shows the runtime averaged over all the rank settings. It shows that our method is consistently\nthe fastest. Note that MACH was slower than normal Tucker decomposition. This is possibly because\n\nit must create an additional sparse tensor, which requires O((cid:81)\n\nk Nk) time complexity.\n\n6 Discussion\n\nOne might point out by way of criticism that the residual error is not a satisfying measure for\ndetermining rank. In machine learning and statistics, it is common to choose hyperparameters based\non the generalization error or its estimator, such as cross-validation (CV) error, rather than the training\nerror (i.e., the residual error in Tucker decomposition). Unfortunately, our approach cannot be used\nthe CV error, because what we can obtain is the minimum of the training error, whereas CV requires\nus to plug in the minimizers. An alternative is to use information criteria such as Akaike [3] and\nBayesian information criteria [15]. These criteria are given by the penalty term, which consists of\n\n8\n\n\fthe number of parameters and samples6, and the maximum log-likelihood. Because the maximum\nlog-likelihood is equivalent to the residual error, our method can approximate these criteria.\nPython code of our algorithm is available at: https://github.com/hayasick/CTFT.\n\nReferences\n[1] E. Acar, R. Bro, and B. Schmidt. New exploratory clustering tool. Journal of Chemometrics, 22(1):91,\n\n2008.\n\n[2] E. Acar, E. E. Papalexakis, G. G\u00fcrdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro.\n\nStructure-revealing data fusion. BMC bioinformatics, 15(1):239, 2014.\n\n[3] H. Akaike. A new look at the statistical model identi\ufb01cation. IEEE Transactions on Automatic Control,\n\n19(6):716\u2013723, 1974.\n\n[4] B. Blankertz, G. Dornhege, M. Krauledat, K.-R. M\u00fcller, and G. Curio. The non-invasive berlin brain\u2013\ncomputer interface: fast acquisition of effective performance in untrained subjects. NeuroImage, 37(2):539\u2013\n550, 2007.\n\n[5] C. Borgs, J. T. Chayes, L. Lov\u00e1sz, V. T. S\u00f3s, and K. Vesztergombi. Convergent sequences of dense graphs I:\nSubgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801\u20131851, 2008.\n\n[6] C. F. Caiafa and A. Cichocki. Generalizing the column\u2013row matrix decomposition to multi-way arrays.\n\nLinear Algebra and its Applications, 433(3):557\u2013573, 2010.\n\n[7] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank-(r1, r2, . . . , rn) approxi-\nmation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications, 21(4):1324\u20131342,\n2000.\n\n[8] P. Drineas and M. W. Mahoney. A randomized algorithm for a tensor-based generalization of the singular\n\nvalue decomposition. Linear Algebra and Its Applications, 420(2):553\u2013571, 2007.\n\n[9] A. Frieze and R. Kannan. The regularity lemma and approximation schemes for dense problems. In FOCS,\n\npages 12\u201320, 1996.\n\n[10] K. Hayashi and Y. Yoshida. Minimizing quadratic functions in constant time. In NIPS, pages 2217\u20132225,\n\n2016.\n\n[11] A. J. Lawaetz, R. Bro, M. Kamstrup-Nielsen, I. J. Christensen, L. N. J\u00f8rgensen, and H. J. Nielsen.\nFluorescence spectroscopy as a potential metabonomic tool for early detection of colorectal cancer.\nMetabolomics, 8(1):111\u2013121, 2012.\n\n[12] L. Lov\u00e1sz. Large Networks and Graph Limits. American Mathematical Society, 2012.\n\n[13] L. Lov\u00e1sz and B. Szegedy. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B,\n\n96(6):933\u2013957, 2006.\n\n[14] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR,\n\nvolume 3, pages 32\u201336, 2004.\n\n[15] G. Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461\u2013464, 1978.\n\n[16] R. J. Steele and A. E. Raftery. Performance of bayesian model selection criteria for gaussian mixture\n\nmodels. Frontiers of Statistical Decision Making and Bayesian Analysis, 2:113\u2013130, 2010.\n\n[17] C. E. Tsourakakis. Mach: Fast randomized tensor decompositions. In ICDM, pages 689\u2013700, 2010.\n\n[18] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279\u2013311,\n\n1966.\n\n[19] R. Vezzani and R. Cucchiara. Video surveillance online repository (visor): an integrated framework.\n\nMultimedia Tools and Applications, 50(2):359\u2013380, 2010.\n\n[20] S. Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge University Press,\n\n2009.\n\n[21] G. Zhou, A. Cichocki, and S. Xie. Decomposition of big tensors with low multilinear rank. arXiv preprint\n\narXiv:1412.1885, 2014.\n\n6For models with multiple solutions, such as Tucker decomposition, the penalty term can differ from the\n\nstandard form [20]. Still, these criteria are useful in practice (see, e.g. [16]).\n\n9\n\n\f", "award": [], "sourceid": 1447, "authors": [{"given_name": "Kohei", "family_name": "Hayashi", "institution": "AIST / RIKEN"}, {"given_name": "Yuichi", "family_name": "Yoshida", "institution": "National Institute of Informatics and Preferred Infrastructure, Inc."}]}