{"title": "On Tensor Train Rank Minimization : Statistical Efficiency and Scalable Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 3930, "page_last": 3939, "abstract": "Tensor train (TT) decomposition provides a space-efficient representation for higher-order tensors. Despite its advantage, we face two crucial limitations when we apply the TT decomposition to machine learning problems: the lack of statistical theory and of scalable algorithms. In this paper, we address the limitations. First, we introduce a convex relaxation of the TT decomposition problem and derive its error bound for the tensor completion task. Next, we develop a randomized optimization method, in which the time complexity is as efficient as the space complexity is. In experiments, we numerically confirm the derived bounds and empirically demonstrate the performance of our method with a real higher-order tensor.", "full_text": "On Tensor Train Rank Minimization: Statistical\n\nEf\ufb01ciency and Scalable Algorithm\n\nMasaaki Imaizumi\n\nInstitute of Statistical Mathematics\n\nRIKEN Center for Advanced Intelligence Project\n\nimaizumi@ism.ac.jp\n\nTakanori Maehara\n\nRIKEN Center for Advanced Intelligence Project\n\ntakanori.maehara@riken.jp\n\nKohei Hayashi\n\nNational Institute of Advanced Industrial Science and Technology\n\nRIKEN Center for Advanced Intelligence Project\n\nhayashi.kohei@gmail.com\n\nAbstract\n\nTensor train (TT) decomposition provides a space-ef\ufb01cient representation for\nhigher-order tensors. Despite its advantage, we face two crucial limitations when\nwe apply the TT decomposition to machine learning problems: the lack of statistical\ntheory and of scalable algorithms. In this paper, we address the limitations. First,\nwe introduce a convex relaxation of the TT decomposition problem and derive\nits error bound for the tensor completion task. Next, we develop a randomized\noptimization method, in which the time complexity is as ef\ufb01cient as the space\ncomplexity is. In experiments, we numerically con\ufb01rm the derived bounds and\nempirically demonstrate the performance of our method with a real higher-order\ntensor.\n\n1\n\nIntroduction\n\nTensor decomposition is an essential tool for dealing with data represented as multidimensional arrays,\nor simply, tensors. Through tensor decomposition, we can determine latent factors of an input tensor\nin a low-dimensional multilinear space, which saves the storage cost and enables predicting missing\nelements. Note that, a different multilinear interaction among latent factors de\ufb01nes a different tensor\ndecomposition model, which yields several variations of tensor decomposition. For general purposes,\nhowever, either Tucker decomposition [29] or CANDECOMP/PARAFAC (CP) decomposition [8]\nmodel is commonly used.\nIn the past three years, an alternative tensor decomposition model, called tensor train (TT) decompo-\nsition [21] has actively been studied in machine learning communities for such as approximating the\ninference on a Markov random \ufb01eld [18], modeling supervised learning [19, 24], analyzing restricted\nBoltzmann machine [4], and compressing deep neural networks [17]. A key property is that, for\nhigher-order tensors, TT decomposition provides more space-saving representation called TT format\nwhile preserving the representation power. Given an order-K tensor (i.e., a K-dimensional tensor),\nthe space complexity of Tucker decomposition is exponential in K, whereas that of TT decomposition\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fis linear in K. Further, on TT format, several mathematical operations including the basic linear\nalgebra operations can be performed ef\ufb01ciently [21].\nDespite its potential importance, we face two crucial limitations when applying this decomposition\nto a much wider class of machine learning problems. First, its statistical performance is unknown.\nIn Tucker decomposition and its variants, many authors addressed the generalization error and\nderived statistical bounds (e.g. [28, 27]). For example, Tomioka et al.[28] clarify the way in which\nusing the convex relaxation of Tucker decomposition, the generalization error is affected by the\nrank (i.e., the dimensionalities of latent factors), dimension of an input, and number of observed\nelements. In contrast, such a relationship has not been studied for TT decomposition yet. Second,\nstandard TT decomposition algorithms, such as alternating least squares (ALS) [6, 30] , require a\nhuge computational cost. The main bottleneck arises from the singular value decomposition (SVD)\noperation to an \u201cunfolding\u201d matrix, which is reshaped from the input tensor. The size of the unfolding\nmatrix is huge and the computational cost grows exponentially in K.\nIn this paper, we tackle the above issues and present a scalable yet statistically-guaranteed TT\ndecomposition method. We \ufb01rst introduce a convex relaxation of the TT decomposition problem and\nits optimization algorithm via the alternating direction method of multipliers (ADMM). Based on\nthis, a statistical error bound for tensor completion is derived, which achieves the same statistical\nef\ufb01ciency as the convex version of Tucker decomposition does. Next, because the ADMM algorithm\nis not suf\ufb01ciently scalable, we develop an alternative method by using a randomization technique. At\nthe expense of losing the global convergence property, the dependency of K on the time complexity\nis reduced from exponential to quadratic. In addition, we show that a similar error bound is still\nguaranteed. In experiments, we numerically con\ufb01rm the derived bounds and empirically demonstrate\nthe performance of our method using a real higher-order tensor.\n\n2 Preliminaries\n\n2.1 Notation\nLet X\u21e2 RI1\u21e5\u00b7\u00b7\u00b7\u21e5IK be the space of order-K tensors, where Ik denotes the dimensionality of the\nk-th mode for k = 1, . . . , K. For brevity, we de\ufb01ne I 0\nis a hyperparameter for a step size. We stop the iteration when the convergence criterion is satis\ufb01ed\n(e.g. as suggested by Tomioka et al.[28]). Since the Schatten TT norm (4) is convex, the sequence of\nthe variables of ADMM is guaranteed to converge to the optimal solution ([5, Theorem 5.1]). We\nrefer to this algorithm as TT-ADMM.\nTT-ADMM requires huge resources in terms of both time and space. For the time complexity, the\nproximal operation of the Schatten TT norm, namely the SVD thresholding of V 1\n, yields the\ndominant complexity, which is O(I 3K/2) time. For the space complexity, we have O(K) variables\nof size O(I K), which requires O(KI K) space.\n\nk\n\n4 Alternating Minimization with Randomization\n\nIn this section, we consider reducing the space complexity for handling higher order tensors. The\nidea is simple: we only maintain the TT format of the input tensor rather than the input tensor itself.\nThis leads the following optimization problem:\n\nmin\n\nG \uf8ff 1\n\n2nkY X(X(G))k2 + nkX(G)ks,T .\n\n(6)\n\nRemember that G = {Gk}k is the set of TT components and X(G) is the tensor given by the TT\nformat with G. Now we only need to store the TT components G, which drastically improves the\nspace ef\ufb01ciency.\n\n4.1 Randomized Schatten TT norm\n\nWe approximate the optimization of the Schatten TT norm. To avoid the computation of exponentially\nlarge-scale SVDs in the Schatten TT norm, we employ a technique called the \u201cvery sparse random\nprojection\u201d [12]. The main idea is that, if the size of a matrix is suf\ufb01ciently larger than its rank, then\nits singular values (and vectors) are well preserved even after the projection by a sparse random\nmatrix. This motivates us to use the Schatten TT norm over the random projection.\nPreliminary, we introduce tensors for the random projection. Let D1, D2 2 N be the size of the\nmatrix after projection. For each k = 1, . . . , K 1 and parameters, let \u21e7k,1 2 RD1\u21e5I1\u21e5\u00b7\u00b7\u00b7\u21e5Ik be a\ntensor whose elements are independently and identically distributed as follows:\n\n[\u21e7k,1]d1,i1,...,ik =8<:\n\n0\n\n+ps/d1\nps/d1\n\nwith probability 1/2s,\nwith probability 1 1/s,\nwith probability 1/2s,\n\n(7)\n\nfor i1, . . . , ik and d1 = 1, . . . , D1. Here, s > 0 is a hyperparameter controlling sparsity. Similarly,\nwe introduce a tensor \u21e7k,2 2 RD2\u21e5Ik+1\u21e5\u00b7\u00b7\u00b7\u21e5IK1 that is de\ufb01ned in the same way as \u21e7k,1. With\n\u21e7k,1 and \u21e7k,2, let Pk : X! RD1\u21e5D2 be a random projection operator whose element is de\ufb01ned as\nfollows:\n\n[Pk(X)]d1,d2 =\n\nI1Xj1=1\n\n\u00b7\u00b7\u00b7\n\nIKXjK =1\n\n[\u21e7k,1]d1,j1,...,jk [X]j1,...,jK [\u21e7k,2]d2,jk+1,...,jK .\n\n(8)\n\n4\n\n\fNote that we can compute the above projection by using the facts that X has the TT format and the\nprojection matrices are sparse. Let \u21e1(k)\nbe a set of indexes of non-zero elements of \u21e7k,j. Then, using\nthe TT representation of X, (8) is rewritten as\n\nj\n\n[Pk(X(G))]d1,d2 = X(j1,...,jk)2\u21e1(k)\nX(jk+1,...,jK )2\u21e1(k)\n\n2\n\n1\n\n[\u21e7k,1]d1,j1,...,jk [G1]j1 \u00b7\u00b7\u00b7 [Gk]jk\n\n[Gk]jk+1 \u00b7\u00b7\u00b7 [GK]jK [\u21e7k,2]d2,jk+1,...,jK ,\n\nj\n\nj\n\n| = |\u21e1(2)\n\n|), the computational\n\nIf the projection matrices have only S nonzero elements (i.e., S = |\u21e1(1)\ncost of the above equation is O(D1D2SKR3).\nThe next theorem guarantees that the Schatten-1 norm of Pk(X) approximates the original one.\nTheorem 1. Suppose X 2X has TT rank (R1, . . . , Rk). Consider the reshaping operator Qk in\n(4), and the random operator Pk as (8) with tensors \u21e7k,1 and \u21e7k,2 de\ufb01ned as (7). If D1, D2 \nmax{Rk, 4(log(6Rk) + log(1/\u270f))/\u270f2}, and all the singular vectors u of Q(X)k are well-spread as\nPj |uj|3 \uf8ff \u270f/(1.6kps), we have\nwith probability at least 1 \u270f.\nNote that the well-spread condition can be seen as a stronger version of the incoherence assumption\nwhich will be discussed later.\n\n1 \u270f\nRk kQk(X)ks \uf8ff kPk(X)ks \uf8ff (1 + \u270f)kQk(X)ks,\n\n4.2 Alternating Minimization\nNote that the new problem (6) is non-convex because X(G) does not form a convex set on X .\nHowever, if we \ufb01x G except for Gk, it becomes convex with respect to Gk. Combining with the\nrandom projection, we obtain the following minimization problem:\n\nGk \" 1\n\nmin\n\n2nkY X(X(G))k2 +\n\nn\nK 1\n\nK1Xk0=1\n\nkPk0(X(G))ks# .\n\n(9)\n\nWe solve this by the ADMM method for each k = 1, . . . , K. Let gk 2 RIkRk1Rk be the vector-\nization of Gk, and Wk0 2 RD1\u21e5D2 be a matrix for the randomly projected matrix. The augmented\nLagrangian function is then given by Lk(gk,{Wk0}K1\nk0=1), where {k0 2 RD1D2}K1\nk0=1\nare the Lagrange multipliers. Starting from initial points (g(0)\nk0=1), the `-th\nADMM step is written as follows:\n\nk0=1,{(0)\n\nk0 }K1\n\nk0=1,{k0}K1\nk ,{W (0)\nk0 }K1\nK1Xk0=1\nk0 )\u2318 , k0 = 1, . . . , K 1,\n)), k0 = 1, . . . , K 1.\n\nK 1\n\nT\n\n1\n\nk0 )! ,\nk0 ) (`)\n\nk0(\u2318eVk(W (`)\n\ng(`+1)\nk\n\nW (`+1)\n\nk0\n\n(`+1)\nk0\n\nk0k0!1 \u2326T Y /n +\n\nT\n\n= \u2326T \u2326/n + \u2318\nK1Xk0=1\n= proxn/\u2318\u21e3eV 1\neVk(W (`+1)\n\nk0 + ( k0g(`+1)\n\n(k0g(`+1)\n\n= (`)\n\nk0\n\nk\n\nk\n\nk\n\n+ (`)\n\nHere, (k) 2 RD1D2\u21e5IkRk1Rk is the matrix imitating the mapping Gk 7! Pk(X(Gk;G\\{Gk})),\neVk is a vectorizing operator of D1 \u21e5 D2 matrix, and \u2326 is an n \u21e5 IkRk1Rk matrix of the operator\nX X(\u00b7;G\\{Gk}) with respect to gk. Similarly to the convex approach, we iterate the ADMM steps\nuntil convergence. We refer to this algorithm as TT-RAM, where RAM stands for randomized least\nsquare.\nThe time complexity of TT-RAM at the `-th iteration is O((n + KD2)KI 2R4); the details are\ndeferred to Supplementary material. The space complexity is O(n + KI 2R4), where O(n) is for Y\nand O(KI 2R4) is for the parameters.\n\n5\n\n\f5 Theoretical Analysis\n\nIn this section, we investigate how the TT rank and the number of observations affect to the estimation\nerror. Note that all the proofs of this section are deferred to Supplementary material.\n\n5.1 Convex Solution\n\nTo analyze the statistical error of the convex problem (5), we assume the incoherence of the reshaped\nversion of X\u21e4.\nAssumption 2. (Incoherence Assumption) There exists k 2{ 1, . . . , K} such that a matrix Qk(X\u21e4)\nhas orthogonal singular vectors {ur 2 RI\uf8ffk , vr 2 RIk<}Rk\nmax\n\nr=1 satisfying\n\nmax\n\nand\n\n1\n2\n\n1\uf8ffi\uf8ffI 0\nfor k = 1, . . . , K. Let Cm, CA, CB > 0 be 0 << 1 be some constants. If\nk0Rk0 max{I\uf8ffk0, Ik0<} log3 max{I\uf8ffk0, Ik0<},\nand the number of iterations t satis\ufb01es t (log )1{log(CBnK1(1 + \u270f)Pk pRk) log C},\nthen with probability at least 1 \u270f(max{I\uf8ffk0, Ik0<})3 and for n kX\u21e4(E)k1/n,\n\nn Cm\u00b52\n\nkX(bG) X(G\u21e4)kF \uf8ff CA(1 + \u270f)n\n\nK1Xk=1pRk.\n\n(10)\n\nAgain, we can obtain the consistency of TT-RAM by setting n ! 0 as n increases. Since the setting\nof n corresponds to that of Theorem 3, the speed of convergence of TT-RAM in terms of n is\nequivalent to the speed of TT-ADMM.\nBy comparing with the convex approach (Theorem 3), the error rate becomes slightly worse. Here,\nk=1 pRk in (10) comes from the estimation by the alternating minimization, which\nlinearly increases by K. This is because there are K optimization problems and their errors are\naccumulated to the \ufb01nal solution. The term (1 + \u270f) in (10) comes from the random projection. The\nsize of the error \u270f can be arbitrary small by controlling the parameters of the random projection\nD1, D2 and s.\n\nthe term nPK1\n\n6 Related Work\n\nTo solve the tensor completion problem with TT decomposition, Wang et al.[30] and Grasedyck et\nal.[6] developed algorithms that iteratively solve minimization problems with respect to Gk for each\nk = 1, . . . , K. Unfortunately, the adaptivity of the TT rank is not well discussed. [30] assumed that\nthe TT rank is given. Grasedyck et al.[6] proposed a grid search method. However, the TT rank is\ndetermined by a single parameter (i.e., R1 = \u00b7\u00b7\u00b7 = RK1) and the search method lacks its generality.\nFurthermore, the scalability problem remains in both methods\u2014they require more than O(I K) space.\nPhien et al. [22] proposed a convex optimization method using the Schatten TT norm. However,\nbecause they employed an alternating-type optimization method, the global convergence of their\nmethod is not guaranteed. Moreover, since they maintain X directly and perform the reshape of X\nseveral times, their method requires O(I K) time.\nTable 1 highlights the difference between the existing and our methods. We emphasize that our study\nis the \ufb01rst attempt to analyze the statistical performance of TT decomposition. In addition, TT-RAM\nis only the method that both time and space complexities do not grow exponentially in K.\n\nMethod\nTCAM-TT[30]\nADF for TT[6]\nSiLRTC-TT[22]\nTT-ADMM\nTT-RAM\n\nGlobal\n\nConvergence\n\nRank\n\nAdaptivity\n\nX\n\n(search)\n\nX\nX\nX\n\nTime\n\nComplexity\nO(nIKR4)\n\nO(KIR3 + nKR2)\n\nO(I 3K/2)\nO(KI 3K/2)\n\nO((n + KD2)KI 2R4) O(n + KI 2R4)\n\nX\nX\n\nSpace\n\nComplexity\n\nStatistical\nBounds\n\nO(I K)\nO(I K)\nO(KI K)\nO(I K)\n\nTable 1: Comparison of TT completion algorithms, with R is a parameter for the TT rank such that\nR = R1 = \u00b7\u00b7\u00b7 = RK1, I = I1 = \u00b7\u00b7\u00b7 = IK is dimension, K is the number of modes, n is the\nnumber of observed elements, and D is the dimension of random projection.\n\n7 Experiments\n\n7.1 Validation of Statistical Ef\ufb01ciency\nUsing synthetic data, we verify the theoretical bounds derived in Theorems 3 and 5. We \ufb01rst\ngenerate TT components G\u21e4; each component G\u21e4k is generated as G\u21e4k = G\u2020k/kG\u2020kkF where each\n\n7\n\n\fFigure 1: Synthetic data: the estimation error kbX X\u21e4kF against SRRPk pRk with the order-4\n\ntensor (K = 4) and the order-5 tensor (K = 5). For each rank and n, we measure the error by 10\ntrials with different random seeds, which affect both the missing pattern and the initial points.\n\nTable 2: Electricity data: the prediction error and the runtime (in seconds).\n\nK = 5\n\nK = 7\n\nK = 8\n\nK = 10\n\nMethod\nTucker\n\nError\n0.219\nTCAM-TT\n0.219\nADF for TT 0.998\nSiLRTC-TT 0.339\nTT-ADMM 0.221\nTT-RAM 0.219\n\nTime\n7.125\n2.174\n1.221\n1.478\n0.289\n4.644\n\nError\n0.371\n0.928\n1.160\n0.928\n1.019\n0.928\n\nTime\n610.61\n27.497\n23.211\n206.708\n154.991\n4.726\n\nError\nN/A\n0.928\n1.180\nN/A\n1.061\n0.928\n\nTime\nN/A\n\n146.651\n278.712\n\nN/A\n\n2418.00\n7.654\n\nError\nN/A\nN/A\nN/A\nN/A\nN/A\n1.173\n\nTime\nN/A\nN/A\nN/A\nN/A\nN/A\n7.968\n\nelement of G\u2020k is sampled from the i.i.d. standard normal distribution. Then we generate Y by\nfollowing the generative model (2) with the observation ratio n/Qk Ik = 0.5 and the noise variance\n0.01. We prepare two tensors of different size: an order-4 tensor of size 8 \u21e5 8 \u21e5 10 \u21e5 10 and an\norder-5 tensor of size 5 \u21e5 5 \u21e5 7 \u21e5 7 \u21e5 7. At the order-4 tensor, the TT rank is set as (R1, R2, R3)\nwhere R1, R2, R3 2{ 3, 5, 7}. At the order-5 tensor, the TT rank is set as (R1, R2, R3, R4) where\nR1, R2, R3, R4 2{ 2, 4}. For estimation, we set the size of Gk and \u21e7k as 10, which is larger than\nthe true TT rank. The regularization coef\ufb01cient n is selected from {1, 3, 5}. The parameters for\nrandom projection are set as s = 20 and D1 = D2 = 10.\nFigure 1 shows the relation between the estimation error and the sum of root rank (SRR)Pk pRk.\n\nThe result of TT-ADMM shows that the empirical errors are linearly related to SSR which is shown\nby the theoretical result. The result of TT-RAM roughly replicates the theoretical relationship.\n\n7.2 Higher-Order Markov Chain for Electricity Data\n\nWe apply the proposed tensor completion methods for analyzing the electricity consumption data [13].\nThe dataset contains time series measurements of household electric power consumption for every\nminutes from December 2006 to November 2010 and it contains over 200, 000 observations.\nThe higher-order Markov chain is a suitable method to represent long-term dependency, and it is a com-\nmon tool of time-series analysis [7] and natural language processing [9]. Let {Wt}t be discrete-time\nrandom variables take values in a \ufb01nite set B, and the order-K Markov chain describes the conditional\ndistribution of Wt with given {W\u2327}\u2327