{"title": "Singleshot : a scalable Tucker tensor decomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 6304, "page_last": 6315, "abstract": "This paper introduces a new approach for the scalable Tucker decomposition\nproblem. Given a tensor X , the method proposed allows to infer the latent factors\nby processing one subtensor drawn from X at a time. The key principle of our\napproach is based on the recursive computations of gradient and on cyclic update of factors involving only one single step of gradient descent. We further improve the\ncomputational efficiency of this algorithm by proposing an inexact gradient version.\nThese two algorithms are backed with theoretical guarantees of convergence and\nconvergence rate under mild conditions. The scalabilty of the proposed approaches\nwhich can be easily extended to handle some common constraints encountered in\ntensor decomposition (e.g non-negativity), is proven via numerical experiments on\nboth synthetic and real data sets.", "full_text": "Singleshot : a scalable Tucker tensor decomposition\n\nAbraham Traor\u00e9\nLITIS EA4108\n\nUniversity of Rouen Normandy\n\nabraham.traore@etu.univ-rouen.fr\n\nMaxime B\u00e9rar\nLITIS EA4108\n\nUniversity of Rouen Normandy\nmaxime.berar@univ-rouen.fr\n\nAlain Rakotomamonjy\n\nLITIS EA4108\n\nUniversity of Rouen Normandy\n\nCriteo AI Lab, Criteo Paris\n\nalain.rakoto@insa-rouen.fr\n\nAbstract\n\nThis paper introduces a new approach for the scalable Tucker decomposition\nproblem. Given a tensor X , the algorithm proposed, named Singleshot, allows to\nperform the inference task by processing one subtensor drawn from X at a time.\nThe key principle of our approach is based on the recursive computations of the\ngradient and on cyclic update of the latent factors involving only one single step of\ngradient descent. We further improve the computational ef\ufb01ciency of Singleshot by\nproposing an inexact gradient version named Singleshotinexact. The two algorithms\nare backed with theoretical guarantees of convergence and convergence rates under\nmild conditions. The scalabilty of the proposed approaches, which can be easily\nextended to handle some common constraints encountered in tensor decomposition\n(e.g non-negativity), is proven via numerical experiments on both synthetic and\nreal data sets.\n\n1\n\nIntroduction\n\nThe recovery of information-rich and task-relevant variables hidden behind observation data (com-\nmonly referred to as latent variables) is a fundamental task that has been extensively studied in\nmachine learning. In many applications, the dataset we are dealing with naturally presents different\nmodes (or dimensions) and thus, can be naturally represented by multidimensional arrays (also called\ntensors). The recent interest for ef\ufb01cient techniques to deal with such datasets is motivated by the\nfact that the methodologies that matricize the data and then apply matrix factorization give a \ufb02attened\nview of data and often cause a loss of the internal structure information. Hence, to mitigate the\nextent of this loss, it is more favorable to process a multimodal data set in its own domain, i.e. tensor\ndomain, to obtain a multiple perspective view of data rather than a \ufb02attened one.\nTensors represent generalization of matrices and the related decomposition techniques are promising\ntools for exploratory analysis of multidimensional data in diverse disciplines including signal process-\ning [11], social networks analysis [28], etc. The two most common decompositions used for tensor\nanalysis are the Tucker decomposition [43] and the Canonical Polyadic Decomposition also named\nCPD[16, 6]. These decompositions are used to infer multilinear relationships from multidimensional\ndatasets as they allow to extract hidden (latent) components and investigate the relationships among\nthem.\nIn this paper, we focus on the Tucker decomposition motivated by the fact that this decomposition\nand its variants have been successfully used in many real applications [24, 19]. Our technical goal\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis to develop a scalable Tucker decomposition technique in a static setting (the tensor is \ufb01xed).\nSuch an objective is relevant in a situation where it is not possible to load in memory the tensor of\ninterest or when the decomposition process may result in memory over\ufb02ow generated by intermediate\ncomputations [20, 31].\n\n1.1 Related work and main limitations\n\nDivide-and-conquer type methods (i.e. which divide the data set into sub-parts) have already been\nproposed for the scalable Tucker decomposition problem, with the goal of ef\ufb01ciently decomposing\na large \ufb01xed tensor (static setting). There are mainly three trends for these methods: distributed\nmethods, sequential processing of small subsets of entries drawn from the tensor or the computation of\nthe tensor-matrix product in a piecemeal fashion by adaptively selecting the order of the computations.\nA variant of the Tucker-ALS has been proposed in [31] and it solves each alternate step of the Tucker\ndecomposition by processing on-the-\ufb02y intermediate data, reducing then the memory footprint of the\nalgorithm. Several other approaches following the same principles are given in [5, 9, 4, 33] while\nothers consider some sampling strategies [29, 36, 14, 39, 48, 18, 47, 35, 27, 10, 25] or distributed\napproaches [49, 7, 34]. One major limitation related to these algorithms is their lack of genericness\n(i.e. they cannot be extended to incorporate some constraints such as non-negativity).\nAnother set of techniques for large-scale Tucker decomposition in a static setting focuses on designing\nboth deterministic and randomized algorithms in order to alleviate the computational burden of the\ndecomposition. An approach proposed by [4] performs an alternate minimization and reduces the\nscale of the intermediate problems via the incorporation of sketching operators. In the same \ufb02avor,\none can reduce the computational burden of the standard method HOSVD through randomization\nand by estimating the orthonormal basis via the so-called range \ufb01nder algorithm [51]. This class\nof approaches encompasses other methods that can be either random [8, 30, 13, 37, 42, 46] or\ndeterministic [40, 2, 38, 3, 50, 17, 26, 32]. The main limitation of these methods essentially stems\nfrom the fact that they use the whole data set at once (instead of dividing it), which makes them\nnon-applicable when the tensor does not \ufb01t the available memory.\nFrom a theoretical point of view, among all these works, some algorithms are backed up with\nconvergence results [4] or have quality of approximation guarantees materialized by a recovery bound\n[1]. However, there is still a lack of convergence rate analysis for the scalable Tucker problem.\n\n1.2 Main contributions\n\nIn contrast to the works described above, our contributions are the following ones:\n\u2022 We propose a new approach for the scalable Tucker decomposition problem, denoted as Singleshot\nleveraging on coordinate gradient descent [41] and sequential processing of data chunks amenable\nto constrained optimization.\n\u2022 In order to improve the computational ef\ufb01ciency of Singleshot, we introduce an inexact gradient\nvariant, denoted as Singleshotinexact. This inexact approach can be further extended so as to make\nit able to decompose a tensor growing in every mode and in an online fashion.\n\n\u2022 From a theoretical standpoint, we establish for Singleshot an ergodic convergence rate of O(cid:16) 1\u221a\n\n(cid:17)\n\n(K: maximum number of iterations) to a stationary point and for Singleshotinexact, we establish a\nconvergence rate of O( 1\n\u2022 We provide experimental analyses showing that our approaches are able to decompose bigger\ntensors than competitors without compromising ef\ufb01ciency. From a streaming tensor decomposition\npoint of view, our Singleshot extension is competitive with its competitor.\n\nk ) (k being the iteration number) to a minimizer.\n\nK\n\n2 Notations & De\ufb01nitions\nA N\u2212order tensor is denoted by a boldface Euler script letter X \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN . The matrices are\ndenoted by bold capital letters (e.g. A). The identity matrix is denoted by Id. The jth row of a\nmatrix A \u2208 RJ\u00d7L is denoted by Aj,: and the transpose of a matrix A by A(cid:62).\nMatricization is the process of reordering all the elements of a tensor into a matrix. The mode-\nmatrix X(n) \u2208 RIn\u00d7((cid:81)\nn matricization of a tensor [X ](n) arranges the mode-n \ufb01bers to be the columns of the resulting\nm(cid:54)=n Im). The mode-n product of a tensor G \u2208 RJ1\u00d7\u00b7\u00b7\u00b7\u00d7JN with a matrix\n\n2\n\n\fin\n\nin is obtained by reshaping the ith\n\nn row of X(n), i.e. the tensor X n\n\nA \u2208 RIn\u00d7Jn denoted by G \u00d7n A yields a tensor of the same order B \u2208 RJ1\u00d7\u00b7\u00b7\u00b7Jn\u22121\u00d7In\u00d7Jn+1\u00b7\u00b7\u00b7\u00d7JN\nwhose mode-n matricized form is de\ufb01ned by: B(n) = AG(n). For a tensor X \u2208 RI1\u00d7...\u00d7IN , its\nn subtensor with respect to the mode n is denoted by X n\n\u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7In\u22121\u00d71\u00d7In+1\u00d7\u00b7\u00b7\u00b7\u00d7IN . This\nith\n\nsubtensor is a N-order tensor de\ufb01ned via the mapping between its n-mode matricization(cid:2)X n\nmatrices(cid:8)A(m) \u2208 RIm\u00d7Jm(cid:9)\n\nn row of X(n), with the target\nthe ith\nN = {n, .., N}:\nshape (I1, .., In\u22121, 1, In+1, .., IN ). The set of integers from n to N is denoted by In\nif n = 1, the set is simply denoted by IN . The set of integers from 1 to N with n excluded is\ndenoted by IN(cid:54)=n = {1, .., n \u2212 1, n + 1, .., N}. Let us de\ufb01ne the tensor G \u2208 RJ1\u00d7..\u00d7JN and N\n. The product of G with the matrices A(m), m \u2208 IN denoted by\nG \u00d71 A(1) \u00d72 ... \u00d7N A(N ) will be alternatively expressed by:\nG \u00d7m\nm\u2208IN\nThe Frobenius norm of a tensor X \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN , denoted by (cid:107)X(cid:107)F is de\ufb01ned by:\n(cid:107)X(cid:107)F =\n2 . The same de\ufb01nition holds for matrices.\n\nn\u2208IN\nA(m) \u00d7n A(n) \u00d7q\nq\u2208In+1\n\n(cid:16)(cid:80)\n1\u2264in\u2264In,1\u2264n\u2264N X 2\n\nA(m) = G \u00d7m\nm\u2208In\u22121\n\nA(q) = G \u00d7m\nm\u2208IN(cid:54)=n\n\n(cid:3)(n) and\n\nA(m) \u00d7n A(n).\n\n(cid:17) 1\n\ni1,\u00b7\u00b7\u00b7 ,iN\n\nin\n\nN\n\n3 Piecewise tensor decomposition: Singleshot\n\n3.1 Tucker decomposition and problem statement\nGiven a tensor X \u2208 RI1\u00d7...\u00d7IN , the Tucker decomposition aims at the following approximation:\n\nX \u2248 G \u00d7m\nm\u2208IN\n\nA(m),G \u2208 RJ1\u00d7...\u00d7JN , A(m) \u2208 RIm\u00d7Jm\n\nThe tensor G is generally named the core tensor and the matrices(cid:8)A(m)(cid:9)\nA natural way to tackle this problem is to infer the latent factors G and(cid:8)A(m)(cid:9)\n\nthe loading matrices.\nWith orthogonality constraints on the loading matrices, this decomposition can be seen as the\nmultidimensional version of the singular value decomposition [23].\nin such a\nway that the discrepancy is low. Thus, the decomposition of X is usually obtained by solving the\nfollowing optimization problem:\n\nm\u2208IN\n\nm\u2208IN\n\n(cid:26)\n\n(cid:16)G, A(1),\u00b7\u00b7\u00b7 , A(N )(cid:17) (cid:44) 1\n\nmin\n\nG,A(1),\u00b7\u00b7\u00b7 ,A(N )\n\nf\n\n(cid:27)\n\n(cid:107)X \u2212 G \u00d7m\u2208IN A(m)(cid:107)2\n\nF\n\n2\n\n(1)\n\nOur goal in this work is to solve the above problem, for large tensors, while addressing two potential\nissues : the processing of a tensor that does not \ufb01t into the available memory and avoiding memory\nover\ufb02ow problem generated by intermediate operations during the decomposition process [21].\nFor this objective, we leverage on a reformulation of the problem (1) in terms of subtensors drawn\nfrom X with respect to one mode (which we suppose to be prede\ufb01ned), the \ufb01nal objective being to\nset up a divide-and-conquer type approach for the inference task. Let\u2019s consider a \ufb01xed integer n (in\nthe sequel, n will be referred to as the splitting mode). Indeed, the objective function can be rewritten\nin the following form (see supplementary, property 2):\n\n(cid:16)G, A(1),\u00b7\u00b7\u00b7 , A(N )(cid:17)\n\nf\n\n=\n\nIn(cid:88)\n\nin=1\n\n(cid:107)X n\n\nin\n\n1\n2\n\n\u2212 G \u00d7m\nm\u2208IN(cid:54)=n\n\nA(m) \u00d7n A(n)\n\nin,:(cid:107)2\n\nF\n\n(2)\n\nMore generally, the function f given by (1) can be expressed in terms of subtensors drawn with\nrespect to every mode (see supplementary material, property 3). For simplicity concerns, we only\naddress the case of subtensors drawn with respect to one mode and the general case can be derived\nfollowing the same principle (see supplementary material, section 5).\n\n3.2 Singleshot\n\nSince the problem (1) does not admit any analytic solution, we propose a numerical resolution\nbased on coordinate gradient descent [41]. The underlying idea is based on a cyclic update over\neach of the variables G, A(1), .., A(N ) while \ufb01xing the others at their last updated values and each\n\n3\n\n\fk = 0\n\n1\u2264m\u2264N\n\nX tensor of interest, n splitting mode,\n\nG,(cid:8)A(m)(cid:9)\n\nAlgorithm 1 Singleshot\nInputs:\nOutput:\nInitialization:\n1: while a prede\ufb01ned stopping criterion is not met do\n2:\n3:\n4:\nCompute optimal step \u03b7p\n5:\nk\nk+1 \u2190 A(p)\nA(p)\nkDp\n6:\nend for\n7:\n8: end while\n\nCompute optimal step \u03b7G\nGk+1 \u2190 Gk \u2212 \u03b7G\nk DG\nk\nfor p from 1 to N do\nk \u2212 \u03b7p\n\nk\n\nk\n\n(cid:110)\n\nA(m)\n\n0\n\n(cid:111)\n\n1\u2264m\u2264N\n\ninitial loading matrices,\n\nwith DG\n\nk given by (4)\n\nwith Dp\n\nk given by (5),(6)\n\nk\n\nk , ..., A(N )\n\n(cid:16)Gk+1, A(1)\n\n(cid:16)Gk, A(1)\nupdate being performed via a single iteration of gradient descent. More formally, given at iteration\nk, Gk, A(1)\nthe value of the latent factors, the derivatives DG\nk of f with respect\nk ,\u00b7\u00b7\u00b7 , A(N )\nand\nto the core tensor and the pth loading matrix respectively evaluated at\n(cid:19)\n(cid:111)p\u22121\n(cid:110)\n(cid:111)N\n\n(cid:18)\nk+1,\u00b7\u00b7\u00b7 , A(p\u22121)\n\nk+1 , A(p)\n\nare given by:\n\nk ., A(N )\n\nk and Dp\n\n(cid:111)N\n\n(cid:19)\n\n(cid:18)\n\n(cid:110)\n\n(cid:110)\n\n(cid:17)\n\n(cid:17)\n\nk\n\nk\n\nDG\nk = \u2202Gf\n\nGk,\n\nA(m)\n\nk\n\n,\n\n1\n\nDp\n\nk = \u2202A(p)f\n\nGk+1,\n\nA(m)\nk+1,\n\n1\n\n,\n\nA(q)\nk\n\np\n\n(3)\n\nThe resulting cyclic update algorithm, named Singleshot, is summarized in Algorithm 1. A naive\nimplementation of the gradient computation would result in memory over\ufb02ow problem. In what\nfollows, we show that the derivatives DG\nk, 1 \u2264 p \u2264 N given by the equation (3) can be\ncomputed by processing a single subtensor at a time, making Algorithm 1 amenable to sequential\nprocessing of subtensors. Discussions on how the step sizes are obtained will be provide in Section 4.\nDerivative with respect to G. The derivative with respect to the core tensor is given by (details in\nProperty 7 of supplementary material):\n\nk and Dp\n\nIn(cid:88)\n\nin=1\n\nDG\nk =\n\n(cid:124)\nRin \u00d7m\nm\u2208IN(cid:54)=n\nsequence(cid:8)(DG\nk )j(cid:9)\n\nIt is straightforward to see that DG\n\n(cid:16)\n\nA(m)\n\nk\n\n(cid:18)(cid:16)\n\n(cid:17)(cid:62) \u00d7n\n(cid:123)(cid:122)\nde\ufb01ned as(cid:0)DG\n\n\u03b8in\n\n(cid:17)\n\nA(n)\n\nk\n\nin,:\n\n(cid:19)(cid:62)\n(cid:125)\n,Rin = \u2212X n\n(cid:1)j\u22121\n\n+ \u03b8j, with(cid:0)DG\n\nin\n\n(4)\nk (given by the equation (4)) is the last term of the recursive\n\n(cid:1)0 being the null tensor.\n\n(cid:1)j\n\n=(cid:0)DG\n\nk\n\nk\n\n1\u2264j\u2264In\nj . This is the key of our approach since it allows the computation of DG\n\nAn important observation is that the additive term \u03b8j (given by the equation (4)) depends only on one\nsingle subtensor X n\nk through\nthe sequential processing of a single subtensor X n\nDerivatives with respect to A(p), p (cid:54)= n (n being the splitting mode). For those derivatives, we\ncan exactly follow the same reasoning, given in detail in Property 9 of the Supplementary material,\nand obtain for p < n (the case p > n yields a similar formula):\n\nj at a time.\n\nk\n\n+Gk \u00d7m\nm\u2208IN(cid:54)=n\n\nk \u00d7n\nA(m)\n\nA(n)\n\nk\n\n(cid:16)\n\n(cid:17)\n\nin,:\n\nIn(cid:88)\n\n(cid:16)\u2212(cid:0)Xn\n\nin\n\n(cid:1)(p)\n\nDp\n\nk =\n\n(cid:17)(cid:16)\n\n(cid:17)(cid:62)\n\n+ A(p)\n\nk B(p)\nin\n\nB(p)\nin\n\n(5)\n\nin=1\n\nThe matrices (Xn\nsubtensor X n\nin\n\n)(p) and B(p)\nin\n\nin and the tensor Bin is de\ufb01ned by:\n\nrepresent respectively the mode-p matricized forms of the ith\nn\n\nBin = Gk+1 \u00d7m\nm\u2208Ip\u22121\n\nk+1 \u00d7p Id \u00d7q\nA(m)\nq\u2208Ip+1\nN(cid:54)=n\n\nk \u00d7n (A(n)\nA(q)\n\nk )in,:\n\n4\n\n\fwith Id \u2208 RJp\u00d7Jp being the identity matrix. With a similar reasoning as for the derivative with\nrespect to the core, it is straightforward to see that Dp\nk can be computed by processing a single\nsubtensor at a time.\n\nDerivative with respect to A(n) (n being the splitting mode). The derivative with respect to the\nmatrix A(n) can be computed via the row-wise stacking of independent terms, that are, the derivatives\nj,: depends only on X n\nwith respect to the rows A(n)\nj .\nIndeed, let\u2019s consider 1 \u2264 j \u2264 In. In the expression of the objective function f given by the equation\nj,: \u00d7q\n(2), the only term that depends on A(n)\nF , thus\nj,:\nq\u2208In+1\n\nj,: and the derivative of f with respect to A(n)\nA(m) \u00d7n A(n)\n\nis (cid:107)X n\n\nA(q)(cid:107)2\n\nN\n\nj \u2212 G \u00d7m\nm\u2208In\u22121\nj,: depends only on X n\n\nj and is given by (see property 8 in the\n\nj,: B(n)(cid:17)\n\n(Xn\n\nj )(n) \u2212 A(n)\n\nB(n)(cid:62)\n\n(6)\n\nthe derivative of f with respect to A(n)\nsupplementary material):\n\n(cid:19)\n\n= \u2212(cid:16)\n\nG,\n\n\u2202A(n)\n\nA(m)(cid:111)N\n(cid:110)\n\n(cid:18)\nj )(n) \u2208 R1\u00d7(cid:81)\nj and B with B = G \u00d7p\np\u2208In\u22121\n\nf\n\nj,:\n\n1\n\nThe tensors (Xn\nthe tensors X n\n\nk(cid:54)=n Ik and B(n) respectively represent the mode-n matricized form of\nA(q), Id \u2208 RJn\u00d7Jn: identity matrix.\n\nA(p) \u00d7n Id \u00d7q\nq\u2208In+1\n\nN\n\nRemark 1. . For one-mode subtensors, it is relevant to choose n such that In is the largest dimension\nsince this yields the smallest subtensors. We stress that all entries of the tensor X have been\nentirely processed when running Algorithm 1 and our key action is the sequential processing of\nsubtensors X n\n. In addition, if one subtensor does not \ufb01t in the available memory, the recursion,\nas shown in section 5 of the supplementary material, can still be applied to subtensors of the form\nX \u03b81,...,\u03b8N , \u03b8m \u2282 {1, 2, .., Im} with (X \u03b81,..,\u03b8N )i1,..,iN = X i1,..,iN , (i1, .., iN ) \u2208 \u03b81 \u00d7 ... \u00d7 \u03b8N , \u00d7\nreferring to the Cartesian product.\n\nin\n\nin\n\n3.3 Singleshotinexact\nWhile all of the subtensors X n\n, 1 \u2264 in \u2264 In are considered in the Singleshot algorithm for the\ncomputation of the gradient, in Singleshotinexact, we propose to use only a subset of them for the\nsake of reducing computational time. The principle is to use for the gradients computation only\nBk < In subtensors. Let\u2019s consider the set SET k (of cardinality Bk) composed of the integers\nrepresenting the indexes of the subtensors that are going to be used at iteration k. The numerical\nresolution scheme is identical to the one described by Algorithm 1 except for the de\ufb01nition of DG\nand Dp\n\nk and (cid:98)Dp\n\nk\n\nk, p (cid:54)= n de\ufb01ned by:\n(cid:17)\n\nA(m)(cid:62)\n\nk\n\n\u00d7n\n\n(cid:18)(cid:16)\n(cid:17)(cid:16)\n\nA(n)\n\nk\n\n(cid:19)(cid:62)\n(cid:17)(cid:62)\n\nin,:\n\n+ A(p)B(p)\nin\n\nB(p)\nin\n\n(cid:1)(p)\n\n(7)\n\n(8)\n\nk =\n\nin\u2208SET k\n\n(cid:98)DG\n(cid:98)Dp\n\nk which are respectively replaced by (cid:98)DG\nRin \u00d7m\n(cid:16)\u2212(cid:0)Xn\nm\u2208IN(cid:54)=n\n\n(cid:88)\n(cid:88)\nworth to highlight that the derivative (cid:98)Dn\n(cid:40)\n(cid:41)\nA(m) \u00d7n A(n)\nj,: (cid:107)2\n\nin\u2208SET k\n\nk =\n\nin\n\nF\n\nFor the theoretical convergence, the descent steps are de\ufb01ned as \u03b7G\n\n, 1 \u2264 p \u2264 N. It is\nk (n being the mode with respect to which the subtensors are\ndrawn) is sparse: Singleshotinexact amounts to minimize f de\ufb01ned by (2) by dropping the terms\n(cid:107)X n\nj \u2212 G \u00d7m\n, j (cid:54)\u2208 SET k are\nm\u2208IN(cid:54)=n\nall equal to zero.\n\nwith j (cid:54)\u2208 SET k, thus, the rows\n\n(cid:16)(cid:98)Dn\n\nand \u03b7p\nk\nBk\n\n(cid:17)\n\nk\nBk\n\nj,:\n\nk\n\n3.4 Discussions\n\nFirst, we discuss the space complexity needed by our algorithms supposing that the subtensors\nare drawn with respect to one mode. Let\u2019s denote by n the splitting mode. For Singleshot and\nSingleshot-inexact, at the same time, we only need to have in memory the tensor X n\nj of size\n\n5\n\n\fm\u2208IN(cid:54)=n\n\nm\u2208IN(cid:54)=n\n\nm\u2208IN(cid:54)=n\n\nIm +(cid:80)\n\nIm = I1..In\u22121In+1..IN , the matrices(cid:8)A(m)(cid:9)\n\n(cid:81)\nthe gradient. Thus, the complexity in space is(cid:81)\nbeing the space complexity of the previous gradient iterate: for the core update, AT =(cid:81)\n\n, A(n)\nin,: and the previous iterate of\nm(cid:54)=n ImJm + Jn + AT with AT\nJm\nand for a matrix A(m), AT = ImJm. If the recursion used for the derivatives computation is applied\nto subtensors of the form X\u03b81,\u00b7\u00b7\u00b7 ,\u03b8N , the space complexity is smaller than these complexities.\nAnother variant of Singleshotinexact can be derived to address an interesting problem that has received\nlittle attention so far [4], that is the decomposition of a tensor streaming in every mode with a single\npass constraint (i.e. each chunk of data is processed only once) named Singleshotonline. This is\nenabled by the inexact gradient computation which uses only subtensors that are needed. In the\nstreaming context, the gradient is computed based only on the available subtensor.\nPositivity constraints is one of the most encountered constraints in tensor computation and we can\nsimply handle those constraints via the so-called projected gradient descent [45]. This operation does\nnot alter the space complexity with respect to the unconstrained case, since no addition storage is\nrequired but increases the complexity in time. For more details, see the section 3 in the supplementary\nmaterial for the algorithmic details for the proposed variants.\n\nm\u2208IN\n\n4 Theoretical result\n\nf\n\nmin\n\nG,A(1),..,A(N )\n\nLet\u2019s consider the minimization problem (1):\n\n(cid:16)G, A(1), .., A(N )(cid:17)\nnative notations of f (G, A(1),\u00b7\u00b7\u00b7 , A(N )) given by: f (G,(cid:8)A(m)(cid:9)N\n\nBy denoting the block-wise derivative by \u2202xf, the derivative of f, denoted \u2207f and de\ufb01ned by\n(\u2202Gf, \u2202A(1)f..\u2202A(N ) f ), is an element of RJ1\u00d7..\u00d7JN \u00d7RI1\u00d7J1\u00d7...\u00d7RIN\u00d7JN endowed with the norm\n(cid:107)\u00b7(cid:107)\u2217 de\ufb01ned as the sum of the Frobenius norms. Besides, let\u2019s consider, for writing simplicity, the alter-\np+1).\nFor the theoretical guarantees which details have been reported in the supplementary material, we\nconsider the following assumptions:\nAssumption 1. Uniform boundedness. The nth subtensors are bounded: (cid:107)X n\nAssumption 2. Boundedness of factors. We consider the domain G \u2208 Dg, A(m) \u2208 Dm with:\n\n1 ), f (G,(cid:8)A(m)(cid:9)p\n\n1 ,(cid:8)A(q)(cid:9)N\n\n(cid:107)F \u2264 \u03c1.\n\nin\n\nDg = {(cid:107)Ga(cid:107)F \u2264 \u03b1} , Dm =\n\na (cid:107)F \u2264 \u03b1\n\n(cid:110)(cid:107)A(m)\n\n(cid:111)\n\n4.1 Convergence result of Singleshot\nFor the convergence analysis, we consider the following de\ufb01nitions of the descent steps \u03b7G\nthe (k + 1)th iteration:\n\nk and \u03b7p\n\nk at\n\n\u03b7G\nk = arg min\n, \u03b42\u221a\nK\n\n\u03b7\u2208[ \u03b41\u221a\n\nK\n\n(\u03b7 \u2212 \u03b41\u221a\nK\n\n)\u03c6g(\u03b7), \u03b7p\n\nk = arg min\n, \u03b42\u221a\nK\n\n\u03b7\u2208[ \u03b41\u221a\n\nK\n\n(\u03b7 \u2212 \u03b41\u221a\nK\n\n)\u03c6p(\u03b7)\n\n(9)\n\n(cid:18)\n\n\u03c6p (\u03b7) =f\n\n(cid:110)\n\n]\n\n(cid:18)\n(cid:111)p\u22121\n\n1\n\n\u03c6g(\u03b7) = f\n\nGk \u2212 \u03b7DG\nk ,\n\nA(m)\n\n\u2212 f\n\nGk,\n\nA(m)\n\nk\n\n(cid:110)\n\n(cid:19)\n(cid:111)N\n(cid:111)N\n\n1\n\nk\n\n(cid:110)\n\n(cid:18)\n\n(cid:19)\n\n]\n\n(cid:110)\n(cid:18)\n\n(cid:19)\n\n(cid:111)N\n(cid:110)\n\n1\n\nGk+1,\n\nA(m)\nk+1\n\n, A(p)\n\nk \u2212 \u03b7Dp\nk,\n\nA(q)\nk\n\n\u2212 f\n\nGk+1,\n\nA(m)\nk+1\n\np+1\n\n(cid:111)p\u22121\n\n1\n\n,\n\n(cid:110)\n\nA(q)\nk\n\n(cid:19)\n\n(cid:111)N\n\np\n\nand \u03b42 > \u03b41 > 0 being user-de\ufb01ned parameters. The motivation of the problems given by the\nequation (9) is to ensure a decreasing of the objective function after each update. Also note that, the\nminimization problems related to \u03b7G\nk are well de\ufb01ned since all the factors involved in their\nde\ufb01nitions are known at the computation stage of Gk+1 and A(p)\nk+1 and correspond to the minimization\nof a continuous function on a compact set.\n\nk and \u03b7p\n\n6\n\n\fAlong with Assumption 1 and Assumption 2, as well as the de\ufb01nitions given by (9), we assume that:\n\n\u03b41\u221a\nK\n\n< \u03b7G\n\nk \u2264 \u03b42\u221a\n\nK\n\nand\n\n\u03b41\u221a\nK\n\n< \u03b7p\n\nk \u2264 \u03b42\u221a\n\nK\n\n(10)\n\nThis simply amounts to consider that the solutions of the minimization problems de\ufb01ned by the\nequation (9) are not attained at the lower bound of the interval. Under this framework, we establish,\nas in standard non-convex settings [12], an ergodic convergence rate. Precisely, we prove that:\n\nk=0\n\n1\nK\n\n(cid:110)\n\n(cid:110)\n\nGk,\n\nGk,\n\nA(m)\n\nA(m)\n\n(cid:107)\u2207f\n\nwith (cid:107)\u2207f\n\n\u2203K0 \u2265 1,\u2200K \u2265 K0,\n\n(cid:19)\n(cid:111)N\n2 +(cid:80)N\n\n(cid:18)\n(cid:111)N\n(cid:16)\nsupremun of f,(cid:107)\u2202A(p) f (G,(cid:8)A(m)(cid:9))(cid:107)F ,(cid:107)\u2202Gf (G,(cid:8)A(m)(cid:9))(cid:107)F on the compact set Dg \u00d7 D1.. \u00d7 DN\nthe rate O(cid:16) 1\u221a\n\n)(cid:107)F ,\n, \u0393, \u0393p, \u0393g \u2265 0 being respectively the\n\nThis result proves that Singleshot converges ergodically to a point where the gradient is equal to 0 at\n\n(cid:107)2\u2217 \u2264 (N + 1)\u2206\u221a\n(cid:110)\n\np=1 (cid:107)\u2202A(p)f (Gk,\n\n(cid:107)\u2217 = (cid:107)\u2202Gf (Gk,\n\np=1(1 + 2\u0393 + \u03b12N \u03932\n\n2\u0393 + \u03b12N \u03932\n\n\u2206 = \u03b42\n\u03b42\n1\n\n1\np\u03b42\n2)\n\n1\ng\u03b42\n\nA(m)\n\nA(m)\n\n(cid:17)\n\n(11)\n\nK\n\n.\n\nk\n\nk\n\nk\n\n1\n\nK\n\n(cid:110)\n(cid:111)N\n\nk\n\n(cid:19)\n(cid:111)N\n)(cid:107)F +(cid:80)N\n(cid:17)\n\n1\n\nK\u22121(cid:88)\n\n(cid:18)\n\n4.2 Convergence result for Singleshotinexact\nLet us consider that (cid:96)j(A(N )) (cid:44) 1\n\n2(cid:107)X n\n\nk+1)j,: \u00d7q\nq\u2208In+1\nN\u22121\nk for A(N ) is de\ufb01ned by the following minimization problem:\n\nj \u2212 Gk+1 \u00d7m\nm\u2208In\u22121\n\nk+1 \u00d7n (A(n)\nA(m)\n\nk+1 \u00d7N A(N )(cid:107)2\nA(q)\n\nF\n\nand that the step \u03b7N\n\n(cid:18)\n\u03b7 \u2212 1\n(cid:111)N\n(cid:110)\n4K \u03b3\n\n(cid:19)\n(cid:19)\n\nA(m)\n\nk\n\n1\n\n\u03c6(\u03b7)\n\n+ \u03bbf\n\n(cid:18)\n\nGk+1,\n\n(cid:110)\n\nA(m)\nk+1\n\n(cid:111)N\u22121\n\n1\n\n(12)\n\n(cid:19)\n\n, \u02dc\u03c6(\u03b7)\n\n\u03c6(\u03b7) = f\n\n(cid:18)\n\n(cid:110)\nGk+1,\nk \u2212 \u03b7\n\n(cid:111)N\u22121\n(cid:80)\n\n1\n\nA(m)\nk+1\n\n\u03b7N\nk = arg min\n\u03b7\u2208[ 1\n4K\u03b3 , 1\n\u2212 f\n\n, A(N )\n\n(cid:19)\n\n(cid:18)\n\nK\u03b3 ]\nGk,\n\nk\n\nk\n\nBk\n\nj\u2208SET k\n\n2. Non-vanishing gradient with respect to A(N ) \u2202A(N ) f\n\n\u2202A(N )(cid:96)j, \u2202A(N ) (cid:96)j being the derivative of (cid:96)j evaluated at A(N )\n\nand \u02dc\u03c6(\u03b7) = A(N )\n.\nThe parameters \u03bb > 0, \u03b3 > 1 represent user-de\ufb01ned parameters. In addition to Assumption 1 and\nAssumption 2, we consider the following three additional assumptions:\nk \u2264 1\n1. Descent step related to the update of A(N ). We assume that\n(cid:111)N\u22121\nthat the solution of problem (12) is not attained at the lower bound of the interval.\n(cid:54)= 0.\nA(m)\nk+1\n\u2202A(N )(cid:96)j (cid:54)= 0 and the set\n\n(cid:18)\n(cid:110)\nThis condition ensures the existence of a set SET k such that(cid:80)\n3. Choice of the number of subtensors Bk. We suppose that In \u00d7(cid:113) 1\n(cid:19)\n(cid:111)N\n\nconsidered for the problem (12) is one of such sets.\n\nThis condition In > 2 ensures that In\n\nGk+1,\nj\u2208SET k\n\n\u2264 Bk and In > 2.\n\n\u03b12N , which means\n\n(cid:113) 1\n\n4K\u03b3 < \u03b7N\n\n2 + 1\nIn\n\n2 + 1\nIn\n\n, A(N )\n\n< In.\n\n(cid:19)\n\n(cid:18)\n\n(cid:110)\n\nk\n\n1\n\n1\n\nWith these assumptions at hand, the sequence \u2206k = f\n\n\u2212 fmin veri\ufb01es:\n\n(cid:18)\n\nA(m)\n\nk\n\n1\n\nGk,\n\n(cid:19)\n\n\u2200k > k0 = 1 +\n\n1\n\nlog\n\n1\n\nlog(1 + \u03bb)\n\nlog(1 + \u03bb)\n\n, \u2206k \u2264 \u22061 + \u03b6(\u03bb, \u03c1, \u03b1, In)\n\nk \u2212 k0\n\n(13)\n\nwith log being the logarithmic function, fmin representing the minimizer of f, a continuous function\nde\ufb01ned on the compact set Dg \u00d7 D1 \u00d7 .... \u00d7 DN and \u03b6a function of \u03bb, \u03c1, \u03b1, In. The parameter k0 is\ngenerated by\nwell-de\ufb01ned since \u03bb > 0. This result ensures that the sequence\n\n(cid:110)Gk, A(1)\nSingleshotinexact converges to the set of minimizers of f at the rate O(cid:0) 1\n(cid:1)\n\nk , ., A(N )\n\n(cid:111)\n\nk\n\nRemark 2. The problems de\ufb01ned by the equations (9) and (12), which solutions are global and can\nbe solved by simple heuristics (e.g. Golden section), are not in contradiction with our approach since\nthey can be solved by processing a single subtensor at a time due to the expression of f given by (2).\n\nk\n\n7\n\n\fFigure 1: Approximation error and running time for the unconstrained decomposition algorithms.\nFrom left to right: \ufb01rst and second \ufb01gures represent Movielens, third and fourth represent Enron. As\nM grows , missing markers for a given algorithm means that it ran out of memory. The core G rank\nis (5, 5, 5).\n\nFigure 2: Approximation error and running time for the non-negative decomposition algorithms.\nFrom left to right: \ufb01rst and second \ufb01gures represent Movielens, third and fourth represent Enron. As\nM grows, missing markers, for a given algorithm means that it ran out of memory. The core G rank\nis (5, 5, 5).\n\n5 Numerical experiments\n\nOur goal here is to illustrate that for small tensors, our algorithms Singleshot and Singleshotinexact\nand their positive variants, are comparable to some state-the-art decomposition methods. Then as the\ntensor size grows, we show that they are the only ones that are scalable. The competitors we have\nconsidered include SVD-based iterative algorithm [44](denoted GreedyHOOI ), a very recent alternate\nminimization approach based on sketching [4] (named Tensorsketch) and randomization-based\nmethods [51] (Algorithm 2 in [51] named Scalrandomtucker and Algorithm 1 in [51] with positivity\nconstraints named posScalrandomtucker). Other materials related to the numerical experiments are\nprovided in the section 4 of the supplementary material. For the tensor computation, we have used\nthe TensorLy tool [22].\nThe experiments are performed on the Movielens dataset [15], from which we construct a 3-order\ntensor whose modes represent timesteps, movies and users and on the Enron email dataset, from which\na 3-order tensor is constructed, the \ufb01rst and second modes representing the sender and recipients of\nthe emails and the third one denoting the most frequent words used in the miscellaneous emails. For\nMovielens, we set up a recommender system for which we report a mean average precision (MAP)\nobtained over a test set (averaged over \ufb01ve 50 \u2212 50 train-test random splits) and for Enron, we report\nan error (AE) on a test set (with the same size as the training one) for a regression problem. As our\ngoal is to analyze the scalability of the different methods as the tensor to decompose grows, we have\narbitrarily set the size of the Movielens and Enron tensors to M \u00d7 M \u00d7 200 and M \u00d7 M \u00d7 610, M\nbeing a user-de\ufb01ned parameter. Experiments have been run on MacOs with 32 Gb of memory.\nAnother important objective is to prove the robustness of our approach with respect to the assumptions\nand the de\ufb01nitions related to the descent steps laid out in the section 4, which is of primary importance\nsince the minimization problems de\ufb01ning these steps can be time-consuming in practice for large\ntensors. This motivates the fact that for our algorithms, the descent steps are \ufb01xed in advance. For\nSingleshot, the steps are \ufb01xed to 10\u22126. For Singleshot-inexact, the steps are respectively \ufb01xed to\n10\u22126 and 10\u22128 for Enron and Movielens. Regarding the computing of the inexact gradient for\nSingleshotinexact, the elements in SET k are generated randomly without replacement with the same\ncardinality for any k. The number of slices is chosen according to the third assumption in section 4.2.\nFor the charts, the unconstrained versions of Singleshot and Singleshotinexact will be followed by\nthe term \"unconstrained\" and \"positive\" for the positive variantes.\n\n8\n\n5001000150020002500300035004000Dimension M0123456Mean Average Precision (MAP)GreedyHOOIScalablerandomizedtuckerTensorsketchSingleshotinexactunconstrainedSingleshotunconstrained5001000150020002500300035004000Dimension M101102103104Running time (s)GreedyHOOIScalablerandomizedtuckerTensorsketchSingleshotinexactunconstrainedSingleshotunconstrained150020002500300035004000Dimension M0100020003000400050006000700080009000Approximation error (AE)GreedyHOOIScalablerandomtuckerTensorsketchSingleshotinexactunconstrainedSingleshotunconstrained1000150020002500300035004000Dimension M101102103104Running time (s)GreedyHOOIScalablerandomtuckerTensorsketchSingleshotinexactunconstrainedSingleshotunconstrained5001000150020002500300035004000Dimension M0123456Mean Average PrecisionposScalablerandomtuckerSingleshotinexactpositiveSingleshotpositive5001000150020002500300035004000Dimension M101102103104Running time (s)posScalablerandomtuckerSingleshotinexactpositiveSingleshotpositive150020002500300035004000Dimension M30004000500060007000Approximation errorposScalablerandomizedtuckerSingleshotinexactpositiveSingleshotpositive150020002500300035004000Dimension M102103104Running time (s)posScalablerandomizedtuckerSingleshotinexactpositiveSingleshotpositive\fFigure 3: Comparing Online version of Tensorsketch and Singleshot with positive constraints on the\nEnron dataset. (left) Approximation error. (right) running time.\n\nFigure 1 presents the results we obtain for these two datasets. At \ufb01rst, we can note that performance,\nin terms of MAP or AE, are rather equivalent for the different methods. Regarding the running\ntime, the Scalrandomtucker is the best performing algorithm being an order of magnitude more\nef\ufb01cient than other approaches. However, all competing methods struggle to decompose tensors with\ndimension M = 4000 and M \u2265 2800 respectively for Enron and Movielens due to memory error.\nInstead, our Singleshot methods are still able to decompose those tensors, although the running time\nis large. As expected, Singleshotinexact is more computationally ef\ufb01cient than Singleshot.\nFigure 2 displays the approximation error and the running time for Singleshot and singleshotinexact\nwith positivity constraints and a randomized decomposition approach with non-negativity constraints\ndenoted here as PosScalrandomtucker for Enron and Movielens. Quality of the decomposition is in\nfavor of Singleshotpositive for both Movielens and Enron. In addition, when the tensor size is small\nenough, PosScalrandomtucker is very computationally ef\ufb01cient, being one order of magnitude faster\nthan our Singleshot approaches on Enron. However, PosScalrandomtucker is not able to decompose\nvery large tensors and ran out of memory contrarily to Singleshot.\nFor illustrating the online capability of our algorithm, we have considered a tensor of size 20000 \u00d7\n2000 \u00d7 50 constructed from Enron which is arti\ufb01cially divided into slices drawn with respect to\nthe \ufb01rst and the second modes. The core rank is (R, R, R). We compare the online variant of our\napproach associated to positivity constraints named Singleshotonlinepositive to the online version\nof Tensorsketch [4] denoted Tensorsketchonline. Figure 3 shows running time for both algorithms.\nWhile of equivalent performance, our method is faster as our proposed update schemes, based on one\nsingle step of gradient descent, are more computationally ef\ufb01cient than a full alternate minimization.\n\nRemark 3. Other assessments are provided in the supplementary material: comparisons with other\nrecent divide-and-conquer type approaches are provided, the non-nullity of the gradient with respect\nto A(n) is numerically shown, and \ufb01nally, we demonstrated the expected behavior of Singleshotinexact,\ni.e. \u201cthe higher the number of subtensors in the gradient approximation, the better performance we\nget\u201d.\n\n6 Conclusion\n\nSingleshotinexact, a convergence rate of O(cid:0) 1\n\nK\n\nk\n\nWe have introduced two new algorithms named Singleshot and Singleshotinexact for scalable Tucker\ndecomposition with convergence rates guarantees: for Singleshot, we have established a convergence\nrate to the set of minimizers of O( 1\u221a\n) (K being the maximum number of iterations) and for\n\n(cid:1) (k being the iteration number). Besides, we have\n\nproposed a new approach for a problem that has drawn little attention so far, that is, the Tucker\ndecomposition under the single pass constraint (with no need to resort to the past data) of a tensor\nstreaming with respect to every mode. In future works, we aim at applying the principle of Singleshot\nto other decomposition problems different from Tucker.\n\n9\n\n23456Rank R920092509300935094009450950095509600Approximation Error (AE)TensorsketchonlineSingleshotonline positive23456Rank R1000120014001600180020002200240026002800Running time (s)TensorsketchonlineSingleshotonline positive\fAcknowledgments\n\nThis work was supported by grants from the Normandie Projet GRR-DAISI, European funding\nFEDER DAISI and LEAUDS ANR-18-CE23-0020 Project of the French National Research Agency\n(ANR).\n\nReferences\n[1] Woody Austin, Grey Ballard, and Tamara G. Kolda. Parallel tensor compression for large-scale\nscienti\ufb01c data. 2016 IEEE International Parallel and Distributed Processing Symposium, 2016.\n\n[2] Grey Ballard, Alicia Klin, and Tamara G. Kolda. Tuckermpi: A parallel c++/mpi software\n\npackage for large-scale data compression via the tucker tensor decomposition. arxiv, 2019.\n\n[3] Muthu Manikandan Baskaran, Beno\u00eet Meister, Nicolas Vasilache, and Richard Lethin. Ef\ufb01cient\nand scalable computations with sparse tensors. 2012 IEEE Conference on High Performance\nExtreme Computing, pages 1\u20136, 2012.\n\n[4] Stephen Becker and Osman Asif Malik. Low-rank tucker decomposition of large tensors using\ntensorsketch. Advances in Neural Information Processing Systems, pages 10117\u201310127, 2018.\n\n[5] Cesar F. Caiafa and Andrzej Cichocki. Generalizing the column-row matrix decomposition to\nmultiway arrays. In Linear Algebra and its Applications, volume 433, pages 557\u2013573, 2010.\n\n[6] Raymond B. Cattell. Parallel proportional pro\ufb01les\u201d and other principles for determining the\n\nchoice of factors by rotation. Psychometrika, 9(4):267\u2013283, 1944.\n\n[7] Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Shivmaran S.\nPandian, Yogish Sabharwal, and Dheeraj Sreedhar. On optimizing distributed tucker decomposi-\ntion for sparse tensors. In Proceedings of the 2018 International Conference on Supercomputing,\npages 374\u2013384, 2018.\n\n[8] Maolin Che and Yimin Wei. Randomized algorithms for the approximations of tucker and the\n\ntensor train decompositions. Advances in Computational Mathematics, pages 1\u201334, 2018.\n\n[9] Dongjin Choi, Jun-Gi Jang, and Uksong Kang. Fast, accurate, and scalable method for sparse\n\ncoupled matrix-tensor factorization. CoRR, 2017.\n\n[10] Dongjin Choi and Lee Sael. Snect: Scalable network constrained tucker decomposition for\n\nintegrative multi-platform data analysis. CoRR, 2017.\n\n[11] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Nonnegative Matrix\nand Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind\nSource Separation. Wiley Publishing, 2009.\n\n[12] Alistarh Dan and al. The convergence of sparsi\ufb01ed gradient methods. Advances in Neural\n\nInformation Processing Systems, pages 5977\u20135987, 2018.\n\n[13] Petros Drineas and Michael W. Mahoney. A randomized algorithm for a tensor-based general-\nization of the singular value decomposition. Linear Algebra and its Applications, 420:553\u2013571,\n2007.\n\n[14] D\u00f3ra Erd\u00f6s and Pauli Miettinen. Walk \u2019n\u2019 merge: A scalable algorithm for boolean tensor\nfactorization. 2013 IEEE 13th International Conference on Data Mining, pages 1037\u20131042,\n2013.\n\n[15] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. TiiS,\n\n5:19:1\u201319:19, 2015.\n\n[16] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. J. Math.Phys.,\n\n6(1):164\u2013189, 1927.\n\n[17] Inah Jeon, Evangelos E. Papalexakis, Uksong Kang, and Christos Faloutsos. Haten2: Billion-\nscale tensor decompositions. 2015 IEEE 31st International Conference on Data Engineering,\npages 1047\u20131058, 2015.\n\n10\n\n\f[18] Oguz Kaya and Bora U\u00e7ar. High performance parallel algorithms for the tucker decomposition\nof sparse tensors. 2016 45th International Conference on Parallel Processing (ICPP), pages\n103\u2013112, 2016.\n\n[19] Tamara Kolda and Brett Bader. The tophits model for higher-order web link analysis. Workshop\n\non Link Analysis, Counterterrorism and Security, 7, 2006.\n\n[20] Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining.\nIn Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages\n363\u2013372, 2008.\n\n[21] Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining.\nIn Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM \u201908,\npages 363\u2013372, 2008.\n\n[22] Jean Kossai\ufb01, Yannis Panagakis, and Maja Pantic. Tensorly: Tensor learning in python. arXiv,\n\n2018.\n\n[23] Lieven Lathauwer and al. A multilinear singular value decomposition. SIAM J. Matrix Anal.\n\nAppl., 21(4):1253\u20131278, 2000.\n\n[24] Lieven De Lathauwer and Joos Vandewalle. Dimensionality reduction in higher-order signal\nprocessing and rank-(r1, r2,...,rn) reduction in multilinear algebra. Linear Algebra and its\nApplications, 391:31\u201355, 2004.\n\n[25] Dongha Lee, Jaehyung Lee, and Hwanjo Yu. Fast tucker factorization for large-scale tensor\ncompletion. 2018 IEEE International Conference on Data Mining (ICDM), pages 1098\u20131103,\n2018.\n\n[26] Xiaoshan Li, Hua Zhou, and Lexin Li. Tucker tensor regression and neuroimaging analysis.\n\nStatistics in Biosciences, 04 2013.\n\n[27] Xinsheng Li, K. Sel\u00e7uk Candan, and Maria Luisa Sapino. M2td: Multi-task tensor decom-\nposition for sparse ensemble simulations. 2018 IEEE 34th International Conference on Data\nEngineering (ICDE), pages 1144\u20131155, 2018.\n\n[28] Ching-Yung Lin, Nan Cao, Shi Xia Liu, Spiros Papadimitriou, J Sun, and Xifeng Yan. Smallblue:\nSocial network analysis for expertise search and collective intelligence. ICDE, pages 1483 \u2013\n1486, 2009.\n\n[29] Michael W. Mahoney, Mauro Maggioni, and Petros Drineas. Tensor-cur decompositions for\n\ntensor-based data. In SIGKDD, pages 327\u2013336, 2006.\n\n[30] Carmeliza Navasca and Deonnia N. Pompey. Random projections for low multilinear rank\ntensors. In Visualization and Processing of Higher Order Descriptors for Multi-Valued Data,\npages 93\u2013106, 2015.\n\n[31] Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu. S-hot:\nScalable high-order tucker decomposition. In Proceedings of the Tenth ACM International\nConference on Web Search and Data Mining, pages 761\u2013770, 2017.\n\n[32] Sejoon Oh, Namyong Park, Lee Sael, and Uksong Kang. Scalable tucker factorization for\nsparse tensors - algorithms and discoveries. 2018 IEEE 34th International Conference on Data\nEngineering (ICDE), pages 1120\u20131131, 2018.\n\n[33] Moonjeong Park, Jun-Gi Jang, and Sael Lee. Vest: Very sparse tucker factorization of large-scale\n\ntensors. 04 2019.\n\n[34] Namyong Park, Sejoon Oh, and U Kang. Fast and scalable method for distributed boolean\n\ntensor factorization. In The VLDB Journal, page 1\u201326, 2019.\n\n[35] Ioakeim Perros, Robert Chen, Richard Vuduc, and J Sun. Sparse hierarchical tucker factorization\n\nand its application to healthcare. pages 943\u2013948, 2015.\n\n11\n\n\f[36] Kijung Shin, Lee Sael, and U Kang. Fully scalable methods for distributed tensor factorization.\n\nIEEE Trans. on Knowl. and Data Eng., 29(1):100\u2013113, January 2017.\n\n[37] Nicholas D. Sidiropoulos, Evangelos E. Papalexakis, and Christos Faloutsos. Parallel randomly\ncompressed cubes ( paracomp ) : A scalable distributed architecture for big tensor decomposition.\n2014.\n\n[38] Shaden Smith and George Karypis. Accelerating the tucker decomposition with compressed\n\nsparse tensors. In Euro-Par, 2017.\n\n[39] Jimeng Sun, Spiros Papadimitriou, Ching-Yung Lin, Nan Cao, Mengchen Liu, and Weihong\nQian. Multivis: Content-based social network exploration through multi-way visual analysis.\nIn SDM, 2009.\n\n[40] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos. Incre-\n\nmental tensor analysis: Theory and applications. TKDD, 2:11:1\u201311:37, 2008.\n\n[41] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable\n\nminimization. In Mathematical Programming, volume 117, page 387\u2013423, 2007.\n\n[42] Charalampos E. Tsourakakis. Mach: Fast randomized tensor decompositions. In SDM, 2009.\n\n[43] L. R. Tucker. Implications of factor analysis of three-way matrices for measurement of change.\nC.W. Harris (Ed.), Problems in Measuring Change, University of Wisconsin Press, pages\n122\u2013137, 1963.\n\n[44] Yangyang Xu. On the convergence of higher-order orthogonal iteration. Linear and Multilinear\n\nAlgebra, pages 2247\u20132265, 2017.\n\n[45] Rafal Zdunek and al. Fast nonnegative matrix factorization algorithms using projected gradient\n\napproaches for large-scale problems. Intell. Neuroscience, 2008:3:1\u20133:13, 2008.\n\n[46] Qibin Zhao, Liqing Zhang, and Andrzej Cichocki. Bayesian sparse tucker models for dimension\n\nreduction and tensor completion. CoRR, 2015.\n\n[47] Shandian Zhe, Yuan Qi, Youngja Park, Zenglin Xu, Ian Molloy, and Suresh Chari. Dintucker:\n\nScaling up gaussian process models on large multidimensional arrays. In AAAI, 2016.\n\n[48] Shandian Zhe, Zenglin Xu, Xinqi Chu, Yuan Qi, and Youngja Park. Scalable nonparametric\n\nmultiway data analysis. In AISTATS, 2015.\n\n[49] Shandian Zhe, Kai Zhang, Pengyuan Wang, Kuang-chih Lee, Zenglin Xu, Yuan Qi, and Zoubin\nGharamani. Distributed \ufb02exible nonlinear tensor factorization. In Proceedings of the 30th\nInternational Conference on Neural Information Processing Systems, NIPS\u201916, pages 928\u2013936,\n2016.\n\n[50] Guoxu Zhou and al. Ef\ufb01cient nonnegative tucker decompositions: Algorithms and uniqueness.\n\nIEEE Transactions on Image Processing, 24(12):4990\u20135003, 2015.\n\n[51] Guoxu Zhou, Andrzej Cichocki, and Shengli Xie. Decomposition of big tensors with low\n\nmultilinear rank. CoRR, 2014.\n\n12\n\n\f", "award": [], "sourceid": 3398, "authors": [{"given_name": "Abraham", "family_name": "Traore", "institution": "University of Rouen"}, {"given_name": "Maxime", "family_name": "Berar", "institution": "Universit\u00e9 de Rouen"}, {"given_name": "Alain", "family_name": "Rakotomamonjy", "institution": "Universit\u00e9 de Rouen Normandie   Criteo AI Lab"}]}