{"title": "Tensorizing Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 442, "page_last": 450, "abstract": "Deep neural networks currently demonstrate state-of-the-art performance in several domains.At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and stopping the further increase of the model size. In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved.In particular, for the Very Deep VGG networks we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.", "full_text": "Tensorizing Neural Networks\n\nAlexander Novikov1,4\n\nDmitry Podoprikhin1\n\nAnton Osokin2\n\nDmitry Vetrov1,3\n\n1Skolkovo Institute of Science and Technology, Moscow, Russia\n\n2INRIA, SIERRA project-team, Paris, France\n\n3National Research University Higher School of Economics, Moscow, Russia\n\n4Institute of Numerical Mathematics of the Russian Academy of Sciences, Moscow, Russia\n\nnovikov@bayesgroup.ru\n\npodoprikhin.dmitry@gmail.com\n\nanton.osokin@inria.fr\n\nvetrovd@yandex.ru\n\nAbstract\n\nDeep neural networks currently demonstrate state-of-the-art performance in sev-\neral domains. At the same time, models of this class are very demanding in terms\nof computational resources. In particular, a large amount of memory is required\nby commonly used fully-connected layers, making it hard to use the models on\nlow-end devices and stopping the further increase of the model size. In this paper\nwe convert the dense weight matrices of the fully-connected layers to the Tensor\nTrain [17] format such that the number of parameters is reduced by a huge factor\nand at the same time the expressive power of the layer is preserved. In particular,\nfor the Very Deep VGG networks [21] we report the compression factor of the\ndense weight matrix of a fully-connected layer up to 200000 times leading to the\ncompression factor of the whole network up to 7 times.\n\n1\n\nIntroduction\n\nDeep neural networks currently demonstrate state-of-the-art performance in many domains of large-\nscale machine learning, such as computer vision, speech recognition, text processing, etc. These\nadvances have become possible because of algorithmic advances, large amounts of available data,\nand modern hardware. For example, convolutional neural networks (CNNs) [13, 21] show by a large\nmargin superior performance on the task of image classi\ufb01cation. These models have thousands of\nnodes and millions of learnable parameters and are trained using millions of images [19] on powerful\nGraphics Processing Units (GPUs).\nThe necessity of expensive hardware and long processing time are the factors that complicate the\napplication of such models on conventional desktops and portable devices. Consequently, a large\nnumber of works tried to reduce both hardware requirements (e. g. memory demands) and running\ntimes (see Sec. 2).\nIn this paper we consider probably the most frequently used layer of the neural networks: the fully-\nconnected layer. This layer consists in a linear transformation of a high-dimensional input signal to a\nhigh-dimensional output signal with a large dense matrix de\ufb01ning the transformation. For example,\nin modern CNNs the dimensions of the input and output signals of the fully-connected layers are\nof the order of thousands, bringing the number of parameters of the fully-connected layers up to\nmillions.\nWe use a compact multiliniear format \u2013 Tensor-Train (TT-format) [17] \u2013 to represent the dense\nweight matrix of the fully-connected layers using few parameters while keeping enough \ufb02exibil-\nity to perform signal transformations. The resulting layer is compatible with the existing training\nalgorithms for neural networks because all the derivatives required by the back-propagation algo-\nrithm [18] can be computed using the properties of the TT-format. We call the resulting layer a\nTT-layer and refer to a network with one or more TT-layers as TensorNet.\nWe apply our method to popular network architectures proposed for several datasets of different\nscales: MNIST [15], CIFAR-10 [12], ImageNet [13]. We experimentally show that the networks\n\n1\n\n\fwith the TT-layers match the performance of their uncompressed counterparts but require up to\n200 000 times less of parameters, decreasing the size of the whole network by a factor of 7.\nThe rest of the paper is organized as follows. We start with a review of the related work in Sec. 2.\nWe introduce necessary notation and review the Tensor Train (TT) format in Sec. 3.\nIn Sec. 4\nwe apply the TT-format to the weight matrix of a fully-connected layer and in Sec. 5 derive all\nthe equations necessary for applying the back-propagation algorithm.\nIn Sec. 6 we present the\nexperimental evaluation of our ideas followed by a discussion in Sec. 7.\n\n2 Related work\n\nWith suf\ufb01cient amount of training data, big models usually outperform smaller ones. However state-\nof-the-art neural networks reached the hardware limits both in terms the computational power and\nthe memory.\nIn particular, modern networks reached the memory limit with 89% [21] or even 100% [25] memory\noccupied by the weights of the fully-connected layers so it is not surprising that numerous attempts\nhave been made to make the fully-connected layers more compact. One of the most straightforward\napproaches is to use a low-rank representation of the weight matrices. Recent studies show that\nthe weight matrix of the fully-connected layer is highly redundant and by restricting its matrix rank\nit is possible to greatly reduce the number of parameters without signi\ufb01cant drop in the predictive\naccuracy [6, 20, 25].\nAn alternative approach to the problem of model compression is to tie random subsets of weights\nusing special hashing techniques [4]. The authors reported the compression factor of 8 for a two-\nlayered network on the MNIST dataset without loss of accuracy. Memory consumption can also be\nreduced by using lower numerical precision [1] or allowing fewer possible carefully chosen param-\neter values [9].\nIn our paper we generalize the low-rank ideas. Instead of searching for low-rank approximation of\nthe weight matrix we treat it as multi-dimensional tensor and apply the Tensor Train decomposition\nalgorithm [17]. This framework has already been successfully applied to several data-processing\ntasks, e. g. [16, 27].\nAnother possible advantage of our approach is the ability to use more hidden units than was available\nbefore. A recent work [2] shows that it is possible to construct wide and shallow (i. e. not deep)\nneural networks with performance close to the state-of-the-art deep CNNs by training a shallow\nnetwork on the outputs of a trained deep network. They report the improvement of performance\nwith the increase of the layer size and used up to 30 000 hidden units while restricting the matrix\nrank of the weight matrix in order to be able to keep and to update it during the training. Restricting\nthe TT-ranks of the weight matrix (in contrast to the matrix rank) allows to use much wider layers\npotentially leading to the greater expressive power of the model. We demonstrate this effect by\ntraining a very wide model (262 144 hidden units) on the CIFAR-10 dataset that outperforms other\nnon-convolutional networks.\nMatrix and tensor decompositions were recently used to speed up the inference time of CNNs [7,\n14]. While we focus on fully-connected layers, Lebedev et al. [14] used the CP-decomposition to\ncompress a 4-dimensional convolution kernel and then used the properties of the decomposition to\nspeed up the inference time. This work shares the same spirit with our method and the approaches\ncan be readily combined.\nGilboa et al. exploit the properties of the Kronecker product of matrices to perform fast matrix-by-\nvector multiplication [8]. These matrices have the same structure as TT-matrices with unit TT-ranks.\nCompared to the Tucker format [23] and the canonical format [3], the TT-format is immune to\nthe curse of dimensionality and its algorithms are robust. Compared to the Hierarchical Tucker\nformat [11], TT is quite similar but has simpler algorithms for basic operations.\n\n3 TT-format\n\nThroughout this paper we work with arrays of different dimensionality. We refer to the one-\ndimensional arrays as vectors, the two-dimensional arrays \u2013 matrices, the arrays of higher dimen-\nsions \u2013 tensors. Bold lower case letters (e. g. a) denote vectors, ordinary lower case letters (e. g.\na(i) = ai) \u2013 vector elements, bold upper case letters (e. g. A) \u2013 matrices, ordinary upper case letters\n(e. g. A(i, j)) \u2013 matrix elements, calligraphic bold upper case letters (e. g. A) \u2013 for tensors and\n\n2\n\n\fA(j1, . . . , jd) = G1[j1]G2[j2]\u00b7\u00b7\u00b7 Gd[jd].\n\nordinary calligraphic upper case letters (e. g. A(i) = A(i1, . . . , id)) \u2013 tensor elements, where d is\nthe dimensionality of the tensor A.\nWe will call arrays explicit to highlight cases when they are stored explicitly, i. e. by enumeration of\nall the elements.\nA d-dimensional array (tensor) A is said to be represented in the TT-format [17] if for each dimen-\nsion k = 1, . . . , d and for each possible value of the k-th dimension index jk = 1, . . . , nk there\nexists a matrix Gk[jk] such that all the elements of A can be computed as the following matrix\nproduct:\n(1)\nAll the matrices Gk[jk] related to the same dimension k are restricted to be of the same\nsize rk\u22121 \u00d7 rk. The values r0 and rd equal 1 in order to keep the matrix product (1) of size 1 \u00d7 1. In\nwhat follows we refer to the representation of a tensor in the TT-format as the TT-representation or\nthe TT-decomposition. The sequence {rk}d\nk=0 is referred to as the TT-ranks of the TT-representation\nof A (or the ranks for short), its maximum \u2013 as the maximal TT-rank of the TT-representation\nof A: r = maxk=0,...,d rk. The collections of the matrices (Gk[jk])nk\njk=1 corresponding to the same\ndimension (technically, 3-dimensional arrays Gk) are called the cores.\nOseledets [17, Th. 2.1] shows that for an arbitrary tensor A a TT-representation exists but is not\nunique. The ranks among different TT-representations can vary and it\u2019s natural to seek a representa-\ntion with the lowest ranks.\nWe use the symbols Gk[jk](\u03b1k\u22121, \u03b1k) to denote the element of the matrix Gk[jk] in the position\n(\u03b1k\u22121, \u03b1k), where \u03b1k\u22121 = 1, . . . , rk\u22121, \u03b1k = 1, . . . , rk. Equation (1) can be equivalently rewritten\nas the sum of the products of the elements of the cores:\n\nG1[j1](\u03b10, \u03b11) . . . Gd[jd](\u03b1d\u22121, \u03b1d).\n\n(2)\n\n\u03b10,...,\u03b1d\n\n(cid:88)\n(cid:81)d\nk=1 nk numbers compared with(cid:80)d\n\nA(j1, . . . , jd) =\n\nThe representation of a tensor A via the explicit enumeration of all its elements requires to store\nk=1 nk rk\u22121 rk numbers if the tensor is stored in the TT-format.\n\nThus, the TT-format is very ef\ufb01cient in terms of memory if the ranks are small.\nAn attractive property of the TT-decomposition is the ability to ef\ufb01ciently perform several types\nof operations on tensors if they are in the TT-format: basic linear algebra operations, such as the\naddition of a constant and the multiplication by a constant, the summation and the entrywise product\nof tensors (the results of these operations are tensors in the TT-format generally with the increased\nranks); computation of global characteristics of a tensor, such as the sum of all elements and the\nFrobenius norm. See [17] for a detailed description of all the supported operations.\n\nfor them is de\ufb01ned in a special manner. Consider a vector b \u2208 RN , where N = (cid:81)d\n\n3.1 TT-representations for vectors and matrices\nThe direct application of the TT-decomposition to a matrix (2-dimensional tensor) coincides with\nthe low-rank matrix format and the direct TT-decomposition of a vector is equivalent to explicitly\nstoring its elements. To be able to ef\ufb01ciently work with large vectors and matrices the TT-format\nk=1 nk. We\ncan establish a bijection \u00b5 between the coordinate (cid:96) \u2208 {1, . . . , N} of b and a d-dimensional vector-\nindex \u00b5((cid:96)) = (\u00b51((cid:96)), . . . , \u00b5d((cid:96))) of the corresponding tensor B, where \u00b5k((cid:96)) \u2208 {1, . . . , nk}. The\ntensor B is then de\ufb01ned by the corresponding vector elements: B(\u00b5((cid:96))) = b(cid:96). Building a TT-\nrepresentation of B allows us to establish a compact format for the vector b. We refer to it as a\nTT-vector.\n\nNow we de\ufb01ne a TT-representation of a matrix W \u2208 RM\u00d7N , where M = (cid:81)d\nN = (cid:81)d\n\nk=1 mk and\nk=1 nk. Let bijections \u03bd(t) = (\u03bd1(t), . . . , \u03bdd(t)) and \u00b5((cid:96)) = (\u00b51((cid:96)), . . . , \u00b5d((cid:96))) map\nrow and column indices t and (cid:96) of the matrix W to the d-dimensional vector-indices whose k-th\ndimensions are of length mk and nk respectively, k = 1, . . . , d. From the matrix W we can form\na d-dimensional tensor W whose k-th dimension is of length mknk and is indexed by the tuple\n(\u03bdk(t), \u00b5k((cid:96))). The tensor W can then be converted into the TT-format:\n\nW (t, (cid:96)) = W((\u03bd1(t), \u00b51((cid:96))), . . . , (\u03bdd(t), \u00b5d((cid:96)))) = G1[\u03bd1(t), \u00b51((cid:96))] . . . Gd[\u03bdd(t), \u00b5d((cid:96))],\n\n(3)\nwhere the matrices Gk[\u03bdk(t), \u00b5k((cid:96))], k = 1, . . . , d, serve as the cores with tuple (\u03bdk(t), \u00b5k((cid:96)))\nbeing an index. Note that a matrix in the TT-format is not restricted to be square. Although index-\nvectors \u03bd(t) and \u00b5((cid:96)) are of the same length d, the sizes of the domains of the dimensions can vary.\nWe call a matrix in the TT-format a TT-matrix.\n\n3\n\n\fAll operations available for the TT-tensors are applicable to the TT-vectors and the TT-matrices as\nwell (for example one can ef\ufb01ciently sum two TT-matrices and get the result in the TT-format). Ad-\nditionally, the TT-format allows to ef\ufb01ciently perform the matrix-by-vector (matrix-by-matrix) prod-\nuct. If only one of the operands is in the TT-format, the result would be an explicit vector (matrix); if\nboth operands are in the TT-format, the operation would be even more ef\ufb01cient and the result would\nbe given in the TT-format as well (generally with the increased ranks). For the case of the TT-matrix-\nby-explicit-vector product c = W b, the computational complexity is O(d r2 m max{M, N}),\nwhere d is the number of the cores of the TT-matrix W , m = maxk=1,...,d mk, r is the maximal\n\nrank and N =(cid:81)d\n\nk=1 nk is the length of the vector b.\n\nThe ranks and, correspondingly, the ef\ufb01ciency of the TT-format for a vector (matrix) depend on the\nchoice of the mapping \u00b5((cid:96)) (mappings \u03bd(t) and \u00b5((cid:96))) between vector (matrix) elements and the un-\nderlying tensor elements. In what follows we use a column-major MATLAB reshape command 1\nto form a d-dimensional tensor from the data (e. g. from a multichannel image), but one can choose\na different mapping.\n\n4 TT-layer\n\nIn this section we introduce the TT-layer of a neural network.\nIn short, the TT-layer is a fully-\nconnected layer with the weight matrix stored in the TT-format. We will refer to a neural network\nwith one or more TT-layers as TensorNet.\nFully-connected layers apply a linear transformation to an N-dimensional input vector x:\n\ny = W x + b,\n\n(4)\n\nwhere the weight matrix W \u2208 RM\u00d7N and the bias vector b \u2208 RM de\ufb01ne the transformation.\nA TT-layer consists in storing the weights W of the fully-connected layer in the TT-format, allowing\nto use hundreds of thousands (or even millions) of hidden units while having moderate number of\nparameters. To control the number of parameters one can vary the number of hidden units as well\nas the TT-ranks of the weight matrix.\nA TT-layer transforms a d-dimensional tensor X (formed from the corresponding vector x) to the d-\ndimensional tensor Y (which correspond to the output vector y). We assume that the weight matrix\nW is represented in the TT-format with the cores Gk[ik, jk]. The linear transformation (4) of a\nfully-connected layer can be expressed in the tensor form:\n\nY(i1, . . . , id) =\n\nG1[i1, j1] . . . Gd[id, jd]X (j1, . . . , jd) + B(i1, . . . , id).\n\n(5)\n\n(cid:88)\n\nj1,...,jd\n\nDirect application of the TT-matrix-by-vector operation for the Eq. (5) yields the computational\ncomplexity of the forward pass O(dr2m max{m, n}d) = O(dr2m max{M, N}).\n\n5 Learning\n\nNeural networks are usually trained with the stochastic gradient descent algorithm where the gradi-\nent is computed using the back-propagation procedure [18]. Back-propagation allows to compute\nthe gradient of a loss-function L with respect to all the parameters of the network. The method starts\nwith the computation of the gradient of L w.r.t. the output of the last layer and proceeds sequentially\nthrough the layers in the reversed order while computing the gradient w.r.t. the parameters and the\ninput of the layer making use of the gradients computed earlier. Applied to the fully-connected lay-\ners (4) the back-propagation method computes the gradients w.r.t. the input x and the parameters\nW and b given the gradients \u2202L\n\n\u2202y w.r.t to the output y:\n\n\u2202L\n\u2202x\n\n= W\n\n(cid:124) \u2202L\n\u2202y\n\n,\n\n\u2202L\n\u2202W\n\n=\n\n(cid:124)\nx\n\n,\n\n\u2202L\n\u2202y\n\n\u2202L\n\u2202b\n\n=\n\n\u2202L\n\u2202y\n\n.\n\n(6)\n\nIn what follows we derive the gradients required to use the back-propagation algorithm with the TT-\nlayer. To compute the gradient of the loss function w.r.t. the bias vector b and w.r.t. the input vector\nx one can use equations (6). The latter can be applied using the matrix-by-vector product (where the\nmatrix is in the TT-format) with the complexity of O(dr2n max{m, n}d) = O(dr2n max{M, N}).\n\n1http://www.mathworks.com/help/matlab/ref/reshape.html\n\n4\n\n\fMemory\nOperation\nFC forward pass\nO(M N )\nO(r max{M, N})\nTT forward pass\nFC backward pass O(M N )\nO(M N )\nTT backward pass O(d2 r4 m max{M, N}) O(r3 max{M, N})\n\nTime\nO(M N )\nO(dr2m max{M, N})\n\nTable 1: Comparison of the asymptotic complexity and memory usage of an M \u00d7 N TT-layer and\nan M \u00d7 N fully-connected layer (FC). The input and output tensor shapes are m1 \u00d7 . . . \u00d7 md and\nn1 \u00d7 . . . \u00d7 nd respectively (m = maxk=1...d mk) and r is the maximal TT-rank.\nTo perform a step of stochastic gradient descent one can use equation (6) to compute the gradient\nof the loss function w.r.t.\nthe weight matrix W , convert the gradient matrix into the TT-format\n(with the TT-SVD algorithm [17]) and then add this gradient (multiplied by a step size) to the\ncurrent estimate of the weight matrix: Wk+1 = Wk + \u03b3k\n\u2202W . However, the direct computation of\n\u2202W requires O(M N ) memory. A better way to learn the TensorNet parameters is to compute the\n\u2202L\ngradient of the loss function directly w.r.t. the cores of the TT-representation of W .\nIn what follows we use shortened notation for pre\ufb01x and post\ufb01x sequences of indices: i\u2212\n(i1, . . . , ik\u22121), i+\ncore products:\n\nk :=\nk ). We also introduce notations for partial\n\nk := (ik+1, . . . , id), i = (i\u2212\n\nk , ik, i+\n\n\u2202L\n\nk ] := G1[i1, j1] . . . Gk\u22121[ik\u22121, jk\u22121],\nk ] := Gk+1[ik+1, jk+1] . . . Gd[id, jd].\n\nWe now rewrite the de\ufb01nition of the TT-layer transformation (5) for any k = 2, . . . , d \u2212 1:\n\nY(i) = Y(i\u2212\n\nk , ik, i+\n\nk ) =\n\nP \u2212\nk [i\u2212\n\nk , j\u2212\n\nk ]Gk[ik, jk]P +\n\nk [i+\n\nk , j+\n\nk ]X (j\u2212\n\nk , jk, j+\n\nk ) + B(i).\n\n(7)\n\n(8)\n\nk [i\u2212\nP \u2212\nk , j\u2212\nP +\nk [i+\nk , j+\n(cid:88)\n\n\u2212\nk ,jk,j+\n\nk\n\nj\n\nThe gradient of the loss function L w.r.t. to the k-th core in the position [\u02dcik, \u02dcjk] can be computed\nusing the chain rule:\n\n.\n\n(9)\n\n(cid:88)\n\ni\n\n\u2202L\n\u2202Y(i)\n\n\u2202Y(i)\n\n\u2202Gk[\u02dcik, \u02dcjk]\n\n=\n\n(cid:125)\n\n(cid:124)\n\n\u2202L\n\n(cid:123)(cid:122)\n\n\u2202Gk[\u02dcik, \u02dcjk]\n\nrk\u22121 \u00d7 rk\n\u2202Y(i)\n\n\u2202Y(i)\n\n\u2202Gk[\u02dcik,\u02dcjk]\n\nthe summation (9) can be done explicitly in O(M rk\u22121 rk)\n\nGiven the gradient matrices\ntime, where M is the length of the output vector y.\nfor any values of the core index k \u2208 {1, . . . , d}\nWe now show how to compute the matrix\nand \u02dcik \u2208 {1, . . . , mk}, \u02dcjk \u2208 {1, . . . , nk}. For any i = (i1, . . . , id) such that ik (cid:54)= \u02dcik the value\nof Y(i) doesn\u2019t depend on the elements of Gk[\u02dcik, \u02dcjk] making the corresponding gradient\n\u2202Gk[\u02dcik,\u02dcjk]\nequal zero. Similarly, any summand in the Eq. (8) such that jk (cid:54)= \u02dcjk doesn\u2019t affect the gradient\n\u2202Y(i)\n\u2202Gk[\u02dcik,\u02dcjk]\nY(i\u2212\nsion:\n\n. These observations allow us to consider only ik = \u02dcik and jk = \u02dcjk.\nk ) is a linear function of the core Gk[\u02dcik, \u02dcjk] and its gradient equals the following expres-\n\nk ,\u02dcik, i+\n\n\u2202Gk[\u02dcik,\u02dcjk]\n\n\u2202Y(i)\n\n\u2202Y(i\u2212\nk ,\u02dcik, i+\nk )\n\u2202Gk[\u02dcik, \u02dcjk]\n\n=\n\n(cid:88)\n\n\u2212\nk ,j+\n\nk\n\nj\n\n(cid:0)P \u2212\n(cid:124)\n\n(cid:123)(cid:122)\nk [i\u2212\nk , j\u2212\nrk\u22121 \u00d71\n\nWe denote the partial sum vector as Rk[j\u2212\n\nk , \u02dcjk, i+\nRk[j1, . . . , jk\u22121, \u02dcjk, ik+1, . . . , id] = Rk[j\u2212\n\nk ](cid:1)(cid:124)\n(cid:125)\n\nk , j+\nk [i+\n1\u00d7rk\n\nk ](cid:1)(cid:124)\n(cid:0)P +\n(cid:125)\n(cid:124)\nk ] \u2208 Rrk:\nk , \u02dcjk, i+\nk ] =\n\n(cid:123)(cid:122)\n(cid:88)\n\nX (j\u2212\n\nk , \u02dcjk, j+\nk ).\n\n(10)\n\nP +\n\nk [i+\n\nk , j+\n\nk ] X (j\u2212\n\nk , \u02dcjk, j+\nk ).\n\nVectors Rk[j\u2212\nk can be computed via dy-\nnamic programming (by pushing sums w.r.t. each jk+1, . . . , jd inside the equation and summing\nout one index at a time) in O(dr2m max{M, N}). Substituting these vectors into (10) and using\n\nk ] for all the possible values of k, j\u2212\n\nk , \u02dcjk and i+\n\nk , \u02dcjk, i+\n\nj+\nk\n\n5\n\n\f%\n\nr\no\nr\nr\ne\n\nt\ns\ne\nt\n\n102\n\n101\n\n100\n\n32 \u00d7 32\n4 \u00d7 8 \u00d7 8 \u00d7 4\n4 \u00d7 4 \u00d7 4 \u00d7 4 \u00d7 4\n2 \u00d7 2 \u00d7 8 \u00d7 8 \u00d7 2 \u00d7 2\n210\nmatrix rank\nuncompressed\n\n102\n\n103\n\n104\n\n105\n\n106\n\nnumber of parameters in the weight matrix of the \ufb01rst layer\n\nFigure 1: The experiment on the MNIST dataset. We use a two-layered neural network and substitute\nthe \ufb01rst 1024 \u00d7 1024 fully-connected layer with the TT-layer (solid lines) and with the matrix rank\ndecomposition based layer (dashed line). The solid lines of different colors correspond to different\nways of reshaping the input and output vectors to tensors (the shapes are reported in the legend). To\nobtain the points of the plots we vary the maximal TT-rank or the matrix rank.\n\n(again) dynamic programming yields us all the necesary matrices for summation (9). The overall\ncomputational complexity of the backward pass is O(d2 r4 m max{M, N}).\nThe presented algorithm reduces to a sequence of matrix-by-matrix products and permutations of\ndimensions and thus can be accelerated on a GPU device.\n\n6 Experiments\n6.1 Parameters of the TT-layer\nIn this experiment we investigate the properties of the TT-layer and compare different strategies for\nsetting its parameters: dimensions of the tensors representing the input/output of the layer and the\nTT-ranks of the compressed weight matrix. We run the experiment on the MNIST dataset [15] for\nthe task of handwritten-digit recognition. As a baseline we use a neural network with two fully-\nconnected layers (1024 hidden units) and recti\ufb01ed linear unit (ReLU) achieving 1.9% error on the\ntest set. For more reshaping options we resize the original 28 \u00d7 28 images to 32 \u00d7 32.\nWe train several networks differing in the parameters of the single TT-layer. The networks contain\nthe following layers: the TT-layer with weight matrix of size 1024\u00d71024, ReLU, the fully-connected\nlayer with the weight matrix of size 1024 \u00d7 10. We test different ways of reshaping the input/output\ntensors and try different ranks of the TT-layer. As a simple compression baseline in the place of\nthe TT-layer we use the fully-connected layer such that the rank of the weight matrix is bounded\n(implemented as follows: the two consecutive fully-connected layers with weight matrices of sizes\n1024 \u00d7 r and r\u00d71024, where r controls the matrix rank and the compression factor). The results of\nthe experiment are shown in Figure 1. We conclude that the TT-ranks provide much better \ufb02exibility\nthan the matrix rank when applied at the same compression level. In addition, we observe that the\nTT-layers with too small number of values for each tensor dimension and with too few dimensions\nperform worse than their more balanced counterparts.\n\nComparison with HashedNet [4]. We consider a two-layered neural network with 1024 hidden\nunits and replace both fully-connected layers by the TT-layers. By setting all the TT-ranks in the\nnetwork to 8 we achieved the test error of 1.6% with 12 602 parameters in total and by setting all\nthe TT-ranks to 6 the test error of 1.9% with 7 698 parameters. Chen et al. [4] report results on the\nsame architecture. By tying random subsets of weights they compressed the network by the factor\nof 64 to the 12 720 parameters in total with the test error equal 2.79%.\n\n6.2 CIFAR-10\nCIFAR-10 dataset [12] consists of 32 \u00d7 32 3-channel images assigned to 10 different classes: air-\nplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The dataset contains 50000 train and\n10000 test images. Following [10] we preprocess the images by subtracting the mean and perform-\ning global contrast normalization and ZCA whitening.\nAs a baseline we use the CIFAR-10 Quick [22] CNN, which consists of convolutional, pooling and\nnon-linearity layers followed by two fully-connected layers of sizes 1024 \u00d7 64 and 64 \u00d7 10. We \ufb01x\nthe convolutional part of the network and substitute the fully-connected part by a 1024\u00d7 N TT-layer\n\n6\n\n\fcompr.\n1\n\nArchitecture TT-layers\nFC FC FC\nTT4 FC FC\nTT2 FC FC\nTT1 FC FC\nTT4 TT4 FC\nMR1 FC FC\nMR5 FC FC\nMR50 FC FC\n\n50 972\n194 622\n713 614\n37 732\n3 521\n704\n70\n\nvgg-16\ncompr.\n\nvgg-19\ncompr.\n\n1\n3.9\n3.9\n3.9\n7.4\n3.9\n3.9\n3.7\n\n1\n3.5\n3.5\n3.5\n6\n3.5\n3.5\n3.4\n\nvgg-16\ntop 1\n30.9\n31.2\n31.5\n33.3\n32.2\n99.5\n81.7\n36.7\n\nvgg-16\ntop 5\n11.2\n11.2\n11.5\n12.8\n12.3\n97.6\n53.9\n14.9\n\nvgg-19\ntop 1\n29.0\n29.8\n30.4\n31.9\n31.6\n99.8\n79.1\n34.5\n\nvgg-19\ntop 5\n10.1\n10.4\n10.9\n11.8\n11.7\n99\n52.4\n15.8\n\nTable 2: Substituting the fully-connected layers with the TT-layers in vgg-16 and vgg-19 networks\non the ImageNet dataset. FC stands for a fully-connected layer; TT(cid:3) stands for a TT-layer with\nall the TT-ranks equal \u201c(cid:3)\u201d; MR(cid:3) stands for a fully-connected layer with the matrix rank restricted\nto \u201c(cid:3)\u201d. We report the compression rate of the TT-layers matrices and of the whole network in the\nsecond, third and fourth columns.\nfollowed by ReLU and by a N \u00d7 10 fully-connected layer. With N = 3125 hidden units (contrary\nto 64 in the original network) we achieve the test error of 23.13% without \ufb01ne-tuning which is\nslightly better than the test error of the baseline (23.25%). The TT-layer treated input and output\nvectors as 4 \u00d7 4 \u00d7 4 \u00d7 4 \u00d7 4 and 5 \u00d7 5 \u00d7 5 \u00d7 5 \u00d7 5 tensors respectively. All the TT-ranks equal\n8, making the number of the parameters in the TT-layer equal 4 160. The compression rate of the\nTensorNet compared with the baseline w.r.t. all the parameters is 1.24. In addition, substituting the\nboth fully-connected layers by the TT-layers yields the test error of 24.39% and reduces the number\nof parameters of the fully-connected layer matrices by the factor of 11.9 and the total parameter\nnumber by the factor of 1.7.\nFor comparison, in [6] the fully-connected layers in a CIFAR-10 CNN were compressed by the\nfactor of at most 4.7 times with the loss of about 2% in accuracy.\n\n6.2.1 Wide and shallow network\nWith suf\ufb01cient amount of hidden units, even a neural network with two fully-connected layers and\nsigmoid non-linearity can approximate any decision boundary [5]. Traditionally, very wide shallow\nnetworks are not considered because of high computational and memory demands and the over-\n\ufb01tting risk. TensorNet can potentially address both issues. We use a three-layered TensorNet of\nthe following architecture: the TT-layer with the weight matrix of size 3 072 \u00d7 262 144, ReLU, the\nTT-layer with the weight matrix of size 262 144 \u00d7 4 096, ReLU, the fully-connected layer with the\nweight matrix of size 4 096 \u00d7 10. We report the test error of 31.47% which is (to the best of our\nknowledge) the best result achieved by a non-convolutional neural network.\n\n6.3\n\nImageNet\n\nthe convolutional and the fully-connected parts.\n\nIn this experiment we evaluate the TT-layers on a large scale task. We consider the 1000-class\nImageNet ILSVRC-2012 dataset [19], which consist of 1.2 million training images and 50 000\nvalidation images. We use deep the CNNs vgg-16 and vgg-19 [21] as the reference models2. Both\nnetworks consist of the two parts:\nIn the both\nnetworks the second part consist of 3 fully-connected layers with weight matrices of sizes 25088 \u00d7\n4096, 4096 \u00d7 4096 and 4096 \u00d7 1000.\nIn each network we substitute the \ufb01rst fully-connected layer with the TT-layer. To do this we reshape\nthe 25088-dimensional input vectors to the tensors of the size 2 \u00d7 7 \u00d7 8 \u00d7 8 \u00d7 7 \u00d7 4 and the 4096-\ndimensional output vectors to the tensors of the size 4 \u00d7 4 \u00d7 4 \u00d7 4 \u00d7 4 \u00d7 4. The remaining fully-\nconnected layers are initialized randomly. The parameters of the convolutional parts are kept \ufb01xed\nas trained by Simonyan and Zisserman [21]. We train the TT-layer and the fully-connected layers\non the training set. In Table 2 we vary the ranks of the TT-layer and report the compression factor of\nthe TT-layers (vs. the original fully-connected layer), the resulting compression factor of the whole\nnetwork, and the top 1 and top 5 errors on the validation set. In addition, we substitute the second\nfully-connected layer with the TT-layer. As a baseline compression method we constrain the matrix\nrank of the weight matrix of the \ufb01rst fully-connected layer using the approach of [2].\n\n2After we had started to experiment on the vgg-16 network the vgg-* networks have been improved by\nthe authors. Thus, we report the results on a slightly outdated version of vgg-16 and the up-to-date version of\nvgg-19.\n\n7\n\n\fType\nCPU fully-connected layer\nCPU TT-layer\nGPU fully-connected layer\nGPU TT-layer\n\n1 im. time (ms)\n16.1\n1.2\n2.7\n1.9\n\n100 im. time (ms)\n97.2\n94.7\n33\n12.9\n\nTable 3: Inference time for a 25088 \u00d7 4096 fully-connected layer and its corresponding TT-layer\nwith all the TT-ranks equal 4. The memory usage for feeding forward one image is 392MB for the\nfully-connected layer and 0.766MB for the TT-layer.\nIn Table 2 we observe that the TT-layer in the best case manages to reduce the number of the\nparameters in the matrix W of the largest fully-connected layer by a factor of 194 622 (from 25088\u00d7\n4096 parameters to 528) while increasing the top 5 error from 11.2 to 11.5. The compression\nfactor of the whole network remains at the level of 3.9 because the TT-layer stops being the storage\nbottleneck. By compressing the largest of the remaining layers the compression factor goes up\nto 7.4. The baseline method when providing similar compression rates signi\ufb01cantly increases the\nerror.\nFor comparison, consider the results of [26] obtained for the compression of the fully-connected lay-\ners of the Krizhevsky-type network [13] with the Fastfood method. The model achieves compression\nfactors of 2-3 without decreasing the network error.\n\nImplementation details\n\n6.4\nIn all experiments we use our MATLAB extension3 of the MatConvNet framework4 [24]. For the\noperations related to the TT-format we use the TT-Toolbox5 implemented in MATLAB as well. The\nexperiments were performed on a computer with a quad-core Intel Core i5-4460 CPU, 16 GB RAM\nand a single NVidia Geforce GTX 980 GPU. We report the running times and the memory usage at\nthe forward pass of the TT-layer and the baseline fully-connected layer in Table 3.\nWe train all the networks with stochastic gradient descent with momentum (coef\ufb01cient 0.9). We\ninitialize all the parameters of the TT- and fully-connected layers with a Gaussian noise and put\nL2-regularization (weight 0.0005) on them.\n\n7 Discussion and future work\n\nRecent studies indicate high redundancy in the current neural network parametrization. To exploit\nthis redundancy we propose to use the TT-decomposition framework on the weight matrix of a\nfully-connected layer and to use the cores of the decomposition as the parameters of the layer. This\nallows us to train the fully-connected layers compressed by up to 200 000\u00d7 compared with the\nexplicit parametrization without signi\ufb01cant error increase. Our experiments show that it is possible\nto capture complex dependencies within the data by using much more compact representations. On\nthe other hand it becomes possible to use much wider layers than was available before and the\npreliminary experiments on the CIFAR-10 dataset show that wide and shallow TensorNets achieve\npromising results (setting new state-of-the-art for non-convolutional neural networks).\nAnother appealing property of the TT-layer is faster inference time (compared with the correspond-\ning fully-connected layer). All in all a wide and shallow TensorNet can become a time and memory\nef\ufb01cient model to use in real time applications and on mobile devices.\nThe main limiting factor for an M \u00d7 N fully-connected layer size is its parameters number M N.\nThe limiting factor for an M \u00d7N TT-layer is the maximal linear size max{M, N}. As a future work\nwe plan to consider the inputs and outputs of layers in the TT-format thus completely eliminating\nthe dependency on M and N and allowing billions of hidden units in a TT-layer.\n\nAcknowledgements. We would like to thank Ivan Oseledets for valuable discussions. A. Novikov,\nD. Podoprikhin, D. Vetrov were supported by RFBR project No. 15-31-20596 (mol-a-ved) and by\nMicrosoft: Moscow State University Joint Research Center (RPD 1053945). A. Osokin was sup-\nported by the MSR-INRIA Joint Center. The results of the tensor toolbox application (in Sec. 6) are\nsupported by Russian Science Foundation No. 14-11-00659.\n\n3https://github.com/Bihaqo/TensorNet\n4http://www.vlfeat.org/matconvnet/\n5https://github.com/oseledets/TT-Toolbox\n\n8\n\n\fReferences\n[1] K. Asanovi and N. Morgan, \u201cExperimental determination of precision requirements for back-propagation\n\ntraining of arti\ufb01cial neural networks,\u201d International Computer Science Institute, Tech. Rep., 1991.\n\n[2] J. Ba and R. Caruana, \u201cDo deep nets really need to be deep?\u201d in Advances in Neural Information Pro-\n\ncessing Systems 27 (NIPS), 2014, pp. 2654\u20132662.\n\n[3] J. D. Caroll and J. J. Chang, \u201cAnalysis of individual differences in multidimensional scaling via n-way\n\ngeneralization of Eckart-Young decomposition,\u201d Psychometrika, vol. 35, pp. 283\u2013319, 1970.\n\n[4] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, \u201cCompressing neural networks with the\n\nhashing trick,\u201d in International Conference on Machine Learning (ICML), 2015, pp. 2285\u20132294.\n\n[5] G. Cybenko, \u201cApproximation by superpositions of a sigmoidal function,\u201d Mathematics of control, signals\n\nand systems, pp. 303\u2013314, 1989.\n\n[6] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, \u201cPredicting parameters in deep learning,\u201d\n\nin Advances in Neural Information Processing Systems 26 (NIPS), 2013, pp. 2148\u20132156.\n\n[7] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, \u201cExploiting linear structure within con-\nvolutional networks for ef\ufb01cient evaluation,\u201d in Advances in Neural Information Processing Systems 27\n(NIPS), 2014, pp. 1269\u20131277.\n\n[8] E. Gilboa, Y. Saati, and J. P. Cunningham, \u201cScaling multidimensional inference for structured gaussian\n\nprocesses,\u201d arXiv preprint, no. 1209.4120, 2012.\n\n[9] Y. Gong, L. Liu, M. Yang, and L. Bourdev, \u201cCompressing deep convolutional networks using vector\n\nquantization,\u201d arXiv preprint, no. 1412.6115, 2014.\n\n[10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, \u201cMaxout networks,\u201d in Inter-\n\nnational Conference on Machine Learning (ICML), 2013, pp. 1319\u20131327.\n\n[11] W. Hackbusch and S. K\u00a8uhn, \u201cA new scheme for the tensor representation,\u201d J. Fourier Anal. Appl., vol. 15,\n\npp. 706\u2013722, 2009.\n\n[12] A. Krizhevsky, \u201cLearning multiple layers of features from tiny images,\u201d Master\u2019s thesis, Computer Sci-\n\nence Department, University of Toronto, 2009.\n\n[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d in Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 1097\u20131105.\n\n[14] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, \u201cSpeeding-up convolutional neural\nnetworks using \ufb01ne-tuned CP-decomposition,\u201d in International Conference on Learning Representations\n(ICLR), 2014.\n\n[15] Y. LeCun, C. Cortes, and C. J. C. Burges, \u201cThe MNIST database of handwritten digits,\u201d 1998.\n[16] A. Novikov, A. Rodomanov, A. Osokin, and D. Vetrov, \u201cPutting MRFs on a Tensor Train,\u201d in Interna-\n\ntional Conference on Machine Learning (ICML), 2014, pp. 811\u2013819.\n\n[17] I. V. Oseledets, \u201cTensor-Train decomposition,\u201d SIAM J. Scienti\ufb01c Computing, vol. 33, no. 5, pp. 2295\u2013\n\n2317, 2011.\n\n[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, \u201cLearning representations by back-propagating errors,\u201d\n\nNature, vol. 323, no. 6088, pp. 533\u2013536, 1986.\n\n[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei, \u201cImagenet large scale visual recognition challenge,\u201d Interna-\ntional Journal of Computer Vision (IJCV), 2015.\n\n[20] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, \u201cLow-rank matrix factoriza-\ntion for deep neural network training with high-dimensional output targets,\u201d in International Conference\nof Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 6655\u20136659.\n\n[21] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d\n\nin International Conference on Learning Representations (ICLR), 2015.\n\n[22] J. Snoek, H. Larochelle, and R. P. Adams, \u201cPractical bayesian optimization of machine learning algo-\n\nrithms,\u201d in Advances in Neural Information Processing Systems 25 (NIPS), 2012, pp. 2951\u20132959.\n\n[23] L. R. Tucker, \u201cSome mathematical notes on three-mode factor analysis,\u201d Psychometrika, vol. 31, no. 3,\n\npp. 279\u2013311, 1966.\n\n[24] A. Vedaldi and K. Lenc, \u201cMatconvnet \u2013 convolutional neural networks for MATLAB,\u201d in Proceeding of\n\nthe ACM Int. Conf. on Multimedia.\n\n[25] J. Xue, J. Li, and Y. Gong, \u201cRestructuring of deep neural network acoustic models with singular value\n\ndecomposition,\u201d in Interspeech, 2013, pp. 2365\u20132369.\n\n[26] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang, \u201cDeep fried convnets,\u201d\n\narXiv preprint, no. 1412.7149, 2014.\n\n[27] Z. Zhang, X. Yang, I. V. Oseledets, G. E. Karniadakis, and L. Daniel, \u201cEnabling high-dimensional hier-\narchical uncertainty quanti\ufb01cation by ANOVA and tensor-train decomposition,\u201d Computer-Aided Design\nof Integrated Circuits and Systems, IEEE Transactions on, pp. 63\u201376, 2014.\n\n9\n\n\f", "award": [], "sourceid": 336, "authors": [{"given_name": "Alexander", "family_name": "Novikov", "institution": "Skolkovo Institute of Science and Technology"}, {"given_name": "Dmitrii", "family_name": "Podoprikhin", "institution": "Skolkovo Institute of Science and Technology"}, {"given_name": "Anton", "family_name": "Osokin", "institution": "Inria"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Skoltech, Moscow"}]}