{"title": "Compression-aware Training of Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 856, "page_last": 867, "abstract": "In recent years, great progress has been made in a variety of application domains thanks to the development of increasingly deeper neural networks. Unfortunately, the huge number of units of these networks makes them expensive both computationally and memory-wise. To overcome this, exploiting the fact that deep networks are over-parametrized, several compression strategies have been proposed. These methods, however, typically start from a network that has been trained in a standard manner, without considering such a future compression. In this paper, we propose to explicitly account for compression in the training process. To this end, we introduce a regularizer that encourages the parameter matrix of each layer to have low rank during training. We show that accounting for compression during training allows us to learn much more compact, yet at least as effective, models than state-of-the-art compression techniques.", "full_text": "Compression-aware Training of Deep Networks\n\nJose M. Alvarez\n\nToyota Research Institute\n\nLos Altos, CA 94022\n\njose.alvarez@tri.global\n\nMathieu Salzmann\n\nEPFL - CVLab\n\nLausanne, Switzerland\n\nmathieu.salzmann@epfl.ch\n\nAbstract\n\nIn recent years, great progress has been made in a variety of application domains\nthanks to the development of increasingly deeper neural networks. Unfortunately,\nthe huge number of units of these networks makes them expensive both computa-\ntionally and memory-wise. To overcome this, exploiting the fact that deep networks\nare over-parametrized, several compression strategies have been proposed. These\nmethods, however, typically start from a network that has been trained in a stan-\ndard manner, without considering such a future compression. In this paper, we\npropose to explicitly account for compression in the training process. To this end,\nwe introduce a regularizer that encourages the parameter matrix of each layer to\nhave low rank during training. We show that accounting for compression during\ntraining allows us to learn much more compact, yet at least as effective, models\nthan state-of-the-art compression techniques.\n\nIntroduction\n\n1\nWith the increasing availability of large-scale datasets, recent years have witnessed a resurgence of\ninterest for Deep Learning techniques. Impressive progress has been made in a variety of application\ndomains, such as speech, natural language and image processing, thanks to the development of new\nlearning strategies [15, 53, 30, 45, 26, 3] and of new architectures [31, 44, 46, 23]. In particular, these\narchitectures tend to become ever deeper, with hundreds of layers, each of which containing hundreds\nor even thousands of units.\nWhile it has been shown that training such very deep architectures was typically easier than smaller\nones [24], it is also well-known that they are highly over-parameterized. In essence, this means that\nequally good results could in principle be obtained with more compact networks. Automatically\nderiving such equivalent, compact models would be highly bene\ufb01cial in runtime- and memory-\nsensitive applications, e.g., to deploy deep networks on embedded systems with limited hardware\nresources. As a consequence, many methods have been proposed to compress existing architectures.\nAn early trend for such compression consisted of removing individual parameters [33, 22] or entire\nunits [36, 29, 38] according to their in\ufb02uence on the output. Unfortunately, such an analysis of\nindividual parameters or units quickly becomes intractable in the presence of very deep networks.\nTherefore, currently, one of the most popular compression approaches amounts to extracting low-rank\napproximations either of individual units [28] or of the parameter matrix/tensor of each layer [14].\nThis latter idea is particularly attractive, since, as opposed to the former one, it reduces the number of\nunits in each layer. In essence, the above-mentioned techniques aim to compress a network that has\nbeen pre-trained. There is, however, no guarantee that the parameter matrices of such pre-trained\nnetworks truly have low-rank. Therefore, these methods typically truncate some of the relevant\ninformation, thus resulting in a loss of prediction accuracy, and, more importantly, do not necessarily\nachieve the best possible compression rates.\nIn this paper, we propose to explicitly account for compression while training the initial deep network.\nSpeci\ufb01cally, we introduce a regularizer that encourages the parameter matrix of each layer to have\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\flow rank in the training loss, and rely on a stochastic proximal gradient descent strategy to optimize\nthe network parameters. In essence, and by contrast with methods that aim to learn uncorrelated units\nto prevent over\ufb01tting [5, 54, 40], we seek to learn correlated ones, which can then easily be pruned\nin a second phase. Our compression-aware training scheme therefore yields networks that are well\nadapted to the following post-processing stage. As a consequence, we achieve higher compression\nrates than the above-mentioned techniques at virtually no loss in prediction accuracy.\nOur approach constitutes one of the very few attempts at explicitly training a compact network\nfrom scratch. In this context, the work of [4] has proposed to learn correlated units by making\nuse of additional noise outputs. This strategy, however, is only guaranteed to have the desired\neffect for simple networks and has only been demonstrated on relatively shallow architectures. In\nthe contemporary work [51], units are coordinated via a regularizer acting on all pairs of \ufb01lters\nwithin a layer. While effective, exploiting all pairs can quickly become cumbersome in the presence\nof large numbers of units. Recently, group sparsity has also been employed to obtain compact\nnetworks [2, 50]. Such a regularizer, however, acts on individual units, without explicitly aiming to\nmodel their redundancies. Here, we show that accounting for interactions between the units within a\nlayer allows us to obtain more compact networks. Furthermore, using such a group sparsity prior in\nconjunction with our compression-aware strategy lets us achieve even higher compression rates.\nWe demonstrate the bene\ufb01ts of our approach on several deep architectures, including the 8-layers\nDecomposeMe network of [1] and the 50-layers ResNet of [23]. Our experiments on ImageNet and\nICDAR show that we can achieve compression rates of more than 90%, thus hugely reducing the\nnumber of required operations at inference time.\n\n2 Related Work\n\nIt is well-known that deep neural networks are over-parametrized [13]. While, given suf\ufb01cient\ntraining data, this seems to facilitate the training procedure, it also has two potential drawbacks. First,\nover-parametrized networks can easily suffer from over\ufb01tting. Second, even when they can be trained\nsuccessfully, the resulting networks are expensive both computationally and memory-wise, thus\nmaking their deployment on platforms with limited hardware resources, such as embedded systems,\nchallenging. Over the years, much effort has been made to overcome these two drawbacks.\nIn particular, much progress has been made to reduce over\ufb01tting, for example by devising new\noptimization strategies, such as DropOut [45] or MaxOut [16]. In this context, other works have\nadvocated the use of different normalization strategies, such as Batch Normalization [26], Weight\nNormalization [42] and Layer Normalization [3]. Recently, there has also been a surge of methods\naiming to regularize the network parameters by making the different units in each layer less correlated.\nThis has been achieved by designing new activation functions [5], by explicitly considering the\npairwise correlations of the units [54, 37, 40] or of the activations [9, 52], or by constraining the\nweight matrices of each layer to be orthonormal [21].\nIn this paper, we are more directly interested in addressing the second drawback, that is, the large\nmemory and runtime required by very deep networks. To tackle this, most existing research has\nfocused on pruning pre-trained networks. In this context, early works have proposed to analyze the\nsaliency of individual parameters [33, 22] or units [36, 29, 38, 34], so as to measure their impact on\nthe output. Such a local analysis, however, quickly becomes impractically expensive when dealing\nwith networks with millions of parameters.\nAs a consequence, recent works have proposed to focus on more global methods, which analyze\nlarger groups of parameters simultaneously. In this context, the most popular trend consists of\nextracting low-rank approximations of the network parameters. In particular, it has been shown that\nindividual units can be replaced by rank 1 approximations, either via a post-processing step [28, 46]\nor directly during training [1, 25]. Furthermore, low-rank approximations of the complete parameter\nmatrix/tensor of each layer were computed in [14], which has the bene\ufb01t of reducing the number of\nunits in each layer. The resulting low-rank representation can then be \ufb01ne-tuned [32], or potentially\neven learned from scratch [47], given the rank of each layer in the network. With the exception of\nthis last work, which assumes that the ranks are known, these methods, however, aim to approximate\na given pre-trained model. In practice, however, the parameter matrices of this model might not have\nlow rank. Therefore, the resulting approximations yield some loss of accuracy and, more importantly,\n\n2\n\n\fwill typically not correspond to the most compact networks. Here, we propose to explicitly learn a\nlow-rank network from scratch, but without having to manually de\ufb01ne the rank of each layer a priori.\nTo this end, and in contrast with the above-mentioned methods that aim to minimize correlations,\nwe rather seek to maximize correlations between the different units within each layer, such that\nmany of these units can be removed in a post-processing stage. In [4], additional noise outputs were\nintroduced in a network to similarly learn correlated \ufb01lters. This strategy, however, is only justi\ufb01ed for\nsimple networks and was only demonstrated on relatively shallow architectures. The contemporary\nwork [51] introduced a penalty during training to learn correlated units. This, however, was achieved\nby explicitly computing all pairwise correlations, which quickly becomes cumbersome in very deep\nnetworks with wide layers. By contrast, our approach makes use of a low-rank regularizer that can\neffectively be optimized by proximal stochastic gradient descent.\nOur approach belongs to the relatively small group of methods that explicitly aim to learn a compact\nnetwork during training, i.e., not as a post-processing step. Other methods have proposed to make\nuse of sparsity-inducing techniques to cancel out individual parameters [49, 10, 20, 19, 35] or\nunits [2, 50, 55]. These methods, however, act, at best, on individual units, without considering the\nrelationships between multiple units in the same layer. Variational inference [17] has also been used\nto explicitly compress the network. However, the priors and posteriors used in these approaches\nwill typically zero out individual weights. Our experiments demonstrate that accounting for the\ninteractions between multiple units allows us to obtain more compact networks.\nAnother line of research aims to quantize the weights of deep networks [48, 12, 18]. Note that,\nin a sense, this research direction is orthogonal to ours, since one could still further quantize our\ncompact networks. Furthermore, with the recent progress in ef\ufb01cient hardware handling \ufb02oating-point\noperations, we believe that there is also high value in designing non-quantized compact networks.\n\n3 Compression-aware Training of Deep Networks\n\nl \u00d7dW\n\nl\n\nIn this section, we introduce our approach to explicitly encouraging compactness while training a\ndeep neural network. To this end, we propose to make use of a low-rank regularizer on the parameter\nmatrix in each layer, which inherently aims to maximize the compression rate when computing a\nlow-rank approximation in a post-processing stage. In the following, we focus on convolutional neural\nnetworks, because the popular visual recognition models tend to rely less and less on fully-connected\nlayers, and, more importantly, the inference time of such models is dominated by the convolutions in\nthe \ufb01rst few layers. Note, however, that our approach still applies to fully-connected layers.\nTo introduce our approach, let us \ufb01rst consider the l-th layer of a convolutional network, and denote\nits parameters by \u03b8l \u2208 RKl\u00d7Cl\u00d7dH\n, where Cl and Kl are the number of input and output channels,\nrespectively, and dH\nl are the height and width of each convolutional kernel. Alternatively,\nthese parameters can be represented by a matrix \u02c6\u03b8l \u2208 RKl\u00d7Sl with Sl = CldH\n. Following [14], a\nnetwork can be compacted via a post-processing step performing a singular value decomposition of\n\u02c6\u03b8l and truncating the 0, or small, singular values. In essence, after this step, the parameter matrix\ncan be approximated as \u02c6\u03b8l \u2248 UlMT\nl , where Ul is a Kl \u00d7 rl matrix representing the basis kernels, with\nrl \u2264 min(Kl,Sl), and Ml is an Sl \u00d7 rl matrix that mixes the activations of these basis kernels.\nBy making use of a post-processing step on a network trained in the usual way, however, there is\nno guarantee that, during training, many singular values have become near-zero. Here, we aim to\nexplicitly account for this post-processing step during training, by seeking to obtain a parameter\nmatrix such that rl << min(Kl,Sl). To this end, given N training input-output pairs (xi,yi), we\nformulate learning as the regularized minimization problem\n\nl and dW\n\nl dW\nl\n\n1\nN\n\nmin\n\u0398\n\nN\n\n\u2211\n\ni=1\n\n(cid:96)(yi, f (xi,\u0398)) + r(\u0398) ,\n\n(1)\n\nwhere \u0398 encompasses all network parameters, (cid:96)(\u00b7,\u00b7) is a supervised loss, such as the cross-entropy,\nand r(\u00b7) is a regularizer encouraging the parameter matrix in each layer to have low rank.\nSince explicitly minimizing the rank of a matrix is NP-hard, following the matrix completion\nliterature [7, 6], we make use of a convex relaxation in the form of the nuclear norm. This lets us\n\n3\n\n\fwrite our regularizer as\n\nr(\u0398) = \u03c4\n\nL\n\n\u2211\n\nl=1\n\n(cid:107) \u02c6\u03b8l(cid:107)\u2217 ,\n\n(2)\n\nj=1\n\n\u03c3 j\nl , with \u03c3 j\n\nwhere \u03c4 is a hyper-parameter setting the in\ufb02uence of the regularizer, and the nuclear norm is de\ufb01ned\nas (cid:107) \u02c6\u03b8l(cid:107)\u2217 = \u2211rank( \u02c6\u03b8l )\nIn practice, to minimize (1), we make use of proximal stochastic gradient descent. Speci\ufb01cally,\nthis amounts to minimizing the supervised loss only for one epoch, with learning rate \u03c1, and then\napplying the proximity operator of our regularizer. In our case, this can be achieved independently\nfor each layer. For layer l, this proximity operator corresponds to solving\n\nl the singular values of \u02c6\u03b8l.\n\n\u03b8\u2217\nl = argmin\n\n\u00af\u03b8l\n\n1\n2\u03c1\n\n(cid:107) \u00af\u03b8l \u2212 \u02c6\u03b8l(cid:107)2\n\nF + \u03c4(cid:107) \u00af\u03b8l(cid:107)\u2217 ,\n\n(3)\n\nwhere \u02c6\u03b8l is the current estimate of the parameter matrix for layer l. As shown in [6], the solution to\nthis problem can be obtained by soft-thresholding the singular values of \u02c6\u03b8l, which can be written as\n\n\u03b8\u2217\nl = Ul\u03a3l(\u03c1\u03c4)V T\n\nl , where \u03a3l(\u03c1\u03c4) = diag([(\u03c3 1\n\n(4)\nUl and Vl are the left - and right-singular vectors of \u02c6\u03b8l, and (\u00b7)+ corresponds to taking the maximum\nbetween the argument and 0.\n\nl \u2212 \u03c1\u03c4)+, . . . , (\u03c3 rank( \u02c6\u03b8l )\n\n\u2212 \u03c1\u03c4)+]),\n\nl\n\n3.1 Low-rank and Group-sparse Layers\n\n(cid:32)\n\nL\n\n(cid:112)Pl\n\n(cid:33)\n\nWhile, as shown in our experiments, the low-rank solution discussed above signi\ufb01cantly reduces\nthe number of parameters in the network, it does not affect the original number of input and output\nchannels Cl and Kl. By contrast, the group-sparsity based methods [2, 50] discussed in Section 2\ncancel out entire units, thus reducing these numbers, but do not consider the interactions between\nmultiple units in the same layer, and would therefore typically not bene\ufb01t from a post-processing\nstep such as the one of [14]. Here, we propose to make the best of both worlds to obtain low-rank\nparameter matrices, some of whose units have explicitly been removed.\nTo this end, we combine the sparse group Lasso regularizer used in [2] with the low-rank one\ndescribed above. This lets us re-de\ufb01ne the regularizer in (1) as\n\nL\n\nl=1\n\nl=1\n\n+ \u03c4\n\n\u2211\n\n\u2211\n\n(cid:107)\u03b8 n\n\nr(\u0398) =\n\n(cid:107) \u02c6\u03b8l(cid:107)\u2217 ,\n\n(1\u2212 \u03b1)\u03bbl\n\nl (cid:107)2 + \u03b1\u03bbl(cid:107)\u03b8l(cid:107)1\n\nKl\u2211\nn=1\nwhere Kl is the number of units in layer l, \u03b8 n\nl denotes the vector of parameters for unit n in layer\nl, Pl is the size of this vector (the same for all units in a layer), \u03b1 \u2208 [0,1] balances the in\ufb02uence\nof sparsity terms on groups vs. individual parameters, and \u03bbl is a layer-wise hyper-parameter. In\npractice, following [2], we use only two different values of \u03bbl; one for the \ufb01rst few layers and one for\nthe remaining ones.\nTo learn our model with this new regularizer consisting of two main terms, we make use of the\nincremental proximal descent approach proposed in [39], which has the bene\ufb01t of having a lower\nmemory footprint than parallel proximal methods. The proximity operator for the sparse group Lasso\nregularizer also has a closed form solution derived in [43] and provided in [2].\n\n(5)\n\n3.2 Bene\ufb01ts at Inference\n\nOnce our model is trained, we can obtain a compact network for faster and more memory-ef\ufb01cient\ninference by making use of a post-processing step. In particular, to account for the low rank of the\nparameter matrix of each layer, we make use of the SVD-based approach of [14]. Speci\ufb01cally, for\neach layer l, we compute the SVD of the parameter matrix as \u02c6\u03b8l = \u02dcUl \u02dc\u03a3l \u02dcVl and only keep the rl singular\nvalues that are either non-zero, thus incurring no loss, or larger than a pre-de\ufb01ned threshold, at some\npotential loss. The parameter matrix can then be represented as \u02c6\u03b8l = UlMl, with Ul \u2208 RCldH\nl \u00d7rl and\nMl = \u03a3lVl \u2208 Rrl\u00d7Kl ). In essence, every layer is decomposed into two layers. This incurs signi\ufb01cant\nmemory and computational savings if rl(CldH\n\nl + Kl) << (CldH\n\nl Kl).\n\nl dW\n\nl dW\n\nl dW\n\n4\n\n\fFurthermore, additional savings can be achieved when using the sparse group Lasso regularizer\ndiscussed in Section 3.1. Indeed, in this case, the zeroed-out units can explicitly be removed, thus\nyielding only \u02c6Kl \ufb01lters, with \u02c6Kl < Kl. Note that, except for the \ufb01rst layer, units have also been\nremoved from the previous layer, thus reducing Cl to a lower \u02c6Cl. Furthermore, thanks to our low-rank\nregularizer, the remaining, non-zero, units will form a parameter matrix that still has low rank, and\ncan thus also be decomposed. This results in a total of rl( \u02c6CldH\nIn our experiments, we select the rank rl based on the percentage el of the energy (i.e., the sum of\nsingular values) that we seek to capture by our low-rank approximation. This percentage plays an\nimportant role in the trade-off between runtime/memory savings and drop of prediction accuracy. In\nour experiments, we use the same percentage for all layers.\n\nl + \u02c6Kl) parameters.\n\nl dW\n\n4 Experimental Settings\n\nDatasets: For our experiments, we used two image classi\ufb01cation datasets: ImageNet [41] and\nICDAR, the character recognition dataset introduced in [27]. ImageNet is a large-scale dataset\ncomprising over 15 million labeled images split into 22,000 categories. We used the ILSVRC-\n2012 [41] subset consisting of 1000 categories, with 1.2 million training images and 50,000 validation\nimages. The ICDAR dataset consists of 185,639 training samples combining real and synthetic\ncharacters and 5,198 test samples coming from the ICDAR2003 training set after removing all\nnon-alphanumeric characters. The images in ICDAR are split into 36 categories. The use of ICDAR\nhere was motivated by the fact that it is fairly large-scale, but, in contrast with ImageNet, existing\narchitectures haven\u2019t been heavily tuned to this data. As such, one can expect our approach consisting\nof training a compact network from scratch to be even more effective on this dataset.\n\nNetwork Architectures:\nIn our experiments, we make use of architectures where each kernel\nin the convolutional layers has been decomposed into two 1D kernels [1], thus inherently having\nrank-1 kernels. Note that this is orthogonal to the purpose of our low-rank regularizer, since, here,\nwe essentially aim at reducing the number of kernels, not the rank of individual kernels. The\ndecomposed layers yield even more compact architectures that require a lower computational cost for\ntraining and testing while maintaining or even improving classi\ufb01cation accuracy. In the following, a\nconvolutional layer refers to a layer with 1D kernels, while a decomposed layer refers to a block of\ntwo convolutional layers using 1D vertical and horizontal kernels, respectively, with a non-linearity\nand batch normalization after each convolution.\nLet us consider a decomposed layer consisting of C and K input and output channels, respectively.\nLet \u00afv and \u00afhT be vectors of length dv and dh, respectively, representing the kernel size of each 1D\nfeature map. In this paper, we set dh = dv \u2261 d. Furthermore, let \u03d5(\u00b7) be a non-linearity, and xc denote\nthe c-th input channel of the layer. In this setting, the activation of the i-th output channel fi can be\nwritten as\n\nfi = \u03d5(bh\n\ni +\n\nL\n\n\u2211\n\nl=1\n\nil \u2217 [\u03d5(bv\n\u00afhT\nl +\n\n\u00afvlc \u2217 xc)]),\n\n(6)\n\nC\n\n\u2211\n\nc=1\n\nl and bh\n\nl are biases.\n\nwhere L is the number of vertical \ufb01lters, corresponding to the number of input channels for the\nhorizontal \ufb01lters, and bv\nWe report results with two different models using such decomposed layers: DecomposeMe [1] and\nResNets [23]. In all cases, we make use of batch-normalization after each convolutional layer 1. We\nrely on recti\ufb01ed linear units (ReLU) [31] as non-linearities, although some initial experiments suggest\nthat slightly better performance can be obtained with exponential linear units [8]. For DecomposeMe,\nwe used two different Dec8 architectures, whose speci\ufb01c number of units are provided in Table 1. For\nresidual networks, we used a decomposed ResNet-50, and empirically veri\ufb01ed that the use of 1D\nkernels instead of the standard ones had no signi\ufb01cant impact on classi\ufb01cation accuracy.\n\nImplementation details: For the comparison to be fair, all models, including the baselines, were\ntrained from scratch on the same computer using the same random seed and the same framework.\nMore speci\ufb01cally, we used the torch-7 multi-gpu framework [11].\n1 We empirically found the use of batch normalization after each convolutional layer to have more impact with\nour low-rank regularizer than with group sparsity or with no regularizer, in which cases the computational cost\ncan be reduced by using a single batch normalization after each decomposed layer.\n\n5\n\n\f1v\n\n32/11\n32/11\n48/9\n\n1h\n\n64/11\n64/11\n96/9\n\n2v\n\n128/5\n128/5\n160/9\n\n2h\n\n192/5\n192/5\n256/9\n\n3v\n\n256/3\n256/3\n512/8\n\n3h\n\n384/3\n384/3\n512/8\n\n4v\n\n256/3\n256/3\n\n\u2013\n\n4h\n\n256/3\n256/3\n\n\u2013\n\n5v\n\n256/3\n512/3\n\n\u2013\n\n5h\n\n256/3\n512/3\n\n\u2013\n\n6v\n\n256/3\n512/3\n\n\u2013\n\n6h\n\n256/3\n512/3\n\n\u2013\n\n7v\n\n256/3\n512/3\n\n\u2013\n\n7h\n\n256/3\n512/3\n\n\u2013\n\n8v\n\n256/3\n512/3\n\n\u2013\n\n8h\n\n256/3\n512/3\n\n\u2013\n\n8\n8\n\nDec256\nDec512\nDec512\n\n3\n\nTable 1: Different DecomposeMe architectures used on ImageNet and ICDAR. Each entry rep-\nresents the number of \ufb01lters and their dimension.\n\nLayer / Conf\u03bb\n1v to 2h\n3v to 8h\n\n0\n\u2013\n\u2013\n\n1\n\n0.0127\n0.0127\n\n2\n\n0.051\n0.051\n\n3\n\n0.204\n0.204\n\n4\n\n0.255\n0.255\n\n5\n\n0.357\n0.357\n\n6\n\n0.051\n0.357\n\n7\n\n0.051\n0.408\n\n8\n\n0.051\n0.510\n\n9\n\n0.051\n0.255\n\n10\n\n0.051\n0.765\n\n11\n\n0.153\n0.51\n\nTable 2: Sparse group Lasso hyper-parameter con\ufb01gurations. The \ufb01rst row provides \u03bb for the\n\ufb01rst four convolutional layers, while the second one shows \u03bb for the remaining layers. The \ufb01rst \ufb01ve\ncon\ufb01gurations correspond to using the same regularization penalty for all the layers, while the later\nones de\ufb01ne weaker penalties on the \ufb01rst two layers, as suggested in [2].\n\n8\n\n8\n\nin Table 1) and every 30 iterations for the larger models (Dec512\n\nFor ImageNet, training was done on a DGX-1 node using two-P100 GPUs in parallel. We used\nstochastic gradient descent with a momentum of 0.9 and a batch size of 180 images. The models were\ntrained using an initial learning rate of 0.1 multiplied by 0.1 every 20 iterations for the small models\n(Dec256\nin Table 1). For ICDAR, we\ntrained each network on a single TitanX-Pascal GPU for a total of 55 epochs with a batch size of\n256 and 1,000 iterations per epoch. We follow the same experimental setting as in [2]: The initial\nlearning rate was set to an initial value of 0.1 and multiplied by 0.1. We used a momentum of 0.9.\nFor DecomposeMe networks, we only performed basic data augmentation consisting of using random\ncrops and random horizontal \ufb02ips with probability 0.5. At test time, we used a single central crop.\nFor ResNets, we used the standard data augmentation advocated for in [23]. In practice, in all models,\nwe also included weight decay with a penalty strength of 1e\u22124 in our loss function. We observed\nempirically that adding this weight decay prevents the weights to overly grow between every two\ncomputations of the proximity operator.\nIn terms of hyper-parameters, for our low-rank regularizer, we considered four values: \u03c4 \u2208{0,1,5,10}.\nFor the sparse group Lasso term, we initially set the same \u03bb to every layer to analyze the effect of\ncombining both types of regularization. Then, in a second experiment, we followed the experimental\nset-up proposed in [2], where the \ufb01rst two decomposed layers have a lower penalty. In addition, we\nset \u03b1 = 0.2 to favor promoting sparsity at group level rather than at parameter level. The sparse group\nLasso hyper-parameter values are summarized in Table 2.\n\nComputational cost: While a convenient measure of computational cost is the forward time,\nthis measure is highly hardware-dependent. Nowadays, hardware is heavily optimized for current\narchitectures and does not necessarily re\ufb02ect the concept of any-time-computation. Therefore, we\nfocus on analyzing the number of multiply-accumulate operations (MAC). Let a convolution be\nj=1Wi j \u2217 x j), where each Wi j is a 2D kernel of dimensions dH \u00d7 dW and\nde\ufb01ned as fi = \u03d5(bi + \u2211C\ni \u2208 [1, . . .K]. Considering a naive convolution algorithm, the number of MACs for a convolutional\nlayer is equal to PCKdhdW where P is the number of pixels in the output feature map. Therefore,\nit is important to reduce CK whenever P is large. That is, reducing the number of units in the \ufb01rst\nconvolutional layers has more impact than in the later ones.\n\n5 Experimental Results\n\nParameter sensitivity and comparison to other methods on ImagNet: We \ufb01rst analyze the\neffect of our low-rank regularizer on its own and jointly with the sparse group Lasso one on MACs\nand accuracy. To this end, we make use of the Dec256\n8 model on ImageNet, and measure the impact\nof varying both \u03c4 and \u03bb in Eq. 5. Note that using \u03c4 = \u03bb = 0 corresponds to the standard model,\nand \u03c4 = 0 and \u03bb (cid:54)= 0 to the method of [2]. Below, we report results obtained without and with the\npost-processing step described in Section 3.2. Note that applying such a post-processing on the\nstandard model corresponds to the compression technique of [14]. Fig. 1 summarizes the results of\nthis analysis.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Parameter sensitivity for Dec256\n(a) Accuracy as a function of the\nregularization strength. (b) MACs directly after training. (c) MACS after the post-processing step\nof Section 3.2 for el = {100%,80%}. In all the \ufb01gures, isolated points represent the models trained\nwithout sparse group Lasso regularizer. The red point corresponds to the baseline, where no low-rank\nor sparsity regularization was applied. The speci\ufb01c sparse group Lasso hyper-parameters for each\ncon\ufb01guration Conf\u03bb are given in Table 2.\n\non ImageNet.\n\n8\n\nFigure 2: Effect of the low-rank regularizer on its own on Dec256\non ImageNet. (Left) Number\nof units per layer. (Right) Effective rank per layer for (top) el=100% and (bottom) el=80%. Note that,\non its own, our low-rank regularizer already helps cancel out entire units, thus inherently performing\nmodel selection.\n\n8\n\nIn Fig. 1(a), we can observe that accuracy remains stable for a wide range of values of \u03c4 and \u03bb . In\nfact, there are even small improvements in accuracy when a moderate regularization is applied.\nFigs. 1(b,c) depict the MACs without and with applying the post-processing step discussed in\nSection 3.2. As expected, the MACs decrease as the weights of the regularizers increase. Importantly,\nhowever, Figs. 1(a,b) show that several models can achieve a high compression rate at virtually no\nloss in accuracy. In Fig. 1(c), we provide the curves after post-processing with two different energy\npercentages el = {100%,80%}. Keeping all the energy tends to incur an increase in MAC, since\nthe inequality de\ufb01ned in Section 3.2 is then not satis\ufb01ed anymore. Recall, however, that, without\npost-processing, the resulting models are still more compact than and as accurate as the baseline one.\nWith el = 80%, while a small drop in accuracy typically occurs, the gain in MAC is signi\ufb01cantly\nlarger. Altogether, these experiments show that, by providing more compact models, our regularizer\nlets us consistently reduce the computational cost over the baseline.\nInterestingly, by looking at the case where Conf\u03bb = 0 in Fig. 1(b), we can see that we already\nsigni\ufb01cantly reduce the number of operations when using our low-rank regularizer only, even without\npost-processing. This is due to the fact that, even in this case, a signi\ufb01cant number of units are\nautomatically zeroed-out. Empirically, we observed that, for moderate values of \u03c4, the number of\nzeroed-out singular values corresponds to complete units going to zero. This can be observed in\nFig. 2(left), were we show the number of non-zero units for each layer. In Fig. 2(right), we further\nshow the effective rank of each layer before and after post-processing.\n\n7\n\n\fhyper-params\n# Params\ntop-1\n\nBaseline\n\n\u2013\n\n3.7M\n88.6%\n\n[14]\n\nel = 90%\n\n3.6M\n88.5%\n\n[2]\n\u2013\n\n525K\n89.6%\n\nOurs\n\n728K\n88.8%\n\n\u03c4 = 15,el = 90% \u03c4 = 15,el = 90% \u03c4 = 15,el = 100%\n\nOurs+[2]\n\nOurs+[2]\n\n318K\n89.7%\n\n454K\n90.5%\n\nTable 3: Comparison to other methods on ICDAR.\n\nImagenet\nDec512\nDec512\n\nTop-1\n-el=80%\n66.8\n-el=100% 67.6\n\n8\n8\n\nLow-Rank appx.\nParams MAC\n-46.2\n-53.5\n-21.1\n-4.8\n\nno SVD\n\nParams MAC\n-25.3\n-39.5\n-39.5\n-25.3\n\nICDAR\nDec512\nDec512\n\nTop-1\n-el=80%\n89.6\n-el=100% 90.8\n\n3\n3\n\nLow-Rank appx.\nParams MAC\n-92.9\n-91.9\n-85.3\n-86.8\n\nno SVD\n\nParams MAC\n-81.6\n-89.2\n-89.2\n-81.6\n\nTable 4: Accuracy and compression rates for Dec512\non\nICDAR (right). The number of parameters and MACs are given in % relative to the baseline model\n(i.e., without any regularizer). A negative value indicates reduction with respect to the baseline. The\naccuracy of the baseline is 67.0 for ImageNet and 89.3 for ICDAR.\n\n8 models on ImageNet (left) and Dec512\n3\n\nel = 80% el = 100% no SVD baseline (\u03c4 = 0)\n\nDec256\nDec256\nDec256\n\n8\n8\n8\n\n-\u03c4 = 1\n-\u03c4 = 5\n-\u03c4 = 10\n\n97.33\n88.33\n85.78\n\n125.44\n119.27\n110.35\n\n94.60\n90.55\n91.36\n\n94.70\n94.70\n94.70\n\nTable 5: Forward time in milliseconds (ms) using a Titan X (Pascal). We report the average over\n50 forward passes using a batch size of 256. A large batch size minimizes the effect of memory\noverheads due to non-hardware optimizations.\n\n3\n\nComparison to other approaches on ICDAR: We now compare our results with existing ap-\nproaches on the ICDAR dataset. As a baseline, we consider the Dec512\ntrained using SGD and L2\nregularization for 75 epochs. For comparison, we consider the post-processing approach in [14]\nwith el = 90%, the group-sparsity regularization approach proposed in [2] and three different in-\nstances of our model. First, using \u03c4 = 15, no group-sparsity and el = 90%. Then, two instances\ncombining our low-rank regularizer with group-sparsity (Section 3.1) with el = 90% and el = 100%.\nIn this case, the models are trained for 55 epochs and then reloaded and \ufb01ne tuned for 20 more\nepochs. Table 3 summarizes these results. The comparison with [14] clearly evidences the bene\ufb01ts\nof our compression-aware training strategy. Furthermore, these results show the bene\ufb01ts of further\ncombining our low-rank regularizer with the groups-sparsity one of [2].\nIn addition, we also compare our approach with L1 and L2 regularizers on the same dataset and with\nthe same experimental setup. Pruning the weights of the baseline models with a threshold of 1e\u2212 4\nresulted in 1.5M zeroed-out parameters for the L2 regularizer and 2.8M zeroed-out parameters for\nthe L1 regularizer. However, these zeroed out weights are sparsely located within units (neurons).\nApplying our post-processing step (low-rank approximation with el = 100%) to these results yielded\nmodels with 3.6M and 3.2M parameters for L2 and L1 regularizers, respectively. The top-1 accuracy\nfor these two models after post-processing was 87% and 89%, respectively. Using a stronger L1\nregularizer resulted in lower top-1 accuracy. By comparison, our approach yields a model with 3.4M\nzeroed-out parameters after post-processing and a top-1 accuracy of 90%. Empirically, we found the\nbene\ufb01ts of our approach to hold for varying regularizer weights.\n\nResults with larger models:\nIn Table 4, we provide the accuracies and MACs for our approach and\nthe baseline on ImageNet and ICDAR for Dec512\n8 models. Note that using our low-rank regularizer\nyields more compact networks than the baselines for similar or higher accuracies. In particular, for\nImageNet, we achieve reductions in parameter number of more than 20% and more than 50% for\nel = 100% and el = 80%, respectively. For ICDAR, these reductions are around 90% in both cases.\nWe now focus on our results with a ResNet-50 model on ImageNet. For post-processing we used\nel = 90% for all these experiments which resulted in virtually no loss of accuracy. The baseline\ncorresponds to a top-1 accuracy of 74.7% and 18M parameters. Applying the post-processing step on\nthis baseline resulted in a compression rate of 4%. By contrast, our approach with low-rank yields\na top-1 accuracy of 75.0% for a compression rate of 20.6%, and with group sparsity and low-rank\n\n8\n\n\fEpoch Num. parameters\nreload\nno SVD\n\nBaseline\nr5\nr15\nr25\nr35\nr45\nr55\nr65\n\n\u2013\n5\n15\n25\n35\n45\n55\n65\n\n\u2013\n\nTotal\n3.7M\n3.2M 3.71M\n2.08M\n210K\n1.60M\n218K\n222K\n1.52M\n1.24M\n324K\n1.24M\n388K\n414K\n1.23M\n\naccuracy\n\ntop-1\n88.4%\n89.8%\n90.0%\n90.0%\n89.0%\n90.1%\n89.2%\n87.7%\n\nTotal\n\ntrain-time\n\n1.69h\n1.81h\n0.77h\n0.88h\n0.99h\n1.12h\n1.26h\n1.36h\n\n3\n\nFigure 3: Forward-Backward training time in milliseconds when varying the reload epoch for\nDec512\non ICDAR. (Left) Forward-backward time per batch in milliseconds (with a batch size of\n32). (Right) Summary of the results of each experiment. Note that we could reduce the training time\nfrom 1.69 hours (baseline) to 0.77 hours by reloading the model at the 15th epoch. This corresponds\nto a relative training-time speed up of 54.5% and yields a 2% improvement in top-1 accuracy.\n\njointly, a top-1 accuracy of 75.2% for a compression rate of 27%. By comparison, applying [2] to the\nsame model yields an accuracy of 74.5% for a compression rate of 17%.\n\nInference time: While MACs represent the number of operations, we are also interested in the\ninference time of the resulting models. Table 5 summarizes several representative inference times\nfor different instances of our experiments. Interestingly, there is a signi\ufb01cant reduction in inference\ntime when we only remove the zeroed-out neurons from the model. This is a direct consequence\nof the pruning effect, especially in the \ufb01rst layers. However, there is no signi\ufb01cant reduction in\ninference time when post-processing our model via a low-rank decomposition. The main reason for\nthis is that modern hardware is designed to compute convolutions with much fewer operations than\na naive algorithm. Furthermore, the actual computational cost depends not only on the number of\n\ufb02oating point operations but also on the memory bandwidth. In modern architectures, decomposing a\nconvolutional layer into a convolution and a matrix multiplication involves (with current hardware)\nadditional intermediate computations, as one cannot reuse convolutional kernels. Nevertheless, we\nbelieve that our approach remains bene\ufb01cial for embedded systems using customized hardware, such\nas FPGAs.\n\nAdditional bene\ufb01ts at training time: So far, our experiments have demonstrated the effectiveness\nof our approach at test time. Empirically, we found that our approach is also bene\ufb01cial for training,\nby pruning the network after only a few epochs (e.g., 15) and reloading and training the pruned\nnetwork, which becomes much more ef\ufb01cient. Speci\ufb01cally, Table 3 summarizes the effect of varying\nthe reload epoch for a model relying on both low-rank and group-sparsity. We were able to reduce the\ntraining time (with a batch size of 32 and training for 100 epochs) from 1.69 to 0.77 hours (relative\nspeedup of 54.5%). The accuracy also improved by 2% and the number of parameters reduced from\n3.7M (baseline) to 210K (relative 94.3% reduction). We found this behavior to be stable across a\nwide range of regularization parameters. If we seek to maintain accuracy compared to the baseline,\nwe found that we could achieve a compression rate of 95.5% (up to 96% for an accuracy drop of\n0.5%), which corresponds to a training time reduced by up to 60%.\n\n6 Conclusion\nIn this paper, we have proposed to explicitly account for a post-processing compression stage when\ntraining deep networks. To this end, we have introduced a regularizer in the training loss to encourage\nthe parameter matrix of each layer to have low rank. We have further studied the case where this\nregularizer is combined with a sparsity-inducing one to achieve even higher compression. Our\nexperiments have demonstrated that our approach can achieve higher compression rates than state-of-\nthe-art methods, thus evidencing the bene\ufb01ts of taking compression into account during training. The\nSVD-based technique that motivated our approach is only one speci\ufb01c choice of compression strategy.\nIn the future, we will therefore study how regularizers corresponding to other such compression\nmechanisms can be incorporated in our framework.\n\n9\n\n\fReferences\n[1] J. M. Alvarez and L. Petersson. Decomposeme: Simplifying convnets for end-to-end learning.\n\nCoRR, abs/1606.05426, 2016.\n\n[2] J. M. Alvarez and M. Salzmann. Learning the number of neurons in neural networks. In NIPS,\n\n2016.\n\n[3] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.\n\n[4] M. Babaeizadeh, P. Smaragdis, and R. H. Campbell. Noiseout: A simple way to prune neural\n\nnetworks. In emdnn Nips workshops, 2016.\n\n[5] Y. Bengio and J. S. Bergstra. Slow, decorrelated features for pretraining complex cell-like\n\nnetworks. In NIPS, pages 99\u2013107. 2009.\n\n[6] J.-F. Cai, E. J. Cand\u00e8s, and Z. Shen. A singular value thresholding algorithm for matrix\n\ncompletion. SIAM J. on Optimization, 20(4):1956\u20131982, Mar. 2010.\n\n[7] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. CoRR,\n\nabs/0805.4471, 2008.\n\n[8] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by\n\nexponential linear units (elus). CoRR, abs/1511.07289, 2015.\n\n[9] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra. Reducing over\ufb01tting in deep\n\nnetworks by decorrelating representations. In ICLR, 2016.\n\n[10] M. D. Collins and P. Kohli. Memory Bounded Deep Convolutional Networks. In CoRR, volume\n\nabs/1412.1442, 2014.\n\n[11] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine\n\nlearning. In BigLearn, NIPS Workshop, 2011.\n\n[12] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and\n\nactivations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.\n\n[13] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas. Predicting parameters in deep\n\nlearning. CoRR, abs/1306.0543, 2013.\n\n[14] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure\n\nwithin convolutional networks for ef\ufb01cient evaluation. In NIPS. 2014.\n\n[15] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\nstochastic optimization. Technical Report UCB/EECS-2010-24, EECS Department, University\nof California, Berkeley, Mar 2010.\n\n[16] I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In\n\nICML, 2013.\n\n[17] A. Graves. Practical variational inference for neural networks. In NIPS, 2011.\n\n[18] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited\n\nnumerical precision. CoRR, abs/1502.02551, 2015.\n\n[19] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. ICLR, 2016.\n\n[20] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for ef\ufb01cient\n\nneural network. In NIPS, 2015.\n\n[21] M. Harandi and B. Fernando. Generalized backpropagation, \u00e9tude de cas: Orthogonality. CoRR,\n\nabs/1611.05927, 2016.\n\n[22] B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning.\n\nIn ICNN, 1993.\n\n10\n\n\f[23] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CoRR,\n\nvolume abs/1512.03385, 2015.\n\n[24] V. O. Hinton, G. E. and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.\n\n[25] Y. Ioannou, D. P. Robertson, J. Shotton, R. Cipolla, and A. Criminisi. Training cnns with\n\nlow-rank \ufb01lters for ef\ufb01cient image classi\ufb01cation. CoRR, abs/1511.06744, 2015.\n\n[26] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. CoRR, 2015.\n\n[27] M. Jaderberg, A. Vedaldi, and Zisserman. Deep features for text spotting. In ECCV, 2014.\n\n[28] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. In British Machine Vision Conference, 2014.\n\n[29] C. Ji, R. R. Snapp, and D. Psaltis. Generalizing smoothness constraints from discrete samples.\n\nNeural Computation, 2(2):188\u2013197, June 1990.\n\n[30] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[32] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speeding-up convolu-\n\ntional neural networks using \ufb01ne-tuned cp-decomposition. CoRR, abs/1412.6553, 2014.\n\n[33] Y. LeCun, J. S. Denker, S. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In\n\nNIPS, 1990.\n\n[34] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy. Sparse convolutional neural networks.\n\nIn CVPR, 2015.\n\n[35] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks\n\nfor resource ef\ufb01cient transfer learning. CoRR, abs/1611.06440, 2016.\n\n[36] M. Mozer and P. Smolensky. Skeletonization: A technique for trimming the fat from a network\n\nvia relevance assessment. In NIPS, 1988.\n\n[37] H. Pan and H. Jiang. Learning convolutional neural networks using hybrid orthogonal projection\n\nand estimation. CoRR, abs/1606.05929, 2016.\n\n[38] R. Reed. Pruning algorithms-a survey. IEEE Transactions on Neural Networks, 4(5):740\u2013747,\n\nSep 1993.\n\n[39] E. Richard, P. andre Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank\n\nmatrices. In ICML, 2012.\n\n[40] P. Rodr\u0131guez, J. Gonzalez, G. Cucurull, and J. M. G. andXavier Roca. Regularizing cnns with\n\nlocally constrained decorrelations. In ICLR, 2017.\n\n[41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. S. Bernstein, A. C. Berg, and F.-F. Li. Imagenet large scale visual recognition\nchallenge. CoRR, abs/1409.0575, 2014.\n\n[42] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate\n\ntraining of deep neural networks. CoRR, abs/1602.07868, 2016.\n\n[43] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of\n\nComputational and Graphical Statistics, 2013.\n\n[44] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. CoRR, abs/1409.1556, 2014.\n\n11\n\n\f[45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[47] C. Tai, T. Xiao, X. Wang, and W. E. Convolutional neural networks with low-rank regularization.\n\nCoRR, abs/1511.06067, 2015.\n\n[48] K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression.\n\nCoRR, abs/1702.04008, 2016.\n\n[49] A. S. Weigend, D. E. Rumelhart, and B. A. Huberman. Generalization by weight-elimination\n\nwith application to forecasting. In NIPS, 1991.\n\n[50] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural\n\nnetworks. In NIPS, 2016.\n\n[51] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating \ufb01lters for faster deep neural\n\nnetworks. CoRR, abs/1703.09746, 2017.\n\n[52] W. Xiong, B. Du, L. Zhang, R. Hu, and D. Tao. Regularizing deep convolutional neural networks\nwith a structured decorrelation constraint. In IEEE Int. Conf. on Data Mining (ICDM), 2016.\n\n[53] M. D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012.\n\n[54] S. Zhang and H. Jiang. Hybrid orthogonal projection and estimation (HOPE): A new framework\n\nto probe and learn neural networks. CoRR, abs/1502.00702, 2015.\n\n[55] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact cnns. In ECCV, 2006.\n\n12\n\n\f", "award": [], "sourceid": 566, "authors": [{"given_name": "Jose", "family_name": "Alvarez", "institution": "TRI"}, {"given_name": "Mathieu", "family_name": "Salzmann", "institution": "EPFL"}]}