{"title": "Learning sparse neural networks via sensitivity-driven regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 3878, "page_last": 3888, "abstract": "The ever-increasing number of parameters in deep neural networks poses challenges for memory-limited applications. Regularize-and-prune methods aim at meeting these challenges by sparsifying the network weights. In this context we quantify the output sensitivity to the parameters (i.e. their relevance to the network output) and introduce a regularization term that gradually lowers the absolute value of parameters with low sensitivity.  Thus, a very large fraction of the parameters approach zero and are eventually set to zero by simple thresholding. Our method surpasses most of the recent techniques both in terms of sparsity and error rates. In some cases, the method reaches twice the sparsity obtained by other techniques at equal error rates.", "full_text": "Learning Sparse Neural Networks\nvia Sensitivity-Driven Regularization\n\nEnzo Tartaglione\nPolitecnico di Torino\n\nTorino, Italy\n\ntartaglioneenzo@gmail.com\n\nSkjalg Leps\u00f8y\n\nNuance Communications\n\nTorino, Italy\n\nAttilio Fiandrotti\n\nPolitecnico di Torino, Torino, Italy\nT\u00e9l\u00e9com ParisTech, Paris, France\n\nGianluca Francini\n\nTelecom Italia\nTorino, Italy\n\nAbstract\n\nThe ever-increasing number of parameters in deep neural networks poses challenges\nfor memory-limited applications. Regularize-and-prune methods aim at meeting\nthese challenges by sparsifying the network weights. In this context we quantify\nthe output sensitivity to the parameters (i.e. their relevance to the network output)\nand introduce a regularization term that gradually lowers the absolute value of\nparameters with low sensitivity. Thus, a very large fraction of the parameters\napproach zero and are eventually set to zero by simple thresholding. Our method\nsurpasses most of the recent techniques both in terms of sparsity and error rates. In\nsome cases, the method reaches twice the sparsity obtained by other techniques at\nequal error rates.\n\n1\n\nIntroduction\n\nDeep neural networks achieve state-of-the-art performance in a number of tasks by means of complex\narchitectures. Let us de\ufb01ne the complexity of a neural network as the number of its learnable\nparameters. The complexity of architectures such as VGGNet [1] and the SENet-154 [2] lies in the\norder of 108 parameters, hindering their deployability on portable and embedded devices, where\nstorage, memory and bandwidth resources are limited.\nThe complexity of a neural network can be reduced by promoting sparse interconnection structures.\nEmpirical evidence shows that deep architectures often require to be over-parametrized (having\nmore parameters than training examples) in order to be successfully trained [3, 4, 5]. However, once\ninput-output relations are properly represented by a complex network, such a network may form a\nstarting point in order to \ufb01nd a simpler, sparser, but suf\ufb01cient architecture [4, 5].\nRecently, regularization has been proposed as a principle for promoting sparsity during training. In\ngeneral, regularization replaces unstable (ill-posed) problems with nearby and stable (well-posed)\nones by introducing additional information about what a solution should be like [6]. This is often\ndone by adding a term R to the original objective function L. Letting \u03b8 denote the network parameters\nand \u03bb the regularization factor, the problem\n\nis recasted as\n\nminimize L(\u03b8) with respect to \u03b8\n\nminimize L(\u03b8) + \u03bbR(\u03b8) with respect to \u03b8.\n\n(1)\n\n(2)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fStability and generalization are strongly related or even equivalent, as shown by Mukherjee et al. [7].\nRegularization therefore also helps ensure that a properly trained network generalizes well on unseen\ndata.\nSeveral known methods aim at reaching sparsity via regularization terms that are more or less\nspeci\ufb01cally designed for the goal. Examples are found in [8, 9, 10].\nThe original contribution of this work is a regularization and pruning method that takes advantage of\noutput sensitivity to each parameter. This measure quanti\ufb01es the change in network output that is\nbrought about by a change in the parameter. The proposed method gradually moves the less sensitive\nparameters towards zero, avoiding harmful modi\ufb01cations to sensitive, therefore important, parameters.\nWhen a parameter value approaches zero and drops below a threshold, the parameter is set to zero,\nyielding the desired sparsity of interconnections.\nFurthermore, our method implies minimal computational overhead, since the sensitivity is a simple\nfunction of a by-product of back-propagation. Image classi\ufb01cation experiments show that our method\nimproves sparsity with respect to competing state-of-the-art techniques. According to our evidence,\nthe method also improves generalization.\nThe rest of this paper is organized as follows. In Sec. 2 we review the relevant literature concerning\nsparse neural architectures. Next, in Sec. 3 we describe our supervised method for training a neural\nnetwork such that its interconnection matrix is sparse. Then, in Sec. 4 we experiment with our\nproposed training scheme over different network architectures. The experiments show that our\nproposed method achieves a tenfold reduction in the network complexity while leaving the network\nperformance unaffected. Finally, Sec. 5 draws the conclusions while providing further directions for\nfuture research.\n\n2 Related work\n\nSparse neural architectures have been the focus of intense research recently due the advantages they\nentail. For example, Zhu et al. [11], have shown that a large sparse architecture improves the network\ngeneralization ability in a number of different scenarios. A number of approaches towards sparse\ninterconnection matrices have been proposed. For example, Liu et al. [12] propose to recast multi-\ndimensional convolutional operations into bidimensional equivalents, resulting in a \ufb01nal reduction of\nthe required parameters. Another approach involves the design of an object function to minimize the\nnumber of features in the convolutional layers. Wen et al. [8] propose a regularizer based on group\nlasso whose task is to cluster \ufb01lters. However, such approaches are speci\ufb01c for convolutional layers,\nwhereas the bulk of network complexity often lies in the fully connected layers.\nA direct strategy to introduce sparsity in neural networks is l0 regularization, which entails however\nsolving a highly complex optimization problem (e.g., Louizos et al. [13]).\nRecently, a technique based on soft weight sharing has been proposed to reduce the memory footprint\nof whole networks (Ullrich et al. [10]). However, it limits the number of the possible parameters\nvalues, resulting in sub-optimal network performance.\nAnother approach involves making input signals sparse in order to use smaller architectures. Inserting\nautoencoder layers at the begin of the neural network (Ranzato et al. [14]) or modeling of \u2018receptive\n\ufb01elds\u2019 to preprocess input signals for image classi\ufb01cation (Culurciello et al. [15]) are two clear\nexamples of how a sparse, properly-correlated input signal can make the learning problem easier.\nIn the last few years, dropout techniques have also been employed to ease sparsi\ufb01cation.\nMolchanov et al. [16] propose variational dropout to promote sparsity. This approach also pro-\nvides a bayesian interpretation of gaussian dropout. A similar but differently principled approach\nwas proposed by Theis et al. [17]. However, such a technique does not achieve in fully-connected\narchitectures state-of-the-art test error.\nThe proposal of Han et al. [9] consists of steps that are similar to those of our method. It is a three-\nstaged approach in which \ufb01rst, a network learns a coarse measurement of each connection importance,\nminimizing some target loss function; second, all the connections less important than a threshold\nare pruned; third and \ufb01nally, the resulting network is retrained with standard backpropagation to\nlearn the actual weights. An application of such a technique can be found in [18]. Their experiments\n\n2\n\n\fshow reduced complexity for partially better performance achieved by avoiding network over-\nparametrization.\nIn this work, we propose to selectively prune each network parameter using the knowledge of\nsensitivity. Engelbrecht et al. [19] and Mrazova et al. [20, 21] previously proposed sensitivity-based\nstrategies for learning sparse architectures. In their work, the sensitivity is however de\ufb01ned as the\nvariation of the network output with respect to a variation of the network inputs. Conversely, in our\nwork we de\ufb01ne the sensitivity of a parameter as the variation of the network output with respect to\nthe parameter, pruning parameters with low sensitivity as they contribute little to the network output.\n\n3 Sensitivity-based regularization\n\nIn this section, we \ufb01rst formulate the sensitivity of a network with respect to a single network\nparameter. Next, we insert a sensitivity-based term in the update rule. Then, we derive a per-\nparameter general formulation of a regularization term based on sensitivity, having as particular case\nReLU-activated neural networks. Finally, we propose a general training procedure aiming for sparsity.\nAs we will experimentally see in Sec. 4, our technique not only sparsi\ufb01es the network, but improves\nits generalization ability as well.\n\n3.1 Some notation\n\nHere we introduce the terminology and the notation used in the rest of this work. Let a feed-forward,\nacyclic, multi-layer arti\ufb01cial neural network be composed of N layers, with xn\u22121 being the input of\nthe n-th network layer and xn its output, n \u2208 [1, N ] integer. We identify with n=0 the input layer,\nn = N the output layer, and other n values indicate the hidden layers. The n-th layer has learnable\nparameters, indicated by wn (which can be biases or weights).1 In order to identify the i-th parameter\nat layer n, we write wn,i.\nThe output of the n-th layer can be described as\n\nxn = fn [gn (xn\u22121, wn)] ,\n\n(3)\nwhere gn(\u00b7) is usually some af\ufb01ne function and fn(\u00b7) is the activation function at layer n. In the\nfollowing, x0 indicates the network input. Let us indicate the output of the network as y = xN \u2208 RC,\nwith C \u2208 N. Similarly, y(cid:63) indicates the target (expected) network output associated to x0.\n\n(a)\n\nFigure 1: Generic layer representation (Fig. 1a) and the case of a fully connected layer in detail\n(Fig. 1b, here we have wn \u2208 Rm\u00d7p). Biases may also be included.\n\n(b)\n\n1According to our notation, \u03b8 = \u222aN\n\nn=1wn\n\n3\n\n\f3.2 Sensitivity de\ufb01nition\n\nWe are interested in evaluating how in\ufb02uential a generic parameter wn,i is to determine the k-th\noutput of the network (given an input of the network).\nLet the weight wn,i vary by a small amount \u2206wn,i, such that the output varies by \u2206y. For small\n\u2206wn,i, we have, for each element,\n\n\u2206yk \u2248 \u2206wn,i\n\n\u2202yk\n\u2202wn,i\n\nby a Taylor series expansion. A weighted sum of the variation in all output elements is then\n\nC(cid:88)\n\n\u03b1k |\u2206yk| = |\u2206wn,i| C(cid:88)\n\nk=1\n\nk=1\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202yk\n\n\u2202wn,i\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03b1k\n\nwhere \u03b1k > 0. The sum on the right-hand side is a key quantity for the regularization, so we de\ufb01ne it\nas the sensitivity:\n\nDe\ufb01nition 1 (Sensitivity) The sensitivity of the network output with respect to the (n, i)-th network\nparameter is\n\nC(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202yk\n\n\u2202wn,i\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\nS(y, wn,i) =\n\n\u03b1k\n\nwhere the coef\ufb01cients \u03b1k are positive and constant.\n\nk=1\n\n(4)\n\n(5)\n\n(6)\n\n(8)\n\nThe choice of coef\ufb01cients \u03b1k will depend on the application at hand. In Subsec. 3.5 we propose two\nchoices of coef\ufb01cients that will be used in the experiments.\nIf the sensitivity with respect to a given parameter is small, then a small change of that parameter\ntowards zero causes a very small change in the network output. After such a change, and if the\nsensitivity computed at the new value still is small, then the parameter may be moved towards zero\nagain. Such an operation can be paired naturally with a procedure like gradient descent, as we propose\nbelow. Towards this end, we introduce the insensitivity function \u00afS\n\n\u00afS(y, wn,i) = 1 \u2212 S(y, wn,i)\n\n(7)\nThe range of such a function is (\u2212\u221e; 1] and the lower it is the more the parameter is relevant. We\nobserve that having \u00afS < 0 \u21d4 S > 1 means that a weight change brings about an output change\nthat is bigger than the weight change itself (5). In this case we say the output is super-sensitive\nto the weight. In our framework we are not interested in promoting the sparsity for such a class\nof parameters; on the contrary, they are very relevant for generating the output. We want to focus\nour attention towards all those parameters whose variation is not signi\ufb01cantly felt by the output\nk \u03b1k|\u2206yk| < \u2206w), for which the output is sub-sensitive to them. Hence, we introduce a bounded\n\n((cid:80)\n\ninsensitivity\n\n\u00afSb(y, wn,i) = max(cid:2)0, \u00afS (y, wn,i)(cid:3)\n\nhaving \u00afSb \u2208 [0, 1].\n\n3.3 The update rule\n\nAs already hinted at, a parameter with small sensitivity may safely be moved some way towards zero.\nThis can be done by subtracting a product of the parameter itself and its insensitivity measure (recall\nthat \u00afSb is between 0 and 1), appropriately scaled by some small factor \u03bb.\nSuch a subtraction may be carried out simultaneously with the step towards steepest descent, effec-\ntively modifying SGD to incorporate the push of less \u2018important\u2019 parameters towards small values.\n\n4\n\n\fThis brings us to the operation at the core of our method \u2013 the rule for updating each weight. At the\nt-th update iteration, the i-th weight in the n-th layer will be updated as\n\u00afSb(y, wt\u22121\nn,i )\n\nn,i := wt\u22121\nwt\n\n\u2212 \u03bbwt\u22121\n\nn,i \u2212 \u03b7\n\n(9)\n\nn,i\n\n\u2202L\n\u2202wt\u22121\n\nn,i\n\nwhere L is a loss function, as in (1) and (2). Here we see why the bounded insensitivity is not allowed\nto become negative: this would allow to push some weights (the super-sensitive ones) away from\nzero.\nBelow we show that each of the two correction terms dominates over the other in different phases of\nthe training. The supplementary material treats this matter in more detail.\nThe derivative of the \ufb01rst correction term in (9) wrt. to the weight (disregarding \u03b7) can be factorized\nas\n\nwhich is a scalar product of two vectors: the derivative of the loss with respect to the output elements\nand the derivative of the output elements with respect to the parameter in question. By the H\u00f6lder\ninequality, we have that\n\n.\n\n(11)\n\n(10)\n\n(12)\n\n\u2202L\n\u2202wn,i\n\n=\n\n\u2202L\n\u2202y\n\n\u2202y\n\u2202wn,i\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202L\n\n\u2202wn,i\n\n(cid:13)(cid:13)(cid:13)(cid:13)1\n\n\u2202wn,i\n\nk\n\n\u2202yk\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 max\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202L\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13) \u2202y\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202y\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202L\n(cid:13)(cid:13)(cid:13)(cid:13)1\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264\n\n\u2202wn,i\n\n\u2202wn,i\n\n.\n\nFurthermore, if the loss function L is the composition of the cross-entropy and the softmax function,\nthe derivative of L with respect to any yk cannot exceed 1 in absolute value. The inequality in eq.11\nthen simpli\ufb01es to\n\nWe note that the l1 norm on the right is proportional to the sensitivity of (6), provided that all\ncoef\ufb01cients \u03b1k are equal (as in (17) in a later section). Otherwise the l1 norm is equivalent to the\nsensitivity. For the following, we think of the l1 norm on the right in eq.12 as a multiple of the\nsensitivity.\nBy (7), the insensitivity is complementary to the sensitivity. The bounded insensitivity is simply a\nrestriction of the insensitivity to non-negative values (8).\nNow we return to the two correction terms in the update rule of (9). If the \ufb01rst correction term is large,\nthen by (12) also the sensitivity must be large. A large sensitivity implies a small (or zero) bounded\ninsensitivity. Therefore a large \ufb01rst correction term implies a small or zero second correction term.\nThis typically happens in early phases of training, when the loss can be greatly reduced by changing\na weight, i.e. when \u2202L\n\u2202wn,i\n\nis large.\n\nConversely, if the loss function is near a minimum, then the \ufb01rst correction term is very small. In\nthis situation, the above equations do not imply anything about the magnitude of the sensitivity. The\nbounded insensitivity may be near 1 for some weights, thus the second correction term will dominate.\nThese weights will be moved towards zero in proportion to \u03bb. Sec. 4 shows that this indeed happens\nfor a large number of weights.\nThe parameter cannot be moved all the way to zero in one update, as the insensitivity may change\nwhen wn,i changes; it must be recomputed at each new updated value of the parameter. The factor \u03bb\nshould therefore be (much) smaller than 1.\n\n3.4 Cost function formulation\n\nThe update rule of (9) does provide the \u201cadditional information\u201d typical of regularization methods.\nIndeed, this method amounts to the addition of a regularization function R to an original loss\nfunction, as in (1). Since (9) speci\ufb01es how a parameter is updated through the derivative of R, an\nintegration of the update term will \u2018restore\u2019 the regularization term. The result is readily interpreted\nfor ReLU-activated networks [3].\n\n5\n\n\fTowards this end, we de\ufb01ne the overall regularization term as a sum over all parameters\n\nR (\u03b8) =\n\nRn,i (wn,i)\n\nand integrate each term over wn,i\n\n(cid:88)\n\nn\n\ni\n\n(cid:88)\n(cid:90)\n(cid:34)\n1 \u2212 C(cid:88)\n\nk=1\n\n(13)\n\n(14)\n\n(cid:35)\n\n(17)\n\n(18)\n\nRn,i (wn,i) =\n\nwn,i \u00afSb(y, wn,i)dwn,i.\n\nIf we solve (14) we \ufb01nd\n\nRn,i (wn,i) = H(cid:2) \u00afS(y, wn,i)(cid:3) w2\n\nn,i\n2\n\n\u00b7\n\n(cid:18) \u2202yk\n\n(cid:19) \u221e(cid:88)\n\n\u2202wn,i\n\nm=1\n\n\u03b1ksign\n\n\u22121m+1 wm\u22121\n\n(m + 1)!\n\n\u2202myk\n\u2202wm\nn,i\n\n(15)\nwhere H(\u00b7) is the Heaviside (one-step) function. (15) holds for any feedforward neural network\nhaving any activation function.\nNow suppose that all activation functions are recti\ufb01ed linear units. Its derivative is the step function;\nthe higher order derivatives are therefore zero. This results in dropping all the m > 1 terms in (15).\nThus, the regularization term for ReLU-activated networks reduces to\n\nRn,i (wn,i) =\n\nw2\nn,i\n2\n\n\u00afS(y, wn,i)\n\n(16)\n\nThe \ufb01rst factor in this expression is the square of the weight, showing the relation to Tikhonov\nregularization. The other factor is a selection and damping mechanism. Only the sub-sensitive\nweights are in\ufb02uenced by the regularization \u2013 in proportion to their insensitivity.\n\n3.5 Types of sensitivity\n\nIn general, (6) allows for different kinds of sensitivity, depending on the value assumed by \u03b1k. This\nfreedom permits some adaptation to the learning problem in question.\nIf all the k outputs assume the same \u201crelevance\u201d (all \u03b1k = 1\nformulation\n\nC ) we say we have an unspeci\ufb01c\n\nSunspec(y, wn,i) =\n\n1\nC\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202yk\n\n\u2202wn,i\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nC(cid:88)\n\nk=1\n\nThis formulation does not require any information about the training examples.\nAnother possibility, applicable to classi\ufb01cation problems, does take into account some extra informa-\ntion. In this formulation we let only one term count, namely the one that corresponds to the desired\noutput class for the given input x0. The factors \u03b1k are therefore taken as the elements in the one-hot\nencoding for the desired output y\u2217. In this case we speak of speci\ufb01c sensitivity:\n\nSspec(y, y\u2217, wn,i) =\n\nC(cid:88)\n\nk=1\n\ny\u2217\n\nk\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202yk\n\n\u2202wn,i\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nThe experiments in Sec. 4 regard classi\ufb01cation problems, and we apply both of the above types of\nsensitivity.\n\n3.6 Training procedure\n\nOur technique ideally aims to put to zero a great number of parameters. However, according to our\nupdate rule (9), less sensitive parameters approach zero but seldom reach it exactly. For this reason,\nwe introduce a threshold T . If\n\n|wn,i| < T\n\nthe method will prune it. According to this, the threshold in the very early stages must be kept to very\nlow values (or must be introduced afterwards). Our training procedure is divided into two different\nsteps:\n\n6\n\n\fTable 1: LeNet300 network trained over the MNIST dataset\n\nHan et al. [9]\n\nProposed (Sunspec)\nProposed (Sspec)\nLouizos et al. [13]\n\nSWS[10]\n\nSparse VD[16]\n\nDNS[24]\n\nProposed (Sunspec)\nProposed (Sspec)\n\nRemaining parameters\n\nFC2\n9%\n\nFC1\n8%\n\nFC3\nTotal\n26% 21.76k\n2.25% 11.93% 69.3% 9.55k\n4.78% 24.75% 73.8% 19.39k\n33% 26.64k\n9.95% 9.68%\n11.19k\nN/A\nN/A\nN/A\n1.1%\n2.7%\n38%\n3.71k\n5.5% 4.72k\n1.8%\n1.8%\n0.93% 1.12% 5.9%\n2.53k\n1.12% 1.88% 13.4% 3.26k\n\nMemory\nfootprint\n87.04kB\n34.2kB\n77.56kB\n106.57kB\n44.76kB\n14.84kB\n18.88kB\n10.12kB\n13.06kB\n\n|\u03b8|\n|\u03b8(cid:54)=0|\n12.2x\n27.87x\n13.73x\n12.2x\n23x\n68x\n56x\n103x\n80x\n\nTop-1\nerror\n1.6%\n1.65%\n1.56%\n1.8%\n1.94%\n1.92%\n1.99%\n1.95%\n1.96%\n\n1. Reaching the performance: in this phase we train the network in order to get to the target\nperformance. Here, any training procedure may be adopted: this makes our method suitable\nalso for pre-trained models and, unlike other state-of-the-art techniques, can be applied at\nany moment of training.\n\n2. Sparsify: thresholding is introduced and applied to the network. The learning process\nstill advances but in the end of every training epoch all the weights of the network are\nthresholded. The procedure is stopped when the network performance drops below a given\ntarget performance.\n\n4 Results\n\nIn this section we experiment with our regularization method in different supervised image classi\ufb01ca-\ntion tasks. Namely, we experiment training a number of well-known neural network architectures\nand over a number of different image datasets. For each trained network we measure the sparsity\nwith layer granularity and the corresponding memory footprint assuming single precision \ufb02oat rep-\nresentation of each parameter. Our method is implemented in Julia language and experiments are\nperformed using the Knet package [22].\n\n4.1 LeNet300 and LeNet5 on MNIST\n\nTo start with, we experiment training the fully connected LeNet300 and the convolutional LeNet5\nover the standard MNIST dataset [23] (60k training images and 10k test images). We use SGD with a\nlearning parameter \u03b7 = 0.1, a regularization factor \u03bb = 10\u22125 and a thresholding value T = 10\u22123\nunless otherwise speci\ufb01ed. No other sparsity-promoting method (dropout, batch normalization) is\nused.\nTable 1 reports the results of the experiments over the LeNet300 network in two successive moments\nduring the training procedure.2 The top-half of the table refers to the network trained up to the point\nwhere the error decreases to 1.6%, the best error reported in [9]. Our method achieves twice the\nsparsity of [9] (27.8x vs. 12.2x compression ratio) for comparable error. The bottom-half of the table\nrefers to the network further trained up to the point where the error settles around 1.95%, the mean\nbest error reported in [10, 16, 24]. Also in this case, our method shows almost doubled sparsity over\nthe nearest competitor for similar error (103x vs. 68x compression ratio of [16]).\nTable 2 shows the corresponding results for LeNet-5 trained until the Top-1 error reaches about 0.77%\n(best error reported by [9]).\nIn this case, when compared to the work of Han et al., our method achieves far better sparsity (51.07x\nvs. 11.87x compression ratio) for a comparable error. We observe how in all the previous experiments\nthe largest gains stem from the \ufb01rst fully connected layer, where most of the network parameters lie.\nHowever, if we compare our results to other state-of-the-art sparsi\ufb01ers, we see that our technique does\n\n2\n\n|\u03b8|\n|\u03b8(cid:54)=0| is the compression ratio, i.e.\n\nthe ratio between number of parameters in the original network\n\n(cardinality of \u03b8) and number of remaining parameters after sparsi\ufb01cation (the higher, the better).\n\n7\n\n\fFC1\n8%\n\nTable 2: LeNet5 network trained over the MNIST dataset\n|\u03b8|\nMemory\n|\u03b8(cid:54)=0|\nfootprint\n145.12kB 11.9x\n33.72kB\n51.1x\n41.9x\n41.12kB\n70x\n24.6kB\n200x\n8.6kB\n6.16kB\n280x\n111x\n15.52kB\n\nRemaining parameters\nFC2\nTotal\nConv1 Conv2\n19% 36.28k\n66% 12%\n67.6% 11.8% 0.9% 31.0% 8.43k\n72.6% 12.0% 1.7% 37.4% 10.28k\n6.15k\n45%\n2.15k\nN/A\n1.54k\n33%\n4% 3.88k\n14%\n\n0.4%\nN/A\n\n5%\n36%\nN/A\nN/A\n2% 0.2% 5%\n3%\n\n0.7%\n\nTop-1\nerror\n0.77%\n0.78%\n0.8%\n1.0%\n0.97%\n0.75%\n0.91%\n\nHan et al. [9]\nProp. (Sunspec)\nProp. (Sspec)\n\nLouizos et al. [13]\n\nSWS [10]\n\nSparse VD [16]\n\nDNS [24]\n\nFigure 2: Loss on test set across epochs for LeNet300 trained on MNIST with different regularizers\n(without thresholding): our method enables improved generalization over l2-regularization.\n\nnot achieve the highest compression rates. Most prominently, Sparse VD obtains higher compression\nat better performance compression rates. as is the case of convolutional layers.\nLast, we investigate how our sensitivity-based regularization term affects the network generalization\nability, which is the ultimate goal of regularization. As we focus on the effects of the regularization\nterm, no thresholding or pruning is applied and we consider the unspeci\ufb01c sensitivity formulation in\n(17). We experiment over four formulations of the regularization term R(\u03b8): no regularizer (\u03bb = 0),\nweight decay (Tikhonov, l2 regularization), l1 regularization, and our sensitivity-based regularizer.\nFig. 2 shows the value of the loss function L (cross-entropy) during training. Without regularization,\nthe loss increases after some epochs, indicating sharp over\ufb01tting. With the l1-regularization, some\nover\ufb01tting cannot be avoided, whereas l2-regularization prevents over\ufb01tting. However, our sensitivity-\nbased regularizer is even more effective than l2-regularization, achieving lower error. As seen from\n(16), our regularization factor can be interpreted as an improved l2 term with an additional factor\npromoting sparsity proportionally to each parameter\u2019s insensitivity.\n\n4.2 VGG-16 on ImageNet\n\nFinally, we experiment on the far more complex VGG-16 [1] network over the larger ImageNet [25]\ndataset. VGG-16 is a 13 convolutional, 3 fully connected layers deep network having more than\n100M parameters while ImageNet consists of 224x224 24-bit colour images of 1000 different types\n\n8\n\n 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0 200 400 600 800 1000Loss testepochSGDSGD+L2SGD+SensitivitySGD+L1\fof objects. In this case, we skip the initial training step as we used the open-source keras pretrained\nmodel [1]. For the sparsity step we have used SGD with \u03b7 = 10\u22123 and \u03bb = 10\u22125 for the speci\ufb01c\nsensitivity, \u03bb = 10\u22126 for the unspeci\ufb01c sensitivity.\nAs previous experiment revealed our method enables improved sparsi\ufb01cation for comparable error,\nhere we train the network up to the point where the Top-1 error is minimized. In this case our method\nenables an 1.08% reduction in error (9.80% vs 10.88%) for comparable sparsi\ufb01cation, supporting the\n\ufb01nding that our method improves a network ability to generalize as shown in Fig. 2.\n\nTable 3: VGG16 network trained on the ImageNet dataset\n\nFC\n\nTotal\n\nRemaining parameters\n\n|\u03b8|\nMemory\n|\u03b8(cid:54)=0|\nfootprint\nConv\n32.77% 4.61% 10.35M 41.4 MB\n13.33x\n64.73% 2.9% 11.34M 45.36 MB 12.17x\n56.49% 2.56% 9.77M 39.08 MB 14.12x\n\nTop-5\nTop-1\nerror\nerror\n31.34% 10.88%\n29.29% 9.80%\n30.92% 10.06%\n\nHan et al. [9]\nProp. (Sunspec)\nProp. (Sspec)\n\n5 Conclusions\n\nIn this work we have proposed a sensitivity-based regularize-and-prune method for the supervised\nlearning of sparse network topologies. Namely, we have introduced a regularization term that\nselectively drives towards zero parameters that are less sensitive, i.e. have little importance on the\nnetwork output, and thus can be pruned without affecting the network performance. The regularization\nderivation is completely general and applicable to any optimization problem, plus it is ef\ufb01ciency-\nfriendly, introducing a minimum computation overhead as it makes use of the Jacobian matrices\ncomputed during backpropagation.\nOur proposed method enables more effective sparsi\ufb01cation than other regularization-based methods\nfor both the speci\ufb01c and the unspeci\ufb01c formulation of the sensitivity in fully-connected architectures.\nIt was empirically observed that for the experiments on MNIST Sunspec reaches higher sparsity than\nSspec, while on ImageNet and on a deeper neural network (VGG16) Sspec is able to reach the highest\nsparsity.\nMoreover, our regularization seems to have a bene\ufb01cial impact on the generalization of the network.\nHowever, in convolutional architectures the proposed technique is surpassed by one sparsifying\ntechnique. This might be explained from the fact that our sensitivity term does not take into account\nshared parameters.\nFuture work involves an investigation into the observed improvement of generalization, a study of the\ntrade-offs between speci\ufb01c and unspeci\ufb01c sensitivity, and the extension of the sensitivity term to the\ncase of shared parameters.\n\nAcknowledgments\n\nThe authors would like to thank the anonymous reviewers for their valuable comments and suggestions.\nThis work was done at the Joint Open Lab Cognitive Computing and was supported by a fellowship\nfrom TIM.\n\n9\n\n\fReferences\n[1] Karen Simonyan and Andrew Zisserman, \u201cVery deep convolutional networks for large-scale\n\nimage recognition,\u201d arXiv preprint arXiv:1409.1556, 2014.\n\n[2] Jie Hu, Li Shen, and Gang Sun, \u201cSqueeze-and-excitation networks,\u201d in Conference on Computer\n\nVision and Pattern Recognition, CVPR, 2018.\n\n[3] Xavier Glorot, Antoine Bordes, and Yoshua Bengio, \u201cDeep sparse recti\ufb01er neural networks,\u201d\nin Proceedings of the 14th International Conference on Artiicial Intelligence and Statistics\n(AISTATS), 2011, pp. 315\u2013323.\n\n[4] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz, \u201cSGD learns over-\nparameterized networks that provably generalize on linearly separable data,\u201d arXiv preprint\narXiv:1710.10174, 2017.\n\n[5] Hrushikesh N Mhaskar and Tomaso Poggio, \u201cDeep vs. shallow networks: An approximation\n\ntheory perspective,\u201d Analysis and Applications, vol. 14, no. 06, pp. 829\u2013848, 2016.\n\n[6] Charles W. Groetsch, Inverse Problems in the Mathematical Sciences, Vieweg, 1993.\n[7] Sayan Mukherjee, Partha Niyogic, Tomaso Poggio, and Ryan Rifkin, \u201cLearning theory: stability\nis suf\ufb01cient for generalization and necessary and suf\ufb01cient for consistency of empirical risk\nminimization,\u201d Advances in Computational Mathematics, vol. 25, pp. 161\u2013193, 2006.\n\n[8] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li, \u201cLearning structured sparsity\nin deep neural networks,\u201d in Advances in Neural Information Processing Systems, 2016, pp.\n2074\u20132082.\n\n[9] Song Han, Jeff Pool, John Tran, and William Dally, \u201cLearning both weights and connections\nfor ef\ufb01cient neural network,\u201d in Advances in Neural Information Processing Systems, 2015, pp.\n1135\u20131143.\n\n[10] Karen Ullrich, Edward Meeds, and Max Welling, \u201cSoft weight-sharing for neural network\n\ncompression,\u201d arXiv preprint arXiv:1702.04008, 2017.\n\n[11] Michael Zhu and Suyog Gupta, \u201cTo prune, or not to prune: exploring the ef\ufb01cacy of pruning\n\nfor model compression,\u201d arXiv preprint arXiv:1710.01878, 2017.\n\n[12] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky, \u201cSparse\nconvolutional neural networks,\u201d in Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, 2015, pp. 806\u2013814.\n\n[13] Christos Louizos, Max Welling, and Diederik P Kingma, \u201cLearning sparse neural networks\n\nthrough l_0 regularization,\u201d arXiv preprint arXiv:1712.01312, 2017.\n\n[14] Y-lan Boureau, Yann L Cun, et al., \u201cSparse feature learning for deep belief networks,\u201d in\n\nAdvances in neural information processing systems, 2008, pp. 1185\u20131192.\n\n[15] Eugenio Culurciello, Ralph Etienne-Cummings, and Kwabena A Boahen, \u201cA biomorphic digital\n\nimage sensor,\u201d IEEE Journal of Solid-State Circuits, vol. 38, no. 2, pp. 281\u2013294, 2003.\n\n[16] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov, \u201cVariational dropout sparsi\ufb01es deep\n\nneural networks,\u201d arXiv preprint arXiv:1701.05369, 2017.\n\n[17] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Husz\u00e1r, \u201cFaster gaze prediction\n\nwith dense networks and \ufb01sher pruning,\u201d arXiv preprint arXiv:1801.05787, 2018.\n\n[18] Song Han, Huizi Mao, and William J Dally, \u201cDeep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding,\u201d arXiv preprint arXiv:1510.00149,\n2015.\n\n[19] Andries P Engelbrecht and Ian Cloete, \u201cA sensitivity analysis algorithm for pruning feedforward\nneural networks,\u201d in Neural Networks, 1996., IEEE International Conference on. IEEE, 1996,\nvol. 2, pp. 1274\u20131278.\n\n[20] Iveta Mr\u00e1zov\u00e1 and Zuzana Reitermanov\u00e1, \u201cA new sensitivity-based pruning technique for\nfeed-forward neural networks that improves generalization,\u201d in Neural Networks (IJCNN), The\n2011 International Joint Conference on. IEEE, 2011, pp. 2143\u20132150.\n\n[21] Iveta Mrazova and Marek Kukacka, \u201cCan deep neural networks discover meaningful pattern\n\nfeatures?,\u201d Procedia Computer Science, vol. 12, pp. 194\u2013199, 2012.\n\n10\n\n\f[22] Deniz Yuret, \u201cKnet: beginning deep learning with 100 lines of julia,\u201d in Machine Learning\n\nSystems Workshop at NIPS 2016, 2016.\n\n[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document\n\nrecognition,\u201d Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 \u2013 2324, Nov. 1998.\n\n[24] Yiwen Guo, Anbang Yao, and Yurong Chen, \u201cDynamic network surgery for ef\ufb01cient dnns,\u201d in\n\nAdvances In Neural Information Processing Systems, 2016, pp. 1379\u20131387.\n\n[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei,\n\u201cImagenet large scale visual recognition challenge,\u201d International Journal of Computer Vision,\nvol. 115, no. 3, pp. 211\u2013252, Dec. 2015.\n\n11\n\n\f", "award": [], "sourceid": 1911, "authors": [{"given_name": "Enzo", "family_name": "Tartaglione", "institution": "Politecnico di Torino"}, {"given_name": "Skjalg", "family_name": "Leps\u00f8y", "institution": "Telecom Italia"}, {"given_name": "Attilio", "family_name": "Fiandrotti", "institution": "POLITO"}, {"given_name": "Gianluca", "family_name": "Francini", "institution": "Telecom Italia"}]}