{"title": "Global Sparse Momentum SGD for Pruning Very Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6382, "page_last": 6394, "abstract": "Deep Neural Network (DNN) is powerful but computationally expensive and memory intensive, thus impeding its practical usage on resource-constrained front-end devices. DNN pruning is an approach for deep model compression, which aims at eliminating some parameters with tolerable performance degradation. In this paper, we propose a novel momentum-SGD-based optimization method to reduce the network complexity by on-the-fly pruning. Concretely, given a global compression ratio, we categorize all the parameters into two parts at each training iteration which are updated using different rules. In this way, we gradually zero out the redundant parameters, as we update them using only the ordinary weight decay but no gradients derived from the objective function. As a departure from prior methods that require heavy human works to tune the layer-wise sparsity ratios, prune by solving complicated non-differentiable problems or finetune the model after pruning, our method is characterized by 1) global compression that automatically finds the appropriate per-layer sparsity ratios; 2) end-to-end training; 3) no need for a time-consuming re-training process after pruning; and  4) superior capability to find better winning tickets which have won the initialization lottery.", "full_text": "Global Sparse Momentum SGD for Pruning Very\n\nDeep Neural Networks\n\nXiaohan Ding 1 Guiguang Ding 1 Xiangxin Zhou 2\n\nYuchen Guo 1, 3\n\nJungong Han 4\n\nJi Liu 5\n\n1 Beijing National Research Center for Information Science and Technology (BNRist);\n\nSchool of Software, Tsinghua University, Beijing, China\n\n2 Department of Electronic Engineering, Tsinghua University, Beijing, China\n\n3 Department of Automation, Tsinghua University;\n\nInstitute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China\n\n4 WMG Data Science, University of Warwick, Coventry, United Kingdom\n\n5 Kwai Seattle AI Lab, Kwai FeDA Lab, Kwai AI platform\ndinggg@tsinghua.edu.cn\ndxh17@mails.tsinghua.edu.cn\n\nxx-zhou16@mails.tsinghua.edu.cn\n\nyuchen.w.guo@gmail.com\n\njungonghan77@gmail.com\n\nji.liu.uwisc@gmail.com\n\nAbstract\n\nDeep Neural Network (DNN) is powerful but computationally expensive and\nmemory intensive, thus impeding its practical usage on resource-constrained front-\nend devices. DNN pruning is an approach for deep model compression, which\naims at eliminating some parameters with tolerable performance degradation. In\nthis paper, we propose a novel momentum-SGD-based optimization method to\nreduce the network complexity by on-the-\ufb02y pruning. Concretely, given a global\ncompression ratio, we categorize all the parameters into two parts at each training\niteration which are updated using different rules. In this way, we gradually zero\nout the redundant parameters, as we update them using only the ordinary weight\ndecay but no gradients derived from the objective function. As a departure from\nprior methods that require heavy human works to tune the layer-wise sparsity\nratios, prune by solving complicated non-differentiable problems or \ufb01netune the\nmodel after pruning, our method is characterized by 1) global compression that\nautomatically \ufb01nds the appropriate per-layer sparsity ratios; 2) end-to-end training;\n3) no need for a time-consuming re-training process after pruning; and 4) superior\ncapability to \ufb01nd better winning tickets which have won the initialization lottery.\n\n1\n\nIntroduction\n\nThe recent years have witnessed great success of Deep Neural Network (DNN) in many real-world\napplications. However, today\u2019s very deep models have been accompanied by millions of parameters,\nthus making them dif\ufb01cult to be deployed on computationally limited devices. In this context,\nDNN pruning approaches have attracted much attention, where we eliminate some connections (i.e.,\nindividual parameters) [21, 22, 31], or channels [32], thus the required storage space and computations\ncan be reduced. This paper is focused on connection pruning, but the proposed method can be easily\ngeneralized to structured pruning (e.g., neuron-, kernel- or \ufb01lter-level). In order to reach a good\ntrade-off between accuracy and model size, many pruning methods have been proposed, which can\nbe categorized into two typical paradigms. 1) Some researchers [13, 18, 21, 22, 26, 31, 32, 39, 41]\npropose to prune the model by some means to reach a certain level of compression ratio, then \ufb01netune\nit using ordinary SGD to restore the accuracy. 2) The other methods seek to produce sparsity in the\nmodel through a customized learning procedure [1, 12, 33, 34, 51, 54, 56].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThough the existing methods have achieved great success in pruning, there are some typical drawbacks.\nSpeci\ufb01cally, when we seek to prune a model in advance and \ufb01netune it, we confront two problems:\n\u2022 The layer-wise sparsity ratios are inherently tricky to set as hyper-parameters. Many\nprevious works [21, 24, 26, 32] have shown that some layers in a DNN are sensitive to\npruning, but some can be pruned signi\ufb01cantly without degrading the model in accuracy. As\na consequence, it requires prior knowledge to tune the layer-wise hyper-parameters in order\nto maximize the global compression ratio without unacceptable accuracy drop.\n\u2022 The pruned models are dif\ufb01cult to train, and we cannot predict the \ufb01nal accuracy after\n\ufb01netuning. E.g., the \ufb01lter-level-pruned models can be easily trapped into a bad local minima,\nand sometimes cannot even reach a similar level of accuracy with a counterpart trained from\nscratch [10, 38]. And in the context of connection pruning, the sparser the network, the\nslower the learning and the lower the eventual test accuracy [15].\n\nOn the other hand, pruning by learning is not easier due to:\n\n\u2022 In some cases we introduce a hyper-parameter to control the trade-off, which does not\ndirectly re\ufb02ect the resulting compression ratio. For instance, MorphNet [17] uses group\nLasso [44] to zero out some \ufb01lters for structured pruning, where a key hyper-parameter is\nthe Lasso coef\ufb01cient. However, given a speci\ufb01c value of the coef\ufb01cient, we cannot predict\nthe \ufb01nal compression ratio before the training ends. Therefore, when we target at a speci\ufb01c\neventual compression ratio, we have to try multiple coef\ufb01cient values in advance and choose\nthe one that yields the result closest to our expectation.\n\u2022 Some methods prune by solving an optimization problem which directly concerns the\nsparsity. As the problem is non-differentiable, it cannot be solved using SGD-based methods\nin an end-to-end manner. A more detailed discussion will be provided in Sect. 3.2.\n\nIn this paper, we seek to overcome the drawbacks discussed above by directly altering the gradient\n\ufb02ow based on momentum SGD, which explicitly concerns the eventual compression ratio and can\nbe implemented via end-to-end training. Concretely, we use \ufb01rst-order Taylor series to measure the\nimportance of a parameter by estimating how much the objective function value will be changed\nby removing it [41, 49]. Based on that, given a global compression ratio, we categorize all the\nparameters into two parts that will be updated using different rules, which is referred to as activation\nselection. For the unimportant parameters, we perform passive update with no gradients derived\nfrom the objective function but only the ordinary weight decay (i.e., (cid:96)-2 regularization) to penalize\ntheir values. On the other hand, via active update, the critical parameters are updated using both the\nobjective-function-related gradients and the weight decay to maintain the model accuracy. Such a\nselection is conducted at each training iteration, so that a deactivated connection gets a chance to\nbe reactivated at the next iteration. Through continuous momentum-accelerated passive updates we\ncan make most of the parameters in\ufb01nitely close to zero, such that pruning them causes no damage\nto the model\u2019s accuracy. Owing to this, there is no need for a \ufb01netuning process. In contrast, some\npreviously proposed regularization terms can only reduce the parameters to some extent, thus pruning\nstill degrades the model. Our contributions are summarized as follows.\n\n\u2022 For lossless pruning and end-to-end training, we propose to directly alter the gradient \ufb02ow,\nwhich is clearly distinguished with existing methods that either add a regularization term or\nseek to solve some non-differentiable optimization problems.\n\u2022 We propose Global Sparse Momentum SGD (GSM), a novel SGD optimization method,\nwhich splits the update rule of momentum SGD into two parts. GSM-based DNN pruning\nrequires a sole global eventual compression ratio as hyper-parameter and can automatically\ndiscover the appropriate per-layer sparsity ratios to achieve it.\n\u2022 Seen from the experiments, we have validated the capability of GSM to achieve high com-\npression ratios on MNIST, CIFAR-10 [29] and ImageNet [9] as well as \ufb01nd better winning\ntickets [15]. The codes are available at https://github.com/DingXiaoH/GSM-SGD.\n\n2 Related work\n\n2.1 Momentum SGD\n\nStochastic gradient descent only takes the \ufb01rst order derivatives of the objective function into account\nand not the higher ones [28]. Momentum is a popular technique used along with SGD, which\n\n2\n\n\faccumulates the gradients of the past steps to determine the direction to go, instead of using only\nthe gradient of the current step. I.e., momentum gives SGD a short-term memory [16]. Formally, let\nL be the objective function, w be a single parameter, \u03b1 be the learning rate, \u03b2 be the momentum\ncoef\ufb01cient which controls the percentage of the gradient retained every iteration, \u03b7 be the ordinary\nweight decay coef\ufb01cient (e.g., 1 \u00d7 10\u22124 for ResNets [23]), the update rule is\n\nz(k+1) \u2190 \u03b2z(k) + \u03b7w(k) +\nw(k+1) \u2190 w(k) \u2212 \u03b1z(k+1) .\n\n\u2202L\n\n\u2202w(k)\n\n,\n\n(1)\n\nThere is a popular story about momentum [16, 42, 45, 48]: gradient descent is a man walking down\na hill. He follows the steepest path downwards; his progress is slow, but steady. Momentum is a\nheavy ball rolling down the same hill. The added inertia acts both as a smoother and an accelerator,\ndampening oscillations and causing us to barrel through narrow valleys, small humps and local\nminima. In this paper, we use momentum as an accelerator to boost the passive updates.\n\n2.2 DNN pruning and other techniques for compression and acceleration\n\nDNN pruning seeks to remove some parameters without signi\ufb01cant accuracy drop, which can be\ncategorized into unstructured and structured techniques based on the pruning granularity. Unstructured\npruning (a.k.a. connection pruning) [7, 21, 22, 31] targets at signi\ufb01cantly reducing the number of\nnon-zero parameters, resulting in a sparse model, which can be stored using much less space, but\ncannot effectively reduce the computational burdens on off-the-shelf hardware and software platforms.\nOn the other hand, structured pruning removes structures (e.g., neurons, kernels or whole \ufb01lters) from\nDNN to obtain practical speedup. E.g., channel pruning [10, 11, 32, 35, 37, 38] cannot achieve an\nextremely high compression ratio of the model size, but can convert a wide CNN into a narrower\n(but still dense) one to reduce the memory and computational costs. In real-world applications,\nunstructured and structured pruning are often used together to achieve the desired trade-off.\nThis paper is focused on connection pruning (but the proposed method can be easily generalized to\nstructured pruning), which has attracted much attention since Han et al. [21] pruned DNN connections\nbased on the magnitude of parameters and restored the accuracy via ordinary SGD. Some inspiring\nworks have improved the paradigm of pruning-and-\ufb01netuning by splicing connections as they become\nimportant again [18], directly targeting at the energy consumption [55], utilizing per-layer second\nderivatives [13], etc. The other learning-based pruning methods will be discussed in Sect. 3.2\nApart from pruning, we can also compress and accelerate DNN in other ways. Some works [2, 46, 57]\ndecompose or approximate parameter tensors; quantization and binarization techniques [8, 19, 20, 36]\napproximate a model using fewer bits per parameter; knowledge distillation [3, 25, 43] transfers\nknowledge from a big network to a smaller one; some researchers seek to speed up convolution with\nthe help of perforation [14], FFT [40, 50] or DCT [53]; Wang et al. [52] compact feature maps by\nextracting information via a Circulant matrix.\n\n3 GSM: Global Sparse Momentum SGD\n\n3.1 Formulation\n\nWe \ufb01rst clarify the notations in this paper. For a fully-connected layer with p-dimensional input and\nq-dimensional output, we use W \u2208 Rp\u00d7q to denote the kernel matrix. For a convolutional layer\nwith kernel tensor K \u2208 Rh\u00d7w\u00d7r\u00d7s, where h and w are the height and width of convolution kernel,\nr and s are the numbers of input and output channels, respectively, we unfold the tensor K into\nW \u2208 Rhwr\u00d7s. Let N be the number of all such layers, we use \u0398 = [Wi] (\u22001 \u2264 i \u2264 N ) to denote\nthe collection of all such kernel matrices, and the global compression ratio C is given by\n\n(2)\nwhere |\u0398| is the size of \u0398 and ||\u0398||0 is the (cid:96)-0 norm, i.e., the number of non-zero entries. Let L, X,\nY be the accuracy-related loss function (e.g., cross entropy for classi\ufb01cation tasks), test examples and\nlabels, respectively, we seek to obtain a good trade-off between accuracy and model size by achieving\na high compression ratio C without unacceptable increase in the loss L(X, Y, \u0398).\n\nC =\n\n|\u0398|\n||\u0398||0\n\n,\n\n3\n\n\f3.2 Rethinking learning-based pruning\n\nThe optimization target or direction of ordinary DNN training is to minimize the objective function\nonly, but when we seek to produce a sparse model via a customized learning procedure, the key is to\ndeviate the original training direction by taking into account the sparsity of the parameters. Through\ntraining, the sparsity emerges progressively, and we eventually reach the expected trade-off between\naccuracy and model size, which is usually controlled by one or a series of hyper-parameters.\n\n3.2.1 Explicit trade-off as constrained optimization\n\nThe trade-off can be explicitly modeled as a constrained optimization problem [56], e.g.,\n\nminimize\n\n\u0398\n\nL(X, Y, \u0398) +\n\ngi(Wi) ,\n\nN(cid:88)\n\ni=1\n\n(3)\n\n(4)\n\nwhere gi is an indicator function,\n\ngi(W ) =\n\n(cid:26)0\n\nif ||W||0 \u2264 li ,\n\n+\u221e otherwise ,\n\nand li is the required number of non-zero parameters at layer i. Since the second term of the objective\nfunction is non-differentiable, the problem cannot be settled analytically or by stochastic gradient\ndescent, but can be tackled by alternately applying SGD and solving the non-differentiable problem,\ne.g., using ADMM [6]. In this way, the training direction is deviated, and the trade-off is obtained.\n\n3.2.2 Implicit trade-off using regularizations\n\nIt is a common practice to apply some extra differentiable regularizations during training to reduce\nthe magnitude of some parameters, such that removing them causes less damage [1, 21, 54]. Let\nR(\u0398) be the magnitude-related regularization term, \u03bb be a trade-off hyper-parameter, the problem is\n\nminimize\n\n\u0398\n\nL(X, Y, \u0398) + \u03bbR(\u0398) .\n\n(5)\n\nHowever, the weaknesses are two-fold. 1) Some common regularizations, e.g., (cid:96)-1, (cid:96)-2 and Lasso\n[44], cannot literally zero out the entries in \u0398, but can only reduce the magnitude to some extent, such\nthat removing them still degrades the performance. We refer to this phenomenon as the magnitude\nplateau. The cause behind is simple: for a speci\ufb01c trainable parameter w, when its magnitude |w| is\n\u2202w , thus |w| is gradually\nlarge at the beginning, the gradient derived from R, i.e., \u03bb \u2202R\nreduced. However, as |w| shrinks, \u2202R\n\u2202w diminishes, too, such that the reducing tendency of |w| plateaus\n\u2202w , and w maintains a relatively small magnitude. 2) The hyper-parameter \u03bb\nwhen \u2202R\ndoes not directly re\ufb02ect the resulting compression ratio, thus we may need to make several attempts\nto gain some empirical knowledge before we obtain the model with our expected compression ratio.\n\n\u2202w , overwhelms \u2202L\n\n\u2202w approaches \u2202L\n\n3.3 Global sparse gradient \ufb02ow via momentum SGD\n\nTo overcome the drawbacks of the two paradigms discussed above, we intend to explicitly control the\neventual compression ratio via end-to-end training by directly altering the gradient \ufb02ow of momentum\nSGD to deviate the training direction in order to achieve a high compression ratio as well as maintain\nthe accuracy. Intuitively, we seek to use the gradients to guide the few active parameters in order to\nminimize the objective function, and penalize most of the parameters to push them in\ufb01nitely close\nto zero. Therefore, the \ufb01rst thing is to \ufb01nd a proper metric to distinguish the active part. Given a\n|\u0398|\nglobal compression ratio C, we use Q =\nC to denote the number of non-zero entries in \u0398. At\neach training iteration, we feed a mini-batch of data into the model, compute the gradients using\nthe ordinary chain rule, calculate the metric values for every parameter, perform active update on Q\nparameters with the largest metric values and passive update on the others. In order to make GSM\nfeasible on very deep models, the metrics should be calculated using only the original intermediate\ncomputational results, i.e., the parameters and gradients, but no second-order derivatives. Inspired by\ntwo preceding methods which utilized \ufb01rst-order Taylor series for greedy channel pruning [41, 49],\n\n4\n\n\fwe de\ufb01ne the metric in a similar manner. Formally, at each training iteration with a mini-batch of\nexamples x and labels y, let T (x, y, w) be the metric value of a speci\ufb01c parameter w, we have\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202L(x, y, \u0398)\n\n\u2202w\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\nw\n\nT (x, y, w) =\n\nThe theory is that for the current mini-batch, we expect to reduce those parameters which can be\nremoved with less impact on L(x, y, \u0398). Using the Taylor series, if we set a speci\ufb01c parameter w to\n0, the loss value becomes\n\n(6)\n\n(7)\n\n(8)\n\nL(x, y, \u0398w\u21900) = L(x, y, \u0398) \u2212 \u2202L(x, y, \u0398)\n\n(0 \u2212 w) + o(w2) .\n\nIgnoring the higher-order term, we have\n\n|L(x, y, \u0398w\u21900) \u2212 L(x, y, \u0398)| =\n\n\u2202w\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202L(x, y, \u0398)\n\n\u2202w\n\n(cid:12)(cid:12)(cid:12)(cid:12) = T (x, y, w) ,\n\nw\n\nwhich is an approximation of the change in the loss value if w is zeroed out.\nWe rewrite the update rule of momentum SGD (Formula 1). At the k-th training iteration with a\nmini-batch of examples x and labels y on a speci\ufb01c layer with kernel W , the update rule is\n\nZ(k+1) \u2190 \u03b2Z(k) + \u03b7W (k) + B(k) \u25e6 \u2202L(x, y, \u0398)\nW (k+1) \u2190 W (k) \u2212 \u03b1Z(k+1) ,\n\n\u2202W (k)\n\n,\n\n(9)\n\n(cid:40)\n\nwhere \u25e6 is the element-wise multiplication (a.k.a. Hadamard-product), and B(k) is the mask matrix,\n\nB(k)\n\nm,n =\n\nm,n) \u2265 the Q-th greatest value in T (x, y, \u0398(k)) ,\n\nif T (x, y, W (k)\notherwise .\n\n1\n\n0\n\n(10)\n\nWe refer to the computation of B for each kernel as activation selection. Obviously, there are exactly\nQ ones in all the mask matrices, and GSM degrades to ordinary momentum SGD when Q = |\u0398|.\nOf note is that GSM is model-agnostic because it makes no assumptions on the model structure or\nthe form of loss function. I.e., the calculation of gradients via back propagation is model-related, of\ncourse, but it is model-agnostic to use them for GSM pruning.\n\n3.4 GSM enables implicit reactivation and fast continuous reduction\n\nAs GSM conducts activation selection at each training iteration, it allows the penalized connections\nto be reactivated, if they are found to be critical to the model again. Compared to two previous works\nwhich explicitly insert a splicing [18] or restoring [55] stage into the entire pipeline to rewire the\nmistakenly pruned connections, GSM features simpler implementation and end-to-end training.\nHowever, as will be shown in Sect. 4.4, re-activation only happens on a minority of the parameters,\nbut most of them undergo a series of passive updates, thus keep moving towards zero. As we would\nlike to know how many training iterations are needed to make the parameters small enough to realize\nlossless pruning, we need to predict the eventual value of a parameter w after k passive updates,\ngiven \u03b1, \u03b7 and \u03b2. We can use Formula 1 to predict w(k), which is practical but cumbersome. In our\ncommon use cases where z(0) = 0 (from the very beginning of training), k is large (at least tens of\nthousands), and \u03b1\u03b7 is small (e.g., \u03b1 = 5 \u00d7 10\u22123, \u03b7 = 5 \u00d7 10\u22124), we have observed an empirical\nformula which is precise enough (Fig. 1) to approximate the resulting value,\n\nw(k)\nw(0)\n\n\u2248 (1 \u2212 \u03b1\u03b7\n1 \u2212 \u03b2\n\n)k .\n\n(11)\n\nIn practice, we \ufb01x \u03b7 (e.g., 1 \u00d7 10\u22124 for ResNets [23] and DenseNets [27]) and adjust \u03b1 just as we\ndo for ordinary DNN training, and use \u03b2 = 0.98 or \u03b2 = 0.99 for 50\u00d7 or 100\u00d7 faster zeroing-out.\nWhen the training is completed, we prune the model globally by only preserving Q parameters in\n\u0398 with the largest magnitude. We decide the number of training iterations k using Eq. 11 based\n\n5\n\n\fFigure 1: Value of a parameter w after continuous passive updates with \u03b1 = 5\u00d7 10\u22123, \u03b7 = 5\u00d7 10\u22124,\nassuming w(0) = 1. First and second \ufb01gures: the actual value of w obtained using Formula 1 with\ndifferent momentum coef\ufb01cient \u03b2. Note the logarithmic scale of the second \ufb01gure. Clearly, a larger\n\u03b2 can accelerate the reduction of parameter value. Third and fourth \ufb01gures: the value approximated\nby Eq. 11 and the difference \u2206 = wactual \u2212 wapprox with \u03b2 = 0.98 as the representative.\n\n1\u2212\u03b2 )k < 1 \u00d7 10\u22124, such a pruning operation causes no\n\non an empirical observation that with (1 \u2212 \u03b1\u03b7\naccuracy drop on very deep models like ResNet-56 and DenseNet-40.\nMomentum is critical for GSM-based pruning to be completed with acceptable time cost. As most of\nthe parameters continuously grow in the same direction determined by the weight decay (i.e., towards\nzero), such a tendency accumulates in the momentum, thus the zeroing-out process is signi\ufb01cantly\naccelerated. On the other hand, if a parameter does not always vary in the same direction, raising \u03b2\nless affect its training dynamics. In contrast, if we increase the learning rate \u03b1 for faster zeroing-out,\nthe critical parameters which are hovering around the global minima will signi\ufb01cantly deviate from\ntheir current values reached with a much lower learning rate before.\n\n4 Experiments\n\n4.1 Pruning results and comparisons\n\n60 = 4.4K for LeNet-300-100 and Q = 431K\n\nWe evaluate GSM by pruning several common benchmark models on MNIST, CIFAR-10 [29] and\nImageNet [9], and comparing with the reported results from several recent competitors. For each trial,\nwe start from a well-trained base model and apply GSM training on all the layers simultaneously.\nMNIST. We \ufb01rst experiment on MNIST with LeNet-300-100 and LeNet-5 [30]. LeNet-300-100 is a\nthree-layer fully-connected network with 267K parameters, which achieves 98.19% Top1 accuracy.\nLeNet-5 is a convolutional network which comprises two convolutional layers and two fully-connected\nlayers, contains 431K parameters and delivers 99.21% Top1 accuracy. To achieve 60\u00d7 and 125\u00d7\ncompression, we set Q = 267K\n125 = 3.4K for LeNet-5,\nrespectively. We use momentum coef\ufb01cient \u03b2 = 0.99 and a batch size of 256. The learning rate\nschedule is \u03b1 = 3 \u00d7 10\u22122, 3 \u00d7 10\u22123, 3 \u00d7 10\u22124 for 160, 40 and 40 epochs, respectively. After GSM\ntraining, we conduct lossless pruning and test on the validation dataset. As shown in Table. 1,\nGSM can produce highly sparse models which still maintain the accuracy. By further raising the\ncompression ratio on LeNet-5 to 300\u00d7, we only observe a minor accuracy drop (0.15%), which\nsuggests that GSM can yield reasonable performance with extremely high compression ratios.\nCIFAR-10. We present the results of another set of experiments on CIFAR-10 in Table. 2 using\nResNet-56 [23] and DenseNet-40 [27]. We use \u03b2 = 0.98, a batch size of 64 and learning rate\n\u03b1 = 5 \u00d7 10\u22123, 5 \u00d7 10\u22124, 5 \u00d7 10\u22125 for 400, 100 and 100 epochs, respectively. We adopt the standard\ndata augmentation including padding to 40 \u00d7 40, random cropping and left-right \ufb02ipping. Though\nResNet-56 and DenseNet-40 are signi\ufb01cantly deeper and more complicated, GSM can also reduce\nthe parameters by 10\u00d7 and still maintain the accuracy.\nImageNet. We prune ResNet-50 to verify GSM on large-scale image recognition applications. We\nuse a batch size of 64 and train the model with \u03b1 = 1 \u00d7 10\u22123, 1 \u00d7 10\u22124, 1 \u00d7 10\u22125 for 40, 10 and\n10 epochs, respectively. We compare the results with L-OBS [13], which is the only comparable\nprevious method that reported experimental results on ResNet-50, to the best of our knowledge.\nObviously, GSM outperforms L-OBS by a clear margin (Table. 3). We assume that the effectiveness\nof GSM on such a very deep network is due to its capability to discover the appropriate layer-wise\nsparsity ratios, given a desired global compression ratio. In contrast, L-OBS performs pruning layer\nby layer using the same compression ratio. This assumption is further veri\ufb01ed in Sect. 4.2.\n\n6\n\n100002000030000number of iterations0.00.20.40.60.81.0value of w=0.90=0.98=0.99100002000030000number of iterations43210value of log10(w)=0.90=0.98=0.99100002000030000number of iterations0.00.20.40.60.81.0value of w=0.98, actual=0.98, approx100002000030000number of iterations0.0000.0020.0040.006value of =0.98\fTable 1: Pruning results on MNIST.\n\nModel\n\nLeNet-300\nLeNet-300\nLeNet-300\nLeNet-300\nLeNet-300\nLeNet-5\nLeNet-5\nLeNet-5\nLeNet-5\nLeNet-5\nLeNet-5\nLeNet-5\n\nResult\n\nHan et al. [21]\nL-OBS [13]\n\nZhang et al. [56]\n\nDNS [18]\n\nGSM\n\nHan et al. [21]\nL-OBS [13]\n\nSrinivas et al. [47]\nZhang et al. [56]\n\nDNS [18]\n\nGSM\nGSM\n\nBase\nTop1\n98.36\n98.24\n98.4\n97.72\n98.19\n99.20\n98.73\n99.20\n99.2\n99.09\n99.21\n99.21\n\nPruned\nTop1\n98.41\n98.18\n98.4\n98.01\n98.18\n99.23\n98.73\n99.19\n99.2\n99.09\n99.22\n99.06\n\nOrigin / Remain\n\nParams\n\n267K / 22K\n267K / 18.6K\n267K / 11.6K\n267K / 4.8K\n267K / 4.4K\n431K / 36K\n431K / 3.0K\n431K / 22K\n431K / 6.05K\n431K / 4.0K\n431K / 3.4K\n431K / 1.4K\n\nCompress\n\nRatio\n12.1\u00d7\n14.2\u00d7\n23.0\u00d7\n55.6\u00d7\n60.0\u00d7\n11.9\u00d7\n14.1\u00d7\n19.5\u00d7\n71.2\u00d7\n107.7\u00d7\n125.0\u00d7\n300.0\u00d7\n\nNon-zero\n\nRatio\n8.23%\n\n7%\n\n4.34%\n1.79%\n1.66%\n8.35%\n\n7%\n\n5.10%\n1.40%\n0.92%\n0.80%\n0.33%\n\nTable 2: Pruning results on CIFAR-10.\n\nModel\n\nResNet-56\nResNet-56\nDenseNet-40\nDenseNet-40\nDenseNet-40\n\nResult\n\nBase\nTop1\nGSM 94.05\nGSM 94.05\nGSM 93.86\nGSM 93.86\nGSM 93.86\n\nPruned\nTop1\n94.10\n93.80\n94.07\n94.02\n93.90\n\nOrigin / Remain\n\nParams\n\n852K / 127K\n852K / 85K\n1002K / 150K\n1002K / 125K\n1002K / 100K\n\nCompress\n\nRatio\n6.6\u00d7\n10.0\u00d7\n6.6\u00d7\n8.0\u00d7\n10.0\u00d7\n\nNon-zero\n\nRatio\n15.0%\n10.0%\n15.0%\n12.5%\n10.0%\n\n4.2 GSM for automatic layer-wise sparsity ratio decision\n\nModern DNNs usually contain tens or even hundreds of layers. As the architectures deepen, it\nbecomes increasingly impractical to set the layer-wise sparsity ratios manually to reach a desired\nglobal compression ratio. Therefore, the research community is soliciting techniques which can\nautomatically discover the appropriate sparsity ratios on very deep models. In practice, we noticed\nthat if directly pruning a single layer of the original model by a \ufb01xed ratio results in a signi\ufb01cant\naccuracy reduction, GSM automatically chooses to prune it less, and vice versa.\nIn this subsection, we present a quantitative analysis of the sensitivity to pruning, which is an\nunderlying property of a layer de\ufb01ned via a natural proxy: the accuracy reduction caused by pruning a\ncertain ratio of parameters from it. We \ufb01rst evaluate such sensitivity via single-layer pruning attempts\nwith different pruning ratios (Fig. 2). E.g., for the curve labeled as \u201cprune 90%\u201d of LeNet-5, we \ufb01rst\nexperiment on the \ufb01rst layer by setting 90% of the parameters with smaller magnitude to zero then\ntesting on the validation set. Then we restore the \ufb01rst layer, prune the second layer and test. The\nsame procedure is applied to the third and fourth layers. After that, we use different pruning ratios\nof 99%, 99.5%, 99.7%, and obtain three curves in the same way. From such experiments we learn\nthat the \ufb01rst layer is far more sensitive than the third, as pruning 99% of the parameters from the \ufb01rst\nlayer reduces the Top1 accuracy by around 85% (i.e., to hardly above 10%), but doing so on the third\nlayer only slightly degrades the accuracy by 3%.\nThen we show the resulting layer-wise non-zero ratio of the GSM-pruned models (125\u00d7 pruned\nLeNet-5 and 6.6\u00d7 pruned DenseNet-40, as reported in Table. 1, 2) as another proxy for sensitivity,\nof which the curves are labeled as \u201cGSM discovered\u201d in Fig. 2. As the two curves vary in the same\n\nTable 3: Pruning results on ImageNet.\n\nModel\n\nResult\n\nResNet-50\nResNet-50\nResNet-50\nResNet-50\n\nL-OBS[13]\nL-OBS[13]\n\nGSM\nGSM\n\nBase\n\nTop1 / Top5\n- / \u2248 92\n- / \u2248 92\n\n75.72 / 92.75\n75.72 / 92.75\n\nPruned\n\nTop1 / Top5\n- / \u2248 92\n- / \u2248 85\n\n75.33 / 92.47\n74.30 / 91.98\n\nOrigin / Remain\n\nParams\n\n25.5M / 16.5M\n25.5M / 11.4M\n25.5M / 6.3M\n25.5M / 5.1M\n\nCompress\n\nRatio\n1.5\u00d7\n2.2\u00d7\n4.0\u00d7\n5.0\u00d7\n\nNon-zero\n\nRatio\n65%\n45%\n25%\n20%\n\n7\n\n\fFigure 2: The layer sensitivity scores estimated by both layer-wise pruning attempts and GSM on\nLeNet-5 (left) and DenseNet-40 (right). Though the numeric values of the two sensitivity proxies on\nthe same layer are not comparable, they vary in the same tendency across layers.\n\n(a) Original accuracy.\n\n(c) Ratio under 1 \u00d7 10\u22123. (d) Ratio under 1 \u00d7 10\u22124.\nFigure 3: The accuracy curves obtained by evaluating both the original model and the globally\n8\u00d7 pruned, and the ratio of parameters of which the magnitude is under 1 \u00d7 10\u22123 or 1 \u00d7 10\u22124,\nrespectively, using different values of momentum coef\ufb01cient \u03b2. Best viewed in color.\n\n(b) Pruned accuracy.\n\ntendency across layers as others, we \ufb01nd out that the sensitivities measured in the two proxies are\nclosely related, which suggests that GSM automatically decides to prune the sensitive layers less\n(e.g., the 14th, 27th and 40th layer in DenseNet-40, which perform the inter-stage transitions [27])\nand the insensitive layers more in order to reach the desired global compression ratio, eliminating the\nneed for heavy human works to tune the sparsity ratios as hyper-parameters.\n\n4.3 Momentum for accelerating parameter zeroing-out\n\nWe investigate the role momentum plays in GSM by only varying the momentum coef\ufb01cient \u03b2 and\nkeeping all the other training con\ufb01gurations the same as the 8\u00d7 pruned DenseNet-40 in Sect. 4.1.\nDuring training, we evaluate the model both before and after pruning every 8000 iterations (i.e., 10.24\nepochs). We also present in Fig. 3 the global ratio of parameters with magnitude under 1 \u00d7 10\u22123 and\n1 \u00d7 10\u22124, respectively. As can be observed, a large momentum coef\ufb01cient can drastically increase\nthe ratio of small-magnitude parameters. E.g., with a target compression ratio of 8\u00d7 and \u03b2 = 0.98,\nGSM can make 87.5% of the parameters close to zero (under 1 \u00d7 10\u22124) in around 150 epochs, thus\npruning the model causes no damage. And with \u03b2 = 0.90, 400 epochs are not enough to effectively\nzero the parameters out, thus pruning degrades the accuracy to around 65%. On the other hand, as a\nlarger \u03b2 value brings more rapid structural change in the model, the original accuracy decreases at\nthe beginning but increases when such change becomes stable and the training converges.\n\n4.4 GSM for implicit connection reactivation\n\nGSM implicitly implements connection rewiring by performing activation selection at each iteration\nto restore the parameters which have been wrongly penalized (i.e., gone through at least one passive\nupdate). We investigate the signi\ufb01cance of doing so by pruning DenseNet-40 by 8\u00d7 again using\n\u03b2 = 0.98 and the same training con\ufb01gurations as before but without re-selection (Fig. 4). Concretely,\nwe use the mask matrices computed at the \ufb01rst iteration to guide the updates until the end of training.\nIt is observed that if re-selection is canceled, the training loss becomes higher, and the accuracy\nis degraded. This is because the \ufb01rst selection decides to eliminate some connections which are\nnot critical for the \ufb01rst iteration but may be important for the subsequent input examples. Without\nre-selection, GSM insists on zeroing out such parameters, leading to lower accuracy. And by depicting\nthe reactivation ratio (i.e., the ratio of the number of parameters which switch from passive to active to\nthe total number of parameters) at the re-selection of each training iteration, we learn that reactivation\n\n8\n\n1234LeNet-5 layer index0.00.20.40.60.8pruning sensitivityprune 90%prune 99%prune 99.5%prune 99.7%GSM discovered1142740DenseNet-40 layer index0.00.20.40.60.81.0pruning sensitivityprune 90%prune 99%prune 99.5%prune 99.7%GSM discovered0100200300400500600training epochs0.9050.9100.9150.9200.9250.9300.9350.940Top1 accuracy=0.90=0.95=0.98=0.990100200300400500600training epochs0.10.20.30.40.50.60.70.80.9Top1 accuracy=0.90=0.95=0.98=0.990100200300400500600training epochs10%20%30%40%50%60%70%80%90%magnitude under 1\u00d7103=0.90=0.95=0.98=0.99target sparsity0100200300400500600training epochs10%20%30%40%50%60%70%80%90%magnitude under 1\u00d7104=0.90=0.95=0.98=0.99target sparsity\f(a) Original accuracy.\n\n(b) Pruned accuracy.\n\n(c) Training loss.\n\n(d) Reactivation ratio.\n\nFigure 4: The training process of GSM both with and without re-selection.\n\nTable 4: Eventual Top1 accuracy of the winning tickets training (step 5).\n\nModel\nLeNet-300\nLeNet-5\nLeNet-5\n\n60\u00d7\n125\u00d7\n300\u00d7\n\nCompression ratio Magnitude tickets GSM tickets\n\n97.39\n97.60\n11.35\n\n98.22\n99.04\n98.88\n\nhappens on a minority of the connections, and the ratio decreases gradually, such that the training\nconverges and the desired sparsity ratio is obtained.\n\n4.5 GSM for more powerful winning lottery tickets\n\nFrankle and Carbin [15] reported that the parameters which are found to be important after training\nare actually important at the very beginning (after random initialization but before training), which\nare referred to as the winning tickets, because they have won the initialization lottery. It is discovered\nthat if we 1) randomly initialize a network parameterized by \u03980, 2) train and obtain \u0398, 3) prune\nsome parameters from \u0398 resulting in a subnetwork parameterized by \u0398(cid:48), 4) reset the remaining\nparameters in \u0398(cid:48) to their initial values in \u03980, which are referred to as the winning tickets \u02c6\u0398, 5) \ufb01x\nthe other parameters to zero and train \u02c6\u0398 only, we may attain a comparable level of accuracy with the\ntrained-then-pruned model \u0398(cid:48). In that work, the third step is accomplished by simply preserving the\nparameters with the largest magnitude in \u0398. We found out that GSM can \ufb01nd a better set of winner\ntickets, as training the GSM-discovered tickets yields higher eventual accuracy than those found by\nmagnitude (Table. 4). Concretely, we only replace step 3 by a pruning process via GSM on \u0398, and\nuse the resulting non-zero parameters as \u0398(cid:48), and all the other experimental settings are kept the same\nfor comparability. Interestingly, 100% parameters in the \ufb01rst fully-connected layer of LeNet-5 are\npruned by 300\u00d7 magnitude-pruning, such that the found winning tickets are not trainable at all. But\nGSM can still \ufb01nd reasonable winning tickets. More experimental details can be found in the codes.\nTwo possible explanations to the superiority of GSM are that 1) GSM distinguishes the unimportant\nparameters by activation selection much earlier (at each iteration) than the magnitude-based criterion\n(after the completion of training), and 2) GSM decides the \ufb01nal winning tickets in a way that is\nrobust to mistakes (i.e., via activation re-selection). The intuition is that since we expect to \ufb01nd the\nparameters that have \u201cwon the initialization lottery\u201d, the timing when we make the decision should\nbe closer to when the initialization takes place, and we wish to correct the mistakes immediately\nwhen we are aware of the wrong decisions. Frankle and Carbin also noted that it might bring bene\ufb01ts\nto prune as early as possible [15], which is precisely what GSM does, as GSM keeps pushing the\nunimportant parameters continuously to zero from the very beginning.\n\n5 Conclusion\n\nWe proposed Global Sparse Momentum SGD (GSM) to directly alter the gradient \ufb02ow for DNN\npruning, which splits the ordinary momentum-SGD-based update into two parts: active update uses\nthe gradients derived from the objective function to maintain the model\u2019s accuracy, and passive update\nonly performs momentum-accelerated weight decay to push the redundant parameters in\ufb01nitely close\nto zero. GSM is characterized by end-to-end training, easy implementation, lossless pruning, implicit\nconnection rewiring, the ability to automatically discover the appropriate per-layer sparsity ratios in\nmodern very deep neural networks and the capability to \ufb01nd powerful winning tickets.\n\n9\n\n0100200300400500600training epochs0.9000.9050.9100.9150.9200.9250.9300.9350.940Top1 accuracyregular GSMno re-selection0100200300400500600training epochs0.40.50.60.70.80.9Top1 accuracyregular GSMno re-selection0100200300400500600training epochs0.000.020.040.060.080.10train lossregular GSMno re-selection0100200300400500600training epochs0.0%1.0%2.0%3.0%4.0%5.0%6.0%7.0%8.0%reactivation ratioreactivation ratio\fAcknowledgement\n\nWe sincerely thank all the reviewers for their comments. This work was supported by the National\nKey R&D Program of China (No. 2018YFC0807500), National Natural Science Foundation of\nChina (No. 61571269, No. 61971260), National Postdoctoral Program for Innovative Talents (No.\nBX20180172), and the China Postdoctoral Science Foundation (No. 2018M640131). Corresponding\nauthor: Guiguang Ding, Jungong Han.\n\nReferences\n[1] Jose M Alvarez and Mathieu Salzmann. Learning the number of neurons in deep networks. In\n\nAdvances in Neural Information Processing Systems, pages 2270\u20132278, 2016.\n\n[2] Jose M Alvarez and Mathieu Salzmann. Compression-aware training of deep networks. In\n\nAdvances in Neural Information Processing Systems, pages 856\u2013867, 2017.\n\n[3] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural\n\ninformation processing systems, pages 2654\u20132662, 2014.\n\n[4] Yoshua Bengio and Yann LeCun, editors. 3rd International Conference on Learning Repre-\nsentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,\n2015.\n\n[5] Yoshua Bengio and Yann LeCun, editors. 7th International Conference on Learning Represen-\n\ntations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.\n\n[6] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends R(cid:13) in Machine learning, 3(1):1\u2013122, 2011.\n\n[7] Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. An iterative pruning algorithm\nfor feedforward neural networks. IEEE transactions on Neural networks, 8(3):519\u2013531, 1997.\n[8] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks: Training deep neural networks with weights and activations constrained to+ 1\nor-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[10] Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. Centripetal sgd for pruning\nvery deep convolutional networks with complicated structure. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 4943\u20134953, 2019.\n\n[11] Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan. Approximated\noracle \ufb01lter pruning for destructive cnn width optimization. In International Conference on\nMachine Learning, pages 1607\u20131616, 2019.\n\n[12] Xiaohan Ding, Guiguang Ding, Jungong Han, and Sheng Tang. Auto-balanced \ufb01lter pruning\nfor ef\ufb01cient convolutional neural networks. In Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\n[13] Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-\nwise optimal brain surgeon. In Advances in Neural Information Processing Systems, pages\n4857\u20134867, 2017.\n\n[14] Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. Perforatedcnns:\nAcceleration through elimination of redundant convolutions. In Advances in Neural Information\nProcessing Systems, pages 947\u2013955, 2016.\n\n[15] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable\n\nneural networks. In Bengio and LeCun [5].\n\n[16] Gabriel Goh. Why momentum really works. Distill, 2017.\n[17] Ariel Gordon, Elad Eban, O\ufb01r Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi.\nMorphnet: Fast & simple resource-constrained structure learning of deep networks. In Proceed-\nings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1586\u20131595,\n2018.\n\n10\n\n\f[18] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for ef\ufb01cient dnns. In\n\nAdvances In Neural Information Processing Systems, pages 1379\u20131387, 2016.\n\n[19] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In International Conference on Machine Learning, pages\n1737\u20131746, 2015.\n\n[20] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural\nnetwork with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann\nLeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San\nJuan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.\n\n[21] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in Neural Information Processing Systems, pages\n1135\u20131143, 2015.\n\n[22] Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain\n\nsurgeon. In Advances in neural information processing systems, pages 164\u2013171, 1993.\n\n[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[24] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural\nnetworks. In International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.\n\n[25] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[26] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven\nneuron pruning approach towards ef\ufb01cient deep architectures. arXiv preprint arXiv:1607.03250,\n2016.\n\n[27] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected\nconvolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261\u20132269. IEEE Computer Society,\n2017.\n\n[28] Ayoosh Kathuria. Intro to optimization in deep learning: Momentum, rmsprop and adam. https:\n//blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/, 2018.\n[29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[30] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[31] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural\n\ninformation processing systems, pages 598\u2013605, 1990.\n\n[32] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[33] Shaohui Lin, Rongrong Ji, Yuchao Li, Cheng Deng, and Xuelong Li. Towards compact convnets\n\nvia structure-sparsity regularized \ufb01lter pruning. arXiv preprint arXiv:1901.07827, 2019.\n\n[34] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse\nconvolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 806\u2013814, 2015.\n\n[35] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim Kwang-Ting Cheng,\nand Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. arXiv\npreprint arXiv:1903.10258, 2019.\n\n[36] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real\nnet: Enhancing the performance of 1-bit cnns with improved representational capability and\nadvanced training algorithm. In Proceedings of the European Conference on Computer Vision\n(ECCV), pages 722\u2013737, 2018.\n\n[37] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.\nLearning ef\ufb01cient convolutional networks through network slimming. In 2017 IEEE Interna-\ntional Conference on Computer Vision (ICCV), pages 2755\u20132763. IEEE, 2017.\n\n11\n\n\f[38] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value\n\nof network pruning. In Bengio and LeCun [5].\n\n[39] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A \ufb01lter level pruning method for deep\nneural network compression. In Proceedings of the IEEE international conference on computer\nvision, pages 5058\u20135066, 2017.\n\n[40] Micha\u00ebl Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks\nthrough ffts. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on\nLearning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track\nProceedings, 2014.\n\n[41] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional\nneural networks for resource ef\ufb01cient inference. In 5th International Conference on Learning\nRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.\nOpenReview.net, 2017.\n\n[42] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[43] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and\n\nYoshua Bengio. Fitnets: Hints for thin deep nets. In Bengio and LeCun [4].\n\n[44] Volker Roth and Bernd Fischer. The group-lasso for generalized linear models: uniqueness\nof solutions and ef\ufb01cient algorithms. In Proceedings of the 25th international conference on\nMachine learning, pages 848\u2013855. ACM, 2008.\n\n[45] Heinz Rutishauser. Theory of gradient methods. In Re\ufb01ned iterative methods for computation of\nthe solution and the eigenvalues of self-adjoint boundary value problems, pages 24\u201349. Springer,\n1959.\n\n[46] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran.\nLow-rank matrix factorization for deep neural network training with high-dimensional output\nIn Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International\ntargets.\nConference on, pages 6655\u20136659. IEEE, 2013.\n\n[47] Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\nWorkshops, pages 138\u2013145, 2017.\n\n[48] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[49] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Husz\u00e1r. Faster gaze prediction\n\nwith dense networks and \ufb01sher pruning. arXiv preprint arXiv:1801.05787, 2018.\n\n[50] Nicolas Vasilache, Jeff Johnson, Micha\u00ebl Mathieu, Soumith Chintala, Serkan Piantino, and\nYann LeCun. Fast convolutional nets with fbfft: A GPU performance evaluation. In Bengio and\nLeCun [4].\n\n[51] Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. Structured pruning for ef\ufb01cient\n\nconvnets via incremental regularization. arXiv preprint arXiv:1811.08390, 2018.\n\n[52] Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Beyond \ufb01lters: Compact feature map\nfor portable deep model. In International Conference on Machine Learning, pages 3703\u20133711,\n2017.\n\n[53] Yunhe Wang, Chang Xu, Shan You, Dacheng Tao, and Chao Xu. Cnnpack: Packing convolu-\ntional neural networks in the frequency domain. In Advances in neural information processing\nsystems, pages 253\u2013261, 2016.\n\n[54] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in\ndeep neural networks. In Advances in Neural Information Processing Systems, pages 2074\u20132082,\n2016.\n\n[55] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-ef\ufb01cient convolutional neural\nnetworks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 5687\u20135695, 2017.\n\n12\n\n\f[56] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi\nWang. A systematic dnn weight pruning framework using alternating direction method of\nmultipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pages\n184\u2013199, 2018.\n\n[57] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional\nnetworks for classi\ufb01cation and detection. IEEE transactions on pattern analysis and machine\nintelligence, 38(10):1943\u20131955, 2016.\n\n13\n\n\f", "award": [], "sourceid": 3438, "authors": [{"given_name": "Xiaohan", "family_name": "Ding", "institution": "Tsinghua University"}, {"given_name": "guiguang", "family_name": "ding", "institution": "Tsinghua University, China"}, {"given_name": "Xiangxin", "family_name": "Zhou", "institution": "Tsinghua University"}, {"given_name": "Yuchen", "family_name": "Guo", "institution": "Tsinghua University"}, {"given_name": "Jungong", "family_name": "Han", "institution": "Lancaster University"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester, Tencent AI lab"}]}