{"title": "Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2133, "page_last": 2144, "abstract": "Filter pruning is one of the most effective ways to accelerate and compress convolutional neural networks (CNNs). In this work, we propose a global filter pruning algorithm called Gate Decorator, which transforms a vanilla CNN module by multiplying its output by the channel-wise scaling factors (i.e. gate). When the scaling factor is set to zero, it is equivalent to removing the corresponding filter. We use Taylor expansion to estimate the change in the loss function caused by setting the scaling factor to zero and use the estimation for the global filter importance ranking. Then we prune the network by removing those unimportant filters. After pruning, we merge all the scaling factors into its original module, so no special operations or structures are introduced. Moreover, we propose an iterative pruning framework called Tick-Tock to improve pruning accuracy. The extensive experiments demonstrate the effectiveness of our approaches. For example, we achieve the state-of-the-art pruning ratio on ResNet-56 by reducing 70% FLOPs without noticeable loss in accuracy. For ResNet-50 on ImageNet, our pruned model with 40% FLOPs reduction outperforms the baseline model by 0.31% in top-1 accuracy. Various datasets are used, including CIFAR-10, CIFAR-100, CUB-200, ImageNet ILSVRC-12 and PASCAL VOC 2011.", "full_text": "Gate Decorator: Global Filter Pruning Method for\nAccelerating Deep Convolutional Neural Networks\n\nZhonghui You 1\nPeking University\n\nzhonghui@pku.edu.cn\n\nKun Yan 1\n\nPeking University\n\nkyan2018@pku.edu.cn\n\nJinmian Ye\nMomenta\n\njinmian.y@gmail.com\n\nMeng Ma 2, *\n\nPeking University\n\nmameng@pku.edu.cn\n\nPing Wang 1, 2, 3, *\nPeking University\npwang@pku.edu.cn\n\n1 School of Software and Microelectronics, Peking University\n\n2 National Engineering Research Center for Software Engineering, Peking University\n\n3 Key Laboratory of High Con\ufb01dence Software Technologies (PKU), Ministry of Education\n\nAbstract\n\nFilter pruning is one of the most effective ways to accelerate and compress convo-\nlutional neural networks (CNNs). In this work, we propose a global \ufb01lter pruning\nalgorithm called Gate Decorator, which transforms a vanilla CNN module by\nmultiplying its output by the channel-wise scaling factors (i.e. gate). When the\nscaling factor is set to zero, it is equivalent to removing the corresponding \ufb01lter. We\nuse Taylor expansion to estimate the change in the loss function caused by setting\nthe scaling factor to zero and use the estimation for the global \ufb01lter importance\nranking. Then we prune the network by removing those unimportant \ufb01lters. After\npruning, we merge all the scaling factors into its original module, so no special\noperations or structures are introduced. Moreover, we propose an iterative pruning\nframework called Tick-Tock to improve pruning accuracy. The extensive experi-\nments demonstrate the effectiveness of our approaches. For example, we achieve\nthe state-of-the-art pruning ratio on ResNet-56 by reducing 70% FLOPs without\nnoticeable loss in accuracy. For ResNet-50 on ImageNet, our pruned model with\n40% FLOPs reduction outperforms the baseline model by 0.31% in top-1 accuracy.\nVarious datasets are used, including CIFAR-10, CIFAR-100, CUB-200, ImageNet\nILSVRC-12 and PASCAL VOC 2011.\n\n1\n\nIntroduction\n\nIn recent years, we have witnessed the remarkable achievements of CNNs in many computer vision\ntasks [40, 48, 37, 51, 24]. With the support of powerful modern GPUs, CNN models can be designed\nto be larger and more complex for better performance. However, the large amount of computation and\nstorage consumption prevents the deployment of state-of-the-art models to the resource-constrained\ndevices such as mobile phones or the Internet of Things (IoT) devices. The constraints mainly come\nfrom three aspects [28]: 1) Model size. 2) Run-time memory. 3) Number of computing operations.\nTake the widely used VGG-16 [39] model as an example. The model has up to 138 million parameters\nand consumes more than 500MB storage space. To infer an image with resolution of 224\u00d7224, the\nmodel requires more than 16 billion \ufb02oating point operations (FLOPs) and 93MB extra run-time\n\n* corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An illustration of \ufb01lter pruning. The i-th layer has 4 \ufb01lters (i.e. channels). If we remove\none of the \ufb01lters, the corresponding feature map will disappear, and the input of the \ufb01lters in the\n(i + 1)-th layer changes from 4 channels to 3 channels.\n\nmemory to store the intermediate output, which is a heavy burden for the low-end devices. Therefore,\nnetwork compression and acceleration methods have aroused great interest in research.\nRecent studies on model compression and acceleration can be divided into four categories: 1)\nQuantization [34, 55, 54]. 2) Fast convolution [2, 41]. 3) Low rank approximation [7, 8, 50]. 4) Filter\npruning [1, 30, 15, 25, 28, 33, 56, 52]. Among these methods, \ufb01lter pruning (a.k.a. channel pruning)\nhas received widespread attention due to its notable advantages. First, \ufb01lter pruning is a universal\ntechnique that can be applied to various types of CNN models. Second, \ufb01lter pruning does not change\nthe design philosophy of the model, which makes it easy to combine with other compression and\nacceleration techniques. Furthermore, no specialized hardware or software is needed for the pruned\nnetwork to earn acceleration.\nNeural pruning was \ufb01rst introduced by Optimal Brain Damage (OBD) [23, 10], in which LeCun et\nal. found that some neurons could be deleted without noticeable loss in accuracy. For CNNs, we\nprune the network at the \ufb01lter level, so we call this technique Filter Pruning (Figure 1). The studies\non \ufb01lter pruning can be separated into two classes: 1) Layer-by-layer pruning [30, 15, 56]. 2) Global\npruning [25, 28, 33, 52]. The layer-by-layer pruning approaches remove the \ufb01lters in a particular\nlayer at a time until certain conditions are met, and then minimize the feature reconstruction error of\nthe next layer. But pruning \ufb01lters layer by layer is time-consuming especially for the deep networks.\nBesides, a pre-de\ufb01ned pruning ratio required to be set for each layer, which eliminates the ability of\nthe \ufb01lter pruning algorithm in neural architecture search, we will discuss it in the section 4.3. On the\nother hand, the global pruning method removes unimportant \ufb01lters, no matter which layer they are.\nThe advantage of global \ufb01lter pruning is that we do not need to set the pruning ratio for each layer.\nGiven an overall pruning objective, the algorithm will reveal the optimal network structure it \ufb01nds.\nThe key to global pruning methods is to solve the global \ufb01lter importance ranking (GFIR) problem.\nIn this work, we propose a novel global \ufb01lter pruning method, which includes two components: The\n\ufb01rst is the Gate Decorator algorithm to solve the GFIR problem. The second is the Tick-Tock pruning\nframework to boost pruning accuracy. Specially, we show how to apply the Gate Decorator to the\nBatch Normalization [19], and we call the modi\ufb01ed module Gate Batch Normalization (GBN). It\nshould be noted that the modules transformed by Gate Decorator are designed to serve the temporary\npurpose of pruning. Given a pre-trained model, we convert the BN modules to GBN before pruning.\nWhen the pruning ends, we turn GBN back to vanilla BN. In this way, no special operations or\nstructures are introduced. The extensive experiments demonstrate the effectiveness of our approach.\nWe achieve the state-of-the-art pruning ratio on ResNet-56 [11] by reducing 70% FLOPs without\nnoticeable loss in accuracy. On ImageNet [4], we reduce 40% FLOPs of the ResNet-50 [11] while\nincrease the top-1 accuracy by 0.31%. Our contributions can be summarized as follows:\n\n(a) We propose a global \ufb01lter pruning pipeline, which is composed of two parts: One is the Gate\nDecorator algorithm designed to solve the GFIR problem, and the other is the Tick-Tock\npruning framework to boost pruning accuracy. Besides, we propose the Group Pruning\ntechnique to solve the Constraint Pruning problem encountered when pruning the network\nwith shortcuts like ResNet [11].\n\n(b) Experimental results show that our approach outperforms state-of-the-art methods. We\nalso extensively study the properties of the GBN algorithm and the Tick-Tock framework.\nFurthermore, we demonstrate that the global \ufb01lter pruning method can be viewed as a\ntask-driven network architecture search algorithm.\n\n2\n\nFilters of layer iFilter of layer (i+1)\f2 Related work\n\nFilter Pruning Filter pruning is a promising solution to accelerate CNNs. Numerous inspiring\nworks prune the \ufb01lters by evaluating their importance. Heuristic metrics are proposed, such as the\nmagnitude of convolution kernels [25], the average percentage of zero activations (APoZ) [17]. Luo\net al. [30] and He et al. [15] use Lasso regression to select the \ufb01lters that minimize the next layer\u2019s\nfeature reconstruction error. Yu et al. [52], on the other hand, optimizes the reconstruction error of the\n\ufb01nal response layer and propagates the importance score for each \ufb01lter. Molchanov et al. [33] applies\nTaylor expansion to evaluate the effect of \ufb01lters on the \ufb01nal loss function. Another category of works\ntrains the network under certain restrictions, which zero out some \ufb01lters or \ufb01nd redundancy in them.\nZhuang et al. [56] get good results by applies additional discrimination-aware losses to \ufb01ne-tune\nthe pre-trained model and keep the \ufb01lters that contribute to the discriminative power. However, the\ndiscrimination-aware losses are designed for classi\ufb01cation tasks, which limits its scope of application.\nLiu et al. [28] and Ye et al. [49] apply scaling factors to each \ufb01lter and add the sparse constraint to\nthe loss in the training or \ufb01ne-tuning stage. Ding et al. [6] proposes a new optimization method that\nforces several \ufb01lters to reach the same value after training, and then safely removes the redundant\n\ufb01lters. These methods need to train the model from scratch, which can be time-consuming for large\ndata sets.\n\nOther Methods Quantization methods compress the network by reducing the number of different\nparameter values. [34, 3] quantize the 32-bit \ufb02oating point parameter into binary or ternary. But these\naggressive quantitative strategies usually comes with accuracy loss. [55, 54] show that when using a\nmoderate quanti\ufb01cation strategy, the quanti\ufb01cation network can even outperform the full precision\nnetwork. Recently, new designs of convolution are proposed. Chen et al. [2] designed a plug-and-play\nconvolutional unit named OctConv, which factorizes the mixed feature maps by their frequencies.\nExperimental results demonstrate that OctConv can improve the accuracy of the model while reducing\nthe calculation. Low-rank decomposition methods [7, 8, 5] approximate network weights with\nmultiple lower rank matrices. Another popular research direction for accelerating network is to\nexplore the design of network architecture. Many computation-ef\ufb01cient architectures [16, 36, 53, 32]\nare proposed for mobile devices. These networks are designed by human experts. To combine the\nadvantages of computers, automatic neural structure search (NAS) has recently received widespread\nattention. Many studies has been proposed, including reinforcement-learning-based [57], gradient-\nbased [47, 27], evolution-based [35] methods. It should be noted that the Gate Decorator algorithm\nwe proposed is orthogonal to the methods described in this subsection. That is, Gate Decorator can\nbe combined with these methods to achieve higher compression and acceleration rates.\n\n3 Method\n\nIn this section, we \ufb01rst introduce the Gate Decorator (GD) to solve the GFIR problem. And show\nhow to apply GD to the Batch Normalization [19]. Then we propose an iterative pruning framework\ncalled Tick-Tock for better pruning accuracy. Finally, we introduce the Group Pruning technique to\nsolve the Constraint Pruning problem encountered when pruning the network with shortcuts.\n\n3.1 Problem De\ufb01nition and Gate Decorator\nFormally, let L(X, Y ; \u03b8) denotes the loss function used to train the model, where X is the input\ndata, Y is the corresponding label, and \u03b8 is the parameters of the model. We use K to represent\nthe set of all \ufb01lters of the network. Filter pruning is to choose a subset of \ufb01lters k \u2282 K and\nremove their parameters \u03b8\u2212\nk , therefore we have\nk \u222a \u03b8\u2212\nk = \u03b8. To minimize the loss increase, we need to carefully choose the k\u2217 by solving the\n\u03b8+\nfollowing optimization problem:\nk\u2217 = arg min\n\nk from the network. We note the left parameters as \u03b8+\n\ns.t. (cid:107)k(cid:107)0 > 0\n\nk\n\n(1)\nwhere (cid:107)k(cid:107)0 is the number of elements of k. A simple way to solve this problem is to try out all\npossibility of k and choose the best one that has the least effect on the loss. But it needs to calculate\n\n(cid:12)(cid:12)L(X, Y ; \u03b8) \u2212 L(X, Y ; \u03b8+\nk )(cid:12)(cid:12)\nk )(cid:12)(cid:12) for (cid:107)K(cid:107)0 times to do just one iteration of pruning, which is\n\nthe \u2206L =(cid:12)(cid:12)L(X, Y ; \u03b8) \u2212 L(X, Y ; \u03b8+\n\ninfeasible for deep models that has tens of thousands of \ufb01lters. To solve this problem, we propose the\nGate Decorator to evaluate the importance of \ufb01lters ef\ufb01ciently.\n\n3\n\n\fFigure 2: An illustration of the Tick-Tock pruning framework. The Tick phase is executed on a subset\nof the training data, and the convolution kernels are set to non-updatable. The Tock uses the full\ntraining data and adds the sparse constraint of \u03c6 to the loss function.\n\nAssuming that feature map z is the output of the \ufb01lter k, we multiply z by a trainable scaling factor\n\u03c6 \u2208 R and use \u02c6z = \u03c6z for further calculations. When the gate \u03c6 is zero, it is equivalent to pruning\nthe \ufb01lter k. By using Taylor expansion, we can approximately evaluate the \u2206L of the pruning. Firstly,\nfor notation convenience, we rewrite the \u2206L in Eq. (2), in which \u2126 includes X, Y and all of the\nmodel parameters except \u03c6. Hence L\u2126(\u03c6) is a unary function w.r.t \u03c6.\n\nThen we use the Taylor series to expand L\u2126(0) in Eq. (3-4),\n\n\u2206L\u2126(\u03c6) = |L\u2126(\u03c6) \u2212 L\u2126(0)|\n\nP(cid:88)\n\np=0\n\nL\u2126(0) =\n\nL(p)\n\u2126 (\u03c6)\np!\n\n(0 \u2212 \u03c6)p + RP (\u03c6)\n\n= L\u2126(\u03c6) \u2212 \u03c6\u2207\u03c6L\u2126 + R1(\u03c6)\n\nCombine Eq. (2) and Eq. (4), we get\n\n\u2206L\u2126(\u03c6) = |\u03c6\u2207\u03c6L\u2126 \u2212 R1(\u03c6)| \u2248 |\u03c6\u2207\u03c6L\u2126| =\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b4L\n\n\u03b4\u03c6\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03c6\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nR1 is the Lagrange remainder, and we neglect this term because it requires a massive amount of\ncalculation. Now, we are able to solve the GFIR problem base on Eq. (5), which can be easily\ncomputed during the process of back-propagation. For each \ufb01lter ki \u2208 K, we use \u0398(\u03c6i) calculated\nby Eq. (6) as its importance score, where D is the training set.\n\n(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b4L(X, Y ; \u03b8)\n\n\u03b4\u03c6i\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03c6i\n\n\u0398(\u03c6i) =\n\n(X,Y )\u2208D\n\nSpecially, we apply the Gate Decorator to the Batch Normalization [19] and use it for our experiments.\nWe call the modi\ufb01ed module Gated Batch Normalization (GBN). We chose the BN module for two\nreasons: 1) The BN layer follows the convolution layer in most cases. Hence we can easily \ufb01nd the\ncorrespondence between the \ufb01lters and the feature maps of the BN layer. 2) We can take advantage\nof the scaling factor \u03b3 in BN to provide the ranking clue for \u03c6 (see Appendix A for details). GBN\nis de\ufb01ned in Eq. (7), in which (cid:126)\u03c6 is a vector of \u03c6 and c is the channel size of zin. Moreover, for the\nnetworks that do not use BN, we can also directly apply the Gate Decorator to the convolution. The\nde\ufb01nition of Gated Convolution can be seen at Appendix B.\n\nzout = (cid:126)\u03c6(\u03b3 \u02c6z + \u03b2), (cid:126)\u03c6 \u2208 Rc\n\n;\n\n(7)\n\nzin \u2212 \u00b5B\n\n(cid:112)\u03c32B + \u0001\n\n\u02c6z =\n\n3.2 Tick-Tock Pruning Framework\n\nIn this section, we introduce an iterative pruning framework to improve pruning accuracy, which\ncalled Tick-Tock (Figure 2). The Tick step is designed to achieve following goals: 1) Speed up the\npruning process. 2) Calculate the importance score \u0398 of each \ufb01lter. 3) Fix the internal covariate shift\nproblem [19] caused by previous pruning. In the Tick phase, we train the model on a small subset\nof the training data for one epoch, in which we only allow the gate \u03c6 and the \ufb01nal linear layer to be\nupdatable to avoid over\ufb01tting on the small dataset. \u0398 is calculated during the backward propagation\naccording to Eq. (6). After training, we sort all the \ufb01lters by their importance score \u0398 and remove a\nportion of the least important \ufb01lters.\n\n4\n\n[Tock]:sparse/fullset[Tick]:freeze/subsetEnd?Loop for T timesBN GBNFine- tuneGBN BNPruned ModelTrained Model\fFigure 3: An illustration of Group Pruning. GBNs with the same color belong to the same group.\n\nThe Tick phase could be repeated T times until the Tock phase comes in. The Tock phase is designed\nto \ufb01ne-tune the network to reduce the accumulation of errors caused by removing \ufb01lters. Besides,\na sparse constraint on \u03c6 is added to the loss function during training, which helps to reveal the\nunimportant \ufb01lters and calculate \u0398 more accurately. The loss used in Tock is shown in Eq. (8).\n\nLtock = L + \u03bb\n\n|\u03c6|\n\n(8)\n\n(cid:88)\n\n\u03c6\u2208\u03a6\n\nFinally, we \ufb01ne-tune the pruned network to get better performance. There are two differences between\nthe Tock step and the Fine-tune step: 1) Fine-tune usually trains more epochs than Tock. 2) Fine-tune\ndoes not add the sparse constraint to the loss function.\n\n3.3 Group Pruning for the Constrained Pruning Problem\n\nResNet [11] and its variants [18, 46, 43] contain shortcut connections, which applies element-wise\naddition on the feature maps that produced by two residual blocks. If we prune the \ufb01lters of each\nlayer independently, it may result in the misalignment of feature maps in the shortcut connection.\nSeveral solutions are proposed. [25, 30] bypass these troublesome layers and only prune the internal\nlayers of the residual blocks. [28, 15] insert an additional sampler before the \ufb01rst convolution layer in\neach residual block and leave the last convolution layer unpruned. However, avoiding the troublesome\nlayers limits the pruning ratio. Moreover, the sampler solution adds new structures to the network,\nwhich will introduce additional computational latency.\nTo solve the misalignment problem, we propose the Group Pruning: we assign the GBNs connected\nby the pure shortcut connections to the same group. The pure shortcut connection is a shortcut with\nno convolution layer on the side branch, as shown in Figure 3. A group can be viewed as a Virtual\nGBN that all its members share the same pruning pattern. And the importance score of the \ufb01lters in\nthe group is the sum of its members, as shown in Eq. (9). g is one of the GBN members in the group\nG, and the ranking of j-th \ufb01lter of all members in G is determined by \u0398(\u03c6G\nj ).\n\n\u0398(\u03c6G\n\nj ) =\n\n\u0398(\u03c6g\nj )\n\n(9)\n\n(cid:88)\n\ng\u2208G\n\n3.4 Compare to the Similar Work.\n\nPCNN [33] also uses the Taylor expansion to solve the GFIR problem. The proposed Gate Decorator\ndiffers from PCNN in three aspects: 1) Since no scaling factors are introduced, PCNN evaluates the\n\ufb01lter\u2019s importance score by summing the \ufb01rst degree Taylor polynomials of each element in its feature\nmap, which will accumulate the estimation error. 2) Moreover, PCNN could not take advantage\nof the sparse constraint due to the lack of scaling factor. However, according to our experiments,\nsparse constraint plays an important role in boosting the pruning accuracy. 3) A score normalization\nacross layers is essential for PCNN, but not for Gate Decorator. This is because PCNN uses the\naccumulation method to calculate the importance score, which will lead to the scale of scores varies\nwith the size of feature maps across layers. We abandon the score normalization since our scores are\nglobally comparable, and normalization will introduce new estimation errors.\n\n5\n\nCONVGBNReLUCONVGBNReLUCONVGBNReLUCONVGBNReLUCONVGBNReLUCONVGBNReLUCONVGBNReLUCONVGBNReLUGroup 1Group 2Group 3Group 4Group5\f4 Experiments\n\nIn this section, we \ufb01rst introduce the datasets and general implementation details used in our\nexperiments. We then con\ufb01rmed the effectiveness of the proposed method by comparing it with\nseveral state-of-the-art approaches. Finally, we explore in detail the role of each component.\n\n4.1\n\nImplementation Details\n\nDatasets. Various datasets are used in our experiments, including CIFAR-10 [20], CIFAR-100 [20],\nCUB-200 [45], ImageNet ILSVRC-12 [4] and PASCAL VOC 2011 [31]. The CIFAR-10 [20] dataset\ncontains 50K training images and 10K test images for 10 classes. The CIFAR-100 [20] dataset is\njust like the CIFAR-10, except it has 100 classes containing 600 images each. The CUB-200 [45]\ndataset consists of nearly 6,000 training images and 5,700 test images, covering 200 birds species.\nThe ImageNet ILSVRC-12 [4] contains 1.28 million training images and 50K test images for 1000\nclasses. The PASCAL VOC 2011 [31] segmentation dataset and its extended dataset SBD [9] are\nused, which provides 8,498 training images and 2,857 test images in 20 categories.\n\nBaseline training. Three types of popular network architectures are adopted: VGGNet [39],\nResNet [11] and FCN [38]. Since the VGGNet is originally designed for the ImageNet classi-\n\ufb01cation tasks, for the CIFAR and CUB-200 tasks, we use the full convolution version of the VGGNet\ntaken from [22] which we note as VGG-M. All networks are trained using SGD, with weight decay\nand momentum set to 10\u22124 and 0.9, respectively. We train our CIFAR and ImageNet baseline models\nby following the setup in [11]. For CIFAR datasets, the model was trained for 160 epochs with a\nbatch size of 128. The initial learning rate is set to 0.1 and divide it by 10 at the epoch 80 and 120.\nBesides, the simple data augmentation addressed in [11] is also adopted: random crop and random\nhorizontal \ufb02ip the training images. For ImageNet, we trained the baseline model for 90 epochs with\na batch size of 256. The initial learning rate is set to 0.1 and divide it by 10 every 30 epochs. We\nfollow the widely used data augmentation in [21]: images are resized to 256\u00d7256, then randomly\ncrop a 224x224 area from the original image or its horizontal re\ufb02ection for training. The testing is on\nthe center crop of 224\u00d7224 pixels. For the semantic segmentation task, We train an FCN-32s [38]\nnetwork taken from [44] for 11 epoch.\n\nTick-Tock settings. Since ResNet is more compact than VGG, we prune 0.2% \ufb01lters of ResNet\nand 1% \ufb01lters of VGG (including FCN) in each Tick stage. The Tick stage can be performed based\non a subset of the training data to speed up the pruning. For the ImageNet task, we randomly draw\n100 images per class to form the subset. In the case of CIFAR and CUB-200, we use all training\ndata in Tick due to the small scale of the dataset. In all of our experiments, T is set to 10, which\nmeans that one Tock operation is performed after every 10 Tick operations. And we train the network\nwith sparse constraint for 10 epochs in the Tock phase. If not stated otherwise, we use the following\nlearning rate adjustment strategy. The learning rate used in Tick is set to 10\u22123. For the Tock phase,\nwe use the 1-cycle [42] strategy to linearly increases the learning rate from 10\u22123 to 10\u22122 in the \ufb01rst\nhalf of the iteration, and then linearly decrease from 10\u22122 to 10\u22123. For the Fine-tune phase, we use\nthe same learning rate strategy as the Tock phase to train the network for 40 epochs.\n\n4.2 Overall Comparisons\n\nResNet-56 on the CIFAR-10. Table 1 shows the pruning results of ResNet-56 on CIFAR-10. We\ncompare GBN with various pruning algorithms, and we can see that GBN has achieved the state-of-\nthe-art pruning ratio without noticeable loss in accuracy. Our pruned ResNet-56 with 60% FLOPs\n\nMetric\n\nFLOPs \u2193%\nParams \u2193%\nAccuracy \u2193%\n\nLi et al.\n\n[25]\n27.6\n13.7\n-0.02\n\nNISP\n[52]\n43.6\n42.6\n0.03\n\nDCP-A\n\n[56]\n47.1\n70.3\n-0.01\n\nCP\n[15]\n50.0\n\n-\n\n1.00\n\nAMC\n[14]\n50.0\n\n-\n\n0.90\n\nC-SGD\n\n[6]\n60.8\n\n-\n\n-0.23\n\nGBN-40 GBN-30\n\n60.1\n53.5\n-0.33\n\n70.3\n66.7\n0.03\n\nTable 1: The pruning results of ResNet-56 [11] on CIFAR-10 [20]. The baseline accuracy is 93.1%.\n\n6\n\n\fTable 2: The pruning results of ResNet-50 [11] on the ImageNet [4] dataset. \"P.Top-1\" and \"P.Top-5\"\ndenotes the top-1 and top-5 single center crop accuracy of the pruned model on the validation set.\n\"[Top-1] \u2193\" and \"[Top-5] \u2193\" denotes the decrease in accuracy of the pruned model compared to its\nunpruned baseline. \"Global\" identi\ufb01es whether the method is a global \ufb01lter pruning algorithm.\nFLOPs \u2193% Param \u2193%\n\n[Top-1] \u2193\n\n[Top-5] \u2193\n\nMethod\n\nGlobal\n\nThiNet-70 [30]\n\nSFP [12]\nGBN-60\nNISP [52]\nFPGM [13]\n\nThiNet-50 [30]\n\nDCP [56]\nGDP [26]\nGBN-50\n\n\u0017\n\u0017\n\u0013\n\u0013\n\u0017\n\u0017\n\u0017\n\u0013\n\u0013\n\nP. Top-1\n72.04\n74.61\n76.19\n\n-\n\n74.83\n71.01\n74.95\n71.89\n75.18\n\nP. Top-5\n90.67\n92.06\n92.83\n\n-\n\n92.32\n90.02\n92.32\n90.71\n92.41\n\n0.84\n1.54\n-0.31\n0.89\n1.32\n1.87\n1.06\n3.24\n0.67\n\n0.47\n0.81\n-0.16\n\n-\n\n0.55\n1.12\n0.61\n1.59\n0.26\n\n36.75\n41.80\n40.54\n44.01\n53.50\n55.76\n55.76\n51.30\n55.06\n\n33.72\n\n-\n\n31.83\n43.82\n\n-\n\n51.56\n51.45\n\n-\n\n53.40\n\nreduction outperforms the baseline by 0.33% in the test accuracy. When reducing the FLOPs by 70%,\nthe test accuracy is only 0.03% lower than the baseline model.\n\nResNet-50 on the ILSVRC-12. To validate the effectiveness of the proposed method on large-scale\ndatasets, we further pruning the widely used ResNet-50 [11] on the ILSVRC-12 [4] dataset. The\nresults are shown in Table 2. We \ufb01ne-tune the pruned network for 60 epochs with a batch size of\n256. The learning rate is initially set to 0.01 and divided by 10 at epoch 36, 48 and 54. We also\ntest the acceleration of the pruned networks in wall clock time. The inference speed of the baseline\nResNet-50 model on a single Titan X Pascal is 864 images per second, using batch size of 64 and\nimage resolution of 224\u00d7224. By reducing 40% FLOPs, GBN-60 can process 1127 images per\nsecond (30%\u2191). And GBN-50 achieves 1237 images per second (43%\u2191).\n\nFCN on the PASCAL VOC 2011. Since the Gate Decorator does not make any assumptions about\nthe loss function and no additional operation or structures are introduced into the pruned model, the\nproposed method can be easily applied to various computer vision tasks. We tested the effect of the\nproposed method on the semantic segmentation task by pruning an FCN-32s [38] network on the\nextended PASCAL VOC 2011 dataset [31, 9] (see Appendix C for details). Compared to the baseline,\nThe pruned network reduces the FLOPs by 27% and the parameters amount by 73% while maintain\nthe mIoU (62.84%\u219262.86%).\n\n4.3 More Explorations\n\nComparisons on the GFIR problem. To verify the effectiveness of Gate Decorator in solving the\nglobal \ufb01lter importance ranking problem, we compare it with the other two global \ufb01lter pruning\nmethods, Slim [28] and PCNN [33]. Concerning the baseline model, we employ the VGG-16-M [22]\nmodel that pre-trained on the ImageNet [4] and train it on the CUB-200 [45] dataset for 90 epochs\nwith batch size 64. The initial learning rate is set to 3 \u00d7 10\u22123 and divide it by 3 every 30 epochs. All\npruning tests are based on the same baseline model. The sparse constraints on the scale factors are\nadopted except for PCNN in the Tock phase, and the \u03bb for the sparse constraint is set to 10\u22123.\nFrom the results shown in Figure 4, we have the following observations: 1) The accuracy of the\nmodel pruned by Slim [28] changes dramatically. It is because the Slim method ranks \ufb01lters by the\nmagnitude of its factors, which is insuf\ufb01cient according to our analysis. Besides, Slim can also bene\ufb01t\nfrom the sparse constraint in the Tock phase. 2) The PCNN [33] and the proposed GBN produce a\nsmoother curve due to the gradient is taken into account. GBN outperforms PCNN by a large margin\ndue to the differences discussed in Section 3.4.\n\nGlobal \ufb01lters pruning as a NAS method. The global \ufb01lter pruning method automatically deter-\nmines the channel size of each layer of the CNN model, which can cooperate with the popular NAS\nmethods [47, 27, 57, 35] to obtain more ef\ufb01cient network architecture for a speci\ufb01c task. Figure 5\nshows two compressed networks with the same amount of computation. The baseline model is a\nVGG-16-M network trained on the CIFAR-100 [20] with an test accuracy of 73.19%. The \"shrunk\"\n\n7\n\n\fFigure 4: The pruning results of VGG-16-M [22] on the CUB-200 [45] dataset. The reported results\nare the model test accuracy before the Fine-tune phase. Slim [28] and PCNN [33] are compared.\n\nnetwork halve the channel size of all convolution layers, so its FLOPs become 1/4 of the baseline.\nWe train the \"shrunk\" network for 320 epochs from scratch, and the test accuracy dropped by 1.98%\ncompared to the baseline. The \"pruned\" network is the result of pruning the baseline model using\nTick-Tock framework, which only drops the accuracy by 1.30%. If we reinitialize the \"pruned\"\nnetwork and train it from scratch, the accuracy can reach 71.02%. More importantly, the number\nof parameters of the \"pruned\" network is only 1/3 of the \"shrunk\" one. Comparing their structure,\nwe \ufb01nd that the redundancy in the deep layers is unnecessary, while the middle layers seem to be\nmore important, which is different from our knowledge. Therefore, this experiment demonstrates our\npruning method can be viewed as a task-driven network architecture search algorithm, which is also\nconsistent with the conclusion presented in [29].\n\nEffectiveness of the Tick-Tock framework. To \ufb01gure out the impact of the Tick-Tock framework,\nwe integrate GBN with three different pruning schemas. Test accuracy of the pruned models are\nshown in Table 4.3. In the One-Shot mode, we only calculate the global \ufb01lter ranking once and prune\nthe model to certain FLOPs without reconstruction. As to the Tick-Only mode, Tick is repeated\nuntil the FLOPs of the network falls below a certain threshold. In each Tick step, GBN recalculates\nthe global \ufb01lter ranking and removing 1% of the \ufb01lters. We apply the full pruning pipeline in the\nTick-Tock mode. Also, due to there are additional training during the Tick-Tock pipeline, we double\nthe epochs to 320 when training the \"scratch\" model for a fair comparison.\nAt the same FLOPs threshold, the models pruned by Tick-Tock and Tick-Only are signi\ufb01cantly\nbetter than One-Shot in both \"\ufb01ne-tune\" and \"scratch\" results, which shows iterative pruning is more\naccurate. Comparing Tick-Only with Tick-Tock, on the one hand, the accuracy of the models trained\nfrom scratch are comparable, because similar network structures are preserved (see Appendix E for\ndetails). On the other hand, The Tock phase enhances the performance of the pruned model, which\nbene\ufb01ts from the sparse constraint. When reducing 40% FLOPs, the pruned model achieves 74.6%\naccuracy on the test set, which is 1.4% higher than the unpruned model.\n\nFLOPs-\nPruned\n40%\n60%\n80%\n\nGBN with One-Shot\n\nGBN with Tick-Only\n\nGBN with Tick-Tock\n\nParam Finetune\n79.3%\n92.0%\n97.5%\n\n71.8\n62.1\n57.7\n\nScratch\n\n73.1\n68.0\n59.9\n\nParam Finetune\n68.5%\n86.2%\n95.0%\n\n73.0\n71.4\n68.4\n\nScratch\n\n73.7\n72.9\n69.6\n\nParam Finetune\n69.0%\n85.5%\n94.7%\n\n74.6\n73.2\n71.2\n\nScratch\n\n73.7\n73.0\n69.9\n\nTable 3: The test results of VGG-16-M [22] model on the CIFAR-100 [20] dataset under different\npruning schemas. The accuracy of unpruned baseline model is 73.2%. \"Param\" denotes the percentage\nof parameters that been removed. \"Finetune\" represents the test accuracy of the pruned model after\n\ufb01ne-tuning. \"Scratch\" shows the test result of the random initialized model, which has the same\narchitecture as the pruned one. When training the \"Scratch\" model, we doubled the epochs to 320.\n\n8\n\n01020304050607080901020304050607080Accuracy (%)FLOPs Reduction (%) GBN-Tick GBN-TickTock Slim-Tick Slim-TickTock PCNN-Tick PCNN-TickTock01020304050607080901020304050607080Accuracy (%)Parameters Reduction (%) GBN-Tick GBN-TickTock Slim-Tick Slim-TickTock PCNN-Tick PCNN-TickTock\fFigure 5: An illustration of two network architectures with the same FLOPs.\n\n5 Conclusion\n\nIn this work, we propose three components to serve the purpose of global \ufb01lter pruning: 1) The Gate\nDecorator algorithm to solve the global \ufb01lter importance ranking (GFIR) problem. 2) The Tick-Tock\nframework to boost pruning accuracy. 3) The Group Pruning method to solve the constrained pruning\nproblem. We show that the global \ufb01lter pruning method can be viewed as a task-driven network\narchitecture search algorithm. Extensive experiments show the proposed method outperforms several\nstate-of-the-art \ufb01lter pruning methods.\n\n6 Acknowledgement\n\nThis work was supported by National Key R&D Program of China (Grant no.2017YFB1200700),\nPeking University Medical Cross Research Seed Fund and National Natural Science Foundation of\nChina (Grant no.61701007).\n\nReferences\n[1] Jose M. Alvarez and Mathieu Salzmann. Learning the number of neurons in deep networks. In Advances\n\nin Neural Information Processing Systems (NeurIPS), pages 2262\u20132270, 2016.\n\n[2] Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan,\nand Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with\noctave convolution. arXiv, abs/1904.05049, 2019.\n\n[3] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and\n\nactivations constrained to +1 or -1. arXiv, abs/1602.02830, 2016.\n\n[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical\nimage database. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 248\u2013255, 2009.\n\n[5] Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure\nwithin convolutional networks for ef\ufb01cient evaluation. In Advances in Neural Information Processing\nSystems (NeurIPS), pages 1269\u20131277, 2014.\n\n[6] Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. Centripetal SGD for pruning very deep\n\nconvolutional networks with complicated structure. arXiv, abs/1904.03837, 2019.\n\n[7] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for ef\ufb01cient dnns. In Advances in\n\nNeural Information Processing Systems (NeurIPS), pages 1379\u20131387, 2016.\n\n[8] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with\npruning, trained quantization and huffman coding. In International Conference on Learning Representa-\ntions (ICLR), 2016.\n\n9\n\n32326464128128128256256256256256256194772107157159155144964636164312345678910111213Convolution Layer IndexChannels shrunk (Acc. Loss: 1.98) pruned (Acc. Loss: 1.30)\f[9] Bharath Hariharan, Pablo Arbelaez, Lubomir D. Bourdev, Subhransu Maji, and Jitendra Malik. Semantic\ncontours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), pages\n991\u2013998, 2011.\n\n[10] Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and performance\n\ncomparison. In Advances in Neural Information Processing Systems (NeurIPS), pages 263\u2013270, 1993.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[12] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft \ufb01lter pruning for accelerating deep\nconvolutional neural networks. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages\n2234\u20132240, 2018.\n\n[13] Yang He, Ping Liu, Ziwei Wang, and Yi Yang. Pruning \ufb01lter via geometric median for deep convolutional\nneural networks acceleration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 4340\u20134349, 2019.\n\n[14] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. AMC: automl for model compression\nand acceleration on mobile devices. In European Conference on Computer Vision (ECCV), pages 815\u2013832,\n2018.\n\n[15] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In\n\nIEEE International Conference on Computer Vision (ICCV), pages 1398\u20131406, 2017.\n\n[16] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile\nvision applications. arXiv, abs/1704.04861, 2017.\n\n[17] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron\n\npruning approach towards ef\ufb01cient deep architectures. arXiv, abs/1607.03250, 2016.\n\n[18] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer\n\nVision and Pattern Recognition (CVPR), pages 7132\u20137141, 2018.\n\n[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML),\npages 448\u2013456, 2015.\n\n[20] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 1106\u20131114,\n2012.\n\n[22] kuangliu. 95.16% on cifar10 with pytorch. https://github.com/kuangliu/pytorch-cifar.\n\n[23] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information\n\nProcessing Systems (NeurIPS), pages 598\u2013605, 1989.\n\n[24] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single\nimage super-resolution using a generative adversarial network. In IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), pages 105\u2013114, 2017.\n\n[25] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for ef\ufb01cient\n\nconvnets. In International Conference on Learning Representations (ICLR), 2017.\n\n[26] Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang Zhang. Accelerating\nconvolutional networks via global & dynamic \ufb01lter pruning. In International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI), pages 2425\u20132432, 2018.\n\n[27] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2019.\n\n[28] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning\nef\ufb01cient convolutional networks through network slimming. In IEEE International Conference on Computer\nVision (ICCV), pages 2755\u20132763, 2017.\n\n10\n\n\f[29] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network\n\npruning. In International Conference on Learning Representations (ICLR), 2019.\n\n[30] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A \ufb01lter level pruning method for deep neural network\n\ncompression. In IEEE International Conference on Computer Vision (ICCV), pages 5068\u20135076, 2017.\n\n[31] Everingham M., Van Gool L., Williams C. K.\n\nI., Winn J.,\n\nPASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.\nnetwork.org/challenges/VOC/voc2011/workshop/index.html.\n\nand Zisserman A.\n\nThe\nhttp://www.pascal-\n\n[32] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuf\ufb02enet V2: practical guidelines for\nef\ufb01cient CNN architecture design. In European Conference on Computer Vision (ECCV), pages 122\u2013138,\n2018.\n\n[33] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural\nnetworks for resource ef\ufb01cient transfer learning. In International Conference on Learning Representations\n(ICLR), 2017.\n\n[34] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi\ufb01ca-\ntion using binary convolutional neural networks. In European Conference on Computer Vision (ECCV),\npages 525\u2013542, 2016.\n\n[35] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. In Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), 2019.\n\n[36] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mo-\nbilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 4510\u20134520, 2018.\n\n[37] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\nrecognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 815\u2013823, 2015.\n\n[38] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 39(4):640\u2013651, 2017.\n\n[39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In International Conference on Learning Representations (ICLR), 2015.\n\n[40] Bharat Singh, Mahyar Najibi, and Larry S. Davis. SNIPER: ef\ufb01cient multi-scale training. In Advances in\n\nNeural Information Processing Systems (NeurIPS), pages 9333\u20139343, 2018.\n\n[41] Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P. Namboodiri. Hetconv: Heterogeneous\n\nkernel-based convolutions for deep cnns. arXiv, abs/1903.04120, 2019.\n\n[42] Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large\n\nlearning rates. arXiv, abs/1708.07120, 2017.\n\n[43] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. Inception-v4, inception-\nresnet and the impact of residual connections on learning. In Association for the Advancement of Arti\ufb01cial\nIntelligence (AAAI), pages 4278\u20134284, 2017.\n\n[44] Kentaro Wada.\n\nPytorch\n\nimplementation\n\nof\n\nfully\n\nconvolutional\n\nnetworks.\n\nhttps://github.com/wkentaro/pytorch-fcn.\n\n[45] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical report, 2011.\n\n[46] Saining Xie, Ross B. Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual transfor-\nmations for deep neural networks. In IEEE Conference on Computer Vision and Pattern (CVPR), pages\n5987\u20135995, 2017.\n\n[47] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In\n\nInternational Conference on Learning Representations (ICLR), 2019.\n\n[48] Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for\nhuman pose estimation. In IEEE International Conference on Computer Vision (ICCV), pages 1290\u20131299,\n2017.\n\n11\n\n\f[49] Jianbo Ye, Xin Lu, Zhe Lin, and James Z. Wang. Rethinking the smaller-norm-less-informative assumption\nin channel pruning of convolution layers. In International Conference on Learning Representations (ICLR),\n2018.\n\n[50] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning\nIn IEEE Conference on\n\ncompact recurrent neural networks with block-term tensor decomposition.\nComputer Vision and Pattern Recognition (CVPR), pages 9378\u20139387, 2018.\n\n[51] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral\nsegmentation network for real-time semantic segmentation. In European Conference on Computer Vision\n(ECCV), pages 334\u2013349, 2018.\n\n[52] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-\nYung Lin, and Larry S. Davis. NISP: pruning networks using neuron importance score propagation. In\nConference on Computer Vision and Pattern Recognition (CVPR), pages 9194\u20139203, 2018.\n\n[53] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional\nneural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 6848\u20136856, 2018.\n\n[54] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization:\nTowards lossless cnns with low-precision weights. In International Conference on Learning Representations\n(ICLR), 2017.\n\n[55] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian D. Reid. Towards effective low-\nbitwidth convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 7920\u20137928, 2018.\n\n[56] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and\nJin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural\nInformation Processing Systems (NeurIPS), pages 883\u2013894, 2018.\n\n[57] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for\nscalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 8697\u20138710, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1267, "authors": [{"given_name": "Zhonghui", "family_name": "You", "institution": "Peking University"}, {"given_name": "Kun", "family_name": "Yan", "institution": "Peking University"}, {"given_name": "Jinmian", "family_name": "Ye", "institution": "SMILE Lab"}, {"given_name": "Meng", "family_name": "Ma", "institution": "Peking University"}, {"given_name": "Ping", "family_name": "Wang", "institution": "Peking University"}]}