{"title": "Learning both Weights and Connections for Efficient Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1135, "page_last": 1143, "abstract": "Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9\u00d7, from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG-16 found that the total number of parameters can be reduced by 13\u00d7, from 138 million to 10.3 million, again with no loss of accuracy.", "full_text": "Learning both Weights and Connections for Ef\ufb01cient\n\nNeural Networks\n\nSong Han\n\nStanford University\n\nsonghan@stanford.edu\n\nJeff Pool\nNVIDIA\n\njpool@nvidia.com\n\nJohn Tran\nNVIDIA\n\njohntran@nvidia.com\n\nWilliam J. Dally\nStanford University\n\nNVIDIA\n\ndally@stanford.edu\n\nAbstract\n\nNeural networks are both computationally intensive and memory intensive, making\nthem dif\ufb01cult to deploy on embedded systems. Also, conventional networks \ufb01x\nthe architecture before training starts; as a result, training cannot improve the\narchitecture. To address these limitations, we describe a method to reduce the\nstorage and computation required by neural networks by an order of magnitude\nwithout affecting their accuracy by learning only the important connections. Our\nmethod prunes redundant connections using a three-step method. First, we train\nthe network to learn which connections are important. Next, we prune the unim-\nportant connections. Finally, we retrain the network to \ufb01ne tune the weights of the\nremaining connections. On the ImageNet dataset, our method reduced the number\nof parameters of AlexNet by a factor of 9\u00d7, from 61 million to 6.7 million, without\nincurring accuracy loss. Similar experiments with VGG-16 found that the total\nnumber of parameters can be reduced by 13\u00d7, from 138 million to 10.3 million,\nagain with no loss of accuracy.\n\n1\n\nIntroduction\n\nNeural networks have become ubiquitous in applications ranging from computer vision [1] to speech\nrecognition [2] and natural language processing [3]. We consider convolutional neural networks used\nfor computer vision tasks which have grown over time. In 1998 Lecun et al. designed a CNN model\nLeNet-5 with less than 1M parameters to classify handwritten digits [4], while in 2012, Krizhevsky\net al. [1] won the ImageNet competition with 60M parameters. Deepface classi\ufb01ed human faces with\n120M parameters [5], and Coates et al. [6] scaled up a network to 10B parameters.\nWhile these large neural networks are very powerful, their size consumes considerable storage,\nmemory bandwidth, and computational resources. For embedded mobile applications, these resource\ndemands become prohibitive. Figure 1 shows the energy cost of basic arithmetic and memory\noperations in a 45nm CMOS process. From this data we see the energy per connection is dominated\nby memory access and ranges from 5pJ for 32 bit coef\ufb01cients in on-chip SRAM to 640pJ for 32bit\ncoef\ufb01cients in off-chip DRAM [7]. Large networks do not \ufb01t in on-chip storage and hence require\nthe more costly DRAM accesses. Running a 1 billion connection neural network, for example, at\n20Hz would require (20Hz)(1G)(640pJ) = 12.8W just for DRAM access - well beyond the power\nenvelope of a typical mobile device. Our goal in pruning networks is to reduce the energy required to\nrun such large networks so they can run in real time on mobile devices. The model size reduction\nfrom pruning also facilitates storage and transmission of mobile applications incorporating DNNs.\n\n1\n\n\fOperation\n32 bit int ADD\n32 bit \ufb02oat ADD\n32 bit Register File\n32 bit int MULT\n32 bit \ufb02oat MULT\n32 bit SRAM Cache\n32 bit DRAM Memory\n\nEnergy [pJ] Relative Cost\n0.1\n0.9\n1\n3.1\n3.7\n5\n640\n\n1\n9\n10\n31\n37\n50\n6400\n\nFigure 1: Energy table for 45nm CMOS process [7]. Memory access is 3 orders of magnitude more\nenergy expensive than simple arithmetic.\n\nTo achieve this goal, we present a method to prune network connections in a manner that preserves the\noriginal accuracy. After an initial training phase, we remove all connections whose weight is lower\nthan a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This \ufb01rst\nphase learns the topology of the networks \u2014 learning which connections are important and removing\nthe unimportant connections. We then retrain the sparse network so the remaining connections can\ncompensate for the connections that have been removed. The phases of pruning and retraining may\nbe repeated iteratively to further reduce network complexity. In effect, this training process learns\nthe network connectivity in addition to the weights - much as in the mammalian brain [8][9], where\nsynapses are created in the \ufb01rst few months of a child\u2019s development, followed by gradual pruning of\nlittle-used connections, falling to typical adult values.\n\n2 Related Work\n\nNeural networks are typically over-parameterized, and there is signi\ufb01cant redundancy for deep learn-\ning models [10]. This results in a waste of both computation and memory. There have been various\nproposals to remove the redundancy: Vanhoucke et al. [11] explored a \ufb01xed-point implementation\nwith 8-bit integer (vs 32-bit \ufb02oating point) activations. Denton et al. [12] exploited the linear\nstructure of the neural network by \ufb01nding an appropriate low-rank approximation of the parameters\nand keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al.\n[13] compressed deep convnets using vector quantization. These approximation and quantization\ntechniques are orthogonal to network pruning, and they can be used together to obtain further gains\n[14].\nThere have been other attempts to reduce the number of parameters of neural networks by replacing\nthe fully connected layer with global average pooling. The Network in Network architecture [15]\nand GoogLenet [16] achieves state-of-the-art results on several benchmarks by adopting this idea.\nHowever, transfer learning, i.e. reusing features learned on the ImageNet dataset and applying them\nto new tasks by only \ufb01ne-tuning the fully connected layers, is more dif\ufb01cult with this approach. This\nproblem is noted by Szegedy et al. [16] and motivates them to add a linear layer on the top of their\nnetworks to enable transfer learning.\nNetwork pruning has been used both to reduce network complexity and to reduce over-\ufb01tting. An\nearly approach to pruning was biased weight decay [17]. Optimal Brain Damage [18] and Optimal\nBrain Surgeon [19] prune networks to reduce the number of connections based on the Hessian of the\nloss function and suggest that such pruning is more accurate than magnitude-based pruning such as\nweight decay. However, second order derivative needs additional computation.\nHashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly\ngroup connection weights into hash buckets, so that all connections within the same hash bucket\nshare a single parameter value. This technique may bene\ufb01t from pruning. As pointed out in Shi et al.\n[21] and Weinberger et al. [22], sparsity will minimize hash collision making feature hashing even\nmore effective. HashedNets may be used together with pruning to give even better parameter savings.\n\n2\n\n110100100010000Relative Energy Cost \fFigure 2: Three-Step Training Pipeline.\n\nFigure 3: Synapses and neurons before and after\npruning.\n\n3 Learning Connections in Addition to Weights\n\nOur pruning method employs a three-step process, as illustrated in Figure 2, which begins by learning\nthe connectivity via normal network training. Unlike conventional training, however, we are not\nlearning the \ufb01nal values of the weights, but rather we are learning which connections are important.\nThe second step is to prune the low-weight connections. All connections with weights below a\nthreshold are removed from the network \u2014 converting a dense network into a sparse network, as\nshown in Figure 3. The \ufb01nal step retrains the network to learn the \ufb01nal weights for the remaining\nsparse connections. This step is critical. If the pruned network is used without retraining, accuracy is\nsigni\ufb01cantly impacted.\n\n3.1 Regularization\n\nChoosing the correct regularization impacts the performance of pruning and retraining. L1 regulariza-\ntion penalizes non-zero parameters resulting in more parameters near zero. This gives better accuracy\nafter pruning, but before retraining. However, the remaining connections are not as good as with L2\nregularization, resulting in lower accuracy after retraining. Overall, L2 regularization gives the best\npruning results. This is further discussed in experiment section.\n\n3.2 Dropout Ratio Adjustment\n\nDropout [23] is widely used to prevent over-\ufb01tting, and this also applies to retraining. During\nretraining, however, the dropout ratio must be adjusted to account for the change in model capacity.\nIn dropout, each parameter is probabilistically dropped during training, but will come back during\ninference. In pruning, parameters are dropped forever after pruning and have no chance to come back\nduring both training and inference. As the parameters get sparse, the classi\ufb01er will select the most\ninformative predictors and thus have much less prediction variance, which reduces over-\ufb01tting. As\npruning already reduced model capacity, the retraining dropout ratio should be smaller.\nQuantitatively, let Ci be the number of connections in layer i, Cio for the original network, Cir for\nthe network after retraining, Ni be the number of neurons in layer i. Since dropout works on neurons,\nand Ci varies quadratically with Ni, according to Equation 1 thus the dropout ratio after pruning the\nparameters should follow Equation 2, where Do represent the original dropout rate, Dr represent the\ndropout rate during retraining.\n\n(cid:114) Cir\n\nCio\n\nCi = NiNi\u22121\n\n(1)\n\nDr = Do\n\n(2)\n\n3.3 Local Pruning and Parameter Co-adaptation\n\nDuring retraining, it is better to retain the weights from the initial training phase for the connections\nthat survived pruning than it is to re-initialize the pruned layers. CNNs contain fragile co-adapted\nfeatures [24]: gradient descent is able to \ufb01nd a good solution when the network is initially trained,\nbut not after re-initializing some layers and retraining them. So when we retrain the pruned layers,\nwe should keep the surviving parameters instead of re-initializing them.\n\n3\n\nTrain ConnectivityPrune ConnectionsTrain Weightspruning neuronspruning synapsesafter pruningbefore pruning\fTable 1: Network pruning can save 9\u00d7 to 13\u00d7 parameters with no drop in predictive performance.\n\nNetwork\nLeNet-300-100 Ref\nLeNet-300-100 Pruned\nLeNet-5 Ref\nLeNet-5 Pruned\nAlexNet Ref\nAlexNet Pruned\nVGG-16 Ref\nVGG-16 Pruned\n\nTop-1 Error Top-5 Error\n1.64%\n1.59%\n0.80%\n0.77%\n42.78%\n42.77%\n31.50%\n31.34%\n\n-\n-\n-\n-\n19.73%\n19.67%\n11.32%\n10.88%\n\nParameters Compression\n267K\n22K\n431K\n36K\n61M\n6.7M\n138M\n10.3M\n\nRate\n12\u00d7\n12\u00d7\n9\u00d7\n13\u00d7\n\nRetraining the pruned layers starting with retained weights requires less computation because we\ndon\u2019t have to back propagate through the entire network. Also, neural networks are prone to suffer\nthe vanishing gradient problem [25] as the networks get deeper, which makes pruning errors harder to\nrecover for deep networks. To prevent this, we \ufb01x the parameters for CONV layers and only retrain\nthe FC layers after pruning the FC layers, and vice versa.\n\n3.4\n\nIterative Pruning\n\nLearning the right connections is an iterative process. Pruning followed by a retraining is one iteration,\nafter many such iterations the minimum number connections could be found. Without loss of accuracy,\nthis method can boost pruning rate from 5\u00d7 to 9\u00d7 on AlexNet compared with single-step aggressive\npruning. Each iteration is a greedy search in that we \ufb01nd the best connections. We also experimented\nwith probabilistically pruning parameters based on their absolute value, but this gave worse results.\n\n3.5 Pruning Neurons\n\nAfter pruning connections, neurons with zero input connections or zero output connections may be\nsafely pruned. This pruning is furthered by removing all connections to or from a pruned neuron.\nThe retraining phase automatically arrives at the result where dead neurons will have both zero input\nconnections and zero output connections. This occurs due to gradient descent and regularization.\nA neuron that has zero input connections (or zero output connections) will have no contribution\nto the \ufb01nal loss, leading the gradient to be zero for its output connection (or input connection),\nrespectively. Only the regularization term will push the weights to zero. Thus, the dead neurons will\nbe automatically removed during retraining.\n\n4 Experiments\n\nWe implemented network pruning in Caffe [26]. Caffe was modi\ufb01ed to add a mask which disregards\npruned parameters during network operation for each weight tensor. The pruning threshold is chosen\nas a quality parameter multiplied by the standard deviation of a layer\u2019s weights. We carried out the\nexperiments on Nvidia TitanX and GTX980 GPUs.\nWe pruned four representative networks: Lenet-300-100 and Lenet-5 on MNIST, together with\nAlexNet and VGG-16 on ImageNet. The network parameters and accuracy 1 before and after pruning\nare shown in Table 1.\n\n4.1 LeNet on MNIST\n\nWe \ufb01rst experimented on MNIST dataset with the LeNet-300-100 and LeNet-5 networks [4]. LeNet-\n300-100 is a fully connected network with two hidden layers, with 300 and 100 neurons each, which\nachieves 1.6% error rate on MNIST. LeNet-5 is a convolutional network that has two convolutional\nlayers and two fully connected layers, which achieves 0.8% error rate on MNIST. After pruning,\nthe network is retrained with 1/10 of the original network\u2019s original learning rate. Table 1 shows\n\n1Reference model is from Caffe model zoo, accuracy is measured without data augmentation\n\n4\n\n\fTable 2: For Lenet-300-100, pruning reduces the number of weights by 12\u00d7 and computation by\n12\u00d7.\n\nLayer Weights\nfc1\nfc2\nfc3\nTotal\n\n235K\n30K\n1K\n266K\n\nFLOP Act% Weights% FLOP%\n8%\n470K 38%\n65%\n60K\n9%\n100% 26%\n2K\n8%\n532K 46%\n\n8%\n4%\n17%\n8%\n\nTable 3: For Lenet-5, pruning reduces the number of weights by 12\u00d7 and computation by 6\u00d7.\n\nLayer Weights\nconv1\nconv2\nfc1\nfc2\nTotal\n\n0.5K\n25K\n400K\n5K\n431K\n\nAct% Weights% FLOP%\nFLOP\n66%\n576K\n82%\n12%\n3200K 72%\n55%\n800K\n8%\n100% 19%\n10K\n8%\n4586K 77%\n\n66%\n10%\n6%\n10%\n16%\n\nFigure 4: Visualization of the \ufb01rst FC layer\u2019s sparsity pattern of Lenet-300-100. It has a banded\nstructure repeated 28 times, which correspond to the un-pruned parameters in the center of the images,\nsince the digits are written in the center.\n\npruning saves 12\u00d7 parameters on these networks. For each layer of the network the table shows (left\nto right) the original number of weights, the number of \ufb02oating point operations to compute that\nlayer\u2019s activations, the average percentage of activations that are non-zero, the percentage of non-zero\nweights after pruning, and the percentage of actually required \ufb02oating point operations.\nAn interesting byproduct is that network pruning detects visual attention regions. Figure 4 shows the\nsparsity pattern of the \ufb01rst fully connected layer of LeNet-300-100, the matrix size is 784 \u2217 300. It\nhas 28 bands, each band\u2019s width 28, corresponding to the 28 \u00d7 28 input pixels. The colored regions\nof the \ufb01gure, indicating non-zero parameters, correspond to the center of the image. Because digits\nare written in the center of the image, these are the important parameters. The graph is sparse on the\nleft and right, corresponding to the less important regions on the top and bottom of the image. After\npruning, the neural network \ufb01nds the center of the image more important, and the connections to the\nperipheral regions are more heavily pruned.\n\n4.2 AlexNet on ImageNet\n\nWe further examine the performance of pruning on the ImageNet ILSVRC-2012 dataset, which\nhas 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as the\nreference model, which has 61 million parameters across 5 convolutional layers and 3 fully connected\nlayers. The AlexNet Caffe model achieved a top-1 accuracy of 57.2% and a top-5 accuracy of 80.3%.\nThe original AlexNet took 75 hours to train on NVIDIA Titan X GPU. After pruning, the whole\nnetwork is retrained with 1/100 of the original network\u2019s initial learning rate. It took 173 hours to\nretrain the pruned AlexNet. Pruning is not used when iteratively prototyping the model, but rather\nused for model reduction when the model is ready for deployment. Thus, the retraining time is less\na concern. Table 1 shows that AlexNet can be pruned to 1/9 of its original size without impacting\naccuracy, and the amount of computation can be reduced by 3\u00d7.\n\n5\n\n\fTable 4: For AlexNet, pruning reduces the number of weights by 9\u00d7 and computation by 3\u00d7.\n\n35K\n307K\n885K\n663K\n442K\n38M\n17M\n4M\n61M\n\nLayer Weights\nconv1\nconv2\nconv3\nconv4\nconv5\nfc1\nfc2\nfc3\nTotal\nTable 5: For VGG-16, pruning reduces the number of weights by 12\u00d7 and computation by 5\u00d7.\n\nFLOP Act% Weights% FLOP%\n211M 88%\n448M 52%\n299M 37%\n224M 40%\n150M 34%\n75M 36%\n34M 40%\n100%\n8M\n1.5B\n54%\n\n84%\n38%\n35%\n37%\n37%\n9%\n9%\n25%\n11%\n\n84%\n33%\n18%\n14%\n14%\n3%\n3%\n10%\n30%\n\nLayer\nconv1 1\nconv1 2\nconv2 1\nconv2 2\nconv3 1\nconv3 2\nconv3 3\nconv4 1\nconv4 2\nconv4 3\nconv5 1\nconv5 2\nconv5 3\nfc6\nfc7\nfc8\ntotal\n\nWeights\n2K\n37K\n74K\n148K\n295K\n590K\n590K\n1M\n2M\n2M\n2M\n2M\n2M\n103M\n17M\n4M\n138M\n\nFLOP Act% Weights% FLOP%\n53%\n0.2B\n89%\n3.7B\n80%\n1.8B\n81%\n3.7B\n68%\n1.8B\n70%\n3.7B\n64%\n3.7B\n51%\n1.8B\n45%\n3.7B\n3.7B\n34%\n925M 32%\n925M 29%\n925M 19%\n206M 38%\n34M 42%\n8M\n30.9B 64%\n\n58%\n22%\n34%\n36%\n53%\n24%\n42%\n32%\n27%\n34%\n35%\n29%\n36%\n4%\n4%\n100% 23%\n7.5%\n\n58%\n12%\n30%\n29%\n43%\n16%\n29%\n21%\n14%\n15%\n12%\n9%\n11%\n1%\n2%\n9%\n21%\n\n4.3 VGG-16 on ImageNet\n\nWith promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 [27],\non the same ILSVRC-2012 dataset. VGG-16 has far more convolutional layers but still only three\nfully-connected layers. Following a similar methodology, we aggressively pruned both convolutional\nand fully-connected layers to realize a signi\ufb01cant reduction in the number of weights, shown in\nTable 5. We used \ufb01ve iterations of pruning an retraining.\nThe VGG-16 results are, like those for AlexNet, very promising. The network as a whole has\nbeen reduced to 7.5% of its original size (13\u00d7 smaller). In particular, note that the two largest\nfully-connected layers can each be pruned to less than 4% of their original size. This reduction is\ncritical for real time image processing, where there is little reuse of fully connected layers across\nimages (unlike batch processing during training).\n\n5 Discussion\n\nThe trade-off curve between accuracy and number of parameters is shown in Figure 5. The more\nparameters pruned away, the less the accuracy. We experimented with L1 and L2 regularization, with\nand without retraining, together with iterative pruning to give \ufb01ve trade off lines. Comparing solid and\ndashed lines, the importance of retraining is clear: without retraining, accuracy begins dropping much\nsooner \u2014 with 1/3 of the original connections, rather than with 1/10 of the original connections.\nIt\u2019s interesting to see that we have the \u201cfree lunch\u201d of reducing 2\u00d7 the connections without losing\naccuracy even without retraining; while with retraining we are ably to reduce connections by 9\u00d7.\n\n6\n\nM15M30M45M60Mconv1conv2conv3conv4conv5fc1fc2fc3totalRemaining ParametersPruned Parameters\fFigure 5: Trade-off curve for parameter reduction and loss in top-5 accuracy. L1 regularization\nperforms better than L2 at learning the connections without retraining, while L2 regularization\nperforms better than L1 at retraining. Iterative pruning gives the best result.\n\nFigure 6: Pruning sensitivity for CONV layer (left) and FC layer (right) of AlexNet.\n\nL1 regularization gives better accuracy than L2 directly after pruning (dotted blue and purple lines)\nsince it pushes more parameters closer to zero. However, comparing the yellow and green lines shows\nthat L2 outperforms L1 after retraining, since there is no bene\ufb01t to further pushing values towards\nzero. One extension is to use L1 regularization for pruning and then L2 for retraining, but this did not\nbeat simply using L2 for both phases. Parameters from one mode do not adapt well to the other.\nThe biggest gain comes from iterative pruning (solid red line with solid circles). Here we take the\npruned and retrained network (solid green line with circles) and prune and retrain it again. The\nleftmost dot on this curve corresponds to the point on the green line at 80% (5\u00d7 pruning) pruned to\n8\u00d7. There\u2019s no accuracy loss at 9\u00d7. Not until 10\u00d7 does the accuracy begin to drop sharply.\nTwo green points achieve slightly better accuracy than the original model. We believe this accuracy\nimprovement is due to pruning \ufb01nding the right capacity of the network and hence reducing over\ufb01tting.\nBoth CONV and FC layers can be pruned, but with different sensitivity. Figure 6 shows the sensitivity\nof each layer to network pruning. The \ufb01gure shows how accuracy drops as parameters are pruned on\na layer-by-layer basis. The CONV layers (on the left) are more sensitive to pruning than the fully\nconnected layers (on the right). The \ufb01rst convolutional layer, which interacts with the input image\ndirectly, is most sensitive to pruning. We suspect this sensitivity is due to the input layer having only\n3 channels and thus less redundancy than the other convolutional layers. We used the sensitivity\nresults to \ufb01nd each layer\u2019s threshold: for example, the smallest threshold was applied to the most\nsensitive layer, which is the \ufb01rst convolutional layer.\nStoring the pruned layers as sparse matrices has a storage overhead of only 15.6%. Storing relative\nrather than absolute indices reduces the space taken by the FC layer indices to 5 bits. Similarly,\nCONV layer indices can be represented with only 8 bits.\n\n7\n\n-4.5%-4.0%-3.5%-3.0%-2.5%-2.0%-1.5%-1.0%-0.5%0.0%0.5%40%50%60%70%80%90%100%Accuracy LossParametes Pruned AwayL2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain -20%-15%-10%-5%0%0%25%50%75%100%Accuracy Loss#Parametersconv1conv2conv3conv4conv5-20%-15%-10%-5%0%0%25%50%75%100%Accuracy Loss#Parametersfc1fc2fc3\fTable 6: Comparison with other model reduction methods on AlexNet. Data-free pruning [28]\nsaved only 1.5\u00d7 parameters with much loss of accuracy. Deep Fried Convnets [29] worked on fully\nconnected layers only and reduced the parameters by less than 4\u00d7. [30] reduced the parameters by\n4\u00d7 with inferior accuracy. Naively cutting the layer size saves parameters but suffers from 4% loss\nof accuracy. [12] exploited the linear structure of convnets and compressed each layer individually,\nwhere model compression on a single layer incurred 0.9% accuracy penalty with biclustering + SVD.\n\nNetwork\nBaseline Caffemodel [26]\nData-free pruning [28]\nFastfood-32-AD [29]\nFastfood-16-AD [29]\nCollins & Kohli [30]\nNaive Cut\nSVD [12]\nNetwork Pruning\n\nTop-1 Error Top-5 Error\n42.78%\n44.40%\n41.93%\n42.90%\n44.40%\n47.18%\n44.02%\n42.77%\n\n19.73%\n-\n-\n-\n-\n23.23%\n20.56%\n19.67%\n\nParameters Compression\n61.0M\n39.6M\n32.8M\n16.4M\n15.2M\n13.8M\n11.9M\n6.7M\n\nRate\n1\u00d7\n1.5\u00d7\n2\u00d7\n3.7\u00d7\n4\u00d7\n4.4\u00d7\n5\u00d7\n9\u00d7\n\nFigure 7: Weight distribution before and after parameter pruning. The right \ufb01gure has 10\u00d7 smaller\nscale.\n\nAfter pruning, the storage requirements of AlexNet and VGGNet are are small enough that all weights\ncan be stored on chip, instead of off-chip DRAM which takes orders of magnitude more energy to\naccess (Table 1). We are targeting our pruning method for \ufb01xed-function hardware specialized for\nsparse DNN, given the limitation of general purpose hardware on sparse computation.\nFigure 7 shows histograms of weight distribution before (left) and after (right) pruning. The weight\nis from the \ufb01rst fully connected layer of AlexNet. The two panels have different y-axis scales.\nThe original distribution of weights is centered on zero with tails dropping off quickly. Almost all\nparameters are between [\u22120.015, 0.015]. After pruning the large center region is removed. The\nnetwork parameters adjust themselves during the retraining phase. The result is that the parameters\nform a bimodal distribution and become more spread across the x-axis, between [\u22120.025, 0.025].\n\n6 Conclusion\n\nWe have presented a method to improve the energy ef\ufb01ciency and storage of neural networks without\naffecting accuracy by \ufb01nding the right connections. Our method, motivated in part by how learning\nworks in the mammalian brain, operates by learning which connections are important, pruning\nthe unimportant connections, and then retraining the remaining sparse network. We highlight our\nexperiments on AlexNet and VGGNet on ImageNet, showing that both fully connected layer and\nconvolutional layer can be pruned, reducing the number of connections by 9\u00d7 to 13\u00d7 without loss of\naccuracy. This leads to smaller memory capacity and bandwidth requirements for real-time image\nprocessing, making it easier to be deployed on mobile systems.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n8\n\n\u22120.04\u22120.03\u22120.02\u22120.0100.010.020.030.0401234567891011x 105Weight ValueCountWeight distribution before pruning\u22120.04\u22120.03\u22120.02\u22120.0100.010.020.030.0401234567891011x 104Weight ValueCountWeight distribution after pruning and retraining\f[2] Alex Graves and J\u00a8urgen Schmidhuber. Framewise phoneme classi\ufb01cation with bidirectional lstm and other\n\nneural network architectures. Neural Networks, 18(5):602\u2013610, 2005.\n\n[3] Ronan Collobert, Jason Weston, L\u00b4eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.\n\nNatural language processing (almost) from scratch. JMLR, 12:2493\u20132537, 2011.\n\n[4] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[5] Yaniv Taigman, Ming Yang, Marc\u2019Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to\n\nhuman-level performance in face veri\ufb01cation. In CVPR, pages 1701\u20131708. IEEE, 2014.\n\n[6] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with\n\ncots hpc systems. In 30th ICML, pages 1337\u20131345, 2013.\n\n[7] Mark Horowitz. Energy table for 45nm process, Stanford VLSI wiki.\n[8] JP Rauschecker. Neuronal mechanisms of developmental plasticity in the cat\u2019s visual system. Human\n\nneurobiology, 3(2):109\u2013114, 1983.\n\n[9] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature, 502(7470):172\u2013172, 2013.\n[10] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning.\n\nIn Advances in Neural Information Processing Systems, pages 2148\u20132156, 2013.\n\n[11] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus.\n\nIn Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011.\n\n[12] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure\n\nwithin convolutional networks for ef\ufb01cient evaluation. In NIPS, pages 1269\u20131277, 2014.\n\n[13] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks\n\nusing vector quantization. arXiv preprint arXiv:1412.6115, 2014.\n\n[14] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with\n\npruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[15] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.\n[16] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint\narXiv:1409.4842, 2014.\n\n[17] Stephen Jos\u00b4e Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with\n\nback-propagation. In Advances in neural information processing systems, pages 177\u2013185, 1989.\n\n[18] Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information\n\nProcessing Systems, pages 598\u2013605. Morgan Kaufmann, 1990.\n\n[19] Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon.\n\nAdvances in neural information processing systems, pages 164\u2013164, 1993.\n\n[20] Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural\n\nnetworks with the hashing trick. arXiv preprint arXiv:1504.04788, 2015.\n\n[21] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan. Hash\n\nkernels for structured data. The Journal of Machine Learning Research, 10:2615\u20132637, 2009.\n\n[22] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing\n\nfor large scale multitask learning. In ICML, pages 1113\u20131120. ACM, 2009.\n\n[23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\n\nA simple way to prevent neural networks from over\ufb01tting. JMLR, 15:1929\u20131958, 2014.\n\n[24] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural\n\nnetworks? In Advances in Neural Information Processing Systems, pages 3320\u20133328, 2014.\n\n[25] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient\n\ndescent is dif\ufb01cult. Neural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[26] Yangqing Jia, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint\n\narXiv:1408.5093, 2014.\n\n[27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. CoRR, abs/1409.1556, 2014.\n\n[28] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. arXiv\n\npreprint arXiv:1507.06149, 2015.\n\n[29] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.\n\nDeep fried convnets. arXiv preprint arXiv:1412.7149, 2014.\n\n[30] Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks. arXiv preprint\n\narXiv:1412.1442, 2014.\n\n9\n\n\f", "award": [], "sourceid": 708, "authors": [{"given_name": "Song", "family_name": "Han", "institution": "Stanford University"}, {"given_name": "Jeff", "family_name": "Pool", "institution": "NVIDIA"}, {"given_name": "John", "family_name": "Tran", "institution": "NVIDIA"}, {"given_name": "William", "family_name": "Dally", "institution": "Stanford University"}]}