Paper ID: 708
Title: Learning both Weights and Connections for Efficient Neural Network
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a method for decreasing the number of parameters in an neural network by pruning connections (and sometimes, by effect, neurons) and results using this method. The method is simple and has probably been tried before, but the experiments conducted are fairly extensive, employing two datasets and three network architectures. The results that networks can be pruned to have 12x or 9x less weights without any drop in performance are interesting and may motivate future work. Of particular interest is the necessity of multiple prune/retrain cycles vs. a single prune and retrain.

Quality: good.

Clarity: Section 3.1 is poorly worded and confusing. Fortunately the L1 vs L2 discussion is eventually given a more clear treatment in Fig 5 and Lines 324-328, but until that point the reader may be left bewildered.

165 To prevent this, we fix the parameters for part of the network and only 166 retrain a shallow network by reusing the surviving parameters, which already co-adapted well with 167 the un-pruned layers during initial training.

What parts were fixed while others trained? This is never discussed in following sections. Some of these details may be hiding behind Line 269, but they are not exposed.

Originality: the algorithm is not very original, but the experiments are well rounded and serve to illustrate how far the basic algorithm may be taken.

Significance: Likely to be interesting to those in the field and motivate future work.

Minor: Line 229: should be "12x and 12x", not "16x and 12x"
Q2: Please summarize your review in 1-2 sentences
This paper presents a method for decreasing the number of parameters in an neural network by pruning connections (and sometimes, by effect, neurons) and results using this method.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)


It is good that the authors study reasonably modern and well-known (AlexNet and VGG) deep nets, so that the compression ratios seem directly relevant to recent CNN work. However, the authors do not seem to compare their work to ANY of the now-large literative on related recent methods for compressing CNN and other deep nets. They dismiss OBD in Section 2 as unsuitable for "today's large scale neural networks", but that seems unfair since OBD's saliencies only require the Hessian diagonals, and several modern methods to approximate the required Hessian diagonals could be used for baselines.

But even without using ODB-related methods, there are many other baselines they could of and should of reported for comparison, including methods that reduce weights to fewer bits or approximate the CNN filter matrices using cheap SVD-based compression, or ones which approximate the fully connected layers using randomized projection (e.g. FastFood methods), and so on.

The authors claim in Section 2 that such work is often "orthogonal to network pruning", but that seems to miss the point: once those methods are used, is there any real advantage to the proposed method?

Importantly, some of those methods are in practice much easier to realize promised speedups (e.g. doing more multiplies in parallel using weights of fewer bits) in the dominate dense computations (i.e. BLAS ops) than the proposed sparse method (see below for more on this point).

The authors are not very clear on how (and when) their approach is necessarily efficient -- Section 4.2 mentions "900k" and "700k" iterations ... relative to AlexNet "450k".

Presumably this means the authors are claiming their approach requires only 2x as long to train as AlexNet?

Unfortunately they are not clear because they never report wall clock in seconds.

The reason that is especially critical is that they later claim great "speedups" when their nets have 5-9x few weights.

However, fewer FLOPS does not directly translate to similar drops in seconds, as is well-known and carefully mentioned in much of the other work on compressing nets.

Specifically, the cache-aware and multicore efficiency of modern BLAS operations (both for CPU and GPU) makes dense matrix operations very competitive to sparse operations until sparsity is much lower than 10%.

So it is not surprising that the authors never report runtime seconds -- they would almost surely NOT have noticed any actual runtime speed improvements with their approach.

This paper is also unclear on several other issues, including how (and why) they decided to reduce the learning rate (by 1/100 sometimes and by 1/10 for other experiments).

If this requires a lot of experimentation, then that search time needs to be included in the cost of this compression process, making this approach even less competitive.

This paper basically tries the simplest method to sparsify a net: drop small weights and retrain.

It reports impressive-sounding compression ratios (e.g. 5-9x) and thus claims victory.

Finding the simplest ideas that work is in general good and should be encouraged, but the actual significance of the results and carefully describing necessary conditions required to achieve them (including tweaking the learning rates) becomes critical.

Too much is left out of this paper, incluidng what stopping condition they used for the retraining process (e.g. how did they properly determine that "900k iterations" was enough (and fair), in Section 4.2?).

Their reporting of only FLOPS, instead of seconds as well, makes it very clear that the authors are not very aware of the rich related literature on compressing networks, and the difficulty of actually achieving significant compressions in practice (due to the efficiency of modern dense-matrix operations over sparse ones).

In short, it is good to see a paper try to get away with the simplest approach, and show an existence proof that more of the weights in some modern CNNs could be zero and still get the same accuracy.

This might have more modest appplications, such as storing networks on disk (sparsely) and then running them densely (filling in zeros explicitly for the net in RAM).

But the authors seem to want and try to claim more significance than that, but without comparing to any other relevant compression methods. As stated this work is too prelimary, as the authors even basically admit in Section 4.3 about VGG ("we have not yet had a chance to prune the convolutional layers".

Its limitatons and tradeoffs are not discussed, nor explored in relation to any other existing methods.
Q2: Please summarize your review in 1-2 sentences
Proposes a simple (and often revisited) idea for sparsifying a model: train, drop small weights, repeat.

Experiments show that good (e.g. 5-9x) reduction in weights (without lowering accuracy) on good CNN models is possible, but does not convincingly argue how significant this result is (e.g. fails to report actual wall-clock speedups, nor compare to other methods (such as using low-bit weights, other compression methods, nor more-modern/scabable approximate OBD methods).

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
See above for a summary. They apply this technique to Alexnet and vgg net, and are able to reduce the number of parameters by a factor of 9 and 6.7, respectively, while retaining the accuracy. This is one of those papers that you wish you had written, because the approach is so simple yet effective. Unlike most NIPS papers, there is very little math here, just a good idea.

The motivation is energy usage on mobile devices.

I found the relation to dropout kind of silly. This is not "hard dropout" - it is simply pruning.
Q2: Please summarize your review in 1-2 sentences
This paper suggests a surprisingly simple yet effective technique for reducing the complexity of deep networks by simply pruning connections below a certain magnitude and retraining the network. The retrained network retains the accuracy of the original network while the weights now have a bimodal distribution around zero.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Authors propose and evaluate simple but effective scheme for reducing the number of non-zero parameters in convolutional neural networks.

They train a network, remove fraction (ie, 90%) of the smallest weights and retrain the network. Their evaluation is on ImageNet2012 and MNIST dataset using best of the class architectures (LeNet, VGG, AlexNet). They try out several different strategies and the best performing strategy allows reduction of 6-9 times in the number of parameters in ImageNet.

Even though the idea is simple, demonstrating that it can match state of the art on ImageNet while retaining 9x parameters is significant.
Q2: Please summarize your review in 1-2 sentences
Simple but effective scheme for reducing size of trained net 6-9 times without losing accuracy. Sufficient detail for re-implementation along with convincing experimental evidence.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their comments. The paper will be revised to reflect the following responses.

Reviewers 2 and 5 ask for comparison to other compression methods
Table 2 already compared our method to four other methods, including the Fastfood method that reviewer 2 mentions. These other methods have lower compression rates (2x-4.4x) than our method (9x), both on AlexNet. We have also compared to the biclustering and SVD method, as shown in line 226-227, which compressed the network by 2.4x-13.4x layerwise, but had significant accuracy loss: as much as 0.9% even compressing a single layer. In contrast, our method had no accuracy loss at similar compression rate. Since they are using different networks, we put them in line 226-227 instead of in the table.
Reduced precision is an orthogonal and complementary method that can be combined with our technique. When reduced precision is combined with pruning, AlexNet is compressed by 27x and VGGNet by 31x, both with no loss of accuracy. We will include results combining reduced precision with pruning in the final paper.


Reviewers 2 and 5 question the cost of sparse matrix operations
Our claims about "efficiency" refer to energy efficiency, not speed. We make no claims about speedup, only about the amount of storage necessary to represent the network. As we point out in the introduction, realizing a large DNN in a mobile device requires keeping all parameters in on-chip SRAM. After pruning, the storage requirements of AlexNet and VGGNet are are small enough that all weights can be stored on chip, instead of going to off-chip DRAM, which takes orders of magnitude more energy (Table 1). We are targeting our pruning method for fixed-function hardware specialized for sparse DNNs. A hardware engine that efficiently implements sparse matrix vector product gives better performance than dense matrix computations as long as the matrix has less than 50% non-zeros. A complete description of the sparse-matrix hardware accelerator is beyond the scope of this paper, but we will make our target more clear in the final paper.


Reviewers 2 and 5 asked about wall-clock training time.
On an NVIDIA Titan X, training the unpruned Alexnet takes 75 hours. Retraining the pruned Alexnet takes 173 hours.
- Unpruned Alexnet training took 450K iterations, each 20 iterations took 12s
- conv layer retraining took 700K iterations , each 20 iterations took 12s
- fc layer retraining took 900K iterations , each 20 iterations took 4.5s, it's faster because back propagation is not needed for conv layers. We will include this data in the final paper.
Pruning is not used when iteratively prototyping the model, but rather used for model reduction when the model is ready for deployment. Thus, the retraining time is not a concern.


Reviewer 6 asked about storage overhead of the sparse matrix
The storage overhead to encode the sparse matrix is 15.6%. Only 5 bits are needed to encode each index in the sparse matrix. This is achieved by encoding the index difference rather than the absolute index. In rare case when the jump exceeds 31, the largest 5-bit unsigned number, we add a dummy zero. With 5-bit indices, the compression rate dropped from 9x to 7.8x for AlexNet.

When we use low-precision weights, the overhead of encoding the sparse matrix becomes larger (53.3%) as the weights are smaller. Taking the overhead into consideration, the compression rate for combined pruning and precision reduction is 27x for AlexNet and 31x for VGGNet, both with no loss of accuracy.
We will include data on sparse matrix overhead with and without low-precision weights in the final paper.


Reviewer 2 asked about VGGNet pruning where convolutional layers have not been pruned.
We have now successfully pruned all the layers of VGGNet, including both conv layers and fc layers, which improved the compression rate from 6.8x to 13x, again without accuracy loss. The same principle still applied (regularization, dropout ratio, sensitivity analysis). We'll include this improved result in the final paper.


Reviewer 2 questioned why the learning rate is reduced by 1/100 and 1/10 for retraining.
No experimentation is used to set the learning rate. The learning rate is reduced 1/100 for all large networks on ImageNet and 1/10 for all small networks on MNIST.


Reviewer 2 asked about the stopping condition
We monitored the learning curve during the training process, when the accuracy reaches a plateau training is stopped.


Reviewer 1 asked about what is fixed and what is retrained
The conv layers are fixed while retraining the fc layers and vice-versa.