
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a method for decreasing the number of parameters in an neural network by pruning connections (and sometimes, by effect, neurons) and results using this method. The method is simple and has probably been tried before, but the experiments conducted are fairly extensive, employing two datasets and three network architectures. The results that networks can be pruned to have 12x or 9x less weights without any drop in performance are interesting and may motivate future work. Of particular interest is the necessity of multiple prune/retrain cycles vs. a single prune and retrain.
Quality: good.
Clarity: Section 3.1 is poorly worded and confusing. Fortunately the L1 vs L2 discussion is eventually given a more clear treatment in Fig 5 and Lines 324328, but until that point the reader may be left bewildered.
165 To prevent this, we fix the parameters for part of the network and only 166 retrain a shallow network by reusing the surviving parameters, which already coadapted well with 167 the unpruned layers during initial training.
What parts were fixed while others trained? This is never discussed in following sections. Some of these details may be hiding behind Line 269, but they are not exposed.
Originality: the algorithm is not very original, but the experiments are well rounded and serve to illustrate how far the basic algorithm may be taken.
Significance: Likely to be interesting to those in the field and motivate future work.
Minor: Line 229: should be "12x and 12x", not "16x and 12x"
Q2: Please summarize your review in 12 sentences
This paper presents a method for decreasing the number of parameters in an neural network by pruning connections (and sometimes, by effect, neurons) and results using this method.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
It is good that the authors study reasonably modern and wellknown (AlexNet and VGG) deep nets, so that the compression ratios seem directly relevant to recent CNN work. However, the authors do not seem to compare their work to ANY of the nowlarge literative on related recent methods for compressing CNN and other deep nets. They dismiss OBD in Section 2 as unsuitable for "today's large scale neural networks", but that seems unfair since OBD's saliencies only require the Hessian diagonals, and several modern methods to approximate the required Hessian diagonals could be used for baselines.
But even without using ODBrelated methods, there are many other baselines they could of and should of reported for comparison, including methods that reduce weights to fewer bits or approximate the CNN filter matrices using cheap SVDbased compression, or ones which approximate the fully connected layers using randomized projection (e.g. FastFood methods), and so on.
The authors claim in Section 2 that such work is often "orthogonal to network pruning", but that seems to miss the point: once those methods are used, is there any real advantage to the proposed method?
Importantly, some of those methods are in practice much easier to realize promised speedups (e.g. doing more multiplies in parallel using weights of fewer bits) in the dominate dense computations (i.e. BLAS ops) than the proposed sparse method (see below for more on this point).
The authors are not very clear on how (and when) their approach is necessarily efficient  Section 4.2 mentions "900k" and "700k" iterations ... relative to AlexNet "450k".
Presumably this means the authors are claiming their approach requires only 2x as long to train as AlexNet?
Unfortunately they are not clear because they never report wall clock in seconds.
The reason that is especially critical is that they later claim great "speedups" when their nets have 59x few weights.
However, fewer FLOPS does not directly translate to similar drops in seconds, as is wellknown and carefully mentioned in much of the other work on compressing nets.
Specifically, the cacheaware and multicore efficiency of modern BLAS operations (both for CPU and GPU) makes dense matrix operations very competitive to sparse operations until sparsity is much lower than 10%.
So it is not surprising that the authors never report runtime seconds  they would almost surely NOT have noticed any actual runtime speed improvements with their approach.
This paper is also unclear on several other issues, including how (and why) they decided to reduce the learning rate (by 1/100 sometimes and by 1/10 for other experiments).
If this requires a lot of experimentation, then that search time needs to be included in the cost of this compression process, making this approach even less competitive.
This paper basically tries the simplest method to sparsify a net: drop small weights and retrain.
It reports impressivesounding compression ratios (e.g. 59x) and thus claims victory.
Finding the simplest ideas that work is in general good and should be encouraged, but the actual significance of the results and carefully describing necessary conditions required to achieve them (including tweaking the learning rates) becomes critical.
Too much is left out of this paper, incluidng what stopping condition they used for the retraining process (e.g. how did they properly determine that "900k iterations" was enough (and fair), in Section 4.2?).
Their reporting of only FLOPS, instead of seconds as well, makes it very clear that the authors are not very aware of the rich related literature on compressing networks, and the difficulty of actually achieving significant compressions in practice (due to the efficiency of modern densematrix operations over sparse ones).
In short, it is good to see a paper try to get away with the simplest approach, and show an existence proof that more of the weights in some modern CNNs could be zero and still get the same accuracy.
This might have more modest appplications, such as storing networks on disk (sparsely) and then running them densely (filling in zeros explicitly for the net in RAM).
But the authors seem to want and try to claim more significance than that, but without comparing to any other relevant compression methods. As stated this work is too prelimary, as the authors even basically admit in Section 4.3 about VGG ("we have not yet had a chance to prune the convolutional layers".
Its limitatons and tradeoffs are not discussed, nor explored in relation to any other existing methods.
Q2: Please summarize your review in 12 sentences
Proposes a simple (and often revisited) idea for sparsifying a model: train, drop small weights, repeat.
Experiments show that good (e.g. 59x) reduction in weights (without lowering accuracy) on good CNN models is possible, but does not convincingly argue how significant this result is (e.g. fails to report actual wallclock speedups, nor compare to other methods (such as using lowbit weights, other compression methods, nor moremodern/scabable approximate OBD methods).
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
See above for a summary. They apply this technique to Alexnet and vgg net, and are able to reduce the number of parameters by a factor of 9 and 6.7, respectively, while retaining the accuracy. This is one of those papers that you wish you had written, because the approach is so simple yet effective. Unlike most NIPS papers, there is very little math here, just a good idea.
The motivation is energy usage on mobile devices.
I found the relation to dropout kind of silly. This is not "hard dropout"  it is simply pruning.
Q2: Please summarize your review in 12 sentences
This paper suggests a surprisingly simple yet effective technique for reducing the complexity of deep networks by simply pruning connections below a certain magnitude and retraining the network. The retrained network retains the accuracy of the original network while the weights now have a bimodal distribution around zero.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Authors propose and evaluate simple but effective scheme for reducing the number of nonzero parameters in convolutional neural networks.
They train a network, remove fraction (ie, 90%) of the smallest weights and retrain the network. Their evaluation is on ImageNet2012 and MNIST dataset using best of the class architectures (LeNet, VGG, AlexNet). They try out several different strategies and the best performing strategy allows reduction of 69 times in the number of parameters in ImageNet.
Even though the idea is simple, demonstrating that it can match state of the art on ImageNet while retaining 9x parameters is significant.
Q2: Please summarize your review in 12 sentences
Simple but effective scheme for reducing size of trained net 69 times without losing accuracy. Sufficient detail for reimplementation along with convincing experimental evidence.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their comments. The paper
will be revised to reflect the following responses.
Reviewers 2
and 5 ask for comparison to other compression methods Table 2 already
compared our method to four other methods, including the Fastfood method
that reviewer 2 mentions. These other methods have lower compression rates
(2x4.4x) than our method (9x), both on AlexNet. We have also compared to
the biclustering and SVD method, as shown in line 226227, which
compressed the network by 2.4x13.4x layerwise, but had significant
accuracy loss: as much as 0.9% even compressing a single layer. In
contrast, our method had no accuracy loss at similar compression rate.
Since they are using different networks, we put them in line 226227
instead of in the table. Reduced precision is an orthogonal and
complementary method that can be combined with our technique. When
reduced precision is combined with pruning, AlexNet is compressed by 27x
and VGGNet by 31x, both with no loss of accuracy. We will include results
combining reduced precision with pruning in the final paper.
Reviewers 2 and 5 question the cost of sparse matrix
operations Our claims about "efficiency" refer to energy
efficiency, not speed. We make no claims about speedup, only about the
amount of storage necessary to represent the network. As we point out in
the introduction, realizing a large DNN in a mobile device requires
keeping all parameters in onchip SRAM. After pruning, the storage
requirements of AlexNet and VGGNet are are small enough that all weights
can be stored on chip, instead of going to offchip DRAM, which takes
orders of magnitude more energy (Table 1). We are targeting our pruning
method for fixedfunction hardware specialized for sparse DNNs. A
hardware engine that efficiently implements sparse matrix vector product
gives better performance than dense matrix computations as long as the
matrix has less than 50% nonzeros. A complete description of the
sparsematrix hardware accelerator is beyond the scope of this paper, but
we will make our target more clear in the final
paper.
Reviewers 2 and 5 asked about wallclock training
time. On an NVIDIA Titan X, training the unpruned Alexnet takes 75
hours. Retraining the pruned Alexnet takes 173 hours.  Unpruned
Alexnet training took 450K iterations, each 20 iterations took 12s 
conv layer retraining took 700K iterations , each 20 iterations took
12s  fc layer retraining took 900K iterations , each 20 iterations
took 4.5s, it's faster because back propagation is not needed for conv
layers. We will include this data in the final paper. Pruning is not
used when iteratively prototyping the model, but rather used for model
reduction when the model is ready for deployment. Thus, the retraining
time is not a concern.
Reviewer 6 asked about storage
overhead of the sparse matrix The storage overhead to encode the sparse
matrix is 15.6%. Only 5 bits are needed to encode each index in the sparse
matrix. This is achieved by encoding the index difference rather than the
absolute index. In rare case when the jump exceeds 31, the largest 5bit
unsigned number, we add a dummy zero. With 5bit indices, the
compression rate dropped from 9x to 7.8x for AlexNet.
When we use
lowprecision weights, the overhead of encoding the sparse matrix becomes
larger (53.3%) as the weights are smaller. Taking the overhead into
consideration, the compression rate for combined pruning and precision
reduction is 27x for AlexNet and 31x for VGGNet, both with no loss of
accuracy. We will include data on sparse matrix overhead with and
without lowprecision weights in the final paper.
Reviewer 2
asked about VGGNet pruning where convolutional layers have not been
pruned. We have now successfully pruned all the layers of VGGNet,
including both conv layers and fc layers, which improved the compression
rate from 6.8x to 13x, again without accuracy loss. The same principle
still applied (regularization, dropout ratio, sensitivity analysis). We'll
include this improved result in the final paper.
Reviewer 2
questioned why the learning rate is reduced by 1/100 and 1/10 for
retraining. No experimentation is used to set the learning rate. The
learning rate is reduced 1/100 for all large networks on ImageNet and 1/10
for all small networks on MNIST.
Reviewer 2 asked about the
stopping condition We monitored the learning curve during the training
process, when the accuracy reaches a plateau training is
stopped.
Reviewer 1 asked about what is fixed and what is
retrained The conv layers are fixed while retraining the fc layers and
viceversa.

