Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality: The proposed idea is not very different from other dynamic pruning methods. In my opinion the main contribution is the reduced amount of extra computation needed for the pruning that allows interesting computational gains and the GPU friendly way of pruning based on channels. The use of channel grouping to avoid a biased selection of the channels is also quite interesting. In dynamic pruning authors should also cite: [Convolutional Networks with Adaptive Inference Graphs. Andreas Veit Serge Belongie]. Quality: The proposed contribution makes sense and is justified by interesting experiments on CIFAR10 and Imagenet. Clarity: Overall the paper is well written and not difficult to read. Thanks to fig. 3 and 4 it should not be too difficult to reproduce the model. Significance: Conditional computation (in this case dynamic pruning), although in principle very interesting, often does not deliver really effective approaches. Often the theoretical speed-up is already reduced by the fact that for choosing which part of the network to compute it requires additional computation. Additionally, the theoretical gain is then difficult to convert to a real speed-up due to the constraints on the data of GPU architectures. The proposed approach is one of the first to show that conditional computation can lead to real speed-ups by a very light control architecture and GPU friendly pruning. This is the most important contribution of this work. Additional comments: - Why Knowledge distillation is applied only on ImageNet? - In speed-up evaluation the authors mention that the algorithm should be fast on GPU as for Perforated CNN. However, they do not really evaluate the method speed-up on GPU. - I think the method should be compared to also other approaches for speeding up convolutions, even if not based on pruning such as Perforated Convolutions. Final Evaluation: I read the other reviews and authors feedback. Authors answers were clear and I did not find any important issue that I overlooked. Thus, I confirm my evaluation.
Pruning in CNN models has gained a lot of attention in the recent years and this paper introduces a new dynamic channel pruning technique. The paper introduces a simple and effective dynamic channel pruning technique along with accelerators for ASIC hardware showing actual execution time speedups. Pros 1. The paper does a good job covering the related work for pruning in CNN models. Static vsndynamic pruning and channel vs parameter pruning are well explained. The channel gating layer proposed by the authors is dynamic and more fine-grained than  which is the closest related paper to this work. However, references and work related to sparsity in the parameters is missing. 2. The channel gating layer introduces a gating mechanism designed on the activation function. The channel block is simple and effective. It doesn't add much more compute on top of the existing layer and requires to compute partial sums all of which seem current hardware friendly. 3. The experimental results are significant across various models and datasets. The authors show an improved FLOP reduction while getting better Accuracies than other pruning methods referenced in the paper. Also, using knowledge distillation the FLOP reduction gains are furthered. 4. Real time execution speeds ups on ASIC hardware. The speedup is very close to the theoretical gains validating the results further. Cons 1. Dynamic channel pruning is explored in multiple related works. The differences seem small between the various techniques. 2. Dynamic pruning doesn't save storage space. It'll be interesting to compare the FLOP reduction of sparse (weight sparsity) models vs channel pruning models to understand the tradeoff between Accuracy and FLOP reduction further.
- Dynamic pruning idea is very interesting. Compared with static pruning ideas, dynamic pruning may use more parameters without increase the runtime during inference. - Authors realized the idea by using partial sum of a subset of input channels to generate a decision map. The description of training is clear in general, however, the motivation of using partial sum to make decision and why this helps with the results were not explained. Meanwhile, in the inference, given a test example, how to decide the base channels and the optional channels? - The paper is technically sound and fairly amount of experimental results were presented to demonstrate the effectiveness of the proposed idea.