NeurIPS 2020

Neuron-level Structured Pruning using Polarization Regularizer

Review 1

Summary and Contributions: This paper proposes the polarization regularizer, a regulariser that, contrairly to the L1 regulariser that pushes all the values towards 0, pushes values to either 0 or a scalar a > 0. Applied to the scaling factors of the BN layers of a neural network, this regulariser produces a bimodal distribution of scaling factors that is more suited for pruning: a fraction of the scaling factors is clustered around zero, while the rest is clustered around a positive value. The neurons whose scaling factor is close to zero can thus be pruned, while the other ones shall be kept. The authors compare the proposed methods against L1 regularisation and other pruning methods on different architectures (VGG-16, ResNet56 and MobileNetV2) and datasets (CIFAR10/100, Imagenet).

Strengths: The presented regularization method is novel and interesting, and makes a lot of sense for structured pruning. The properties and effect of such regularisation are thoroughly presented. The proposed method also shows good results in the empirical evaluation.

Weaknesses: The main weakness of the proposed method comes from the two additional hyper-parameters lambda and t. In particular: - If I have a fixed computational budget for my pruned network, I have no particular way of setting lambda and t to match that budget as explained in section 3.3 (note that it is also an issue with L1 regularizer). - The paper is missing a bit of insight of what values should practitioners use for lambda and t, and how they have been tuned in the experiments (default values, values used during the grid search, sets used to tune them, etc). The figure 3 is also not bringing much insights, and maybe showing histograms of scaling factors for different values of lambda and t would help understand the effect of both hyper-parameters.

Correctness: The empirical evaluation could be reinforced by showing mean and standard deviations of the results (cf. reproducibility checklist). Lines 198-199: "On CIFAR-100 tasks, with fewer compared results available, our method definitely achieves the best results." This is a quite funny sentence that sounds a bit like: "If we don't compare to other methods, our method works best!" That said, I believe each experiment should at least have the L1 Norm as baseline.

Clarity: The paper is clearly written.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: About the hyper-parameter t, I would suggest to apply it to the second term in Equation (2) rather than to the first: R(g) = || g ||_1 - t * || g - mean(g) ||_1 That way, when setting t to 0, one can retrieve the L1 regulariser. And as t increases, one should observe more and more polarisation. This would allow for better comparison between L1 and polarization (say with equivalent lambda), and we could better appreciate the role of the hyper-parameter t. After author response: I thank the authors for their response. Algorithm 1 provided in the rebuttal clearly explains the procedure to set lambda and t, and should definitely be placed in the appendix. The new empirical results against DeepHoyer and USC are also compelling.

Review 2

Summary and Contributions: This paper proposes polarization regularizer, which is a regularizer that encourages structure (neuron-level) sparsity. Polarization regularizer, unlike L1 regularizer, do not have the shrinkage effect on all the weights, but rather some of the weights (regarded as unimportant weights). Theoretical analyses about the regularizer itself is carried out and the paper has shown that it introduces polarized sparsity pattern (when the optimized with the regularizer alone). Empirically, the desired property (polarization) still remains when training deep networks with the proposed regularizer. Compared to previous papers, this paper achieve better results (smaller FLOPs and higher accuracy) for three convnets and three datasets.

Strengths: - Well written and easy to follow. - A novel regularizer is proposed. I find this regularizer interesting and making sense. Specifically, if we want to have two clusters (pruned and unpruned) of weights by the end of training, we should encourage the optimization to do so. That said, existing regularization such L1 fails to do so and requires a non-trivial threshold to introduce two clusters. - I like the theoretical analysis of the regularizer, which makes it clear that the regularizer, when optimized alone, introduces polarized solutions. - The method is simple yet effective. - The empirical results are promising (the numbers achieved seem to be good).

Weaknesses: - My major concern is that this paper misses important comparisons with the uniform channel scaling baseline. As shown in [1] that the final architecture plays a huge role in the effectiveness of channel/filter pruning, I think pruning methods should compare to a naive way of model scaling (width-multiplier). More specifically, width-multiplier uniformly increases or decreases the channel counts across all layers with the same proportion (e.g., 50% of the original channel counts). Such a method is very simple (near zero computational cost) and can match any desired FLOPs requirement. If after all the pruning efforts, one still fail to compare with uniform channel scaling, what is the point? To perform a fair comparison, I'd suggest training such a baseline using the epoch number of training+fine-tuning, which is 248 for ResNet on ImageNet, 512 for MobileNetv2 on ImageNet, and 400 for CIFAR according to this paper's setup. - I think [7] is also very relevant to this work that is not cited. [7] proposed a regularizer that improves upon L1 to avoid shrinkage for all parameters, which is of the same motivation for this work. I think it is necessary to compare with this work. - Another concern is regarding hyperparameter tuning. While this paper makes a good argument regarding the property of polarization regularizer compared to the L1 regularizer, it is has two hyperparameters and make pruning laborious. Most notably, since both hyperparameters affect FLOPs and accuracy, there should be multiple solutions that hit the same FLOPs, which makes the tuning process hard as one cannot navigate on the simple objective (FLOPs) to decide. According to current presentation, it is not clear how the grid search is done and I'd appreciate detailed discussions regarding this issue for reproducibility. - The empirical results only focus on pruning less than ~50% FLOPs away. It seems to me that going lower might not favor the proposed method due to the huge regularization imposed in the training loss. This seems to be a problem for L1 regularizer as well. However, there are many methods for filter pruning and it is not clear why one wants to go for regularization-based methods. It would be great if the paper provides a wider spectrum of FLOPs (this at least can be done by comparing the proposed method to uniform pruning and L1 pruning). - Since I can find better pruning results (available prior to the submission deadline) [2-6], I do not think it is appropriate to claim state-of-the-art. I agree the results of this paper look good, but without proper comparisons with these work (taking into account the differences in training settings), it seems to me the presented results are "comparable" to the current state-of-the-art. Also, it is informative to discuss these competitive methods [2-6] in the paper. [1] Liu, Zhuang, et al. "Rethinking the value of network pruning." ICLR 2019. [2] Guo, Shaopeng, et al. "DMCP: Differentiable Markov Channel Pruning for Neural Networks." arXiv preprint arXiv:2005.03354 (2020). [3] Chin, Ting-Wu, et al. "Towards Efficient Model Compression via Learned Global Ranking." arXiv preprint arXiv:1904.12368 (2019). [4] You, Zhonghui, et al. "Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks." Advances in Neural Information Processing Systems. 2019. [5] Dong, Xuanyi, and Yi Yang. "Network pruning via transformable architecture search." Advances in Neural Information Processing Systems. 2019. [6] Yu, Jiahui, and Thomas Huang. "AutoSlim: Towards One-Shot Architecture Search for Channel Numbers." arXiv preprint arXiv:1903.11728 (2019). [7] Yang, Huanrui, Wei Wen, and Hai Li. "Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures." ICLR 2020.

Correctness: I have concerned regarding the claim for SOTA. Empirical methodology is largely fine and this paper has demonstrated it effectiveness over L1 regularization. However, it misses an important baseline uniform pruning (uniformly scales channel counts across layers) or aka the width-multiplier method.

Clarity: Yes, the paper is well written.

Relation to Prior Work: Yes, it is clear that this paper proposed a novel regularizer that encourages polarization of the weights, which does not happen naturally using previous methods (L1 regularization). However, this work has not compared to a recent work that proposed a scale-invariant regularizer [1]. [1] Yang, Huanrui, Wei Wen, and Hai Li. "Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures." ICLR 2020.

Reproducibility: No

Additional Feedback: =============== Post-rebuttal ===================== I've read the rebuttal and the reviews. Here are my thoughts: First of all, the rebuttal is nicely done in my opinion and it has addressed my concerns: - The authors have convincingly addressed my concern regarding DeepHoyer and uniform channel scaling. The comparisons in the rebuttal look great for the proposed method. - The authors have detailed the grid search in the rebuttal and have addressed my concern regarding hyperparameter tuning. - The authors have compared pruning at a larger pruning range and the results still favor the proposed method. On the other hand, I agree with R3 that the derivation for the regularizer is not relavent when the goal is to optimize the summed loss and agree with the simplication of the explanation makes it hard to attribute the reason for improvements. Overall, I'm willing to raise my score to 6 and argue in favor of this paper. Specifically, 1. This paper presents a novel yet simple regularization that works well in practice. The simple nature together with the provided code make it convincing that it is reproducible. 2. The comparisons with baselines are comprehensive.

Review 3

Summary and Contributions: This paper introduces a simple regularization method, polarization regularizer (PR), to steer channel BN scaling factors towards either 0 or a constant value `a`. The author argued that L1 regularizers used by previous work lacked distinction between pruned and preserved neurons as they push all scaling factors towards 0. Therefore, the authors proposed an alternative regularizer that (I believe) is piecewise linear and separately push the scaling factors of unimportant neurons to 0 and the important ones to the mean value of all scaling factors. They demonstrated that PR can produce smaller models that are more accurate when compared to other state-of-the-art channel-wise sparsity methods under an equal FLOPs budget.

Strengths: The method proposed is simple yet effective. It shows consistent accuracy improvements over L1 regularization. It also achieves better accuracies under an equal FLOPs budget when compared to other state-of-the-art results. In addition, the authors provided extensive theoretical analysis of PR, and proved the regularizer has nice properties that associates the hyperparameter `t` with the proportion of pruned neurons in a piecewise-linear fashion (albeit without considering the messy loss landscape of the neural network itself), potentially allowing easy sparsity tuning (again without considering the network loss).

Weaknesses: The authors proved that PR on its own has easy sparsity tuning properties, but this properties could be irrelevant as it does not consider the added effect of the loss function `L` of the network. As shown in Figure 3, Theorem 3.2 does not seem to hold in practice. Intuitively, the method proposed by the authors is in essence a piecewise regularizer. Given a well-trained neural network, the cross-entropy loss approaches 0. Under this assumption, it is reasonable to expect the regularizer only effectively separates the channel neurons into two parts with the threshold `argmax_{x \in [0, a]} (t|x| - |x - \overbar{γ}|)`. The authors tried to explain the effectiveness of PR with a comparison of the Hellinger distances of scaling factor distributions in Table 3. They argued that a smaller distance indicates the scaling factor distribution after pruning and fine-tuning with PR is closer to the original baseline model, and thus explains a smaller accuracy drop. However, the distances were measured between distributions rather than between individual channels, and did not consider how the scaling factors evolve during training. Such an over-simplification undermines the efficacy of the Hellinger distance as the explanation for a smaller accuracy drop.

Correctness: It is reasonable to believe the correctness of the method and results used in this paper. The authors provided sufficient details in the description of the methods and hyperparameters used for reproducing the work. However, the paper could be further improved by repeating some of the experiments to give confidence intervals on the results, and thus showing that the method is robust against changes in the initialization.

Clarity: The paper is overall easy to read, with minor grammatical errors.

Relation to Prior Work: The paper clearly differentiates PR from the L1 counterparts.

Reproducibility: Yes

Additional Feedback: It would also be a positive if the authors could explain how PR is designed, for instance, why the choice of `\overbar{γ}`. It would improve this paper if additional insights into how PR affects the fine-tuning process. This could be explored by examining the time evolution of the scaling factors being trained. Updates after rebuttal ----------------- The author's reubuttal did not answer the reviewer's question regarding the improvment points mentioned above. The rebuttal did not address the reviewer's concern about whether the Hellinger distance between the distributions of scaling factors can be a valid similarity measurement between the baseline and fine-tuned models. It also did not answer the reviewer's question about the relevance of Theorem 3.2 in practice. Although the authors promise multiple runs for the experiments as suggested by the reviewer, most of the shortcomings raised by the reviewer were not addressed. For this reason, the reviewer will maintain the original overall score for the submission.

Review 4

Summary and Contributions: This paper proposes a novel regularizer (polarization) for structured pruning of networks. Different from L1 regularization pushing all scaling factors towards 0, polarization regularizer is showed that it pushes a proportion of scaling factors to 0 and others to a value larger than 0, which is naturally easy to find a reasonable pruning threshold.

Strengths: This work theoretically analyzed the properties of polarization regularizer and have the empirical evaluation that it simultaneously pushes a proportion of scaling factors to 0 and others to a value >> 0 in the experiment. The idea is certainly interesting and relatively simple.

Weaknesses: Polarization regularizer is just one part of the loss function, thus it is no theory to quantitatively control t to achieve the targeted reduced FLOPs. Given the FLOPs, the method only use a grid search strategy to determine the hyper-parameters. On both small datasets(CIFAR10/100), experiments were conducted only on specific layers of the network.

Correctness: yes

Clarity: above average

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: Run the experiments that prune channel on networks of different layers, this would give me additional confidence in the results. Include a proper description of the pruning algorithm. This would make it easier for the reader to understand how it is used.