NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7607 AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters

### Reviewer 1

Originality: The novelty lies in two aspects: 1) Reformulation of the pruning problem as a mask learning problem 2) Modify the corresponding gradient update rule. The novelty is not good enough for a NeurIPS paper in the following sense. a) The masking formulation has been adopted by [1] to learn sparse NN. Actually, this formulation has long been adopted by the LASSO variants in traditional structural learning say [2] and the references therein. b) The proposed update rule is a straightforward binarization of the weights. Moreover how to calculate the partial derivation $\frac{\partial{h(m_{ij})}}{\partial{m_{ij}}}$ Reference [1] Louizos C, Welling M, Kingma D P. Learning Sparse Neural Networks through $L_0$ Regularization[J]. arXiv preprint arXiv:1712.01312, 2017. [2] Frecon J, Salzo S, Pontil M. Bilevel learning of the group lasso structure[C]//Advances in Neural Information Processing Systems. 2018: 8301-8311. Clarity: The motivation and research goal are clearly written, and the organization follows good logic. Quality: Presentation needs to be improved, both mathematically and literally. Some typical issues are listed below: 1） 4) and 5) in the last paragraph of the introduction section should not be understood as major contributions of this paper. They only validate the significance of the proposed method. The authors might want to merge them into 3). 2) It is misleading to call the proposed update direction a gradient, especially in the theoretical discussions in Sec. 3.3.. Also, the notation $\partial{\mathcal{L}_2}{m_{ij}}$ should be changed. 3) In Sec.3.3, the name ‘sensitivity consistency’ does not make sense to me. Replacing w_ij with sign(w_ij) makes m_ij insensitive toward the weight magnitude, rather than correcting the sensitivity. 4) Eq.(3) and Eq.(4)-(5) are inconsistent, the authors might want to add $\lambda\mathcal{R}(W)$ on Eq.(3). 5) The second paragraph of Sec. 3.4 why weight decay damages the recoverability is not clearly explained. 6) In page 3, Line 1: The sentence 'The goal is to find the minimum subset w \in \mathcal{W} that reduces the model accuracy.' should be rephrased. The Goal is to preserve the model accuracy under the pruning process. Significance: The experimental results shown significant improvement on a variety of datasets and network architectures. This paper presents an interesting work on network pruning. However, balancing pros and cons, I think this paper is not good enough to be accepted as a NeurIPS paper, especially in terms of the originality and quality issues.