NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3438
Title:Global Sparse Momentum SGD for Pruning Very Deep Neural Networks

Reviewer 1

Update: Authors justified the choice of the competitor in empirical evaluation (thought it's better to add it to the body of the paper in camera ready if accepted). I find technique interesting, though i think results are exploratory and some-what preliminary, I think it's important for NeurIPS community to get familiar with these results. -------------- Authors suggest new gradient flow for prunning large DNN models. They identify and address major issues of current approaches, such as 1) prune then finetune for accuracy recover 2) prunning by custom learning (mostly custom regulizers). Authors introduce GSM - a new approach, that does not require finetuning afterwards and can be solved by means of vanilla SGD. GSM only updats the top Q values of the gradient based on the suggested metric (first order Taylor) --- |dL/dw * w|. This way weight decay gradually zero out all the redundunt parameters. Authors provide experimental study of their method. Strengths of the paper: - nice overview of the problem motivation (drawbacks of existent methods) - simple and straight-forward idea behind the algorithm Weaknesses of the paper: - no theoretical guarantees for convergence/pruning - though experiments on the small networks (LeNet300 and LeNet5) are very promising: similar to DNS [16] on LeNet300, significantly better than DNS [16] on LeNet5, the ultimate goal of pruning is to reduce the compute needed for large networks. - on the large models authors only compare GSM to L-OBS. No motivation given for the choice of the competing algorithm. Based on the smaller experiments it should be DNS [16], the closest competitor, rather than L-OBS, showed quite poor performance compared to others. - Authors state that GSM can be used for automated pruning sensitivity estimation. 1) While graphs (Fig 2) show that GSM indeed correlates with layer sensitivity, it was not shown how to actually predict sensitivity, i.e. no algorithm that inputs model, runs GSM, processes GSM result and output sensitivity for each layer. 2) Authors don't explain the detail on how the ground truth of sensitivity is achieved, lines 238-239 just say "we first estimate a layer's sensitivity by pruning ...", but no details on how actual pruning was done. comments: 1) Table 1, Table 2, Table 3 - "origin/remain params|compression ratio| non-zero ratio" --- all these columns duplicate the information, only one of the is enough. 2) Figure 1 - plot 3, 4 - two lines are indistinguishable (not even sure if there are two, just a guess), would be better to plot relative error of approximation, rather than actual values; why plot 3, 4 are only for one value of beta while plot 1 and 2 are for three values? 3) All figures - unreadable in black and white 4) Pruning majorly works with large networks, which are usually trained in distributed settings, authors do not mention anything about potential necessity to find global top Q values of the metric over the average of gradients. This will potentially break big portion of acceleration techniques, such as quantization and sparsification.

Reviewer 2

Originality: This paper has two major drawbacks in its originality segment: 1) the field of NN-pruning is quite busy with many related papers populating the field and 2) it does not compare against the following very similar paper: Faster gaze prediction with dense networks and Fisher pruning by Theis et al 2018. This paper uses the fisher information to prune features during gradient descent subject to user-preset-computational constraints. Quality: The paper is technically interesting, but makes one leap which is unclear: the authors claim to be model agnostic and instead to be putting all their assumptions into the SGD method. However, the curvature calculations (via Taylor approximations) are model-dependent and actually exploit model structure to determine if a weight should be pruned. It would be great to relate this to the Hessian and the Fisher Information (see: Fisher pruning) to clarify the relationship to the model. Apart from that, another drawback of the paper is the need to express the compression ratio, which is quite an unnatural quantity to have to hand-specify and is not really what a user wants to control. Constraints typically exist in speed or memory space, not in compression ratios. The experiments are pretty well executed, I particularly enjoyed the study of the feature-re-activation, which studies a specific property of this model. Clarity: The paper is well written and concise. Significance: This paper manages to not need complex criteria or multi-stage models to achieve its goal of sparsifying. In the long term, this can be an important property to make pruning a pragmatic modeling tool enshrined in software.

Reviewer 3

(1) This paper is well written. (2) To my knowledge, most of the preceding methods only prune relatively shallow models like Alexnet and Vgg, where it is possible to manually set the layer-wise pruning rates based on trial-and-error. But the proposed method requires no pre-defined layer-wise pruning rates, which is especially good on very deep models. (3) The proposed method (GSM) achieves lossless pruning. Compared to the classic L1/L2-based pruning method [Han et al. Learning both ...], which use L1/L2 regularization to reduce the magnitude of parameters (at the cost of compromised accuracy) and then prune the parameters (with accuracy reduction again), the model encounters no accuracy drop when pruned after GSM training. (4) The proposed method is intuitive and easy to understand. The method utilizes momentum in a natural and creative way: to accelerate the process of a parameter moving towards a constant direction. (5) The main reasons for me to vote for accepting the paper are the novelty and potential insights. The idea of directly modifying the gradients to accomplish a certain task is intriguing. Actually, we always customize the loss function to indirectly modify the gradients which control the direction of training, but rarely directly transform the gradients.