Summary and Contributions: The paper proposes an idea to compensate the pruned neurons by merging their contributions (approximately) to the retained neurons (actually to the nearest retained one according to cosine similarity) for ReLU networks. This ensures that pruning does not lose too much in accuracy and the pruned network retains most of the performance of the original network without finetuning. Experiments are conducted on fashion mnist and cifar datasets that shows the benefits of the method.
Strengths: 1) The idea of compensating for pruned neurons is interesting and useful in practice. The common drawback of pruning is, it reduces the learned accuracy and one needs to retrain it but this approach partly alleviates the need for retraining in relu networks. 2) Overall the paper is well written and the method is clearly explained.
Weaknesses: Some of the weaknesses/comments in my opinion are as follows: 1) The idea works only with relu please mention it upfront in abstract and introduction. It is not clear how this can be extened to other acitvation functions. Please comment. 2) In Alg. 1, the "scale" simply assinged to z without aggregating to the existing value of z. So it is not clear if two pruned neurons are found to be closest to one retained neuron, how is this situation handled? Please clarify this. 3) Please consider discussing "pruning-at-initialization" methods [a,b] which would have the same training time and complexity as the proposed merging method as both the approaces train the network only once. It would be interesting to see how this approach compares against those ones. 4) The benefits of this method could be deemed limted as this method performs similar to the vanilla pruning method which does not do any merging when finetuning is allowed. But I agree that the method focuses on pruning without fine-tuning. [a] Lee, N., Ajanthan, T. and Torr, P.H., 2019. Snip: Single-shot network pruning based on connection sensitivity. ICLR. [b] Frankle, J. and Carbin, M., 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ICLR. Post rebuttal update: I was confused with the Alg.1 update to z which the authors clarified. I would recommend the authors to improve the clarity in the updated version as well in addition to other promised discussions. I was positive and I retain my rating.
Correctness: The method, claims and experimental setup are correct.
Clarity: The paper is clearly written.
Relation to Prior Work: Most of the works are adequately discussed. Please consider discussing pruning at initialization methods as mentioned in weaknesses.
Additional Feedback: Please consider releasing the code for reproducibility.
Summary and Contributions: This paper proposes neuron merging, which aims to compress two FC-layers or conv-layers across ReLU layers. The basic idea is that FC-layer before ReLU could be decomposed into two matrix, and one could be merged with FC-layer after ReLU if that matrix satisfy certain conditions. Similar results could be obtained on conv-layers. The Authors provide algorithms for this specific decomposition. The method is verified on a set of network architectures on dataset MNIST, CIFAR-10/100.
Strengths: + The idea looks very interesting, and the derivation sounds correct.
Weaknesses: - If a convolution layer could be decomposed into a lightweight conv-layer plus an 1*1 conv-layer, Corollary-1.1 is more easy to be proved, followed Theorem 1. Interestingly, this property is proved by the other paper "Network Decoupling: From Regular to Depthwise Separable Convolutions" in BMVC 2018. - Most modern CNN networks just have one FC layers. Hence, the merging of conv-layers is more interesting and practical. - Another special case is thus proposed. How to handle network with depth-wise separable layers like MobileNet (v1/v2/v3)? - The network this tested is very limited, and the dataset is also small. How about other more network architectures like DenseNet, and modern NAS searched networks? Are there any results on ImageNet?
Correctness: Some special cases like depthwise convolution are not discussed in the paper.
Clarity: It generally easy to follow, and the illustration is very clear.
Relation to Prior Work: See my weakness points.
Additional Feedback: I give the rating based on the current status. However, I basically love the idea. I would like to seriously consider upgrading my rating if the authors could address my concerns in the rebuttal. update after rebuttal: I appreciate the authors providing additional experiments to address my concerns. If this paper is accepted, the authors are strongly suggested to consider the generalization capability of this work.
Summary and Contributions: This paper proposes an approach to merge the factor that is generated from pruning the previous layer into the next layer. It proves the condition for which this pruning is exact. Experiment results show that this simple procedure can improve the results of multiple state-of-the-art network pruning algorithms that do not involve fine-tuning.
Strengths: The methodology is simple and sound, and it significantly improves the results of previous algorithms.
Weaknesses: This is not necessarily a weakness, but I was wondering what would happen if a proper matrix decomposition algorithm, such as a non-negative matrix factorization, is applied as the decomposition algorithm in Alg. 1, instead of the current "MostSim" heuristic. It is unclear whether the MostSim heuristic is optimal and authors did not mention that aspect. Given that the improvement was already significant, I think it's OK to take the paper as-is. But the paper would be strengthened if NMF-type algorithms are tried against the MostSim heuristic.
Correctness: I haven't checked line-by-line, but I think the proofs are correct.
Clarity: It's well-written and easy to read.
Relation to Prior Work: Prior work were addressed properly as far as I know. I don't work on the area of pruning hence may miss recent work.