NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2171
Title:MetaQuant: Learning to Quantize by Learning to Penetrate Non-differentiable Quantization

Reviewer 1

Most existing neural network quantization methods use STE when performing extremely low-bit quantization tasks such as binary and ternary ones. They assume the full-precision reference and the quantized model have the same loss gradient for easy implementation. This paper proposes MetaQuant, a really novel method to calculate more accurate gradients from the training perspective. The proposed MetaQuant bridges the Gq w.r.t. quantized weights and r of full-precision weights as the inputs to a meta network, which outputs Gr and is trained jointly with the classification network (needs to be quantized) in an end-to-end manner. Three designs of meta quantizer are provided, and validated on DoReFa-Net and BWN using different image classification datasets and settings. Generally, the paper is very well-written, including the motivation, the theoretical analysis, the proposed design, the practical implementation, and the experiment settings and results. Here, I have some questions: (1) In the current design, the meta quantizer is shared across all layers and each weight parameter is processed independently. As weight parameters in kernels/filters are correlated, have the author tried other design harnessing weight relations? By encoding weight correlations via better weights sharing designs for the meta quantizer, improved accuracy may be obtained. (2) The meta quantizer will introduce extra memory cost as partially described in the supp material, how about its impact to training time cost (not the number of iterations)? I suggest the author to put this part of experiments to the main paper. (3) I would like to see a more comprehensive comparison on ImageNet, e.g., including more state-of-the-art results on binary/ternary networks. ------------------------------------------------------------------------------------------------------ My questions are well addressed by the author responses. I think this paper is a decent submission, thus I retain the score of 7.

Reviewer 2

The main issues of the paper are in the training of the meta quantizer (most of which is discussed in section 4.2). - In eq. (8) the term \partial \tilde{W} / \partial \phi is needed in the backward phase. However, from the solution setup description provided by the authors, \tilde{W} occurs before \phi in the computation graph. I fail to see how \partial \tilde{W} / \partial \phi can thus be computed, or even what it represents. A clarification from the authors would be appreciated. Note, it could be that auto-differentiation does not crash when this gradient is called (and this could explain why the method runs) - however, that is not enough evidence, a deeper explanation on what the term means and how it is computed is required. - Still in eq. (8) there seems to be a chicken-egg problem. The term \partial L / \partial \tilde{W} is replaced by M(...), the output of the meta-quantizer. This means that the dependence on the loss function L is suppressed and the updates based on the gradients computed in eq. (8) do not in fact operate to minimize L. The authors should clarify how the meta-quantizer is linked to the loss function. - This brings me to my next point: should the loss function of the base network be used for training the meta-quantizer? This seems not to be very well thought about. The two networks have different tasks, and the meta-quantizer is a regressor. A convincing discussion is needed to address this issue which I think is the main weakness of the paper as it stands. - Finally, I would like to point out to the authors that there are many writing imprecisions that severely harm the quality of the paper. For instance, some symbols are utilized without being defined/introduced (e.g., the boldface 1). The notation is inconsistent: for instance L is used to denote both the number of layers and the loss function. There are some typographic mistakes, e.g., 'outperforms' does not require a hyphen, and the same applies to 'fully connected' etc.. Post Response Comments: I thank the authors for their response. The feedback from the authors has helped me better articulate my issue with the proposed method, which I believe is very serious. Indeed, in the rebuttal, the authors show at line 22 how the computation occurs: "phi -> delta W -> W tilde -> W hat -> L". Clearly from Figure 1 in the rebuttal document, the link between W tilde -> W hat is a quantization operation, which is non-differentiable. So, back-propagating gradients from the output of the main network to the meta quantizer suffers from the same problem of non-differentiability the authors so vehemently claim to have solved. Further, note that the rebuttal provided strongly disagrees with a key claim in the main paper on line 140: "Therefore, M_phi is connected to the final quantization training loss, which receives gradient update on phi backpropagated from the final loss... MetaQuant ***not only avoids the non-differentiability issue for the parameters**** in the model, but also...". Obviously, this backpropagation goes through a non differentiable step along the way (W tilde <- W hat) and so the problem is not solved, it is simply delegated from the main network to the meta-quantizer making the contribution void. I have therefore decided to decrease my score by one point (from 5 to 4).

Reviewer 3

Originality: the approach in the paper is novel and one that I haven't seen before. It provides an end-to-end training platform. I see it as essentially modeling a neural network to learn the residual dynamics that gets thrown away under the STE model. Quality: The paper is thorough. It motivates the problem well, does a good job of explaining its approach and lays out experiments on a number of networks and baselines. Clarity: The paper is clear in it's explanation of the problem, its approach and the results. Significance: Limited - since results are shown only on CIFAR benchmarks. It would be interesting to see results on Imagenet. Post-response comment: Thanks to the authors for pointing out their benchmark on ImageNet which I had overlooked during the original review. I have improved my score to 8.

Reviewer 4

In a weight-quantized network, the quantization function is usually non-differentiable. However, many methods need to use full-precision weight for update. Previous methods usually use heuristic methods to transform the gradient w.r.t. the quantized weights to full-precision weight. This paper proposes to learn a meta network to predict this transform. The authors also propose three ways to parameterize the meta network. The paper is overall easy to follow. My main concerns are in the experimental settings and results. 1. In line 203, the authors said that they report the "best test accuracy". This is not fair. 2. It is not clear that the proposed MetaQuant works under what kind of conditions. From the experiment results in Appendix A in the supplementary material. The proposed MetaQuant-FC sometimes has very poor performance. On the other hand, the previous STE methods, though might perform worse sometimes, can get relative stable performance for all the reported tasks. 3. In Figure 3 (b), when Adam is used, the STE method has similar or even better convergence behavior as the proposed MetaQuant at the early stage of training. However, it is run for a smaller number of iterations than the proposed MetaQuant. Thus it is hard to draw a comparison between these two kinds of methods. For fair comparison, all the competing methods should be run for the same number of iterations. It is also not clear whether the numbers reported in Tables 1-4 are run with the same number of iterations.