NeurIPS 2020

### Review 1

Summary and Contributions: The papper introduce a new differentiable method for searching discrete low-bit weights in quantized networks. A weight searching algorithm is used to avoid the non-differentiable problem of quantization functions. To further minimize the performance gap after searching, a temperature factor and the state batch normalization are proposed for seeking consistency in both training and testing. The experiments on both high-level and low-level tasks have demonstrated effectiveness of the proposed methods.

Strengths: The proposed weight searching method can replace the conventional methods by utilizing approximate gradients to optimize the latent weights. The searching approach is novel and interesting. The formulation on how to optimize the distribution is clear. This paper is good on the discrete weights that could also be learned together with the distribution, which may motivate followers. The experimental results are convincing with the state-of-the-art results. The experiments include various networks and tasks, and show promising results under different bit-width settings.

Weaknesses: This paper introduce a novel weight searching algorithm and show its effectiveness. but I have several questions and suggestions about this paper: There are several Typos needed to be fixed, e.g. Alg. 1, 5, '13' should be 'Eq. 13'. The stage batch normalization is motivated to handle the quantization gap based on a threshold, and the author assumes the distribution will not change. Is the distribution irreversible? Please discuss the relationship between the proposed method with some more recent works, such as Latent Weights Do Not Exist: Rethinking Binarized Neural. NeurIPS2019

Correctness: yes

Clarity: yes, well done

Relation to Prior Work: yes, related work is well done too.

Reproducibility: Yes

### Review 2

Summary and Contributions: In this manuscript, authors consider the problem that non-differentiable quantization methods increase the optimization difficulty of quantized networks. By representing quantized model weights as the expectation of a parameterized probability distribution, end-to-end gradient-based optimization can be adopted. The quantization gap, between floating-point and quantized model weights, is reduced by dynamically adjusting the softmax temperature, and state batch normalization is proposed to account for the inconsistent statistics between FP32 and discrete model weights.

Strengths: 1. The idea of representing discrete weights as the expectation of a parameterized probability distribution is novel and reasonable, and indeed resolves the non-differentiable issue of discrete weights. The quantization gap is gradually reduced by adjust the temperature factor in the softmax function. 2.By maintaining two groups of statistics for full-precision and quantized weights during training, the state batch normalization method explicitly consider the difference between a sharp softmax distribution and a one-hot distribution, which improves the final inference accuracy with quantized weights.

Weaknesses: 1.The decreasing temperature strategy. This plays an important role in balancing the optimization efficiency and reducing the quantization gap. Is it possible to derive some theoretical guidelines for designing the decreasing scheduler? The three schedulers (“exp”, “cos”, and “linear”) used in the experiments seem quite heuristic, and the best one, “exp”, only outperforms the other two after 200 epochs and ends up with a much better solution. 2.If the major dilemma is to avoid local minimum while reducing the quantization gap, is it possible to adopt the SGD-R [1] like schedule for adjusting the temperature factor? The quantization gap should be quite small at the end of each period, and local minimum can be avoided with each warm restart. 3.The state batch normalization method requires maintaining two groups of statistics during the whole training process. For each mini-batch, this requires two forward passes, one with FP32 weight and the other with quantized weights. This introduces extra computation overhead. Is it possible to only compute statistics for quantized weights after the training process is completed? 4.Do you have SLB w/o SBN results on CIFAR-10? It would be easier to analyze which part contributes more to the final improvement, if such results are available. 5.For the CIFAR-10 dataset, 500 epochs are used to train the quantized network. This is significantly larger than the setting in some of baseline methods. For instance, LQ-net only uses 200 epochs. In this case, the comparison may not be quite fair. Does the proposed method really require so much epochs (maybe due to the optimization efficiency), or it can also be well trained using much fewer epochs? [1] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017.

Correctness: Yes, the proposed method is reasonable and empirical evaluation results have verified its effectiveness on several tasks and benchmark datasets.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Please address issues listed in the “Weaknesses” section. ---- Authors have resolved most of my previous concerns in their rebuttal, especially the temperature decaying strategy and calibrating BN statistics. It would be nice to include results achieved with the SGD-R based temperature scheduler, as mentioned in the rebuttal. I would like to raise my score to 6 and recommend acceptance.

### Review 3

Summary and Contributions: This paper proposes to train a low-precision quantized network by searching over the possible values for the network parameters. The proposed method achieves state-of-the-art results on the CIFAR-10 and ImageNet datasets.

Strengths: This paper proposes a novel way to find the quantized weights' discrete value, which addresses the imprecise gradient problem in the STE-based methods. Moreover, the authors propose a temperature scheduling method to minimize the difference between the weights during training and inference.

Weaknesses: The proposed method consumes more device memory as it requires storing the probability distribution of each weight. In the case of a 4-bit weight, the proposed method requires 16 times larger device memory to store the weights than STE-based approaches. As the size of the weights becomes larger, the total training time is assumed to be longer as well. Moreover, as stated in the CIFAR-10 experiment, the proposed method takes 500 epochs. It is not clear whether the proposed would require more epochs to converge than STE-based methods.

Correctness: The proposed method is sound and the empirical methodology is correct. However, some details about the empirical study are missing.

Clarity: The paper is clear and well written.

Relation to Prior Work: The related works are discussed clearly.

Reproducibility: No

Additional Feedback: Can you please provide the rationale of setting the low-bit set V to be in the range of -1 to 1? When the bit-width is higher than 1, how does this choice affect the accuracy of the quantized network? The reported results on the CIFAR-10 datasets are using 500 epochs whereas other methods could take much less number of epochs. For example, the LQ-Net uses 200 eopchs for the CIFAR-10 dataset. Moreover, the hyperparameter of training is missing for the ImageNet experiments. For both the CIFAR-10 and ImageNet experiments, can you comment on the proposed method's total training time? For the CIFAR-10 experiments, can you please show the error bar of the experiments? Besides, the proposed method is superior to the STE-based methods when the precision of the weights is low. Can you also comment on when the proposed method is significantly better than the existing STE-based alternatives? ------------ after rebuttal ------------- The authors have responded to most of my questions in their rebuttal. Although the improvement over the STE-based method for quantization with multiple bits isn't very impressive, the paper does provide an interesting solution for weight quantization. I would like to keep my score to 6 and recommend acceptance.

### Review 4

Summary and Contributions: This paper introduce a method which uses an auxiliary trainable network to build a quantized neural network. To minimize the gap between the two networks, a quantization loss is proposed to minimize it. Two datasets are used to valudate the proposed method. The similar framework has been proposed in IJCAI 2019.

Strengths: The idea is interesting and the topic is very important for resource-efficient computation.

Weaknesses: 1) The similar idea of learning an auxiliary differentiable network has also been introduced in the following paper. The main difference of this paper to the following reference is that multiple bits are learned for each code in this paper while, undoubtedly, binary weights and representations will be more cost-efficient. More importantly, authors did not discuss this similar reference. [1] Binarized Neural Networks for Resource-Efficient Hashing with Minimizing Quantization Loss. IJCAI, 2019 2) I am very confused with the EQ. (6) which is most important equation in this paper. According to EQ. (1), The values $v$ are discrete numbers while $p$ is probability that the elements in $W$ belong to the $i$-th discrete value. Take one example, if we assume $v\in {1,2,3,4} and$p={0.4,0.2,0.2,0.2}\$, the results would be 2.2 which is more close to 2. So, authors should carefully check and correct this equation or add more explanations. 3) The experimental setting is not clear. Abalation study is missed and thus, it is not clear which part is more important. -------------after reading the rebuttal I have read the rebuttal and the review comments from others. Although the clarity of this paper should be carefully improved, the novelty issue, particularly for the differences between this paper and the reference I mentioned, has been well addressed by the authors. After carefully re-read the paper, I found that the sign function which is non-differentiable is not used in this paper, so the following optimization will be much easier than those have sign function and the paper seems to be more special than what I orginally thought. Moreover, considering such a weight search method can obtain SOTA performance on ImageNet, I would like to increase my score to 5. I suggest the authors include more recent related works into the final version and highlight the novelty accordingly.

Correctness: Some equations are hard to understand. The experiment is not clear.

Clarity: Equations need to be explained in detial. Some of them (Eq. 6) need to be corrected.

Relation to Prior Work: The main drawback is that this paper didn't discuss the highly-similar work.

Reproducibility: No