NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 2468 Scalable methods for 8-bit training of neural networks

### Reviewer 1

# Summary of the paper The goal of this paper is to train and quantize a model into 8 bit. This is interesting given the fact that most of the existing works are based on 16 bit and people are having some difficulties in training 8bit models. The paper identified that the training difficulty comes from batchnorm and it proposed a variant of batchnorm called range batchnorm which alleviate the numerical instability of the original batchnorm occurring with the quantized models. By such simple modification, the paper shows that a 8 bit model can be easily trained using GEMMLOWP, an existing framework. The paper also tried to analyze and understand the proposed approach in a theoretical manner. Experiments well supported the argument of the paper. # General comments I am on the positive because I found the paper has a clean goal (training 8 bit model), identified the right problem (batch norm) and proposed a solution to address the problem (range batch norm). The paper is technically sound and I appreciate the authors’ effort in understanding of the problem which puts some foundation of the proposed solution. # Quality The proposed method is technically sound, simple and effective. I roughly checked all the equations which look good in general. 1- Given this is a model quantization paper, I would be interested in a evaluation and comparison on the model size and speed. 2- The analysis in section 3 is good. However, the assumption that x^{(d)} is gaussian distributed is probably not true in the real scenario. The input data could be Gaussian however, the input to other following layers could often not be. But I don’t think this is a severe problem for this paper given the fact that properly analyzing neural networks is still a challenging theoretical problem. 3- Section 5 derives the lower bound of the expectation of cosine distance. But how about the variance of the cosine? I think the variance could also be an important metric to understand better about such performance guarantee. # Clarity The paper is well written and easy to follow. Few comments: 1- Appendix E is an important technical detail and should be included in the main body (section 4) of the paper. If you feel the paper is too long, I would suggest reducing Section 5 a little bit, e.g., Figure 1-right does not seem to add additional information while it took a lot of space. 2- Fix typos, e.g., Figure 1-left the x label “treshold” -> “threshold”; Line 233 “Res50” -> “ResNet-50”. Please be consistent with the terminologies and short forms. The caption of figure 2, “with respect the” -> “with respect to the”. 3- All equations should be properly punctuated. # Originality I believe the Range Batchnorm and a systematic method to quantize models into 8 bit are novel. # Significance I think the results presented in this paper could be interesting to researchers in theory and quantization. Quantizing a model into 8 bit is interesting which might inspire many more interesting future work in this area.

### Reviewer 2

The paper is focused on a very important problem of DNNs quantization. The authors propose a method of quantization of gradients, activations, and weights to 8-bit without a drop in test accuracy. The authors noticed that layer gradients do not follow a Gaussian distribution and connected this observation with the poor performance of low precision training. Based on this observation the authors suggested replacing several 8-bit matrix multiplications with 16-bit operations during the backward pass. It is necessary to note that 16-bit operations are applied only to such multiplications that do not involve a performance bottleneck. Another important component that leads to a good performance is Range Batch-Normalization (Range BN) operator. In other words, the authors introduced a more robust version of BN layer. Overall, it is a very interesting and well-written paper and the result is pretty strong. However, more experimental results on low-precision training without a drop in accuracy are required since it is the main contribution of the paper. The authors showed that their method has the same accuracy as a full-precision model only for ResNet-18 on ImageNet. Supplementary contains more experiments on more aggressive and, as a result, lossy quantization. Theoretical results in section 5 also contain several shortcomings: 1. In subsection 5.4 the authors’ reasoning is based on the fact that vectors W and eps are independent. However, components of eps are drawn from the uniform distribution with parameters dependent on max_i|W|_i. 2. In (13) inequality should be replaced with approximate inequality, since the authors consider approximation. In (12) the equality should be replaced with approximate equality due to the same reason. 3. To prove (9) the authors use Jensen's inequality. The classic Jensen's inequality has the form f(\mathbb{E} X) \leq \mathbb{E}f(x), where f is convex. In this work, the authors apply this inequality to get the inequality (9) of the form \mathbb{E} Y f(\mathbb{E}X) \leq \mathbb{E}(f(X) Y), where X and Y are not independent variables and f is convex. In other words, could you please elaborate how exactly you apply Jensen's inequality in (9), because under expectation in (9) there are two dependent variables (X = \|w\|_2 and Y = \|w\|_1) and f = 1 / x takes as the argument only one of these variables ( f(X) = 1 / \| w\|_2)? Update: I would like to thank the authors for their feedback. Since the authors provided new experimental results, I will change my score. However, I think that the theoretical part should be improved. ‘We note that unlike [1] which established that this angle converges to 37 degrees only at the limit when the dimension of the vector goes to infinity, our proof shows that this is a fundamental property valid also for the more practical case of finite dimensions.’ In the feedback, the authors state that the Jensen’s inequality is a good approximation. However, it is a good approximation when dimensionality is large enough (in other words it goes to infinity). Therefore this statement significantly resembles previous results. Moreover, since the authors use an approximation, they should not use the equality sign because it confuses readers.