NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 2965 Training Quantized Nets: A Deeper Understanding

### Reviewer 1

This papers investigates theoretically and numerically why the recent BinaryConnect (BC) works better in comparison to more traditional rounding schemes, such as Stochastic Rounding (SR). It proves that for convex functions, (the continuous weights in) BC can converge to the global minimum, while SR methods fair less well. Also, it is proven that, below a certain value, learning in SR is unaffected by decreasing the learning rate, except that the learning process is slowed down. The paper is, to the best of my understanding: 1) Clear, modulo the issues below. 2) Technically correct, except some typos. 3) Deals with a significant topic, which has practical implications: understanding how to train neural in networks with low precision. 4) Novel and original, especially considering that most papers on this subject do not contain much theory. 5) Has interesting results. Specifically, I think it helps clarify why is it so hard to train with SR over BC (it would be extremely useful if one could use SR, since then there would be any need to store the full precision weights during training). Some issues: 1) It is confusing that w_r and w_b are both denoted w in section 4. For example, since the BC bounds are on w_r, it should be clarified that the F(w_b) behaves differently (e.g., it should have an "accuracy floor), and what are the implications (e.g., it seems a somewhat unfair to compare this bound with the SR bound on w_b). 2) It is not clear how tight are these bounds, and especially the accuracy floor. The paper would have been stronger if you it had a lower bound on the error in SR. Also, I would suggest doing a (simple) simulation of a convex/strongly-convex functions to check the tightness of these results. 3) Percentage of weight change graphs do not look very convincing. In figure 3 the linear layers actually decrease in SR in comparison to BC. Also, in figure 4(b) both batch sizes arrive to almost the same value: the different convergence speed could be related to the fact that with smaller batch sizes we do more iterations per epoch. Minor issues: * The separation of the references to various advantages in lines [26-27] seems wrong. For example, [1] actually accelerated inference throughput (using a xnor kenrel), while [3,4] only discussed this. * line 34: "on" -> "to" * It should be mentioned on the description of the methods (R SR BC) that the weights are typically restricted to a finite domain. * lines 100-101: I guess the authors refer to the binary weights here, not w_r, since w_r was restricted to [-1,1] not {-1,1}. * lines 102-103: not clear if this true. See follow-up of [1]: "Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations". * Lemma 1: I don't think it was previously defined that d is the dimension of w. Also, in the proof, the authors should consider keeping the L1 norm on the gradients instead of \sqrt(d) times the L2 norm, which can give a much higher "accuracy floor". * Line 122: confusing sentence: When alpha -> 0 the rounding actually becomes more destructive, as shown in section 5. The authors should consider replacing "rounding effect" with "rounding error per step", or removing sentence. * First equation in Supplementary Material (SM): change +1 to -1 * line 334 in SM: (11) -> (10) %% After author feedback %% The authors have addressed my concerns.

### Reviewer 2

his paper presents theoretical analysis for understanding the quantized neural networks. Particularly, the convergence analysis is performed for two types of quantizers, including the stochastic rounding (SR) and binary connect (BC), with different assumptions. Empirical evaluation and results are also provided to justify the theoretical results. Pros. 1. In general this paper is well written. Quantized nets have lots of potential applications, especially for the low-power embedded devices. Although many quantized nets have shown promising performance in practice, a rigorous analysis of these models is very essential. This paper presents some interesting theoretical results along this direction. 2. This paper shows the convergence results in both convex and non-convex settings, although certain assumptions are imposed. Cons. 1. Experimental results from Table 1 suggest that BC-ADAM outperforms SR-ADAM and R-ADAM in every case. An additional comparison of runtime behavior would be very helpful in evaluating the efficiency of these methods in practice. 2. The authors claimed that, a smaller batch size leads to an lower error for BC-ADAM, while a larger batch size is preferred for SR-ADAM. However, only two batch sizes (i.e., 128 and 2014) are employed in the evaluation. More batch sizes such as 256 and 512 should be adopted to testify if the conclusion consistently holds across multiple sizes.