NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6257 Dimension-Free Bounds for Low-Precision Training

### Reviewer 1

Originality: The idea of quantization and the analysis of quantized SGD is not new. The authors provide a novel dimension independent bound which does seem novel. The work seems complete and the authors seem to cite properly. Quality: The quality of the is overall fine. It is a reasonably solid paper. The result is not amazing but it seems correct and there are also simulations. Overall, the authors have a nice result. Clarity: The paper is clear. Significance: This is where the authors could have done a better job. I am not convinced of the significance of this paper. My main issue is that the dimension dependent bound is tight so why does the re-parameterization that removes the dimension dependence help. A clear explanation of when the authors results are better and when they do not improve previous results would have been really useful. Also, a clear statement of how the authors helped improve SGD algorithms would also have helped.

### Reviewer 2

Instead of of the existing work where the error bounds depend on the model dimensionality in analyzing low-precision training problem, the paper proposes a new analysis that is dimension free. This removes the undesired reliance on d in analysis where practical d value could be as large as 10^8. The idea is interesting and potentially it could have good impact. Meanwhile there also a few issues. *Presentation. **It could be good to present the main results early as in Table.1 at page two. It however also seems quite out of the place as the notations are used without being defined or introduced at that time. In fact, parameters of Table 1 such as sigma_1 (and T) are otherwise first defined at page 4 (and page 5). **Some parameters are mentioned without setting down their values. For example values of the parameters kappa and kappa_1 mentioned in Theorem 1 are not defined (at least from what I can see after multiple read). There also lack explanations on why some parameters are necessary in Theorem 1, 2, and 3. For example, sigma_0, sigma_1 and sigma are mentioned in Theorem 1&2 but it is not clear what specific role each plays and why we need these parameters here. There are in fact many parameters introduced in the paper, while there are quite a number of them are either not properly defined or explained. *Technical development **The paper is entirely about analyzing the LP-SGD algorithm developed in [17]. It is unclear how the proposed analysis could be suitable for other existing or new low-precision training algorithms, such as [4-7], [15,16, 18, 20, 22, 23]. *Experiments **Many internal parameters are introduced in the paper. Empirically it is unclear how sensible these parameters are with respect to the possibly broad range of problem settings. for example, In the experiments mentioned in page 8, the parameters L, L1, sigma, and sigma1 are set to 37.41, 685.27, 2.38, 29.11, respectively. Why these specific values here, and what if a different set of values are used here? It would be helpful to discuss and clarify. Post rebuttal: Thanks for the response and I tend to keep my current rating.

### Reviewer 3

Low-precision training is an important setting to practitioners, which has been shown to be efficient when training neural network. The submission studies the convergence properties of low-precision training, and presents dimension-free’’ bounds for both linear quantization and non-linear quantization. However, the term dimension-free’’ is very misleading and may not convey the right message. Indeed the dimension d, though not explicitly shown in the proposed bounds, is hidden in the parameters. The trick the authors use is utilizing L_1 norm as the upper bound of the Hessian while keeping the L_2 norm as the lower of the Hessian. As the authors mentioned, there can be such a gap of O(sort{d}) between L_1 norm and L_2 norm, and actually the O(sort{d}) gap happens in most of the practical examples, so it is improper to state the bound is dimension-free. It is true that L_1 norm can be naturally used to define continuity, for example as the natural norm for greedy coordinate descent, but using two different norms to define the upper bound and the lower bound is a bit cheating especially when focusing on the dimension. Moreover, the \|w_0-w^*\|^2 in the right hand of Theorem 1 itself contains the dimension d, hence the term dimension-free’’ is not well defined. It is true that one can specially design some examples (like those in the experiments section) without such gap between the two norms, but those come from unrealistic models/settings. Usually in machine learning models, the data vector x is not sparse, and even if it is sparse, the sparsity should grow (linearly) as the dimension of the problem grows. I am afraid the result of this example cannot be generalized to any real-world machine learning setting, hence it is not convincing to me. In my opinion, the key proof technique is Lemma 3, which connects the quantization to l1 norm. This is an interesting observation, and potentially shows that l1 norm is the suitable norm for the low-precision setting. A deeper understanding of such connection may lead to an interesting work. The non-linear quantization part is also interesting. It would be better if there are more discussions on the advantages of non-linear quantization. Here are some more comments: 1. Write more terms in the definition of dom(\delta, b), otherwise it is a bit confusing about whether you only have the terms \delta*(2^i-1). 2. Having more discussions on what non-linear quantization gain compared with the linear counterpart can be helpful. 3. More numerical examples (for example ordinary least square for which each entry of X is iid Gaussian) can be convincing. And also try decreasing learning rate instead of constant learning rate. 4. The submission seems to contain too many settings (linear quantization, non-linear quantization, convex setting, non-convex setting), so that it is hard to have deeper discussions in a 8-page paper.