NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Post rebuttal: I feel this is a strong paper and will maintain my score. - Figure 3 is a very nice visualisation; I hadn't thought of plotting the corruption to the objective landscape under model quantisation before. - This seems to out-perform some of the recent mixed-precision results as well; you may want to directly state this in one of the comparison tables - I'd like to see time comparisons for training and inference - there should be better baseline comparisons (although: this method seems to match normal training, so there's very little margin for it to be out-performed. The comparisons should be used to emphasize that more complex methods actually end up under-performing the proposed method)
Reviewer 2
After reviewing the feedback of authors and other reviewers, I've decided to keep my high score. This is a strong paper. -------------------------------------------- 1. Originality: 1.a. The idea of using 1-4-3 FP8 format is the evolution of prior work on using 1-5-2 FP8 format for training neural networks. 1.b. The idea and analysis of re-tuning batch normalization statistics to match 1-4-3 data is novel. 1.c. The idea of doing distributed training with FP16 reduce-scatter and HFP8 allgather is novel and is supported by an excellent analysis. 2. Quality: the paper provides a detailed analysis of hybrid FP8 training approach and highlights the benefits of having FP8 multiplication units that support both 1-4-3 and 1-5-2 formats. 3. Clarity: the paper is clearly written, and together with an appendix, provides enough information to reproduce the results. 4. Significance: the paper explores a practical technique to speed up training and inference of neural networks, at a cost of minimal model quality loss and introduction of new hardware primitives. While this paper builds upon previous work on FP8 training & inference, it provides significant improvements useful for the research community and the industry.
Reviewer 3
Originality - This basically amounts to using two different floating point formats - one for forward, and one for backward. Or another way to think about it is that we are allowing more freedom in the mantissa/exponent divide for floating point. That's a good observation to have, theoretically, but how would a framework implement this, practically? For example, maybe I missed it, but I don't see how you convert between 1-4-3 and 1-5-2 formats when you prepare for back prop if we were to productize this. Do the frameworks now have to support 2 more data types? Is the user aware of the data types, even? Quality - They did a sufficiently thorough study, including looking at impact on batch norm layers, thinking about weight update, etc. I appreciate the diversity of networks addressed as well - that's very convincing. Clarity - the paper is clearly written. How do you get away with FP16 master weights when most others need FP32? Significance - Is the intent to convince hardware vendors to provide this? Or is this for a custom chip? How does a reader take advantage of this? -------------------- After author feedback. The authors did not quite address my question about frameworks using this. I was referring more to annoying things like plumbing a new data type through every layer implementation. That is an often overlooked part that takes real work. However, despite that, I think this paper still passes the bar so I'll keep my score. My other issues were addressed.