NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
This paper proposes a deep learning training strategy using natural gradient variational inference, and claims that this will preserve the benefits of bayesian principles. However, experimental results by the proposed method are not very impressive compared with other baselines despite the more complicated training process. In addition, I think it would be better if the author can discuss more about how easy the proposed method can be generalized to different deep models.
Reviewer 2
Originality: Rather low The main technical novelty lies in applying tricks from the deep learning literature to VOGN. The experiments are fairly standard. Quality: High That being said, the experiments seem to be carefully executed, described in detail and the overall method is technically sound. While not overly ambitious in terms of technical novelty, I think this is a well-executed piece of work. Clarity: High The paper is well-written and easy to follow. Significance: Mixed I find that the paper does itself a bit of a disservice by putting so much focus on technicalities. I believe this in an attempt to appeal to readers with an interest in deep learning rather than Bayesian inference, however I don't find the empirical part of the paper to make a particularly strong case for using Bayesian methods in deep learning. My main takeaway from the experiments would be that "being Bayesian" does not matter too much on a large dataset like Imagenet (or even CIFAR-10) and the small calibration improvements as in Figure 1 are probably not worth the extra headache. If the authors indeed wish to make a case for Bayesian deep learning to a larger audience, I think that the paper would be much stronger if it had some online learning or fine-tuning experiments when using the approximate posterior as a prior on a much smaller dataset, where ignoring parameter uncertainty would most likely lead to dramatically worse performance. The numbers in Table 1 are too close/inconsistent to be really convincing in an empirical paper and for the out-of-distribution uncertainty as in Figure 5 it is unclear if it is a good metric since we don't know the uncertainty of the true posterior. Alternatively, this could also be a much more relevant contribution to the Bayesian deep learning subfield if the paper made an attempt to gain insight into why VOGN works better than e.g. BBB. The paragraph in lines 91 to 97 does not make much sense to me, since (unless I misunderstood something) both methods optimize the same objective with an approximate posterior from the same parametric family - the difference is that VOGN is a natural gradient method. So the failure of BBB can't be attributed to the ELBO if VOGN works. But if the argument is that natural gradient is necessary, I find it surprising that Nosiy KFAC is apparently difficult to tune. Digging a bit deeper here would probably lead to interesting insights.
Reviewer 3
This paper proposes a perspective on training Bayesian Neural Networks (BNNs) that motivates how to best incorporate different tricks (such as batch normalization and momentum) into BNN training. The resulting algorithm scales to large inference problems like fitting a BNN to ImageNet and achieves well calibrated predictions. Starting point is an existing approach (VOGN) for fitting BNNs with natural gradients. The authors observe that the update equations of VOGN are similar to the update equations of popular SGD methods with adaptive learning rates. From this perspective, they can derive by analogy how to best incorporate different tricks for practical deep learning (batch normalization, data augmentation, distributed training). The extensive experimental study supports the claims of the authors. Topic-wise, this work is a good fit to the Neurips community. There seem to be no 'new ideas' in this paper (VOGN comes from ref [22] and batch normalization, data augmentation, etc. come from the deep learning literature), so I would rate it lower on originality. Yet, I find it an important contribution to bridging the gap between Bayesian neural networks and practical deep learning. The ideas and how they are connected are described clearly. This work is an interesting step into the direction of finding the right trade-off between computational efficiency and well calibrated predictions in Bayesian deep learning.