Paper ID: | 7203 |
---|---|

Title: | A Simple Baseline for Bayesian Uncertainty in Deep Learning |

The method is almost trivially simple, scalable and easy to implement, yet the empirical evaluation shows that it performs competitively and often better than all alternatives. This is the best kind of paper! The task of representing uncertainty over model weights is highly significant -- it is debatably *the* core problem in Bayesian deep learning, with (as the authors point out) applications to calibrated decision making, out-of-sample detection, adversarial robustness, transfer learning, and more. I expect this baseline to be widely used by researchers in the field, and likely implemented by practitioners as well. The paper is well written and easy to follow. The method is clearly motivated and cleanly presented. The experimental results are extensive and compelling, and include comparisons to the major alternative approaches from recent literature. I appreciate that the experiments include some 'real' tasks (e.g., Imagenet models) as opposed to the toy problems often used in Bayesian deep learning papers. One omission I found fairly glaring was the lack of any discussion of the seemingly even simpler baseline of iterate ensembles. If you've got a bunch of SGD iterates lying around, why bother fitting a Gaussian and ensembling its samples, when you could have just ensembled the iterates directly? I'd expect that imposing the Gaussian distributional assumption increases bias and reduces variance, and I could be convinced that in practical situations the bias-variance tradeoff is always in favor of fitting the Gaussian, but I want to see that comparison! Since the main innovation of the paper is to fit a Gaussian to SGD iterates, the question of whether doing so *actually helps* seems quite foundational. Why not include iterate ensembling as a baseline in Figs 2 and 3? Update: thanks to the authors for your response and the new results, which are encouraging. I'm still not sure I have good intuition for when (and why) SWAG will outperform iterate ensembles -- I would appreciate some discussion on this point in the paper -- but it's good to see evidence that it often does. I also appreciated Reviewer 3's points re convergence to an isotropic Gaussian -- having not read Mandt et al. closely, I didn't realize that under non-crazy assumptions (i.e., that the data are generated from the model) the scale of the SGD iterate distribution is independent of the shape of the true posterior. Just reading this submission (e.g. section 3), it's easy to believe otherwise; readers would be better served if the paper clarified the gap between the theory and the empirical results. R3 also makes a very strong point that the paper should discuss how the SWAG learning rates are chosen, since Mandt et al. (in the settings of both eqn (13) and section 6.2) indicates that the learning rate is crucial to the scale of posterior uncertainty. Overall I still favor accepting the paper. Whether or not we consider SWAG a principled Bayesian approximation, it's significant that a simple method performs well at the tasks used to benchmark Bayesian deep learning algorithms, and asking new methods to do better is a reasonable challenge. To the extent that SWAG is a broken approximation, it should be possible to beat it, but in the meantime simple baselines are important. I think this paper will help move the field forward. That said, if it's accepted I encourage the authors to consider softening the Bayesian framing; positioning SWAG as 'merely' a useful approach to quantifying uncertainty seems like a much stronger case.

The authors propose a very simple and practical method for evaluating Bayesian model uncertainty by running SGD with a fixed step-size and fitting a low-rank Gaussian to the resulting iterates. Surprisingly, this method works better than many more involved methods for approximating Bayesian inference. This method is currently one of the most practical ways of evaluating model/parameter uncertainty when training deep neural networks and could be very useful to many people. The paper is written very clearly. There is useful theory supporting the method, with relevant references to Mandt et al about the approximately Gaussian distribution of SGD and how this relates to the Bayesian posterior. There is a useful experimental evaluation of the posterior landscape. The method is shown to work at Imagenet scale. There is a thorough comparison against competing methods like Laplace approximations, SGLD and dropout.

The paper is well written, and the empirical evaluation is thorough. However I am concerned by the interpretation the authors propose, which I think will add confusion to the literature. I feel that the paper should not be accepted without significant re-structuring. The authors describe their method as a baseline for Bayesian uncertainty, however it is straightforward to show in simple cases that the stationary distribution of SGD does not approximate the Bayesian posterior. Indeed Mandt et al. ('SGD as approximate Bayesian inference') showed that, under Bernstein-von Mises type identifiability assumptions, SGD will converge to an isotropic Gaussian distribution near the minimum as the dataset size grows. This Gaussian distribution is independent of the Hessian of the posterior, depending only on the batch size and the learning rate. In order to get Bayesian samples from SGD, one must replace the learning rate by a learned pre-conditioning matrix. On a more minor note, I would have appreciated a comparison between SWAG and conventional model ensembling. While ensembling requires k times more computation at training time, it requires the same computation as SWAG at test time, and I would expect it to perform substantially better on the test set. Edit: I thank the authors for their response, however I remained concerned about the framing of the paper as a Bayesian method. In my view, it is not appropriate to call any distribution over parameters Bayesian. One should show that the distribution is close to a valid posterior under some reasonable (if often unrealistic) assumptions. By contrast, it is possible to prove that under reasonable assumptions, the stationary distribution of SGD is independent of the shape of the posterior. Furthermore, the authors themselves demonstrate that Mandt et al's method for predicting the learning rate fails in practice. I presume that this learning rate, which is what sets the uncertainty scale, must be tuned on the validation set, although the authors claim otherwise in the paper. I do agree with the other reviewers that the proposed method provides a strong baseline for handling uncertainty in deep learning. However I believe that the paper should be re-worked for a future conference, emphasising that the method is primarily empirical, rather than being a Bayesian approximation. I am particularly concerned that this paper furthers existing misunderstandings about the Mandt paper in the community. With this in mind, the authors should compare against other non-Bayesian baselines for uncertainty. To my knowledge, the "deep ensembles" method already outperforms all the Bayesian techniques considered in this work, and additionally I remain unconvinced that SWAG would outperform ensembling in settings where the bottleneck is the compute cost at test time. Finally, I note that SGLD was run at a reduced temperature below 1 in order to prevent the iterates diverging. This may indicate that the posterior is improper, in which case any accurate Bayesian method would be likely to produce poor results.