NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7203
Title:A Simple Baseline for Bayesian Uncertainty in Deep Learning

This paper presents SWAG, a method that uses the iterates of a Polyak-averaging-like stochastic gradient descent to approximate the posterior distribution of a neural network. It is presented as a simple baseline for uncertainty in large deep neural networks and the authors demonstrate its effectiveness on a variety of large scale tasks including residual networks on CIFAR and Imagenet. The strengths of this paper are: - it is indeed a simple baseline for a promising area of research that is really lacking good baselines - experiments are thorough and on benchmarks that are large and interesting to the wider deep learning community - the authors empirically evaluate the quality of their approximation and provide some analysis The main criticism of this paper is that it is not really Bayesian from a purist perspective. R3 is correct to point out that the presented approximation can not actually capture the true posterior as shown by Mandt et al. (Stochastic Gradient Descent as Approximate Bayesian Inference). The language of the paper at times implies otherwise and R3 is right to point this out (e.g. L192 "our procedure... corresponds to fully Bayesian inference"). It also is rather close to Mandt et al. in methodology. The major difference appears to be the application to deep neural networks, the scale of which justifies the approximations presented here. The author's treatment of Mandt et al. in related work is not entirely fair and R3 is right to point this out. That paper explores iterate averaging and Algorithm 1 details a version that doesn't involve a full-covariance matrix. The reason the authors use a full covariance later in the paper is because they show mathematically that one cannot capture the posterior using SGD iterates without doing so. The recommendation is accept, because the empirical work is thorough, this area does indeed lack reasonable baselines and the authors demonstrate empirically that their method gives a reasonable approximation. However, we request that the authors make clear that this is an approximation and especially please give proper attribution to Mandt et al. in the camera ready version.