{"title": "Practical Deep Learning with Bayesian Principles", "book": "Advances in Neural Information Processing Systems", "page_first": 4287, "page_last": 4299, "abstract": "Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated, uncertainties on out-of-distribution data are improved, and continual-learning performance is boosted. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation is available as a plug-and-play optimiser.", "full_text": "Practical Deep Learning with Bayesian Principles\n\nKazuki Osawa,1 Siddharth Swaroop,2,\u21e4 Anirudh Jain,3,\u21e4,\u2020 Runa Eschenhagen,4,\u2020\n\nRichard E. Turner,2 Rio Yokota,1 Mohammad Emtiyaz Khan5,\u2021.\n\n1 Tokyo Institute of Technology, Tokyo, Japan\n2 University of Cambridge, Cambridge, UK\n\n3 Indian Institute of Technology (ISM), Dhanbad, India\n\n4 University of Osnabr\u00fcck, Osnabr\u00fcck, Germany\n\n5 RIKEN Center for AI Project, Tokyo, Japan\n\nAbstract\n\nBayesian methods promise to \ufb01x many shortcomings of deep learning, but they\nare impractical and rarely match the performance of standard methods, let alone\nimprove them. In this paper, we demonstrate practical training of deep networks\nwith natural-gradient variational inference. By applying techniques such as batch\nnormalisation, data augmentation, and distributed training, we achieve similar\nperformance in about the same number of epochs as the Adam optimiser, even on\nlarge datasets such as ImageNet. Importantly, the bene\ufb01ts of Bayesian principles\nare preserved: predictive probabilities are well-calibrated, uncertainties on out-\nof-distribution data are improved, and continual-learning performance is boosted.\nThis work enables practical deep learning while preserving bene\ufb01ts of Bayesian\nprinciples. A PyTorch implementation1 is available as a plug-and-play optimiser.\n\n1\n\nIntroduction\n\nDeep learning has been extremely successful in many \ufb01elds such as computer vision [29], speech\nprocessing [17], and natural-language processing [39], but it is also plagued with several issues\nthat make its application dif\ufb01cult in many other \ufb01elds. For example, it requires a large amount of\nhigh-quality data and it can over\ufb01t when dataset size is small. Similarly, sequential learning can cause\nforgetting of past knowledge [27], and lack of reliable con\ufb01dence estimates and other robustness\nissues can make it vulnerable to adversarial attacks [6]. Ultimately, due to such issues, application of\ndeep learning remains challenging, especially for applications where human lives are at risk.\nBayesian principles have the potential to address such issues. For example, we can represent un-\ncertainty using the posterior distribution, enable sequential learning using Bayes\u2019 rule, and reduce\nover\ufb01tting with Bayesian model averaging [19]. The use of such Bayesian principles for neural\nnetworks has been advocated from very early on. Bayesian inference on neural networks were all pro-\nposed in the 90s, e.g., by using MCMC methods [41], Laplace\u2019s method [35], and variational inference\n(VI) [18, 2, 49, 1]. Bene\ufb01ts of Bayesian principles are even discussed in machine-learning textbooks\n[36, 3]. Despite this, they are rarely employed in practice. This is mainly due to computational\nconcerns, unfortunately overshadowing their theoretical advantages.\nThe dif\ufb01culty lies in the computation of the posterior distribution, which is especially challenging for\ndeep learning. Even approximation methods, such as VI and MCMC, have historically been dif\ufb01cult\n\n* These two authors contributed equally.\n\u2020 This work is conducted during an internship at RIKEN Center for AI project.\n\u2021 Corresponding author: emtiyaz.khan@riken.jp\n1 The code is available at https://github.com/team-approx-bayes/dl-with-bayes.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Comparing VOGN [24], a natural-gradient VI method, to Adam and SGD, training ResNet-\n18 on ImageNet. The two left plots show that VOGN and Adam have similar convergence behaviour\nand achieve similar performance in about the same number of epochs. VOGN achieves 67.38% on\nvalidation compared to 66.39% by Adam and 67.79% by SGD. Run-time of VOGN is 76 seconds per\nepoch compared to 44 seconds for Adam and SGD. The rightmost \ufb01gure shows the calibration curve.\nVOGN gives calibrated predictive probabilities (the diagonal represents perfect calibration).\n\nto scale to large datasets such as ImageNet [47]. Due to this, it is common to use less principled\napproximations, such as MC-dropout [9], even though they are not ideal when it comes to \ufb01xing the\nissues of deep learning. For example, MC-dropout is unsuitable for continual learning [27] since its\nposterior approximation does not have mass over the whole weight space. It is also found to perform\npoorly for sequential decision making [45]. The form of the approximation used by such methods\nis usually rigid and cannot be easily improved, e.g., to other forms such as a mixture of Gaussians.\nThe goal of this paper is to make more principled Bayesian methods, such as VI, practical for deep\nlearning, thereby helping researchers tackle its key limitations.\nWe demonstrate practical training of deep networks by using recently proposed natural-gradient VI\nmethods. These methods resemble the Adam optimiser, enabling us to leverage existing techniques\nfor initialisation, momentum, batch normalisation, data augmentation, and distributed training. As a\nresult, we obtain similar performance in about the same number of epochs as Adam when training\nmany popular deep networks (e.g., LeNet, AlexNet, ResNet) on datasets such as CIFAR-10 and\nImageNet (see Fig. 1). The results show that, despite using an approximate posterior, the training\nmethods preserve the bene\ufb01ts coming from Bayesian principles. Compared to standard deep-learning\nmethods, the predictive probabilities are well-calibrated, uncertainties on out-of-distribution inputs\nare improved, and performance for continual-learning tasks is boosted. Our work shows that practical\ndeep learning is possible with Bayesian methods and aims to support further research in this area.\nRelated work. Previous VI methods, notably by Graves [14] and Blundell et al. [4], require signi\ufb01-\ncant implementation and tuning effort to perform well, e.g., on convolution neural networks (CNN).\nSlow convergence is found to be especially problematic for sequential problems [45]. There appears\nto be no reported results with complex networks on large problems, such as ImageNet. Our work\nsolves these issues by applying deep-learning techniques to natural-gradient VI [24, 56].\nIn their paper, Zhang et al. [56] also employed data augmentation and batch normalisation for a\nnatural-gradient method called Noisy K-FAC (see Appendix A) and showed results on VGG on\nCIFAR-10. However, a mean-\ufb01eld method called Noisy Adam was found to be unstable with batch\nnormalisation. In contrast, we show that a similar method, called Variational Online Gauss-Newton\n(VOGN), proposed by Khan et al. [24], works well with such techniques. We show results for\ndistributed training with Noisy K-FAC on Imagenet, but do not provide extensive comparisons since\ntuning it is time-consuming. Many of our techniques can speed-up Noisy K-FAC, which is promising.\nMany other approaches have recently been proposed to compute posterior approximations by training\ndeterministic networks [46, 37, 38]. Similarly to MC-dropout, their posterior approximations are not\n\ufb02exible, making it dif\ufb01cult to improve the accuracy of their approximations. On the other hand, VI\noffers a much more \ufb02exible alternative to apply Bayesian principles to deep learning.\n\n2\n\n\f2 Deep Learning with Bayesian Principles and Its Challenges\n\nThe success of deep learning is partly due to the availability of scalable and practical methods for\ntraining deep neural networks (DNNs). Network training is formulated as an optimisation problem\nwhere a loss between the data and the DNN\u2019s predictions is minimised. For example, in a supervised\nlearning task with a dataset D of N inputs xi and corresponding outputs yi of length K, we minimise\nNPi `(yi, f w(xi)), f w(x) 2 RK\na loss of the following form: \u00af`(w) + w>w, where \u00af`(w) := 1\ndenotes the DNN outputs with weights w, `(y, f ) denotes a differentiable loss function between an\noutput y and the function f, and > 0 is the L2 regulariser.2 Deep learning relies on stochastic-\ngradient (SG) methods to minimise such loss functions. The most commonly used optimisers, such\nas stochastic-gradient descent (SGD), RMSprop [53], and Adam [25], take the following form3 (all\noperations below are element-wise):\n\nwt+1 wt \u21b5t\n\n\u02c6g(wt) + wt\npst+1 + \u270f\n\n,\n\nst+1 (1 t)st + t (\u02c6g(wt) + wt)2 ,\n\n(1)\n\nwhere t is the iteration, \u21b5t > 0 and 0 < t < 1 are learning rates, \u270f> 0 is a small scalar, and \u02c6g(w)\nMPi2Mt rw`(yi, f w(xi)) using a\nis the stochastic gradients at w de\ufb01ned as follows: \u02c6g(w) := 1\nminibatch Mt of M data examples. This simple update scales extremely well and can be applied\nto very large problems. With techniques such as initialisation protocols, momentum, weight-decay,\nbatch normalisation, and data augmentation, it also achieves good performance for many problems.\nIn contrast, the full Bayesian approach to deep learning is computationally very expensive. The\n\nposterior distribution can be obtained using Bayes\u2019 rule: p(w|D) = expN \u00af`(w)/\u2327 p(w)/p(D)\nwhere 0 <\u2327 \uf8ff 1.4 This is costly due to the computation of the marginal likelihood p(D), a\nhigh-dimensional integral that is dif\ufb01cult to compute for large networks. Variational inference (VI)\nis a principled approach to more scalably estimate an approximation to p(w|D). The main idea\nis to employ a parametric approximation, e.g., a Gaussian q(w) := N (w|\u00b5, \u2303) with mean \u00b5 and\ncovariance \u2303. The parameters \u00b5 and \u2303 can then be estimated by maximising the evidence lower\nbound (ELBO):\n\nELBO: L(\u00b5, \u2303) := NEq\u21e5\u00af`(w)\u21e4 \u2327DKL[q(w)k p(w)],\n\n(2)\nwhere DKL[\u00b7] denotes the Kullback-Leibler divergence. By using more complex approximations, we\ncan further reduce the approximation error, but at a computational cost. By formulating Bayesian\ninference as an optimisation problem, VI enables a practical application of Bayesian principles.\nDespite this, VI has remained impractical for training large deep networks on large datasets. Existing\nmethods, such as Graves [14] and Blundell et al. [4], directly apply popular SG methods to optimise\nthe variational parameters in the ELBO, yet they fail to get a reasonable performance on large prob-\nlems, usually converging very slowly. The failure of such direct applications of deep-learning methods\nto VI is not surprising. The techniques used in one \ufb01eld may not directly lead to improvements in the\nother, but it will be useful if they do, e.g., if we can optimise the ELBO in a way that allows us to\nexploit the tricks and techniques of deep learning and boost the performance of VI. The goal of this\nwork is to do just that. We now describe our methods in detail.\n\n3 Practical Deep Learning with Natural-Gradient Variational Inference\n\nIn this paper, we propose natural-gradient VI methods for practical deep learning with Bayesian\nprinciples. The natural-gradient update takes a simple form when estimating exponential-family\napproximations [23, 22]. When p(w) := N (w|0, I/), the update of the natural-parameter is\nperformed by using the stochastic gradient of the expected regularised-loss:\n\nt+1 = (1 \u2327 \u21e2)t \u21e2r\u00b5Eq\u21e5\u00af`(w) + 1\n\n2 \u2327 w>w\u21e4 ,\n\n(3)\n\n2This regulariser is sometimes set to 0 or a very small value.\n3Alternate versions with weight-decay and momentum differ from this update [34]. We present a form useful\n4This is a tempered posterior [54] setup where \u2327 is set 6= 1 when we expect model misspeci\ufb01cation and/or\n\nto establish the connection between SG methods and natural-gradient VI.\n\nadversarial examples [10]. Setting \u2327 = 1 recovers standard Bayesian inference.\n\n3\n\n\fwhere \u21e2> 0 is the learning rate, and we note that the stochastic gradients are computed with respect\nto \u00b5, the expectation parameters of q. The moving average above helps to deal with the stochasticity\nof the gradient estimates, and is very similar to the moving average used in deep learning (see (1)).\nWhen \u2327 is set to 0, the update essentially minimises the regularised loss (see Section 5 in Khan et al.\n[24]). These properties of natural-gradient VI makes it an ideal candidate for deep learning.\nRecent work by Khan et al. [24] and Zhang et al. [56] further show that, when q is Gaussian, the\nupdate (3) assumes a form that is strikingly similar to the update (1). For example, the Variational\nOnline Gauss-Newton (VOGN) method of Khan et al. [24] estimates a Gaussian with mean \u00b5t and a\ndiagonal covariance matrix \u2303t using the following update:\n\n\u00b5t+1 \u00b5t \u21b5t\n\n\u02c6g(wt) + \u02dc\u00b5t\n\nst+1 + \u02dc\n\n,\n\nst+1 (1 \u2327 t)st + t\n\n1\n\nM Xi2Mt\n\n(gi(wt))2 ,\n\n(4)\n\nMPi2Mt\n\nMPi2Mt\n\nwhere gi(wt) := rw`(yi, fwt(xi)), wt \u21e0N (w|\u00b5t, \u2303t) with \u2303t := diag(1/(N (st + \u02dc))), \u02dc :=\n\u2327 /N , and \u21b5t, t > 0 are learning rates. Operations are performed element-wise. Similarly to (1),\nthe vector st adapts the learning rate and is updated using a moving average.\nA major difference in VOGN is that the update of st is now based on a Gauss-Newton approximation\n[14] which uses 1\n(gi(wt))2. This is fundamentally different from the SG update in (1)\nwhich instead uses the gradient-magnitude ( 1\ngi(wt) + wt)2 [5]. The \ufb01rst approach uses\nthe sum outside the square while the second approach uses it inside. VOGN is therefore a second-\norder method and, similarly to Newton\u2019s method, does not need a square-root over st. Implementation\nof this step requires an additional calculation (see Appendix B) which makes VOGN a bit slower than\nAdam, but VOGN is expected to give better variance estimates (see Theorem 1 in Khan et al. [24]).\nThe main contribution of this paper is to demonstrate practical training of deep networks using\nVOGN. Since VOGN takes a similar form to SG methods, we can easily borrow existing deep-\nlearning techniques to improve performance. We will now describe these techniques in detail.\nPseudo-code for VOGN is shown in Algorithm 1.\nBatch normalisation: Batch normalisation [20] has been found to signi\ufb01cantly speed up and stabilise\ntraining of neural networks, and is widely used in deep learning. BatchNorm layers are inserted\nbetween neural network layers. They help stabilise each layer\u2019s input distribution by normalising the\nrunning average of the inputs\u2019 mean and variance. In our VOGN implementation, we simply use the\nexisting implementation with default hyperparameter settings. We do not apply L2 regularisation and\nweight decay to BatchNorm parameters, like in Goyal et al. [13], or maintain uncertainty over the\nBatchNorm parameters. This straightforward application of batch normalisation works for VOGN.\nData Augmentation: When training on image datasets, data augmentation (DA) techniques can\nimprove performance drastically [13]. We consider two common real-time data augmentation\ntechniques: random cropping and horizontal \ufb02ipping. After randomly selecting a minibatch at each\niteration, we use a randomly selected cropped version of all images. Each image in the minibatch has\na 50% chance of being horizontally \ufb02ipped.\nWe \ufb01nd that directly applying DA gives slightly worse performance than expected, and also affects\nthe calibration of the resulting uncertainty. However, DA increases the effective sample size. We\ntherefore modify it to be \u21e2N where \u21e2 1, improving performance (see step 2 in Algorithm 1). The\nreason for this performance boost might be due to the complex relationship between the regularisation\n and N. For the regularised loss \u00af`(w) + w>w, the two are unidenti\ufb01able, i.e., we can multiply\n by a constant and reduce N by the same constant without changing the minimum. However, in a\nBayesian setting (like in (2)), the two quantities are separate, and therefore changing the data might\nalso change the optimal prior variance hyperparameter in a complicated way. This needs further\ntheoretical investigations, but our simple \ufb01x of scaling N seems to work well in the experiments.\nWe set \u21e2 by considering the speci\ufb01c DA techniques used. When training on CIFAR-10, the random\ncropping DA step involves \ufb01rst padding the 32x32 images to become of size 40x40, and then taking\nrandomly selected 28x28 cropped images. We consider this as effectively increasing the dataset size\nby a factor of 5 (4 images for each corner, and one central image). The horizontal \ufb02ipping DA step\ndoubles the dataset size (one dataset of un\ufb02ipped images, one for \ufb02ipped images). Combined, this\ngives \u21e2 = 10. Similar arguments for ImageNet DA techniques give \u21e2 = 5. Even though \u21e2 is another\nhyperparameter to set, we \ufb01nd that its precise value does not matter much. Typically, after setting an\nestimate for \u21e2, tuning a little seems to work well (see Appendix E).\n\n4\n\n\fAlgorithm 1: Variational Online Gauss Newton (VOGN)\n1: Initialise \u00b50, s0, m0.\n2: N \u21e2N, \u02dc \u2327 /N .\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nSample a minibatch M of size M.\nSplit M into each GPU (local minibatch Mlocal).\nfor each GPU in parallel do\nfor k = 1, 2, . . . , K do\nSample \u270f \u21e0N (0, I).\nw(k) \u00b5 + \u270f with (1/(N (s + \u02dc + )))1/2.\nCompute g(k)\ni rw`(yi, f w(k) (xi)),8i 2M local\nusing the method described in Appendix B.\nMPi2Mlocal\n\u02c6gk 1\nMPi2Mlocal\n\u02c6hk 1\nKPK\nk=1 \u02c6gk and \u02c6h 1\n\nKPK\n\ng(k)\n.\ni\n(g(k)\n\nend for\n\u02c6g 1\n\n11:\n12:\n13:\n14:\n15:\n16:\n17: m 1m + (\u02c6g + \u02dc\u00b5).\ns (1 \u2327 2)s + 2 \u02c6h.\n18:\n\u00b5 \u00b5 \u21b5m/(s + \u02dc + ).\n19:\n20: until stopping criterion is met\n\nend for\nAllReduce \u02c6g, \u02c6h.\n\n\u02c6hk.\n\n)2 .\n\nk=1\n\ni\n\nw(i) \u21e0 q(w)\n\nAAACC3icbVDLSsNAFJ3UV42vqEs3Q4vQbkqigi6LblxWsA9oYplMJ+3QySTOTJQSunfjr7hxoYhbf8Cdf+OkDaitBy4czrmXe+/xY0alsu0vo7C0vLK6Vlw3Nza3tnes3b2WjBKBSRNHLBIdH0nCKCdNRRUjnVgQFPqMtP3RRea374iQNOLXahwTL0QDTgOKkdJSzyq5IVJDP0jvJzdphVYnrqSheVv5kas9q2zX7CngInFyUgY5Gj3r0+1HOAkJV5ghKbuOHSsvRUJRzMjEdBNJYoRHaEC6mnIUEuml018m8FArfRhEQhdXcKr+nkhRKOU49HVndqKc9zLxP6+bqODMSymPE0U4ni0KEgZVBLNgYJ8KghUba4KwoPpWiIdIIKx0fKYOwZl/eZG0jmrOcc2+OinXz/M4iuAAlEAFOOAU1MElaIAmwOABPIEX8Go8Gs/Gm/E+ay0Y+cw++APj4xv4CZr8\n\nAAAB/HicbVDLSsNAFL3xWesr2qWbwSK4KokKuiy6cSNUsA9oQ5hMJ+3QySTMTIQQ6q+4caGIWz/EnX/jpM1CWw8MHM65l3vmBAlnSjvOt7Wyura+sVnZqm7v7O7t2weHHRWnktA2iXksewFWlDNB25ppTnuJpDgKOO0Gk5vC7z5SqVgsHnSWUC/CI8FCRrA2km/XBhHWY4J5fjf1cx4bNvXtutNwZkDLxC1JHUq0fPtrMIxJGlGhCcdK9V0n0V6OpWaE02l1kCqaYDLBI9o3VOCIKi+fhZ+iE6MMURhL84RGM/X3Ro4jpbIoMJNFVLXoFeJ/Xj/V4ZWXM5GkmgoyPxSmHOkYFU2gIZOUaJ4ZgolkJisiYywx0aavqinBXfzyMumcNdzzhnN/UW9el3VU4AiO4RRcuIQm3EIL2kAgg2d4hTfryXqx3q2P+eiKVe7U4A+szx9z0JVI\n\nMlocal\nw(5)\nw(6)\n\nAAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhmaa03YkKQ48Tlve+DbzWxMqFQvFg55GtBfgoWA+I1gbqW/b3QDrkecnT+ljUrk8Tft22ak6M6Bl4uakDDnqffurOwhJHFChCcdKdVwn0r0ES80Ip2mpGysaYTLGQ9oxVOCAql4yS56iE6MMkB9K84RGM/X3RoIDpaaBZyaznGrRy8T/vE6s/etewkQUayrI/JAfc6RDlNWABkxSovnUEEwkM1kRGWGJiTZllUwJ7uKXl0nzrOqeV537i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz9CCpNm\n\nAAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyVRUZdFNy4r2Ae0tUymk3boZBJmJpUS8iduXCji1j9x5984abPQ1gMDh3Pu5Z45XsSZ0o7zbRVWVtfWN4qbpa3tnd09e/+gqcJYEtogIQ9l28OKciZoQzPNaTuSFAcepy1vfJv5rQmVioXiQU8j2gvwUDCfEayN1LftboD1yPOTp/QxqVyepn277FSdGdAycXNShhz1vv3VHYQkDqjQhGOlOq4T6V6CpWaE07TUjRWNMBnjIe0YKnBAVS+ZJU/RiVEGyA+leUKjmfp7I8GBUtPAM5NZTrXoZeJ/XifW/nUvYSKKNRVkfsiPOdIhympAAyYp0XxqCCaSmayIjLDERJuySqYEd/HLy6R5VnXPq879Rbl2k9dRhCM4hgq4cAU1uIM6NIDABJ7hFd6sxHqx3q2P+WjByncO4Q+szx9DkJNn\n\nAAAB/HicbVDLSsNAFL3xWesr2qWbwSK4KokKuiy6cSNUsA9oQ5hMJ+3QySTMTIQQ6q+4caGIWz/EnX/jpM1CWw8MHM65l3vmBAlnSjvOt7Wyura+sVnZqm7v7O7t2weHHRWnktA2iXksewFWlDNB25ppTnuJpDgKOO0Gk5vC7z5SqVgsHnSWUC/CI8FCRrA2km/XBhHWY4J5fjf1cx4bNvXtutNwZkDLxC1JHUq0fPtrMIxJGlGhCcdK9V0n0V6OpWaE02l1kCqaYDLBI9o3VOCIKi+fhZ+iE6MMURhL84RGM/X3Ro4jpbIoMJNFVLXoFeJ/Xj/V4ZWXM5GkmgoyPxSmHOkYFU2gIZOUaJ4ZgolkJisiYywx0aavqinBXfzyMumcNdzzhnN/UW9el3VU4AiO4RRcuIQm3EIL2kAgg2d4hTfryXqx3q2P+eiKVe7U4A+szx9z0JVI\n\nMlocal\nw(7)\nw(8)\n\nAAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyVRwS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5Dbz21MqFQvFg55FtB/gkWA+I1gbaWDbvQDrsecnT+ljUqmdpwO77FSdOdAqcXNShhyNgf3VG4YkDqjQhGOluq4T6X6CpWaE07TUixWNMJngEe0aKnBAVT+ZJ0/RmVGGyA+leUKjufp7I8GBUrPAM5NZTrXsZeJ/XjfWfq2fMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdK6qLqXVef+qly/yesowgmcQgVcuIY63EEDmkBgCs/wCm9WYr1Y79bHYrRg5TvH8AfW5w9GnJNp\n\nAAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyVRoS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5Dbz21MqFQvFg55FtB/gkWA+I1gbaWDbvQDrsecnT+ljUqmdpwO77FSdOdAqcXNShhyNgf3VG4YkDqjQhGOluq4T6X6CpWaE07TUixWNMJngEe0aKnBAVT+ZJ0/RmVGGyA+leUKjufp7I8GBUrPAM5NZTrXsZeJ/XjfW/nU/YSKKNRVkcciPOdIhympAQyYp0XxmCCaSmayIjLHERJuySqYEd/nLq6R1UXUvq879Vbl+k9dRhBM4hQq4UIM63EEDmkBgCs/wCm9WYr1Y79bHYrRg5TvH8AfW5w9FFpNo\n\nAAAB/HicbVDLSsNAFL3xWesr2qWbwSK4KokKuiy6cSNUsA9oQ5hMJ+3QySTMTIQQ6q+4caGIWz/EnX/jpM1CWw8MHM65l3vmBAlnSjvOt7Wyura+sVnZqm7v7O7t2weHHRWnktA2iXksewFWlDNB25ppTnuJpDgKOO0Gk5vC7z5SqVgsHnSWUC/CI8FCRrA2km/XBhHWY4J5fjf1cx4bNvXtutNwZkDLxC1JHUq0fPtrMIxJGlGhCcdK9V0n0V6OpWaE02l1kCqaYDLBI9o3VOCIKi+fhZ+iE6MMURhL84RGM/X3Ro4jpbIoMJNFVLXoFeJ/Xj/V4ZWXM5GkmgoyPxSmHOkYFU2gIZOUaJ4ZgolkJisiYywx0aavqinBXfzyMumcNdzzhnN/UW9el3VU4AiO4RRcuIQm3EIL2kAgg2d4hTfryXqx3q2P+eiKVe7U4A+szx9z0JVI\n\nMAAAB8nicbVBNS8NAFHypX7V+VT16WSyCp5KooMeiFy9CBWsLaSib7bZdutmE3RehhP4MLx4U8eqv8ea/cdPmoK0DC8PMe+y8CRMpDLrut1NaWV1b3yhvVra2d3b3qvsHjyZONeMtFstYd0JquBSKt1Cg5J1EcxqFkrfD8U3ut5+4NiJWDzhJeBDRoRIDwShaye9GFEeMyuxu2qvW3Lo7A1kmXkFqUKDZq351+zFLI66QSWqM77kJBhnVKJjk00o3NTyhbEyH3LdU0YibIJtFnpITq/TJINb2KSQz9fdGRiNjJlFoJ/OIZtHLxf88P8XBVZAJlaTIFZt/NEglwZjk95O+0JyhnFhCmRY2K2EjqilD21LFluAtnrxMHs/q3nndvb+oNa6LOspwBMdwCh5cQgNuoQktYBDDM7zCm4POi/PufMxHS06xcwh/4Hz+AIM0kWU=\nMlocal\nw(3)\nw(4)\n\nAAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxgi6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5Dbz21MqFQvFg55FtB/gkWA+I1gbaWDbvQDrsecnT+ljUqmdpwO77FSdOdAqcXNShhyNgf3VG4YkDqjQhGOluq4T6X6CpWaE07TUixWNMJngEe0aKnBAVT+ZJ0/RmVGGyA+leUKjufp7I8GBUrPAM5NZTrXsZeJ/XjfW/nU/YSKKNRVkcciPOdIhympAQyYp0XxmCCaSmayIjLHERJuySqYEd/nLq6R1UXVrVef+sly/yesowgmcQgVcuII63EEDmkBgCs/wCm9WYr1Y79bHYrRg5TvH8AfW5w8+/pNk\n\nAAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXRgi6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5Dbz21MqFQvFg55FtB/gkWA+I1gbaWDbvQDrsecnT+ljUqmdpwO77FSdOdAqcXNShhyNgf3VG4YkDqjQhGOluq4T6X6CpWaE07TUixWNMJngEe0aKnBAVT+ZJ0/RmVGGyA+leUKjufp7I8GBUrPAM5NZTrXsZeJ/XjfW/nU/YSKKNRVkcciPOdIhympAQyYp0XxmCCaSmayIjLHERJuySqYEd/nLq6R1UXUvq859rVy/yesowgmcQgVcuII63EEDmkBgCs/wCm9WYr1Y79bHYrRg5TvH8AfW5w9AhJNl\n\nAAAB/HicbVDLSsNAFL3xWesr2qWbwSK4KokKuiy6cSNUsA9oQ5hMJ+3QySTMTIQQ6q+4caGIWz/EnX/jpM1CWw8MHM65l3vmBAlnSjvOt7Wyura+sVnZqm7v7O7t2weHHRWnktA2iXksewFWlDNB25ppTnuJpDgKOO0Gk5vC7z5SqVgsHnSWUC/CI8FCRrA2km/XBhHWY4J5fjf1cx4bNvXtutNwZkDLxC1JHUq0fPtrMIxJGlGhCcdK9V0n0V6OpWaE02l1kCqaYDLBI9o3VOCIKi+fhZ+iE6MMURhL84RGM/X3Ro4jpbIoMJNFVLXoFeJ/Xj/V4ZWXM5GkmgoyPxSmHOkYFU2gIZOUaJ4ZgolkJisiYywx0aavqinBXfzyMumcNdzzhnN/UW9el3VU4AiO4RRcuIQm3EIL2kAgg2d4hTfryXqx3q2P+eiKVe7U4A+szx9z0JVI\n\nMlocal\nw(1)\nw(2)\n\nAAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQq6LLoxmUF+4A2lsl00g6dTMLMpFJC/sSNC0Xc+ifu/BsnbRbaemDgcM693DPHjzlT2nG+rZXVtfWNzdJWeXtnd2/fPjhsqSiRhDZJxCPZ8bGinAna1Exz2oklxaHPadsf3+Z+e0KlYpF40NOYeiEeChYwgrWR+rbdC7Ee+UH6lD2mVfcs69sVp+bMgJaJW5AKFGj07a/eICJJSIUmHCvVdZ1YeymWmhFOs3IvUTTGZIyHtGuowCFVXjpLnqFTowxQEEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDay9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qzv1lpX5T1FGCYziBKrhwBXW4gwY0gcAEnuEV3qzUerHerY/56IpV7BzBH1ifPzvyk2I=\n\nAAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWpgi6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5Dbz21MqFQvFg55FtB/gkWA+I1gbaWDbvQDrsecnT+ljUqmdpwO77FSdOdAqcXNShhyNgf3VG4YkDqjQhGOluq4T6X6CpWaE07TUixWNMJngEe0aKnBAVT+ZJ0/RmVGGyA+leUKjufp7I8GBUrPAM5NZTrXsZeJ/XjfW/nU/YSKKNRVkcciPOdIhympAQyYp0XxmCCaSmayIjLHERJuySqYEd/nLq6RVq7oXVef+sly/yesowgmcQgVcuII63EEDmkBgCs/wCm9WYr1Y79bHYrRg5TvH8AfW5w89eJNj\n\n\u02c6gAAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6LLoxmUF+4AmlMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4VWDwwczrmXe+YEqRQaHefLqqytb2xuVbdrO7t7+wf24VFHJ5livM0SmaheQDWXIuZtFCh5L1WcRoHk3WByV/jdKVdaJPEjzlLuR3QUi1AwikYa2LY3pph7EcVxEOaj+Xxg152GswD5S9yS1KFEa2B/esOEZRGPkUmqdd91UvRzqlAwyec1L9M8pWxCR7xvaEwjrv18kXxOzowyJGGizIuRLNSfGzmNtJ5FgZksIupVrxD/8/oZhjd+LuI0Qx6z5aEwkwQTUtRAhkJxhnJmCGVKmKyEjamiDE1ZNVOCu/rlv6Rz0XAvG87DVb15W9ZRhRM4hXNw4RqacA8taAODKTzBC7xaufVsvVnvy9GKVe4cwy9YH988yZQL\n\n\u02c6hAAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIVdFl047KCfUATymQ6aYZOJmHmplBC/sSNC0Xc+ifu/BsnbRZaPTBwOOde7pkTpIJrcJwvq7a2vrG5Vd9u7Ozu7R/Yh0c9nWSKsi5NRKIGAdFMcMm6wEGwQaoYiQPB+sH0rvT7M6Y0T+QjzFPmx2QiecgpASONbNuLCOReTCAKwjwqipHddFrOAvgvcSvSRBU6I/vTGyc0i5kEKojWQ9dJwc+JAk4FKxpepllK6JRM2NBQSWKm/XyRvMBnRhnjMFHmScAL9edGTmKt53FgJsuIetUrxf+8YQbhjZ9zmWbAJF0eCjOBIcFlDXjMFaMg5oYQqrjJimlEFKFgymqYEtzVL/8lvYuWe9lyHq6a7duqjjo6QafoHLnoGrXRPeqgLqJohp7QC3q1cuvZerPel6M1q9o5Rr9gfXwDPk+UDA==\n\n\u02c6gAAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6LLoxmUF+4AmlMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4VWDwwczrmXe+YEqRQaHefLqqytb2xuVbdrO7t7+wf24VFHJ5livM0SmaheQDWXIuZtFCh5L1WcRoHk3WByV/jdKVdaJPEjzlLuR3QUi1AwikYa2LY3pph7EcVxEOaj+Xxg152GswD5S9yS1KFEa2B/esOEZRGPkUmqdd91UvRzqlAwyec1L9M8pWxCR7xvaEwjrv18kXxOzowyJGGizIuRLNSfGzmNtJ5FgZksIupVrxD/8/oZhjd+LuI0Qx6z5aEwkwQTUtRAhkJxhnJmCGVKmKyEjamiDE1ZNVOCu/rlv6Rz0XAvG87DVb15W9ZRhRM4hXNw4RqacA8taAODKTzBC7xaufVsvVnvy9GKVe4cwy9YH988yZQL\n\n\u02c6hAAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIVdFl047KCfUATymQ6aYZOJmHmplBC/sSNC0Xc+ifu/BsnbRZaPTBwOOde7pkTpIJrcJwvq7a2vrG5Vd9u7Ozu7R/Yh0c9nWSKsi5NRKIGAdFMcMm6wEGwQaoYiQPB+sH0rvT7M6Y0T+QjzFPmx2QiecgpASONbNuLCOReTCAKwjwqipHddFrOAvgvcSvSRBU6I/vTGyc0i5kEKojWQ9dJwc+JAk4FKxpepllK6JRM2NBQSWKm/XyRvMBnRhnjMFHmScAL9edGTmKt53FgJsuIetUrxf+8YQbhjZ9zmWbAJF0eCjOBIcFlDXjMFaMg5oYQqrjJimlEFKFgymqYEtzVL/8lvYuWe9lyHq6a7duqjjo6QafoHLnoGrXRPeqgLqJohp7QC3q1cuvZerPel6M1q9o5Rr9gfXwDPk+UDA==\n\n\u02c6gAAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6LLoxmUF+4AmlMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4VWDwwczrmXe+YEqRQaHefLqqytb2xuVbdrO7t7+wf24VFHJ5livM0SmaheQDWXIuZtFCh5L1WcRoHk3WByV/jdKVdaJPEjzlLuR3QUi1AwikYa2LY3pph7EcVxEOaj+Xxg152GswD5S9yS1KFEa2B/esOEZRGPkUmqdd91UvRzqlAwyec1L9M8pWxCR7xvaEwjrv18kXxOzowyJGGizIuRLNSfGzmNtJ5FgZksIupVrxD/8/oZhjd+LuI0Qx6z5aEwkwQTUtRAhkJxhnJmCGVKmKyEjamiDE1ZNVOCu/rlv6Rz0XAvG87DVb15W9ZRhRM4hXNw4RqacA8taAODKTzBC7xaufVsvVnvy9GKVe4cwy9YH988yZQL\n\n\u02c6hAAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIVdFl047KCfUATymQ6aYZOJmHmplBC/sSNC0Xc+ifu/BsnbRZaPTBwOOde7pkTpIJrcJwvq7a2vrG5Vd9u7Ozu7R/Yh0c9nWSKsi5NRKIGAdFMcMm6wEGwQaoYiQPB+sH0rvT7M6Y0T+QjzFPmx2QiecgpASONbNuLCOReTCAKwjwqipHddFrOAvgvcSvSRBU6I/vTGyc0i5kEKojWQ9dJwc+JAk4FKxpepllK6JRM2NBQSWKm/XyRvMBnRhnjMFHmScAL9edGTmKt53FgJsuIetUrxf+8YQbhjZ9zmWbAJF0eCjOBIcFlDXjMFaMg5oYQqrjJimlEFKFgymqYEtzVL/8lvYuWe9lyHq6a7duqjjo6QafoHLnoGrXRPeqgLqJohp7QC3q1cuvZerPel6M1q9o5Rr9gfXwDPk+UDA==\n\n\u02c6gAAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6LLoxmUF+4AmlMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4VWDwwczrmXe+YEqRQaHefLqqytb2xuVbdrO7t7+wf24VFHJ5livM0SmaheQDWXIuZtFCh5L1WcRoHk3WByV/jdKVdaJPEjzlLuR3QUi1AwikYa2LY3pph7EcVxEOaj+Xxg152GswD5S9yS1KFEa2B/esOEZRGPkUmqdd91UvRzqlAwyec1L9M8pWxCR7xvaEwjrv18kXxOzowyJGGizIuRLNSfGzmNtJ5FgZksIupVrxD/8/oZhjd+LuI0Qx6z5aEwkwQTUtRAhkJxhnJmCGVKmKyEjamiDE1ZNVOCu/rlv6Rz0XAvG87DVb15W9ZRhRM4hXNw4RqacA8taAODKTzBC7xaufVsvVnvy9GKVe4cwy9YH988yZQL\n\n\u02c6hAAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIVdFl047KCfUATymQ6aYZOJmHmplBC/sSNC0Xc+ifu/BsnbRZaPTBwOOde7pkTpIJrcJwvq7a2vrG5Vd9u7Ozu7R/Yh0c9nWSKsi5NRKIGAdFMcMm6wEGwQaoYiQPB+sH0rvT7M6Y0T+QjzFPmx2QiecgpASONbNuLCOReTCAKwjwqipHddFrOAvgvcSvSRBU6I/vTGyc0i5kEKojWQ9dJwc+JAk4FKxpepllK6JRM2NBQSWKm/XyRvMBnRhnjMFHmScAL9edGTmKt53FgJsuIetUrxf+8YQbhjZ9zmWbAJF0eCjOBIcFlDXjMFaMg5oYQqrjJimlEFKFgymqYEtzVL/8lvYuWe9lyHq6a7duqjjo6QafoHLnoGrXRPeqgLqJohp7QC3q1cuvZerPel6M1q9o5Rr9gfXwDPk+UDA==\n\n\u02c6gAAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQq6LLoxmUF+4AmlMl00g6dPJi5KZTQP3HjQhG3/ok7/8ZJm4VWDwwczrmXe+YEqRQaHefLqqytb2xuVbdrO7t7+wf24VFHJ5livM0SmaheQDWXIuZtFCh5L1WcRoHk3WByV/jdKVdaJPEjzlLuR3QUi1AwikYa2LY3pph7EcVxEOaj+Xxg152GswD5S9yS1KFEa2B/esOEZRGPkUmqdd91UvRzqlAwyec1L9M8pWxCR7xvaEwjrv18kXxOzowyJGGizIuRLNSfGzmNtJ5FgZksIupVrxD/8/oZhjd+LuI0Qx6z5aEwkwQTUtRAhkJxhnJmCGVKmKyEjamiDE1ZNVOCu/rlv6Rz0XAvG87DVb15W9ZRhRM4hXNw4RqacA8taAODKTzBC7xaufVsvVnvy9GKVe4cwy9YH988yZQL\n\n\u02c6hAAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIVdFl047KCfUATymQ6aYZOJmHmplBC/sSNC0Xc+ifu/BsnbRZaPTBwOOde7pkTpIJrcJwvq7a2vrG5Vd9u7Ozu7R/Yh0c9nWSKsi5NRKIGAdFMcMm6wEGwQaoYiQPB+sH0rvT7M6Y0T+QjzFPmx2QiecgpASONbNuLCOReTCAKwjwqipHddFrOAvgvcSvSRBU6I/vTGyc0i5kEKojWQ9dJwc+JAk4FKxpepllK6JRM2NBQSWKm/XyRvMBnRhnjMFHmScAL9edGTmKt53FgJsuIetUrxf+8YQbhjZ9zmWbAJF0eCjOBIcFlDXjMFaMg5oYQqrjJimlEFKFgymqYEtzVL/8lvYuWe9lyHq6a7duqjjo6QafoHLnoGrXRPeqgLqJohp7QC3q1cuvZerPel6M1q9o5Rr9gfXwDPk+UDA==\n\nLearning rate\n\u21b5\nMomentum rate\n1\nExp. moving average rate\n2\nPrior precision\n\nExternal damping factor\n\nTempering parameter\n\u2327\n# MC samples for training K\nData augmentation factor\n\u21e2\n\nFigure 2: A pseudo-code for our distributed VOGN algorithm is shown in Algorithm 1, and the\ndistributed scheme is shown in the right \ufb01gure. The computation in line 10 requires an extra\ncalculation (see Appendix B), making VOGN slower than Adam. The bottom table gives a list of\nalgorithmic hyperparameters needed for VOGN.\n\nMomentum and initialisation: It is well known that both momentum and good initialisation can\nimprove the speed of convergence for SG methods in deep learning [51]. Since VOGN is similar\nto Adam, we can implement momentum in a similar way. This is shown in step 17 of Algorithm 1,\nwhere 1 is the momentum rate. We initialise the mean \u00b5 in the same way the weights are initialised\nin Adam (we use init.xavier_normal in PyTorch [11]). For the momentum term m, we use the\nsame initialisation as Adam (initialised to 0). VOGN requires an additional initialisation for the\nvariance 2. For this, we \ufb01rst run a forward pass through the \ufb01rst minibatch, calculate the average of\nthe squared gradients and initialise the scale s0 with it (see step 1 in Algorithm 1). This implies that\n0 = \u2327/ (N (s0 + \u02dc)). For the tempering parameter \u2327, we use a schedule\nthe variance is initialised to 2\nwhere it is increased from a small value (e.g., 0.1) to 1. With these initialisation protocols, VOGN is\nable to mimic the convergence behaviour of Adam in the beginning.\nLearning rate scheduling: A common approach to quickly achieve high validation accuracies is to\nuse a speci\ufb01c learning rate schedule [13]. The learning rate (denoted by \u21b5 in Algorithm 1) is regularly\ndecayed by a factor (typically a factor of 10). The frequency and timings of this decay are usually\npre-speci\ufb01ed. In VOGN, we use the same schedule used for Adam, which works well.\nDistributed training: We also employ distributed training for VOGN to perform large experiments\nquickly. We can parallelise computation both over data and Monte-Carlo (MC) samples. Data\nparallelism is useful to split up large minibatch sizes. This is followed by averaging over multiple\nMC samples and their losses on a single GPU. MC sample parallelism is useful when minibatch size\nis small, and we can copy the entire minibatch and process it on a single GPU. Algorithm 1 and\nFigure 2 illustrate our distributed scheme. We use a combination of these two parallelism techniques\nwith different MC samples for different inputs. This theoretically reduces the variance during training\n(see Equation 5 in Kingma et al. [26]), but sometimes requires averaging over multiple MC samples\nto get a suf\ufb01ciently low variance in the early iterations. Overall, we \ufb01nd that this type of distributed\ntraining is essential for fast training on large problems such as ImageNet.\nImplementation of the Gauss-Newton update in VOGN: As discussed earlier, VOGN uses the\nGauss-Newton approximation, which is fundamentally different from Adam. In this approximation,\nthe gradients on individual data examples are \ufb01rst squared and then averaged afterwards (see step\n\n5\n\n\f12 in Algorithm 1 which implements the update for st shown in (4)). We need extra computation\nto get access to individual gradients, due to which, VOGN is slower Adam or SGD (e.g., in Fig. 1).\nHowever, this is not a theoretical limitation and this can be improved if a framework enables an easy\ncomputation of the individual gradients. Details of our implementation are described in Appendix\nB. This implementation is much more ef\ufb01cient than a naive one where gradients over examples are\nstored and the sum over the square is computed sequentially. Our implementation usually brings the\nrunning time of VOGN to within 2-5 times of the time that Adam takes.\nTuning VOGN: Currently, there is no common recipe for tuning the algorithmic hyperparameters\nfor VI, especially for large-scale tasks like ImageNet classi\ufb01cation. One key idea we use in our\nexperiments is to start with Adam hyperparameters and then make sure that VOGN training closely\nfollows an Adam-like trajectory in the beginning of training. To achieve this, we divide the tuning\ninto an optimisation part and a regularisation part. In the optimisation part, we \ufb01rst tune the\nhyperparameters of a deterministic version of VOGN, called the online Gauss-Newton (OGN)\nmethod. This method, described in Appendix C, is more stable than VOGN since it does not require\nMC sampling, and can be used as a stepping stone when moving from Adam/SGD to VOGN. After\nreaching a competitive performance to Adam/SGD by OGN, we move to the regularisation part,\nwhere we tune the prior precision , the tempering parameter \u2327, and the number of MC samples K for\nVOGN. We initialise our search by setting the prior precision using the L2-regularisation parameter\nused for OGN, as well as the dataset size N. Another technique is to warm-up the parameter \u2327\ntowards \u2327 = 1 (also see the \u201cmomentum and initialisation\" part). Setting \u2327 to smaller values usually\nstabilises the training, and increasing it slowly also helps during tuning. We also add an external\ndamping factor > 0 to the moving average st. This increases the lower bound of the eigenvalues of\nthe diagonal covariance \u2303t and prevents the noise and the step size from becoming too large. We\n\ufb01nd that a mix of these techniques works well for the problems we considered.\n\n4 Experiments\n\nIn this section, we present experiments on \ufb01tting several deep networks on CIFAR-10 and Ima-\ngeNet. Our experiments demonstrate practical training using VOGN on these benchmarks and show\nperformance that is competitive with Adam and SGD. We also assess the quality of the posterior\napproximation, \ufb01nding that bene\ufb01ts of Bayesian principles are preserved.\nCIFAR-10 [28] contains 10 classes with 50,000 images for training and 10,000 images for validation.\nFor ImageNet, we train with 1.28 million training examples and validate on 50,000 examples,\nclassifying between 1,000 classes. We used a large minibatch size M = 4, 096 and parallelise them\nacross 128 GPUs (NVIDIA Tesla P100). We compare the following methods on CIFAR-10: Adam,\nMC-dropout [9]. For ImageNet, we also compare to SGD, K-FAC, and Noisy K-FAC. We do not\nconsider Noisy K-FAC for other comparisons since tuning is dif\ufb01cult. We compare 3 architectures:\nLeNet-5, AlexNet, ResNet-18. We only compare to Bayes by Backprop (BBB) [4] for CIFAR-10\nwith LeNet-5 since it is very slow to converge for larger-scale experiments. We carefully set the\nhyperparameters of all methods, following the best practice of large distributed training [13] as the\ninitial point of our hyperparameter tuning. The full set of hyperparameters is in Appendix D.\n\n4.1 Performance on CIFAR-10 and ImageNet\n\nWe start by showing the effectiveness of momentum and batch normalisation for boosting the\nperformance of VOGN. Figure 3a shows that these methods signi\ufb01cantly speed up convergence and\nperformance (in terms of both accuracy and log likelihoods).\nFigures 1 and 4 compare the convergence of VOGN to Adam (for all experiments), SGD (on\nImageNet), and MC-dropout (on the rest). VOGN shows similar convergence and its performance\nis competitive with these methods. We also try BBB on LeNet-5, where it converges prohibitively\nslowly, performing very poorly. We are not able to successfully train other architectures using this\napproach. We found it far simpler to tune VOGN because we can borrow all the techniques used for\nAdam. Figure 4 also shows the importance of DA in improving performance.\nTable 1 gives a \ufb01nal comparison of train/validation accuracies, negative log likelihoods, epochs\nrequired for convergence, and run-time per epoch. We can see that the accuracy, log likelihoods,\nand the number of epochs are comparable. VOGN is 2-5 times slower than Adam and SGD. This\n\n6\n\n\f(a) Effect of momentum and batch normalisation.\n\n(b) Continual Learning\n\nFigure 3: Figure (a) shows that momentum and batch normalisation improve the performance of\nVOGN. The results are for training ResNet-18 on CIFAR-10. Figure (b) shows comparison for a\ncontinual-learning task on the Permuted MNIST dataset. VOGN performs at least as well (average\naccuracy) as VCL over 10 tasks. We also \ufb01nd that, for each task, VOGN converges much faster,\ntaking only 100 epochs per task as opposed to 800 epochs taken by VCL (plots not shown).\n\nFigure 4: Validation accuracy for various architectures trained on CIFAR-10 (DA: Data Augmenta-\ntion). VOGN\u2019s convergence and validation accuracies are comparable to Adam and MC-dropout.\n\nis mainly due to the computation of individual gradients required in VOGN (see the discussion in\nSection 3). We clearly see that by using deep-learning techniques on VOGN, we can perform practical\ndeep learning. This is not possible with methods such as BBB.\nDue to the Bayesian nature of VOGN, there are some trade-offs to consider. Reducing the prior\nprecision ( in Algorithm 1) results in higher validation accuracy, but also larger train-test gap (more\nover\ufb01tting). This is shown in Appendix E for VOGN on ResNet-18 on ImageNet. As expected,\nwhen the prior precision is small, performance is similar to non-Bayesian methods. We also show\nthe effect of changing the effective dataset size \u21e2 in Appendix E: note that, since we are going to\ntune the prior variance anyway, it is suf\ufb01cient to set \u21e2 to its correct order of magnitude. Another\ntrade-off concerns the number of Monte-Carlo (MC) samples, shown in Appendix F. Increasing the\nnumber of training MC samples (up to a limit) improves VOGN\u2019s convergence rate and stability,\nbut also increases the computation. Increasing the number of MC samples during testing improves\ngeneralisation, as expected due to averaging.\nFinally, a few comments on the performance of the other methods. Adam regularly over\ufb01ts the\ntraining set in most settings, with large train-test differences in both validation accuracy and log\nlikelihood. One exception is LeNet-5, which is most likely due to the small architecture which results\nin under\ufb01tting (this is consistent with the low validation accuracies obtained). In contrast to Adam,\nMC-dropout has small train-test gap, usually smaller than VOGN\u2019s. However, we will see in Section\n4.2 that this is because of under\ufb01tting. Moreover, the performance of MC-dropout is highly sensitive\nto the dropout rate (see Appendix G for a comparison of different dropout rates). On ImageNet, Noisy\nK-FAC performs well too. It is slower than VOGN, but it takes fewer epochs. Overall, wall clock\ntime is about the same as VOGN.\n\n4.2 Quality of the Predictive Probabilities\n\nIn this section, we compare the quality of the predictive probabilities for various methods. For\nBayesian methods, we compute these probabilities by averaging over the samples from the posterior\napproximations (see Appendix H for details). For non-Bayesian methods, these are obtained using the\n\n7\n\n\fDataset/\nArchitecture Optimiser\n\nCIFAR-10/\nLeNet-5\n(no DA)\n\nCIFAR-10/\nAlexNet\n(no DA)\n\nCIFAR-10/\nAlexNet\n\nCIFAR-10/\nResNet-18\n\nImageNet/\nResNet-18\n\nAdam\nBBB\nMC-dropout\nVOGN\nAdam\nMC-dropout\nVOGN\nAdam\nMC-dropout\nVOGN\nAdam\nMC-dropout\nVOGN\nSGD\nAdam\nMC-dropout\nOGN\nVOGN\nK-FAC\nNoisy K-FAC\n\nTrain/Validation\nAccuracy (%)\n71.98 / 67.67\n66.84 / 64.61\n68.41 / 67.65\n70.79 / 67.32\n100.0 / 67.94\n97.56 / 72.20\n79.07 / 69.03\n97.92 / 73.59\n80.65 / 77.04\n81.15 / 75.48\n97.74 / 86.00\n88.23 / 82.85\n91.62 / 84.27\n82.63 / 67.79\n80.96 / 66.39\n72.96 / 65.64\n85.33 / 65.76\n73.87 / 67.38\n83.73 / 66.58\n72.28 / 66.44\n\nValidation\n\nNLL\n0.937\n1.018\n0.99\n0.938\n2.83\n1.077\n0.93\n1.480\n0.667\n0.703\n0.55\n0.51\n0.477\n1.38\n1.44\n1.43\n1.60\n1.37\n1.493\n1.44\n\nEpochs\n\nTime/\n\nepoch (s)\n\nECE AUROC\n\n210\n800\n210\n210\n161\n160\n160\n161\n160\n160\n160\n161\n161\n90\n90\n90\n90\n90\n60\n60\n\n6.96\n11.43\u2020\n6.95\n18.33\n3.12\n3.25\n9.98\n3.08\n3.20\n10.02\n11.97\n12.51\n53.14\n44.13\n44.40\n45.86\n63.13\n76.04\n133.69\n179.27\n\n0.021\n0.045\n0.087\n0.046\n0.262\n0.140\n0.024\n0.262\n0.114\n0.016\n0.082\n0.166\n0.040\n0.067\n0.064\n0.012\n0.128\n0.029\n0.158\n0.080\n\n0.794\n0.784\n0.797\n0.8\n0.793\n0.818\n0.796\n0.793\n0.828\n0.832\n0.877\n0.768\n0.876\n0.856\n0.855\n0.856\n0.854\n0.854\n0.842\n0.852\n\nTable 1: Performance comparisons on different dataset/architecture combinations. Out of the 15\nmetrics (NLL, ECE, and AUROC on 5 dataset/architecture combinations), VOGN performs the best\nor tied best on 10 ,and is second-best on the other 5. Here DA means \u2018Data Augmentation\u2019, NLL\nrefers to \u2018Negative Log Likelihood\u2019 (lower is better), ECE refers to \u2018Expected Calibration Error\u2019\n(lower is better), AUROC refers to \u2018Area Under ROC curve\u2019 (higher is better). BBB is the Bayes\nBy Backprop method. For ImageNet, the reported accuracy and negative log likelihood are the\nmedian value from the \ufb01nal 5 epochs. All hyperparameter settings are in Appendix D. See Table 3 for\nstandard deviations. \u2020 BBB is not parallelised (other methods have 4 processes), with 1 MC sample\nused for the convolutional layers (VOGN uses 6 samples per process).\n\npoint estimate of the weights. We compare the probabilities using the following metrics: validation\nnegative log-likelihood (NLL), area under ROC (AUROC) and expected calibration curves (ECE)\n[40, 15]. For the \ufb01rst and third metric, a lower number is better, while for the second, a higher\nnumber is better. See Appendix H for an explanation of these metrics. Results are summarised in\nTable 1. VOGN\u2019s uncertainty performance is more consistent and marginally better than the other\nmethods, as expected from a more principled Bayesian method. Out of the 15 metrics (NLL, ECE\nand AUROC on 5 dataset/architecture combinations), VOGN performs the best or tied best on 10,\nand is second-best on the other 5. In contrast, both MC-dropout\u2019s and Adam\u2019s performance varies\nsigni\ufb01cantly, sometimes performing poorly, sometimes performing decently. MC-dropout is best on 4,\nand Adam is best on 1 (on LeNet-5; as argued earlier, the small architecture may result in under\ufb01tting).\nWe also show calibration curves [7] in Figures 1 and 14. Adam is consistently over-con\ufb01dent, with\nits calibration curve below the diagonal. Conversely, MC-dropout is usually under-con\ufb01dent. On\nImageNet, MC-dropout performs well on ECE (all methods are very similar on AUROC), but this\nrequired an excessively tuned dropout rate (see Appendix G).\nWe also compare performance on out-of-distribution datasets. When testing on datasets that are\ndifferent from the training datasets, predictions should be more uncertain. We use experimental\nprotocol from the literature [16, 31, 8, 32] to compare VOGN, Adam and MC-dropout on CIFAR-10.\nWe also borrow metrics from other works [16, 30], showing predictive entropy histograms and also\nreporting AUROC and FPR at 95% TPR. See Appendix I for further details on the datasets and metrics.\nIdeally, we want predictive entropy to be high on out-of-distribution data and low on in-distribution\ndata. Our results are summarised in Figure 5 and Appendix I. On ResNet-18 and AlexNet, VOGN\u2019s\npredictive entropy histograms show the desired behaviour: a spread of entropies for the in-distribution\ndata, and high entropies for out-of-distribution data. Adam has many predictive entropies at zero,\n\n8\n\n\fFigure 5: Histograms of predictive entropy for out-of-distribution tests for ResNet-18 trained on\nCIFAR-10. Going from left to right, the inputs are: the in-distribution dataset (CIFAR-10), followed\nby out-of-distribution data: SVHN, LSUN (crop), LSUN (resize). Also shown are the FPR at 95%\nTPR metric (lower is better) and the AUROC metric (higher is better), averaged over 3 runs. We\nclearly see that VOGN\u2019s predictive entropy is generally low for in-distribution and high for out-of-\ndistribution data, but this is not the case for other methods. Solid vertical lines indicate the mean\npredictive entropy. The standard deviations are small and therefore not reported.\n\nindicating Adam tends to classify out-of-distribution data too con\ufb01dently. Conversely, MC-dropout\u2019s\npredictive entropies are generally high (particularly in-distribution), indicating MC-dropout has too\nmuch noise. On LeNet-5, we observe the same result as before: Adam and MC-dropout both perform\nwell. The metrics (AUROC and FPR at 95% TPR) do not provide a clear story across architectures.\n\n4.2.1 Performance on a Continual-learning task\nThe goal of continual learning is to avoid forgetting of old tasks while sequentially observing new\ntasks. The past tasks are never visited again, making it dif\ufb01cult to remember them. The \ufb01eld\nof continual learning has recently grown, with many approaches proposed to tackle this problem\n[27, 33, 43, 48, 50]. Most approaches consider a simple setting where the tasks (such as classifying a\nsubset of classes) arrive sequentially, and all the data from that task is available. We consider the\nsame setup in our experiments.\nWe compare to Elastic Weight Consolidation (EWC) [27] and a VI-based approach called Variational\nContinual Learning (VCL) [43]. VCL employs BBB for each task, and we expect to boost its\nperformance by replacing BBB by VOGN. Figure 3b shows results on a common benchmark called\nPermuted MNIST. We use the same experimental setup as in Swaroop et al. [52]. In Permuted\nMNIST, each task consists of the entire MNIST dataset (10-way classi\ufb01cation) with a different \ufb01xed\nrandom permutation applied to the input images\u2019 pixels. We run each method 20 times, with different\nrandom seeds for both the benchmark\u2019s permutations and model training. See Appendix D.2 for\nhyperparameter settings and further details. We see that VOGN performs at least as well as VCL,\nand far better than a popular approach called EWC [27]. Additionally, as found in the batch learning\nsetting, VOGN is much quicker than BBB: we run VOGN for only 100 epochs per task, whereas\nVCL requires 800 epochs per task to achieve best results [52].\n\n5 Conclusions\n\nWe successfully train deep networks with a natural-gradient variational inference method, VOGN,\non a variety of architectures and datasets, even scaling up to ImageNet. This is made possible due\nto the similarity of VOGN to Adam, enabling us to boost performance by borrowing deep-learning\ntechniques. Our accuracies and convergence rates are comparable to SGD and Adam. Unlike them,\nhowever, VOGN retains the bene\ufb01ts of Bayesian principles, with well-calibrated uncertainty and\ngood performance on out-of-distribution data. Better uncertainty estimates open up a whole range\nof potential future experiments, for example, small data experiments, active learning, adversarial\nexperiments, and sequential decision making. Our results on a continual-learning task con\ufb01rm this.\nAnother potential avenue for research is to consider structured covariance approximations.\n\n9\n\n\fAcknowledgements\nWe would like to thank Hikaru Nakata (Tokyo Institute of Technology) and Ikuro Sato (Denso IT\nLaboratory, Inc.) for their help on the PyTorch implementation. We are also thankful for the RAIDEN\ncomputing system and its support team at the RIKEN Center for AI Project which we used extensively\nfor our experiments. This research used computational resources of the HPCI system provided by\nTokyo Institute of Technology (TSUBAME3.0) through the HPCI System Research Project (Project\nID:hp190122). K. O. is a Research Fellow of JSPS and is supported by JSPS KAKENHI Grant\nNumber JP19J13477.\n\nReferences\n[1] James R Anderson and Carsten Peterson. A mean \ufb01eld theory learning algorithm for neural\n\nnetworks. Complex Systems, 1:995\u20131019, 1987.\n\n[2] David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks.\n\nGeneralization in Neural Networks and Machine Learning, 168:215\u2013238, 1998.\n\n[3] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and\n\nStatistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738.\n\n[4] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\nin neural networks. In International Conference on Machine Learning, pages 1613\u20131622, 2015.\n\n[5] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. arXiv preprint arXiv:1606.04838, 2016.\n\n[6] John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adversarial examples,\nuncertainty, and transfer testing robustness in Gaussian process hybrid deep networks. arXiv\npreprint arXiv:1707.02476, 2017.\n\n[7] Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation of forecasters.\n\nThe Statistician: Journal of the Institute of Statisticians, 32:12\u201322, 1983.\n\n[8] Terrance DeVries and Graham W. Taylor. Learning con\ufb01dence for out-of-distribution detection\n\nin neural networks. arXiv preprint arXiv:1802.04865, 2018.\n\n[9] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing\nmodel uncertainty in deep learning. In International Conference on Machine Learning, pages\n1050\u20131059, 2016.\n\n[10] S. Ghosal and A. Van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44.\n\nCambridge University Press, 2017.\n\n[11] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedfor-\nward neural networks. In Proceedings of the thirteenth international conference on arti\ufb01cial\nintelligence and statistics, pages 249\u2013256, 2010.\n\n[12] Ian Goodfellow. Ef\ufb01cient Per-Example Gradient Computations. ArXiv e-prints, October 2015.\n\n[13] Priya Goyal, Piotr Doll\u00e1r, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo\nKyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD:\ntraining imagenet in 1 hour. CoRR, abs/1706.02677, 2017.\n\n[14] Alex Graves. Practical variational inference for neural networks.\n\nInformation Processing Systems, pages 2348\u20132356, 2011.\n\nIn Advances in Neural\n\n[15] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural\nnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 1321\u20131330. JMLR. org, 2017.\n\n[16] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution\nexamples in neural networks. In International Conference on Learning Representations, 2017.\n\n10\n\n\f[17] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,\nAndrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural\nnetworks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29,\n2012.\n\n[18] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the\ndescription length of the weights. In Annual Conference on Computational Learning Theory,\npages 5\u201313, 1993.\n\n[19] Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky. Bayesian model\n\naveraging: a tutorial. Statistical science, pages 382\u2013401, 1999.\n\n[20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/\nabs/1502.03167.\n\n[21] Mohammad Khan. Variational learning for latent Gaussian model of discrete data. PhD thesis,\n\nUniversity of British Columbia, 2012.\n\n[22] Mohammad Emtiyaz Khan and Wu Lin. Conjugate-computation variational inference: con-\nverting variational inference in non-conjugate models to inferences in conjugate models. In\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 878\u2013887, 2017.\n\n[23] Mohammad Emtiyaz Khan and Didrik Nielsen. Fast yet simple natural-gradient descent for\nvariational inference in complex models. In 2018 International Symposium on Information\nTheory and Its Applications (ISITA), pages 31\u201335. IEEE, 2018.\n\n[24] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash\nSrivastava. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In\nInternational Conference on Machine Learning, pages 2616\u20132625, 2018.\n\n[25] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations, 2015.\n\n[26] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local\nreparameterization trick. In Advances in Neural Information Processing Systems, pages 2575\u2013\n2583, 2015.\n\n[27] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,\nAndrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.\nOvercoming catastrophic forgetting in neural networks. Proceedings of the national academy of\nsciences, 114(13):3521\u20133526, 2017.\n\n[28] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[30] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems 30, pages 6402\u20136413. Curran Associates, Inc., 2017.\n\n[31] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training con\ufb01dence-calibrated clas-\nIn International Conference on Learning\n\nsi\ufb01ers for detecting out-of-distribution samples.\nRepresentations, 2018.\n\n[32] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image\ndetection in neural networks. In International Conference on Learning Representations, 2018.\n\n[33] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning.\n\nIn NIPS, 2017.\n\n11\n\n\f[34] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International\n\nConference on Learning Representations, 2019.\n\n[35] David Mackay. Bayesian Methods for Adaptive Models. PhD thesis, California Institute of\n\nTechnology, 1991.\n\n[36] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university\n\npress, 2003.\n\n[37] Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson.\nA simple baseline for Bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476,\n2019.\n\n[38] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as\n\napproximate Bayesian inference. Journal of Machine Learning Research, 18:1\u201335, 2017.\n\n[39] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[40] Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated\nprobabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on\nArti\ufb01cial Intelligence, AAAI\u201915, pages 2901\u20132907. AAAI Press, 2015.\n\n[41] Redford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto,\n\n1995.\n\n[42] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning 2011, 2011.\n\n[43] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual\n\nlearning. arXiv preprint arXiv:1710.10628, 2017.\n\n[44] Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka.\nSecond-order optimization method for large mini-batch: Training resnet-50 on imagenet in 35\nepochs. CoRR, abs/1811.12019, 2018.\n\n[45] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown: An\nempirical comparison of Bayesian deep networks for Thompson sampling. arXiv preprint\narXiv:1802.09127, 2018.\n\n[46] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation for\n\nneural networks. In International Conference on Learning Representations, 2018.\n\n[47] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International journal of computer vision, 115(3):211\u2013252, 2015.\n\n[48] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[49] Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean \ufb01eld theory for sigmoid belief\n\nnetworks. Journal of Arti\ufb01cial Intelligence Research, 4:61\u201376, 1996.\n\n[50] Jonathan Schwarz, Jelena Luketina, Wojciech M. Czarnecki, Agnieszka Grabska-Barwinska,\nYee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework\nfor continual learning. In International Conference on Machine Learning, 2018.\n\n[51] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[52] Siddharth Swaroop, Cuong V. Nguyen, Thang D. Bui, and Richard E. Turner. Improving and\n\nunderstanding variational continual learning. arXiv preprint arXiv:1905.02099, 2019.\n\n12\n\n\f[53] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-RMSprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2012.\n[54] V. G. Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on Computa-\ntional Learning Theory, COLT \u201990, pages 371\u2013386, San Francisco, CA, USA, 1990. Morgan\nKaufmann Publishers Inc. ISBN 1-55860-146-5.\n\n[55] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a\nlarge-scale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365,\n2015.\n\n[56] Guodong Zhang, Shengyang Sun, David K. Duvenaud, and Roger B. Grosse. Noisy natural\n\ngradient as variational inference. arXiv preprint arXiv:1712.02390, 2018.\n\n13\n\n\f", "award": [], "sourceid": 2405, "authors": [{"given_name": "Kazuki", "family_name": "Osawa", "institution": "Tokyo Institute of Technology"}, {"given_name": "Siddharth", "family_name": "Swaroop", "institution": "University of Cambridge"}, {"given_name": "Mohammad Emtiyaz", "family_name": "Khan", "institution": "RIKEN"}, {"given_name": "Anirudh", "family_name": "Jain", "institution": "Indian Institute of Technology (ISM), Dhanbad"}, {"given_name": "Runa", "family_name": "Eschenhagen", "institution": "University of Osnabrueck"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}, {"given_name": "Rio", "family_name": "Yokota", "institution": "Tokyo Institute of Technology, AIST- Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC- OIL), National Institute of Advanced Industrial Science and Technology (AIST)"}]}