{"title": "Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting", "book": "Advances in Neural Information Processing Systems", "page_first": 3738, "page_last": 3748, "abstract": "We introduce the Kronecker factored online Laplace approximation for overcoming catastrophic forgetting in neural networks. The method is grounded in a Bayesian online learning framework, where we recursively approximate the posterior after every task with a Gaussian, leading to a quadratic penalty on changes to the weights. The Laplace approximation requires calculating the Hessian around a mode, which is typically intractable for modern architectures. In order to make our method scalable, we leverage recent block-diagonal Kronecker factored approximations to the curvature. Our algorithm achieves over 90% test accuracy across a sequence of 50 instantiations of the permuted MNIST dataset, substantially outperforming related methods for overcoming catastrophic forgetting.", "full_text": "Online Structured Laplace Approximations for\n\nOvercoming Catastrophic Forgetting\n\nHippolyt Ritter1\u2217\n1University College London\n\n2Alan Turing Institute\n\n3reinfer.io\n\nAleksandar Botev1\n\nDavid Barber1,2,3\n\nAbstract\n\nWe introduce the Kronecker factored online Laplace approximation for overcoming\ncatastrophic forgetting in neural networks. The method is grounded in a Bayesian\nonline learning framework, where we recursively approximate the posterior after\nevery task with a Gaussian, leading to a quadratic penalty on changes to the weights.\nThe Laplace approximation requires calculating the Hessian around a mode, which\nis typically intractable for modern architectures. In order to make our method\nscalable, we leverage recent block-diagonal Kronecker factored approximations to\nthe curvature. Our algorithm achieves over 90% test accuracy across a sequence\nof 50 instantiations of the permuted MNIST dataset, substantially outperforming\nrelated methods for overcoming catastrophic forgetting.\n\n1\n\nIntroduction\n\nCreating an agent that performs well across multiple tasks and continuously incorporates new\nknowledge has been a longstanding goal of research on arti\ufb01cial intelligence. When training on a\nsequence of tasks, however, the performance of many machine learning algorithms, including neural\nnetworks, decreases on older tasks when learning new ones. This phenomenon has been termed\n\u2018catastrophic forgetting\u2019 [6, 26, 33] and has recently received attention in the context of deep learning\n[8, 16]. Catastrophic forgetting cannot be overcome by simply initializing the parameters for a\nnew task with optimal ones from the old task and hoping that stochastic gradient descent will stay\nsuf\ufb01ciently close to the original values to maintain good performance on previous datasets [8].\nBayesian learning provides an elegant solution to this problem. It combines the current data with\nprior information to \ufb01nd an optimal trade-off in our belief about the parameters. In the sequential\nsetting, such information is readily available: the posterior over the parameters given all previous\ndatasets. It follows from Bayes\u2019 rule that we can use the posterior over the parameters after training\non one task as our prior for the next one. As the posterior over the weights of a neural network is\ntypically intractable, we need to approximate it. This type of Bayesian online learning has been\nstudied extensively in the literature [31, 7, 13].\nIn this work, we combine Bayesian online learning [31] with the Kronecker factored Laplace\napproximation [34] to update a quadratic penalty for every new task. The block-diagonal Kronecker\nfactored approximation of the Hessian [23, 2] allows for an expressive scalable posterior that takes\ninteractions between weights within the same layer into account. In our experiments we show that\nthis principled approximation of the posterior leads to substantial gains in performance over simpler\ndiagonal methods, in particular for long sequences of tasks.\n\n\u2217Corresponding author: j.ritter@cs.ucl.ac.uk\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2 Bayesian online learning for neural networks\n\nWe are interested in optimizing the parameters \u03b8 of a single neural network to perform well across\nmultiple tasks D1, . . . ,DT , speci\ufb01cally \ufb01nding a MAP estimate \u03b8\u2217 = arg max\u03b8 p(\u03b8|D1, . . . ,DT ).\nHowever, the datasets arrive sequentially and we can only train on one of them at a time.\nIn the following, we \ufb01rst discuss how Bayesian online learning solves this problem and introduce an\napproximate procedure for neural networks. We then review recent Kronecker factored approxima-\ntions to the curvature of neural networks and how to use them to obtain a better \ufb01t to the posterior.\nFinally, we introduce a hyperparameter that acts as a regularizer on the approximation to the posterior.\n\n2.1 Bayesian online learning\n\nBayesian online learning [31], or Assumed Density Filtering [25], is a framework for updating an\napproximate posterior when data arrive sequentially. Using Bayes\u2019 rule we would like to simply\nincorporate the most recent dataset Dt+1 into the posterior as:\n\np(\u03b8|D1:t+1) =\n\n(cid:82) d\u03b8(cid:48)p(Dt+1|\u03b8(cid:48))p(\u03b8(cid:48)|D1:t)\n\np(Dt+1|\u03b8)p(\u03b8|D1:t)\n\n(1)\n\n(2)\n\nwhere we use the posterior p(\u03b8|D1:t) from the previously observed tasks as the prior over the\nparameters for the most recent task. As the posterior given the previous datasets is typically intractable,\nBayesian online learning formulates a parametric approximate posterior q with parameters \u03c6t, which\nit iteratively updates in two steps:\n\nUpdate step In the update step, the approximate posterior q with parameters \u03c6t from the previous\ntask is used as a prior to \ufb01nd the new posterior given the most recent data:\n\np(\u03b8|Dt+1, \u03c6t) =\n\n(cid:82) d\u03b8(cid:48)p(Dt+1|\u03b8(cid:48))q(\u03b8(cid:48)|\u03c6t)\n\np(Dt+1|\u03b8)q(\u03b8|\u03c6t)\n\nProjection step The projection step \ufb01nds the distribution within the parametric family of the\napproximation that most closely resembles this posterior, i.e. sets \u03c6t+1 such that:\n\nq(\u03b8|\u03c6t+1) \u2248 p(\u03b8|Dt+1, \u03c6t)\n\n(3)\n\nOpper and Winther [31] suggest minimizing the KL-divergence between the approximate and the\ntrue posterior, however this is mostly appropriate for models where the update-step posterior and a\nsolution to the KL-divergence are available in closed form. In the following, we therefore propose\nusing a Laplace approximation to make Bayesian online learning tractable for neural networks.\n\n2.2 The online Laplace approximation\n\nNeural networks have found wide-spread success and adoption by performing simple MAP inference,\ni.e. \ufb01nding a mode of the posterior:\n\n\u03b8\u2217 = arg max\n\nlog p(\u03b8|D) = arg max\n\nlog p(D|\u03b8) + log p(\u03b8)\n\n\u03b8\n\n\u03b8\n\n(4)\n\nwhere p(D|\u03b8) is the likelihood of the data and p(\u03b8) the prior. Most commonly used loss functions\nand regularizers \ufb01t into this framework, e.g. using a categorical cross-entropy with L2-regularization\ncorresponds to modeling the data with a categorical distribution and placing a zero-mean Gaussian\nprior on the network parameters. A local mode of this objective function can easily be found using\nstandard gradient-based optimizers.\nAround a mode, the posterior can be locally approximated using a second-order Taylor expansion,\nresulting in a Normal distribution with the MAP parameters as the mean and the Hessian of the\nnegative log posterior around them as the precision. Using a Laplace approximation for neural\nnetworks was pioneered by MacKay [22].\n\n2\n\n\fWe therefore proceed in two iterative steps similar to Bayesian online learning, using a Gaussian\napproximate posterior for q, such that \u03c6t = {\u00b5t, \u039bt} consists of a mean \u00b5 and a precision matrix \u039b:\nUpdate step As the posterior of a neural network is intractable for all but the simplest architectures,\nwe will work with the unnormalized posterior. The normalization constant is not needed for \ufb01nding a\nmode or calculating the Hessian. The Gaussian approximate posterior results in a quadratic penalty\nencouraging the parameters to stay close to the mean of the previous approximate posterior:\n\nlog p(\u03b8|Dt+1, \u03c6t) \u221d log p(Dt+1|\u03b8) + log q(\u03b8|\u03c6t)\n\n\u221d log p(Dt+1|\u03b8) \u2212 1\n2\n\n(\u03b8 \u2212 \u00b5t)(cid:62)\u039bt(\u03b8 \u2212 \u00b5t)\n\n(5)\n\nProjection step In the projection step we approximate the posterior with a Gaussian. We \ufb01rst\nupdate the mean of the approximation to a mode of the new posterior:\n\n\u00b5t+1 = arg max\n\n\u03b8\n\nlog p(Dt+1|\u03b8) + log q(\u03b8|\u03c6t)\n\n(6)\n\nand then perform a quadratic approximation around it, which requires calculating the Hessian of the\nnegative objective. This leads to a recursive update to the precision with the Hessian of the most\nrecent log likelihood, as the Hessian of the negative log approximate posterior is its precision:\n\n(cid:12)(cid:12)(cid:12)\u03b8=\u00b5t+1\n\n\u039bt+1 = Ht+1(\u00b5t+1) + \u039bt\n\n(7)\n\n\u2202\u03b8\u2202\u03b8\n\nwhere Ht+1(\u00b5t+1) = \u2212 \u22022 log p(Dt+1|\u03b8)\nis the Hessian of the newest negative log likelihood\naround the mode. The precision of a Gaussian is required to be positive semi-de\ufb01nite, which is the\ncase for the Hessian at a mode. In order to numerically guarantee this in practice, we use the Fisher\nInformation as an approximation [24] that is positive semi-de\ufb01nite by construction.\nThe recursion is initialized with the Hessian of the log prior, which is typically constant. For a\nzero-mean isotropic Gaussian prior, corresponding to an L2-regularizer, it is simply the identity\nmatrix times the prior precision.2\nA desirable property of the Laplace approximation is that the approximate posterior becomes peaked\naround its current mode as we observe more data. This becomes particularly clear if we think of the\nprecision matrix as the product of the number of data points and the average precision. By becoming\nincreasingly peaked, the approximate posterior will naturally allow the parameters to change less for\nlater tasks. At the same time, even though the Laplace method is a local approximation, we would\nexpect it to leave suf\ufb01cient \ufb02exibility for the parameters to adapt to new tasks, as the Hessian of\nneural networks has been observed to be \ufb02at in most directions [36].\nWe will also compare to \ufb01tting the true posterior with a new Gaussian at every task for which we\ncompute the Hessian of all tasks around the most recent MAP estimate:\n\nt+1(cid:88)\n\n\u039bt+1 = Hprior +\n\nHi(\u00b5t+1)\n\n(8)\n\ni=1\n\nThis procedure differs from the online Laplace approximation only in evaluating all Hessians at\nthe most recent MAP parameters instead of the respective task\u2019s ones. Technically, this is not a\nvalid Laplace approximation, as we only optimize an approximation to the posterior. Hence the\noptimal parameters for the approximate objective will not exactly correspond to a mode of the true\nposterior. However, as we will use a positive semi-de\ufb01nite approximation to the Hessian, this will\nonly introduce a small additional approximation error.\nCalculating the Hessian across all datasets requires relaxing the sequential learning setting to allowing\naccess to previous data \u2018of\ufb02ine\u2019, i.e. between tasks. We use this baseline to check if there is any loss\nof information in using estimates of the curvature at previous parameter values.\n\n2Husz\u00e1r [14] recently discussed a similar recursive Laplace approximation for online learning, however with\n\nlimited experimental results and in the context of using a diagonal approximation to the Hessian.\n\n3\n\n\f2.3 Kronecker factored approximation of the Hessian\n\nModern networks typically have millions of parameters, so the size of the Hessian is several terabytes.\nAn approximation that is simple to implement with automatic differentiation frameworks is the\ndiagonal of the Fisher matrix, i.e. the expected square of the gradients, where the expectation is over\nthe datapoints and the conditional distribution de\ufb01ned by the model. While this approximation has\nbeen used successfully [16], it ignores interactions between the parameters.\nRecent works on second-order optimization [23, 2] have developed block-diagonal approximations\nto the Hessian. They exploit that, for a single data point, the diagonal blocks of the Hessian of a\nfeedforward network \u2014 corresponding to the weights of a single layer \u2014 are Kronecker factored, i.e.\na product of two relatively small matrices.\nWe denote a neural network as taking an input a0=x and producing an output hL. The input is passed\nthrough layers 1, . . . , L as the linear pre-activations hl=Wlal\u22121 and the activations al=fl(hl), where\nfl is a non-linear elementwise function. The outputs then parameterize the log likelihood of the data,\nand, using the chain rule, we can write the Hessian w.r.t. the weights of a single layer as:\n\nHl =\n\n\u22022 log p(D|hL)\n\n\u2202 vec(Wl)\u2202 vec(Wl)\n\n= Ql \u2297 Hl\n\n(9)\n\nwhere vec(Wl) is the weight matrix of layer l stacked into a vector and we de\ufb01ne Ql = al\u22121a(cid:62)\nl\u22121\nas the covariance of the inputs to the layer. Hl = \u22022 log p(D|\u03b8)\nis the pre-activation Hessian, i.e. the\nsecond derivative w.r.t. the pre-activations hl of the layer. We provide the basic derivation of Eq. (9)\nand the recursive formula for calculating Hl in Appendix A. To maintain the Kronecker factorization\nin expectation, i.e. for an entire dataset, [23] and [2] assume the two factors to be independent and\napproximate the expected Kronecker product by the Kronecker product of the expected factors.\nThe block-diagonal approximation splits the Hessian-vector product in the quadratic penalty across\nthe layers. Due to the Kronecker factored approximation, it can be calculated ef\ufb01ciently for each\nlayer using the following well-known identity:\n\n\u2202hl\u2202hl\n\n(Ql \u2297 Hl) vec(Wl \u2212 W \u2217\n\nl ) = vec(Hl (Wl \u2212 W \u2217\n\nl )Ql)\n\n(10)\n\nwhere vec stacks the columns of a matrix into a vector and we use that H is symmetric.\nThe block-diagonal Kronecker factored approximation corresponds to assuming independence be-\ntween the layers and factorizing the covariance between the weights of a layer into the covariance\nof the columns and rows, resulting in a matrix normal distribution [11]. The same approximation\nhas been used recently to sample from the predictive posterior [34, 9]. While it still makes some\nindependence assumptions about the weights, the most important interactions \u2014 the ones within the\nsame layer \u2014 are accounted for. In order to guarantee for the curvature being positive semi-de\ufb01nite,\nwe approximate the Hessian with the Fisher Information as in [23] throughout our experiments.\n\n2.4 Regularizing the approximate posterior\n\nKirkpatrick et al. [16], who develop a similar method inspired by the Laplace approximation, suggest\nusing a multiplier \u03bb on the quadratic penalty in Eq. (5). This hyperparameter provides a way of\ntrading off retaining performance on previous tasks against having suf\ufb01cient \ufb02exibility for learning a\nnew one. As modifying the objective would propagate into the recursion for the precision matrix, we\ninstead place the multiplier on the Hessian of each log likelihood and update the precision as:\n\n\u039bt+1 = \u03bbHt+1(\u00b5t+1) + \u039bt\n\n(11)\n\nThe multiplier affects the width of the approximate posterior and thus the location of the next MAP\nestimate. As it acts directly on the parameter of a probability distribution, its optimal value can\ninform us about the quality of our approximation: if it strongly deviates from its natural value of 1,\nour approximation is a poor one and over- or underestimates the uncertainty about the parameters.\nWe visualize the effect of \u03bb in Fig. 5 in Appendix B.\n\n4\n\n\f2.5 Computational complexity\n\nOur method requires calculating the expectations of the two Kronecker factors from Eq. (9) over the\ndata of the most recent task after training on it as well as calculating the quadratic penalty in Eq. (5)\nusing the identity in Eq. (10) for every parameter update.\nCalculating the Kronecker factors can ef\ufb01ciently be mini-batched and requires the same calculations\nas a forward and backward pass through the network plus two additional matrix-matrix products. The\noverall cost is thus effectively equivalent to that of an extra training epoch. See [23] for more details.\nThe computational complexity of calculating the quadratic penalty is dominated by the two matrix-\nmatrix products in Eq. (10). Assuming that all L layers of the network as well as the inputs and\noutputs are of dimensionality d, all weight matrices as well as the Kronecker factors will be of\ndimensionality d \u00d7 d.3 The complexity of calculating the penalty for all layers is then O(Ld3).\nFinally, we note that sums of Kronecker products do not add up pairwise, i.e. A \u2297 B + C \u2297 D (cid:54)=\n(A + C) \u2297 (B + D), so the corresponding Kronecker factors of different tasks do not simply add.\nIn our implementation, we keep an approximate Hessian for every task in memory, similar to how\nEWC [16] keeps the MAP parameters for each task. If constant scaling in the number of tasks is\nrequired, one can make a further approximation by adding up the Kronecker factors separately. This\nwould be comparable to the independence assumption between the factors within the same task.\n\n3 Related work\n\nOur method is closely related to Bayesian online learning [31] and to Laplace propagation [4]. In\ncontrast to Bayesian online learning, as we cannot update the posterior over the weights in closed\nform, we use gradient-based methods to \ufb01nd a mode and perform a quadratic approximation around\nit, resulting in a Gaussian approximation. Laplace propagation, similar to expectation propagation\n[27], maintains a factor for every task, but approximates each of them with a Gaussian. It performs\nmultiple updates, whereas we use each dataset only once to update the approximation to the posterior.\nThe most similar method to ours for overcoming catastrophic forgetting is Elastic Weight Consolida-\ntion (EWC) [16]. EWC approximates the posterior after the \ufb01rst task with a Gaussian. However, it\ncontinues to add a penalty for every new task [17]. This is more closely related to Laplace propagation,\nbut may be overcounting early tasks [14] and does not approximate the posterior. Furthermore, EWC\nuses a simple diagonal approximation to the Hessian. Lee et al. [20] approximate the posterior around\nthe mode for each dataset with a diagonal Gaussian in addition to a similar approximation of the\noverall posterior. They update this approximation to the posterior as the Gaussian that minimizes the\nKL divergence with the individual posterior approximations. Nguyen et al. [30] implement online\nvariational learning [7, 13], which \ufb01ts an approximation to the posterior through the variational\nlower bound and then uses this approximation as the prior on the next task. Their Gaussian is fully\nfactorized, hence they do not take weight interactions into account either.\n[34] and [9] have independently proposed the use of block-diagonal Kronecker factored curvature\napproximations [23, 2] to sample from an approximate Gaussian posterior over the weights of a\nneural network. They \ufb01nd that this requires adding a multiple of the identity to their curvature factors\nas an ad-hoc regularizer, which is not necessary for our method. In our work, we use an approximate\nposterior with the same Kronecker factored covariance structure as a prior for subsequent tasks. We\niteratively update this approximation for every new dataset. The curvature factors that we accumulate\nthroughout training could be used on top of our method to approximate the predictive posterior similar\nto [34, 9]. However, both the curvature factors and the mode that our method \ufb01nds will be different to\nperforming a Laplace approximation in batch mode. Our work links the Kronecker factored Laplace\napproximation [34] to Bayesian online learning [31] similar to how Variational Continual Learning\n[30] connects Online Variational Learning [7, 13] to Bayes-by-Backprop [1].\nWe discuss additional related methods without a Bayesian motivation in Appendix C.\n\n3In general, the size of the \ufb01rst factor is square in the dimensionality of the input to a layer and that of the\n\nsecond factor square in the number of units, i.e. din \u00d7 din and dout \u00d7 dout for a dout \u00d7 din weight matrix.\n\n5\n\n\f(a) Kronecker factored\n\nFigure 1: Mean test accuracy on a sequence of\npermuted MNIST datasets. We categorize SI\nas a diagonal method, as it does not account\nfor parameter interactions. The dotted black\nline shows the performance of a single network\ntrained on all observed data at each task.\n\n(b) Diagonal\n\nFigure 2: Effect of \u03bb for different curvature ap-\nproximations for permuted MNIST. Each plot\nshows the mean, minimum and maximum across\nthe tasks observed so far, as well as the accuracy\non the \ufb01rst and most recent task.\n\n4 Experiments\n\nIn our experiments we compare our online Laplace approximation to the approximate Laplace\napproximation of Eq. (8) as well as EWC [16] and Synaptic Intelligence (SI) [41], both of which also\nadd quadratic regularizers to the objective. Further, we investigate the effect of using a block-diagonal\nKronecker factored approximation to the curvature over a diagonal one. We also run EWC with a\nKronecker factored approximation, even though the original method is based on a diagonal one. We\nimplement our experiments using Theano [39] and Lasagne [3] software libraries.4\n\n4.1 Permuted MNIST\n\nAs a \ufb01rst experiment, we test on a sequence of permutations of the MNIST dataset [19]. Each\ninstantiation consists of the 28\u00d728 grey-scale images and labels from the original dataset with a \ufb01xed\nrandom permutation of the pixels. This makes the individual data distributions mostly independent of\neach other, testing the ability of each method to fully utilize the model\u2019s capacity.\nWe train a feed-forward network with two hidden layers of 100 units and ReLU nonlinearities on a\nsequence of 50 versions of permuted MNIST. Every one of these datasets is equally dif\ufb01cult for a\nfully connected network due to its permutation invariance to the input. We stress that our network is\nsmaller than in previous works as the limited capacity of the network makes the task more challenging.\nFurther, we train on a longer sequence of datasets. Optimization details are in Appendix D.\nFig. 1 shows the mean test accuracy as new datasets are observed for the optimal hyperparameters\nof each method. We refer to the online Laplace approximation as \u2018Online Laplace\u2019, to the Laplace\napproximation around an approximate mode as \u2018Approximate Laplace\u2019 and to adding a quadratic\npenalty for every set of MAP parameters as in [16] as \u2018Per-task Laplace\u2019. The per-task Laplace\nmethod with a diagonal approximation to the Hessian corresponds to EWC.\nWe \ufb01nd our online Laplace approximation to maintain higher test accuracy throughout training than\nplacing a quadratic penalty around the MAP parameters of every task, in particular when using a\nsimple diagonal approximation to the Hessian. However, the main difference between the methods\nlies in using a Kronecker factored approximation of the curvature over a diagonal one.5 Using this\napproximation, we achieve over 90% average test accuracy across 50 tasks, almost matching the\nperformance of a network trained jointly on all observed data. Recalculating the curvature for each\ntask instead of retaining previous estimates does not signi\ufb01cantly affect performance.\n\n4Our fork with code to calculate the Kronecker factors is available at: www.github.com/BB-UCL/Lasagne\n5In earlier work, e.g. [16, 41], diagonal approximations were reported to be effective for a smaller number of\n\ntasks and with substantially larger networks than in our experiments.\n\n6\n\n\fBeyond simple average performance, we investigate different values of the hyperparameter \u03bb on the\npermuted MNIST sequence of datasets for our online Laplace approximation. The goal is to visualize\nhow it affects the trade-off between remembering previous tasks and being able to learn new ones\nfor the two approximations of the curvature that we consider. Fig. 2 shows various statistics of the\naccuracy on the test set for the smallest and largest value of the hyperparameter on the quadratic\npenalty that we tested, as well as the one that optimizes the validation error.\nWe are particularly interested in the performance on the \ufb01rst dataset and the most recent one, as a\nmeasure for memory and \ufb02exibility respectively. For all displayed values of the hyperparameter, the\nKronecker factored approximation (Fig. 2a) has higher test accuracy than the diagonal approximation\n(Fig. 2b) on both the most recent and the \ufb01rst task, as well as on average. For the natural choice of\n\u03bb = 1 (leftmost sub\ufb01gure respectively), the network\u2019s performance decays for the \ufb01rst task for both\ncurvature approximations, yet it is able to learn the most recent task well. The performance on the\n\ufb01rst task decays more slowly, however, for the more expressive Kronecker factored approximation\nof the curvature. Increasing the hyperparameter, corresponding to making the prior more narrow\nas discussed in Section 2.4, leads to the network remembering the \ufb01rst task much better at the cost\nof not being able to achieve optimal performance on the most recently added task. Using \u03bb = 3\n(central sub\ufb01gure), the value that achieves optimal validation error in our experiments, the Kronecker\nfactored approximation leads to the network performing similarly on the most recent and \ufb01rst tasks.\nThis coincides with optimal average test accuracy. We are not able to \ufb01nd such an ideal trade-off for\nthe diagonal Hessian approximation, resulting in worse average performance and suggesting that the\nposterior cannot be matched well without accounting for interactions between the weights. Using\na large value of \u03bb = 100 (rightmost sub\ufb01gure) reverts the order of performance between the most\nrecent and the \ufb01rst task for both approximations: while for small \u03bb the \ufb01rst task is \u2018forgotten\u2019, the\nnetwork\u2019s performance now stays at a high level \u2014 for the Kronecker factored approximation it\nremembers it perfectly \u2014 which comes at the cost of being unable to learn new tasks well.\nWe conclude from our results that the online Laplace approximation overestimates the uncertainty in\nthe approximate posterior about the parameters for the permuted MNIST task, in particular with a\ndiagonal approximation to the Hessian. Overestimating the uncertainty leads to a need for regulariza-\ntion in the form of reducing the width of the approximate posterior, as the value that optimizes the\nvalidation error is \u03bb = 3. Only when regularizing too strongly the approximate posterior underesti-\nmates the uncertainty about the weights, leading to reduced performance on new tasks for large values\nof \u03bb. Using a better approximation to the posterior leads to a drastic increase in performance and a\nreduced need for regularization in the subsequent experiments. We note that some regularization is\nstill necessary, suggesting that even the Kronecker factored approximation overestimates the variance\nin the posterior, and a better approximation could lead to further improvements. However, it is also\npossible that the Laplace approximation as such requires a large amount of data to estimate the\ninteraction between the parameters suf\ufb01ciently well; hence it might be best suited for settings where\nplenty of data are available.\n\n4.2 Disjoint MNIST\n\nWe further experiment with the disjoint MNIST task, which\nsplits the MNIST dataset into one part containing the digits \u20180\u2019\nto \u20184\u2019, and a second part containing \u20185\u2019 to \u20189\u2019 and training a\nten-way classi\ufb01er on each set separately. Previous work [20]\nhas found this problem to be challenging for EWC, as during\nthe \ufb01rst half of training the network is encouraged to set the\nbias terms for the second set of labels to highly negative values.\nThis setup makes it dif\ufb01cult to balance out the biases for the\ntwo sets of classes after the \ufb01rst task without overcorrecting and\nsetting the biases for the \ufb01rst set of classes to highly negative\nvalues. Lee et al. [20] report just over 50% test accuracy for\nEWC, which corresponds to either completely forgetting the\n\ufb01rst task or being unable to learn the second one, as each task\nindividually can be solved with around 99% accuracy.\nWe use an identical network architecture to the previous section\nand found stronger regularization of the approximate posterior\n\n7\n\nFigure 3: Disjoint MNIST test ac-\ncuracy for the Laplace approxima-\ntion (hyperparameter: \u03bb) and SI\n(hyperparameter: c).\n\u2018Kronecker\nfactored\u2019 and \u2018Diagonal\u2019 refer to\nthe respective curvature approxima-\ntion for the Laplace method.\n\n\fFigure 4: Test accuracy of a convolutional network on a sequence of vision datasets. We train on\nthe datasets separately in the order displayed from top to bottom and show the network\u2019s accuracy\non each dataset once training on it has started. The dotted black line indicates the performance of\na network with the same architecture trained separately on the task. The diagonal and Kronecker\nfactored approximation to the Hessian both use our online Laplace method to prevent forgetting.\n\n\u03bb\n\nto be necessary. For the Laplace methods, we tested values of \u03bb \u2208 {1, 3, 10, . . . , 3\u00d7105, 106}, and\nc \u2208 {0.1, 0.3, 1, . . . , 3\u00d7104, 105} for SI. We train using Nesterov momentum with a learning rate\nof 0.1 and momentum of 0.9 and decay the learning rate by a factor of 10 every 1000 parameter\nupdates using a batch size of 250. We decay the initial learning rate for the second task depending on\nthe hyperparameter to prevent the objective from diverging. We test various decay factors for each\nhyperparameter, but as a rule of thumb found \u03bb\n10 to perform well for the Kronecker factored, and\n1000 for the diagonal approximation. The results are averaged across ten independent runs.\nFig. 3 shows the test accuracy for various hyperparameter values for a Kronecker factored and\na diagonal approximation of the curvature as well as SI. As there are only two datasets, the three\nLaplace-based methods are identical, therefore we focus on the impact of the curvature approximation.\nApproximating the Hessian with a diagonal corresponds to EWC. While we do not match the\nperformance of the method developed in [20], we \ufb01nd the Laplace approximation to work signi\ufb01cantly\nbetter than reported by the authors. The Kronecker factored approximation gives a small improvement\nover the diagonal one and requires weaker regularization, which further suggests that it better \ufb01ts the\ntrue posterior. It also outperforms SI.\n\n4.3 Vision datasets\n\nAs a \ufb01nal experiment, we test our method on a suite of related vision datasets. Speci\ufb01cally, we train\nand test on MNIST [19], notMNIST6, Fashion MNIST [40], SVHN [29] and CIFAR10 [18] in this\norder. All \ufb01ve datasets contain around 50, 000 training images from 10 different classes. MNIST\ncontains hand-written digits from \u20180\u2019 to \u20189\u2019, notMNIST the letters \u2018A\u2019 to \u2018J\u2019 in different computer\nfonts, Fashion MNIST different categories of clothing, SVHN the digits \u20180\u2019 to \u20189\u2019 on street signs\nand CIFAR10 ten different categories of natural images. We zero-pad the images of the MNIST-like\ndatasets to be of size 32\u00d732 and replicate their intensity values over three channels, such that all\nimages have the same format.\nWe train a LeNet-like architecture [19] with two convolutional layers with 5\u00d75 convolutions with\n20 and 50 channels respectively and a fully connected hidden layer with 500 units. We use ReLU\nnonlinearities and perform a 2\u00d72 max-pooling operation after each convolutional layer with stride 2.\nAn extension of the Kronecker factored curvature approximations to convolutional neural networks is\npresented in [10]. As the meaning of the classes in each dataset is different, we keep the weights of the\n\ufb01nal layer separate for each task. We optimize the networks as in the permuted MNIST experiment\nand compare to \ufb01ve baseline networks with the same architecture trained on each task separately.\nOverall, the online Laplace approximation in conjunction with a Kronecker factored approximation\nof the curvature achieves the highest test accuracy across all \ufb01ve tasks (see Appendix E for the\nnumerical results). However, the difference between the three Laplace-based methods is small in\n\n6Originally published at www.yaroslavvb.blogspot.co.uk/2011/09/notmnist-dataset.html and\n\ndownloaded from www.github.com/davidflanagan/notMNIST-to-MNIST\n\n8\n\n\fcomparison to the improvement stemming from the better approximation to the Hessian. We therefore\nplot the test accuracy curves through training only for the online Laplace approximation in the main\ntext in Fig. 4 to show the difference to SI and between the two curvature approximations. The\ncorresponding \ufb01gures for having a separate quadratic penalty for each task and the approximate\nLaplace approximation are in Appendix F.\nUsing a diagonal Hessian approximation for the Laplace approximation, the network mostly remem-\nbers the \ufb01rst three tasks, but has dif\ufb01culties learning the \ufb01fth one. SI, in contrast, shows decaying\nperformance on the initial tasks, but learns the \ufb01fth task almost as well as our method with a Kro-\nnecker factored approximation of the Hessian. However, using the Kronecker factored approximation,\nthe network achieves good performance relative to the individual networks across all \ufb01ve tasks. In\nparticular, it remembers the easier early tasks almost perfectly while being suf\ufb01ciently \ufb02exible to\nlearn the more dif\ufb01cult later tasks better than the diagonal methods, which suffer from forgetting.\n\n5 Conclusion\n\nWe proposed the online Laplace approximation, a Bayesian online learning method for overcoming\ncatastrophic forgetting in neural networks. By formulating a principled approximation to the posterior,\nwe were able to substantially improve over EWC [16] and SI [41], two recent methods that also add\na quadratic regularizer to the objective for new tasks. By further taking interactions between the\nparameters into account, we achieved considerable increases in test accuracy on the problems that we\ninvestigated, in particular for long sequences of datasets. Our results demonstrate the importance\nof going beyond diagonal approximation methods which only measure the sensitivity of individual\nparameters. Dealing with the complex interaction and correlation between parameters is necessary in\nmoving towards a more complete response to the challenge of continual learning.\n\nAcknowledgements\n\nWe thank Raza Habib, Harshil Shah and the anonymous reviewers for their feedback. This work was\nsupported by the Alan Turing Institute under the EPSRC grant EP/N510129/1.\n\nReferences\n[1] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural\n\nNetworks. In International Conference on Machine Learning, pages 1613\u20131622, 2015.\n\n[2] A. Botev, H. Ritter, and D. Barber. Practical Gauss-Newton Optimisation for Deep Learning. In\n\nInternational Conference on Machine Learning, pages 557 \u2013 565, 2017.\n\n[3] S. Dieleman, J. Schl\u00fcter, C. Raffel, E. Olson, S. K. S\u00f8nderby, D. Nouri, et al. Lasagne: First\n\nrelease., August 2015.\n\n[4] E. Eskin, A. J. Smola, and S. Vishwanathan. Laplace Propagation. In Advances in Neural\n\nInformation Processing Systems, pages 441\u2013448, 2004.\n\n[5] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wier-\nstra. Pathnet: Evolution Channels Gradient Descent in Super Neural Networks. arXiv preprint\narXiv:1701.08734, 2017.\n\n[6] R. M. French. Catastrophic Forgetting in Connectionist Networks. Trends in Cognitive Sciences,\n\n3:128\u2013135, 1999.\n\n[7] Z. Ghahramani. Online Variational Bayesian Learning. 2000. Slides from talk presented at NIPS\n\n2000 workshop on Online Learning.\n\n[8] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An Empirical Investigation\nof Catastrophic Forgetting in Gradient-based Neural Networks. arXiv preprint arXiv:1312.6211,\n2013.\n\n[9] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Grif\ufb01ths. Recasting Gradient-Based Meta-Learning\n\nas Hierarchical Bayes. In International Conference on Learning Representations, 2018.\n\n9\n\n\f[10] R. Grosse and J. Martens. A Kronecker-factored Approximate Fisher Matrix for Convolution\n\nLayers. In International Conference on Machine Learning, pages 573\u2013582, 2016.\n\n[11] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions, volume 104. CRC Press, 1999.\n\n[12] X. He and H. Jaeger. Overcoming Catastrophic Interference using Conceptor-Aided Backpropa-\n\ngation. In International Conference on Learning Representations, 2018.\n\n[13] A. Honkela and H. Valpola. On-line Variational Bayesian Learning. In 4th International\nSymposium on Independent Component Analysis and Blind Signal Separation, pages 803\u2013808,\n2003.\n\n[14] F. Husz\u00e1r. Note on the Quadratic Penalties in Elastic Weight Consolidation. Proceedings of the\n\nNational Academy of Sciences, 2018.\n\n[15] D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,\nJ. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell.\nOvercoming Catastrophic Forgetting in Neural Networks. Proceedings of the National Academy\nof Sciences, pages 3521\u20133526, 2017.\n\n[17] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,\nJ. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell.\nReply to Husz\u00e1r: The Elastic Weight Consolidation Penalty is Empirically Valid. Proceedings of\nthe National Academy of Sciences, 2018.\n\n[18] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images. 2009.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document\n\nRecognition. In Proceedings of the IEEE, pages 2278 \u2013 2324, 1998.\n\n[20] S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang. Overcoming Catastrophic Forgetting\nby Incremental Moment Matching. In Advances in Neural Information Processing Systems, pages\n4655\u20134665, 2017.\n\n[21] D. Lopez-Paz and M. Ranzato. Gradient Episodic Memory for Continual Learning. In Advances\n\nin Neural Information Processing Systems, pages 6470\u20136479, 2017.\n\n[22] D. J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural\n\nComputation, 4:448\u2013472, 1992.\n\n[23] J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate\n\nCurvature. In International Conference on Machine Learning, pages 2408\u20132417, 2015.\n\n[24] J. Martens. New Insights and Perspectives on the Natural Gradient Method. arXiv preprint\n\narXiv:1412.1193, 2014.\n\n[25] P. Maybeck. Stochastic Models, Estimation and Control, chapter 12.7. Academic Press, 1982.\n\n[26] M. McCloskey and N. J. Cohen. Catastrophic Interference in Connectionist Networks: The\nSequential Learning Problem. Psychology of Learning and Motivation - Advances in Research\nand Theory, 24:109\u2013165, 1989.\n\n[27] T. P. Minka. Expectation Propagation for Approximate Bayesian Inference. In Proceedings of\n\nthe Seventeenth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 362\u2013369, 2001.\n\n[28] Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate O\n\n(1/k2). Soviet Mathematics Doklady, 27:372\u2013376, 1983.\n\n[29] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading Digits in Natural\nImages with Unsupervised Feature Learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, page 5, 2011.\n\n10\n\n\f[30] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational Continual Learning. International\n\nConference on Learning Representations, 2018.\n\n[31] M. Opper and O. Winther. A Bayesian Approach to On-line Learning. On-line Learning in\n\nNeural Networks, ed. D. Saad, pages 363\u2013378, 1998.\n\n[32] B. T. Polyak. Some Methods of Speeding up the Convergence of Iteration Methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4:1\u201317, 1964.\n\n[33] R. Ratcliff. Connectionist Models of Recognition Memory: Constraints Imposed by Learning\n\nand Forgetting Functions. Psychological Review, 97:285\u2013308, 1990.\n\n[34] H. Ritter, A. Botev, and D. Barber. A Scalable Laplace Approximation for Neural Networks.\n\nInternational Conference on Learning Representations, 2018.\n\n[35] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu,\nR. Pascanu, and R. Hadsell. Progressive Neural Networks. arXiv preprint arXiv:1606.04671,\n2016.\n\n[36] L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical Analysis of the Hessian\n\nof Over-Parametrized Neural Networks. arXiv preprint arXiv:1706.04454, 2017.\n\n[37] J. Serr\u00e0, D. Sur\u00eds, M. Miron, and A. Karatzoglou. Overcoming Catastrophic Forgetting with\n\nHard Attention to the Task. arXiv preprint arXiv:1801.01423, 2018.\n\n[38] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual Learning with Deep Generative Replay. arXiv\n\npreprint arXiv:1705.08690, 2017.\n\n[39] Theano Development Team. Theano: A Python Framework for Fast Computation of Mathemat-\n\nical Expressions. arXiv e-prints, abs/1605.02688, May 2016.\n\n[40] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking\n\nMachine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[41] F. Zenke, B. Poole, and S. Ganguli. Continual Learning through Synaptic Intelligence. In\n\nInternational Conference on Machine Learning, pages 3987\u20133995, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1877, "authors": [{"given_name": "Hippolyt", "family_name": "Ritter", "institution": "University College London"}, {"given_name": "Aleksandar", "family_name": "Botev", "institution": "University College London"}, {"given_name": "David", "family_name": "Barber", "institution": "University College London"}]}