{"title": "Superposition of many models into one", "book": "Advances in Neural Information Processing Systems", "page_first": 10868, "page_last": 10877, "abstract": "We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network during training.", "full_text": "Superposition of many models into one\n\nBrian Cheung\n\nRedwood Center, BAIR\n\nUC Berkeley\n\nbcheung@berkeley.edu\n\nAlex Terekhov\nRedwood Center\n\nUC Berkeley\n\naterekhov@berkeley.edu\n\nYubei Chen\n\nRedwood Center, BAIR\n\nUC Berkeley\n\nyubeic@berkeley.edu\n\nPulkit Agrawal\n\nBAIR\n\nUC Berkeley\n\npulkitag@berkeley.edu\n\nBruno Olshausen\n\nRedwood Center, BAIR\n\nUC Berkeley\n\nbaolshausen@berkeley.edu\n\nAbstract\n\nWe present a method for storing multiple models within a single set of parame-\nters. Models can coexist in superposition and still be retrieved individually. In\nexperiments with neural networks, we show that a surprisingly large number of\nmodels can be effectively stored within a single parameter instance. Furthermore,\neach of these models can undergo thousands of training steps without signi\ufb01cantly\ninterfering with other models within the superposition. This approach may be\nviewed as the online complement of compression: rather than reducing the size\nof a network after training, we make use of the unrealized capacity of a network\nduring training.\n\n1\n\nIntroduction\n\nWhile connectionist models have enjoyed a resurgence of interest in the arti\ufb01cial intelligence commu-\nnity, it is well known that deep neural networks are over-parameterized and a majority of the weights\ncan be pruned after training [7, 20, 3, 8, 9, 1]. Such pruned neural networks achieve accuracies similar\nto the original network but with much fewer parameters. However, it has not been possible to exploit\nthis redundancy to train a neural network with fewer parameters from scratch to achieve accuracies\nsimilar to its over-parameterized counterpart. In this work we show that it is possible to partially\nexploit the excess capacity present in neural network models during training by learning multiple\ntasks. Suppose that a neural network with L parameters achieves desirable accuracy at a single\ntask. We outline a method for training a single neural network with L parameters to simultaneously\n\nperform K different tasks and thereby effectively requiring \u2248 O(cid:0) L\nWhile we learn a separate set of parameters(cid:0)Wk; k \u2208 [1, K](cid:1) for each of the K tasks, these\n\n(cid:1) parameters per task.\n\nparameters are stored in superposition with each other, thus requiring approximately the same\nnumber of parameters as a model for a single task. The task-speci\ufb01c models can be accessed using\ntask-speci\ufb01c \u201ccontext\u201d information Ck that dynamically \u201croutes\u201d an input towards a speci\ufb01c model\nretrieved from this superposition. The model parameters W can be therefore thought of as a \u201cmemory\"\nand the context Ck as \u201ckeys\" that are used to access the speci\ufb01c parameters Wk required for a task.\nSuch an interpretation is inspired by Kanerva\u2019s work on hetero-associative memory [4].\nBecause the parameters for different tasks exist in super-position with each other and are constantly\nchanging during training, it is possible that these individual parameters interfere with each other and\nthereby result in loss in performance on individual tasks. We show that under mild assumptions of\nthe input data being intrinsically low-dimensional relative to its ambient space (e.g. natural images\nlie on a much lower dimensional subspace as compared to their representation of individual pixels\n\nK\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Left: Parameters for different models w(1), w(2) and w(3) for different tasks are stored\nin superposition with each other in w. Right: To prevent interference between (A) similar set of\nparameter vectors w(s), s \u2208 {1, 2, 3}, we B (store) these parameters after rotating the weights\ninto nearly orthogonal parts of the space using task dependent context information (C\u22121(s)). An\nappropriate choice of C(s) ensures that we can C (retrieve) \u02c6w(k) by operation wC(k) in a manner\nthat w(s), for s (cid:54)= k will remain nearly orthogonal, reducing interference during learning.\n\nwith RGB values), it is possible to choose context that minimizes such interference. The proposed\nmethod has wide ranging applications such as training a neural networks in memory constrained\nenvironments, online learning of multiple tasks and over-coming catastrophic forgetting.\nApplication to Catastrophic Forgetting: Online learning and sequential training of multiple tasks\nhas traditionally posed a challenge for neural networks. If the distribution of inputs (e.g. changes in\nappearance from day to night) or the distribution output labels changes over time (e.g. changes in the\ntask) then training on the most recent data leads to poor performance on data encountered earlier. This\nproblem is known as catastrophic forgetting [12, 15, 2]. One way to deal with this issue is to maintain\na memory of all the data and train using batches that are constructed by uniformly and randomly\nsampling data from this memory (replay buffers [14]). However in memory constrained settings this\nsolution is not viable. Some works train a separate network (or sub-parts of network) for separate\ntask [17, 19, 11]. The other strategy is to selectively update weights that do not play a critical role on\nprevious tasks using variety of criterion such as: Fisher information between tasks [5], learning an\nattention mask to decide which weights to change [10, 18] and other criterion [22]. However, these\nmethods prevent re-use of weights in the future and therefore intrinsically limit the capacity of the\nnetwork to learn future tasks and increase computational cost. Furthermore, for every new task, one\nadditional variable per weight parameter indicating whether this weight can be modi\ufb01ed in the future\nor not (i.e. L new parameters per task) needs to be stored.\nWe propose a radically different way of using the same set of parameters in a neural network to\nperform multiple tasks. We store the weights for different tasks in superposition with each other\nand do not explicitly constrain how any speci\ufb01c weight parameter changes within the superposition.\nFurthermore, we need to store substantially less additional variables per new task (1 additional\nvariable per task for one variant of our method; Section 2.1). We demonstrate the ef\ufb01cacy of our\napproach of learning via parameter superposition on two separate online image-classi\ufb01cation settings:\n(a) time-varying input data distribution and (b) time-varying output label distribution. With parameter\nsuperposition, it is possible to overcome catastrophic forgetting on the permuting MNIST [5] task,\ncontinuously changing input distribution on rotating MNIST and fashion MNIST tasks and when the\noutput labels are changing on the incremental CIFAR dataset [16].\n\n2 Parameter Superposition\n\nThe intuition behind Parameter Superposition (PSP) as a method to store many models simultaneously\ninto one set of parameters stems from analyzing the fundamental operation performed in all neural\n\nnetworks \u2013 multiplying the inputs(cid:0)x \u2208 (cid:60)N(cid:1) by a weight matrix(cid:0)W \u2208 (cid:60)M\u00d7N(cid:1) to compute features\n\n(y = W x). Over-parameterization of a network essentially implies that only a small sub-space\nspanned by the rows of W in (cid:60)N are relevant for the task.\nLet W1, W2, ..., WK be the set of parameters required for each of the K tasks. If only a small\nsubspace in (cid:60)N is required by each Wk, it should be possible to transform each Wk using a task-\nspeci\ufb01c linear transformation C\u22121\nk occupy\nmutually orthogonal subspace in (cid:60)N (see Figure 1). Because each WkC\u22121\noccupies a different\n\n(that we call as context), such that rows of each WkC\u22121\n\nk\n\nk\n\n2\n\nBACw(3)C(3)\u22121w(2)C(2)\u22121w(1)C(1)\u22121w(2)C(2)\u22121C(1)w(1)C(1)\u22121C(1)=w(1)w(3)C(3)\u22121C(1)storeretrieve(cid:90)(cid:90)(cid:11)(cid:20)(cid:12)(cid:90)(cid:11)(cid:21)(cid:12)(cid:90)(cid:11)(cid:22)(cid:12)\f# parameters +1 model\n\nStandard\nRotational M (N + M )\nM (N + 1)\n\nM N\n\nBinary\nComplex\nOnePower\n\n2M (N + 0.5)\n2M (N + 0.5)\n\nM N\nM 2\nM\nM\n1\n\nTable 1: Parameter count for superposition of a linear transformation of size L = M \u00d7 N. \u2018+1 model\u2019\nrefers to the number of additional parameters required to add a new model.\n\nsubspace, these parameters can be summed together without interfering when stored in superposition:\n\nW =\n\nWiC\u22121\n\ni\n\n(1)\n\nK(cid:88)\n\ni=1\n\nThis is similar to the superposition principle in fourier analysis where a signal is represented as a\nsuperposition of sinusoids. Each sinusoid can be considered as the \u201ccontext\u201d. The parameters for an\nindividual task can be retrieved using the context Ck and let them be referred by \u02c6Wk:\n\n(cid:0)C\u22121\n\n(cid:1)\n\nK(cid:88)\n\ni=1\n\n\u02c6Wk = W Ck =\n\nWi\n\ni Ck\n\n(2)\n\nBecause the weights are stored in superposition, the retrieved weights ( \u02c6Wk) are likely to be a noisy\nestimate of Wk. Noisy retrieval will not affect the overall performance if \u02c6Wkx = Wkx + \u0001, where \u0001\nstays small. A detailed analysis of \u0001 for some choices of context vectors described in Section 2.1 can\nbe found in the Appendix A.\nIn the special case of C\u22121\nk , each Ck would be an orthogonal matrix representing a rotation.\nAs matrix multiplication is associative, yk = (W Ck)x can be rewritten as yk = W (Ckx). The PSP\nmodel for computing outputs for the kth task is therefore,\n\nk = C T\n\n(3)\nIn this form PSP can be thought of as learning a single set of parameters W for multiple tasks,\nafter the rotating the inputs (x) into orthogonal sub-spaces of (cid:60)N . It is possible to construct such\northogonal rotations of the input when x itself is over-parameterized (i.e. it lies on a low-dimensional\nmanifold). The assumption that x occupies a low-dimensional manifold is a mild one and it is well\nknown that natural signals such as images and speech do indeed have this property.\n\nyk = W(cid:0)Ckx(cid:1)\n\n2.1 Choosing the Context\n\nRotational Superposition The most general way to choose the context is to sample rotations\nuniformly from orthogonal group O(M ) (Haar distribution)1. We refer to this formulation as\nrotational superposition. In this case, Ck \u2208 (cid:60)M\u00d7M and therefore training for a new task would\nrequire M 2 more parameters. Thus, training for K tasks would require M N +(K\u22121)M 2 parameters.\nIn many scenarios M \u223c N and therefore learning by such a mechanism would require approximately\nas many parameters as training a separate neural network for each task. Therefore, the rotational\nsuperposition in its most general is not memory ef\ufb01cient.\nIt is possible to reduce the memory requirements of rotational superposition by restricting the context\nto a subset of the orthogonal group, e.g. random permutation matrices, block diagonal matrices\nor diagonal matrices. In the special case, we choose Ck = diag(ck) to be a diagonal matrix with\nthe diagonal entries given by the vector ck. With such a choice, only M additional parameters are\nrequired per task (see Table 1). In case of a diagonal context, PSP in equation 3 reduces to an\nelement-wise multiplication (symbol (cid:12)) between ck, x and can be written as:\n\ny = W (c(k) (cid:12) x)\n\n(4)\n\nThere are many choices of ck that lead to construction of orthogonal matrices:\n\n1we use scipy.stats.ortho_group.\n\n3\n\n\fComplex Superposition In Equation 4, we can chose ck to be a vector of complex numbers, where\neach component cj\n\nk is given by,\n\n(5)\nk lies on the complex unit circle. The phase \u03c6j(k) \u2208 [\u2212\u03c0, \u03c0] for all j is sampled with\n2\u03c0 . It can be seen that such a choice of ck results in a diagonal\n\ncj\nk = ei\u03c6j (k)\n\nEach of the cj\nuniform probability density p(\u03c6) = 1\northogonal matrix.\n\nPowers of a single context The memory footprint of complex superposition can be reduced to a\nsingle parameter per task, by choosing context vectors that are integer powers of one context vector:\n\ncj\nk = ei\u03c6j k\n\n(6)\nBinary Superposition Constraining the phase to two possible values \u03c6j(k) \u2208 {0, \u03c0} is a special\ncase of complex superposition. The context vectors become c(k)j \u2208 {\u22121, 1}. We refer to this\nformulation as binary superposition. The low-precision of the context vectors in this form of\nsuperposition has both computational and memory advantages. Furthermore, binary superposition is\ndirectly compatible with both real-valued and low-precision linear transformations.\n\n3 Neural Network Superposition\n\nWe can extend these formulations to entire neural network models by applying superposition (Equation\n3) to the linear transformation of all layers l of a neural network:\nx(l+1) = g(W (l)(c(k)(l) (cid:12) x(l)))\n\n(7)\n\nwhere g() is a non-linearity (e.g. ReLU).\n\nExtension to Convolutional Networks For neural networks applied to vision tasks, convolution is\ncurrently the dominant operation in a majority of layers. Since the dimensionality of convolution\nparameters is usually much smaller than the input image, it makes more sense computationally to\napply context to the weights rather than the input. By associativity of multiplication, we are able\nreduce computation by applying a context tensor c(k) \u2208 CM\u00d7Hw\u00d7Ww to the convolution kernel\nw \u2208 CN\u00d7M\u00d7Hw\u00d7Ww instead of the input image x \u2208 CM\u00d7Hx\u00d7Wx:\n\n(8)\nwhere \u2217 is the convolution operator, M is the input channel dimension, N is the output channel\ndimension.\n\nyn = (wn (cid:12) c(k)) \u2217 x\n\n4 Experiments\n\nThere are two distinct ways in which the data distribution can change over time: (a) change in the input\ndata distribution and (b) change in the output labels over time. Neural networks are high-capacity\nlearners and can even learn tasks with random labelling [23]. Despite shifts in data distribution, if\ndata from different tasks are pooled together and uniformly sampled to construct training batches,\nneural networks are expected to perform well. However, a practical scenario of interest is when it is\nnot possible to access all the data at once, and online training is necessary. In such scenarios, training\nthe neural network on the most recent task leads to loss in performance on earlier tasks. This problem\nis known as catastrophic forgetting \u2013 learning on one task interferes with performance on another task.\nWe evaluate the performance the proposed PSP method on mitigating the interference in learning due\nto changes in input and output distributions.\n\n4.1\n\nInput Interference\n\nA common scenario in online learning is when the input data distribution changes over time (e.g.\nvisual input from day to night). Permuting MNIST dataset [2], is a variant of the MNIST dataset [7]\nwhere the image pixels are permuted randomly to create new tasks over time. Each permutation of\npixels corresponds to a new task. The output labels are left unchanged. Permuting MNIST has been\n\n4\n\n\fFigure 2: Comparing the accuracy of the binary superposition model (blue) with the baseline model\n(orange) for varying number of units in fully connected networks with differing number of units (128\nto 2048) on the permuting MNIST challenge. On this challenge, the inputs are permuted after every\n1000 iterations, and each permutation corresponds to a new task. 50K iterations therefore correspond\nto 50 different tasks presented in sequence. Dotted red line indicates completion of 10 tasks. It is\nto be expected that larger networks can \ufb01t more data and be more robust to catastrophic forgetting.\nWhile indeed this is true and the baseline model does better with more units, the PSP model is far\nsuperior and the effect of catastrophic forgetting is negligible in larger networks.\n\nused by many previous works [2, 22, 17, 5] to study the problem of catastrophic forgetting. In our\nsetup, a new task is created after every 1000 mini-batches (steps) of training by permuting the image\npixels. To adapt to the new task, all layers of the neural network are \ufb01netuned. Figure 2 shows that a\nstandard neural network suffers from catastrophic forgetting and the performance on the \ufb01rst task\ndegrades after training on newer tasks.\nSeparate context parameters are chosen for every task. Each choice of context can be thought of as\ncreating a new model within the same neural network that can be used to learn a new task. In case\nof binary superposition (Section 2.1), a random binary vector is chosen for each task, for complex\nsuperposition (Section 2.1), a random complex number (constant magnitude, random phase) is chosen\nand for rotation superposition (Section 2.1) a random orthogonal matrix is chosen. Note that use\nof task identity information to overcome catastrophic forgetting is not special to our method, but\nhas been used by all previous methods [22, 5, 17]. We investigated the ef\ufb01cacy of PSP in mitigating\nforgetting with changes in network size and the methods of superposition.\n\n4.1.1 Effect of network size on catastrophic forgetting\n\nBigger networks have more parameters and can thus be expected to be more robust to catastrophic\nforgetting as they can \ufb01t to larger amounts of data. We trained fully-connected networks with two\nhidden layers on \ufb01fty permuting MNIST tasks presented sequentially. The size of hidden layers was\nvaried from 128 to 2048 units. Results in Figure 2 show marginal improvements in performance of\nthe standard neural network with its size. The PSP method with binary superposition (pspBinary) is\nsigni\ufb01cantly more robust to catastrophic forgetting as compared to the standard baseline. Because\nhigher number of parameters create space to pack a larger number of models in super-position, the\nperformance of pspBinary also improves with network size and with hidden layer of size 2048, the\nperformance on the initial task is virtually unchanged even after training for 49 other tasks with very\ndifferent input data distribution.\n\n4.1.2 Effect of types of superposition on catastrophic forgetting\n\nDifferent methods of storing models in superposition use a different number of additional parameters\nper task. While pspBinary and pspComplex require M (where M is the size of the input to each\nlayer for a fully-connected network) additional parameters; pspRotation requires M 2 additional\nparameters (see Table 1). Larger number of parameters implies that a set of more general orthogonal\ntransformation that span larger number of rotations can be constructed. More rotations means\nthat inputs can be rotated in more ways and thereby more models can be packed with the same\nnumber of parameters. Results shown in Figure 3 left con\ufb01rm this intuition for networks of 256\nunits. Better performance of pspComplex as compared to pspBinary is not surprising because binary\nsuperposition is a special case of complex superposition (see section 2.1). In the appendix, we show\nthese differences become negligible for larger networks.\nWhile the performance of pspRotation is the best among all superposition methods, this method\nis impractical because it amounts to adding the same number of additional parameters as required\n\n5\n\n02000040000Step020406080100Accuracy128 units256 units512 units1024 units2048 units\fEWC [5]\u2217\nSI [22]\u2217\nStandard\nBinary\nComplex\nOnePower\n\nAvg. Accuracy (%)\n\n97.0\n97.2\n61.8\n97.6\n97.4\n97.2\n\nFigure 3: Left: Comparing the accuracy of various methods for PSP on the \ufb01rst task of the permuting\nMNIST challenge over training steps. After every 1000 steps, the input pixels are randomly permuted\n(i.e. new task) and therefore training on this newer task can lead to loss in performance on the\ninitial task due to catastrophic forgetting. The PSP method is robust to catastrophic forgetting with\npspRotation performing slightly better than pspComplex which in turn is better than pspBinary. This\nis expected as the number of additional parameters required per task in pspRotation > pspComplex >\npspBinary (see Table 1). Right: The average accuracy over the last 10 tasks on the permuting MNIST\nchallenge shows that the proposed PSP method outperforms previously published methods. \u2217results\nfrom Figure 4 in Zenke et al. [22]\n\n(b) Accuracy on fashionMNIST.\n\n(a) Rotating (MNIST and fashionMNIST) datasets.\nFigure 4: (a) Samples of rotating-MNIST (top) and rotating-FashionMNIST (bottom) datasets. To\nmodel a continuously and smoothly changing data stream, at every training step (i.e. mini-batch\nshown by green box), the images are rotated by a small counter-clockwise rotation. Images rotate\nby 360 degrees over 1000 steps. (b) Test accuracy for 0 degrees rotation as a function of number\nof training steps. A regular neural networks suffers from catastrophic forgetting. High accuracy is\nachieved after training on 0o rotation and then the performance degrades. The proposed PSP method\nis robust to slow changes in data distribution when provided with the appropriate context.\n\nfor training a separate network for each task. pspComplex requires extension of neural networks to\ncomplex numbers and pspBinary is easiest to implement. To demonstrate the ef\ufb01cacy of our method,\nin the remainder of the paper we present most results with pspBinary with an understanding that\npspComplex can further improve performance.\nComparison to previous methods: Table in Figure 3 compares the performance of our method\nwith two previous methods: EWC [5] and SI [22]. Following the metric used in these works, we\nreport the average accuracy on the last ten permuted MNIST tasks after the end of training on 50\ntasks. PSP outperforms previous methods.\n\n4.1.3 Continuous Domain Shift\n\nWhile permuting MNIST has been used by previous work for addressing catastrophic forgetting, the\nbig and discrete shift in input distribution between tasks is somewhat arti\ufb01cial. In the natural world,\ndistribution usually shifts slowly \u2013 for example day gradually comes night and summer gradually\nbecomes winter. To simulate real-world like continuous domain shift, we propose rotating-MNIST\n\n6\n\n01000020000300004000050000Step020406080100Task 1 Accuracy10 taskspspRotationpspOnepowerpspComplexpspBinarystandardtraining step01000020000300004000050000Step020406080100Task 1 AccuracypspComplexpspOnepowerpspBinarypspRotationstandard\fFigure 5: Left: Closer comparison of each form of parameter superposition on the rotating-\nFashionMNIST task at angle 0\u25e6. Right: Different context selection functions on the rotating-\nFashionMNIST task at angle 0\u25e6.\n\nand rotating-FashionMNIST that are variants of the original MNIST and FashionMNIST [21] datasets.\nAt every time step, the input images are rotated in-plane by a small amount in counter-clockwise\ndirection. A rotation of 360o is completed after 1000 steps and the input distribution becomes similar\nto the initial distribution. Every 1000 steps one complete cycle of rotation is completed. Sample\nimages from the rotating datasets are shown in Figure 4a.\nIt is to be expected that very small rotations will not lead to interference in learning. Therefore,\ninstead of choosing a separate context for every time step, we change the context after every 100 steps.\nThe 10 different context vectors used in the \ufb01rst cycle (1000 steps) and are re-used in subsequent\ncycles of rotations. Figure 4b plots accuracy on a test data set of fashion MNIST with 0o rotation\nwith time. The oscillations in performance of the standard network correspond to 1000 training steps,\nwhich is the time required to complete one cycle of rotation. As the rotation of input images shifts\naway from 0o, due to catastrophic forgetting, the performance worsens and it improves as the cycle\nnears completion. The proposed PSP models are robust to changes in input distribution and closely\nfollow the same trends as on permuting MNIST. These results show that ef\ufb01cacy of PSP approach is\nnot speci\ufb01c to a particular dataset or the nature of change in the input distribution.\nChoosing context parameters: Instead of using task identity to choose the context, it would be ideal\nif the context could be automatically chosen without this information. While a general solution to this\nproblem is beyond the scope of this paper, we investigate the effect of using looser information about\ntask identity on catastrophic forgetting. For this we constructed, pspFast a variant of pspComplex\nwhere the context is randomly changed at every time step for 1000 steps corresponding to one cycle\nof rotations. In the next cycle these contexts are re-used. In this scenario, instead of using detailed\ninformation about the task identity only coarse information about when the set of tasks repeat is used.\nAbsence of task identity requires storage of 1000 models in superposition, which is 100x times the\nnumber of models stored in previous scenarios. Figure 5 right shows that while pspFast is better than\nstandard model, it is worse in performance when more detailed task identity information was used.\nPotential reasons for worse performance are that each model in pspFast is trained with lesser amount\nof data (same data, but 100x models) and increased interference between models stored.\nAnother area of investigation is the scenario when detailed task information is not available, but\nsome properties about changes in data distribution are known. For example, in the rotating fashion\nMNIST task it is known that the distribution is changing slowly. In contrast to existing methods, one\nof the strengths of the PSP method is that it is possible to incorporate such knowledge in constructing\ncontext vectors. To demonstrate this, we constructed pspFastLocalMix, a variant of pspFast, where\nat every step we de\ufb01ne a context vector as a mixture of the phases of adjacent timepoints. Figure 5\nshows that pspFastLocalMix leads to better performance than pspFast. This provides evidence that it\nis indeed possible to incorporate coarse information about non-stationarity of input distribution.\n\n4.2 Output Interference\n\nLearning in neural networks can be adversely affected by changes in the output (e.g. label) distribution\nof the training data. For example, this occurs when transitioning from one classi\ufb01cation task to\nanother. The incremental CIFAR (iCIFAR) dataset [16, 22] (see Figure 6a) is a variant of the CIFAR\n\n7\n\n01000020000300004000050000Step505560657075808590Task 1 AccuracypspComplexpspOnepowerpspBinarypspRotation01000020000300004000050000Step505560657075808590Task 1 AccuracypspComplexpspLocalMixpspFastpspFastLocalMix\f(a) iCIFAR task\n\n(b) Performance comparison on iCIFAR\n\nFigure 6: (a) Samples from the iCIFAR dataset. (b) Accuracy of ResNet-18 model on CIFAR-10\ntest dataset after training for 20K steps \ufb01rst on CIFAR-10 dataset, and then sequentially \ufb01netuning\non four disjoint set of 10 classes from CIFAR-100 for 20K iterations each. The baseline standard\nand multihead model are critically affected by changes in output labels, whereas the PSP model with\nbinary superposition is virtually unaffected. These result shows the ef\ufb01cacy of PSP in dealing with\ncatastrophic forgetting and easy scaling to state-of-the-art neural networks.\n\ndataset [6] where the \ufb01rst task is the standard CIFAR-10 dataset and subsequent tasks are formed by\ntaking disjoint subsets of 10 classes from the CIFAR-100 dataset.\nTo show that our PSP method can be used with state-of-the-art neural networks, we used ResNet-18\nto \ufb01rst train on CIFAR-10 dataset for 20K steps. Next, we trained the network on 20K steps on\nfour subsequent and disjoint sets of 10 classes chosen from the CIFAR-100 dataset. We report the\nperformance on the test set of CIFAR-10 dataset. Unsurprisingly, the standard ResNet-18 suffers a\nbig loss in performance on the CIFAR-10 after training on classes from CIFAR-100 (see standard-\nResNet18 in Figure 6b). This forms a rather weak baseline, because the output targets also changes\nfor each task and thus reading out predictions for CIFAR-10 after training on other tasks is expected\nto be at chance performance. A much stronger baseline is when a new output layer is trained for\nevery task (but the rest of the network is re-used). This is because, it might be expected that for\nthese different tasks the same features might be adept but a different output readout is required.\nPerformance of a network trained in this fashion, multihead-ResNet18, in Figure 6b is signi\ufb01cantly\nbetter than standard-ResNet18.\nTo demonstrate the robustness of our approach, we train ResNet-18 with binary superposition on\niCIFAR using only a single output layer and avoiding the need for a network with multiple output\nheads in the process. The PSP network suffers surprisingly little degradation in accuracy despite\nsigni\ufb01cant output interference.\n\n5 Discussion\n\nWe have presented a fundamentally different way of diminishing catastrophic forgetting via storing\nmultiple parameters for multiple tasks in the same neural network via superposition. Our framework\ntreats neural network parameters as memory, from which task-speci\ufb01c model is retrieved using a\ncontext vector that depends on the task-identity. Our method works with both fully-connected nets\nand convolutional nets. It can be easily scaled to state-of-the-art neural networks like ResNet. Our\nmethod is robust to catastrophic forgetting caused due to both input and output interference and\noutperforms existing methods. An added advantage of our framework is that it can easily incorporate\ncoarse information about changes in task distribution and does not completely rely on task identity\n(see Section 4.1.3). Finally, we proposed the rotating MNIST and rotating fashion MNIST tasks to\nmimic slowly changing task distribution that is re\ufb02ective of the real world.\nWhile in this work we have demonstrated the utility of PSP method, a thorough analysis of how many\ndifferent models can be stored in superposition with each other will be very useful. This answer\nis likely to depend on the neural network architecture and the speci\ufb01c family of tasks. Another\nvery interesting avenue of investigation is to automatically and dynamically determine context\n\n8\n\nCIFAR10CIFAR100 1-10training stepTask 1Task 2...020000400006000080000100000Step020406080100Task 1 AccuracypspBinary-ResNet18multihead-ResNet18standard-ResNet18\fvector instead of relying on task-speci\ufb01c information. One fruitful direction is to make the context\ndifferentiable instead of using a \ufb01xed context.\n\nReferences\n[1] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Training pruned neural\n\nnetworks. arXiv preprint arXiv:1803.03635, 2018.\n\n[2] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical\ninvestigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint\narXiv:1312.6211, 2013.\n\n[3] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,\n2015.\n\n[4] Pentti Kanerva. Hyperdimensional computing: An introduction to computing in distributed\nrepresentation with high-dimensional random vectors. Cognitive Computation, 1(2):139\u2013159,\n2009.\n\n[5] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,\nAndrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.\nOvercoming catastrophic forgetting in neural networks. Proceedings of the national academy of\nsciences, page 201611835, 2017.\n\n[6] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[7] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten\n\ndigits. 1998.\n\n[8] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic\n\ndimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.\n\n[9] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value\n\nof network pruning. arXiv preprint arXiv:1810.05270, 2018.\n\n[10] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to\nmultiple tasks by learning to mask weights. In Proceedings of the European Conference on\nComputer Vision (ECCV), pages 67\u201382, 2018.\n\n[11] Nicolas Y Masse, Gregory D Grant, and David J Freedman. Alleviating catastrophic forgetting\nusing context-dependent gating and synaptic stabilization. Proceedings of the National Academy\nof Sciences, 115(44):E10467\u2013E10475, 2018.\n\n[12] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks:\nThe sequential learning problem. In Psychology of learning and motivation, volume 24, pages\n109\u2013165. Elsevier, 1989.\n\n[13] Francesco Mezzadri. How to generate random matrices from the classical compact groups.\n\narXiv preprint math-ph/0609050, 2006.\n\n[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[15] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning\n\nand forgetting functions. Psychological review, 97(2):285, 1990.\n\n[16] Sylvestre-Alvise Rebuf\ufb01, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl:\nIncremental classi\ufb01er and representation learning. In 2017 IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 5533\u20135542. IEEE, 2017.\n\n[17] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[18] Joan Serr\u00e0, D\u00eddac Sur\u00eds, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic\n\nforgetting with hard attention to the task. arXiv preprint arXiv:1801.01423, 2018.\n\n9\n\n\f[19] Alexander V. Terekhov, Guglielmo Montone, and J. Kevin O\u2019Regan. Knowledge transfer in\ndeep block-modular neural networks. In Living Machines 2015: Biomimetic and Biohybrid\nSystems, pages 268\u2013279, 2015.\n\n[20] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of\nneural networks using dropconnect. In International Conference on Machine Learning, pages\n1058\u20131066, 2013.\n\n[21] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[22] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic\n\nintelligence. arXiv preprint arXiv:1703.04200, 2017.\n\n[23] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n10\n\n\f", "award": [], "sourceid": 5810, "authors": [{"given_name": "Brian", "family_name": "Cheung", "institution": "UC Berkeley"}, {"given_name": "Alexander", "family_name": "Terekhov", "institution": "Awecom, Inc"}, {"given_name": "Yubei", "family_name": "Chen", "institution": "Berkeley AI Research UC Berkeley"}, {"given_name": "Pulkit", "family_name": "Agrawal", "institution": "UC Berkeley"}, {"given_name": "Bruno", "family_name": "Olshausen", "institution": "Redwood Center/UC Berkeley"}]}