{"title": "Reinforced Continual Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 899, "page_last": 908, "abstract": "Most artificial intelligence models are limited in their ability to solve new tasks faster, without forgetting previously acquired knowledge. The recently emerging paradigm of continual learning aims to solve this issue, in which the model learns various tasks in a sequential fashion. In this work, a novel approach for continual learning is proposed, which searches for the best neural architecture for each coming task via sophisticatedly designed reinforcement learning strategies. We name it as Reinforced Continual Learning. Our method not only has good performance on preventing catastrophic forgetting but also fits new tasks well. The experiments on sequential classification tasks for variants of MNIST and CIFAR-100 datasets demonstrate that the proposed approach outperforms existing continual learning alternatives for deep networks.", "full_text": "Reinforced Continual Learning\n\nCenter for Data Science, Peking University\n\nJu Xu\n\nBeijing, China\n\nxuju@pku.edu.cn\n\nZhanxing Zhu (cid:3)\n\nCenter for Data Science, Peking University &\nBeijing Institute of Big Data Research (BIBDR)\n\nBeijing, China\n\nzhanxing.zhu@pku.edu.cn\n\nAbstract\n\nMost arti\ufb01cial intelligence models are limited in their ability to solve new tasks\nfaster, without forgetting previously acquired knowledge. The recently emerging\nparadigm of continual learning aims to solve this issue, in which the model learns\nvarious tasks in a sequential fashion. In this work, a novel approach for continual\nlearning is proposed, which searches for the best neural architecture for each com-\ning task via sophisticatedly designed reinforcement learning strategies. We name\nit as Reinforced Continual Learning. Our method not only has good performance\non preventing catastrophic forgetting but also \ufb01ts new tasks well. The experiments\non sequential classi\ufb01cation tasks for variants of MNIST and CIFAR-100 datasets\ndemonstrate that the proposed approach outperforms existing continual learning\nalternatives for deep networks.\n\n1\n\nIntroduction\n\nContinual learning, or lifelong learning [15], the ability to learn consecutive tasks without forgetting\nhow to perform previously trained tasks, is an important topic for developing arti\ufb01cial intelligence.\nThe primary goal of continual learning is to overcome the forgetting of learned tasks and to leverage\nthe earlier knowledge for obtaining better performance or faster convergence/training speed on the\nnewly coming tasks.\nIn the deep learning community, two groups of strategies have been developed to alleviate the prob-\nlem of forgetting the previously trained tasks, distinguished by whether the network architecture\nchanges during learning.\nThe \ufb01rst category of approaches maintain a \ufb01xed network architecture with large capacity. When\ntraining the network for consecutive tasks, some regularization term is enforced to prevent the model\nparameters from deviating too much from the previous learned parameters according to their signif-\nicance to old tasks [4, 19].\nIn [6], the authors proposed to incrementally match the moment of\nthe posterior distribution of the neural network which is trained on the \ufb01rst and the second task,\nrespectively. Alternatively, an episodic memory [7] is budgeted to store the subsets of previous\ndatasets, and then trained together with the new task. FearNet [3] mitigates catastrophic forgetting\nby consolidating recent memories into long-term storage using pseudorehearsal [10] which employs\na generative autoencoder to generate previously learned examples that are replayed alongside novel\ninformation during consolidation. Fernando et al. [2] proposed PathNet, in which a neural network\nhas ten or twenty modules in each layer, and three or four modules are picked for one task in each\nlayer by an evolutionary approach. However, these methods typically require unnecessarily large-\ncapacity networks, particularly when the number of tasks is large, since the network architecture is\nnever dynamically adjusted during training.\n\n(cid:3)\n\nCorresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe other group of methods for overcoming catastrophic forgetting dynamically expand the network\nto accommodate the new coming task while keeping the parameters of previous architecture un-\nchanged. Progressive networks [11] expand the architectures with a \ufb01xed size of nodes or layers,\nleading to an extremely large network structure particularly faced with a large number of sequential\ntasks. The resultant complex architecture might be expensive to store and even unnecessary due\nto its high redundancy. Dynamically Expandable Network (DEN, [17] alleviated this issue slightly\nby introducing group sparsity regularization when adding new parameters to the original network;\nunfortunately, there involves many hyperparameters in DEN, including various regularization and\nthresholding ones, which need to be tuned carefully due to the high sensitivity to the model perfor-\nmance.\nIn this work, in order to better facilitate knowledge transfer and avoid catastrophic forgetting, we pro-\npose a novel framework to adaptively expand the network. Faced with a new task, deciding optimal\nnumber of nodes/\ufb01lters to add for each layer is posed as a combinatorial optimization problem. We\nprovide a sophisticatedly designed reinforcement learning method to solve this problem. Thus, we\nname it as Reinforced Continual Learning (RCL). In RCL, a controller implemented as a recurrent\nneural network is adopted to determine the best architectural hyper-parameters of neural networks\nfor each task. We train the controller by an actor-critic strategy guided by a reward signal deriving\nfrom both validation accuracy and network complexity. This can maintain the prediction accuracy\non older tasks as much as possible while reducing the overall model complexity. To the best of our\nknowledge, the proposal is the \ufb01rst attempt that employs the reinforcement learning for solving the\ncontinual learning problems.\nRCL not only differs from adding a \ufb01xed number of units to the old network for solving a\nnew task [11], which might be suboptimal and computationally expensive, but also distinguishes\nfrom [17] as well that performs group sparsity regularization on the added parameters. We validate\nthe effectiveness of RCL on various sequential tasks. And the results show that RCL can obtain\nbetter performance than existing methods even with adding much less units.\nThe rest of this paper is organized as follows. In Section 2, we introduce the preliminary knowledge\non reinforcement learning. We propose the new method RCL in Section 3, a model to learn a\nsequence of tasks dynamically based on reinforcement learning. In Section 4, we implement various\nexperiments to demonstrate the superiority of RCL over other state-of-the-art methods. Finally, we\nconclude our paper in Section 5 and provide some directions for future research.\n\n2 Preliminaries of Reinforcement learning\n\n\u22111\n\nReinforcement learning [13] deals with learning a policy for an agent interacting in an unknown\nenvironment. It has been applied successfully to various problems, such as games [8, 12], natural\nlanguage processing [18], neural architecture/optimizer search [20, 1] and so on. At each step,\nan agent observes the current state st of the environment, decides of an action at according to a\npolicy (cid:25)(atjst), and observes a reward signal rt+1. The goal of the agent is to \ufb01nd a policy that\n\u22111\n\u2032(cid:0)t(cid:0)1rt\u2032; where (cid:13) 2 (0; 1]\n\u22111\nmaximizes the expected sum of discounted rewards Rt, Rt =\nis a discount factor that determines the importance of future rewards. The value function of a policy\nt=0 (cid:13)trt+1js0 = s] and its action-value function\n(cid:25) is de\ufb01ned as the expected return V(cid:25)(s) = E(cid:25)[\nt=0 (cid:13)trt+1js0 = s; a0 = a].\nas Q(cid:25)(s; a) = E(cid:25)[\nPolicy gradient methods address the problem of \ufb01nding a good policy by performing stochastic\ngradient descent to optimize a performance objective over a given family of parametrized stochastic\npolicies (cid:25)(cid:18)(ajs) parameterized by (cid:18). The policy gradient theorem [14] provides expressions for the\n\u22111\ngradient of the average reward and discounted reward objectives with respect to (cid:18). In the discounted\nsetting, the objective is de\ufb01ned with respect to a designated start state (or distribution) s0: (cid:26)((cid:18); s0) =\nt=0 (cid:13)trt+1js0]. The policy gradient theorem shows that:\nE(cid:25)(cid:18) [\n@(cid:25)(cid:25)(cid:18) (ajs)\n\n\u2211\n\n\u2211\n\nt\u2032=t+1 (cid:13)t\n\n(1)\n\n@(cid:18)\n\na\n\nQ(cid:25)(cid:18) (s; a):\n\n(cid:22)(cid:25)(cid:18) (sjs0)\n\n@(cid:26)((cid:18); s0)\n\n=\n\n@(cid:18)\n\n\u22111\nt=0 (cid:13)tP (st = sjs0).\n\ns\n\nwhere (cid:22)(cid:25)(cid:18) (sjs0) =\n\n2\n\n\f3 Our Proposal: Reinforced Continual Learning\n\nIn this section, we elaborate on the new framework for continual learning, Reinforced Continual\nLearning(RCL). RCL consists of three networks, controller, value network, and task network. The\ncontroller is implemented as a Long Short-Term Memory network (LSTM) for generating policies\nand determining how many \ufb01lters or nodes will be added for each task. We design the value net-\nwork as a fully-connected network, which approximates the value of the state. The task network\ncan be any network of interest for solving a particular task, such as image classi\ufb01cation or object\ndetection. In this paper, we use a convolutional network (CNN) as the task network to demonstrate\nhow RCL adaptively expands this CNN to prevent forgetting, though our method can not only adapt\nto convolutional networks, but also to fully-connected networks.\n\n3.1 The Controller\n\nFigure 1(a) visually shows how RCL expands the network when a new task arrives. After the\nlearning process of task t (cid:0) 1 \ufb01nishes and task t arrives, we use a controller to decide how many\n\ufb01lters or nodes should be added to each layer.\nIn order to prevent semantic drift, we withhold\nmodi\ufb01cation of network weights for previous tasks and only train the newly added \ufb01lters. After we\nhave trained the model for task t, we timestamp each newly added \ufb01lter by the shape of every layer.\nDuring the inference time, each task only employs the parameters introduced in stage t, and does\nnot consider the new \ufb01lters added in the later tasks to prevent the caused semantic drift.\nSuppose the task network has m layers, when faced with a newly coming task, for each layer i, we\nspecify the the number of \ufb01lters to add in the range between 0 and ni (cid:0) 1. A straightforward idea to\n\u220f\nobtain the optimal con\ufb01guration of added \ufb01lters for m layers is to traverse all the combinatorial com-\nbinations of actions. However, for an m-layer network, the time complexity of collecting the best\naction combination is O(\n1 ni), which is NP-hard and unacceptable for very deep architectures\nsuch as VGG and ResNet.\nTo deal with this issue, we treat a series of actions as a \ufb01xed-length string. It is possible to use a\ncontroller to generate such a string, representing how many \ufb01lters should be added in each layer.\nSince there is a recurrent relationship between consecutive layers, the controller can be naturally\ndesigned as a LSTM network. At the \ufb01rst step, the controller network receives an empty embedding\nas input (i.e.\nthe state s) for the current task, which will be \ufb01xed during the training. For each\ntask t, we equip the network with softmax output, pt;i 2 Rni representing the probabilities of\nsampling each action for layer i, i.e. the number of \ufb01lters to be added. We design the LSTM in an\nautoregressive manner, as Figure 1(b) shows, the probability pt;i in the previous step is fed as input\ninto the next step. This process is circulated until we obtain the actions and probabilities for all the\nm layers. And the policy probability of the sequence of actions a1:m follows the product rule,\n\nm\n\n(2)\n\n(cid:25)(a1:mjs; (cid:18)c) =\n\nm\u220f\n\npt;i;ai;\n\ni=1\n\nwhere (cid:18)c denotes the parameters of the controller network.\n\n3.2 The Task Network\nWe deal with T tasks arriving in a sequential manner with training dataset Dt = fxi; yigNt\ni=1 ,\nvalidation dataset Vt = fxi; yigMt\ni=1 at time t. For the \ufb01rst task, we\ntrain a basic task network that performs well enough via solving a standard supervised learning\nproblem,\n\ni=1, test dataset Tt = fxi; yigKt\n\nL1(W1;D1):\n(3)\nmin\nW1\nWe de\ufb01ne the well-trained parameters as W a\nt for task t. When the t-th task arrives, we already know\nt(cid:0)1 for task t (cid:0) 1. Now we use the controller to decide how many \ufb01lters\nthe best parameters W a\nshould be added to each layer, and then we obtain an expanded child network, whose parameters\nt(cid:0)1). The training procedure for the new task is as\nto be learned are denoted as Wt (including W a\nt(cid:0)1 \ufb01xed and only back-propagating the newly added parameters of WtnW a\nfollows, keeping W a\nt(cid:0)1.\nThus, the optimization formula for the new task is,\n(4)\n\nLt(Wt;Dt):\n\nmin\nWtnW a\nt(cid:0)1\n\n3\n\n\fht\n\nFigure 1: (a) RCL adaptively expands each layer of the network when t-th task arrives. (b) The\ncontroller implemented as a RNN to determine how many \ufb01lters to add for the new task.\n\n(a)\n\n(b)\n\nWe use stochastic gradient descent to learn the newly added \ufb01lters with (cid:17) as the learning rate,\n\nWtnW a\nt(cid:0)1\n\n (cid:0) WtnW a\nt(cid:0)1\n\n(cid:0) (cid:17)\u2207WtnW a\n\nt(cid:0)1\n\nLt:\n\n(5)\n\nThe expanded child network will be trained until the required number of epochs or convergence are\nreached. And then we test the child network on the validation dataset Vt and the corresponding\naccuracy At will be returned. The parameters of the expanded network achieving the maximal\nreward (described in Section 3.3) will be the optimal ones for task t, and we store them for later\ntasks.\n\n3.3 Reward Design\n\nIn order to facilitate our controller to generate better actions over time, we need design a reward\nfunction to re\ufb02ect the performance of our actions. Considering both the validation accuracy and\ncomplexity of the expanded network, we design the reward for task t by the combination of the two\nterms,\n\nwhere At represents the validation accuracy on Vt, the network complexity as Ct = (cid:0) m\u2211\n\nki, ki\nis the number of \ufb01lters added in layer i, and (cid:11) is a parameter to balance between the prediction\nperformance and model complexity. Since Rt is non-differentiable, we use policy gradient to update\nthe controller, described in the following section.\n\nRt = At(St; a1:m) + (cid:11)Ct(St; a1:m);\n\n(6)\n\ni=1\n\n3.4 Training Procedures\n\nThe controller\u2019s prediction can be viewed as a list of actions a1:m, which means the number of \ufb01lters\nadded in m layers , to design an new architecture for a child network and then be trained in a new\ntask. At convergence, this child network will achieve an accuracy At on a validation dataset and the\nmodel complexity Ct, \ufb01nally we can obtain the reward Rt as de\ufb01ned in Eq. (6). We can use this\nreward Rt and reinforcement learning to train the controller.\nTo \ufb01nd the optimal incremental architecture the new task t, the controller aims to maximize its\nexpected reward,\n\nJ((cid:18)c) = V(cid:18)c(st):\n\n(7)\nwhere V(cid:18)c is the true value function. In order to accelerate policy gradient training over (cid:18)c, we\nuse actorcritic methods with a value network parameterized by (cid:18)v to approximate the state value\nV (st; (cid:18)v). The REINFORCE algorithm [16] can be used to learn (cid:18)c,\n(cid:25)(a1:mjst; (cid:18)c)(R(st; a1:m) (cid:0) V (st; (cid:18)v))\n\n\u2207(cid:18)cJ((cid:18)c) = E\n\n[\u2211\n\n]\n\n(8)\n\n:\n\n\u2207(cid:18)c (cid:25)(a1:mjst; (cid:18)c)\n(cid:25)(a1:mjst; (cid:18)c)\n\na1:m\n\n4\n\nt-1toutputlayer Nlayer 1input\u2026\u2026number of filters addedberoflayer N-1layer Nlayer N+1number of filters addedberofnumber of filters addedberof\fAlgorithm 1 RCL for Continual Learning\n1: Input: A sequence of dataset D = fD1;D2; : : : ;DTg\n2: Output: W a\nT\n3: for t = 1; :::; T do\n4:\n5:\n6:\n7:\nend if\n8:\n9: end for\n\nif t = 1 then\nTrain the base network using ( 3) on the \ufb01rst datasest D1 and obtain W a\n1 .\nelse\nExpand the network by Algorithm 2, and obtain the trained W a\nt .\n\nni; i = 1 : : : ; m; number of epochs for training the controller and value network, Te.\n\nAlgorithm 2 Routine for Network Expansion\n1: Input: Current dataset Dt; previous parameter W a\n2: Output: Network parameter W a\nt\n3: for i = 1; : : : ; Te do\n4:\n5:\n6:\n7:\n\nGenerate actions a1:m by controller\u2019s policy;\nGenerate W (i)\nby expanding parameters W a\nTrain the expanded network using Eq. (5) to obtain W (i)\nEvaluate the gradients of the controller and value network by Eq. (9) and Eq.(10),\n\nt(cid:0)1 according to a1:m;\n\n.\n\nt\n\nt\n\nt(cid:0)1; the size of action space for each layer\n\n(cid:18)c = (cid:18)c + (cid:17)c\u2207(cid:18)cJ((cid:18)c);\n\n(cid:18)v = (cid:18)v (cid:0) (cid:17)v\u2207(cid:18)v Lv((cid:18)v):\n\n8: end for\n9: Return the best network parameter con\ufb01guration, W a\n\nt = argmaxW (i)\n\nt\n\nRt(W (i)\n\nt\n\n).\n\nA Monte Carlo approximation for the above quantity is,\n\n(\n\n)\n1:m) (cid:0) V (st; (cid:18)v)\n\njst; (cid:18)c)\n\nR(st; a(i)\n\n:\n\n(9)\n\nN\u2211\n\ni=1\n\n1\nN\n\n1:m\n\n\u2207(cid:18)c log (cid:25)(a(i)\nN\u2211\n\nwhere N is the batch size. For the value network, we utilize gradient-based method to update (cid:18)v, the\ngradient of which can be evaluated as follows,\n\nLv =\n\n1\nN\n\ni=1\n\n\u2207(cid:18)v Lv =\n\n2\nN\n\n(V (st; (cid:18)v) (cid:0) R(st; a(i)\n(\nN\u2211\n\n1:m))2;\n\nV (st; (cid:18)v) (cid:0) R(st; a(i)\n1:m)\n\ni=1\n\n)\n\n@V (st; (cid:18)v)\n\n@(cid:18)v\n\n:\n\n(10)\n\nFinally we summarize our RCL approach for continual learning in Algorithm 1, in which the sub-\nroutine for network expansion is described in Algorithm 2.\n\n3.5 Comparison with Other Approaches\n\nAs a new framework for network expansion to achieve continual learning, RCL distinguishes from\nprogressive network [11] and DEN [17] from the following aspects.\n\n(cid:15) Compared with DEN, instead of performing selective retraining and network split, RCL\nkeeps the learned parameters for previous tasks \ufb01xed and only updates the added parame-\nters. Through this training strategy, RCL can totally prevent catastrophic forgetting due to\nthe freezing parameters for corresponding tasks.\n(cid:15) Progressive neural networks expand the architecture with a \ufb01xed number of units or \ufb01lters.\nTo obtain a satisfying model accuracy when number of sequential tasks is large, the \ufb01nal\ncomplexity of progressive nets is required to be extremely high. This directly leads to high\ncomputational burden both in training and inference, even dif\ufb01cult for the storage of the\n\n5\n\n\fentire model. To handle this issue, both RCL and DEN dynamically adjust the networks to\nreach a more economic architecture.\n(cid:15) While DEN achieves the expandable network by sparse regularization, RCL adaptively ex-\npands the network by reinforcement learning. However, the performance of DEN is quite\nsensitive to the various hyperparameters, including regularization parameters and thresh-\nolding coef\ufb01cients. RCL largely reduces the number of hyperparameters and boils down to\nonly balancing the average validation accuracy and model complexity when the designed\nreward function. Through different experiments in Section 4, we demonstrate that RCL\ncould achieve more stable results, and better model performance could be achieved simul-\ntaneously with even much less neurons than DEN.\n\n4 Experiments\n\nWe perform a variety of experiments to access the performance of RCL in continual learning. We\nwill report the accuracy, the model complexity and the training time consumption between our RCL\nand the state-of-the-art baselines. We implemented all the experiments in Tensorfolw framework on\nGPU Tesla K80.\n\nDatasets\n(1) MNIST Permutations [4]. Ten variants of the MNIST data, where each task is trans-\nformed by a \ufb01xed permutation of pixels. In the dataset, the samples from different task are not\nindependent and identically distributed; (2) MNIST Mix. Five MNIST permutations (P1; : : : ; P5)\nand \ufb01ve variants of the MNIST dataset (R1; : : : ; R5) where each contains digits rotated by a \ufb01xed\nangle between 0 and 180 degrees. These tasks are arranged in the order P1; R1; P2; : : : ; P5; R5. (3)\nIncremental CIFAR-100 [9]. Different from the original CIFAR-100, each task introduces a new set\nof classes. For the total number of tasks T , each new task contains digits from a subset of 100=T\nclasses. In this dataset, the distribution of the input is similar for all tasks, but the distribution of the\noutput is different.\nFor all of the above datasets, we set the number of tasks to be learned as T = 10. For the MNIST\ndatasets, each task contains 60000 training examples and 10000 test examples from 10 different\nclasses. For the CIFAR-100 datasets, each task contains 5000 train examples and 1000 examples\nfrom 10 different classes. The model observes the tasks one by one, and once the task had been\nobserved, the task will not be observed later during the training.\n\nBaselines\n(1) SN, a single network trained across all tasks; (2) EWC, deep network trained with\nelastic weight consolidation [4] for regularization; (3) GEM, gradient episodic memory [7]; (4)\nPGN, progressive neural network proposed in [11]; (5) DEN, dynamically expandable network [17].\n\nBase network settings\n(1) Fully connected networks for MNIST Permutations and MNIST Mix\ndatasets. We use a three-layer network with 784-312-128-10 neurons with RELU activations; (2)\nLeNet is used for Incremental CIFAR-100. LeNet has two convolutional layers and three fully-\nconnected layers, the detailed structure of LeNet can be found in [5].\n\n4.1 Results\n\nWe evaluate each compared approach by considering average test accuracy on all the tasks, model\ncomplexity and training time. Model complexity is measured via the number of model parameters\nafter training all the tasks. We \ufb01rst report the test accuracy and model complexity of baselines and\nour proposed RCL for the three datasets in Figure 2.\n\nComparison between \ufb01xed-size and expandable networks. From Figure 2, we can easily ob-\nserve that the approaches with \ufb01xed-size network architectures, such as IN, EWC and GEM, own\nlow model complexity, but their prediction accuracy is much worse than those methods with expand-\nable networks, including PGN, DEN and RCL. This shows that dynamically expanding networks\ncan indeed contribute to the model performance by a large margin.\n\nComparison between PGN, DEN and RCL. Regarding to the expandable networks, RCL out-\nperforms PGN and DEN on both test accuracy and model complexity. Particularly, RCL achieves\n\n6\n\n\fFigure 2: Top: Average test accuracy for all the datasets. Bottom: The number of parameters for\ndifferent methods.\n\nFigure 3: Average test accuracy v.s. model complexity for RCL, DEN and PGN.\n\nsigni\ufb01cant reduction on the number of parameters compared with PGN and DEN, e.g. for incremen-\ntal Cifar100 data, 42% and 53% parameter reduction, respectively.\nTo further see the difference of the three methods, we vary the hyperparameters settings and train the\nnetworks accordingly, and obtain how test accuracy changes with respect to the number of parame-\nters, as shown in Figure 3. We can clearly observe that RCL can achieve signi\ufb01cant model reduction\nwith the same test accuracy as that of PGN and DEN, and accuracy improvement with same size of\nnetworks. This demonstrates the bene\ufb01ts of employing reinforcement learning to adaptively control\nthe complexity of the entire model architecture.\n\nComparison between RCL and Random Search. We compare our policy gradient controller and\nrandom search controller on different datasets. In every experiment setup, hyper-parameters are the\nsame except the controller (random search controller v.s. policy gradient controller). We run each\nexperiment for four times. We found that random search achieves more than 0.1% less accuracy and\nalmost the same number of parameters on these three datasets compared with policy gradient. We\n\nFigure 4: Test accuracy on the \ufb01rst task as more tasks are learned.\n\n7\n\n$\u001f\u001c\u0003\u001a\u0002\u001c\u001e!\u0002\u001f\u001b\u001c\u001f#\u001a\u00020.00.20.40.60.81.01.29089\u0003,..:7,.\u00050.4800.4220.1570.8160.6720.3210.9200.8840.4980.9600.9630.5640.9660.9660.5810.9660.9660.599\u0018;07,\u00040\u00039089\u0003,..:7,.\u0005\u0003,.7488\u0003,\u0004\u0004\u00039,8\u00048\u001e\u001f\u0002$%\u00035072:9,9\u0004438\u001e\u001f\u0002$%\u00032\u0004\u0005\u001a\u0002\u001d\u0018#\n\u000e\r\r$\u001f\u001c\u0003\u001a\u0002\u001c\u001e!\u0002\u001f\u001b\u001c\u001f#\u001a\u0002012345675,7,2090781e5\u001f:2-07\u000341\u00035,7,209078\u001e\u001f\u0002$%\u00035072:9,9\u0004438\u001e\u001f\u0002$%\u00032\u0004\u0005\u001a\u0002\u001d\u0018#\n\u000e\r\r345675,7,2090781e50.8000.8250.8500.8750.9000.9250.9500.9759089\u0003,..:7,.\u0005\u001e\u001f\u0002$%\u00035072:9,9\u0004438#\u001a\u0002\u001b\u001c\u001f!\u0002\u001f345675,7,2090781e50.700.750.800.850.900.959089\u0003,..:7,.\u0005\u001e\u001f\u0002$%\u00032\u0004\u0005#\u001a\u0002\u001b\u001c\u001f!\u0002\u001f1.01.52.02.53.03.55,7,2090781e50.500.520.540.560.589089\u0003,..:7,.\u0005\u001a\u0002\u001d\u0018#\n\u000e\r\r#\u001a\u0002\u001b\u001c\u001f!\u0002\u001f2468100.20.40.60.81.09089\u0003,..:7,.\u0005\u001e\u001f\u0002$%\u00035072:9,9\u00044382468100.20.40.60.81.09089\u0003,..:7,.\u0005\u001e\u001f\u0002$%\u00032\u0004\u0005\u0002\u001c\u001e\u001c\u0003\u001a$\u001f!\u0002\u001f\u001b\u001c\u001f#\u001a\u000202468100.10.20.30.40.50.69089\u0003,..:7,.\u0005\u001a\u0002\u001d\u0018#\n\u000e\r\r\fnote that random search performs surprisingly well, which we attribute to the representation power\nof our reward design. This demonstrates that our well-constructed reward strikes a balance between\naccuracy and model complexity very effectively.\n\nEvaluating the forgetting behavior. Figure 4 shows the evolution of the test accuracy on the \ufb01rst\ntask as more tasks are learned. RCL and PGN exhibit no forgetting while the approaches without\nexpanding the networks raise catastrophic forgetting. Moreover, DEN can not completely prevent\nforgetting since it retrains the previous parameters when learning new tasks.\n\nTraining time We report the wall clock training time for each compared method in Table 1). Since\nRCL is based on reinforcement learning, a large number of trials are typically required that leads to\nmore training time than other methods. Improving the training ef\ufb01ciency of reinforcement learning\nis still an open problem, and we leave it as future work.\n\nTable 1: Training time (in seconds) of experiments for all methods.\n\nMethods\n\nIN EWC GEM DEN PGN\n\nRCL\n\nMNIST permutations\n\n173\n\n1319\n\n1628\n\n21686\n\nMNIST mix\n\nCIFAR100\n\n170\n\n149\n\n1342\n\n1661\n\n19690\n\n508\n\n7550\n\n1428\n\n452\n\n451\n\n167\n\n34583\n\n23626\n\n3936\n\nBalance between test accuracy and model complexity. We control the tradeoff between the\nmodel performance and complexity through the coef\ufb01cient (cid:11) in the reward function (6). Figure 5\nshows how varying (cid:11) affects the test accuracy and number of model parameters. As expected, with\nincreasing (cid:11) the model complexity drops signi\ufb01cantly while the model performance also deteriorate\ngradually. Interestingly, when (cid:11) is small, accuracy drops much slower compared with the decreasing\nof the number of parameters. This observation could help to choose a suitable (cid:11) such that a medium-\nsized network can still achieve a relatively good model performance.\n\nFigure 5: Experiments on the in\ufb02uence of the parameter (cid:11) in the reward design\n\n5 Conclusion\n\nWe propose a novel framework for continual learning, Reinforced Continual Learning. Our method\nsearches for the best neural architecture for coming task by reinforcement learning, which increases\nits capacity when necessary and effectively prevents semantic drift. We implement both fully con-\nnected and convolutional neural networks as our task networks, and validate them on different\ndatasets. The experiments demonstrate that our proposal outperforms the exiting baselines signi\ufb01-\ncantly both on prediction accuracy and model complexity.\nAs for future works, two directions are worthy of consideration. Firstly, we will develop new strate-\ngies for RCL to facilitate backward transfer, i.e. improve previous tasks\u2019 performance by learning\nnew tasks. Moreover, how to reduce the training time of RCL is particularly important for large\nnetworks with more layers.\n\n8\n\n105104103102,\u00045\u0004,0.840.860.880.900.920.940.969089\u0003,..:7,.\u0005\u001e\u001f\u0002$%\u00035072:9,9\u00044383.54.04.55.05.56.05,7,2090781e5,..:7,.\u00055,7,209078105104103102,\u00045\u0004,0.880.900.920.940.969089\u0003,..:7,.\u0005\u001e\u001f\u0002$%\u00032\u0004\u00053.54.04.55.05.55,7,2090781e5,..:7,.\u00055,7,209078105104103102,\u00045\u0004,0.400.450.500.559089\u0003,..:7,.\u0005\u001a\u0002\u001d\u0018#\n\u000e\r\r0.81.01.21.41.61.82.02.25,7,2090781e5,..:7,.\u00055,7,209078\fAcknowledgments\n\nSupported by National Natural Science Foundation of China (Grant No: 61806009) and Beijing\nNatural Science Foundation (Grant No: 4184090).\n\nReferences\n[1] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V. Le. Neural optimizer search with\nreinforcement learning. In International Conference on Machine Learning(ICML), pages 459\u2013\n468, 2017.\n\n[2] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A.\nRusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent\nin super neural networks. arXiv preprint arXiv:1701.08734, 2017.\n\n[3] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learn-\n\ning. arXiv preprint arXiv:1711.10563, 2017.\n\n[4] James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins,\nAndrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,\nDemis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas-\ntrophic forgetting in neural networks. Proceedings of the National Academy of Sciences,\n114(13):3521\u20133526, 2017.\n\n[5] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[6] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcom-\ning catastrophic forgetting by incremental moment matching. In Advances in Neural Informa-\ntion Processing Systems, pages 4655\u20134665, 2017.\n\n[7] David Lopez-Paz and Marc\u2019Aurelio Ranzato. Gradient episodic memory for continual learning.\n\nIn NIPS, 2017.\n\n[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,\nDaan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[9] Sylvestre-Alvise Rebuf\ufb01, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl:\nIncremental classi\ufb01er and representation learning. In CVPR, pages 5533\u20135542. IEEE Com-\nputer Society, 2017.\n\n[10] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science,\n\n7(2):123\u2013146, 1995.\n\n[11] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[12] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game\nof go without human knowledge. Nature, 550(7676):354, 2017.\n\n[13] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Cam-\n\nbridge: MIT press, 1998.\n\n[14] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gra-\ndient methods for reinforcement learning with function approximation. In Advances in Neural\nInformation Processing Systems, pages 1057\u20131063, 1999.\n\n9\n\n\f[15] Sebastian Thrun. A lifelong learning perspective for mobile robot control. In International\n\nConference on Intelligent Robots and Systems, 1995.\n\n[16] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning. Machine Learning, 8:229\u2013256, 1992.\n\n[17] J. Yoon and E. Yang. Lifelong learning with dynamically expandable networks. arXiv preprint\n\narXiv:1708.01547, 2017.\n\n[18] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In AAAI, pages 2852\u20132858, 2017.\n\n[19] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intel-\n\nligence. In International Conference on Machine Learning (ICML), 2017.\n\n[20] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv\n\npreprint arXiv:1611.01578, 2016.\n\n10\n\n\f", "award": [], "sourceid": 500, "authors": [{"given_name": "Ju", "family_name": "Xu", "institution": "Peking University"}, {"given_name": "Zhanxing", "family_name": "Zhu", "institution": "Peking University"}]}