{"title": "Transfer Learning with Neural AutoML", "book": "Advances in Neural Information Processing Systems", "page_first": 8356, "page_last": 8365, "abstract": "We reduce the computational cost of Neural AutoML with transfer learning. AutoML relieves human effort by automating the design of ML algorithms. Neural AutoML has become popular for the design of deep learning architectures, however, this method has a high computation cost. To address this we propose Transfer Neural AutoML that uses knowledge from prior tasks to speed up network design. We extend RL-based architecture search methods to support parallel training on multiple tasks and then transfer the search strategy to new tasks.\nOn language and image classification data, Transfer Neural AutoML reduces convergence time over single-task training by over an order of magnitude on many tasks.", "full_text": "Transfer Learning with Neural AutoML\n\nCatherine Wong\n\nMIT\n\ncatwong@mit.edu\n\nNeil Houlsby\nGoogle Brain\n\nneilhoulsby@google.com\n\nYifeng Lu\nGoogle Brain\n\nyifenglu@google.com\n\nAbstract\n\nAndrea Gesmundo\n\nGoogle Brain\n\nagesmundo@google.com\n\nWe reduce the computational cost of Neural AutoML with transfer learning. Au-\ntoML relieves human effort by automating the design of ML algorithms. Neural\nAutoML has become popular for the design of deep learning architectures, however,\nthis method has a high computation cost. To address this we propose Transfer\nNeural AutoML that uses knowledge from prior tasks to speed up network design.\nWe extend RL-based architecture search methods to support parallel training on\nmultiple tasks and then transfer the search strategy to new tasks. On language and\nimage classi\ufb01cation tasks, Transfer Neural AutoML reduces convergence time over\nsingle-task training by over an order of magnitude on many tasks.\n\n1\n\nIntroduction\n\nAutomatic Machine Learning (AutoML) aims to \ufb01nd the best performing learning algorithms with\nminimal human intervention. Many AutoML methods exist, including random search [1], perfor-\nmance modelling [2, 3], Bayesian optimization [4], genetic algorithms [5, 6] and RL [7, 8]. We\nfocus on neural AutoML, that uses deep RL to optimize architectures. These methods have shown\npromising results. For example, Neural Architecture Search has discovered novel networks that rival\nthe best human-designed architectures on challenging image classi\ufb01cation tasks [9, 10].\nHowever, neural AutoML is expensive because it requires training many networks. This may require\nvast computations resources; Zoph and Le [8] report 800 concurrent GPUs to train on Cifar-10.\nFurther, training needs to be repeated for every new task. Some methods have been proposed\nto address this cost, such as using a progressive search space [11], or by sharing weights among\ngenerated networks [12, 13]. We propose a complementary solution, applicable when one has multiple\nML tasks to solve. Humans can tune networks based on knowledge gained from prior tasks. We aim\nto leverage the same information using transfer learning.\nWe exploit the fact that deep RL-based AutoML algorithms learn an explicit parameterization of the\ndistribution over performant models. We present Transfer Neural AutoML, a method to accelerate\nnetwork design on new tasks based on priors learned on previous tasks. To do this we design a\nnetwork that performs neural AutoML on multiple tasks simultaneously. Our method for multitask\nneural AutoML learns both hyperparameter choices common to multiple tasks and speci\ufb01c choices\nfor individual tasks. We then transfer this controller to new tasks and leverage the learned priors over\nperformant models. We reduce the time to converge in both text and image domains by over an order\nof magnitude in most tasks. In our experiments we save 10s of CPU hours for every task that we\ntransfer to.\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2 Methods\n\n2.1 Neural Architecture Search\n\nTransfer Neural AutoML is based on Neural Architecture Search (NAS) [8]. NAS uses deep RL\nto generate models that maximize performance on a given task. The framework consists of two\ncomponents: a controller model and child models.\nThe controller is an RNN that generates a sequence of discrete actions. Each action speci\ufb01es a design\nchoice; for example, if the child models are CNNs, these choices could include the \ufb01lter heights,\nwidths, and strides. The controller is an autoregressive model, like a language model: the action\ntaken at each time step is fed into the RNN as input for the next time step. The recurrent state of the\nRNN maintains a history of the design choices taken so far. The use of an RNN allows dependencies\nbetween the design choices to be learned. The sequence of design choices de\ufb01ne a child model that is\ntrained and evaluated on the ML task at hand. The performance of the child network on the validation\nset is used as a reward to update the controller via a policy gradient algorithm.\n\n2.2 Multitask Training\n\nWe propose Multitask Neural AutoML, that searches for model on multiple tasks simultaneously.\nIt requires de\ufb01ning a generic search space that is shared across tasks. Many deep learning models\nrequire the same common design decisions, such as choice of network depth, learning rate, and\nnumber of training iterations. By de\ufb01ning a generic search space that contains common architecture\nand hyperparameter choices, the controller can generate a wide range of models applicable to many\ncommon problems. Multitask training allows the controller to learn a broadly applicable prior over\nthe search space by observing shared behaviour across tasks. The proposed multitask controller has\ntwo key features: learned task representations, and advantage normalization.\n\nLearned task representations The multitask AutoML controller characterizes the tasks by learning\na unique embedding vector for each task. This task-embedding allows to condition model generation\non the task ID. The task-embeddings are analogous to word-embeddings commonly used for NLP,\nwhere each word is associated to a trainable vector [14].\nFigure 1 (left) shows the architecture of the multitask controller at each time step. The task embedding\nis fed into the RNN at every time step. In standard single-task training of NAS, only the embedding\nof the previous action is fed into the RNN. In multitask training, the task embedding is concatenated\nto the action embedding. We also add a skip connection across the RNN cell to ease the learning\nof action marginal distributions. The task embeddings are the only task-speci\ufb01c parameters. One\nembedding is assigned to each task; these are randomly initialized and trained jointly with the\ncontroller.\nAt each iteration of multitask training, a task is sampled at random. This task\u2019s embedding is fed\nto the controller, which generates a sequence of actions conditioned on this embedding. The child\nmodel de\ufb01ned by these actions is trained and evaluated on the task, and the reward is used to update\nthe task-agnostic parameters and the corresponding task embedding.\n\nTask-speci\ufb01c advantage normalization We train the controller using policy gradient. Each task\nde\ufb01nes a different performance metric which we use as reward. The reward affects the amplitude of\nthe gradients applied to update the controller\u2019s policy, \u03c0 . To maintain a balanced gradient updates\nacross tasks, we ensure that the distribution of each task\u2019s rewards are scaled to have same mean and\nvariance.\nThe mean of each task\u2019s reward distribution is centered on zero by subtracting the expected reward\nfor the given task. The centered reward, or advantage, A\u03c4 (m), of a model, m, applied to a task, \u03c4, is\nde\ufb01ned as the difference between the reward obtained by the model, R\u03c4 (m), and the expected reward\nfor the given task, b\u03c4 = Em\u223c\u03c0[R\u03c4 (m)]: A\u03c4 (m) = R\u03c4 (m) \u2212 b\u03c4 . b\u03c4 . Subtracting such a baseline\nis a standard technique in policy gradient algorithms used to reduce the variance of the parameter\nupdates [15].\nThe variance of each task\u2019s reward distribution is normalized by dividing the advantage by the standard\ndeviation of the reward: A(cid:48)\n\n. Where \u03c3\u03c4 =(cid:112)Em\u223c\u03c0[(R\u03c4 (m) \u2212 b\u03c4 )2]. We\n\n\u03c4 (m) = (R\u03c4 (m) \u2212 b\u03c4 )\u03c3\u22121\n\n\u03c4\n\n2\n\n\fFigure 1: Left:A single time step of the recurrent multitask AutoML controller, in which a single\naction is taken. The task embedding is concatenated with the embedding of the action sampled\nat the previous timestep and passed into the controller RNN. All parameters, other than the task\nembeddings, are shared across tasks. Right: Cosine similarity between the task embeddings learned\nby the multitask neural AutoML model.\nrefer to A(cid:48) as the normalized advantage. The gradient update to the parameters of the policy \u03b8 is\nthe product of the advantage and expected derivative of the log probability of sampling an action:\n\u03c4 (m)E\u03c0[\u2207\u03b8 log \u03c0\u03b8(m)]. Thus, normalizing the advantage may also be seen as adapting the learning\nA(cid:48)\nrate for each task.\nIn practice, we compute b\u03c4 and \u03c3\u03c4 using exponential moving averages over the sequence of rewards:\n\u03c4 = (1 \u2212 \u03b1)bt\u22121\n\u03c4 )2, where t indexes the trial,\nbt\nand \u03b1 = 0.01 is the decay factor.\n\n\u03c4 = (1 \u2212 \u03b1)\u03c32,t\u22121\n\n\u03c4 + \u03b1R\u03c4 (m), \u03c32,t\n\n+ \u03b1(R\u03c4 (m) \u2212 bt\n\n\u03c4\n\n2.3 Transfer Learning\n\nThe multitask controller is pretrained on a set of tasks and learns a prior over generic architectural\nand parameter choices, along with task-speci\ufb01c decisions encoded in the task embeddings. Given a\nnew task, we can perform transfer of the controller by: 1) reloading the parameters of the pretrained\nmultitask controller, 2) adding a new randomly initialized task embedding for the new task. Then,\narchitecture search is resumed, and the controller\u2019s parameters are updated jointly with the new task\nembedding. By learning an embedding for the new task, the controller learns a representation that\nbiases towards actions that performed well on similar tasks.\n\n3 Related Work\n\nA variety of optimization methods have been proposed to search over architectures, hyperparam-\neters, and learning algorithms. These include random search [1], parameter modeling [3], meta-\nlearned hyperparameter initialization [16], deep-learning based tree searches over a prede\ufb01ned\nmodel-speci\ufb01cation language [17], and learning of gradient descent optimizers [18, 19]. An emerging\nbody of neuro-evolution research has adapted genetic algorithms for these complex optimization\nproblems [20], including to set the parameters of existing deep networks [21], evolve image classi\ufb01ers\n[5], and evolve generic deep neural networks [6].\nOur work relates closest to NAS [8]. NAS was applied to construct CNNs for the CIFAR-10 image\nclassi\ufb01cation and RNNs for the Penn Treebank language modelling. Subsequent work reduces\nthe computational cost for more challenging tasks [10]. To engineer an architecture for ImageNet\nclassi\ufb01cation, Zoph et al. [10] train the NAS controller on the simpler CIFAR-10 task and then transfer\nthe child architecture to ImageNet by stacking it. However, they did not transfer the controller model\nitself, relying instead on the intuition that additional depth is necessary for the more challenging task.\nOther works apply RL to automate architecture generation and also reduce the computation cost.\nMetaQNN sequentially chooses CNN layers using Q-learning [22]. MetaQNN uses an aggressive\nexploration to reduce search time, though it can cause the resulting architectures to underperform.\n\n3\n\nRNN CellFFNNStateTask embeddingPrev. action Prev. stateSample actionAction distributionFFNNTask Action embeddingComplaintsNews AggregatorAirlinePrimary EmotionEconomic NewsPolitical MessageSentiment SSTUS EconomyComplaintsNews AggregatorAirlinePrimary EmotionEconomic NewsPolitical MessageSentiment SSTUS Economy0.80.40.00.40.8\fCai et al. [23] transform existing architectures incrementally to avoid generating entire networks from\nscratch. Liu et al. [11] reduce search time by progressively increasing architecture complexity, and\n[12] propose child-model weight sharing to reduce child training time.\nTransfer learning has achieved excellent results as an initialization method for deep networks,\nincluding for models trained using RL [24, 25, 26]. Recent meta-learning research has broadened\nthis concept to learn generalizable representations across classes of tasks [27, 28]. Simultaneous\nmultitask training can facilitate learning between tasks with a common structure, though retaining\nknowledge effectively across tasks is still an active area of research [29, 30]. There is also prior\nresearch on transfer of optimizers for Neural AutoML; Sequential Model-based Optimizers have been\ntransferred across tasks to improve hyperparameter tuning [31, 32], we propose a parallel solution for\nneural methods.\n\n4 Experiments\n\nChild models Constructing the search space needs human input, so we choose wide parameter\nranges to minimize injected domain expertise. Our search space for child models contains two-tower\nfeedforward neural networks (FFNN), similar to the wide and deep models in Cheng et al. [33].\nOne tower is a deep FFNN, containing an input embedding module, fully connected layers and a\nsoftmax classi\ufb01cation layer. This tower is regularized with an L2 loss. The other is a wide-shallow\nlayer that directly connects the one-hot token encodings to the softmax classi\ufb01cation layer with a\nlinear projection. This tower is regularized with a sparse L1 loss. The wide layer allows the model\nto learn task-speci\ufb01c biases for each token directly. The deep FFNN\u2019s embedding modules are\npretrained1.This results in child models with higher quality and faster convergence.\nThe single search space for all tasks is de\ufb01ned by the following sequence of choices: 1) Pretrained\nembedding module. 2) Whether to \ufb01ne-tune the embedding module. 3) Number of hidden layers\n(HL). 4) HL size. 5) HL activation function. 6) HL normalization scheme to use. 7) HL dropout\nrate. 8) Deep column learning rate. 9) Deep column regularization weight. 10) Wide layer learning\nrate. 11) Wide layer regularization weight. 12) Training steps. The Appendix contains the exact\nspeci\ufb01cation. The search space is much larger than the number of possible trials, containing 1.1B\ncon\ufb01gurations. All models are trained using Proximal Adagrad with batch size 100. Notice that this\nsearch space aims to optimize jointly the architecture and hyperparameters. While standard NAS\nsearch spaces are de\ufb01ned strictly over architectural parameters.\n\nController models The controller is a 2-layer LSTM with 50 units. The action and task embeddings\nhave size 25. The controller and embedding weights are initialized uniformly at random, yielding an\napproximate uniform initial distribution over actions. The learning rate is set to 10\u22124 and it receives\ngradient updates after every child completes. We tried four variants of policy gradient to train the\ncontroller: REINFORCE [34], TRPO [35], UREX [36] and PPO [37]. In preliminary experiments on\nfour NLP tasks, we found REINFORCE and TRPO to perform best and selected REINFORCE for\nthe following experiments.\nWe evaluate three controllers. First, Transfer Neural AutoML, our neural controller that transfers\nfrom multitask pre-training. Second, Single-task AutoML, which is trained from scratch on each task.\nFinally, a baseline, Random Search (RS), that selects action uniformly at random.\n\nMetrics To measure the ability of the different AutoML controllers to \ufb01nd good models, we\ncompute the average accuracy of the topN (accuracy-topN) child models generated during the search.\nWe select the best topN models according to accuracy on the validation set. We then report the\nvalidation and test performance of these models.\nWe assess convergence rates with two metrics: 1) accuracy-topN achieved with a \ufb01xed budget of\ntrials, 2) the number of trials required to attain a certain reward. The latter can only be used with\nvalidation accuracy-topN since test accuracy-topN does not necessarily increase monotonically with\nthe number of trials.\n\n1The pretrained modules are distributed via TensorFlow Hub: https://www.tensorflow.org/hub .\n\n4\n\n\fDataset\n\nRS NAML T-NAML\n\nDataset\n\n2470\n20 Newsgroups\n245\nBrown Corpus\n4815\nSMS Spam\n3850\nCorp Messaging\n4970\nDisasters\n4995\nEmotion\n4985\nGlobal Warming\nProg Opinion\n4200\nCustomer Reviews 4895\n4965\nMPQA Opinion\nSentiment Cine\n4520\nSentiment IMDB 4760\nSubj Movie\n4745\n\n1870\n235\n3390\n1510\n2730\n1645\n1935\n3620\n925\n1510\n3225\n630\n1600\n\n435\n10\n70\n80\n25\n195\n90\n60\n15\n15\n535\n690\n105\n\n87.5\n20 Newsgroups\n37.0\nBrown Corpus\n97.9\nSMS Spam\n90.0\nCorp Messaging\n81.7\nDisasters\n33.9\nEmotion\n82.4\nGlobal Warming\nProg Opinion\n68.9\nCustomer Reviews 77.8\n87.9\nMPQA Opinion\nSentiment Cine\n73.2\nSentiment IMDB 85.8\nSubj Movie\n92.6\n\nRS NAML T-NAML\n87.4 88.1\u00b10.4\n38.2 53.4\u00b13.3\n97.8 98.1\u00b10.1\n90.2 90.2\u00b10.3\n81.5 82.1\u00b10.3\n33.7 35.3\u00b10.3\n82.8 82.9\u00b10.3\n66.3 70.3\u00b10.9\n79.0 81.4\u00b10.5\n87.9 88.6\u00b10.3\n76.3 75.4\u00b10.4\n87.3 88.1\u00b10.1\n93.2 93.4\u00b10.2\n\nTable 1: Performance of Random Search (RS), single-task Neural AutoML (NAML) and Transfer\nNeural AutoML (T-NAML). Bolding indicates the best controller, or within \u00b12 s.e.m.. Left: Number\nof trials needed to attain a validation accuracy-top10 equal to the best achieved by Random Search\nwith 5000 trials (250/2500 for Brown and 20 Newsgroups, respectively). Right: Test accuracy-top10\ngiven at a \ufb01xed budget B of 500 trials (B = 250 for Brown). Error bars show \u00b12 s.e.m. computed\nacross the top 10 models. Similar s.e.m. values are observed for all methods.\n4.1 Natural Language Processing\n\nData We evaluate using 21 text classi\ufb01cation tasks with varied statistics. The dataset sizes range\nfrom 500 to 420k datapoints. The number of classes range from 2 to 157, and the mean length of the\ntexts, in characters, range from 19 to 20k. The Appendix contains full statistics and references.\nEach child model is trained on the training set. The accuracy on the validation set is used as reward\nfor the controller. The topN child models, selected on the validation set, are evaluated on the test set.\nDatasets without a pre-de\ufb01ned train/validation/test split, are split randomly 80/10/10.\nThe multitask controller is pretrained on 8 randomly sampled tasks: Airline, Complaints, Economic\nNews, News Aggregator, Political Message, Primary Emotion, Sentiment SST, US Economy. We\nthen transfer from this controller to each of the remaining 13 tasks.\n\nResults To assess the controllers\u2019 ability to optimize the reward (validation set accuracy) we\ncompute the speed-up versus the baseline, RS. We \ufb01rst compute accuracy-top10 on the validation\nset for RS given a \ufb01xed budget of B trials. We use B = 5000, except for the Brown Corpus and\n20 Newsgroups where we can only use a B = 500, 3500, respectively, because these datasets were\nslower to train. We then report the number of trials required by AutoML and T-AutoML to achieve\nthe same validation accuracy-top10 as RS with B trials. Table 1 (left) shows the results. Note that RS\nmay exhibit fewer than B = 5000 trials if it converged earlier. These results shows that T-AutoML is\neffective at optimizing validation accuracy, offering a large reduction in time to attain a \ufb01xed reward.\nIn 12 of the 13 datasets T-AutoML achieves the desired reward fastest, and in 9 cases achieves an\norder of magnitude speed-up.\nNext, we assess the quality of the models on the test set. Table 1 (right) shows test accuracy-top10\nwith a budget of 500 trials (250 for Brown Corpus). Within this budget, T-AutoML performs best on\nall but one dataset. T-AutoML outperforms single-task AutoML on 10 out of the 13 datasets, ties on\none, and loses on two. On the datasets where T-AutoML does not produce the best \ufb01nal model at 500\ntrials, it often produces better models at earlier iterations. Figure 2 shows the full learning curves of\ntest set accuracy-top10 versus number of trials. Figure 2 shows that in most cases the controller with\ntransfer starts with a much better prior over good models. On some datasets the quality is improved\nwith further training e.g. Emotion, Corp Messaging, but in others the initial con\ufb01gurations learned\nfrom the multitask model are not improved.\nFor reference, we put the learning curves for the initial multitask training phase in the Appendix. We\nalso ran RS and single-task AutoML on these datasets. Slightly disappointingly, multitask training\n\n5\n\n\f20 Newsgroups\n\nBrown Corpus\n\nCorp Messaging\n\nCustomer Reviews\n\nDisasters\n\nEmotion\n\nGlobal Warming\n\nMPQA Opinion\n\nProg Opinion\n\nSentiment Cine\n\nSentiment IMDB\n\nSMS Spam\n\nFigure 2: Learning curves for Random Search (RS), single-task Neural AutoML (NAML), and\nTransfer (T-NAML). x-axis: Number of trials (child model evaluations). y-axis: Average test set\naccuracy of the 10 models with best validation accuracy (test accuracy-top10) found up to each trial.\n\ndid not in itself yield substantial improvements over single-task; it attains a higher accuracy on two\ndatasets, and in similar on the other six.\nWe aim to to attain good performance with fewest possible trials. We do not seek to beat state-of-the-\nart all datasets because \ufb01rst, although our search space is large, it does not contain all performant\nmodel components (e.g. convolutions). Second, we use embedding modules pretrained on large\ndatasets which makes the results incomparable to those that only uses in-domain training data.\nHowever, to con\ufb01rm that Neural AutoML generates good models we compare to some previous\npublished results where available. Overall we \ufb01nd that Transfer AutoML with the search space\ndescribed above yields models competitive with the state-of-the-art. For example, Almeida et al.\n[38] use classical ML classi\ufb01ers (Logistic Regression, SVMs, etc.) on SMS Spam and report best\naccuracy of 97.59%. Transfer AutoML gets accuracy-top10 of 98.1%. Le and Mikolov [39] report\n92.58% accuracy on Sentiment IMDB with more complex architectures, Transfer AutoML\u2019s is a little\nbehind, accuracy top-10 is 88.1%. Li et al. [40] report 86.8% accuracy using an ensemble of weighted\nneural BOWs on MPQA. Transfer AutoML achieve accuracy-top10 of 88.6%. Li et al. [40] also\nevaluate their ensemble of weighted neural BOW models on Customer Reviews, and achieve 82.5%\nbest accuracy, though the best accuracy of any single model is 81.1%. Comparably, T-AutoML gets\nan accuracy-top10 of 81.4%. Barnes et al. [41] compare many algorithms and report best accuracy on\nSentiment-SST of 83.1% using LSTMs. Multitask AutoML gets an accuracy-Top10 of 83.4%. The\nbest performance achieved with a more complex architecture that is not in our search space is: 87.8%\n[39]. Maas et al. [42] report 88.1% on Movie Subj, Transfer AutoML gets accuracy-top10 of 93.4%.\n\nComputational Cost and Savings The median cost to perform a single trial across all 21 datasets\nin our experiments is T = 268s. If we run B trials with a speedup factor of S, we save BT (1 \u2212\nS\u22121)/3600CPU-h per task to attain a \ufb01xed reward (validation accuracy-top10). Estimating the\nspeedup factors from Table 1 (left) for transfer over single-task, we attain a median computational\nsaving of 30CPU-h per task when performing B = 500 trials. The mean is 89CPU-h, but this is\nheavily in\ufb02uenced by the slow Brown Corpus. The time to train the multitask controller is 15h on\n\n6\n\n050010000.800.850.90RSNAMLT-NAML02004000.00.5RSNAMLT-NAML050010000.90RSNAMLT-NAML050010000.60.8RSNAMLT-NAML050010000.750.80RSNAMLT-NAML050010000.300.350.40RSNAMLT-NAML050010000.800.820.84RSNAMLT-NAML050010000.850.90RSNAMLT-NAML050010000.60.7RSNAMLT-NAML050010000.60.70.8RSNAMLT-NAML050010000.70.8RSNAMLT-NAML050010000.960.981.00RSNAMLT-NAML\fFigure 3: Comparison on an image classi\ufb01cation\ntask, Cifar-10. Mean test accuracy of the top 10\nmodels chosen on the validation set.\n\n100 CPUs. If we do not need the M models for the tasks used to train the multitask controller, then\nwe must run > (1 \u2212 1/S)\u22121M new tasks to amortize this cost. For the median speedup in our\nexperiments S = 22 that is > 1.05M new tasks.\n\n4.2\n\nImage classi\ufb01cation\n\nTo validate the generality of our approach we evaluate on image classi\ufb01cation task: Cifar-10. We\ncompare the same three controllers: RS, AutoML trained from scratch, and Transfer AutoML\npretrained on MNIST and Flowers2. Figure 3 shows the mean accuracy-top-10 on the test set. The\ntransferred controller attains an accuracy-top-10 of 96.5%, similar to the other methods, but converges\nmuch faster as in the NLP tasks. The best models embed images with a \ufb01netuned Inception v3\nnetwork, pretrained on ImageNet. Relu activations are preferred over Swish [43] and the dropout rate\nof converges to 0.3.\n\n4.3 Analysis\n\nMeta over\ufb01tting The controller is trained on the tasks\u2019 validation sets. Over\ufb01tting of AutoML to\nthe validation set is not often addressed. This type of over\ufb01tting may seem unlikely because each\ntrial is expensive, and many trials may be required to over\ufb01t. However, we observe it in some cases.\nFigure 4 (left, center) shows the accuracy-top10 on the validation and test sets on the Prog Opinion\ndataset. Transfer Neural AutoML attains good solutions in the \ufb01rst few trials, but afterwards its\nvalidation performance grows while test performance does not. The generalization gap between the\nvalidation and test accuracy increases over time. This is the most extreme case we observed, but some\nother datasets exhibit some generalization gap also (see Appendix for all validation curves). This\neffect is largest on Prog Opinion because the validation set is tiny, with only 116 examples.\nOver\ufb01tting arises from bias due to selecting the best models on the validation set. Child evaluation\ncontains randomness due to the stochastic training procedure. Therefore, over time we see an\nimproved validation score, even after convergence, due to lucky evaluations. However, those apparent\nimprovements are not re\ufb02ected on the test set. Transfer AutoML exhibits more over\ufb01tting than\nsingle-task because it converges earlier. We con\ufb01rmed this effect; if we \u2018cheat\u2019 and select models by\ntheir test-set performance, we observe the same arti\ufb01cial improvement on the test score as on the\nvalidation score. Other than entropy regularization, we do not combat over\ufb01tting extensively. Here,\nwe simply emphasize that because our Transfer Neural AutoML model observes many trials in total,\nmeta-over\ufb01tting becomes a bigger issue. We leave combatting this effect to future research.\n\nDistant transfer: across languages The more distant the tasks, the harder it is to perform transfer\nlearning. The Sentiment Cine task is an outlier because it is the only Spanish task. Figure 2 and\nTable 1 show poorer performance of transfer on this task.\nThe most language-sensitive parameters are the pretrained word embeddings. The controller selects\nfrom eight pretrained embeddings (see Appendix), six of which are English, and two Spanish. In the\n\ufb01rst 1500 iterations, the transferred controller chooses English embeddings, limiting the performance.\nHowever, after further training, the controller switches to Spanish tables at around 2000th trial,\nFigure 4 (right). At trial 2000, T-AutoML attains a test accuracy-top10 of 79.8%, approximately\nequal to that or random search with 79.4%, and greater than single-task with 78.1%. This indicates\nthat although transfer works best on similar tasks, the controller is still able to adapt to outliers given\nsuf\ufb01cient training time.\n\n2goo.gl/tpzfR1\n\n7\n\n050100150# trials0.60.81.0test acc-top10RSNAMLT-NAML\fProg Opinion: val\n\nProg Opinion: test\n\nSentiment Cine module distribution\n\nFigure 4: Left, Center: Learning curves on the validation (left) and test sets (center) for the Prog\nOpinion dataset. Right: Evolution of the choice of pretrained embedding module for transfer to the\nSpanish Corpus-Cine task. y-axis indicates the probability of sampling each table. This probability is\nestimated from the samples using a sliding window of width 100.\nTask representations and learned models We inspect the learned task similarities via the em-\nbeddings. Figure 1 (right) shows the cosine similarity between the task embeddings learned during\nmultitask training. The model assigns most tasks to two clusters. It is hard to guess a priori which\ntasks require similar models; the dataset sizes, number of classes and text lengths differ greatly.\nHowever, the controller assigns the same model to tasks within the same cluster. At convergence,\nthe cluster {Complaints, New Agg, Airline, Primary Emotion} is assigned (with high probability)\na 1-layer networks with 256 units, Swish activation function, wide-layer learning rate 0.01, and\ndropout rate 0.2. The cluster {Economic News, Political Emotion, Sentiment SST} is assigned 2-layer\nnetworks with 64 units, Relu activation, wide-layer learning rate 0.003, and dropout rate 0.3.\nOther choices follow similar distributions for each cluster. For example, the same 128D word\nembeddings, trained using a Neural Language Model are chosen. The controller also always chooses\nto \ufb01ne-tune these embeddings. The controller may remove either the deep or wide tower by setting\nthe regularization very high, but in all cases it chooses to keep both active.\n\nAblation We consider two ablations of T-NAML. First, we remove the task embeddings. For this,\nwe train a task-agnostic multitask controller without task embeddings, then transfer this controller\nas for T-NAML. Second, we transfer a single architecture rather than the controller parameters. For\nthis, we train the task-agnostic multitask controller to convergence, and select the \ufb01nal child model.\nWe then re-train this single architecture on each new task. Omitting task embeddings performs well\non some tasks, but poorly on those that require a model different to the mode. Overall, according to\naccuracy-top10 at 500 trials, T-NAML outperforms the version without task embeddings on 8 tasks,\nloses 4, and draws on 1. The mean performance drop when ablating task embeddings is 1.8%. Using\njust a single model performs very poorly on many tasks, T-NAML wins 8 cases, loses 2, and draws 3,\nwith a mean performance increase of 4.8%.\n\n5 Conclusion\n\nNeural AutoML, whilst becoming popular, comes with a high computational cost. To address this we\npropose transfer learning of the controller and show a large reductions in convergence time across\nmany datasets. Extensions to this work include: Broadening the search space to contain more models\nclasses. Attempting transfer across modalities; some priors over hyperparameter combinations\nlearned on NLP tasks may be useful for images or other domains. Making the controller more robust\nto evaluation noise, and addressing the potential to meta over\ufb01t on small datasets.\n\nAcknowledgments\n\nWe are very grateful to Quentin de Laroussilhe, Andrey Khorlin, Quoc Le, Sylvain Gelly, the\nTensor\ufb02ow Hub team and the Google Brain team Zurich for developing software frameworks and\nmany useful discussions.\n\n8\n\n05001000# trials0.60.70.8val acc-top10RSNAMLT-NAML05001000# trials0.60.7test acc-top10RSNAMLT-NAML010002000300040005000# trials0.00.20.40.60.81.0p(module)nnlm-en-dim128nnlm-es-dim50nnlm-es-dim128\fReferences\n[1] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR, 2012.\n\n[2] James S Bergstra, R\u00e9mi Bardenet, Yoshua Bengio, and Bal\u00e1zs K\u00e9gl. Algorithms for hyper-parameter\n\noptimization. In NIPS, 2011.\n\n[3] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter\n\noptimization in hundreds of dimensions for vision architectures. In ICML, 2013.\n\n[4] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning\n\nalgorithms. In NIPS, 2012.\n\n[5] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex\n\nKurakin. Large-scale evolution of image classi\ufb01ers. In ICML, 2017.\n\n[6] Risto Miikkulainen, Jason Zhi Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala\nRaju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving deep neural\nnetworks. CoRR, abs/1703.00548, 2017.\n\n[7] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures\n\nusing reinforcement learning. In ICLR, 2017.\n\n[8] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.\n\n[9] Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. In AAAI,\n\n2018.\n\n[10] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for\n\nscalable image recognition. CoRR, abs/1707.07012, 2017.\n\n[11] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang,\n\nand Kevin Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.\n\n[12] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Ef\ufb01cient neural architecture search\n\nvia parameter sharing. arXiv preprint arXiv:1802.03268, 2018.\n\n[13] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint\n\narXiv:1806.09055, 2018.\n\n[14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NIPS, pages 3111\u20133119, 2013.\n\n[15] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient\n\nestimates in reinforcement learning. JMLR, 2004.\n\n[16] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter.\n\noptimization via meta-learning. In AAAI, 2015.\n\nInitializing bayesian hyperparameter\n\n[17] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architec-\n\ntures. arXiv preprint arXiv:1704.08792, 2017.\n\n[18] Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha\nDenil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. arXiv\npreprint arXiv:1703.04813, 2017.\n\n[19] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V. Le. Neural optimizer search with reinforcement\n\nlearning. In ICML, 2017.\n\n[20] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O Stanley, and Jeff\nClune. Improving exploration in evolution strategies for deep reinforcement learning via a population of\nnovelty-seeking agents. arXiv preprint arXiv:1712.06560, 2017.\n\n[21] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune.\nDeep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks\nfor reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.\n\n[22] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures\n\nusing reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.\n\n9\n\n\f[23] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for architecture\n\nsearch by network transformation. arXiv preprint arXiv:1707.04873, 2017.\n\n[24] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural\n\nnetworks? In NIPS, 2014.\n\n[25] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf:\n\nan astounding baseline for recognition. In CVPR workshops, 2014.\n\n[26] Yusen Zhan and Matthew E Taylor. Online transfer learning in reinforcement learning domains. arXiv\n\npreprint arXiv:1507.00436, 2015.\n\n[27] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. ICML, 2017.\n\n[28] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner.\n\nIn NIPS 2017 Workshop on Meta-Learning, 2017.\n\n[29] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu,\nKieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic\nforgetting in neural networks. PNAS, 2017.\n\n[30] Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell,\nNicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. arXiv preprint\narXiv:1707.04175, 2017.\n\n[31] R\u00e9mi Bardenet, M\u00e1ty\u00e1s Brendel, Bal\u00e1zs K\u00e9gl, and Michele Sebag. Collaborative hyperparameter tuning.\n\nIn ICML, pages 199\u2013207, 2013.\n\n[32] Dani Yogatama and Gideon Mann. Ef\ufb01cient transfer learning method for automatic hyperparameter tuning.\n\nIn AISTATS, pages 1077\u20131085, 2014.\n\n[33] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen\nAnderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain,\nXiaobing Liu, and Hemal Shah. Wide & deep learning for recommender systems. CoRR, abs/1606.07792,\n2016.\n\n[34] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. In Reinforcement Learning. Springer, 1992.\n\n[35] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\n\noptimization. In ICML, 2015.\n\n[36] O\ufb01r Nachum, Mohammad Norouzi, and Dale Schuurmans.\n\nunder-appreciated rewards. In ICLR, 2017.\n\nImproving policy gradient by exploring\n\n[37] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[38] Tiago Almeida, Jos\u00e9 Mar\u00eda G\u00f3mez Hidalgo, and Tiago Pasqualini Silva. Towards sms spam \ufb01ltering:\n\nResults under a new dataset. International Journal of Information Security Science, 2013.\n\n[39] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, ICML\u201914,\n\n2014.\n\n[40] Bofang Li, Zhe Zhao, Tao Liu, Puwei Wang, and Xiaoyong Du. Weighted neural bag-of-n-grams model:\n\nNew baselines for text classi\ufb01cation. In COLING, 2016.\n\n[41] Jeremy Barnes, Roman Klinger, and Sabine Schulte im Walde. Assessing state-of-the-art sentiment models\non state-of-the-art sentiment datasets. In Proceedings of the 8th Workshop on Computational Approaches\nto Subjectivity, Sentiment and Social Media Analysis. ACL, 2017.\n\n[42] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts.\n\nLearning word vectors for sentiment analysis. In ACL: Human Language Technologies. ACL, 2011.\n\n[43] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. arXiv preprint\n\narXiv:1710.05941, 2017.\n\n10\n\n\f", "award": [], "sourceid": 5066, "authors": [{"given_name": "Catherine", "family_name": "Wong", "institution": "MIT"}, {"given_name": "Neil", "family_name": "Houlsby", "institution": "Google"}, {"given_name": "Yifeng", "family_name": "Lu", "institution": null}, {"given_name": "Andrea", "family_name": "Gesmundo", "institution": "Google"}]}