{"title": "Meta Architecture Search", "book": "Advances in Neural Information Processing Systems", "page_first": 11227, "page_last": 11237, "abstract": "Neural Architecture Search (NAS) has been quite successful in constructing state-of-the-art models on a variety of tasks. Unfortunately, the computational cost can make it difficult to scale. In this paper, we make the first attempt to study Meta Architecture Search which aims at learning a task-agnostic representation that can be used to speed up the process of architecture search on a large number of tasks. We propose the Bayesian Meta Architecture SEarch (BASE) framework which takes advantage of a Bayesian formulation of the architecture search problem to learn over an entire set of tasks simultaneously. We show that on Imagenet classification, we can find a model that achieves 25.7% top-1 error and 8.1% top-5 error by adapting the architecture in less than an hour from an 8 GPU days pretrained meta-network. By learning a good prior for NAS, our method dramatically decreases the required computation cost while achieving comparable performance to current state-of-the-art methods - even finding competitive models for unseen datasets with very quick adaptation. We believe our framework will open up new possibilities for efficient and massively scalable architecture search research across multiple tasks.", "full_text": "Meta Architecture Search\n\nAlbert Shaw1\u2217 Wei Wei2 Weiyang Liu1 Le Song1,3 Bo Dai1,2\n1Georgia Institute of Technology\n2Google Research 3Ant Financial\n\nAbstract\n\nNeural Architecture Search (NAS) has been quite successful in constructing state-\nof-the-art models on a variety of tasks. Unfortunately, the computational cost can\nmake it dif\ufb01cult to scale. In this paper, we make the \ufb01rst attempt to study Meta\nArchitecture Search which aims at learning a task-agnostic representation that\ncan be used to speed up the process of architecture search on a large number of\ntasks. We propose the Bayesian Meta Architecture SEarch (BASE) framework\nwhich takes advantage of a Bayesian formulation of the architecture search problem\nto learn over an entire set of tasks simultaneously. We show that on Imagenet\nclassi\ufb01cation, we can \ufb01nd a model that achieves 25.7% top-1 error and 8.1%\ntop-5 error by adapting the architecture in less than an hour from an 8 GPU\ndays pretrained meta-network. By learning a good prior for NAS, our method\ndramatically decreases the required computation cost while achieving comparable\nperformance to current state-of-the-art methods - even \ufb01nding competitive models\nfor unseen datasets with very quick adaptation. We believe our framework will\nopen up new possibilities for ef\ufb01cient and massively scalable architecture search\nresearch across multiple tasks.\n\n1\n\nIntroduction\n\nFor deep neural networks, the particular structure often plays a vital role in achieving state-of-the-art\nperformance in many practical applications, and there has been much work [16, 11, 13, 41, 23, 22,\n21, 32, 31, 36] exploring the space of neural network designs. Due to the combinatorial nature of the\ndesign space, hand-designing architectures is time-consuming and inevitably sub-optimal. Automated\nNeural Architecture Search (NAS) has had great success in \ufb01nding high-performance architectures.\nHowever, people may need optimal architectures for several similar tasks at once, such as solving\ndifferent classi\ufb01cation tasks or even optimizing task networks for both high accuracy and ef\ufb01cient\ninference on multiple hardware platforms [35]. Although there has been success in transferring\narchitectures across tasks [43], recent work has increasingly shown that the optimal architectures can\nvary between even similar tasks; to achieve the best results, NAS would need to be repeatedly run for\neach task [5] which can be quite costly.\nIn this work, we present a \ufb01rst effort towards Meta Architecture Search, which aims at learning a\ntask-agnostic representation that can be used to search over multiple tasks ef\ufb01ciently. The overall\ngraphical illustration of the model can be found in Figure 1, where the meta-network represents the\ncollective knowledge of architecture search across tasks. Meta Architecture Search takes advantage of\nthe similarities among tasks and the corresponding similarities in their optimal networks, reducing the\noverall training time signi\ufb01cantly and allowing fast adaptation to new tasks. We formulate the Meta\nArchitecture Search problem from a Bayesian perspective and propose Bayesian Meta Architecture\nSEarch (BASE), a novel framework to derive a variational inference method to learn optimal weights\nand architectures for a task distribution. To parameterize the architecture search space, we use\na stochastic neural network which contains all the possible architectures within our architecture\n\n\u2217Corresponding author: ashaw596@gatech.edu\nThe code repository is available at https://github.com/ashaw596/meta_architecture_search.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustrations of Meta Architecture Search. We train a shared distribution for the meta-\nnetwork and a sample from the distribution will quick adapt to new task.\n\nspace as speci\ufb01c paths within the network. By using the Gumbel-softmax [14] distribution in the\nparameterization of the path distributions, this network containing an entire architecture space can be\noptimized differentially. To account for the task distribution in the posterior distribution of the neural\nnetwork architecture and weights, we exploit the optimization embedding[6] technique to design the\nparameterization of the posterior. This allows us to train it as a meta-network optimized over a task\ndistribution.\nTo train our meta-network over a wide distribution of tasks with different image sizes, we de\ufb01ne a new\nspace of classi\ufb01cation tasks by randomly selecting 10 Imagenet [7] classes and downsampling the\nimages to 32\u00d732, 64\u00d764, or 224\u00d7224 image sizes. By training on these datasets, we can learn good\ndistributions of architectures optimized for different image sizes. With a meta-network trained for 8\nGPU days, we then show that we can achieve very competitive results on full Imagenet by deriving\noptimal task-speci\ufb01c architectures from the meta-network, obtaining 25.7% top-1 error on ImageNet\nusing an adaption time of less than one hour. Our method achieves signi\ufb01cantly lower computational\ncosts compared to current state-of-the-art NAS approaches. By adapting the multi-task meta-network\nfor to the unseen CIFAR10 dataset for less than one hour, we found a model that achieves 2.83%\nTop-1 Error. Additionally, we also apply this method to tackle neural architecture search for few-shot\nlearning, demonstrating the \ufb02exibility of our framework.\nOur research opens new potentials for using Meta Architecture Search across massive amounts of\ntasks. The nature of the Bayesian formulation makes it possible to learn over an entire collection of\ntasks simultaneously, bringing additional bene\ufb01ts such as computational ef\ufb01ciency and privacy when\nperforming neural architecture search.\n\n2 Related Work\n\nNeural Architecture Search Several evolutionary and reinforcement learning based algorithms\nhave been quite successful in achieving state-of-the-art performances on many tasks [42, 43, 30, 12].\nHowever, these methods are computationally costly and require tremendous amounts of computing\nresources. While previous work has achieved good results with sharing architectures across tasks [43],\n[35] and [5] show that task and even platform-speci\ufb01c architecture search is required in order to\nachieve the best performance. Several methods [20, 27, 4, 3, 17] have been proposed to reduce\nthe search time, and both FBNet [35] and SNAS [37] utilize the Gumbel-Softmax [14] distribution\nsimilarly to our meta-network design to allow gradient-based architecture optimization. [2] and [40]\nalso both propose methods to generate optimal weights for one task given any architecture like our\nmeta-network is capable of. Their methods, however, do not allow optimization of the architectures\nand are only trained on a single task making them inef\ufb01cient in optimizing over multiple tasks.\nSimilarly to our work, [34] recently proposed methods to accelerate search utilizing knowledge from\nprevious searches and predicting posterior distributions of the optimal architecture. Our approach,\nhowever, achieves much better computational ef\ufb01ciency by not limiting ourselves to transferring\nknowledge from only the performance of discrete architectures on the validation datasets, but instead\n\n2\n\nMeta\tNetwork,00NN11NN22NN33NN...1......1...123..J...Neural\tNetworkTask\fsharing knowledge for both optimal weights and architecture parameters and implicitly characterizing\nthe entire dataset utilizing optimization embedding.\n\nMeta Learning Meta-learning methods allow networks to be quickly trained on new data and new\ntasks [8, 29]. While previous works have not applied these methods to Neural Architecture Search,\nour derived Bayesian optimization method bears some similarities to Neural Processes [9, 10, 15].\nBoth can derive a neural network specialized for a dataset by conditioning the model on some\nsamples from the dataset. The use of neural networks allows both to be optimized by gradient\ndescent. However, Neural Processes use specially structured encoder and aggregator networks to\nbuild a context embedding from the samples. We use the optimization embedding technique [6]\nto condition our neural network using gradient descent in an inner loop, which allows us to avoid\nexplicitly summarizing the datasets with a separate network. This inner-outer loop dynamic shares\nsome similarities to second-order MAML [8]. Both algorithms unroll the stochastic gradient descent\nstep. Due to this, we are also able to establish a connection between the heuristic MAML algorithm\nand Bayesian inference.\n\n3 A Bayesian Inference View of Architecture Search\n\nIn this section, we propose a Bayesian inference view for neural architecture search which naturally\nintroduces the hierarchical structures across different tasks. Such a view inspires an ef\ufb01cient algorithm\nwhich can provide a task-speci\ufb01c neural network with adapted weights and architecture using only a\nfew learning steps.\nWe \ufb01rst formulate the neural architecture search as an operation selection problem. Speci\ufb01cally, we\nconsider the neural network as a composition of L layers of cells, where the cells share the same\narchitecture, but have different parameters. In the l-th layer, the cell consists of a K-layer sub-network\nwith bypass connections. Speci\ufb01cally, we denote the xl\n\n(cid:0)zi,k\n(cid:62)Ai\n(1)\n(cid:1)(cid:3)J\nj=1 denotes a group of J different operations from Rd \u2192 Rp which\nwhere Ai\nij,k, e.g., different nonlinear neurons, convolution kernels with different sizes,\ndepend on parameters \u03b8l\nor other architecture choices. zi,k are all binary variables which are shared across L layers. They\nindicate which layers from the 1 to k \u2212 1 levels in l-th cell should be selected as inputs to the k-th\nlayer. Therefore, with different instantiations of z, the cell will select different operations to form the\noutput. Figure 1 has an illustration of this structure.\nWe assume the probabilistic model as\n\nk as the output of the k-th layer of l-th cell\n\nk\u22121(cid:88)\n(cid:0)\u00b7; \u03b8l\n\n(cid:1)(cid:1) \u25e6 xl\n\n(cid:1) =(cid:2)\u03c6j\n\nk\u22121(cid:88)\n\nJ(cid:88)\n\n(cid:0)xl\n\n(cid:0)\u03b8l\n\n(cid:0)\u03b8l\n\nzij,k\u03c6j\ni\n\nxl\nk =\n\ni; \u03b8l\n\ni :=\n\n(cid:1)\n\nj=1\n\ni=1\n\ni=1\n\nij,k\n\nij,k\n\ni,k\n\ni,k\n\ni\n\n\u03b8l\n\nij,k\n\nk,(cid:0)\u03c3l\n\n\u223c N(cid:16)\n(cid:3)k\u22121,J\nk :=(cid:2)\u03b8l\n(cid:9)K\n(cid:9)K\nk=1, z = (cid:8)[zi,k]k\u22121\ny \u223c p (y|x; \u03b8, z) \u221d exp (\u2212(cid:96) (f (x; \u03b8, z) , y)) ,\n\n(cid:1)2(cid:17)\n(cid:62) 0,(cid:80)L\n\nzi,k \u223c Categorial (\u03b1i,k) , k = 1, . . . , K,\n\nl=1 \u03b1l\n\ni,j=1\n\n\u00b5l\n\ni,k\n\nk\n\n,\n\nwith \u03b8 = (cid:8)[\u03b8l\n\n(2)\n\nl=1\n\ni=1\n\nk]L\n\nk=1, and \u03b1l\n\ni,k = 1. With this probabilistic\nmodel, the selection of z, i.e., neural network architecture search, is reduced to \ufb01nding a distribution\nde\ufb01ned by \u03b1, and the neural network learning is reduced to \ufb01nding \u03b8, both of which are the parameters\nof the probabilistic model.\nThe most natural choice here for probabilistic model estimation is the maximum log-likelihood\nestimation (MLE), i.e.,\n\n(cid:2)log(cid:82) p (y|x; \u03b8, z) p (z; \u03b1) p (\u03b8; \u00b5, \u03c3) dzd\u03b8(cid:3) .\n\n(3)\nHowever, the MLE is intractable due to the integral over latent variable z. We apply the classic\nvariational Bayesian inference trick, which leads to the evidence lower bound (ELBO), i.e.,\n\nmaxW maxq(z),q(\u03b8) \u2212(cid:98)Ex,yEz\u223cq(z),\u03b8\u223cq(\u03b8)[(cid:96) (f (x; \u03b8, z) , y)] \u2212 KL (q(z)q(\u03b8)||p (z, \u03b8)),\n\nmaxW :=(\u00b5,\u03c3,\u03b1) (cid:98)Ex,y\n\n(4)\n\nwhere p (z) =(cid:81)K\n\n(cid:81)k\u22121\ni=1 Categorial (zi,k) =(cid:81)K\n\ni,k. As shown in [39], the op-\ntimal solution of (4) in all possible distributions will be the posterior. With such a model, architecture\nlearning can be recast as Bayesian inference.\n\nk=1\n\nk=1\n\ni=1\n\nl=1\n\ni,k\n\n(cid:81)k\u22121\n\n(cid:81)L\n\n(cid:0)\u03b1l\n\n(cid:1)zl\n\n3\n\n\f3.1 Bayesian Meta Architecture Learning\n\nBased on the Bayesian view of architecture search, we can easily extend it to the meta-learning\ni}n\nsetting, where we have many tasks, i.e., Dt = {xt\ni=1. We are required to learn the neural\nnetwork architectures and the corresponding parameters jointly while taking the task dependencies\non the neural network structure into account.\nWe generalize the model (2) to handle multiple tasks as follows. For the t-th task, we design the\nmodel following (2). Meanwhile, the hyperparameters, i.e., (\u00b5, \u03c3, \u03b1), are shared across all the tasks.\nIn other words, the layers and architecture priors are shared between tasks. Then we have the MLE:\n\ni, yt\n\n(cid:21)\n\nlog\n\np (y|x; \u03b8, z) p (z; \u03b1) p (\u03b8; \u00b5, \u03c3) dzd\u03b8\n\n(5)\n\n(cid:98)EDt(cid:98)E(x,y)\u223cDt\n\nmax\n\nW\n\n(cid:20)\n\n(cid:90)\n\n(cid:18)\n\n(cid:98)EDt\n\nSimilarly, we exploit the ELBO. Due to the structures induced by sharing across the tasks, the\nposteriors for (z, \u03b8) have special dependencies, i.e.,\n\nW\n\nmax\n\nmax\n\nq(z|D),q(\u03b8|D)\n\n(6)\nWith the variational posterior distributions, q (z|D) and q (\u03b8|D), introduced into the model, we can\ndirectly generate the architecture and its corresponding weights based on the posterior. In a sense, the\nposterior can be understood as the neural network predictive model.\n\nEz\u223cq(z|D),\u03b8\u223cq(\u03b8|D) [\u2212(cid:96) (f (x; \u03b8, z) , y)] \u2212 KL (q||p)\n\n(cid:98)E(x,y)\u223cDt\n\n(cid:19)\n\n4 Variational Inference by Optimization Embedding\nThe design of the parameterization of the posterior q (z|D) and q (\u03b8|D) is extremely important, espe-\ncially in our case where we need to model the dependence between (z, \u03b8) w.r.t. the task distributions\nD and the loss information. Fortunately, we can bypass this problem by applying parameterized\nCoupled Variational Bayes (CVB), which generates the parameterization automatically through\noptimization embedding [6].\nSpeci\ufb01cally, we assume the q (\u03b8|D) is Gaussian and the q (z|D) is a product of the categorical\ndistribution. We approximate the categorical z with the Gumbel-Softmax distribution [14, 25], which\nleads to a valid gradient so that the model will be fully differentiable. Therefore, we have\n\n\uf8eb\uf8ed L(cid:88)\n\nl=1\n\n(cid:17)\u03c4\n\ni,k,l\n\n(cid:16)\n\n\u03c0D,\u03c6l\n\nzl\ni,k\n\n\uf8f6\uf8f8\u2212r\n\nr(cid:89)\n\ni=1\n\n\uf8eb\uf8ec\uf8ed \u03c0D,\u03c6l\n(cid:17)\u03c4 +1\n(cid:16)\n\ni,k,l\n\n\uf8f6\uf8f7\uf8f8 (7)\n\nzl\ni,k\n\nq\u03c8 (\u03b8|D) = N (\u03c8\u00b5, \u03c8\u03c3) ,\n\nq\u03c6 (zi,k|D) = \u0393 (r) \u03c4 L\u22121\n\nThen, we can sample (\u03b8, z) by following,\n\n\u03b8D (\u0001, \u03c8) = \u03c8D,\u00b5 + \u0001\u03c8D,\u03c3,\n\n\u0001 \u223c N (0, 1) ,\n\n\u03c6lD,i,k + \u03bel(cid:17)\n(cid:16)(cid:16)\n(cid:16)(cid:16)\n\n/\u03c4\n\n(cid:17)\n\n(cid:17)\n\n(cid:17) ,\n\nexp\n\n(cid:80)L\n\nzl\ni,k,D (\u03be, \u03c6) =\n\n\u03bel \u223c G (0, 1) ,\n\nl \u2208 {1, . . . , L} ,\n\n(8)\n\n\u03c6l\ni,k + \u03bel\n\n(cid:80)p\n\nl=1 exp\nand G (0, 1) denotes the Gumbel distribution. We emphasize that we do\nwith \u03c0x,\u03c6,i =\nnot have any explicit form of the parameters \u03c6D and \u03c8D yet, which will be derived by optimization\nembedding automatically.\nPlugging the formulation into the ELBO (6), we arrive at the objective\n\nexp(\u03c6x,i)\ni=1 exp(\u03c6x,i)\n\n/\u03c4\n\n(cid:104)\n\n(cid:98)ED\n\nmax\n\u03c6D,\u03c8D\n\n(cid:98)Ex,yE\u03be,\u0001 [\u2212(cid:96) (f (x; \u03b8D (\u0001, \u03c8) , zD (\u03be, \u03c6)) , y)] \u2212 log\n(cid:124)\n\n(cid:123)(cid:122)\n\nL(\u03c6D,\u03c8D;W )\n\nq\u03c6 (z|D)\np (z; \u03b1)\n\n\u2212 log\n\nq\u03c8 (\u03b8|D)\np (\u03b8; \u00b5, \u03c3)\n\n.\n\n(9)\n\nWith the ultimate objective (9) we follow the parameterized CVB derivation [6] for embedding the\n\noptimization procedure for (\u03c6, \u03c8). Denoting the(cid:98)g\u03c6D ,\u03c8D (D, W ) =\n\n\u2202(\u03c6D ,\u03c8D ) where(cid:98)L is the stochastic\n\u2202(cid:98)L\n\napproximation for L (\u03c6D, \u03c8D; W ), then, the stochastic gradient descent (SGD) iteratively updates as\n\n[\u03c6tD, \u03c8tD] = \u03b7t(cid:98)g\u03c6D,\u03c8D (D, W ) +(cid:2)\u03c6t\u22121D , \u03c8t\u22121D\n\n(cid:3) ,\n\n(10)\n\n(cid:105)\n(cid:125)\n\n4\n\n\fWe can initialize(cid:0)\u03c60, \u03c80(cid:1) = W which is shared across all the tasks. Alternative choices are also\npossible, e.g., with one more neural network,(cid:0)\u03c60, \u03c80(cid:1) = hV (D). We unfold T steps of the iteration\nto form a neural network with output(cid:0)\u03c6TD, \u03c8TD(cid:1). Plugging the obtained(cid:0)\u03c6TD, \u03c8TD(cid:1) to (8), we have\nthe parameters and architecture as(cid:0)\u03b8TD(cid:0)\u03be, \u03c8TD(cid:1) , zD(cid:0)\u03be, \u03c6TD(cid:1)(cid:1). In other words, we derive the concrete\n\nparameterization of q (\u03b8|D) and q (z|D) automatically by unfolding the optimization steps. Replacing\nthe parameterization of q (z|D) and q (\u03b8|D) into L (\u03c6D, \u03c8D, W ), we have\nq\u03c6TD (z|D)\n\n(cid:104)\u2212(cid:96)(cid:0)f(cid:0)x; \u03b8TD (\u0001, \u03c8) , zTD (\u03be, \u03c6)(cid:1) , y(cid:1) \u2212 log\n(cid:124)\n(cid:123)(cid:122)\n(cid:98)L(x,y,\u0001,\u03be;W )\n\n(cid:105)\nq\u03c8TD (\u03b8|D)\n(cid:125)\n\np (\u03b8; \u00b5, \u03c3)\n\n(cid:98)ED(cid:98)Ex,yE\u03be,\u0001\n\n\u2212 log\n\n.\n\n(11)\n\np (z; \u03b1)\n\nmax\n\nW\n\nc\n\nc\n\n10:\n\n, \u03c8t\u22121\n\nc\n\nc\n\n(cid:3)\u2212\n\nc, \u03c80\nSample \u03be \u223c G (0, 1).\nUpdate [\u03c6t\n\u03b7\u2207\u03c6t\u22121\n\nc] =(cid:2)\u03c6t\u22121\nc (cid:98)L(f (xt; \u03c6t\u22121\nc=1((cid:2)\u03c6T\n(cid:80)C\n\nc, \u03c8t\n,\u03c8t\u22121\n\nc=1 \u223c D.\nt=1 \u223c Dc.\n\nAlgorithm 1 Bayesian meta Architecture SEarch (BASE)\n1: Initialize meta-network parameters W0.\n2: for e = 1, . . . , E do\nSample C tasks {Dc}C\n3:\nfor Dc in D do\n4:\nSample {xt, yt}T\n5:\nLet \u03c60\nc = We\u22121.\n6:\nfor t = 1, . . . , T do\n7:\n8:\n9:\n\nIf we apply stochastic gradient\nascent in the optimization (11)\nfor updating W , the instantiated\nalgorithm from optimization em-\nbedding shares some similarities\nto second-order MAML [8] and\nDARTS [20] algorithms. Both of\nthese two algorithms unroll the\nstochastic gradient step. How-\never, with the introduction of\nthe Bayesian view, we can ex-\nploit the rich literature for the ap-\nproximation of the distributions\non discrete variables. More im-\nportantly, we can easily share\nboth the architecture and weights\nacross many tasks. Finally, this establishes the connection between the heuristic MAML algorithm to\nBayesian inference, which can be of independent interest.\nPractical algorithm: In the method derivation, for the simplicity of exposition, we assumed there is\nonly one cell shared across all the layers in every task, which may be overly restrictive. Following\n[43], we design two types of cells, named as a normal cell with \u03c6n and a reduction cell with \u03c6r,\nwhich appear alternatively in the neural network. Please refer to Appendix B.3 for an illustration.\nIn practice, the multistep-unrolling of the gradient computation is expensive and memory inef\ufb01cient.\nWe can exploit the \ufb01nite difference approximation for the gradient. This is similar to the iMAML [28]\nand REPTILE [26] approximations of MAML. Moreover, we can further accelerate learning by\nexploiting parallel computation. Speci\ufb01cally, for each task, we start from a local copy of the current\nW and apply stochastic gradient ascent based on the task-speci\ufb01c samples. Then, the shared W can\nbe updated by summarizing the task-speci\ufb01c parameters and architecture. The pseudo-code for the\nconcrete algorithm for Bayesian meta-Architecture SEarch (BASE) can be found in Algorithm 1.\nWith a meta-network trained with BASE over a series of tasks, for a new task, we can adapt an\n\narchitecture by sampling from the posterior distribution of zD through (7) with(cid:2)\u03c6T\n\n(cid:3) \u2212 We\u22121).\n\n(cid:3) calculated\n\nUpdate We = We\u22121 + \u03bb 1\nC\n\nc\n\n, \u03c8t\u22121\nc , \u03c8T\nc\n\nby (10) given new task D which will be used to de\ufb01ne the full-sized network. Illustrations of the\nnetwork motifs used for the search network and the full networks can be found in Appendix A.2.\nMore details about the architecture space can be found in Appendix A.\n\n, \u03be), yt).\n\nD, \u03c8T\nD\n\n5 Experiments and Results\n\n5.1 Experiment Setups\n\nDownsampled Multi-task Datasets To help the meta-network generalize to inputs with different\nsizes, we create three new multi-task datasets: Imagenet32(Imagenet downsampled to 32x32),\nImagenet64(Imagenet downsampled to 64x64), and Imagenet224(Imagenet downsampled to\n224x224). Imagenet224 uses the most commonly used size for inference for the full Imagenet\ndataset in the mobile setting. Our tasks are de\ufb01ned by sampling 10 random classes from one of the\nresized Imagenet datasets similar to the Mini-Imagenet dataset [33] in few-shot learning. This\nallows us to sample tasks from a space of C(1000, 10)\u00d7 3\u2248 2.634\u00d7 1023 tasks.\n\n5\n\n\fTable 1: Classi\ufb01cation Accuracies on CIFAR10\n\nArchitecture\n\nNASNet-A + cutout [43]\nAmoebaNet-A + cutout [30]\nAmoebaNet-B + cutout [30]\nHierarchical Evo [19]\nPNAS [18]\nDARTS (1st order bi-level) + cutout [20]\nDARTS (2nd order bi-level) + cutout [20]\nSNAS (single-level) + cutout [37]\nSMASH [2]\nENAS + cutout [27]\nBASE (Multi-task Prior)\nBASE (Imagenet32 Tuned)\nBASE (CIFAR10 Tuned)\n\nTop-1 Test\nError\n2.65\n3.34 \u00b1 0.06\n2.55 \u00b1 0.05\n3.75 \u00b1 0.12\n3.41 \u00b1 0.09\n3.00 \u00b1 0.14\n2.76 \u00b1 0.09\n2.85 \u00b1 0.02\n4.03\n2.89\n3.18\n3.00\n2.83\n\nParameters\n(M)\n3.3\n3.2\n2.8\n15.7\n3.2\n3.3\n3.3\n2.8\n16\n4.6\n3.2\n3.3\n3.1\n\nSearch Time\n(GPU Days)\n1800\n3150\n3150\n300\n225\n1.5\n4\n1.5\n1.5\n0.5\n8 Meta\n0.04 Adap / 8 Meta\n0.05 Adap / 8 Meta\n\nFeaturization Layers To conduct architecture search on these multi-sized, multi-task datasets,\nthe meta-network uses separate initial featurization layers (heads) for each image size. The use of\nnon-shared weights for the initial image featurization both allows the meta-network to learn a better\nprior as well as enabling the use of different striding in the heads to compensate for the signi\ufb01cant\ndifference in image sizes. The Imagenet224 head strides the output to 1/8th of the original input\nwhile the 32x32 and 64x64 heads both stride to 1/2th the original input size.\n\n5.2 Search Performance\n\nWe validated our meta-network by transferring the results of architectures optimized for CIFAR10,\nSVHN, and Imagenet224 to full-sized networks. Details of how we trained the full networks can be\nfound in Appendix A.1. To derive the full-sized Imagenet architectures, we select a high probability\narchitectures from the posterior distribution of architectures given random 10-class Imagenet224\ndatasets by averaging the sampled architecture distributions for 8 random datasets. To derive the\nCIFAR10 and SVHN architectures, we adapted the network on the unseen datasets and selected\nthe architecture with the highest probability of being chosen. The meta-network was trained for\n130 epochs. At each epoch, we sampled and trained on a total of 24 tasks, sampling 8 10-class\ndiscrimination tasks each from Imagenet32, Imagenet64, and Imagenet224. All experiments\nwere conducted with Nvidia 1080 Ti GPUs.\n\nPerformance on CIFAR10 Dataset The result of our Meta Architecture Search on CIFAR10 can\nbe found in Table 1. We compared a few variants of our methods. BASE (Multi-task Prior) is\narchitecture derived from training on the multi-task Imagenet datasets only without further \ufb01ne-\ntuning. This model did not have access to any information on the CIFAR10 dataset and is used as a\nbaseline comparison.\nThe BASE (Imagenet32 Tuned) is the network derived from the multi-task prior \ufb01ne-tuned on\nImagenet32. We chose Imagenet32 since it has the same image dimension as CIFAR10. It does\nslightly better than the BASE (Multi-task Prior) on CIFAR10. We compare these networks to the\nBASE (CIFAR10 Tuned), which is the network derived from the meta-network prior \ufb01ne-tuned on\nCIFAR10. Not surprisingly, this network performs the best as it has access to both the multi-task prior\nand the target dataset. One thing to note is that for BASE (Imagenet32 Tuned) and BASE (CIFAR10\nTuned), we only \ufb01ne-tuned the meta-networks for 0.04 GPU days and 0.05 GPU days respectively.\nThe adaptation time required is signi\ufb01cantly less than that required for the initial training of the\nmulti-task prior, as well as the required search time for the rest of the baseline NAS algorithms. With\nrespect to the number of parameters, our models are comparable in size with to the baseline models.\nUsing adaptation from our meta-network prior, we can \ufb01nd high performing models while using\nsigni\ufb01cantly less compute.\n\n6\n\n\fTable 2: Classi\ufb01cation Accuracies on SVHN\n\nArchitecture\n\nWideResnet [38]\nMetaQNN [1]\nDARTS (CIFAR10 Searched)\nBASE (Multi-task Prior)\nBASE (Imagenet32 Tuned)\nBASE (SVHN Tuned)\n\nTop-1 Test\nError\n1.30 \u00b1 0.03\n2.24\n2.09\n2.13\n2.07\n2.01\n\nParameters\n(M)\n11.7\n9.8\n3.3\n3.2\n3.3\n3.2\n\nSearch Time\n(GPU Days)\n-\n100\n4\n8 Meta\n0.04 Adap / 8 Meta\n0.04 Adap / 8 Meta\n\nTable 3: Classi\ufb01cation Accuracies on Imagenet\n\nArchitecture\n\nNASNet-A [43]\nNASNet-B [43]\nNASNet-C [43]\nAmoebaNet-A [30]\nAmoebaNet-B [30]\nAmoebaNet-C [30]\nPNAS [18]\nDARTS [20]\nSNAS [37]\nBASE (Multi-task Prior)\nBASE (Imagenet Tuned)\n\nTop-1 Top-5\nErr\n26.0\n27.2\n27.5\n25.5\n26.0\n24.3\n25.8\n26.9\n27.3\n26.1\n25.7\n\nErr\n8.4\n8.7\n9.0\n8.0\n8.5\n7.6\n8.1\n9.0\n9.2\n8.5\n8.1\n\nParams MACs\n(M)\n5.3\n5.3\n4.9\n5.1\n5.3\n6.4\n5.1\n4.9\n4.3\n4.6\n4.9\n\n(M)\n564\n488\n558\n555\n555\n570\n588\n595\n522\n544\n559\n\nSearch Time\n(GPU Days)\n1800\n1800\n1800\n3150\n3150\n3150\n225\n4\n1.5\n8 Meta\n0.04 Adap / 8 Meta\n\nPerformance on SVHN Dataset The result of our Meta Architecture Search on SVHN are shown in\nTable 2. We used the same multi-task prior previously trained on the multi-scale Imagenet datasets\nand quickly adapted the meta-network to SVHN in less than an hour. We also trained the CIFAR10\nspecialized architecture found in DARTS [20]. The adapted network architecture achieves the best\nperformance in our experiments and has comparable performance to other work for the model size.\nThis also validates the importance of task-speci\ufb01c specialization since it signi\ufb01cantly improved the\nnetwork performance over both our multi-task prior and Imagenet32 tuned baselines.\n\nPerformance on ImageNet Dataset The re-\nsults of our Meta Architecture Search on\nImagenet can be found in Table 3. We compare\nBASE (Multi-task Prior) with Base (Imagenet\nTuned), which is the multi-task prior tuned on\n224x224 Imagenet. The performance of our Im-\nagenet Tuned model actually exceeds that of exist-\ning differential NAS approaches DARTS [20] and\nSNAS [37] on both top-1 Error and top-5 error.\nIn terms of number of parameters and Multiply\nAccumulates(MAC), our found models are com-\nparable to state-of-the-art networks. Considering\nrunning time, while the multi-task pretraining took\n8 GPU days, we only needed 0.04 GPU days to\nadapt to full sized Imagenet. In Figure 2, we\ncompare our models with other NAS approaches\nwith respect to top-1 error and search time. For\nfairness, we include the time required to learn the\narchitecture prior, and we still achieve signi\ufb01cant\naccuracy gains for our computational cost.\n\n7\n\nFigure 2: Top-1 Imagenet Accuracy vs Search\nTime in GPU Days of different NAS methods on\nImagenet.\n\n\u000e\r\r\u000e\r\u000e\u000e\r\u000f\u000e\r\u0010\u000e\r\u0011$0,7.\u0004\u0003%\u000420\u0002!&\u0003\u001b,\u00058\u0001\u000f\u000b\r\u0001\u000f\u000b\u0012\u0001\u0010\u000b\r\u0001\u0010\u000b\u0012\u0001\u0011\u000b\r\u0001\u0011\u000b\u0012\u0001\u0012\u000b\r\u0001\u0012\u000b\u0012\u0001\u0013\u000b\r%45\n\u000e\u0003\u0018..:7,.\u0005\u0019\u0018$\u001c\u001b\u0018#%$$\u001f\u0018$\u001f\u0018$\u001f09\n\u0018\u001f\u0018$\u001f09\n\u0019\u001f\u0018$\u001f09\n\u001a\u0018240-,\u001f09\n\u0018\u0018240-,\u001f09\n\u0019\u0018240-,\u001f09\n\u001a!\u001f\u0018$%\u000550\u0019\u0018$\u001c\u00027,/\u0004039\u0003\u0019,80/#\u0002\f\u00020309\u0004.\f 9\u000407\f(a) PCA of weights\n\n(b) PCA of architecture\n\nFigure 3: Visualization of the PCA for (\u03b8, z), i.e., weight and architecture, sampled from the posterior\ndistribution of the meta-network.\n\n6 Empirical Analysis\n\nIn this section, we analyze the task-dependent parameter distributions derived from meta-network\nadaptation and demonstrate the abilities of the proposed method for fast adaptation as well as\narchitecture search for few-shot learning.\n\n6.1 Visualization of Posterior Distributions\n\nFigure 3 shows the PCA visualization of the posterior distributions of the convolutional weights\nD. The CIFAR10 optimized distributions were derived by quick\nD and architecture parameters \u03c6t\n\u03c8t\nadapting the pretrained meta-network for the CIFAR10 dataset while the other distributions were\nadapted for tasks sampled from the corresponding multi-task datasets. We see that the distribution\nof weights is more concentrated for CIFAR10 than for other datasets, likely since it corresponds to\na single task instead of a task distribution. It also seems that the Imagenet224 and Imagenet64\nposterior weight and architecture distributions are close to each other. This is likely due to the fact\nthey are the closest to each other in feature resolution after being strided down by the feature heads\nto 28 \u00d7 28 and 32 \u00d7 32. Considering the visualization of the architecture parameter distributions,\nit\u2019s notable that while the closeness of clusters seems to indicate a similarity between Imagenet32\nand CIFAR10, CIFAR10 still has a clearly distinct cluster. This seems to support that even though the\nmeta-network prior was never trained on CIFAR10, an optimized architecture posterior distribution\ncan be quickly derived for CIFAR10.\n\n6.2 Fast Adaptations\n\nIn this section, we explore the direct trans-\nfer of both architecture and convolutional\nweights from the meta-network by comparing\nthe test accuracy we get on CIFAR10 with meta-\nnetworks adapted for six epochs. The results\nare shown in Figure 4. We compare against\nthe baseline accuracy of the DARTS [20] super-\nnetwork trained from scratch on CIFAR10. Our\nmeta-network adapted normally from a multi-\ntask prior, achieves an accuracy of around\n0.75 after only one epoch. We also experi-\nmented with freezing the architecture parame-\nters, which greatly degraded the performance.\nThis shows the importance of co-optimizing\nboth the weight and architecture parameters.\n\nFigure 4: Graph showing the fast adaptation proper-\nties of pretrained meta-networks when adapting to\nCIFAR10 in a few epochs.\n\n8\n\n!7\u00043.\u00045,\u0004\u0003\u001a42543039\u0003\u000e!7\u00043.\u00045,\u0004\u0003\u001a42543039\u0003\u000f\u001b,9,8098\u00022,\u00040309\u0010\u000f\u00022,\u00040309\u0013\u0011\u00022,\u00040309\u000f\u000f\u0011\u001a\u00041,7\u000e\r!7\u00043.\u00045,\u0004\u0003\u001a42543039\u0003\u000e!7\u00043.\u00045,\u0004\u0003\u001a42543039\u0003\u000f\u001b,9,8098\u00022,\u00040309\u0010\u000f\u00022,\u00040309\u0013\u0011\u00022,\u00040309\u000f\u000f\u0011\u001a\u00041,7\u000e\r\u000e\u000f\u0010\u0011\u0012\u0013\u001c54.\u0004\r\u000b\u0012\r\r\u000b\u0012\u0012\r\u000b\u0013\r\r\u000b\u0013\u0012\r\u000b\u0001\r\r\u000b\u0001\u0012\r\u000b\u0001\r\u0018..:7,.\u0005\u001f09\u000547\u0004\u0019\u0018$\u001c\u0003!7\u000447\u0003\u00023\u00049\u0004,\u0004\u0004\u00050/\u0019\u0018$\u001c\u0003!7\u000447\u0003\u00023\u00049\u0004,\u0004\u0004\u00050/\u0003\u00187.\u0004\u0003!,7,28\u0003\u001d74\u000503\u001b\u0018#%$\u00031742\u0003$.7,9.\u0004\fTable 4: Comparison of few-shot learning baselines against MAML [8] using the architectures found\nby our BASE algorithm on few-shot learning on the Mini-Imagenet dataset.\n\nArchitecture\n\nMAML [8]\nREPTILE [26]\nDARTS Architecture\nBASE (Softmax)\nBASE (Gumbel)\n\nParams\n5-shot Test\n(M)\nAccuracy\n63.11 \u00b1 0.92% 0.1\n65.99 \u00b1 0.58% 0.1\n63.95 \u00b1 1.1%\n1.6\n65.4 \u00b1 0.74%\n1.2\n66.2 \u00b1 0.7%\n1.2\n\nFew-shot\nAlgorithm\n\nMAML\nREPTILE\nMAML\nMAML\nMAML\n\n6.3 Few-Shot Learning\n\nIn order to show the generalizability of our algorithm, we used it to conduct an architecture search\nover the few-shot learning problem. Since few-shot learning targets adapting in very few samples, we\ncan avoid using the Finite Difference approximation and directly use the optimization-embedding\ntechnique in these experiments. These experiments were run on a commonly used benchmark\nfor few-shot learning, the Mini-Imagenet dataset as proposed in [33], speci\ufb01cally on the 5-way\nclassi\ufb01cation 5-shot learning problem. The full-sized network is trained on the few-shot learning\nproblem using second-order MAML [8]. Search and full training were run twice for each method.\nA variation of our algorithm was also run using a simple softmax approximation of the Categorical\ndistribution as proposed in [20] to test the effect of the Gumbel-Softmax architecture parameterization.\nThe full results are shown in Table 4, our searched architectures achieved signi\ufb01cantly better average\ntesting accuracies than our baselines on \ufb01ve-shot learning on the Mini-Imagenet dataset in the\nsame architecture space. The CIFAR10 optimized DARTS architecture also achieved results that were\nsigni\ufb01cantly better than that found in the original MAML baseline [8] showing some transferability\nbetween CIFAR10 and meta-learning on Mini-Imagenet. That architecture, however, also had\nconsiderably more parameters than our found architectures and trained signi\ufb01cantly slower. The\nGumbel-Softmax meta-network parameterization also found better architectures than the simple\nsoftmax parameterization.\n\n7 Conclusion\n\nIn this work, we present a Bayesian Meta-Architecture search (BASE) algorithm that can learn the\noptimal neural network architectures for an entire task distribution simultaneously. The algorithm\nderived from a novel Bayesian view of architecture search utilizes the optimization embedding\ntechnique [6] to automatically incorporated the task information into the parameterization of the\nposterior. We demonstrate the algorithm by training a meta-network simultaneous on a distribution\nof 2.634 \u00d7 1023 tasks derived from Imagenet and achieve state-of-the-art results given our search\ntime on both CIFAR10, SVHN, and Imagenet with quick adapted task-speci\ufb01c architectures. This\nwork paves the way for future extensions with Meta Architecture Search such as direct fast-adaption\nto derive both optimal task-speci\ufb01c architectures and optimal weights and demonstrates the great\nef\ufb01ciency gains possible by conducting architecture search over task distributions.\n\nAcknowledgments\n\nWe would like to thank the anonymous reviewers for their comments and suggestions. Part of this\nwork was done while Bo Dai and Albert Shaw were at Georgia Tech. Le Song was supported in part\nby NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMitF, IIS-1841351, SaTC-1704701, and\nCAREER IIS-1350983.\n\n9\n\n\fReferences\n[1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures\n\nusing reinforcement learning. In International Conference on Learning Representations, 2017. 7\n\n[2] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search\n\nthrough hypernetworks. In International Conference on Learning Representations, 2018. 2, 6\n\n[3] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Ef\ufb01cient architecture search by network\n\ntransformation. In AAAI Conference on Arti\ufb01cial Intelligence, 2018. 2\n\n[4] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transformation for\nef\ufb01cient architecture search. In Proceedings of the 35th International Conference on Machine Learning,\npages 678\u2013687, 2018. 2\n\n[5] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and\n\nhardware. In International Conference on Learning Representations, 2019. 1, 2\n\n[6] Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao, and Le Song. Coupled\n\nvariational bayes via optimization embedding. In NeurIPS, pages 9713\u20139723, 2018. 2, 3, 4, 9\n\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\ndatabase. In The IEEE Conference on Computer Vision and Pattern Recognition, 2009. 2\n\n[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. In International Conference on Machine Learning, 2017. 3, 5, 9, 15\n\n[9] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan,\nYee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional neural processes. In Proceedings of\nthe 35th International Conference on Machine Learning, pages 1704\u20131713, 2018. 3\n\n[10] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami,\n\nand Yee Whye Teh. Neural processes. CoRR, abs/1807.01622, 2018. 3\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 1\n\n[12] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang,\nYukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3.\nCoRR, abs/1905.02244, 2019. 2\n\n[13] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic\n\ndepth. In European Conference on Computer Vision, pages 646\u2013661. Springer, 2016. 1\n\n[14] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In Interna-\n\ntional Conference on Learning Representations, 2017. 2, 4\n\n[15] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals,\nand Yee Whye Teh. Attentive neural processes. In International Conference on Learning Representations,\n2019. 3\n\n[16] Yann LeCun and Yoshua Bengio. The handbook of brain theory and neural networks. chapter Convolutional\nNetworks for Images, Speech, and Time Series, pages 255\u2013258. MIT Press, Cambridge, MA, USA, 1998.\n1\n\n[17] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei.\nAuto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In The IEEE\nConference on Computer Vision and Pattern Recognition, pages 82\u201392, 2019. 2\n\n[18] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,\nJonathan Huang, and Kevin Murphy. Progressive neural architecture search. In European Conference on\nComputer Vision, September 2018. 6, 7\n\n[19] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical\nrepresentations for ef\ufb01cient architecture search. In International Conference on Learning Representations,\n2018. 6\n\n[20] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search.\n\nInternational Conference on Learning Representations, 2019. 2, 5, 6, 7, 8, 9, 12\n\nIn\n\n10\n\n\f[21] Weiyang Liu, Zhen Liu, James Rehg, and Le Song. Neural similarity learning. In NeurIPS, 2019. 1\n\n[22] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M. Rehg, and Le Song.\n\nDecoupled networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1\n\n[23] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep\n\nhyperspherical learning. In NIPS, 2017. 1\n\n[24] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. In International\n\nConference on Learning Representations, 2017. 12\n\n[25] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation\n\nof discrete random variables. In International Conference on Learning Representations, 2017. 4\n\n[26] Alex Nichol, Joshua Achiam, and John Schulman. On \ufb01rst-order meta-learning algorithms. CoRR,\n\nabs/1803.02999, 2018. 5, 9\n\n[27] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Ef\ufb01cient neural architecture search\n\nvia parameter sharing. arXiv preprint arXiv:1802.03268, 2018. 2, 6\n\n[28] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-Learning with Implicit\n\nGradients. arXiv e-prints, page arXiv:1909.04630, Sep 2019. 5\n\n[29] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International\n\nConference on Learning Representations, 2017. 3\n\n[30] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. In AAAI Conference on Arti\ufb01cial Intelligence, 2019. 2, 6, 7\n\n[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014. 1\n\n[32] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE conference\non computer vision and pattern recognition, 2015. 1\n\n[33] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching\n\nnetworks for one shot learning. In NIPS, pages 3630\u20133638, 2016. 5, 9\n\n[34] Martin Wistuba and Tejaswini Pedapati. Inductive transfer for neural architecture optimization. CoRR,\n\nabs/1903.03536, 2019. 2\n\n[35] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter\nVajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware ef\ufb01cient convnet design via differentiable\nneural architecture search. In The IEEE Conference on Computer Vision and Pattern Recognition, June\n2019. 1, 2\n\n[36] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural\n\nnetworks for image recognition. arXiv preprint arXiv:1904.01569, 2019. 1\n\n[37] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In\n\nInternational Conference on Learning Representations, 2019. 2, 6, 7\n\n[38] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson\nand William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages\n87.1\u201387.12. BMVA Press, September 2016. 7\n\n[39] Arnold Zellner. Optimal Information Processing and Bayes\u2019s Theorem. The American Statistician, 42(4),\n\nNovember 1988. 3\n\n[40] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. In\n\nInternational Conference on Learning Representations, 2019. 2\n\n[41] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional\nneural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition,\npages 6848\u20136856, 2018. 1\n\n[42] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International\n\nConference on Learning Representations, 2017. 2\n\n[43] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for\nscalable image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n1, 2, 5, 6, 7, 12\n\n11\n\n\f", "award": [], "sourceid": 6001, "authors": [{"given_name": "Albert", "family_name": "Shaw", "institution": "Tesla"}, {"given_name": "Wei", "family_name": "Wei", "institution": "Google AI"}, {"given_name": "Weiyang", "family_name": "Liu", "institution": "Georgia Institute of Technology"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}]}