{"title": "Efficient Forward Architecture Search", "book": "Advances in Neural Information Processing Systems", "page_first": 10122, "page_last": 10131, "abstract": "We propose a neural architecture search (NAS) algorithm, Petridish, to iteratively \nadd shortcut connections to existing network layers. The added shortcut connections \neffectively perform gradient boosting on the augmented layers.\nThe proposed algorithm is motivated by the feature selection algorithm \nforward stage-wise linear regression, since we consider NAS as a generalization \nof feature selection for regression, where NAS selects shortcuts among layers \ninstead of selecting features. \nIn order to reduce the number of trials of possible connection combinations, we train\njointly all possible connections at each stage of growth while leveraging\nfeature selection techniques to choose a subset of them. \nWe experimentally show this process to be an efficient forward \narchitecture search algorithm that can find competitive \nmodels using few GPU days in both the search space of repeatable \nnetwork modules (cell-search) and the space of general networks (macro-search). \nPetridish is particularly well-suited for warm-starting from existing models \ncrucial for lifelong-learning scenarios.", "full_text": "Ef\ufb01cient Forward Architecture Search\n\nHanzhang Hu,1 John Langford,2 Rich Caruana,2\n\nSaurajit Mukherjee,2 Eric Horvitz,2 Debadeepta Dey2\n\n1Carnegie Mellon University, 2Microsoft Research\n\nhanzhang@cs.cmu.edu, {jcl,rcaruana,saurajim,horvitz,dedey}@microsoft.com\n\nAbstract\n\nWe propose a neural architecture search (NAS) algorithm, Petridish, to iteratively\nadd shortcut connections to existing network layers. The added shortcut connec-\ntions effectively perform gradient boosting on the augmented layers. The proposed\nalgorithm is motivated by the feature selection algorithm forward stage-wise linear\nregression, since we consider NAS as a generalization of feature selection for\nregression, where NAS selects shortcuts among layers instead of selecting features.\nIn order to reduce the number of trials of possible connection combinations, we\ntrain jointly all possible connections at each stage of growth while leveraging\nfeature selection techniques to choose a subset of them. We experimentally show\nthis process to be an ef\ufb01cient forward architecture search algorithm that can \ufb01nd\ncompetitive models using few GPU days in both the search space of repeatable\nnetwork modules (cell-search) and the space of general networks (macro-search).\nPetridish is particularly well-suited for warm-starting from existing models crucial\nfor lifelong-learning scenarios.\n\n1\n\nIntroduction\n\nNeural networks have achieved state-of-the-art performance on large scale supervised learning tasks\nacross domains like computer vision, natural language processing, audio and speech-related tasks\nusing architectures manually designed by skilled practitioners, often via trial and error. Neural\narchitecture search (NAS) (Zoph & Le, 2017; Zoph et al., 2018; Real et al., 2018; Pham et al.,\n2018; Liu et al., 2019; Han Cai, 2019) algorithms attempt to automatically \ufb01nd good architectures\ngiven data-sets. In this work, we view NAS as a bi-level combinatorial optimization problem (Liu\net al., 2019), where we seek both the optimal architecture and its associated optimal parameters.\nInterestingly, this formulation generalizes the well-studied problem of feature selection for linear\nregression (Tibshirani, 1994; Efron et al., 2004; Das & Kempe, 2011). This observation permits us to\ndraw and leverage parallels between NAS algorithms and feature selection algorithms.\nA plethora of NAS works have leveraged sampling methods including reinforcement learning (Zoph &\nLe, 2017; Zoph et al., 2018; Liu et al., 2018), evolutionary algorithms (Real et al., 2017, 2018; Elsken\net al., 2018a), and Bayesian optimization (Kandasamy et al., 2018) to enumerate architectures that\nare then independently trained. Interestingly, these approaches are uncommon for feature selection.\nIndeed, sample-based NAS often takes hundreds of GPU-days to \ufb01nd good architectures, and can be\nbarely better than random search (Elsken et al., 2018b).\nAnother common NAS approach is analogous to sparse optimization (Tibshirani, 1994) or backward\nelimination for feature selection, e.g., (Liu et al., 2019; Pham et al., 2018; Han Cai, 2019; Xie et al.,\n2019). The approach starts with a super-graph that is the union of all possible architectures, and learns\nto down-weight the unnecessary edges gradually via gradient descent or reinforcement learning. Such\napproaches drastically cut down the search time of NAS. However, these methods require domain\nknowledge to create the initial super-graphs, and typically need to reboot the search if the domain\nknowledge is updated.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this work, we instead take an approach that is analogous to a forward feature selection algorithm\nand iteratively grow existing networks. Although forward methods such as Orthogonal Matching\nPursuit (Pati et al., 1993) and Least-angle Regression (Efron et al., 2004) are common in feature\nselection and can often result in performance guarantees, there are only a similar NAS approaches (Liu\net al., 2017). Such forward algorithms are attractive, when one wants to expand existing models as\nextra computation becomes viable. Forward methods can utilize such extra computational resources\nwithout rebooting the training as in backward methods and sparse optimization. Furthermore, forward\nmethods naturally result in a spectrum of models of various complexities to suitably choose from.\nCrucially, unlike backward approaches, forward methods do not need to specify a \ufb01nite search space\nup front making them more general and easier to use when warm-starting from prior available models\nand for lifelong learning.\nSpeci\ufb01cally, inspired by early neural network growth work (Fahlman & Lebiere, 1990), we propose\na method (Petridish) of growing networks from small to large, where we opportunistically add\nshortcut connections in a fashion that is analogous to applying gradient boosting (Friedman, 2002)\nto the intermediate feature layers. To select from the possible shortcut connections, we also exploit\nsparsity-inducing regularization (Tibshirani, 1994) during the training of the eligible shortcuts.\nWe experiment with Petridish for both the cell-search (Zoph et al., 2018), where we seek a shortcut\nconnection pattern and repeat it using a manually designed skeleton network to form an architecture,\nand the less common but more general macro-search, where shortcut connections can be freely\nformed. Experimental results show Petridish macro-search to be better than previous macro-search\nNAS approaches on vision tasks, and brings macro-search performance up to par with cell-search\ncounter to beliefs from other NAS works (Zoph & Le, 2017; Pham et al., 2018) that macro-search\nis inferior to cell-search. Petridish cell-search also \ufb01nds models that are more cost-ef\ufb01cient than\nthose from (Liu et al., 2019), while using similar training computation. This indicates that forward\nselection methods for NAS are effective and useful.\nWe summarize our contribution as follows.\n\n\u2022 We propose a forward neural architecture search algorithm that is analogous to gradient\nboosting on intermediate layers, allowing models to grow in complexity during training and\nwarm-start from existing architectures and weights.\n\u2022 On CIFAR10 and PTB, the proposed method \ufb01nds competitive models in few GPU-days\n\u2022 The ablation studies of the hyper-parameters highlight the importance of starting conditions\n\nwith both cell-search and macro-search.\n\nto algorithm performance.\n\n2 Background and Related Work\n\nSample-based. Zoph & Le (2017) leveraged policy gradients (Williams, 1992) to learn to sample\nnetworks, and established the now-common framework of sampling networks and evaluating them\nafter a few epochs of training. The policy-gradient sampler has been replaced with evolutionary\nalgorithms (Schaffer et al., 1990; Real et al., 2018; Elsken et al., 2018a), Bayesian optimization (Kan-\ndasamy et al., 2018), and Monte Carlo tree search (Negrinho & Gordon, 2017). Multiple search-\nspaces (Elsken et al., 2018b) are also studied under this framework. Zoph et al. (2018) introduce\nthe idea of cell-search, where we learn a connection pattern, called a cell, and stack cells to form\nnetworks. Liu et al. (2018) further learn how to stack cells with hierarchical cells. Cai et al. (2018)\nevolve networks starting from competitive existing models via net-morphism (Wei et al., 2016).\nWeight-sharing. The sample-based framework of (Zoph & Le, 2017) spends most of its training\ncomputation in evaluating the sampled networks independently, and can cost hundreds of GPU-days\nduring search. This framework is revolutionized by Pham et al. (2018), who share the weights of\nthe possible networks and train all possible networks jointly. Liu et al. (2019) formalize NAS with\nweight-sharing as a bi-level optimization (Colson et al., 2007), where the architecture and the model\nparameters are jointly learned. Xie et al. (2019) leverage policy gradient to update the architecture in\norder to update the whole bi-level optimization with gradient descent.\nForward NAS. Forward NAS originates from one of the earliest NAS works by Fahlman & Lebiere\n(1990) termed \u201cCascade-Correlation\u201d, in which, neurons are added to networks iteratively. Each new\nneuron takes input from existing neurons, and maximizes the correlation between its activation and\n\n2\n\n\fthe residual in network prediction. Then the new neuron is frozen and is used to improve the \ufb01nal\nprediction. This idea of iterative growth has been recently studied in (Cortes et al., 2017; Huang et al.,\n2018) via gradient boosting (Friedman, 2002). While Petridish is similar to gradient boosting, it is\napplicable to any layer, instead of only the \ufb01nal layer. Furthermore, Petridish initializes weak learners\nwithout freezing or affecting the current model, unlike in gradient boosting, which freezes previous\nmodels. Elsken et al. (2018a); Cai et al. (2018) have explored forward search via iterative model\nchanges called net-morphisms (Wei et al., 2016), and control the iterative change via reinforcement\nlearning and evolutionary algorithms. Liu et al. (2017) select models by predicting their performances\nbased on those of some sampled models.\n\n3 Preliminaries\nGradient Boosting: Let H be a space of weak learners. Each step of gradient boosting seeks a weak\nlearner h\u2217 \u2208 H that is the most similar to the negative functional gradient, \u2212\u2207\u02c6yL, of the loss L with\nrespect to the prediction \u02c6y. The similarity is measured by their Frobenius inner product.\n\nh\u2217 = arg min\nh\u2208H\n\n(cid:104)\u2207\u02c6yL, h(cid:105).\n\n(1)\n\nThen we update the predictor to be \u02c6y \u2190 \u02c6y + \u03b7h\u2217, where \u03b7 is the learning rate.\nNAS Optimization: Given data sample x with label y from a distribution D, a neural network\narchitecture \u03b1 with parameters w produces a prediction \u02c6y(x; \u03b1, w) and suffers a prediction loss\n(cid:96)(\u02c6y(x; \u03b1, w), y). The expected loss is then\n\nL(\u03b1, w) = Ex,y\u223cD[(cid:96)(\u02c6y(x; \u03b1, w), y)] \u2248 1\n\n(2)\nIn practice, the loss L is estimated on the empirical training data Dtrain. Following (Liu et al., 2019),\nthe problem of neural architecture search can be formulated as a bi-level optimization (Colson et al.,\n2007) of the network architecture \u03b1 and the model parameters w under the loss L as follows.\n\n(cid:96)(\u02c6y(x; \u03b1, w), y).\n\n(x,y)\u2208Dtrain\n\n|Dtrain|\n\n(cid:88)\n\nL(\u03b1, w(\u03b1)),\n\nmin\n\n\u03b1\n\ns.t. w(\u03b1) = arg min\n\nw\n\nL(\u03b1, w)\n\nand c(\u03b1) \u2264 K,\n\n(3)\n\nwhere c(\u03b1) is the test-time computational cost of the architecture, and K is some constant. Formally,\nlet x1, x2, ... be intermediate layers in a feed-forward network. Then a shortcut from layer xi to xj\n(j > i) using operation op is represented by (xi, xj, op), where the operation op is a unary operation\nsuch as 3x3 conv. We merge multiple shortcuts to the same xj with summation, unless speci\ufb01ed\notherwise using ablation studies. Hence, the architecture \u03b1 is a collection of shortcut connections.\nFeature Selection Analogy: We note that Eq. 3 generalizes feature selection for linear predic-\ntion (Tibshirani, 1994; Pati et al., 1993; Das & Kempe, 2011), where \u03b1 selects feature subsets, w is\nthe prediction coef\ufb01cient, and the loss is expected square error. Hence, we can understand a NAS\nalgorithm by considering its application to feature selection, as discussed in the introduction and\nrelated works. This work draws a parallel to the feature selection algorithm Forward-Stagewise\nLinear Regression (FSLR) (Efron et al., 2004) with small step sizes, which is an approximation to\nLeast-angle Regression (Efron et al., 2004). In FSLR, we iteratively update with small step sizes\nthe weight of the feature that correlates the most with the prediction residual. Viewing candidate\nfeatures as weak learners, the residuals become the gradient of the square loss with respect to the\nlinear prediction. Hence, FSLR is also understood as gradient boosting (Friedman, 2002).\nCell-search vs. Macro-search: In this work, we consider both cell-search, a popular NAS search\nspace where a network is a prede\ufb01ned sequence of some learned connection patterns (Zoph et al.,\n2018; Real et al., 2018; Pham et al., 2018; Liu et al., 2019), called cells, and macro-search, a more\ngeneral NAS where no repeatable patterns are required. For a fair comparison between the two, we\nset both macro and cell searches to start with the same seed model, which consists of a sequence\nof simple cells. Both searches also choose from the same set of shortcuts. The only difference is\ncell-search cells changing uniformly and macro-search cells changing independently.\n\n4 Methodology: Ef\ufb01cient Forward Architecture Search (Petridish)\n\nFollowing gradient boosting strictly would limit the model growth to be only at the prediction layer\nof the network, \u02c6y. Instead, this work seeks to jointly expand the expressiveness of the network at\n\n3\n\n\fAlgorithm 1 Petridish.initialize_candidates\n1: Input: (1) Lx, the list of layers in the current model (macro-search) or current cell (cell-search)\nin topological order; (2) is_out(x), whether we are to expand at x; (3) \u03bb, hyper parameter for\nselection shortcut connections.\n\nx, the modi\ufb01ed Lx with weak learners xc; (2) Lc, the list of xc created; (3) (cid:96)extra,\n\nthe additional training loss.\nx \u2190 Lx; Lc \u2190 empty list;\nif not is_out(xk)\nCompute the eligible inputs In(xk), and index them as z1, ..., zI.\n\n(cid:96)extra \u2190 0\nthen continue end if\n\n2: Output: (1) L(cid:48)\n3: L(cid:48)\n4: for xk in enumerate(Lx) do\n5:\n6:\n7:\n8:\n9:\nAppend xc to Lc.\n10:\n11: Modify xk in L(cid:48)\n12: end for\n\nxc \u2190(cid:80)I\n(cid:96)extra \u2190 (cid:96)extra + \u03bb(cid:80)I\n\n(cid:80)J\n\ni=1\n\nj=1 \u03b1k\n\n(cid:80)J\ni,jopj(sg(zi)).\nInsert the layer xc right before xk in L(cid:48)\nx.\nj=1 |\u03b1k\ni,j|.\n\ni=1\n\nx so that xk \u2190 xk + sf(xc).\n\nintermediate layers, x1, x2, .... Speci\ufb01cally, we consider adding a weak learner hk \u2208 Hk at each xk,\nwhere Hk (speci\ufb01ed next) is the space of weak learners for layer xk. hk helps reduce the gradient of\nthe loss L with respect to xk, \u2207xkL = Ex,y\u223cD[\u2207xk (cid:96)(\u02c6y(x; \u03b1, w), y)], i.e., we choose hk with\n\nhk = arg min\nh\u2208Hk\n\n(cid:104)h,\u2207xkL(\u03b1, w)(cid:105) = arg min\nh\u2208Hk\n\n(cid:104)h, Ex,y\u223cD[\u2207xk (cid:96)(\u02c6y(x; \u03b1, w), y)](cid:105).\n\n(4)\n\nThen we expand the model by adding hk to xk. In other words, we replace each xk with xk + \u03b7hk in\nthe original network, where \u03b7 is a scalar variable initialized to 0. The modi\ufb01ed model then can be\ntrained with backpropagation. We next specify the weak learner space, and how they are learned.\nWeak Learner Space: The weak learner space Hk for a layer xk is formally\n\nHk = {opmerge(op1(z1), ..., opImax(zImax)) : z1, ..., zImax \u2208 In(xk), op1, ..., opImax \u2208 Op},\n\n(5)\n\nwhere Op is the set of eligible unary operations, In(xk) is the set of allowed input layers, Imax is the\nnumber of shortcuts to merge together in a weak learner, and opmerge is a merge operation to combine\nthe shortcuts into a tensor of the same shape as xk. On vision tasks, following (Liu et al., 2019), we\nset Op to contain separable conv 3x3 and 5x5, dilated conv 3x3 and 5x5, max and average pooling\n3x3, and identity. The separable conv is applied twice as per (Liu et al., 2019). Following (Zoph\net al., 2018; Liu et al., 2019), we set In(xk) to be layers that are topologically earlier than xk, and are\neither in the same cell as xk or the outputs of the previous two cells. We choose Imax = 3 through an\nablation study from amongst 2, 3 or 4 in Sec. B.5, and we set opmerge to be a concatenation followed\nby a projection with conv 1x1 through an ablation study in Sec. B.3 against weighted sum.\n(cid:0) IJ\nWeak Learning with Weight Sharing: In gradient boosting, one typically optimizes Eq. 4 by\nminimizing (cid:104)h,\u2207xkL(cid:105) for multiple h, and selecting the best h afterwards. However, as there are\n\n(cid:1) possible weak learners in the space of Eq. 5, where I = |In(xk)| and J = |Op|, it may be\n\nImax\ncostly to enumerate all possibilities. Inspired by the parameter sharing works in NAS (Pham et al.,\n2018; Liu et al., 2019) and model compression in neural networks (Huang et al., 2017a), we propose\nto jointly train the union of all weak learners, while learning to select the shortcut connections. This\nprocess also only costs a constant factor more than training one weak learner. Speci\ufb01cally, we \ufb01t the\nfollowing joint weak learner xc for a layer xk in order to minimize (cid:104)xc,\u2207xkL(cid:105):\n\nxc =\n\n\u03b1i,jopj(zi),\n\n(6)\n\ni=1\n\nj=1\n\nwhere opj \u2208 Op and zi \u2208 In(xk) enumerate all possible operations and inputs, and \u03b1i,j \u2208 R\nis the weight of the shortcut opj(zi). Each opj(zi) is normalized with batch-normalization to\nhave approximately zero mean and unit variance in expectation, so \u03b1i,j re\ufb02ects the importance\nof the operation. To select the most important operations, we minimize (cid:104)xc,\u2207xkL(cid:105) with an L1-\n\n4\n\nI(cid:88)\n\nJ(cid:88)\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Blue boxes are in the parent model, and red boxes are for weak learning. Operations are\njoined together in a weighted sum to form xc, in order to match \u2212\u2207xkL. (b) The top Imax operations\nare selected and merged with a concatenation, followed by a projection.\n\ni,j, the learned operation weights of xc for layer xk.\n\nAlgorithm 2 Petridish.\ufb01nalize_candidates\n1: Inputs: (1) L(cid:48)\nx, the list of layers of the model in topological order; (2) Lc, list of selection\nmodules in L(cid:48)\nx; (3) \u03b1k\n2: Output: A modi\ufb01ed L(cid:48)\n3: for xc in Lc do\nLet A = {\u03b1k\n4:\nSort {|\u03b1| : \u03b1 \u2208 A}, and let op1, ..., opImax be operations with the largest associated |\u03b1|.\n5:\nReplace xc with proj(concat(op1, ..., opImax)) in L(cid:48)\nx. proj is to the same shape as xk.\n6:\n7: end for\n8: Remove all sg(\u00b7). Replace each sf(x) with a \u03b7x, where \u03b7 is a scalar variable initialized to 0.\n\ni,j : i = 1, ..., I, j = 1, ..., J} be the weights of operations in xc.\n\nx, which is to be trained with backpropagation for a few epochs.\n\nregularization on the weight vector (cid:126)\u03b1, i.e.,\n\n\u03bb(cid:107)(cid:126)\u03b1(cid:107)1 = \u03bb\n\nI(cid:88)\n\nJ(cid:88)\n\ni=1\n\nj=1\n\n|\u03b1i,j|,\n\n(7)\n\n(cid:80)J\n\nso that xc =(cid:80)I\n\nwhere \u03bb is a hyper-parameter which we choose in the appendix B.6. L1-regularization, known as\nLasso (Tibshirani, 1994), induces sparsity in the parameter and is widely used for feature selection.\nWeak Learning Implementation: A na\u00efve implementation of joint weak learning needs to compute\n\u2207xkL and freeze the existing model during weak learner training. Here we provide a modi\ufb01cation\nto avoid these two costly requirements. Algorithm 1 describes the proposed implementation and\nFig. 1a illustrates the weak learning computation graph. We leverage a custom operation called\nstop-gradient, sg, which has the property that for any x, sg(x) = x and \u2207xsg(x) = 0. Similarly,\nwe de\ufb01ne the complimentary operation stop-forward, sf(x) = x \u2212 sg(x), i.e., sf(x) = 0 and\n\u2207xsf(x) = Id, the identity function. Speci\ufb01cally, on line 7, we apply sg to inputs of weak learners,\nj=1 \u03b1i,jopj(sg(zi)) does not affect the gradient of the existing model. Next,\non line 11, we replace the layer xk with xk + sf(xc), so that the prediction of the model is unaffected\nby weak learning. Finally, the gradient of the loss with respect to any weak learner parameter \u03b8 is:\n(8)\nThis means that sf and sg not only prevent the weak learning from affecting the training of existing\nmodel, but also enable us to minimize (cid:104)\u2207xkL, xc(cid:105) via backpropagation on the whole network. Thus,\nwe no longer need explicitly compute \u2207xkL nor freeze the existing model weights during weak\nlearning. Furthermore, since weak learners of different layers do not interact during weak learning,\nwe grow the network at all xk that are ends of cells at the same time.\nFinalize Weak Learners: In Algorithm 2 and Fig. 1b, we \ufb01nalize the weak learners. We select\nin each xc the top Imax shortcuts according to the absolute value of \u03b1i,j, and merge them with a\n\n\u2207\u03b8L = \u2207xk+sf(xc)L\u2207xcsf(xc)\u2207\u03b8xc = \u2207xkL\u2207\u03b8xc = \u2207\u03b8(cid:104)\u2207xkL, xc(cid:105).\n\ni=1\n\n5\n\n\fconcatenation followed by a projection to the shape of xk. We note that the weighted sum during weak\nlearning is a special case of concatenation-projection, and we use an ablation study in appendix B.3\nto validate this replacement. We also note that most NAS works (Zoph et al., 2018; Real et al.,\n2018; Pham et al., 2018; Liu et al., 2019; Xie et al., 2019; Han Cai, 2019) have similar set-ups of\nconcatenating intermediate layers in cells and projecting the results. We train the \ufb01nalized models for\na few epochs, warm-starting from the parameters in weak learning.\nRemarks: A key design concept of Petridish is amortization, where we require the computational\ncosts of weak learning and model training to be a constant factor of each other. We further design\nPetridish to do both at the same time. Following these principles, it only costs a constant factor of\nadditional computation to augment models with Petridish while training the model concurrently.\nWe also note that since Petridish only grows models, noise in weak learning and model training\ncan result in sub-optimal short-cut selections. To mitigate this potential problem and to reduce the\nsearch variance, we utilize multiple parallel workers of Petridish, each of which can warm-start from\nintermediate models of each other. We defer this implementation detail to the appendix.\n\n5 Experiments\n\nWe report the search results on CIFAR-10 (Krizhevsky, 2009) and the transfer result on Ima-\ngeNet (Russakovsky et al., 2015). Ablation studies for choosing the hyper parameters are deferred to\nappendix B, which also demonstrates the importance of blocking the in\ufb02uence of weak learners to\nthe existing models during weak learning via sf and sg. We also search on Penn Tree Bank (Marcus\net al., 1993), and show that it is not an interesting data-set for evaluating NAS algorithms.\n\n5.1 Search Results on CIFAR10\n\nSet-up: Following (Zoph et al., 2018; Liu et al., 2019), we search on a shallow and slim networks,\nwhich have N = 3 normal cells in each of the three feature map resolution, one transition cell\nbetween each pair of adjacent resolutions, and F = 16 initial \ufb01lter size. Then we scale up the found\nmodel to have N = 6 and F = 32 for a \ufb01nal training from scratch. During search, we use the last\n5000 training images as a validation set. The starting seed model is a modi\ufb01ed ResNet (He et al.,\n2016), where the output of a cell is the sum of the input and the result of applying two 3x3 separable\nconv to the input. This is one of the simplest seeds in the search space popularized by (Zoph et al.,\n2018; Pham et al., 2018; Liu et al., 2019). The seed model is trained for 200 epochs, with a batch\nsize of 32 and a learning rate that decays from 0.025 to 0 in cosine decay (Loshchilov & Hutter,\n2017). We apply drop-path (Larsson et al., 2017) with probability 0.6 and the standard CIFAR-10\ncut-out (DeVries & Taylor, 2017). Weak learner selection and \ufb01nalization are trained for 80 epochs\neach, using the same parameters. The \ufb01nal model training is from scratch for 600 epochs on all\ntraining images with the same parameters.\nSearch Results: Table 1 depicts the test-errors, model parameters, and search computation of the\nproposed methods along with many state-of-the-art methods. We mainly compare against models of\nfewer than 3.5M parameters, since these models can be easily transferred to ILSVRC (Russakovsky\net al., 2015) mobile setting via a standard procedure (Zoph et al., 2018). The \ufb01nal training of Petridish\nmodels is repeated \ufb01ve times. Petridish cell search \ufb01nds a model with 2.87\u00b10.13% error rate with\n2.5M parameters, in 5 GPU-days using GTX 1080. Increasing \ufb01lters to F = 37, the model has\n2.75\u00b10.21% error rate with 3.2M parameters. This is one of the better models among models that\nhave fewer than 3.5M parameters, and is in particular better than DARTS (Liu et al., 2019).\nPetridish macro search \ufb01nds a model that achieves 2.85\u00b1 0.12% error rate using 2.2M parameters\nin the same search computation. This is signi\ufb01cantly better than previous macro search results, and\nshowcases that macro search can \ufb01nd cost-effective architectures that are previously only found\nthrough cell search. This is important, because the NAS literature has been moving away from macro\narchitecture search, as early works (Zoph et al., 2018; Pham et al., 2018; Real et al., 2018) have\nshown that cell search results tend to be superior to those from macro search. However, this result\nmay be explained by the superior initial models of cell search: the initial model of Petridish is one of\nthe simplest models that any of the listed cell search methods proposes and evaluates, and it already\nachieves 4.6% error rate using only 0.4M parameters, a result already on-par or better than any other\nmacro search result.\n\n6\n\n\fMethod\nZoph & Le (2017)\u2020\nZoph & Le (2017) + more \ufb01lters\u2020\nReal et al. (2017)\u2020\nENAS macro (Pham et al., 2018)\u2020\nENAS macro + more \ufb01lters\u2020\nLemonade I (Elsken et al., 2018a)\nPetridish initial model (N = 6, F = 32)\nPetridish initial model (N = 12, F = 64)\nPetridish macro\nNasNet-A (Zoph et al., 2018)\nAmoebaNet-B (Real et al., 2018)\nPNAS (Liu et al., 2017)\u2020\nENAS cell (Pham et al., 2018)\nLemonade II (Elsken et al., 2018a)\nDARTS (Liu et al., 2019)\nSNAS (Xie et al., 2019)\nLuo et al. (2018)\u2020\nPARSEC (Casale et al., 2019)\nDARTS random (Liu et al., 2019)\n16 Random Models in Petridish space\nPetridish cell w/o feature selection\nPetridish cell\nPetridish cell more \ufb01lters (F=37)\n\n(mil.)\n7.1\n37.4\n5.4\n21.3\n38\n8.9\n0.4\n3.1\n2.2\n3.3\n2.8\n3.2\n4.6\n3.98\n3.4\n2.8\n3.3\n3.7\n3.1\n\n2.5\n3.2\n\n2.27 \u00b1 0.15\n2.50 \u00b1 0.28\n\nSearch\n\n(GPU-Days)\n\n1680+\n1680+\n2500\n0.32\n0.32\n56\n\u2013\n\u2013\n5\n\n1800\n3150\n225\n0.45\n56\n4\n1.5\n0.4\n1\n\u2013\n\u2013\n\u2013\n5\n5\n\nMethod\nInception-v1 (Szegedy et al., 2015)\nMobileNetV2 (Sandler et al., 2018)\nNASNet-A (Zoph et al., 2017)\nAmoebaNet-A (Real et al., 2018)\nPNAS (Liu et al., 2017a)\nDARTS (Liu et al., 2019)\nSNAS (Xie et al., 2019)\nProxyless (Han Cai, 2019)\u2020\nPath-level (Cai et al., 2018)\u2020\nPARSEC (Casale et al., 2019)\nPetridish macro (N=6,F=44)\nPetridish cell (N=6,F=44)\n\n# params\n\n(mil.)\n6.6\n6.9\n5.3\n5.1\n5.1\n4.9\n4.3\n7.1\n\u2013\n5.6\n4.3\n4.8\n\n# multi-add\n\n(mil.)\n1448\n585\n564\n555\n588\n595\n522\n465\n588\n\u2013\n511\n598\n\nSearch\n\n(GPU-Days)\n\n\u2013\n\u2013\n\n1800\n3150\n225\n4\n1.6\n8.3\n8.3\n1\n5\n5\n\n(%)\n4.47\n3.65\n5.4\n4.23\n3.87\n3.37\n4.6\n\n3.06 \u00b1 0.12\n\n2.83 | 2.85\u00b1 0.12\n\n2.65\n\n2.55 \u00b1 0.05\n3.41 \u00b1 0.09\n\n2.89\n3.50\n\n3.53\n\n2.76 \u00b1 0.09\n2.85 \u00b1 0.02\n2.81 \u00b1 0.03\n3.29 \u00b1 0.15\n3.32 \u00b1 0.15\n3.26 \u00b1 0.10\n\n2.61 | 2.87 \u00b1 0.13\n2.51 | 2.75 \u00b1 0.21\n\ntop-1 Test Error\n\n(%)\n30.2\n28.0\n26.0\n25.5\n25.8\n26.9\n27.3\n24.9\n25.5\n26.0\n\n28.5 | 28.7 \u00b1 0.15\n26.0 | 26.3 \u00b1 0.20\n\nTable 1: Comparison against state-of-the-art recognition results on CIFAR-10. Results marked with\n\u2020 are not trained with cutout. The \ufb01rst block represents approaches for macro-search. The second\nblock represents approaches for cell-search. We report Petridish results in the format of \u201cbest | mean\n\u00b1 standard deviation\u201d among \ufb01ve repetitions of the \ufb01nal training.\n\n# params\n\nTest Error\n\nTable 2: The performance of the best CIFAR model transferred to ILSVRC. Variance is from multiple\ntraining of the same model from scratch. \u2020 These searches start from PyramidNet(Han et al., 2017).\n\nWe also run multiple instances of Petridish cell-search, and Table 3 reports performance of the\nbest model of each search run. We observe that the models from the separate runs have similar\nperformances. On average, the search time is 10.5 GPU-days and the model takes 2.8M parameters to\nachieve 2.88% average mean error rate. In addition, we experiment with replacing feature selection\nwith random choice and leaving all other parts intact, i.e., we keep initialization and \ufb01nalization of\nweak learners with parallel workers. The average of mean error rate of the \ufb01nal-trained models is\n3.26 \u00b1 0.04%, close to random models, shown near the bottom of Table 1.\nTransfer to ImageNet: We focus on the mobile setting for the model transfer results on\nILSVRC (Russakovsky et al., 2015), which means we limit the number of multi-add per image\nto be within 600M. We transfer the \ufb01nal models on CIFAR-10 to ILSVRC by adding an initial 3x3\n\n7\n\n\f# params\n\nSearch\n\n(GPU-Days)\n\nTable 3: Performances of the best models from\nmultiple instances of Petridish cell-search.\nTest Error\n2.80 \u00b1 0.10\n2.87 \u00b1 0.13\n2.88 \u00b1 0.15\n2.90 \u00b1 0.12\n2.95 \u00b1 0.09\n\n(mil.)\n3.32\n2.5\n2.2\n2.61\n3.38\n\n7.5\n5\n12\n18\n10\n\n(%)\n\nFigure 2: Petridish naturally \ufb01nd a collection\nof models of different complexity and accuracy.\nModels outside of the lower convex hull are re-\nmoved for clarity.\n\nconv of stride of 2, followed by two transition cells, to down-sample the 224x224 input images to\n28x28 with F \ufb01lters. In macro-search, where no transition cells are speci\ufb01cally learned, we again\nuse the the modi\ufb01ed ResNet cells from the initial seed model as the replacement. After this initial\ndown-sampling, the architecture is the same as in CIFAR-10 \ufb01nal models. Following (Liu et al.,\n2019), we train these models for 250 epochs with batch size 128, weight decay 3 \u2217 10\u22125, and initial\nSGD learning rate of 0.1 (decayed by a factor of 0.97 per epoch).\nTable 2 depicts performance of the transferred models. The Petridish cell-search model achieves\n26.3\u00b10.2% error rate using 4.8M parameters and 598M multiply-adds, which is on par with state-\nof-the-art results listed in the second block of Table 2. By utilizing feature selection techniques to\nevaluate multiple model expansions at the same time, Petridish is able to \ufb01nd models faster by one or\ntwo orders of magnitude than early methods that train models independently, such as NASNet (Zoph\net al., 2018), AmoebaNet (Real et al., 2018), and PNAS (Liu et al., 2017). In comparison to super-\ngraph methods such as DARTS (Liu et al., 2019), Petridish cell-search takes similar search time to\n\ufb01nd a more accurate model.\nThe Petridish macro-search model achieves 28.7\u00b10.15% error rate using 4.3M parameters and 511M\nmultiply-adds, a comparable result to the human-designed models in the \ufb01rst block of Table 2.\nThough this is one of the \ufb01rst successful transfers of macro-search result on CIFAR to ImageNet,\nthe relative performance gap between cell-search and macro-search widens after the transfer. This\nmay be because the default transition cell is not adequate for transfer to more complex data-sets. As\nPetridish gradually expands existing models, we naturally receive a gallery of models of various\ncomputational costs and accuracy. Figure 2 showcases the found models.\n\n5.2 Search Results on Penn Treebank\nPetridish when used to grow the cell of a recurrent neural network achieves a best test perplexity of\n55.85 and average test perplexity of 56.39 \u00b1 0.38 across 8 search runs with different random seeds\non PTB. This is competitive with the best search result of (Li & Talwalkar, 2019) of 55.5 via random\nsearch with weight sharing. In spite of good performance we don\u2019t put much signi\ufb01cance on this\nparticular language-modeling task with this data set because no NAS algorithm appears to perform\nbetter than random search (Li & Talwalkar, 2019), as detailed in appendix C.\n\n6 Conclusion\n\nWe formulate NAS as a bi-level optimization problem, which generalizes feature selection for linear\nregression. We propose an ef\ufb01cient forward selection algorithm that applies gradient boosting to\nintermediate layers, and generalizes the feature selection algorithm LARS (Efron et al., 2004). We\nalso speed weak learning via weight sharing, training the union of weak learners and selecting a subet\nfrom the union via L1-regularization. We demonstrate experimentally that forward model growth can\n\ufb01nd accurate models in a few GPU-days via cell and macro searches.\n\n8\n\n\fAcknowledgements\nWe thank J. Andrew Bagnell and Martial Hebert for their support and helpful discussions.\n\nReferences\nCai, Han, Yang, Jiacheng, Zhang, Weinan, Han, Song, and Yu, Yong. Path-level network transforma-\n\ntion for ef\ufb01cient architecture search. In ICML, 2018.\n\nCasale, Francesco Paolo, Gordon, Jonathan, and Fusi, Nicolo. Probabilistic neural architecture search.\n\nIn arxiv.org/abs/1902.05116, 2019.\n\nColson, Beno\u00eet, Marcotte, Patrice, and Savard, Gilles. An overview of bilevel optimization. In Annals\n\nof operations research, 2007.\n\nCortes, Corinna, Gonzalvo, Xavier, Kuznetsov, Vitaly, Mohri, Mehryar, and Yang, Scott. Adanet:\n\nAdaptive structural learning of arti\ufb01cial neural networks. In ICML, 2017.\n\nDas, A. and Kempe, D. Submodular meets spectral: Greedy algorithms for subset selection, sparse\n\napproximation and dictionary selection. In ICML, 2011.\n\nDeVries, Terrance and Taylor, Graham. Improved regularization of convolutional neural networks\n\nwith cutout. CoRR, abs/1708.04552, 2017.\n\nEfron, Bradley, Hastie, Trevor, Johnstone, Iain, and Tibshirani, Robert. Least angle regression.\n\nAnnals of Statistics, 32:407\u2013499, 2004.\n\nElsken, Thomas, Metzen, Jan Hendrik, and Hutter, Frank. Ef\ufb01cient multi-objective neural architecture\n\nsearch via lamarckian evolution. 2018a.\n\nElsken, Thomas, Metzen, Jan Hendrik, and Hutter, Frank. Neural architecture search: A survey.\n\nCoRR, abs/1808.05377, 2018b.\n\nFahlman, Scott E. and Lebiere, Christian. The cascade-correlation learning architecture. In NIPS,\n\n1990.\n\nFriedman, J.H. Stochastic gradient boosting. Computational Statistics and Data Analysis, 2002.\n\nHan, Dongyoon, Kim, Jiwhan, and Kim, Junmo. Deep pyramidal residual networks. In CVPR, 2017.\n\nHan Cai, Ligeng Zhu, Song Han. Proxylessnas: Direct neural architecture search on target task and\n\nhardware. In ICLR, 2019.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.\n\nHuang, Furong, Ash, Jordan, Langford, John, and Schapire, Robert. Learning deep resnet blocks\n\nsequentially using boosting theory. In ICML, 2018.\n\nHuang, G., Liu, S., van der Maaten, L., and Weinberger, K. Condensenet: An ef\ufb01cient densenet using\n\nlearned group convolutions. arXiv preprint arXiv:1711.09224, 2017a.\n\nHuang, Gao, Liu, Zhuang, van der Maaten, Laurens, and Weinberger, Kilian Q. Densely connected\n\nconvolutional networks. In CVPR, 2017b.\n\nKandasamy, Kirthevasan, Neiswanger, Willie, Schneider, Jeff, Poczos, Barnabas, and Xing, Eric.\n\nNeural architecture search with bayesian optimisation and optimal transport. In NIPS, 2018.\n\nKrizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.\n\nLarsson, Gustav, Maire, Michael, and Shakhnarovich, Gregory. Fractalnet: Ultra-deep neural\n\nnetworks without residuals. In ICLR, 2017.\n\nLi, Liam and Talwalkar, Ameet. Random search and reproducibility for neural architecture search.\n\nCoRR, abs/1902.07638, 2019. URL http://arxiv.org/abs/1902.07638.\n\n9\n\n\fLiu, Chenxi, Zoph, Barret, Shlens, Jonathon, Hua, Wei, Li, Li-Jia, Fei-Fei, Li, Yuille, Alan L., Huang,\nJonathan, and Murphy, Kevin. Progressive neural architecture search. CoRR, abs/1712.00559,\n2017.\n\nLiu, Hanxiao, Simonyan, Karen, Vinyals, Oriol, Fernando, Chrisantha, and Kavukcuoglu, Koray.\n\nHierarchical representations for ef\ufb01cient architecture search. In ICLR, 2018.\n\nLiu, Hanxiao, Simonyan, Karen, and Yang, Yiming. Darts: Differentiable architecture search. 2019.\n\nLoshchilov, Ilya and Hutter, Frank. Sgdr: Stochastic gradient descent with warm restarts. In ICLR,\n\n2017.\n\nLuo, Renqian, Tian, Fei, Qin, Tao, Chen, Enhong, and Liu, Tie-Yan. Neural architecture optimization.\n\nIn NIPS, 2018.\n\nMarcus, Mitchell, Santorini, Beatrice, and Marcinkiewicz, Mary Ann. Building a large annotated\n\ncorpus of english: The penn treebank. 1993.\n\nNegrinho, Renato and Gordon, Geoffrey J. Deeparchitect: Automatically designing and training deep\n\narchitectures. CoRR, abs/1704.08792, 2017.\n\nPati, Y, Rezaiifar, R., and Krishnaprasad, P. Orthogonal matching pursuit: recursive function\napproximation with application to wavelet decomposition. In Signals, Systems and Computation,\n1993.\n\nPham, Hieu, Guan, Melody Y., Zoph, Barret, Le, Quoc V., and Dean, Jeff. Ef\ufb01cient neural architecture\n\nsearch via parameter sharing. In ICML, 2018.\n\nReal, Esteban, Moore, Sherry, Selle, Andrew, Saxena, Saurabh, Suematsu, Yutaka Leon, Tan, Jie,\nLe, Quoc, and Kurakin, Alex. Large-scale evolution of image classi\ufb01ers. CoRR, abs/1703.01041,\n2017.\n\nReal, Esteban, Aggarwal, Alok, Huang, Yanping, and Le, Quoc V. Regularized evolution for image\n\nclassi\ufb01er architecture search. CoRR, abs/1802.01548, 2018.\n\nRussakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang,\nZhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei,\nLi. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.\n\nSchaffer, J David, Caruana, Richard A, and Eshelman, Larry J. Using genetic search to exploit the\nemergent behavior of neural networks. Physica D: Nonlinear Phenomena, 42(1-3):244\u2013248, 1990.\n\nTibshirani, Robert. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58:267\u2013288, 1994.\n\nWei, Tao, Wang, Changhu, Rui, Yong, and Chen, Chang Wen. Network morphism. In ICML, 2016.\n\nWilliams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. In Machine Learning, 1992.\n\nXie, Sirui, Zheng, Hehui, Liu, Chunxiao, and Lin, Liang. Snas: Stochastic neural architecture search.\n\nIn ICLR, 2019.\n\nYang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, and Cohen, William W. Breaking the softmax\n\nbottleneck: A high-rank rnn language model. ICML, 2018.\n\nYing, Chris, Klein, Aaron, Real, Esteban, Christiansen, Eric, Murphy, Kevin, and Hutter, Frank.\nNas-bench-101: Towards reproducible neural architecture search. In arxiv.org/abs/1902.09635,\n2019.\n\nZoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. In ICLR, 2017.\n\nZoph, Barret, Vasudevan, Vijay, Shlens, Jonathon, and Le, Quoc V. Learning transferable architectures\n\nfor scalable image recognition. In CVPR, 2018.\n\n10\n\n\f", "award": [], "sourceid": 5349, "authors": [{"given_name": "Hanzhang", "family_name": "Hu", "institution": "Carnegie Mellon University"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}, {"given_name": "Rich", "family_name": "Caruana", "institution": "Microsoft"}, {"given_name": "Saurajit", "family_name": "Mukherjee", "institution": "microsoft"}, {"given_name": "Eric", "family_name": "Horvitz", "institution": "Microsoft Research"}, {"given_name": "Debadeepta", "family_name": "Dey", "institution": "Microsoft Research AI"}]}