{"title": "Towards modular and programmable architecture search", "book": "Advances in Neural Information Processing Systems", "page_first": 13715, "page_last": 13725, "abstract": "Neural architecture search methods are able to find high performance deep learning architectures with minimal effort from an expert. However, current systems focus on specific use-cases (e.g. convolutional image classifiers and recurrent language models), making them unsuitable for general use-cases that an expert might wish to write. Hyperparameter optimization systems are general-purpose but lack the constructs needed for easy application to architecture search. In this work, we propose a formal language for encoding search spaces over general computational graphs. The language constructs allow us to write modular, composable, and reusable search space encodings and to reason about search space design. We use our language to encode search spaces from the architecture search literature. The language allows us to decouple the implementations of the search space and the search algorithm, allowing us to expose search spaces to search algorithms through a consistent interface. Our experiments show the ease with which we can experiment with different combinations of search spaces and search algorithms without having to implement each combination from scratch. We release an implementation of our language with this paper.", "full_text": "Towards modular and programmable architecture\n\nsearch\n\nRenato Negrinho1 \u2217\n\nDarshan Patil1\n\nNghia Le1\n\nDaniel Ferreira2\n\nMatthew R. Gormley1\n\nGeoffrey Gordon1,3\n\nCarnegie Mellon University1, TU Wien2, Microsoft Research Montreal3\n\nAbstract\n\nNeural architecture search methods are able to \ufb01nd high performance deep learning\narchitectures with minimal effort from an expert [1]. However, current systems\nfocus on speci\ufb01c use-cases (e.g. convolutional image classi\ufb01ers and recurrent\nlanguage models), making them unsuitable for general use-cases that an expert\nmight wish to write. Hyperparameter optimization systems [2, 3, 4] are general-\npurpose but lack the constructs needed for easy application to architecture search.\nIn this work, we propose a formal language for encoding search spaces over\ngeneral computational graphs. The language constructs allow us to write modular,\ncomposable, and reusable search space encodings and to reason about search space\ndesign. We use our language to encode search spaces from the architecture search\nliterature. The language allows us to decouple the implementations of the search\nspace and the search algorithm, allowing us to expose search spaces to search\nalgorithms through a consistent interface. Our experiments show the ease with\nwhich we can experiment with different combinations of search spaces and search\nalgorithms without having to implement each combination from scratch. We release\nan implementation of our language with this paper2.\n\n1\n\nIntroduction\n\nArchitecture search has the potential to transform machine learning work\ufb02ows. High performance\ndeep learning architectures are often manually designed through a trial-and-error process that amounts\nto trying slight variations of known high performance architectures. Recently, architecture search\ntechniques have shown tremendous potential by improving on handcrafted architectures, both by\nimproving state-of-the-art performance and by \ufb01nding better tradeoffs between computation and\nperformance. Unfortunately, current systems fall short of providing strong support for general\narchitecture search use-cases.\nHyperparameter optimization systems [2, 3, 4, 5] are not designed speci\ufb01cally for architecture\nsearch use-cases and therefore do not introduce constructs that allow experts to implement these\nuse-cases ef\ufb01ciently, e.g., easily writing new search spaces over architectures. Using hyperparameter\noptimization systems for an architecture search use-case requires the expert to write the encoding for\nthe search space over architectures as a conditional hyperparameter space and to write the mapping\nfrom hyperparameter values to the architecture to be evaluated. Hyperparameter optimization systems\nare completely agnostic that their hyperparameter spaces encode search spaces over architectures.\nBy contrast, architecture search systems [1] are in their infancy, being tied to speci\ufb01c use-cases\n(e.g., either reproducing results reported in a paper or concrete systems, e.g., for searching over\nScikit-Learn pipelines [6]) and therefore lack support for general architecture search work\ufb02ows. For\n\n\u2217Part of this work was done while the \ufb01rst author was a research scientist at Petuum.\n2Visit https://github.com/negrinho/deep_architect for code and documentation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fexample, current implementations of architecture search methods rely on ad-hoc encodings for search\nspaces, providing limited extensibility and programmability for new work to build on. For example,\nimplementations of the search space and search algorithm are often intertwined, requiring substantial\ncoding effort to try new search spaces or search algorithms.\n\nContributions We describe a modular language for encoding search spaces over general computa-\ntional graphs. We aim to improve the programmability, modularity, and reusability of architecture\nsearch systems. We are able to use the language constructs to encode search spaces in the literature.\nFurthermore, these constructs allow the expert to create new search spaces and modify existing ones\nin structured ways. Search spaces expressed in the language are exposed to search algorithms under a\nconsistent interface, decoupling the implementations of search spaces and search algorithms. We\nshowcase these functionalities by easily comparing search spaces and search algorithms from the\narchitecture search literature. These properties will enable better architecture search research by\nmaking it easier to benchmark and reuse search algorithms and search spaces.\n\n2 Related work\n\nHyperparameter optimization Algorithms for hyperparameter optimization often focus on small\nor simple hyperparameter spaces (e.g., closed subsets of Euclidean space in low dimensions). Hy-\nperparameters might be categorical (e.g., choice of regularizer) or continuous (e.g., learning rate\nand regularization constant). Gaussian process Bayesian optimization [7] and sequential model\nbased optimization [8] are two popular approaches. Random search has been found to be com-\npetitive for hyperparameter optimization [9, 10]. Conditional hyperparameter spaces (i.e., where\nsome hyperparameters may be available only for speci\ufb01c values of other hyperparameters) have also\nbeen considered [11, 12]. Hyperparameter optimization systems (e.g. Hyperopt [2], Spearmint [3],\nSMAC [5, 8] and BOHB [4]) are general-purpose and domain-independent. Yet, they rely on the\nexpert to distill the problem into an hyperparameter space and write the mapping from hyperparameter\nvalues to implementations.\n\nArchitecture search Contributions to architecture search often come in the form of search algo-\nrithms, evaluation strategies, and search spaces. Researchers have considered a variety of search\nalgorithms, including reinforcement learning [13], evolutionary algorithms [14, 15], MCTS [16],\nSMBO [16, 17], and Bayesian optimization [18]. Most search spaces have been proposed for\nrecurrent or convolutional architectures [13, 14, 15] focusing on image classi\ufb01cation (CIFAR-10)\nand language modeling (PTB). Architecture search encodes much of the architecture design in the\nsearch space (e.g., the connectivity structure of the computational graph, how many operations to\nuse, their type, and values for specifying each operation chosen). However, the literature has yet\nto provide a consistent method for designing and encoding such search spaces. Systems such as\nAuto-Sklearn [19], TPOT [20], and Auto-Keras [21] have been developed for speci\ufb01c use-cases (e.g.,\nAuto-Sklearn and TPOT focus on classi\ufb01cation and regression of featurized vector data, Auto-Keras\nfocus on image classi\ufb01cation) and therefore support relatively rigid work\ufb02ows. The lack of focus\non extensibility and programmability makes these systems unsuitable as frameworks for general\narchitecture search research.\n\n3 Proposed approach: modular and programmable search spaces\n\nTo maximize the impact of architecture search research, it is fundamental to improve the programma-\nbility of architecture search tools3. We move towards this goal by designing a language to write\nsearch spaces over computational graphs. We identify the following advantages for our language\nand search spaces encoded in it:\n\n\u2022 Similarity to computational graphs: Writing a search space in our language is similar to\nwriting a \ufb01xed computational graph in an existing deep learning framework. The main difference\nis that nodes in the graph may be search spaces rather than \ufb01xed operations (e.g., see Figure 5).\nA search space maps to a single computational graph once all its hyperparameters have been\nassigned values (e.g., in frame d in Figure 5).\n\n3cf. the effect of highly programmable deep learning frameworks on deep learning research and practice.\n\n2\n\n\f\u2022 Modularity and reusability: The building blocks of our search spaces are modules and hyper-\nparameters. Search spaces are created through the composition of modules and their interactions.\nImplementing a new module only requires dealing with aspects local to the module. Modules\nand hyperparameters can be reused across search spaces, and new search spaces can be written\nby combining existing search spaces. Furthermore, our language supports search spaces in\ngeneral domains (e.g., deep learning architectures or Scikit-Learn [22] pipelines).\n\n\u2022 Laziness: A substitution module delays the creation of a subsearch space until all hyperpa-\nrameters of the substitution module are assigned values. Experts can use substitution modules\nto encode natural and complex conditional constructions by concerning themselves only with\nthe conditional branch that is chosen. This is simpler than the support for conditional hyper-\nparameter spaces provided by hyperparameter optimization tools, e.g., in Hyperopt [2], where\nall conditional branches need to be written down explicitly. Our language allows conditional\nconstructs to be expressed implicitly through composition of language constructs (e.g., nesting\nsubstitution modules). Laziness also allows us to encode search spaces that can expand in\ufb01nitely,\nwhich is not possible with current hyperparameter optimization tools (see Appendix D.1).\n\n\u2022 Automatic compilation to runnable computational graphs: Once all choices in the search\nspace are made, the single architecture corresponding to the terminal search space can be mapped\nto a runnable computational graph (see Algorithm 4). By contrast, for general hyperparameter\noptimization tools this mapping has to be written manually by the expert.\n\n4 Components of the search space speci\ufb01cation language\n\nA search space is a graph (see Figure 5) consisting of hyperparameters (either of type independent or\ndependent) and modules (either of type basic or substitution). This section describes our language\ncomponents and show encodings of simple search spaces in our Python implementation. Figure 5\nand the corresponding search space encoding in Figure 4 are used as running examples. Appendix A\nand Appendix B provide additional details and examples, e.g. the recurrent cell search space of [23].\n\nIndependent hyperparameters The value of an independent hyperparameter is chosen from its\nset of possible values. An independent hyperparameter is created with a set of possible values, but\nwithout a value assigned to it. Exposing search spaces to search algorithms relies mainly on iteration\nover and value assignment to independent hyperparameters. An independent hyperparameter in our\nimplementation is instantiated as, for example, D([1, 2, 4, 8]). In Figure 5, IH-1 has set of\npossible values {64, 128} and is eventually assigned value 64 (shown in frame d).\n\nDependent hyperparameters The value of a dependent hyperparameter is computed as a func-\ntion of the values of the hyperparameters it depends on (see line 7 of Algorithm 1). Depen-\ndent hyperparameters are useful to encode relations between hyperparameters, e.g., in a convo-\nlutional network search space, we may want the number of \ufb01lters to increase after each spatial\nreduction. In our implementation, a dependent hyperparameter is instantiated as, for example, h\nh_units}). In\n= DependentHyperparameter(lambda dh:\nFigure 5, in the transition from frame a to frame b, IH-3 is assigned value 1, triggering the value\nassignment of DH-1 according to its function fn:2*x.\n\n2*dh[\"units\"], {\"units\":\n\ndef one_layer_net ():\n\nBasic modules A basic module implements\ncomputation that depends on the values of its\nproperties. Search spaces involving only basic\nmodules and hyperparameters do not create new\nmodules or hyperparameters, and therefore are\n\ufb01xed computational graphs (e.g., see frames c\nand d in Figure 5). Upon compilation, a basic\nmodule consumes the values of its inputs, per-\nforms computation, and publishes the results to\nits outputs (see Algorithm 4). Deep learning\nlayers can be wrapped as basic modules, e.g., a\nfully connected layer can be wrapped as a single-input single-output basic module with one hyper-\nparameter for the number of units. In the search space in Figure 1, dropout, dense, and relu are\n\nFigure 1: Search space over feedforward networks\nwith dropout rate of 0.25 or 0.5, ReLU activations,\nand one hidden layer with 100, 200, or 300 units.\n\na_in , a_out = dropout (D ([0.25 , 0.5]))\nb_in , b_out = dense (D ([100 , 200 , 300]))\nc_in , c_out = relu ()\na_out [\" out \" ]. connect ( b_in [\" in \" ])\nb_out [\" out \" ]. connect ( c_in [\" in \" ])\nreturn a_in , c_out\n\n1\n2\n3\n4\n5\n6\n7\n\n3\n\n\fbasic modules. In Figure 5, both frames c and d are search spaces with only basic modules and\nhyperparameters. In the search space of frame d, all hyperparameters have been assigned values, and\ntherefore the single architecture can be mapped to its implementation (e.g., in Tensor\ufb02ow).\n\n]) , h_repeat )\n\ndef multi_layer_net ():\n\nlambda : siso_sequential ([\n\ndense (D ([300])) ,\nsiso_or ([ relu , tanh ], h_or )\n\nh_or = D ([0 , 1])\nh_repeat = D ([1 , 2, 4])\nreturn siso_repeat (\n\nSubstitution modules Substitution modules\nencode structural transformations of the com-\nputational graph that are delayed4 until their\nhyperparameters are assigned values. Similarly\nto a basic module, a substitution module has\nhyperparameters, inputs, and outputs. Contrary\nto a basic module, a substitution module does\nnot implement computation\u2014it is substituted\nby a subsearch space (which depends on the val-\nues of its hyperparameters and may contain new\nsubstitution modules). Substitution is triggered\nonce all its hyperparameters have been assigned\nvalues. Upon substitution, the module is removed from the search space and its connections are\nrerouted to the corresponding inputs and outputs of the generated subsearch space (see Algorithm 1\nfor how substitutions are resolved). For example, in the transition from frame b to frame c of\nFigure 5, IH-2 was assigned the value 1 and Dropout-1 and IH-7 were created by the substitution\nof Optional-1. The connections of Optional-1 were rerouted to Dropout-1. If IH-2 had been\nassigned the value 0, Optional-1 would have been substituted by an identity basic module and no\nnew hyperparameters would have been created. Figure 2 shows a search space using two substitution\nmodules: siso_or chooses between relu and tanh; siso_repeat chooses how many layers to\ninclude. siso_sequential is used to avoid multiple calls to connect as in Figure 1.\n\nFigure 2: Search space over feedforward networks\nwith 1, 2, or 4 hidden layers and ReLU or tanh\nactivations.\n\ndef rnn_cell ( hidden_fn , output_fn ):\n\nAuxiliary functions Auxiliary functions,\nwhile not components per se, help create com-\nplex search spaces. Auxiliary functions might\ntake functions that create search spaces and put\nthem together into a larger search space. For\nexample, the search space in Figure 3 de\ufb01nes an\nauxiliary RNN cell that captures the high-level\nfunctional dependency: ht = qh(xt, ht\u22121)\nand yt = qy(ht). We can instantiate a\nspeci\ufb01c search space as rnn_cell(lambda:\nsiso_sequential([concat(2), one_layer_net()]), multi_layer_net).\n\nh_inputs , h_outputs = hidden_fn ()\ny_inputs , y_outputs = output_fn ()\nh_outputs [\" out \" ]. connect ( y_inputs [\" in \" ])\nreturn h_inputs , y_outputs\n\nFigure 3: Auxiliary function to create the search\nspace for the recurrent cell given functions that\ncreate the subsearch spaces.\n\n5 Example search space\n\nWe ground discussion textually, through code\nexamples (Figure 4), and visually (Figure 5)\nthrough an example search space. There is\na convolutional layer followed, optionally, by\ndropout with rate 0.25 or 0.5. After the optional\ndropout layer, there are two parallel chains of\nconvolutional layers. The \ufb01rst chain has length\n1, 2, or 4, and the second chain has double the\nlength of the \ufb01rst. Finally, the outputs of both\nchains are concatenated. Each convolutional\nlayer has 64 or 128 \ufb01lters (chosen separately).\nThis search space has 25008 distinct models.\nFigure 5 shows a sequence of graph transitions\nfor this search space. IH and DH denote type\nidenti\ufb01ers for independent and dependent hyper-\nparameters, respectively. Modules and hyperpa-\n\ndef search_space ():\n\nh_n = D ([1 , 2, 4])\nh_ndep = DependentHyperparameter (\n\nlambda dh : 2 * dh [\"x\"], {\"x\": h_n })\n\nc_inputs , c_outputs = conv2d (D ([64 , 128]))\no_inputs , o_outputs = siso_optional (\n\nlambda : dropout (D ([0.25 , 0.5])) , D ([0 , 1]))\n\nfn = lambda : conv2d (D ([64 , 128]))\nr1_inputs , r1_outputs = siso_repeat (fn , h_n )\nr2_inputs , r2_outputs = siso_repeat (fn , h_ndep )\ncc_inputs , cc_outputs = concat (2)\n\no_inputs [\" in \" ]. connect ( c_outputs [\" out \" ])\nr1_inputs [\" in \" ]. connect ( o_outputs [\" out \" ])\nr2_inputs [\" in \" ]. connect ( o_outputs [\" out \" ])\ncc_inputs [\" in0 \" ]. connect ( r1_outputs [\" out \" ])\ncc_inputs [\" in1 \" ]. connect ( r2_outputs [\" out \" ])\nreturn c_inputs , cc_outputs\n\nFigure 4: Simple search space showcasing all lan-\nguage components. See also Figure 5.\n\n4Substitution modules are inspired by delayed evaluation in programming languages.\n\n4\n\n1\n2\n3\n4\n5\n6\n7\n8\n\n1\n2\n3\n4\n5\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n\n\fFigure 5: Search space transitions for the search space in Figure 4 (frame a) leading to a single\narchitecture (frame d). Modules and hyperparameters created since the previous frame are highlighted\nin green. Hyperparameters assigned values since the previous frame are highlighted in red.\n\nrameters types are suf\ufb01xed with a number to generate unique identi\ufb01ers. Modules are represented by\nrectangles that contain inputs, outputs, and properties. Hyperparameters are represented by ellipses\n(outside of modules) and are associated to module properties (e.g., in frame a, IH-1 is associated to\nfilters of Conv2D-1). To the right of an independent hyperparameter we show, before assignment,\nits set of possible values and, after assignment, its value (e.g., IH-1 in frame a and in frame d,\nrespectively). Similarly, for a dependent hyperparameter we show, before assignment, the function\nthat computes its value and, after assignment, its value (e.g., DH-1 in frame a and in frame b, respec-\ntively). Frame a shows the initial search space encoded in Figure 4. From frame a to frame b, IH-3\nis assigned a value, triggering the value assignment for DH-1 and the substitutions for Repeat-1\nand Repeat-2. From frame b to frame c, IH-2 is assigned value 1, creating Dropout-1 and IH-7\n(its dropout rate hyperparameter). Finally, from frame c to frame d, the \ufb01ve remaining independent\nhyperparameters are assigned values. The search space in frame d has a single architecture that can\nbe mapped to an implementation in a deep learning framework.\n\n6 Semantics and mechanics of the search space speci\ufb01cation language\n\nIn this section, we formally describe the semantics and mechanics of our language and show how\nthey can be used to implement search algorithms for arbitrary search spaces.\n\n6.1 Semantics\n\nSearch space components A search space G has hyperparameters H(G) and modules M (G).\nWe distinguish between independent and dependent hyperparameters as Hi(G) and Hd(G), where\n\n5\n\nConv2D-1inout\ufb01ltersOptional-1inoutoptRepeat-1inoutkRepeat-2inoutkConcat-1in0in1outIH-2[0,1]DH-1fn:2*xxIH-3[1,2,4]IH-1[64,128]aConv2D-1inout\ufb01ltersOptional-1inoutoptConv2D-2inout\ufb01ltersConv2D-3inout\ufb01ltersConv2D-4inout\ufb01ltersConcat-1in0in1outIH-2[0,1]IH-1[64,128]IH-4[64,128]IH-5[64,128]IH-6[64,128]DH-12IH-31bConv2D-1inout\ufb01ltersDropout-1inoutprobConv2D-2inout\ufb01ltersConv2D-3inout\ufb01ltersConv2D-4inout\ufb01ltersConcat-1in0in1outIH-1[64,128]IH-4[64,128]IH-5[64,128]IH-6[64,128]IH-7[0.25,0.5]DH-12IH-31IH-21cConv2D-1inout\ufb01ltersDropout-1inoutprobConv2D-2inout\ufb01ltersConv2D-3inout\ufb01ltersConv2D-4inout\ufb01ltersConcat-1in0in1outIH-164IH-4128IH-5128IH-664IH-70.5DH-12IH-31IH-21d\fH(G) = Hi(G) \u222a Hd(G) and Hd(G) \u2229 Hi(G) = \u2205, and basic modules and substitution modules as\nMb(G) and Ms(G), where M (G) = Mb(G) \u222a Ms(G) and Mb(G) \u2229 Ms(G) = \u2205.\n\nHyperparameters We distinguish between hyperparameters that have been assigned a value and\nthose that have not as Ha(G) and Hu(G). We have H(G) = Hu(G)\u222aHa(G) and Hu(G)\u2229Ha(G) =\n\u2205. We denote the value assigned to an hyperparameter h \u2208 Ha(G) as v(G),(h) \u2208 X(h), where h \u2208\nHa(G) and X(h) is the set of possible values for h. Independent and dependent hyperparameters are\nassigned values differently. For h \u2208 Hi(G), its value is assigned directly from X(h). For h \u2208 Hd(G),\nits value is computed by evaluating a function f(h) for the values of H(h), where H(h) is the set of\nhyperparameters that h depends on. For example, in frame a of Figure 5, for h = DH-1, H(h) =\n{IH-3}. In frame b, Ha(G) = {IH-3, DH-1} and Hu(G) = {IH-1, IH-4, IH-5, IH-6, IH-2}.\nModules A module m \u2208 M (G) has inputs I(m), outputs O(m), and hyperparameters H(m) \u2286\nH(G) along with mappings assigning names local to the module to inputs, outputs, and hyperparam-\neters, respectively, \u03c3(m),i : S(m),i \u2192 I(m), \u03c3(m),o : S(m),o \u2192 O(m), \u03c3(m),h : S(m),h \u2192 H(m),\nwhere S(m),i \u2282 \u03a3\u2217, S(m),o \u2282 \u03a3\u2217, and S(m),h \u2282 \u03a3\u2217, where \u03a3\u2217 is the set of all strings of alphabet \u03a3.\nS(m),i, S(m),o, and S(m),h are, respectively, the local names for the inputs, outputs, and hyperparam-\n(m),i : I(m) \u2192 Sm,i\neters of m. Both \u03c3(m),i and \u03c3(m),o are bijective, and therefore, the inverses \u03c3\u22121\n(m),o : O(m) \u2192 S(m),o exist and assign an input and output to its local name. Each input\nand \u03c3\u22121\nand output belongs to a single module. \u03c3(m),h might not be injective, i.e., |S(m),h| \u2265 |H(m)|. A\nname s \u2208 S(m),h captures the local semantics of \u03c3(m),h(s) in m \u2208 M (G) (e.g., for a convolutional\nbasic module, the number of \ufb01lters or the kernel size). Given an input i \u2208 I(M (G)), m(i) recovers\nthe module that i belongs to (analogously for outputs). For m (cid:54)= m(cid:48), we have I(m) \u2229 I(m(cid:48)) = \u2205\nand O(m) \u2229 O(m(cid:48)) = \u2205, but there might exist m, m(cid:48) \u2208 M (G) for which H(m) \u2229 H(m(cid:48)) (cid:54)= \u2205,\ni.e., two different modules might share hyperparameters but inputs and outputs belong to a single\nmodule. We use shorthands I(G) for I(M (G)) and O(G) for O(M (G)). For example, in frame a of\nFigure 5, for m = Conv2D-1 we have: I(m) = {Conv2D-1.in}, O(m) = {Conv2D-1.out}, and\nH(m) = {IH-1}; S(m),i = {in} and \u03c3(m),i(in) = Conv2D-1.in (\u03c3(m),o and \u03c3(m),h are similar);\nm(Conv2D-1.in) = Conv2D-1. Output and inputs are identi\ufb01ed by the global name of their module\nand their local name within their module joined by a dot, e.g.. Conv2D-1.in\n\nConnections between modules Connections between modules in G are represented through the set\nof directed edges E(G) \u2286 O(G)\u00d7I(G) between outputs and inputs of modules in M (G). We denote\nthe subset of edges involving inputs of a module m \u2208 M (G) as Ei(m), i.e., Ei(m) = {(o, i) \u2208\nE(G) | i \u2208 I(m)}. Similarly, for outputs, Eo(m) = {(o, i) \u2208 E(G) | o \u2208 O(m)}. We denote the\nset of edges involving inputs or outputs of m as E(m) = Ei(m) \u222a Eo(m). In frame a of Figure 5,\nFor example, in frame a of Figure 5, Ei(Optional-1) = {(Conv2D-1.out, Optional-1.in)} and\nEo(Optional-1) = {(Optional-1.out, Repeat-1.in), (Optional-1.out, Repeat-2.in)}.\nSearch spaces We denote the set of all possible search spaces as G. For a search space G \u2208\nG, we de\ufb01ne R(G) = {G(cid:48) \u2208 G | G1, . . . , Gm \u2208 Gm, Gk+1 = Transition(Gk, h, v), h \u2208\nHi(Gk) \u2229 Hu(Gk), v \u2208 X(h),\u2200k \u2208 [m], G1 = G, Gm = G(cid:48)}, i.e., the set of reachable search\nspaces through a sequence of value assignments to independent hyperparameters (see Algorithm 1\nfor the description of Transition). We denote the set of terminal search spaces as T \u2282 G, i.e.\nT = {G \u2208 G | Hi(G)\u2229 Hu(G) = \u2205}. We denote the set of terminal search spaces that are reachable\nfrom G \u2208 G as T (G) = R(G) \u2229 T . In Figure 5, if we let G and G(cid:48) be the search spaces in frame a\nand d, respectively, we have G(cid:48) \u2208 T (G).\n\n6.2 Mechanics\nSearch space transitions A search space G \u2208 G encodes a set of architectures (i.e., those in\nT (G)). Different architectures are obtained through different sequences of value assignments\nleading to search spaces in T (G). Graph transitions result from value assignments to independent\nhyperparameters. Algorithm 1 shows how the search space G(cid:48) = Transition(G, h, v) is computed,\nwhere h \u2208 Hi(G) \u2229 Hu(G) and v \u2208 X(h). Each transition leads to progressively smaller search\nspaces (i.e., for all G \u2208 G, G(cid:48) = Transition(G, h, v) for h \u2208 Hi(G) \u2229 Hu(G) and v \u2208 X(h),\nthen R(G(cid:48)) \u2286 R(G)). A search space G(cid:48) \u2208 T (G) is reached once there are no independent\n\n6\n\n\fAlgorithm 1: Transition\nInput: G, h \u2208 Hi(G) \u2229 Hu(G), v \u2208 X(h)\n1 v(G),(h) \u2190 v\n2 do\n3\n\n\u02dcHd(G) = {h \u2208 Hd(G) \u2229 Hu(G) | Hu(h) = \u2205}\nfor h \u2208 \u02dcHd(G) do\n\nn \u2190 |S(h)|\nLet S(h) = {s1, . . . , sn} with s1 < . . . < sn\nv(G),(h) \u2190 f(h)(vG,\u03c3(h)(s1), . . . , vG,\u03c3(h)(sn))\n\n4\n5\n6\n7\n\n8\n\n9\n10\n11\n12\n\n\u02dcMs(G) = {m \u2208 Ms(G) | Hu(m) = \u2205}\nfor m \u2208 \u02dcMs(G) do\nn \u2190 |S(m),h|\nLet S(m),h = {s1, . . . , sn} with s1 < . . . < sn\n(Gm, \u03c3i, \u03c3o) = f(m)(vG,\u03c3(m),h(s1), . . . , vG,\u03c3(m),h(sn))\nEi = {(o, i(cid:48)) | (o, i) \u2208 Ei(m), i(cid:48) = \u03c3i(\u03c3\n(m),i(i))}\n\u22121\nEo = {(o(cid:48), i) | (o, i) \u2208 Eo(m), o(cid:48) = \u03c3o(\u03c3\n(m),o(o))}\n\u22121\nE(G) \u2190 (E(G) \\ E(m)) \u222a (Ei \u222a Eo)\nM (G) \u2190 (M (G) \\ {m}) \u222a M (Gm)\nH(G) \u2190 H(G) \u222a H(Gm)\n18 while \u02dcHd(G) (cid:54)= \u2205 or \u02dcMs(G) (cid:54)= \u2205;\n19 return G\n\n15\n16\n17\n\n13\n\n14\n\nAlgorithm 2: OrderedHyperps\nInput: G, \u03c3o : So \u2192 Ou(G)\n1 Mq \u2190 OrderedModules(G, \u03c3o)\n2 Hq \u2190 [ ]\n3 for m \u2208 Mq do\nn = |S(m),h|\nLet S(m),h = {s1, . . . , sn}\nwith s1 < . . . < sn.\nfor j \u2208 [n] do\nh \u2190 \u03c3(m),h(sj)\nif h /\u2208 Hq then\n\n4\n5\n\nHq \u2190 Hq + [h]\n\n10 for h \u2208 Hq do\n\nif h \u2208 Hd(G) then\n\nn \u2190 |S(h)|\nLet S(h) = {s1, . . . , sn}\nwith s1 < . . . < sn\nfor j \u2208 [n] do\nh(cid:48) \u2190 \u03c3(h)(sj)\nif h(cid:48) /\u2208 Hq then\n\nHq \u2190 Hq + [h(cid:48)]\n\n6\n7\n8\n9\n\n11\n12\n13\n\n14\n15\n16\n17\n\n18 return Hq\n\nFigure 6: Left: Transition assigns a value to an independent hyperparameter and resolves assign-\nments to dependent hyperparameters (line 3 to 7) and substitutions (line 8 to 17) until none are left\n(line 18). Right: OrderedHyperps returns H(G) sorted according to a unique order. Adds the\nhyperparameters that are immediately reachable from modules (line 1 to 9), and then traverses the\ndependencies of the dependent hyperparameters to \ufb01nd additional hyperparameters (line 10 to 17).\n\nhyperparameters left to assign values to, i.e., Hi(G) \u2229 Hu(G) = \u2205. For G(cid:48) \u2208 T (G), Ms(G(cid:48)) = \u2205,\ni.e., there are only basic modules left. For search spaces G \u2208 G for which Ms(G) = \u2205, we\nhave M (G(cid:48)) = M (G) (i.e., Mb(G(cid:48)) = Mb(G)) and H(G(cid:48)) = H(G) for all G(cid:48) \u2208 R(G), i.e.,\nno new modules and hyperparameters are created as a result of graph transitions. Algorithm 1\ncan be implemented ef\ufb01ciently by checking whether assigning a value to h \u2208 Hi(G) \u2229 Hu(G)\ntriggered substitutions of neighboring modules or value assignments to neighboring hyperparameters.\nFor example, for the search space G of frame d of Figure 5, Ms(G) = \u2205. Search spaces G,\nG(cid:48), and G(cid:48)(cid:48) for frames a, b, and c, respectively, are related as G(cid:48) = Transition(G, IH-3, 1)\nand G(cid:48)(cid:48) = Transition(G(cid:48), IH-2, 1). For the substitution resolved from frame b to frame c, for\nm = Optional-1, we have \u03c3i(in) = Dropout-1.in and \u03c3o(out) = Dropout-1.out (see line\n12 in Algorithm 1).\n\nTraversals over modules and hyperparameters Search space traversal is fundamental to provide\nthe interface to search spaces that search algorithms rely on (e.g., see Algorithm 3) and to auto-\nmatically map terminal search spaces to their runnable computational graphs (see Algorithm 4 in\nAppendix C). For G \u2208 G, this iterator is implemented by using Algorithm 2 and keeping only the\nhyperparameters in Hu(G) \u2229 Hi(G). The role of the search algorithm (e.g., see Algorithm 3) is to\nrecursively assign values to hyperparameters in Hu(G) \u2229 Hi(G) until a search space G(cid:48) \u2208 T (G) is\nreached. Uniquely ordered traversal of H(G) relies on uniquely ordered traversal of M (G). (We\ndefer discussion of the module traversal to Appendix C, see Algorithm 5.)\n\nArchitecture instantiation A search space G \u2208 T can be mapped to a domain implementation (e.g.\ncomputational graph in Tensor\ufb02ow [24] or PyTorch [25]). Only fully-speci\ufb01ed basic modules are left\nin a terminal search space G (i.e., Hu(G) = \u2205 and Ms(G) = \u2205). The mapping from a terminal search\nspace to its implementation relies on graph traversal of the modules according to the topological or-\ndering of their dependencies (i.e., if m(cid:48) connects to an output of m, then m(cid:48) should be visited after m).\n\n7\n\n\fAppendix C details this graph propagation pro-\ncess (see Algorithm 4). For example, it is sim-\nple to see how the search space of frame d of\nFigure 5 can be mapped to an implementation.\n\n6.3 Supporting search algorithms\n\nSearch algorithms interface with search spaces\nthrough ordered iteration over unassigned in-\ndependent hyperparameters (implemented with\nthe help of Algorithm 2) and value assignments\nto these hyperparameters (which are resolved\nwith Algorithm 1). Algorithms are run for a\n\ufb01xed number of evaluations k \u2208 N, and return\nthe best architecture found. The iteration func-\ntionality in Algorithm 2 is independent of the\nsearch space and therefore can be used to expose\nsearch spaces to search algorithms. We use this\ndecoupling to mix and match search spaces and\nsearch algorithms without implementing each\npair from scratch (see Section 7).\n\n7 Experiments\n\nAlgorithm 3: Random search.\nInput: G, \u03c3o : So \u2192 Ou(G), k\n1 rbest \u2190 \u2212\u221e\n2 for j \u2208 [k] do\nG(cid:48) \u2190 G\nwhile G(cid:48) /\u2208 T do\nHq \u2190 OrderedHyperps(G(cid:48), \u03c3o)\nfor h \u2208 Hq do\nif h \u2208 Hu(G(cid:48)) \u2229 Hi(G(cid:48)) then\n\nv \u223c Uniform(X(h))\nG(cid:48) \u2190 Transition(G(cid:48), h, v)\n\n3\n4\n5\n6\n7\n8\n9\n\n10\n11\n12\n13\n\nr \u2190 Evaluate(G(cid:48))\nif r > rbest then\nrbest \u2190 r\nGbest \u2190 G(cid:48)\n\n14 return Gbest\n\nFigure 7: Assigns a value uniformly at random\n(line 8) for each independent hyperparameter (line\n7) in the search space until a terminal search space\nis reached (line 4).\n\nWe showcase the modularity and programmability of our language by running experiments that rely\non decoupled of search spaces and search algorithms. The interface to search spaces provided by the\nlanguage makes it possible to reuse implementations of search spaces and search algorithms.\n\n7.1 Search space experiments\n\nTable 1: Test results for search space\nexperiments.\n\nWe vary the search space and \ufb01x the search algorithm\nand the evaluation method. We refer to the search spaces\nwe consider as Nasbench [27], Nasnet [28], Flat [15],\nand Genetic [26]. For the search phase, we randomly\nsample 128 architectures from each search space and train\nthem for 25 epochs with Adam with a learning rate of\n0.001. The test results for the fully trained architecture\nwith the best validation accuracy are reported in Table 1.\nThese experiments provide a simple characterization of\nthe search spaces in terms of the number of parameters,\ntraining times, and validation performances at 25 epochs of the architectures in each search space\n(see Figure 8). Our language makes these characterizations easy due to better modularity (the\nimplementations of the search space and search algorithm are decoupled) and programmability (new\nsearch spaces can be encoded and new search algorithms can be developed).\n\nSearch Space\nGenetic [26]\nFlat [15]\nNasbench [27]\nNasnet [28]\n\nTest Accuracy\n\n90.07\n93.58\n94.59\n93.77\n\n7.2 Search algorithm experiments\n\nTable 2: Test results for search algorithm\nexperiments.\n\nWe evaluate search algorithms by running them on the\nsame search space. We use the Genetic search space [26]\nfor these experiments as Figure 8 shows its architectures\ntrain quickly and have substantially different validation\naccuracies. We examined the performance of four search\nalgorithms: random, regularized evolution, sequential\nmodel based optimization (SMBO), and Monte Carlo Tree\nSearch (MCTS). Random search uniformly samples val-\nues for independent hyperparameters (see Algorithm 3).\nRegularized evolution [14] is an evolutionary algorithm\nthat mutates the best performing member of the population and discards the oldest. We use population\n\nSearch algorithm Test Accuracy\n91.61 \u00b1 0.67\nRandom\n91.45 \u00b1 0.11\nMCTS [29]\n91.93 \u00b1 1.03\nSMBO [16]\n91.32 \u00b1 0.50\nEvolution [14]\n\n8\n\n\fFigure 8: Results for the architectures sampled in the search space experiments. Left: Relation\nbetween number of parameters and validation accuracy at 25 epochs. Right: Relation between time\nto complete 25 epochs of training and validation accuracy.\n\nFigure 9: Results for search algorithm experiments. Left: Relation between the performance of the\nbest architecture found and the number of architectures sampled. Right: Histogram of validation\naccuracies for the architectures encountered by each search algorithm.\n\nsize 100 and sample size 25. For SMBO [16], we use a linear surrogate function to predict the valida-\ntion accuracy of an architecture from its features (hashed modules sequences and hyperparameter\nvalues). For each architecture requested from this search algorithm, with probability 0.1 a randomly\nspeci\ufb01ed architecture is returned; otherwise it evaluates 512 random architectures with the surrogate\nmodel and returns the one with the best predicted validation accuracy. MCTS [29, 16] uses the Upper\nCon\ufb01dence Bound for Trees (UCT) algorithm with the exploration term of 0.33. Each run of the\nsearch algorithm samples 256 architectures that are trained for 25 epochs with Adam with a learning\nrate of 0.001. We ran three trials for each search algorithm. See Figure 9 and Table 2 for the results.\nBy comparing Table 1 and Table 2, we see that the choice of search space had a much larger impact\non the test accuracies observed than the choice of search algorithm. See Appendix F for more details.\n\n8 Conclusions\n\nWe design a language to encode search spaces over architectures to improve the programmability\nand modularity of architecture search research and practice. Our language allows us to decouple the\nimplementations of search spaces and search algorithms. This decoupling enables to mix-and-match\nsearch spaces and search algorithms without having to write each pair from scratch. We reimplement\nsearch spaces and search algorithms from the literature and compare them under the same conditions.\nWe hope that decomposing architecture search experiments through the lens of our language will lead\nto more reusable and comparable architecture search research.\n\n9\n\n0.00.51.01.52.02.53.03.54.0Number of parameters1e70.10.20.30.40.50.60.70.80.9Validation accuracySearch SpaceNasbenchGeneticFlatNasnet050100150200250300350Training time in minutes0.10.20.30.40.50.60.70.80.9Validation accuracySearch SpaceNasbenchGeneticFlatNasnet050100150200250Number of sampled architectures0.6250.6500.6750.7000.7250.7500.7750.8000.825Best Validation Accuracy FoundSearcherRandomRegularized EvolutionMCTSSMBO0.00.20.40.60.81.0Validation accuracy020040060080010001200CountRandomRegularized EvolutionMCTSSMBO\f9 Acknowledgements\n\nWe thank the anonymous reviewers for helpful comments and suggestions. We thank Graham Neubig,\nBarnabas Poczos, Ruslan Salakhutdinov, Eric Xing, Xue Liu, Carolyn Rose, Zhiting Hu, Willie\nNeiswanger, Christoph Dann, Kirielle Singajarah, and Zejie Ai for helpful discussions. We thank\nGoogle for generous TPU and GCP grants. This work was funded in part by NSF grant IIS 1822831.\n\nReferences\n[1] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. JMLR,\n\n2019.\n\n[2] James Bergstra, Dan Yamins, and David Cox. Hyperopt: A Python library for optimizing the hyperparame-\n\nters of machine learning algorithms. Citeseer, 2013.\n\n[3] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical Bayesian optimization of machine learning\n\nalgorithms. In NeurIPS, 2012.\n\n[4] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and ef\ufb01cient hyperparameter optimization\n\nat scale. In ICML, 2018.\n\n[5] Marius Lindauer, Katharina Eggensperger, Matthias Feurer, Stefan Falkner, Andr\u00e9 Biedenkapp, and Frank\n\nHutter. SMACv3: Algorithm con\ufb01guration in Python. https://github.com/automl/SMAC3, 2017.\n\n[6] Randal Olson and Jason Moore. TPOT: A tree-based pipeline optimization tool for automating machine\n\nlearning. In Workshop on Automatic Machine Learning, 2016.\n\n[7] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan Adams, and Nando De Freitas. Taking the human out\n\nof the loop: A review of Bayesian optimization. Proceedings of the IEEE, 2016.\n\n[8] Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general\n\nalgorithm con\ufb01guration. In International Conference on Learning and Intelligent Optimization, 2011.\n\n[9] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR, 2012.\n\n[10] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A\n\nnovel bandit-based approach to hyperparameter optimization. JMLR, 2017.\n\n[11] James Bergstra, R\u00e9mi Bardenet, Yoshua Bengio, and Bal\u00e1zs K\u00e9gl. Algorithms for hyper-parameter\n\noptimization. In NeurIPS, 2011.\n\n[12] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter\n\noptimization in hundreds of dimensions for vision architectures. JMLR, 2013.\n\n[13] Barret Zoph and Quoc Le. Neural architecture search with reinforcement learning. ICLR, 2017.\n\n[14] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. AAAI, 2019.\n\n[15] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical\n\nrepresentations for ef\ufb01cient architecture search. ICLR, 2018.\n\n[16] Renato Negrinho and Geoff Gordon. DeepArchitect: Automatically designing and training deep architec-\n\ntures. arXiv:1704.08792, 2017.\n\n[17] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,\n\nJonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018.\n\n[18] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural\n\narchitecture search with Bayesian optimisation and optimal transport. NeurIPS, 2018.\n\n[19] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter.\n\nEf\ufb01cient and robust automated machine learning. In NeurIPS, 2015.\n\n[20] Randal Olson, Ryan Urbanowicz, Peter Andrews, Nicole Lavender, Jason Moore, et al. Automating biomed-\nical data science through tree-based pipeline optimization. In European Conference on the Applications of\nEvolutionary Computation, 2016.\n\n10\n\n\f[21] Haifeng Jin, Qingquan Song, and Xia Hu. Ef\ufb01cient neural architecture search with network morphism.\n\narXiv:1806.10282, 2018.\n\n[22] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,\nMathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-Learn: Machine learning\nin Python. JMLR, 2011.\n\n[23] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Ef\ufb01cient neural architecture search via\n\nparameter sharing. ICML, 2018.\n\n[24] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Cor-\nrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv:1603.04467, 2016.\n\n[25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\n\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017.\n\n[26] Lingxi Xie and Alan Yuille. Genetic CNN. In ICCV, 2017.\n\n[27] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NAS-Bench-\n\n101: Towards reproducible neural architecture search. In ICML, 2019.\n\n[28] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc Le. Learning transferable architectures for\n\nscalable image recognition. In CVPR, 2018.\n\n[29] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter Cowling, Philipp Rohlfshagen,\nStephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree\nsearch methods. IEEE Transactions on Computational Intelligence and AI in games, 2012.\n\n11\n\n\f", "award": [], "sourceid": 7624, "authors": [{"given_name": "Renato", "family_name": "Negrinho", "institution": "Carnegie Mellon University"}, {"given_name": "Matthew", "family_name": "Gormley", "institution": "Carnegie Mellon University"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "MSR Montr\u00e9al & CMU"}, {"given_name": "Darshan", "family_name": "Patil", "institution": "Carnegie Mellon University"}, {"given_name": "Nghia", "family_name": "Le", "institution": "Carnegie Mellon University"}, {"given_name": "Daniel", "family_name": "Ferreira", "institution": "TU Wien"}]}