{"title": "XNAS: Neural Architecture Search with Expert Advice", "book": "Advances in Neural Information Processing Systems", "page_first": 1977, "page_last": 1987, "abstract": "This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice. Its optimization criterion is well fitted for an architecture-selection, i.e., it minimizes the regret incurred by a sub-optimal selection of operations. \nUnlike previous search relaxations, that require hard pruning of architectures, our method is designed to dynamically wipe out inferior architectures and enhance superior ones.\nIt achieves an optimal worst-case regret bound and suggests the use of multiple learning-rates, based on the amount of information carried by the backward gradients. \nExperiments show that our algorithm achieves a strong performance over several image classification datasets.\nSpecifically, it obtains an error rate of 1.6% for CIFAR-10, 23.9% for ImageNet under mobile settings, and achieves state-of-the-art results on three additional datasets.", "full_text": "XNAS: Neural Architecture Search\n\nwith Expert Advice\n\nNiv Nayman\u2217, Asaf Noy\u2217, Tal Ridnik\u2217, Itamar Friedman, Rong Jin, Lihi Zelnik-Manor\n\n{niv.nayman,asaf.noy,tal.ridnik,itamar.friedman,jinrong.jr,lihi.zelnik}\n\nMachine Intelligence Technology, Alibaba Group\n\n@alibaba-inc.com\n\nAbstract\n\nThis paper introduces a novel optimization method for differential neural architec-\nture search, based on the theory of prediction with expert advice. Its optimization\ncriterion is well \ufb01tted for an architecture-selection, i.e., it minimizes the regret in-\ncurred by a sub-optimal selection of operations. Unlike previous search relaxations,\nthat require hard pruning of architectures, our method is designed to dynamically\nwipe out inferior architectures and enhance superior ones. It achieves an optimal\nworst-case regret bound and suggests the use of multiple learning-rates, based on\nthe amount of information carried by the backward gradients. Experiments show\nthat our algorithm achieves a strong performance over several image classi\ufb01cation\ndatasets. Speci\ufb01cally, it obtains an error rate of 1.6% for CIFAR-10, 23.9% for\nImageNet under mobile settings, and achieves state-of-the-art results on three\nadditional datasets.\n\n1\n\nIntroduction\n\nIn recent years tremendous efforts have been put into a manual design of high performance neural\nnetworks [22, 16, 40, 39]. An emerging alternative approach is replacing the manual design with\nautomated Neural Architecture Search (NAS). NAS excels in \ufb01nding architectures which yield\nstate-of-the-art results. Earlier NAS works were based on reinforcement learning [55, 56], sequential\noptimization [24], and evolutionary algorithms [33], and required immense computational resources,\nsometimes demanding years of GPU compute time in order to output an architecture. More recent\nNAS methods reduce the search time signi\ufb01cantly, e.g. via weight-sharing [30] or by a continuous\nrelaxation of the space [25], making the search affordable and applicable to real problems.\nWhile current NAS methods provide encouraging results, they still suffer from several shortcomings.\nFor example, a large number of hyper-parameters that are not easy to tune, hard pruning decisions that\nare performed sub-optimally at once at the end of the search, and a weak theoretical understanding.\nThis cultivates skepticism and criticism of the utility of NAS in general. Some recent works even\nsuggest that current search methods are only slightly better than random search and further imply that\nsome selection methods are not well principled and are basically random [23, 35].\nTo provide more principled methods, we view NAS as an online selection task, and rely on Prediction\nwith Experts Advice (PEA) theory [4] for the selection. Our key contribution is the introduction of\nXNAS (eXperts Neural Architecture Search), an optimization method (section 2.2) that is well suited\nfor optimizing inner architecture weights over a differentiable architecture search space (section 2.1).\nWe propose a setup in which the experts represent inner neural operations and connections, whose\ndominance is speci\ufb01ed by architecture weights.\n\n\u2217These authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur proposed method addresses the mentioned shortcomings of current NAS methods. For the\nmitigation of the hard pruning, we leverage the Exponentiated-Gradient (EG) algorithm [21], which\nfavors sparse weight vectors to begin with, enhanced by a wipeout mechanism for dynamically\npruning inferior experts during the search process. Additionally, the algorithm requires less hyper-\nparameters to be tuned (section 3.2.2), and the theory behind it further provides guidance for the\nchoice of learning rates. Speci\ufb01cally, the algorithm avoids the decay of architecture weights [12],\nwhich is shown to promote selection of arbitrary architectures.\nAdditionally, XNAS features several desirable properties, such as achieving an optimal worst-case\nregret bound (section 3.1) and suggesting to assign different learning rates for different groups of\nexperts. Considering an appropriate reward term, the algorithm is more robust to the initialization of\nthe architecture weights and inherently enables the recovery of \u2019late bloomers\u2019, i.e., experts which\nmay become effective only after a warm-up period (section 3.2.1). The wipeout mechanism allows\nthe recovery of experts with a chance of being selected at the end of the process.\nWe compare XNAS to previous methods and demonstrate its properties and effectiveness over\nstatistical and deterministic setups, as well as over 7 public datasets (section 4). It achieves state-\nof-the-art performance over 3 datasets, and top-NAS over rest, with signi\ufb01cant improvements. For\nexample, XNAS reaches 1.60% error over CIFAR-10, more than 20% improvement over existing\nNAS methods.\n\n2 Proposed Approach\n\nTo lay out our approach we \ufb01rst reformulate the differentiable architecture search space of\nDARTS [25] in a way that enables direct optimization over the architecture weights. We then\npropose a novel optimizer that views NAS as an online selection task, and relies on PEA theory for\nthe selection.\n\n2.1 Neural Architecture Space\n\nWe start with a brief review of the PEA settings and then describe our view of the search space as\nseparable PEA sub-spaces. This enables us to leverage PEA theory for NAS.\nPEA Settings. PEA [4] refers to a sequential decision making framework, dealing with a decision\nmaker, i.e. a forecaster, whose goal is to predict an unknown outcome sequence {yt}T\nt=1 \u2208 Y while\nhaving access to a set of N experts\u2019 advises, i.e. predictions. Denote the experts\u2019 predictions at time\nt by ft,1, . . . , ft,N \u2208 D, where D is the decision space, which we assume to be a convex subset of\na vector space. Denote the forecaster\u2019s prediction {pt}T\nt=1 \u2208 D, and a non-negative loss function\n(cid:96) : D \u00d7 Y \u2212\u2192 R. At each time step t = 1, . . . , T , the forecaster observes ft,1, . . . , ft,N and predicts\npt. The forecaster and the experts suffer losses of (cid:96)t(pt) := (cid:96)(pt, yt) and (cid:96)t(ft,i) := (cid:96)(ft,i, yt)\nrespectively.\nThe Search Space Viewed as Separable PEA Sub-spaces. We view the search space suggested by\nDARTS [25] as multiple separable sub-spaces of experts, as illustrated in Figure 1, described next.\nAn architecture is built from replications of normal and reduction cells represented as a directed\nacyclic graph. Every node x(j) in this super-graph represents a feature map and each directed edge\n(j, k) is associated with a forecaster, that predicts a feature map p(j,k) := p(j,k)(x(j)) given the\ninput x(j). Intermediate nodes are computed based on all of their predecessors: x(k) = \u03a3j