{"title": "DATA: Differentiable ArchiTecture Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 876, "page_last": 886, "abstract": "Neural architecture search (NAS) is inherently subject to the gap of architectures during searching and validating. To bridge this gap, we develop Differentiable ArchiTecture Approximation (DATA) with an Ensemble Gumbel-Softmax (EGS) estimator to automatically approximate architectures during searching and validating in a differentiable manner. Technically, the EGS estimator consists of a group of Gumbel-Softmax estimators, which is capable of converting probability vectors to binary codes and passing gradients from binary codes to probability vectors. Benefiting from such modeling, in searching, architecture parameters and network weights in the NAS model can be jointly optimized with the standard back-propagation, yielding an end-to-end learning mechanism for searching deep models in a large enough search space. Conclusively, during validating, a high-performance architecture that approaches to the learned one during searching is readily built. Extensive experiments on a variety of popular datasets strongly evidence that our method is capable of discovering high-performance architectures for image classification, language modeling and semantic segmentation, while guaranteeing the requisite efficiency during searching.", "full_text": "DATA: Differentiable ArchiTecture Approximation\n\nJianlong Chang1,2,3\n\nXinbang Zhang1,2\n\nShiming Xiang1,2\n\nYiwen Guo4,5\nChunhong Pan1\n\nGaofeng Meng1\n\n1NLPR, Institute of Automation, Chinese Academy of Sciences\n\n2School of Arti\ufb01cial Intelligence, University of Chinese Academy of Sciences\n3Samsung Research China - Beijing, 4Intel Labs China, 5Bytedance AI Lab\n{jianlong.chang, xinbang.zhang, gfmeng, smxiang, chpan}@nlpr.ia.ac.cn\n\nguoyiwen.ai@bytedance.com\n\nAbstract\n\nNeural architecture search (NAS) is inherently subject to the gap of architectures\nduring searching and validating. To bridge this gap, we develop Differentiable\nArchiTecture Approximation (DATA) with an Ensemble Gumbel-Softmax (EGS)\nestimator to automatically approximate architectures during searching and vali-\ndating in a differentiable manner. Technically, the EGS estimator consists of a\ngroup of Gumbel-Softmax estimators, which is capable of converting probability\nvectors to binary codes and passing gradients from binary codes to probability\nvectors. Bene\ufb01ting from such modeling, in searching, architecture parameters\nand network weights in the NAS model can be jointly optimized with the stan-\ndard back-propagation, yielding an end-to-end learning mechanism for searching\ndeep models in a large enough search space. Conclusively, during validating, a\nhigh-performance architecture that approaches to the learned one during searching\nis readily built. Extensive experiments on a variety of popular datasets strongly\nevidence that our method is capable of discovering high-performance architectures\nfor image classi\ufb01cation, language modeling and semantic segmentation, while\nguaranteeing the requisite ef\ufb01ciency during searching.\n\n1\n\nIntroduction\n\nIn the era of deep learning, how to design proper network architectures for speci\ufb01c problems is\na crucial but challenging task. However, designing architecture with state-of-the-art performance\ntypically requires substantial efforts from human experts. In order to eliminate such exhausting\nengineering, many neural architecture search (NAS) methods have been devoted to accomplishing\nthe task automatically [14, 27, 55], i.e., evolution-based NAS [13, 18, 26, 41, 43, 44, 45, 47],\nreinforcement learning-based NAS [2, 3, 21, 42, 56, 59, 60], and gradient-based NAS [11, 34, 35, 46,\n53], which has achieved signi\ufb01cant successes in a multitude of \ufb01elds, including image classi\ufb01cation [4,\n12, 21, 30, 31, 34, 44, 53, 60], semantic segmentation [8, 32] and object detection [9, 15, 50, 52, 60].\nAlthough the achievements in the literature are brilliant, these methods are still hard to effectively\nbridge the gap between architectures during searching and validating. That is, feasible paths in a\nlearned architecture are dependent on each other and become deeply coupled during searching. In\nvalidating, however, the inherited architectures from searching always decouple the dependent paths\nrudely, such as DARTS [34] and SNAS [53] that choose only one path in validating. As a result, the\neffectiveness of the searched architectures are unclear although they could surpass the random ones.\nIn order to eliminate the limitation, Differentiable ArchiTecture Approximation (DATA) is proposed\nto elegantly minimize the gap of architectures during searching and validating. For this purpose, we\ndevelop the Ensemble Gumbel-Softmax (EGS) estimator, an ensemble of a group of Gumbel-Softmax\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\festimators, which is in a position to sample an architecture that approaches the one during searching\nas close as possible, while maintaining the differentiability of a promising NAS pipeline for requisite\nef\ufb01ciency. That is, our EGS estimator suf\ufb01ces to not only decouple the relationship between different\npaths in learned architectures but also pass gradient seamlessly, yielding an end-to-end mechanism of\nsearching deep models in a large enough search space.\nTo sum up, the main contributions of this work are:\n\n\u2022 By generalizing the Gumbel-Softmax estimator, we develop the EGS estimator, which\nprovides a successful attempt to effectively and ef\ufb01ciently perform structural decisions like\npolicy gradient in the reinforcement-learning, with higher ef\ufb01ciency.\n\n\u2022 With the EGS estimator, the DATA model can seamlessly bridge the gap of architectures\nbetween searching and validating, and be learned with the standard back-propagation,\nyielding an end-to-end mechanism of searching deep models in a large enough search space.\n\u2022 Extensive experiments strongly demonstrate that our DATA model consistently outperforms\ncurrent NAS models in searching high-performance convolutional and recurrent architectures\nfor image classi\ufb01cation, semantic segmentation, and language modeling.\n\n2 Differentiable architecture search\n\nBefore introducing our approach, we \ufb01rst brie\ufb02y review NAS. Without loss of generality, the\narchitecture search space A can be naturally represented by directed acyclic graphs (DAG) each\nconsisting of an ordered sequence of nodes. For a speci\ufb01c architecture, it always corresponds\nto a graph \u03b1 \u2208 A, represented as N (\u03b1, w) with network weights w. Intrinsically, the goal in\nNAS is to \ufb01nd a graph \u03b1\u2217 \u2208 A that minimizes the validation loss Lval(N (\u03b1\u2217, w\u2217)), where the\nnetwork weights w\u2217 associated with the architecture \u03b1\u2217 are obtained by minimizing the training loss\nw\u2217 = arg minw Ltrain(N (\u03b1\u2217, w)), i.e.,\n\u03b1\u2208A Lval(N (\u03b1, w\u2217)),\n\nLtrain(N (\u03b1\u2217, w)).\n\ns.t. w\u2217 = arg min\n\nmin\n\n(1)\n\nw\n\nmin\n\ns.t. w\u2217 = arg minw Ltrain(N (\u03b1\u2217, w))\n\n\u03b1\u2208A Lval(N (\u03b1, w\u2217)),\nThis implies that the essence of NAS is to solve a bi-level optimization problem, which is hard to\noptimize because of the nested relationship between architecture parameters \u03b1 and network weights\nw. To handle this issue, we parameterize architectures with binary codes, and devote to jointly\nlearning architectures and network weights in a differentiable way.\n\n2.1 Parameterizing architectures with binary codes\nFor simplicity, we denote all DAGs with n ordered nodes as A = {e(i,j)|1 \u2264 i < j \u2264 n}, where\ne(i,j) indicates a directed edge from the i-th node to the j-th node. Corresponding to each directed\nedge e(i,j), there are a set of candidate primitive operations O = {o1,\u00b7\u00b7\u00b7 , oK}, such as convolution,\npooling, identity, and zero. With these operations, the output at the j-th node can be formulated as\n\nx(j) =\n\no(i,j)(x(i))\n\n(2)\n\n(cid:88)\n\ni