{"title": "DetNAS: Backbone Search for Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 6642, "page_last": 6652, "abstract": "Object detectors are usually equipped with backbone networks designed for image classification.  It might be sub-optimal because of the gap between the tasks of image classification and object detection. In this work, we present DetNAS to use Neural Architecture Search (NAS) for the design of better backbones for object detection.  It is non-trivial because detection training typically needs ImageNetpre-training while NAS systems require accuracies on the target detection task as supervisory signals. Based on the technique of one-shot supernet, which contains all possible networks in the search space, we propose a framework for backbone search on object detection. We train the supernet under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. This framework makes NAS on backbones very efficient. In experiments, we show the effectiveness of DetNAS on various detectors, for instance, one-stage RetinaNetand the two-stage FPN. We empirically find that networks searched on object detection shows consistent superiority compared to those searched on ImageNet classification. The resulting architecture achieves superior performance than hand-crafted networks on COCO with much less FLOPs complexity.", "full_text": "DetNAS: Backbone Search for Object Detection\n\nYukang Chen1\u2020\u21e4, Tong Yang2\u2020, Xiangyu Zhang2\u2021, Gaofeng Meng1, Xinyu Xiao1, Jian Sun2\n1National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences\n\n{yukang.chen, gfmeng, xinyu.xiao}@nlpr.ia.ac.cn {yangtong, zhangxiangyu, sunjian}@megvii.com\n\n2Megvii Technology\n\nAbstract\n\nObject detectors are usually equipped with backbone networks designed for image\nclassi\ufb01cation. It might be sub-optimal because of the gap between the tasks of\nimage classi\ufb01cation and object detection. In this work, we present DetNAS to use\nNeural Architecture Search (NAS) for the design of better backbones for object\ndetection. It is non-trivial because detection training typically needs ImageNet\npre-training while NAS systems require accuracies on the target detection task as\nsupervisory signals. Based on the technique of one-shot supernet, which contains\nall possible networks in the search space, we propose a framework for backbone\nsearch on object detection. We train the supernet under the typical detector training\nschedule: ImageNet pre-training and detection \ufb01ne-tuning. Then, the architecture\nsearch is performed on the trained supernet, using the detection task as the guidance.\nThis framework makes NAS on backbones very ef\ufb01cient. In experiments, we show\nthe effectiveness of DetNAS on various detectors, for instance, one-stage RetinaNet\nand the two-stage FPN. We empirically \ufb01nd that networks searched on object detec-\ntion shows consistent superiority compared to those searched on ImageNet classi\ufb01-\ncation. The resulting architecture achieves superior performance than hand-crafted\nnetworks on COCO with much less FLOPs complexity. Code and models have\nbeen made available at: https://github.com/megvii-model/DetNAS.\n\n1\n\nIntroduction\n\nBackbones play an important role in object detectors. The performance of object detectors highly\nrelies on features extracted by backbones. For example, a large accuracy increase could be obtained by\nsimply replacing a ResNet-50 [8] backbone with stronger networks, e.g., ResNet-101 or ResNet-152.\nThe importance of backbones is also demonstrated in DetNet [12], Deformable ConvNets v2 [30],\nThunderNet [22] in real-time object detection, and HRNet [25] in keypoint detection.\nHowever, many object detectors directly use networks designed for Image classi\ufb01cation as backbones.\nIt might be sub-optimal because image classi\ufb01cation focus on what the main object of an image\nis, while object detection aims at \ufb01nding where and what each object instance. For instance, the\nrecent hand-crafted network, DetNet [12], has demonstrated this point. ResNet-101 performs better\nthan DetNet-59 [12] on ImageNet classi\ufb01cation, but is inferior to DetNet-59 [12] on object detection.\nHowever, the handcrafting process heavily relies on expert knowledge and tedious trials.\nNAS has achieved great progress in recent years. On image classi\ufb01cation [31, 32, 23], searched\nnetworks reach or even surpass the performance of the hand-crafted networks. However, NAS for\nbackbones in object detectors is still challenging. It is infeasible to simply employ previous NAS\nmethods for backbone search in object detectors. The typical detector training schedule requires\nbackbone networks to be pre-trained on ImageNet. This results in two dif\ufb01culties for searching\n\n\u2020Equal contribution. \u21e4Work done during an internship at Megvii Technology. \u2021 Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(cid:8)(cid:21)(cid:14)(cid:18)(cid:17)(cid:10)(cid:17)(cid:26)\n\n(cid:4)(cid:11)(cid:4)(cid:11)\n(cid:2)(cid:26)(cid:24)(cid:14)(cid:19)(cid:22)(cid:19)(cid:22)(cid:18)(cid:1)(cid:25)(cid:17)(cid:26)(cid:3)\n\n(cid:4)(cid:11)(cid:4)(cid:11)\n(cid:2)(cid:28)(cid:14)(cid:20)(cid:19)(cid:16)(cid:14)(cid:26)(cid:19)(cid:23)(cid:22)(cid:1)(cid:25)(cid:17)(cid:26)(cid:3)\n\n(cid:6)(cid:28)(cid:23)(cid:20)(cid:27)(cid:26)(cid:19)(cid:23)(cid:22)(cid:14)(cid:24)(cid:29)(cid:1)\n(cid:4)(cid:23)(cid:22)(cid:26)(cid:24)(cid:23)(cid:20)(cid:20)(cid:17)(cid:24)\n\n(cid:7)(cid:12)(cid:10)\n\n(cid:7)(cid:12)(cid:10)\n\n(cid:5)(cid:17)(cid:26) (cid:13)(cid:15)(cid:23)(cid:24)(cid:17)\n\n(cid:4)(cid:20)(cid:25) (cid:9)(cid:23)(cid:25)(cid:25)\n\n(cid:5)(cid:17)(cid:26) (cid:9)(cid:23)(cid:25)(cid:25)\n\n(cid:8)(cid:23)(cid:12)(cid:20)(cid:4)(cid:6)(cid:1)(cid:8)(cid:24)(cid:20)(cid:12)(cid:21)(cid:18)(cid:12)(cid:23) (cid:13)(cid:16)(cid:18)(cid:12)(cid:2)(cid:23)(cid:24)(cid:18)(cid:16)(cid:18)(cid:14)\n\n(cid:8)(cid:23)(cid:12)(cid:20)(cid:3)(cid:6)(cid:1)(cid:8)(cid:24)(cid:20)(cid:12)(cid:21)(cid:18)(cid:12)(cid:23) (cid:20)(cid:21)(cid:12)(cid:2)(cid:23)(cid:21)(cid:9)(cid:16)(cid:18)(cid:16)(cid:18)(cid:14)\n(cid:8)(cid:23)(cid:12)(cid:20)(cid:5)(cid:6) (cid:7)(cid:25)(cid:19)(cid:17)(cid:24)(cid:23)(cid:16)(cid:19)(cid:18)(cid:9)(cid:21)(cid:26)(cid:1)(cid:22)(cid:12)(cid:9)(cid:21)(cid:10)(cid:15) (cid:19)(cid:18)(cid:1)(cid:23)(cid:15)(cid:12)(cid:1)(cid:23)(cid:21)(cid:9)(cid:16)(cid:18)(cid:12)(cid:11)(cid:1)(cid:22)(cid:24)(cid:20)(cid:12)(cid:21)(cid:18)(cid:12)(cid:23)\nFigure 1: The pipeline of DetNAS that searches for backbones in object detectors. There are three\nsteps: supernet pre-training on ImageNet, supernet \ufb01ne-tuning on the detection training set, e.g.,\nCOCO, and architecture search on the trained supernet with the evolution algorithm. The validation\nset is actually split from COCO trainval35k and consists of 5k images.\n\nbackbones in object detectors: 1) hard to optimize: NAS systems require accuracies on target tasks as\nreward signals, pre-training accuracy is unquali\ufb01ed for this requirement; 2) inef\ufb01ciency: In order to\nobtain the precious performance, each candidate architecture during search has to be \ufb01rst pretrained\n(e.g. on ImageNet) then \ufb01netuned on the detection dataset, which is very costly. Even though training\nfrom scratch is an alternative [10], it requires more training iterations to compensate for the lack of\npre-training. Moreover, training from scratch breaks down in small datasets, e.g. PASCAL VOC.\nIn this work, we present the \ufb01rst effort on searching for backbones in object detectors. Recently,\na NAS work on object detection, NAS-FPN [5], is proposed.\nIt searches for feature pyramid\nnetworks (FPN) [14] rather than backbones. It can perform with a pre-trained backbone network\nand search with the previous NAS algorithm [31]. Thus, the dif\ufb01culty of backbone search is still\nunsolved. Inspired by one-shot NAS methods [7, 1, 2], we solve this issue by decoupling the weight\ntraining and architecture search. Most previous NAS methods optimize weights and architectures in a\nnested manner. Only if we decouple them into two stages, the pre-training step can be incorporated\neconomically. This framework avoids the inef\ufb01ciency issue caused by the pre-training and makes\nbackbone search feasible.\nThe framework of DetNAS consists of three steps: (1) pre-training the one-shot supernet on ImageNet,\n(2) \ufb01ne-tuning the one-shot supernet on detection datasets, (3) architecture search on the trained\nsupernet with an evolutionary algorithm (EA). In experiments, the main result backbone network,\nDetNASNet, with much fewer FLOPs, achieves 2.9% better mmAP than ResNet-50 on COCO with\nthe FPN detector. Its enlarged version, DetNASNet (3.8), is superior to ResNet-101 by 2.0% on\nCOCO with the FPN detector. In addition, we validate the effectiveness of DetNAS in different\ndetectors (the two-stage FPN [14] and the one-stage RetinaNet [15]) and various datasets (COCO\nand VOC). DetNAS are consistently better than the network searched on ImageNet classi\ufb01cation by\nmore than 3% on VOC and 1% on COCO, no matter on FPN or RetinaNet.\nOur main contributions are summarized as below:\n\naccuracies with limited FLOPs complexity.\n\nbest knowledge, this is the \ufb01rst work on this challenging task.\n\n\u2022 We present DetNAS, a framework that enables backbone search for object detection. To our\n\u2022 We introduce a powerful search space. It helps the searched networks obtain inspiring\n\u2022 Our result networks, DetNASNet and DetNASNet (3.8), outperforms the hand-crafted\nnetworks by a large margin. Without the effect of search space, we show the effective-\nness of DetNAS in different detectors (two-stage FPN and one-stage RetinaNet), various\ndatasets (COCO and VOC). The searched networks have consistently better performance\nand meaningful structural patterns.\n\n2\n\n\f2 Related Work\n\n2.1 Object Detection\n\nObject detection aims to locate each object instance and assign a class to it in an image. With the\nrapid progress of deep convolutional networks, object detectors, such as FPN [14] and RetinaNet [15],\nhave achieved great improvements in accuracy. In general, an object detector can be divided into\ntwo parts, a backbone network, and a \"head\". In the past few years, many advances in object\ndetection come from the study of \"head\", such as architecture [14], loss [15, 24], and anchor [29, 26].\nFPN [14] develops a top-down architecture with lateral connections to integrate features at all scales\nas an effective feature extractor. The focal loss [15] is proposed in RetinaNet to solve the problem\nof class imbalance, which leads to the instability in early training. MetaAnchor [29] proposes a\ndynamic anchor mechanism to boost the performance for anchor-based object detectors. However,\nfor the backbone network, almost all object detectors adopt networks for image classi\ufb01cation, which\nmight be sub-optimal. Because object detection cares about not only \"what\" object is, which image\nclassi\ufb01cation only focuses, but also \"where\" it is. Similar to our work, DetNet [12] also exploits the\narchitecture of the backbone that specially designed for object detection manually. Inspired by NAS,\nwe present DetNAS to \ufb01nd the optimal backbone automatically for object detection in this work.\n\n2.2 Neural Architecture Search\n\nNAS on image classi\ufb01cation Techniques to design networks automatically have attracted increasing\nresearch interests. NAS [31] and NASNet [32] use reinforcement learning (RL) to determine neural\narchitectures sequentially. In addition to these RL-based methods, the evolution algorithm (EA)\nalso shows its potential. AmeobaNet [23] proves that the basic evolutionary algorithm without\nany controller can also achieve comparable results and even surpass RL-base methods. To save\ncomputational resources, some works propose to use weight sharing or one-shot methods, e.g.,\nENAS [21] and DARTS [17]. Many following works, including SNAS [28], Proxyless [3] and\nFBNet [27] and others [4], also belong to one-shot NAS to some extent.\nNAS on other tasks In addition to NAS works on image classi\ufb01cation, some recent works attempt\nto develop NAS to other tasks, especially semantic segmentation. [19] proposes to search auxiliary\ncells as the segmentation decoders. Auto-DeepLab [16] applies the gradient-based method to search\nbackbones for segmentation models. To our best knowledge, no works have attempted to search\nneural architectures for backbones in object detectors. One main reason might come from the costly\nImageNet pre-training for object detectors. Training from scratch scheme [10], as a substitute, proves\nto bring no computational savings and tends to break down in small datasets. In this work, we\novercome this obstacle with a one-shot supernet and the evolutionary search algorithm.\n\n3 Detection Backbone Search\n\nOur goal is to extend NAS to search for backbones in object detectors. In general, object detector\ntraining typically requires ImageNet pre-training. Meanwhile, NAS systems require supervisory\nsignals from target tasks. For each network candidate, it needs ImageNet pre-training, which is\ncomputationally expensive. Additionally, training from scratch is an alternative method while it\nrequires more iterations to optimize and breaks down in small datasets. Inspired by the one-shot\nNAS [7, 2, 1], we decouple the one-shot supernet training and architecture optimization to overcome\nthis obstacle. In this section, we \ufb01rst clarify the motivation of our methodology.\n\n3.1 Motivation\nWithout loss of generality, the architecture search space A can be denoted by a directed acyclic\ngraph (DAG). Any path in the graph corresponds to a speci\ufb01c architecture, a 2A . For the speci\ufb01c\narchitecture, its corresponding network can be represented as N (a, w) with the network weights w.\nNAS aims to \ufb01nd the optimal architecture a\u21e4 2A that minimizes the validation loss Lval(N (a\u21e4, w\u21e4)).\nw\u21e4 denotes the optimal network weights of the architecture a\u21e4. It is obtained by minimizing the\n\n3\n\n\ftraining loss. We can formulate NAS process as a nested optimization problem:\n\nmin\n\na2A Lval(N (a, w\u21e4(a)))\n\ns.t. w\u21e4(a) = arg min\n\nLtrain(w, a)\n\nw\n\n(1)\n\n(2)\n\nThe above formulation can represent NAS on tasks that work without pre-training, e.g., image\nclassi\ufb01cation. But for object detection, which needs pre-training and \ufb01ne-tuning schedule, Eq. (2)\nneeds to be reformulated as follow:\n\nw\u21e4(a) = arg min\n\nw wp(a)\u21e4 Ldet\n\ntrain(w, a)\n\ns.t. wp(a)\u21e4 = arg min\n\nwp Lcls\n\ntrain(wp, a)\n\n(3)\n\nwhere w wp(a)\u21e4 is to optimize w with wp(a)\u21e4 as initialization. The pre-trained weights wp(a)\u21e4\ncan not directly serve for the Eq. (1), but it is necessary for w(a)\u21e4. Thus, we can not skip the\nImageNet pre-training in DetNAS. However, ImageNet pre-training usually costs several GPU days\njust for a single network. It is unaffordable to train all candidate networks individually. In one-shot\nNAS methods [7, 2, 1], the search space is encoded in a supernet which consists of all candidate\narchitectures. They share the weights in their common nodes. In this way, Eq. (1) and Eq. (2) become:\n(4)\n\ns.t. W \u21e4\n\nmin\n\nA = arg min\n\nW Ltrain(N (A, W ))\n\na2A Lval(N (a, W \u21e4\n\nA(a)))\n\nwhere all individual network weights w(a) are inherited from the one-shot supernet WA. The supernet\ntraining, WA optimization, is decoupled from the architecture a optimization. Based on this point,\nwe go further step to incorporate pre-training step. This enables NAS on the more complicated task,\nbackbone search in object detecion:\n\ns.t. W \u21e4\n\nmin\n\na2A Ldet\nA = arg min\nW W \u21e4pA\nW \u21e4pA = arg min\n\nval(N (a, W \u21e4\nA(a)))\nLdet\ntrain(N (A, W ))\nWp Lcls\ntrain(N (A, Wp))\n\n(5)\n\n(6)\n\n(7)\n\n3.2 Our NAS Pipeline\nAs in Fig. 1, DetNAS consists of 3 steps: supernet pre-training on ImageNet, supernet \ufb01ne-tuning on\ndetection datasets and architecture search on the trained supernet.\nStep 1: Supernet pre-training.\nImageNet pre-training is the fundamental step of the \ufb01ne-tuning\nschedule. For some one-shot methods [17, 27], they relax the actually discrete search space into\na continuous one, which makes the weights of individual networks deeply coupled. However, in\nsupernet pre-training, we adopt a path-wise [7] manner to ensure the trained supernet can re\ufb02ect the\nrelative performance of candidate networks. Speci\ufb01cally, in each iteration, only one single path is\nsampled for feedforward and backward propagation. No gradient or weight update acts on other paths\nor nodes in the supernet graph.\nStep 2: Supernet \ufb01ne-tuning.\nThe supernet \ufb01ne-tuning is also path-wise but equipped with detec-\ntion head, metrics and datasets. The other necessary detail to mention is about batch normalization\n(BN). BN is a popular normalization method to help optimization. Typically, the parameters of BN,\nduring \ufb01ne-tuning, are \ufb01xed as the pre-training batch statistics. However, the freezing BN is infeasible\nin DetNAS, because the features to normalize are not equal on different paths. On the other hand,\nobject detectors are trained with high-resolution images, unlike image classi\ufb01cation. This results\nin small batch sizes as constrained by memory and severely degrades the accuracy of BN. To this\nend, we replace the conventional BN with Synchronized Batch Normalization (SyncBN) [20] during\nsupernet training. It computes batch statistics across multiple GPUs and increases the effective batch\nsize. We formulate the supernet training process in Algorithm 1 in the supplementary material.\nStep 3: Search on supernet with EA.\nThe third step is to conduct the architecture search on the\ntrained supernet. Paths in the supernet are picked and evaluated under the direction of the evolutionary\ncontroller. For the evolutionary search, please refer to Section 3.4 for details. The necessary detail\nin this step is also about BN. During search, different child networks are sampled path-wise in the\n\n4\n\n\fStage\n\nBlock\n\nTable 1: Search space of DetNAS.\nLarge (40 blocks)\nc1\n48\n96\n240\n480\n960\n\nn1\n1\n8\n8\n16\n8\n\nConv3\u21e53-BN-ReLU\n\n0\n1\n2\n3\n4\n\u21e4 Shuf\ufb02eNetv2 block has 4 choices for search: 3x3, 5x5, 7x7 and Xception 3x3.\n\nShuf\ufb02eNetv2 block (search)\nShuf\ufb02eNetv2 block (search)\nShuf\ufb02eNetv2 block (search)\nShuf\ufb02eNetv2 block (search)\n\nSmall (20 blocks)\nc2\n16\n64\n160\n320\n640\n\nn2\n1\n4\n4\n8\n4\n\nsupernet. The issue is that the batch statistics on one path should be independent of others. Therefore,\nwe need to recompute batch statistics for each single path (child networks) before each evaluation.\nThis detail is indispensable in DetNAS. We extract a small subset of the training set (500 images) to\nrecompute the batch statistics for the single path to be evaluated. This step is to accumulate reasonable\nrunning mean and running variance values for BN. It involves no gradient backpropagation.\n\n3.3 Search Space Design\n\nThe details about the search space are described in the Table 1. Our search space is based on the\nShuf\ufb02eNetv2 block. It is a kind of ef\ufb01cient and lightweight convolution architectures and involves\nchannel split and shuf\ufb02e operation [18]. We design two search spaces with different sizes, the large\none for the main result and the small one for ablation studies.\nLarge (40 blocks)\nThis search space is a large one and designed for the main results to compare\nwith hand-crafted backbone networks. The channels and blocks in each stage are speci\ufb01ed by c1\nand n1. In each stage, the \ufb01rst block has stride 2 for downsampling. Except for the \ufb01rst stage, there\nare 4 stages that contain 8 + 8 + 16 + 8 = 40 blocks for search. For each block to search, there\nare 4 choices developed from the original Shuf\ufb02eNetv2 block: changing the kernel size with {3\u21e53,\n5\u21e55, 7\u21e57} or replacing the right branch with an Xception block (three repeated separable depthwise\n3\u21e53 convolutions). It is easy to count that this search space includes 440 \u21e1 1.2 \u21e5 1024 candidate\narchitectures. Most networks in this search space have more than 1G FLOPs. We construct this large\nsearch space is for the comparisons with hand-crafted large networks. For example, ResNet-50 and\nResNet-101 have 3.8G and 7.6G FLOPs respectively.\nSmall (20 blocks)\nThis search space is smaller and designed for ablation studies. The channels and\nblocks in each stage are speci\ufb01ed by c2 and n2. The block numbers n1 are twice as n2. The channel\nnumbers c1 are 1.5 times as c2 in all searched stages. This search space includes 420 \u21e1 1.0 \u21e5 1012\npossible architectures. It is still a large number and enough for ablation studies. Most networks in\nthis search space have around 300M FLOPs. We conduct all situation comparisons in this search\nspace, including various object detectors (FPN or RetinaNet), different datasets (COCO or VOC),\nand different schemes (training from scratch or with pre-training).\n\n3.4 Search Algorithm\n\nThe architecture search step is based on the evolution algorithm. At \ufb01rst, a population of networks P\nis initialized randomly. Each individual P consists of its architecture P.\u2713 and its \ufb01tness P.f. Any\narchitecture against the constraint \u2318 would be removed and a substitute would be picked. After\ninitialization, we start to evaluate the individual architecture P.\u2713 to obtain its \ufb01tness P.f on the\nvalidation set VDet. Among these evaluated networks, we select the top |P| as parents to generate\nchild networks. The next generation networks are generated by mutation and crossover half by\nhalf under the constraint \u2318. By repeating this process in iterations, we can \ufb01nd a single path \u2713best\nwith the best validation accuracy or \ufb01tness, fbest. We formulate this process as Algorithm 2 in the\nsupplementary material. The hyper-parameters in this step are introduced in Section 4.\nCompared to RL-based [32, 31, 21] and gradient-based NAS methods [17, 27, 3], the evolutionary\nsearch can stably meet hard constraints, e.g., FLOPs or inference speed. To optimize FLOPs or\ninference speed, RL-based methods need a carefully tuned reward function while gradient-based\nmethods require a wise designed loss function. But their outputs are still hard to totally meet the\nrequired constraints. To this end, DetNAS chooses the evolutionary search algorithm.\n\n5\n\n\fTable 2: Main result comparisons.\n\nAccuracy\n\nObject Detection with FPN on COCO\n\nmAP AP50 AP75 APs APm APl\nBackbone\n49.4\n37.3\nResNet-50\n40.0\n52.2\nResNet-101\n52.2\n39.2\nShuf\ufb02eNetv2-40\n54.2\n40.8\nShuf\ufb02eNetv2-40 (3.8)\n53.8\n40.2\nDetNASNet\n42.0\n56.8\nDetNASNet (3.8)\n\u21e4 These are trained with the same \u201c1x\u201d settings in Section 4. The \u201c2x\u201d results are in the supplementary material.\n\nImageNet Classi\ufb01cation\nFLOPs\n3.8G\n7.6G\n1.3G\n3.8G\n1.3G\n3.8G\n\n76.15\n77.37\n77.18\n78.47\n77.20\n78.44\n\n58.2\n61.4\n60.8\n62.1\n61.5\n63.9\n\n40.8\n43.7\n42.4\n44.8\n43.6\n45.8\n\n21.0\n23.8\n23.6\n23.4\n23.3\n24.9\n\n40.2\n43.1\n42.3\n44.2\n42.5\n45.1\n\n4 Experimental Settings\n\nSupernet pre-training. For ImageNet classi\ufb01cation dataset, we use the commonly used 1.28M\ntraining images for supernet pre-training. To train the one-shot supernet backbone on ImageNet, we\nuse a batch size of 1024 on 8 GPUs for 300k iterations. We set the initial learning rate to be 0.5 and\ndecrease it linearly to 0. The momentum is 0.9 and weight decay is 4 \u21e5 105.\nSupernet \ufb01ne-tuning. We validate our method with two different detectors. The training images\nare resized such that the shorter size is 800 pixels. We train on 8 GPUs with a total of 16 images\nper minibatch for 90k iterations on COCO and 22.5k iterations on VOC. The initial learning rate is\n0.02 which is divided by 10 at {60k, 80k} iterations on COCO and {15k, 20k} iterations on VOC.\nWe use weight decay of 1 \u21e5 104 and momentum of 0.9. For the head of FPN, we replace two\nfully-connected layers (2fc) with 4 convolutions and 1 fully-connected layer (4conv1fc), which is also\nused in all baselines in this work, e.g., ResNet-50, ResNet-101, Shuf\ufb02eNetv2-20 and Shuf\ufb02eNetv2-40.\nFor RetinaNet, the training settings is similar to FPN, except that the initial learning rate is 0.01.\nSearch on the trained supernet. We split the detection datasets into a training set for supernet\n\ufb01ne-tuning, a validation set for architecture search, and a test set for \ufb01nal evaluation. For VOC, the\nvalidation set contains 5k images randomly selected from trainval2007 + trainval2012 and\nthe remains for supernet \ufb01ne-tuning. For COCO, the validation set contains 5k images randomly\nselected from trainval35k [13] and the remains for supernet \ufb01ne-tuning. For evolutionary search,\nthe evolution process is repeated for 20 iterations. The population size |P| is 50 and the parents\nsize |P| is 10. Thus, there are 1000 networks evaluated in one search.\nFinal architecture evaluation. The selected architectures are also retrained in the pre-training and\n\ufb01ne-tuning schedule. The training con\ufb01guration is the same as that of the supernet. For COCO, the\ntest set is minival. For VOC, the test set is test2007. Results are mainly evaluated with COCO\nstandard metrics (i.e. mmAP) and VOC metric (IOU=.5). All networks listed in the paper are trained\nwith the \u201c1x\u201d training setting used in Detectron [6] to keep consistency with the supernet \ufb01ne-tuning.\n\n5 Experimental Results\n\n5.1 Main Results\n\nOur main result architecture, DetNASNet, is searched on FPN in the large search space. The\narchitecture of DetNASNet is depicted in the supplementary material. We search on FPN because it is\na mainstream two-stage detector that has been used in other vision tasks, e.g., instance segmentation\nand skeleton detection [9]. Table 2 shows the main results. We list three hand-crafted networks\nfor comparisons, including ResNet-50, ResNet-101 and Shuf\ufb02eNetv2-40. DetNASNet achieves\n40.2% mmAP with only 1.3G FLOPs. It is superior to ResNet-50 and equal to ResNet-101.\nTo eliminate the effect of search space, we compare with the hand-crafted Shuf\ufb02eNetv2-40, a baseline\nin this search space. It has 40 blocks and is scaled to 1.3G FLOPs, which are identical to DetNASNet.\nShuf\ufb02eNetv2-40 is inferior to ResNet-101 and DetNAS by 0.8% mmAP on COCO. This shows the\neffectiveness of DetNAS without the effect of search space.\n\n6\n\n\fTable 3: Ablation studies.\n\nShuf\ufb02eNetv2-20\nClsNASNet\nDetNAS-scratch\nDetNAS\n\nImageNet\n\n(Top1 Acc, %)\n\n73.1\n74.3\n\n73.8 - 74.3\n73.9 - 74.1\n\nCOCO (mmAP, %)\nFPN\nRetinaNet\n34.8\n35.1\n35.9\n36.6\n\n32.1\n31.2\n32.8\n33.3\n\nVOC (mAP, %)\nFPN RetinaNet\n80.6\n78.5\n81.1\n81.5\n\n79.4\n76.5\n79.9\n80.1\n\n\u21e4 DetNAS and DetNAS-scratch both have 4 speci\ufb01c networks in each case (COCO/VOC, FPN/RetinaNet).\nThe ImageNet classi\ufb01cation accuracies of DetNAS and DetNAS-scratch are the minimal to the maximal.\n\nTable 4: Computation cost for each step on COCO.\n\nDetNAS\n\nSupernet pre-training\n3 \u21e5 105 iterations\n8 GPUs on 1.5 days\n\nSupernet \ufb01ne-tuning\n9 \u21e5 104 iterations\n8 GPUs on 1.5 days\n\nSearch on the supernet\n\n20 \u21e5 50 models\n20 GPUs on 1 day\n\n\u21e4 For the small search space, GPUs are GTX 1080Ti . For the large search space, GPUs are Tesla V100.\n\nAfter that, we include the effect of search space for consideration and increase the channels of\nDetNASNet by 1.8 times to 3.8G FLOPs, DetNASNet (3.8). Its FLOPs are identical to that of\nResNet-50. It achieves 42.0% mmAP which surpasses ResNet-50 by 4.7 % and ResNet-101 by 2.0%.\n\n5.2 Ablation Studies\n\nThe ablation studies are conducted in the small search space introduced in Table 1. This search space\nis much smaller than the large one, but it is ef\ufb01cient and enough for making ablation studies. As in\nTable 3, we validate the effectiveness of DetNAS with various detectors (FPN and RetinaNet) and\ndatasets (COCO and VOC). All models in Table 3 and Table 5 are trained with the same settings\ndescribed in Section 4. Their FLOPs are all similar and under 300M.\nComparisons to the hand-crafted network.\nShuf\ufb02eNetv2-20 is a baseline network constructed with 20 blocks and scaled to 300M FLOPs. It has\nthe same number of blocks to architectures searched in the search space. As in Table 3, DetNAS shows\na consistent superiority to the hand-crafted Shuf\ufb02eNetv2-20. DetNAS outperforms Shuf\ufb02eNetv2-20\nby more than 1% mmAP in COCO on both FPN and RetinaNet detectors. This shows that NAS in\nobject detection can also achieve a better performance than the hand-crafted network.\nComparisons to the network for ImageNet classi\ufb01cation.\nSome NAS works tend to search on small proxy tasks and then transfer to other tasks or datasets. For\nexample, NASNet is searched from CIFAR-10 image classi\ufb01cation and directly applied to object\ndetection [32]. We empirically show this manner sub-optimal. ClsNASNet is the best architecture\nsearched on ImageNet classi\ufb01cation. The search method and search space follow DetNAS. We use it\nas the backbone of object detectors. ClsNASNet is the best on ImageNet classi\ufb01cation in Table 3\nwhile its performance on object detection is disappointing. It is the worst in all cases, except a slightly\nbetter than Shuf\ufb02eNetv2-20 on COCO-FPN. This shows that NAS on target tasks can perform better\nthan NAS on proxy tasks.\nComparisons to the from-scratch counterpart.\nDetNAS-scratch is a from-scratch baseline to ablate the effect of pre-training. In this case, the\nsupernet is trained from scratch on detection datasets without being pre-trained on ImageNet. To\ncompensate for the lack of pretraining, its training iterations on detection datasets are twice as those\nof DetNAS, that is, 180k on COCO and 45k on VOC. In this way, the computation cost of DetNAS\nand DetNAS-scratch are similar. All other settings are the same as DetNAS. Both DetNAS and\nDetNAS-scratch have consistent improvements on ClsNASNet and Shuf\ufb02eNetv2-20. This shows that\nsearching directly on object detection is a better choice, no matter from scratch or with ImageNet\npre-training. In addition, DetNAS performs also better than DetNAS-scratch in all cases, which\nre\ufb02ects the importance of pre-training.\n\n7\n\n\fTable 5: Comparisons to the random baseline.\n\nImageNet\n(Top1 Acc, %)\n73.9 \u00b1 0.2\n73.9 - 74.1\n\nRandom\nDetNAS\n\nCOCO (mmAP, %)\nFPN\nRetinaNet\n35.6 \u00b1 0.6\n32.5 \u00b1 0.4\n\n36.6\n\n33.3\n\nVOC (mAP, %)\nFPN\n\nRetinaNet\n79.0 \u00b1 0.7\n\n80.1\n\n80.9 \u00b1 0.2\n\n81.5\n\nFigure 2: Curve of EA and Random during search.\n\nFigure 3: Random models on COCO-FPN.\n\nComparisons to the random baseline.\nAs stated in many NAS works [11, 17], the random baseline is also competit ive. In this work, we\nalso include the random baseline for comparisons as in Table 5. In Figure 2, the mmAP curve on the\nsupernet search are depicted to compare EA with Random. For each iteration, top 50 models until the\ncurrent iteration are depicted at each iteration. EA demonstrates a clearly better sampling ef\ufb01ciency\nthan Random. In addition, we randomly pick 20 networks in the search space and train them with the\nsame settings to other result models. On ImageNet classi\ufb01cation, the random baseline is comparable\nto DetNAS. But on the object detection tasks, DetNAS performs better than Random. In Figure 3, we\ndepicted the scatter and the average line of random models and the line of DetNAS. DetNAS in the\nsmall search space is 36.6% while the Random is 35.6\u00b10.6%. From these points of view, DetNAS\nperforms better than Random not only during search but also in the output models.\n\n5.3 DetNAS Architecture and Discussions\n\nOur architectures searched for object detectors show meaningful patterns that are distinct from\narchitectures searched for image classi\ufb01cation. Figure 4 illustrates three neural architectures searched\nin the 20 blocks search space. The architecture in the top of Figure 4 is ClsNASNet. The other\ntwo are searched with FPN and RetinaNet detectors respectively. These architectures are depicted\nblock-wise. The yellow and orange blocks are 5\u21e55 and 7\u21e57 Shuf\ufb02eNetv2 blocks. The blue blocks\nhave kernel sizes as 3. The larger blue blocks are Xception Shuf\ufb02eNetv2 blocks which are deeper\nthan the small 3\u21e53 Shuf\ufb02eNetv2 blocks. Figure 5 illustrates the architecture of DetNASNet. It has\n40 blocks in total and {8, 8, 16, 8} blocks each stage.\nIn contrast to ClsNASNet, architectures of DetNAS have large-kernel blocks in low-level layers and\ndeep blocks in high-level layers. In DetNAS, blocks of large kernels (5\u21e55, 7\u21e57) almost gather in\nlow-level layers, Stage 1 and Stage 2. In contrast, ClsNASNet has all 7\u21e57 blocks in Stage 3 and\nStage 4. This pattern also conforms to the architectures of ProxylessNAS [3] that is also searched on\nimage classi\ufb01cation and ImageNet dataset. On the other hand, blocks with blue color have 3\u21e53 kernel\nsize. As in the middle of Figure 4, blue blocks are almost grouped in Stage 3 and Stage 4. Among\nthese 8 blue blocks, 6 blocks are Xception Shuf\ufb02Netv2 blocks that are deeper than common 3\u21e53\nShuf\ufb02Netv2 blocks. In ClsNASNet, only one Xception Shuf\ufb02eNetv2 block exists in high-level layers.\nIn addition, DetNASNet also shows the meaningful pattern that most high-levels blocks have 3\u21e53\nkernel size. Based on these observations, we \ufb01nd that the networks suitable for object detection are\nvisually different from the networks for classi\ufb01cation. Therefore, these distinctions further con\ufb01rm\nthe necessity of directly searching on the target tasks, instead of proxy tasks.\n\n8\n\n\f7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n(cid:11)(cid:68)(cid:12)(cid:3)(cid:38)(cid:79)(cid:86)(cid:49)(cid:36)(cid:54)(cid:49)(cid:72)(cid:87)\n\n(cid:11)(cid:68)(cid:12)(cid:3)(cid:38)(cid:79)(cid:86)(cid:49)(cid:36)(cid:54)(cid:49)(cid:72)(cid:87)\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\n(cid:11)(cid:68)(cid:12)(cid:3)(cid:38)(cid:79)(cid:86)(cid:49)(cid:36)(cid:54)(cid:49)(cid:72)(cid:87)\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\n5\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n5\nS\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n5\nS\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\nS\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \nt\ni\nn\nU\n \ne\n \n\ufb04\ne\nu\n\ufb04\nh\nS\nu\nh\nS\n\n(cid:22)\n(cid:91)\n3\n3\nx\n \nt\n3\ni\nn\nU\n\n \nt\ni\nn\nU\n \ne\n \n\ufb04\ne\nu\n\ufb04\nh\nS\nu\nh\nS\n\n(cid:22)\n(cid:91)\n3\n3\nx\n \nt\n3\ni\nn\nU\n\n \nt\ni\nn\nU\n \ne\n \n\ufb04\ne\nu\n\ufb04\nh\nS\nu\nh\nS\n\n(cid:22)\n(cid:91)\n3\n3\nx\n \nt\n3\ni\nn\nU\n\n \nt\ni\nn\nU\n \ne\n \n\ufb04\ne\nu\n\ufb04\nh\nS\nu\nh\nS\n\n(cid:22)\n(cid:91)\n3\n3\nx\n \nt\n3\ni\nn\nU\n\n \nt\ni\nn\nU\n \ne\n \n\ufb04\ne\nu\n\ufb04\nh\nS\nu\nh\nS\n\n(cid:22)\n(cid:91)\n3\n3\nx\n \nt\n3\ni\nn\nU\n\n \nt\ni\nn\nU\n \ne\n \n\ufb04\ne\nu\n\ufb04\nh\nS\nu\nh\nS\n\n(cid:22)\n(cid:91)\n3\n3\nx\n \nt\n3\ni\nn\nU\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n(b) DetNAS-P (FPN-COCO)\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\n(b) DetNAS-P (FPN-COCO)\n\n(b) DetNAS-P (FPN-COCO)\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n5\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n5\nS\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n5\nS\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\nS\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nStage 3\n\nStage 3\n\nStage 3\n\n5\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n5\nS\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n5\nS\nx\n5\n5\nx\n \n5\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nStage 3\nh\nu\nS\nh\nS\n\nStage 3\n\n \n\n \nt\ni\nn\nStage 3\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nStage 3\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n7\nx\n7\n7\nx\n \n7\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n7\nS\nx\n7\n7\nx\n \n7\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n7\nS\nx\n7\n7\nx\n \n7\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nStage 4\nh\nu\nS\nh\nS\n\nStage 4\n\nStage 4\n\n7\nx\n7\n7\nx\n \n7\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n7\nS\nx\n7\n7\nx\n \n7\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nh\nu\nS\nh\n7\nS\nx\n7\n7\nx\n \n7\nt\n \ni\nt\nn\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nStage 4\nh\nu\nS\nh\nS\n\nStage 4\n\n \n\nt\ni\nn\nStage 4\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nStage 4\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\nt\ni\nn\nt\ni\nn\nU\nU\n \n \ne\ne\n\ufb04\n\ufb04\nu\nu\nh\nh\nS\nS\n\nn\nn\no\no\ni\ni\nt\nt\np\np\ne\ne\nc\nc\nX\nX\n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nStage 1\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nStage 1\nu\nh\nS\n\n \n\n \nt\ni\nn\nStage 1\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nStage 1\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nStage 1\nh\nS\n\n3\nx\n3\n\nStage 1\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\nStage 1\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n5\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n5\nS\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nStage 2\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nStage 2\nh\nS\n\n3\nx\n3\n\n7\nx\n7\n\nStage 2\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\n7\nS\nx\n7\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nStage 2\nu\nh\n7\nS\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nStage 2\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nStage 2\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nStage 2\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nFigure 4: The searched architecture pattern comparison in the small (20 blocks) search space. From\ntop to bottom, they are ClsNASNet, DetNAS (COCO-FPN) and DetNAS (COCO-RetinaNet).\n\nStage 1\n\nStage 2\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nStage 4\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n(c) DetNAS-P (RetinaNet-COCO)\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nStage 3\nS\n\n3\nx\n3\n\n(c) DetNAS-P (RetinaNet-COCO)\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nStage 1\n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\nStage 2\n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\n(cid:3)\n3\nx\n3\nV\nN\nO\nC\n\nU\nL\ne\nR\nN\nB\n\n \n\n(c) DetNAS-P (RetinaNet-COCO)\n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\nStage 3\n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\nStage 4\n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nStage 1\n\nStage 2\n\nStage 3\n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n7\nx\n7\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n3\nx\n3\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n5\nx\n5\n \nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\n \n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\n \n\nt\ni\nn\nU\ne\n\ufb04\nu\nh\nS\n\nn\no\ni\nt\np\ne\nc\nX\n\nStage 3\n\nStage 4\nFigure 5: DetNASNet architecture\n\n6 Conclusion\n\nWe present DetNAS, the \ufb01rst attempt to search backbones in object detectors without any proxy. Our\nmethod consists of three steps: supernet pre-training on ImageNet, supernet \ufb01ne-tuning on detection\ndatasets and searching on the trained supernet with EA. Table 4 shows the computation cost for each\nsteps. The computation cost of DetNAS, 44 GPU days on COCO, is just two times as training a\ncommon object detector. In experiments, the main result of DetNAS achieves superior performance\nthan ResNet-101 on COCO and the FPN detector with much fewer FLOPs complexity. We also make\ncomprehensive comparisons to show the effectiveness of DetNAS. We test DetNAS on various object\ndetectors (FPN and RetinaNet) and different datasets (COCO and VOC). For further discussions, we\nspotlight the architecture-level gap between image classi\ufb01cation and object detection. ClsNASNet\nand DetNAS have different and meaningful architecture-level patterns. This might, in return, provide\nsome insights for the hand-crafted architecture design.\n\nAcknowledgement\n\nThis work is supported by Major Project for New Generation of AI Grant (No. 2018AAA0100402),\nNational Key R&D Program of China (No. 2017YFA0700800), and the National Natural Science\nFoundation of China under Grants 61976208, 91646207, 61573352, and 61773377. This work is also\nsupported by Beijing Academy of Arti\ufb01cial Intelligence (BAAI).\n\n9\n\n\fReferences\n[1] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V. Le. Under-\n\nstanding and simplifying one-shot architecture search. In ICML, pages 549\u2013558, 2018.\n\n[2] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. SMASH: one-shot model\n\narchitecture search through hypernetworks. CoRR, abs/1708.05344, 2017.\n\n[3] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target\n\ntask and hardware. ICLR, abs/1812.00332, 2019.\n\n[4] Jianlong Chang, Xinbang Zhang, Yiwen Guo, Gaofeng Meng, Shiming Xiang, and Chunhong\nPan. Differentiable architecture search with ensemble gumbel-softmax. abs/1905.01786, 2019.\n[5] Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc V. Le. NAS-FPN: learning scalable\n\nfeature pyramid architecture for object detection. CoRR, abs/1904.07392, 2019.\n\n[6] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Doll\u00e1r, and Kaiming He. Detectron,\n\n2018.\n\n[7] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun.\nSingle path one-shot neural architecture search with uniform sampling. abs/1904.00420, 2019.\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, pages 770\u2013778, 2016.\n\n[9] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. In ICCV, pages\n\n2961\u20132969, 2017.\n\n[10] Kaiming He, Ross Girshick, and Piotr Doll\u00e1r. Rethinking imagenet pre-training. page\n\nabs/1811.08883, 2019.\n\n[11] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search.\n\nCoRR, abs/1902.07638, 2019.\n\n[12] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet:\n\nDesign backbone for object detection. In ECCV, pages 339\u2013354.\n\n[13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,\nPiotr Doll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV,\npages 740\u2013755, 2014.\n\n[14] Tsung-Yi Lin, Piotr Doll\u00e1r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J.\n\nBelongie. Feature pyramid networks for object detection. In CVPR, pages 936\u2013944, 2017.\n\n[15] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal loss for\n\ndense object detection. In ICCV, pages 2999\u20133007, 2017.\n\n[16] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille,\nand Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image\nsegmentation. CoRR, abs/1901.02985, 2019.\n\n[17] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search.\n\nICLR, abs/1806.09055, 2019.\n\n[18] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuf\ufb02enet V2: practical guidelines\n\nfor ef\ufb01cient CNN architecture design. 2018.\n\n[19] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian D. Reid. Fast neural architecture search\nof compact semantic segmentation models via auxiliary cells. CoRR, abs/1810.10804, 2018.\n[20] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian\n\nSun. Megdet: A large mini-batch object detector. In CVPR, pages 6181\u20136189, 2018.\n\n[21] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Ef\ufb01cient neural\n\narchitecture search via parameter sharing. In ICML, pages 4092\u20134101, 2018.\n\n10\n\n\f[22] Zheng Qin, Zeming Li, Zhaoning Zhang, Yiping Bao, Gang Yu, Yuxing Peng, and Jian Sun.\n\nThundernet: Towards real-time generic object detection. CoRR, abs/1903.11752.\n\n[23] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for\n\nimage classi\ufb01er architecture search. CoRR, abs/1802.01548, 2018.\n\n[24] Hamid Rezato\ufb01ghi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio\nSavarese. Generalized intersection over union: A metric and a loss for bounding box regression.\nCoRR, abs/1902.09630, 2019.\n\n[25] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning\n\nfor human pose estimation. CoRR, abs/1902.09212, 2019.\n\n[26] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by\n\nguided anchoring. CoRR, abs/1901.03278, 2019.\n\n[27] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong\nTian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware ef\ufb01cient convnet\ndesign via differentiable neural architecture search. CVPR, abs/1812.03443, 2019.\n\n[28] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture\n\nsearch. ICLR, abs/1812.09926, 2019.\n\n[29] Tong Yang, Xiangyu Zhang, Zeming Li, Wenqiang Zhang, and Jian Sun. Metaanchor: Learning\n\nto detect objects with customized anchors. In NIPS, pages 318\u2013328, 2018.\n\n[30] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable,\n\nbetter results. CoRR, abs/1811.11168, 2018.\n\n[31] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR,\n\nabs/1611.01578, 2016.\n\n[32] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable\n\narchitectures for scalable image recognition. pages 8697\u20138710, 2018.\n\n11\n\n\f", "award": [], "sourceid": 3602, "authors": [{"given_name": "Yukang", "family_name": "Chen", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Tong", "family_name": "Yang", "institution": "Megvii Inc."}, {"given_name": "Xiangyu", "family_name": "Zhang", "institution": "MEGVII Technology"}, {"given_name": "GAOFENG", "family_name": "MENG", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Xinyu", "family_name": "Xiao", "institution": "National Laboratory of Pattern recognition (NLPR),  Institute of Automation of Chinese Academy of Sciences (CASIA)"}, {"given_name": "Jian", "family_name": "Sun", "institution": "Megvii, Face++"}]}