NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 383 NAT: Neural Architecture Transformer for Accurate and Compact Architectures

### Reviewer 1

This paper was proposed to optimize and simplify network architectures based on a fixed topological structure, i.e., directed acyclic graph. The search space of this paper is the operation combination space. To simplify the structure, some pruning actions are added in to the search space. Further, it is formulated as a reinforcement learning problem, and learnt by graph convolution networks. This paper is not well-written and well-organized. There are some typos and grammatical issues in this paper. Some more details I am more concerned about are listed below: 1. For line 43 of this paper, large search space means large probability to find the better network architecture. Why large search space is a limitation of NAO? 2. Based on the simplification action design of this paper (Figure 2 (c)), how to get the skip connections for the linear network architecture such as VGG? 3. Why the generated networks have two inputs? In another word, “-1” and “-2” mean what in Figure 3, 4? 4. I don’t think the optimization problem formulation (Eq. 2) is reasonable. Because the purpose of NAS is to find the best operation setting for the best structure. But the formulation tends to find the best expected operation setting for all structures. 5. Can you show more details about how to solve three challenges which are proposed in line 131 to 136? 6. On experiments, can you show the training cost of NAT, i.e., extra time cost after NAS phase?

### Reviewer 2

The paper proposes a new method "Neural Architecture Transformer" (NAT) for adapting existing neural architecture such that they perform favourably in terms of the resource_consumption-performance trade-off (by replacing existing operations with skip-connections or removing them or by adding skip connections). * Originality: The paper proposes a new search space and a novel search method. Both contributions are original. However, the paper misses closely related work such as, e.g. Cao et al., "Learnable Embedding Space for Efficient Neural Architecture Compression" * Quality: The empirical results show that the proposed combination of search-space and search method allows to improve existing architectures such as VGG16, ResNet20,ENASNet or DartsNet in terms of the resource_efficiency-performance trade-off. However, it does not disentangle the effects of search space and search method; in particular, it remains unclear if the proposed and relatively complex search method (policy-gradient + graph-convolutional neural networks) would outperform simpler baselines such as random search on the same search space. Moreover, the method is only applied to neural architectures that were not optimized for being resource-efficient; it remains unclear if NAT would also improve architectures such as, e.g., the MobileNet family or MnasNet. * Clarity: The paper is well written and structured. It also contains sufficient details for being able to replicate the paper. * Significance: The paper's three main contributions are potentially significant, in particular for practitioners. The proposed method could be seen as a post-processing step for NAS, which allows increasing the methods performance with the same or even less resources required (since the search space is very simple, a search method is more likely to find optimal configuration in it compared to the significantly larger typical NAS search spaces). However, to really show the significance of the proposed method, it would have to show that it outperforms simpler baselines such as random search and is also applicable to cells that were already optimized for being resource-efficient. In summary, the paper proposes a promising method for "post-processing" neural architectures. However, because of the shortcomings listed above, I think it is too premature for acceptance in the current form.

### Reviewer 3

** Summary ** This paper presents a new way of NAS called Neural Architecture Transformer (briefly, NAT), that can refine a network to a better one. NAT leverages GCN to encode a network as a representation. A policy network is used, which takes the current network as input, and outputs a probability over the possible operation, indicating which kind of operation should be applied to two nodes. In this paper, three types of operation are allowed: keeping the original operation, replacing it with “skip_connect” and removing this operation. Experiments on CIFAR10 and ImageNet are conducted to verify the proposed algorithm. ** Originality ** This paper shows a light idea of NAS by refining an input neural network. This idea is interesting to me. A policy network is introduced to determine whether an operation should be kept, removed or “skip_connect”ed. Another point is that different from NAO, NAT use GCN to represent a network. ** Clarity ** This paper is easy to follow and well-written. However, I think there are some places that are confusing: 1. Line 191, can you provide a specific form of $\mathcal{L}(\varpi, w)$? 2. In Eqn.(2), how the “s.t. c(\alpha) \le \kappa” is applied to your optimization process? 3. Will each computation architecture (i.e., the architecture in the dashed line, Figure 1) in the network be pruned in the same way or different ways? 4. Is Cutout used in your experiment? As this is very crucial to evaluate the performances. ** Significance ** 1. In current scheme, NAT can only replace the “Original computation module” to “skip connection” or “null”. (i.e., the “none”, “same” and “skip_connect”in “genotypes.py”). As a result, compared to NAO, the improvement is not so significant. As shown in Table 1, Table 2 and Figure 9 and Figure 10 in the supplementary document, the improvement compared to NAO is marginal. 2. It seems that when the baseline is stronger (like the ENAS, Darts), NAT cannot significant boost the baseline, without much reduction of parameters (see Table 2, CIFAR-10). Then, why we need NAT when we have ENAS? 3. Did you try further tuning the output of NAONet using NAT? That is, Table 3 lacks a column “NAO”. 4. Do you use Cutout in training networks? ** Reference ** The following work, which works on evolving from a good model, should also be discussed. [1] The Evolved Transformer, ICML’19 ** post rebuttal ** Thanks for the reponse. (1) reply to Q12: please kindly express how you address this problem clearly in the next version. (2) reply to Q15: I understand that NAT can further boost the performances of the input model， but I am not sure whether such improvement can be achieved by the simple parameter tuning tricks, like use different random seeds, different dropouts, etc. Hope the authors can provide more discussions in the future. (3) reply to Q16: Thanks for the detailed results. I hope in the next version, you can add the following result(s): Given a network searched by NAO + coutout (instead of NAO-WS), what accuracy can your approach reach? Overall, I think this is an interesting paper for NAS literature. Hope the authors can provide better ways in the future, since currently, the improvement over strong models are not very significant. I remain my score as "6".