NeurIPS 2020

Differentiable Neural Architecture Search in Equivalent Space with Exploration Enhancement

Review 1

Summary and Contributions: This paper introduces E^2NAS, a differentiable neural architecture search algorithm combined with exploration enhancement and architecture complementation. The authors apply a variational graph autoencoder to map the architecture into a manifold and define the novelty of an architecture for exploration. To alleviate rich-get-richer problem, they take differentiable NAS problem as an online multi-task learning tasks and define a complementation loss function. The experiment results show the efficiency of the proposed method.

Strengths: 1. Even though the idea of autoencoder framework has already been used in NAO, they consider the probability density function on the manifold, which is novel in NAS problem. 2. They consider one-shot NAS as a multi-modeling task and convert two architectures together. 3. The experiment results are sufficient.

Weaknesses: 1. As for exploration, there are some other traditional methods like Bayesian Optimization. I wonder what the difference between BO and your way is. 2. You add a lot of hyperparameters in the framework. Is it difficult for the model to tune the hyperparameters? 3. There are some SOTA one-shot NAS methods [1, 2] missed. Compared with their reported results, your results are not the best. For example, [1] reach 46.34+/- 0 for ImageNet on NasBench-201 while you only reach 45.77. And your results on DARTS search space is not the SOTA. 4. Some Typos. For example, in line 279 “and and”.

Correctness: The claims and method are correct and the empirical methodology is correct.

Clarity: The paper is well written.

Relation to Prior Work: It is clearly discussed how this work differs from previous contributions.

Reproducibility: Yes

Additional Feedback: [Post rebuttal] I keep my original positive score since the rebuttal convinces me in some degrees.

Review 2

Summary and Contributions: The paper proposes an exploration enhancing neural architecture search with architecture complementation (E2NAS) to address several limitations in existing differentiable NAS approaches. More particularly, the paper: -Improves the theoretical foundation of equivalently perform optimization in the continuous latency space vs. in a discrete space. -Tackles the rich-get-richer problem using a probabilistic exploration enhancement method to enhance intelligent exploration during the search. -Proposes an architecture complementation based continual learning method for the supernet training, in order to force the supernet to keep the memory of previously visited architectures. -Presents a thorough evaluation on NASBench.

Strengths: The paper is well motivated and explains the limitations in existing NAS approaches.

Weaknesses: The paper is not very novel or significant in its contribution. It compiles two regularization methods to mitigate two long-standing problems in differentiable NAS, however, the proposed methods are not very novel. NAS-Bench is not a very well established benchmark that not many people are very familiar with. It is not fair to compare with existing work on NAS-bench, as most of them were not optimized on NAS-Bench. For instance, the DARTS work may work equally well with proper hyperparameter tuning and regularization. With the existing DARTS hyperparmeters, search on NAS-bench converges to networks with only identity/skip operation. In addition, NAS-bench has its limitation such that it might not reflect the realistic settings. The reviewer would like to see a stronger ImageNet result based on either transfer learning or a direct search on ImageNet. The catastrophic forgetting problem can be a more significant issue for a multi-task learning problem. However, here, even though different network architectures are sampled sequentially, you should not treat it as a multi-task learning problem. The reviewer agrees that it is a multi-model optimization problem, not a multi-task problem. The proposed methods can be summarized in a more concise and easy-to-understand language. The first method to enhance exploration is a diversity regularization method, while the second method to mitigate forgetting is a soft regularization method with a complementation loss function. In the second method, the selection of three models seems very arbitrary and not intuitive.

Correctness: Partially correct. The paper directly reports the numbers of related work (DARTS results in an example) copied from previous paper, which seems to be not right. The DARTS results were not optimized with proper hyperparmeter tuning. The catastrophic forgetting is not accurate here.

Clarity: Yes. The writing quality is ok. The related work section is detailed and thorough.

Relation to Prior Work: The method mitigating catastrophic forgetting is not new and a similar approach was proposed by the EWC work [1]. [1] Overcoming catastrophic forgetting in neural networks,

Reproducibility: Yes

Additional Feedback: Post-rebuttal: The rebuttal addressed some of the concerns from the reviewer. The reviewer agrees the paper tackles some fundamental limitations of differentiable search and the contribution itself can be significant. However, the reviewer still recommend evaluating the work on a more impactful dataset such as ImageNet. Evaluating on a real workload for this current paper is critical is because the paper is built upon existing approach and the two regularization techniques invented are incremental to the reviewer's opinions. The reviewer personally enjoyed reading the DARTS paper and would consider DARTS much more innovative and more ground breaking. Also, a lot of great work were optimized for real workloads like CIFAR10, ImageNet, etc.; those work are not optimized anyway for NASBench (which is only recently getting traction). Comparison with great work in an entirely different setting is unfair and should not be encouraged. The reviewer will raise the score to acknowledge the paper's contribution and the detailed rebuttal.

Review 3

Summary and Contributions: This paper addresses the deteriorated validation performance subnets with shared weights by using variational graph autoencoder to objectively transform discrete discrete architectures into the continuous space and then tackles the catastrophic forgetting problem by a exploration enhancement way in the differentiable space through architecture complementation. The author shows some theoretical understanding on optimizing architecture search in the latent continuous space is equivalent to the discrete space and extensive empirical ablation study is provided. Overall I value this paper of good technical quality.

Strengths: This paper first analyze why traditional GD-based one-shot search fails by stating the optimizing architectures in the discrete space is not very efficient and easily lead to local minimum. It then transforms the architecture latent space into continuous by using VGAE and devise a exploration enhancement with theoretical analysis. To overcome the catastrophic forgetting problem, they further propose architecture complementation as an effective regularization during the search process. The analysis of architecture complementation in Figure 1 looks technically reasonable to me. Overall I think this work advocates a new direction that by encoding architectures in the low dimensional latent space, architectures with similar structures could be grouped together, and optimizing in such smoothing-changing performance surface makes the exploitation and exploration much easier.

Weaknesses: The experiments could be further enriched by verifying the effectiveness of the proposed method on more one-shot search spaces such as NAS-Bench-1Shot1. There is a recent and concurrent work on studying the network structure of neural architectures. it shares some similar idea that by transforming neural networks in to low-dimensional spaces using GNNs, it can provide better predictive performance of the latent representations and further improve the sampling efficiency across different tasks and datasets. I would suggest authors take a look. [1] Graph Structure of Neural Networks. ICML 2020.

Correctness: The claims and method sounds correct to me. The empirical evaluations are correct.

Clarity: Yes. It is well written and I enjoy reading the work.

Relation to Prior Work: There is no too much work on this direction. NAO is one of the eariest work optimizing neural architecture in the continuous latent space but it uses LSTM as predictor and optimized in a supervised manner. There are some concurrent work on evaluating the effectiveness of neural architecture encodings/representation learning on either discrete/continuous space. I would suggest authors to have a read. [2] A Study on Encodings for Neural Architecture Search. arXiv:2007.04965 [3] Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?arXiv:2006.06936 [4] Are Labels Necessary for Neural Architecture Search? ECCV 2020.

Reproducibility: Yes

Additional Feedback: Take a look on the NAS-Bench-1Shot1 paper and [1,2,3,4] which discusses lots of useful information that coincidence with the observations in your paper.

Review 4

Summary and Contributions: This paper addresses three important problems in one-shot gradient-based NAS, namely 1) the incongruence between the continuous search space and the discrete result architecture, 2) the rich-get-richer problem where the initially good connections will be selected and reinforced during the search, and 3) the multi-model catastrophic forgetting problem when the weights are shared and trained in a one-shot manner. The authors tackle the first problem by introducing a one-to-one mapping from the latent representation to an architecture with a graph autoencoder, while introducing a replay buffer with novel losses (i.e., the loss to encourage selecting novel architectures and the loss considering the complementary architecture) to address the second and the third problems. The experiments on NAS-Bench-201, CIFAR-10/100, ImageNet-16-120 demonstrate the promising performance of the proposed method.

Strengths: 1. Three important problems in one-shot gradient-based NAS are addressed. 2. Detailed experiments and ablations. 3. This paper is well-written and easy to follow.

Weaknesses: 1. In Table 2, why gamma = 0.2 performs the worst and introduces huge variance? Should the performance change smoothly with gamma? 2. What is the definition of the complementary/orthogonal architecture? Should the union of alpha_{i-1} and alpha_i^c be the whole search space, or it just needs the union of alpha_{i-1} and alpha_i^c includes alpha_i? 3. Please consider redrawing Fig. 1 as the current version possess unnatural skewing. 4. It should be minus in Eqs. (2) and (3) for gradient descend.

Correctness: Yes, except a typo in Eqs. (2) and (3) (Please see Point 4 in the Weakness Section)

Clarity: Yes.

Relation to Prior Work: There are two recent methods dealing with the search/result space incongruence problem from other ways, please consider citing and discussing them: [1] Li et al., SGAS: Sequential Greedy Architecture Search. arXiv:1912.00195, 2019. [2] Gao et al., MTL-NAS: Task-Agnostic Neural Architecture Search towards General-Purpose Multi-Task Learning. arXiv:2003.14058, 2020.

Reproducibility: Yes

Additional Feedback: I have read the authors' rebuttal and remain my initial recommendation.