NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6001
Title:Meta Architecture Search

Reviewer 1

Originality: The work is original in its modelling of meta architecture search. The approach used builds upon existing methodologies in variational inference (coupled VB) and neural architecture search (bayesian formalism of DARTS search space) (Perhaps also similar to [1] in the use of Gumbel Softmax trick) Quality: The work is well presented. Relevant baselines have been cited in the comparative study. An empirical analysis to understand different aspects of the proposed model is also provided. Clarity: This work clearly motivates the problem statement, provides a solution and supports it with an empirical study to understand the learnt posterior distribution of different tasks, the effect of prior and the flexibility of model in few-shot learning setting. Significance: As NAS methods are often expensive, it is natural to look for meta modelling approaches that can help share certain aspects of NAS across tasks. This work is a useful step in this direction. Weaknesses: The tasks considered comprise of different resolutions of the same dataset. It is well known that models learns on ImageNet transfer to different resolutions. Evaluating the approach on a diverse set of tasks would be useful. General comments A latex typo on Page 5 \texttt{CIFAR-10} [1] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, Kurt Keutzer: FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. CoRRabs/1812.03443 (2018)

Reviewer 2

The authors propose Bayesian Meta Architecture Search (BASE), a method for meta learning neural network architectures and their weights across tasks. The paper frames this problem as an Bayesian inference problem and employs Gumbel-Softmax, reparametrization and optimization embedding, a variation inference method, to optimize a distribution over neural network architectures and their weights across different tasks. Originality: Meta learning neural network architectures is a very natural next step for NAS research, which as not been done so far (at least I’m not aware of any work). It is not only very natural but also very important as it allows to make NAS more scalable and of more practical relevance. The Bayesian view, however, is not really novel, but rather an obvious extension of [1]. [1] should have been discussed more thoroughly and the authors should point out differences (if any more). In general, the related work section is very short and does not provide a proper summary of the current state of the art in this field of research Quality: BASE is well motivated and derived. However, it is unclear to me which role the optimization embedding actually plays. Is it actually used? Which part of the equations and pseudo code are not already explainable/covered by ELBO+Gumbel-Softmax + reparametrization? The empirical results seem convincing (in terms of search time) at first glance, but I have the following concerns: (i) while the authors claim that BASE significantly speeds up NAS by meta learning (which I tend to agree with), they do not properly comment on how expensive the training of the full model actually is. It seems that this final training is actually more expensive than the search process and should therefore be considered. Are the meta-learned weights actually used for the full sized model? If so, how? (ii) Missing ablation studies/ unfair comparisons. A proper ablation study on the meta learning of the architecture is missing. E.g., in section 6.2., one could run BASE without architecture adaptation (i.e., only meta learning the weights) during the meta learning phase. How would DARTS perform if one would simply pre-train it on the meta-train tasks rather than running DARTS from scratch? This would be a more fair comparison. In 6.3., to the best of my knowledge, MAML and REPTILE use the same architecture, so why are they both mentioned in the table and with different results? Also, this architectures has approx.. 100k params, which is not stated in the table. REPTILE also seems to be on-par with BASE(Gumbel) while likely requiring significantly less compute,. (iii) The error numbers should be taken with a grain of salt as it more and more seems that the cell-based search space in combination with scaling up found models is less interesting and apropriate than one would think [2,3]. In terms of error rates, I do not see significant improvements over prior work (which is of course not the focus of this work). However, the authors tend to overclaim their results (e.g., in the abstract: “This result beats the state-of-the-art methods such as DARTS and SNAS significantly in terms of both performance and computational costs”). * Clarity: The paper is well structured, even though it would be better to have the related work section in the beginning. As mentioned before, the related work section is rather short. It would be helpful for the reader to have some information on optimization embedding as well. The paper contains various typos and bad grammar, which makes it hard to read the paper. The paper is partially overloaded with notation and sometimes the notation is not explained and /or not re-used and / or not consistent with the figures. E.g., in Figure 1, one can only assume that “NN” means neural network. What are \phi, \psi? It is unclear at this point but only mentioned later in the paper. In the right part of Figure 1, what is 1,2,..J? What is M? Is notation is not used in the text (as far as I can see). Equation (1), why introduce A? Equation (2), what are f and l? Figure 2 is barely readable. * Significance: The authors address an important problem, namely extending NAS methods to the meta learning setting, so that one does not need to re-run NAS from scratch for a new task. However, it is not clear if this paper achieves due to the concerns on the empirical evaluation mentioned above. The methodological novelty right now is vague to me as (i) there seems to be large overlap with [1] and (ii) the role and impact of optimization embedding is not fully clear. In summary, the authors present an interesting approach on using architecture search in a meta learning setting. Empirical results need to be taken with a grain of salt. The presentation of the paper is borderline. [1] SNAS: stochastic neural architecture search, Xie et al., ICLR 2019 [2] Evaluating the Search Phase of Neural Architecture Search, Sciuto et al. [3] Random Search and Reproducibility for Neural Architecture Search, Li et al. ------------------------------------------- Update after author feedback --------------------------------- Thank you for commenting on my review. I acknowledge that this work does indeed give a better Bayesian picture on NAS and extends it to the meta learning setting, which is an improvement over SNAS. However, I still have one major concern, that has not been addressed in the rebuttal: the authors only compare to NAS methods ran from scratch. There is no baseline that employs the meta train set. E.g., one important baseline that should have been compared to: pre-running some NAS method, e.g. DARTS, on the meta train set + adaptation on the meta test set (for the same compute time as BASE). Therefore, the benefit of the proposed method is still unclear to me and I decided to not increase my rating.

Reviewer 3

The idea seems to be novel to me. I specifically, like the skip-connection part which allows one to compose network with different functioning layers. The only limitation is that the number of such elements are hard-limited to K and it is not permutation-invariant. However, I think the paper still stands without them. The experimental results have been presented elaborately. It is also very interesting to look at the PCA results of the weights and the architecture and how the different datasets are organised with respect to others. cifar10 and imagenet32 are close in the architecture space, but different in the weight space - which makes sense as the image sizes are close. I am not a deep learning person, so I am unable to see the limitation from the real application point of view. But looking from the meta-problem and the Bayesian methodology, I would love to accept this paper. Minor comments are: 1. The paper needs a good proof reading to fix many type-setting issues e.g. extra 'a' in line 19, extra 'then' in line 53, part of line 66 can be writte better as " as a composition of L layers of cells...", extra textit in line 198. 2. The authors also have not spent any effort in giving proper references: sometimes conference names are in long format, somethimes they are short. Even it has duplicate references, [24] and [25]. Please fix them in the later version.