NeurIPS 2020

AutoBSS: An Efficient Algorithm for Block Stacking Style Search

Review 1

Summary and Contributions: - Authors study and use a method to figure out the optimal way to stack network "blocks" (BSS) for a variety of different networks. They change the number of blocks and the channels for each block in the network. - The authors use bayesian optimization for their search method for finding the optimal block stacking order once they apply a clustering method to make the search more efficient - They try this on a variety of different architectures on ImageNet and get good improvements. They also try on a variety of other tasks like object detection, instance segmentation and model compression.

Strengths: - The authors did a nice job of running random baselines compared to their method, to show that their search method is necessary - Over all the paper addresses an important and often overlooked topic when designing neural network architectures. It is quite practical and is additive to architectures people have already discovered. - The method works on lots of tasks and models, some of which are strong baselines - The authors do a nice job of running ablations for their algorithm (Table 2) - Due to the authors method, the computation complexity is not high - The algorithm the authors use is novel and makes sense to use given their analysis in the paper

Weaknesses: - In Table 1, how do the latencies change on devices like GPU as FLOPS can be misleading in how it translates to runtime performance. - Overall the improvements are not that large on most architectures and on some of the setups the baselines are not that strong (Table 4 both RetinaNet and Mask RCNN). - In Table 1 it would be nice to compare to prior BSS search methods like POP [23] - It would be nice to have a figure detailing an example BSSC to make it more intuitive to the reader in the main text - More details can be given for some of the plots as they are hard to understand from just looking at them with the given caption (EX: in 3(b) explain what is going on more thoroughly. In 3(a) what is the x-axis?)

Correctness: Yes the claims, methods and empirical methodology is correct.

Clarity: - The paper is written moderately well, but there are a fair bit of typos scattered throughout the text and certain parts are unclear - I found 3.1 to be unclear and took a while to understand - Examples: - Line 108: "Benefit from this hypothesis" -> "To benefit from this hypothesis" - Line 128: "It can be proved in random sampling", what can be proved "in" random sampling?

Relation to Prior Work: Yes the paper does a good job of framing their work vs prior work

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper proposes a novel neural architecture search method based on the optimization of the configuration of predefined blocks. The main contribution of the paper is the definition of a search space, a coding, which results in strong performance with very few evaluations. Other contributions include a first optimization step called refining, and a filtering of candidate points based on clustering, then fed into a classical Bayesian optimization method. Some improvements are demonstrated on state of the art models, results also suggest that optimal block stacking might be narrower in the first stages and larger in the later stages. Edit post rebuttal: I think the authors did a reasonable job in addressing my questions. My biggest complaint was the poor justification of the BSSC, but with the additionnal explanation and experiment I feel confident it is a beneficial addition. I amended my score.

Strengths: The strengths of this paper are methodological and experimental. The proposed method seems to improve upon the state of the art. Given the relative maturity of block-based classification models, gains are often in the few percentage points or lower, but I would argue this is still significant. Some authors have studied Block stacking approaches, but computational budgets seem to be significantly higher (no direct comparison provided however).

Weaknesses: Even though I find the proposed approach interesting and I think it can have impact in the field, there are significant issues in the justification and perhaps the soundness of the method. Some steps of the method are not justified in a convincing manner, perhaps some further experiments might help with this. Some key elements are missing to get a clear understanding of the method (this ties in to clarity). For instance, the authors state (page 4, lines 133-137): "Specifically, we first randomly select one dimension i of BSSC, then BSSC_i will be increased by a predefined step size if this doesn’t result in larger FLOPs or latency than the threshold. This process will be looped until no dimension can be increased." This seems to assume that increasing the parameter / number of blocks (at this point it's still not clear what the parameters refer to) will necessarily improve the performance of the resulting network. Is that a verified assumption? More importantly, I could not understand exactly what is the purpose of BSSC refining -- it should be better explained and justified. The proposed one-linear neural network takes as input a BSSC and outputs a new BSSC that has a lower euclidean distance if accuracies are similar. How does this modification of the BSSC make sense in the parameter space? Why does lowering the Euclidean distance of points result in better candidate suggestions? How is the one-layer network initialized when starting the optimization? Is it pre-learned on other datasets? Why did you do an ablation study for the clusters but not the refining? Some detail is also lacking in order to be able to reproduce this work. For the Bayesian optimization part, how is the acquisition function optimized? What is the set of candidates used for the optimization, the set of k-mean cluster centers (not explicited anywhere in the paper)? Some points mentioned above are also important with regards to reproducibility.

Correctness: The method may or may not be correct, depending on how the points above are addressed.

Clarity: The first 4-5 pages of the paper are hard to digest and could benefit from proofreading. There are numerous grammar mistakes. Find some comments on clarity below, but the paper probably needs to be rewritten. The name block-wise structure and Block stacking style are confusing (at least to this reviewer). They sound like they refer to the same thing. I suggest replacing "block-wise structure" with something else. Block-wise structure does not imply that the structure of the block is modified, rather the structure with regards to the blocks, i.e. the layout of blocks. It's not a major point I think it would help with adoption of your method. line 118: what is R? are all x_i's are in the same set of values? line 119: the sentence "|·| denotes the number of elements for a set" is unneeded in this paragraph. Section 3.2: What is the definition of accuracy discrepancy for Figure 2? Discrepancy can be defined in many ways, it seems important to include the equation used or at the very least a reference. Table 2 header: clustring -> clustering Figure 3b) should also reproduce the original relationship between accuracy and unrefined BSSC for easier comparison.

Relation to Prior Work: To my knowledge the authors correctly differentiate their contributions from that of previous works.

Reproducibility: Yes

Additional Feedback: I think the authors should add an ablation study on the refining process. It would also help to gain a better understanding to study the way in which BSSCs are influenced by the refining step, as well as more information on how it is trained.

Review 3

Summary and Contributions: The authors propose a novel NAS method that searches the Block Stacking Style (BSS). Different from most NAS methods that search operations and topology of a block, this paper focuses on searching the computation allocation of a network, namely searches the width and depth of a network for a given block. I think this paper gives a new perspective to NAS community that, how to effectively stack the building block of networks has a big impact on the importance. Overall, this paper is well written and the idea is novel, experiment results on classification, detection, segmentation and model compression demonstrate the effectiveness and generalization of proposed method.

Strengths: 1. The main idea of studying the computation allocation by searching the Block Stacking Style is novel, and may give a new perspective to NAS community. The proposed method is also complementary to most block based NAS methods. 2. The proposed Bayesian based search algorithm is sample efficient, and does not involve biased tricks usually used in NAS methods, such as early stop or weight sharing.

Weaknesses: 1. Some existing NAS methods can also search the width and depth of network, such as single path one shot, DARTs, more discussions of these methods should be added. 2. The results are under FLOPs constraints, I wonder if the proposed AutoBSS can be used under latency constraints.

Correctness: correct

Clarity: Clearly written

Relation to Prior Work: Since this paper focuses on searching widths and depth, it is better to specifically discuss the connection between the proposed method and existing NAS methods that can search width and depth of networks, such as single path one shot, darts, EfficientNet.

Reproducibility: Yes

Additional Feedback: After reading the rebuttal and comments from other reviewers, I would like to keep my original rating. Different from most NAS methods that search operations and topology of a block, this paper focuses on searching the computation allocation of a network, which is meaningful to improve the hardware utilization for AI chip.

Review 4

Summary and Contributions: This paper proposes a Bayesian optimization based search method. It can find optimal block stacking style tens of trials. The experiments demonstrate the effectiveness of the proposed method.

Strengths: The idea of designing Block Stacking Style is interesting. The experiments on image classification, object detection, instance segmentation and model compression demonstrates the generalizability of the proposed method. The paper is well written and easy to follow in general.

Weaknesses: 1. From the detectron2 model zoo ( I find that RetinaNet-R50 with 3x learning scheduler achieves 38.7. In the Tab.4, the result is 37.02. I would like to know why there is a big gap between the model zoo results and your results. Mask R-CNN is the same. 2. In Table 1, I observe that you conduct experiment on EfficientNet-B0 and EfficientNet-B1. Why donot you conduct experiments on EfficientNet-B7 or some stronger network?

Correctness: The method and empirical methodology are correct.

Clarity: This paper is well written.

Relation to Prior Work: It is clear that difference between this paper and previous works states clearly.

Reproducibility: Yes

Additional Feedback: The k-means are adopted in your method. Do you try other clustering methods? How long is the training time? And what is the number of GPUS.? Do you have any plan to make your source code and models pubulic? I would be happy to see the results on pixel-level applications such as, semantic segmentation.