NeurIPS 2020

Revisiting Parameter Sharing for Automatic Neural Channel Number Search

Review 1

Summary and Contributions: POST REBUTTAL: After reading the authors' response and discussing with other reviewers, I decided to keep my score. The authors propose a general weight sharing scheme to efficiently search for the number of channels at each layer of a predefined ConvNet. The general scheme, called Affine Parameter Sharing (APS), provides a bridge between two extremes of parameters sharings for efficient architecture search: (1) “ordinal selection” makes the weights of different convolutional kernels completely entangled, and (2) “independent selection” which makes all the weights independent. After providing a middle ground between these two extremes, the authors propose a quantitative measurement of how one update using parameter sharing on a particular architecture would affect other parameters. The measure is via the Frobenius norm of the covariance of APS. Based on such measurements, the authors optimize APS for an optimal parameter sharing strategy, where optimality is defined by the authors’ intuition (Equation 4, Section 3.4). The found strategy is called Transitionary APS (APS-T). Finally, the authors perform RL search using the APS-T strategy and claim that their search leads to stronger empirical results. I like the presentation, the motivation, and strongly appreciate the novelty of these contributions. I am not convinced that the empirical results are strong. Quite the opposite, I think the empirical results of this paper are relatively weak. That said, I suspect better results could be obtained by different optimizations of APS. In particular, if the authors do not optimize APS for the objective in Equation (4), the resulting parameter sharing scheme would be different from APS-T, and could lead to better results. This is a potential future direction, but I would recommend accepting this paper for laying the ground to quantitatively measure the effects of parameter sharing.

Strengths: [Good motivation and good presentation] The authors do a good job in presenting their intuitions about parameter sharing. This is perhaps one of the first attempts to quantify the positive or negative effects of parameters sharing in architecture research. Not only does the quantitative measurement of the effects of parameters sharing fosters better understanding of such practice, the measurement also leads to a tangible, optimizable objective. The authors show that such objective could be optimized and lead to empirical gains. These are good contributions. [Novelty] The formulation of APS and the derivations of APS-T are all novel.

Weaknesses: [Not particularly strong empirical results] This is evident in Table 1, where the improvements on CIFAR-10 is relatively small, probably even smaller than standard deviations of some baselines. Same with Table 2’s results on ImageNet.

Correctness: I could trust the correctness of the paper’s results.

Clarity: The paper is not clearly written. Affine Parameter Sharing (APS), which is an important contribution, is poorly explained. For example, in Section 3.1: What is A at Line 100, as in P^A and Q^A? Also, just below that, Equation (2) is technically wrong, because matrix multiplication is not defined for 4-dimensional tensors. The authors should strive to be more accurate when using maths. While Section 3.1 coupled with Figure 1 makes the method understandable, each of them alone obfuscates the presentation. The authors should strive to improve their presentation.

Relation to Prior Work: I think the paper appropriately cites related works and discusses their relationship to the APS method. Sure, some AutoML citations are missing, but that literature is too vast. I think this paper does a good job using the related work to motivate its APS method as a middle ground between ordinal selection and independent selection.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: Parameter sharing is often used in neural channel number search. It is generally believed that after using this strategy, the update of the NN structure will affect other subnetworks and improve search efficiency. However, it will also make the optimization process of different NNs coupled, resulting in a good architecture less discriminative. This has always been a difficult problem in the NAS field. In order to solve this problem, this paper proposes a new strategy, APS, to balance the search efficiency of training efficiency with the judgment of model performance architecture discrimination. It is an interesting problem to solve the problem of how to inherit parameters under a given network structure. No one's been paying special attention before.

Strengths: -- Affine parameter sharing (APS) is proposed to describe the strategy of parameter sharing. -- This paper proposes a transitional strategy, namely transitionary APS. Firstly, the high sharing strategy is selected, and then the simulated annealing method is used to reduce the sharing degree.

Weaknesses: -- In many one shot NAS models, there is a contradiction between search efficiency and model coupling. How to use the proposed method to improve it should be added a paragraph of explanation. -- The description of RL based controller and the algorithm of parameter update in the appendix are not detailed enough. -- In the introduction of the concept of sharing level "\Phi", I hope to give an example to illustrate the calculation method, which is easy to understand. -- In the experiment, the structure of pruning rate is given to verify whether the strategy is effective. Only the CIFAR-10 dataset needs 600 epochs. In the actual Neural architecture search (NAS), the structure space is very large, so will the calculation amount explode? So what is the practicability and significance of this method?

Correctness: -- Figure 1 and Figure 2 do not explain the method clearly. The meaning of many terms is not given. -- How to measure the discrimination of architecture?

Clarity: -- The readability of the article is not strong, and the symbols and equations are confused. For example, in Equation (2), what the meaning of \times_2 and \times_1?

Relation to Prior Work: Yes, clearly.

Reproducibility: Yes

Additional Feedback: Check the problems mentioned above. ------- Thank you for your rebuttal, which clarifies some of my questions. However, there are still doubts about the practicability of the method.

Review 3

Summary and Contributions: The paper mainly aims to deal with a common dilemma in channel number search: “to share or not to share” among channels. The share-weight scheme (like ordinally sharing strategy) can significantly speed up the training, however, the discrimination ability may be harmed due to the correlation between different channel configurations; while the independent scheme otherwise. Based on the observations, the paper introduces a new affine parameter sharing strategy, trying to bridge the gap between the above two basic schemes. Moreover, a new metric, “level of affine parameter sharing”, is proposed to measure the trade-offs of sharing or independence. Combined with an RL-based searching framework, the proposed method is demonstrated to outperform ordinal sharing or independent selection strategies.

Strengths: 1. The proposed method seems to technically sound and make sense. 2. The analysis on the searching dynamics (Line 142) is very interesting, which may help to understand how parameter sharing works in neural architecture search. 3. The paper is clear written. Experiments and ablation studies are well performed and provide many details for reproducing.

Weaknesses: 1. I think the major drawback of the paper is lack of detailed discussions and comparisons with MetaPruning [15], which also focuses on the direction of “semi-sharing” for channel search. Rather than semi-orthogonal projections from a shared mete-weight, [15] proposes to share a meta-network among all channel configurations, which sounds more general and elegant to me. Also, [15] suggests it is possible to end-to-end train the meta-network with the task loss without additional optimization (like Eq 4) or learning rate tuning (Line 166). So, I am curious about the result of directly optimizing P and Q with the task loss instead of Eq 4. In addition, though in Table 2 the proposed APS-T outperforms MetaPruning, however the gap is quite minor, and the training method seems not to be aligned. I think a fair ablation study may be required to compare the two sharing strategies. 2. Though in Line 245 the authors claim that the method may benefit from larger search spaces, however, since P and Q seems independent parameters for different configurations (Line 100), I doubt about the efficiency of scaling up; however, existing counterparts like MetaPruning do not suffer from the issue.

Correctness: The proposed method seems conceptively correct. I do not check the detailed formulations and demonstrations.

Clarity: The paper is very well written.

Relation to Prior Work: See the weaknesses part.

Reproducibility: Yes

Additional Feedback: --------------------------- Comments after rebuttal: Thanks for the rebuttal. As for the comparisons with MetaPruning [15], I agree the point that “the sharing scheme of [15] is less interpretable and controllable” mentioned in the rebuttal. However, I still feel that the proposed method shares similar idea with MetaPruning, while seems to be more complex to optimize but no obvious benefits are obtained according to the experiments. Especially, when generalized to larger search spaces, the parameters grow linearly with the number of candidates (Line 43 in the rebuttal), while MetaPruning does not suffer from the issue. So, I downgrade the rating to borderline leaning toward rejection.

Review 4

Summary and Contributions: This paper first proposes affine parameter sharing as a general formulation, which unifies existing channel search methods. Authors find that with parameter sharing, weight updates of one architecture can simultaneously benefit other candidates. Finally, the authors propose a new parameter sharing strategy. Experiments are conducted to support the proposed method.

Strengths: This paper provides a unified understanding of existing channel search algorithms. The topic is of great importance and may attract certain interest from the NeurIPS community.

Weaknesses: One issue is that the soundness of the proposed transitionary strategy. Controlling the level of parameter sharing Phi makes sense in a way to realize a tradeoff between training efficiency and discriminative ability. However, note that the Phi depends on both the kernel weights and PQ. Authors simply optimize PQ to realize the decrease of Phi, which might not work since the weights are also optimized during training, and the Phi does not ensure a decrease over training epochs. Most importantly, the storytelling of this paper might not be proper enough. As illustrated in the paper Eq.(2), the new weights are simply a linear transformation of a predefined super-kernel. In this sense, all weights are actually fully-shared with this super-kernel, and they are derived by a special linear layer P and Q, which is also trainable according to Section 3.4. And the defined share level Phi is actually the linear correlation between different kernel weights. Thus I wonder whether the claim about parameter sharing level is proper strictly since the weights are already fully-shared with the super-kernel by P and Q. I think it holds more naturally that linear correlation between different kernel weights matters for channel number search.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: After-rebuttal I have read the responses and other reviewers' comments. The authors have addressed most cocerns, the remaining one is the unsatisfied experimental comparison with other similar methods like MetaPruning . So I keep the original score.