Review for NeurIPS paper: Improving Auto-Augment via Augmentation-Wise Weight Sharing

NeurIPS 2020

Improving Auto-Augment via Augmentation-Wise Weight Sharing

Review 1

Summary and Contributions: POST REBUTTAL: After reading the authors' response and discussing with other reviewers, I decided to keep my score. In this paper, the authors propose Augmentation-Wise Weight Sharing (AWS), a weight sharing strategy to efficiently search for data augmentation operations in image classification. AWS is based on a simple observation: data augmentation is more effective later in the training process, rather than earlier. Thus, AWS trains a single model for a while, and then only performs data augmentation search for a few last epochs, all starting from the trained shared weights. I think this is such a simple and elegant observation. The authors report strong empirical results on CIFAR-10, okay-ish generalization to CIFAR-100, and miss-report the results for ImageNet. Please see my comments in Weaknesses for more details. Overall, I think the paper contributes something nice to the field. Spotting similar patterns, ie. where and when does an AutoML process matter the most, and sharing the parts that don’t matter, is a very simple and elegant takeaway. However, the authors do need to include a more fair comparison against existing search approaches and update their claims accordingly. If this is not done, I would be very uncomfortable seeing this paper being accepted, even though I really like the method and trust that it works.

Strengths: [Elegant, well-motivated, and well-presented method] The whole AWS scheme is very elegant. The empirical gains on CIFAR-10 are impressive. While the gains do not carry over to CIFAR-100 and to ImageNet, I believe that the authors simply don’t have the resources to tune extensively on these datasets.

Weaknesses: [Missing empirical results from previous work] In Table 2, why did the authors omit AdvAA, which they present in Table 1? I checked the AdvAA paper, and they obtained 20.6% test error on ImageNet with ResNet-50, ie. the same performance with AWS. The authors should add this comparison. Sure, it makes the results of AWS look weaker, but it wouldn’t devalue AWS.

Correctness: I could trust the correctness of the paper’s results.

Clarity: The paper is clearly written and the method is easy to understand.

Relation to Prior Work: I think the paper appropriately cites related works.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The authors propose to improve AutoAugment, a prior work to acquire augmentation automatically, with the Augmentation-Wise Weight Sharing scheme.

Strengths: 1. Good performance. 2. The finding in Figure 2 is good. The motivation for the Augmentation-Wise Weight Sharing scheme is clear. 3. The method is easy to follow and makes sense.

Weaknesses: 1. In Fig 1, it is used to validate the claim "Specifically, the compromised evaluation process would distort the ranking for augmentation strategies since the model trained with too few iterations are unstable." However, the Figure cannot support the claim: the change of rank cannot indicate that the model is unstable. It just shows that the improvements caused the three methods are different. The rank decreases but the accuracy may also get improved. Also, whether a model is stable or not needs more clarification. 2. Overall, the proposed method is an incremental improvement on the existing AutoAugment. But the performance improvements are not very significant when comparing to recent methods. On ImageNet, resnet-50 is only 0.3 better than [18]. Also, the std is very large, which leads to doubts about generalizability. 3. In table 1, it is said that "We report Mean STD (standard deviation) of the test error rates wherever available." But no std reported in the table.

Correctness: correct

Clarity: Well written

Relation to Prior Work: Clearly discussed

Reproducibility: Yes

Additional Feedback: I've read other reviewers' reviews and the rebuttal. Most of my concerns have been addressed. I increased my score.

Review 3

Summary and Contributions: This work observes that the augmentation is better to be applied to the late stage rather than the early stage. Based on this observation, it proposes to share early stage weights and finetune it in the augmentation searching process. This design significantly speeds up the searching and optimization, and also reaches SOTA performance.

Strengths: The key observation in this paper is interesting, and is empirically evaluated. The weight sharing idea is not new, but is naturally combined with the observation in the proposed pipeline. On CIFAR it achieves significant improvement. The supplementary materials include detailed information on experiment setup.

Weaknesses: - The ResNet50 accuracy on ImageNet looks marginal: 20.73+0.17 = 20.9, which is close to 21.07 from OHL. Would be better to compare with the Enlarge Batch trick. - This paper makes the observation based on image classification task, which is training from scratch. It would also be important to evaluate that if transfer learning with a pre-trained backbone still have this phenomenon. Update after Rebuttal: I read the opinions from other reviewers, and the feedback from the authors. The feedback is valid, but not strong enough to improve my rating, thus I'd like to keep my score.

Correctness: The proof, claims, and the empirical settings are correct to me.

Clarity: The paper is easy to follow.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - It is mentioned to be unaffordable to use the enlarge batch trick, and it is fair. Still, there's a workaround to approximate large batch size training: in training let the model performs forward and backward for several batches and accumulates the gradient, then only apply the gradient to weights once the accumulated number of batches is at least 16,384. - Evaluate the key observation for other tasks involving transfer learning.