NeurIPS 2020

On Warm-Starting Neural Network Training

Review 1

Summary and Contributions: This paper addresses warm-starting as the problem of lack of performance when a model is retrained using an existing converged model as initialization. In particular, the paper focuses on those cases where the amount of data used for training the second model is larger than the first case. After a comprehensive analysis, the paper proposes to shrink and perturb as a way to improve retraining with new data.

Strengths: - Very relevant problem that does not seem to be discussed much in the literature. If solved, could have large impact for the community. - Simple method easy to implement and use in different scenarios

Weaknesses: - Experimental paper with little no theoretical contribution - Take home message seems not very convincing (see below). For instance, seems like this is particularly useful when the new model has limited data (which is far from the motivation in the introduction).

Correctness: - The paper is mostly based in empirical evidence and is conducted in a methodic manner. - There is a no clear take home message. It is not clear to me if, in the end, the original concern is solved: Is the proposed method able to reduce the training time of a network when additional data is added to the training set?

Clarity: The paper is clear and easy to read and follow.

Relation to Prior Work: This part seems correct to me.

Reproducibility: Yes

Additional Feedback: - Images do need some improvement. In the printed version, it is difficult to understand what is doing better or worse. For instance, in Figure 9, the proposed method performs at least as well as the best performing method. I think I can not see that from the figures.

Review 2

Summary and Contributions: The authors address the problem of training from "warm start" model and propose an approach to solve the problem. There are extensive empirical evaluations demonstrating that training from "warm start" model hurts the generalization. To solve this problem, the authors propose a shirk-perturb method that efficiently closes the generalization gap between training from "warm start" and a random model.

Strengths: 1. Extensive empirical evaluation. They empirically demonstrate the problem of worse generalization from learning warm-up model, as well as the effectiveness of the shirk-perturb method to solve this problem. 2. Worse generalization from warm start model is a common problem but few works addressed it formally. The authors empirically evaluate the problem by specific designed experiments.

Weaknesses: 1. I think the present work lacks sufficient analysis though there are many empirical validation results. It would be better if authors could reason more behind extensive empirical evaluations. 2. It's better to evaluate on a large-scale dataset like ImageNet.

Correctness: 1. A fixed small learning rate 0.001 could lead to a poor generalization, which affects the results of comparison. It might arrive a sharp local minimum when using small learning rate. For SGD a learning rate schedule should be involved. 2. In section 3.1, why choosing Pearson correlation to show the difference between optimized solution and its initialization?

Clarity: The paper is mostly well written. One suggestion is shrinking the font size of description sentences of figures.

Relation to Prior Work: I believe authors clearly present related works and their differences.

Reproducibility: Yes

Additional Feedback: As seen in the Figure 4, there is a place in first few epochs where it achieves both high test accuracy and less difference. It indicates that overfitting leads to a poor generalization when training from an over-fitted warm-start model. It would be interesting to investigate mode along the direction.

Review 3

Summary and Contributions: The authors of this article have made an extensive study of the phenomenon of overfitting when a neural network (NN) has been pre-trained: pre-training a neural network with 50% of available data, then training it with 100% of available data leads to poorer performance than training it directly with 100% of available data. They compare these two setups with different learning rates (LR), batch size, pre-training epochs, and regularization factor. Moreover, they demonstrate that altering the SGD update by shrinking and noising the updated weight prevents such overfitting.

Strengths: The authors have tested the main hyperparameters we usually tune when training a NN. Moreover, they have excluded from their study some tricks that might alter the fairness of their comparison, as LR schedule or data augmentation. Notably, they have proven experimentally that pre-training causes overfitting for a wide range of hyperparameters.

Weaknesses: The problem studied by the authors is not a major one. The authors do not explain their findings about the effect of pre-training, either experimentally or theoretically. It would be valuable to understand *why* such overfit is observed when training a pre-trained NN. I was more expecting an explanation than a solution to the problem. EDIT: after reading the other reviews and the rebuttal, I think there is a lack of either theoretical ground, or extensive experiments, that would have help to understand precisely this "warm-starting problem". The authors have run experiments with RNNs, which is helpful, but I still think that their observations should be validated in a wider range of tasks (e.g., regression...) and NN models (e.g., VGG or other CNNs). Even a negative result in one case would be interesting.

Correctness: The authors have tested enough sets of hyperparameters to validate their claim about the effect of pre-training (LR, weight decay, number of pre-training epochs, batch size). The proposed technique ("shrink and perturb", Section 4) seems to be tested only with ResNet-like architectures.

Clarity: There is no major writing issue.

Relation to Prior Work: The authors have cited main papers about weight initialization, and also a paper about the link between overfitting and distance of the weights from their initialization. This last one corroborates the experimental results of the paper. Apparently, the studied problem has not been addressed before.

Reproducibility: No

Additional Feedback:

Review 4

Summary and Contributions: The paper addresses the problem of coming with tricks to warm-start network. The authors demonstrate that warm-starting / fine-tuning from existing weights causes drop in generalization performance. To mitigate this problem they propose the use of "shrink and perturb", which scales the weights and adds noise to the weights.

Strengths: The paper brings light upon the phenomena where training from scratch performs better than warm-starting / fine-tuning from on CIFAR10. Although this does not seems to be a previously noted phenomena, if this behavior is actually a commonly occurring phenomenon in continual learning this could be of value to the community and open up future research directions. The paper shows that this phenomena is robust to regularization, couple of models .. and provides various empirical studies. The paper provides a method called "shrink and perturb" that allows one to train models with warm-starts.

Weaknesses: The paper is limited to evaluating on CIFAR/SVHN, and I worry that this phenomenon may not extend to other methods and tasks. Warm-starting .. in the context of the problem setup of the authors .. seems to be basically the same thing as fine-tuning with more-data. This phenomenon doesn't seem to be happening on more sophisticated computer-vision tasks, and finetuning from datasets like ImageNet leads to similar or better performance with much faster convergence. Although the label-space is different in many fine-tuning setups one can imagine extending the existing setup to cover common and more realistic problems. The paper is written to motivate the idea of re-using weights on for continual/online learning setting but splitting the datasets into 2 sets (training with 1 and fine-tuning with both) seems to me a little toyish and unconventional continual learning setting. In online / continual learning there is a distribution shift as the dataset enters, but the dataset seems to be randomly split meaning that on expectation the distribution of these 2 sets should be the same. It would have been interesting to see the current setup on a distribution shifting datasets? Furthermore, due to the size of the dataset used in the experimentts, i wonder if the model is just overfitting to the training set (50% dataset) .. which makes it harder for the model to recover from. What happens if you don't wait till 100% convergence? There seems to be not enough theory behind why the behavior of generalization drop happens. Although I am not against the empirical findings on phenomena, the motivation for why shrink and perturb solves this seems a little lack luster. Either having a stronger theory, or demonstrating empirical result on a wide-range of realistic and challenging tasks would have made this paper stronger.

Correctness: Yes, the methods are clear.

Clarity: The paper is well written and easy to read.

Relation to Prior Work: To my knowledge, the paper proposes to solve a problem that has not been tackled by prior works.

Reproducibility: Yes

Additional Feedback: