Review for NeurIPS paper: What is being transferred in transfer learning?

NeurIPS 2020

What is being transferred in transfer learning?

Review 1

Summary and Contributions: In this paper, what enables a successful transfer and which part of the network is responsible for that are explored by conducting a series of experiments. To be specific, role of feature reuse, mistakes predicted by networks, feature similarity, distance in feature space, performance barriers and basins in the loss landscape, module criticality, spectrum of different modules, useful pre-trained checkpoint are explored in this paper. As a result, several intersting observations and insightful discussions are provided.

Strengths: + The paper is well written. It is enjoyable to read. + Motivations are clear, some well-designed experiments are conducted and interesting observations are provided. + A portion of analyses are insightful.

Weaknesses: Important issues: - Line 137-138: More evidences should be provided, this conclusion (higher order statistics of the data that are not ruined in the shuffling lead to the significant benefits of transfer learning, especially on optimization speed) is not well supported. -Line 179-188: Is there any evidence to support the effectiveness of the proposed measurement? In my opinion, linear interpolation of the two weights is somewhat unreasonable: it can only reflect the linear relationship between the two networks, which is not a reasonable measurement for nonlinear networks. For further improving the paper: -Line 78-79: [9] show that transfer (even between similar tasks) does not necessarily result in performance improvements. This sentence is misleading: transfer does not necessarily result in performance improvements in some different downstream tasks, while in recognition this is not the case. -Line 108-109: four target domains with decreasing visual similarities from natural images. Is there any evidence/measurement to confirm this view? -Line 150: PT --> P-T -Line 197-198: a interesting observation, but there is no analysis for that.

Correctness: I did not find any technical error in the paper.

Clarity: Motivations, ideas and experimental results are well expressed in this paper.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: After reading the rebuttal, I find that main concerns are addressed and my score is 6 now.

Review 2

Summary and Contributions: This work presents a set of tools and analysis of transfer learning algorithms. Key observations are made from the study of feature-reuse, loss landscape, module criticality and training convergence. The results are verified on transfer learning tasks across datasets of varying levels of visual similarity (ImageNet, CheXpert and DomainNet). Overall, the paper provides new insights for transfer learning.

Strengths: - Transfer learning is studied from various perspectives on real-world datasets. - The presented results are insightful. - The submission of code is helpful.

Weaknesses: The authors should address the following points to improve the submission. 1. There are several references to the appendix, including major results such as the discussion in Sec. 3.2. Thus the reader has to refer to the appendix back and forth. The authors could improve the readability of the paper by incorporating certain results from the appendix to the main paper (see additional feedback section for suggestions to improve space usage). 2. Certain claims require further clarifications: a) L151-152: The following phrase is unclear, “since P-T has strong prior, it finds it harder to adapt to the target domain”. From Fig. 2 it is clear that P-T achieves a better performance than RI-T, and as described in L132, the optimization speed on P-T is more stable than RI-T for smaller block sizes. Likewise, the performance of P-T is better under class imbalance (Appendix L445-446). This is opposite to the claim in L151-152. b) L191-192: “pre-trained weights guide the optimization to a flat basin”. The information in the figures such as Fig. 3 is insufficient to argue about the flatness of the basin. Note that the x-axis in Fig. 3 is the interpolation coefficient and thus the characteristics of the basin are not guaranteed to be observed along the direction of interpolation. For e.g., the case could be that two finetuneT models are close to each other in the parameter space (due to similar initializations) and thus there is not much variation in the loss landscape along the interpolation. Whereas a pair of randinitT could be far apart (due to different random initializations), and therefore a significant barrier is observed along the interpolation. However, since both plots are overlaid on the same graph (with the x-axis being the interpolation coefficient), it provides a false notion of the “flatness” of the basin. This clarification should be incorporated unless justified otherwise. c) Sec. 3.7 Spectrum: This section requires a brief introduction of the method [26] used to obtain the spectrum. More importantly, there are several counter-intuitive examples to the discussion in Sec. 3.7. For instance, in spectrum-plots-clipart.pdf: Fig. 17, 20, 23, 26, 29, 36, 39 (right) show more concentration towards smaller singular values after the model is trained (blue) when compared with random initialization (orange). This would imply a higher uncertainty after training (L254) which is counterintuitive. Likewise, several CheXpert plots ( Fig. 3, 4, 6, 10, 11, 13, 14, 16, 17 (right) ) show a stronger result. It is important to discuss this phenomenon. Why is the model driven towards smaller singular values during training (note that this is true for fine-tune plots as well)? Perhaps, the random initialization itself possesses larger singular values, while the training along with regularizers such as weight decay (Appendix L410) drives the model towards lower singular values. In such a scenario, the spectrum may not reflect model uncertainty (L245) and the plots for P-T and RI-T are not directly comparable. 3. Fig. 5: This figure is difficult to interpret - please label the x-axis, y-axis and the color scale. What does the x-axis refer to here? (x-axis contains fractions for Fig. 5a,c and integers for Fig. 5b,d)

Correctness: Certain claims need further justification (see weaknesses).

Clarity: The paper is well written for most of the parts. However one has to refer to the appendix files back and forth to completely understand the proposed ideas. A reorganization of certain sections is required (see additional comments).

Relation to Prior Work: The paper presents a set of analyses to study the effect of transfer learning. While some techniques are based on prior works, the authors sufficiently justify the choice of tools used for analysis.

Reproducibility: Yes

Additional Feedback: Please reorganize the sections with more discussion at appropriate places. For e.g., better space adjustment can be employed in Fig. 5 & 6; certain sentences can be paraphrased to conserve space (save lines such as L262, L75, L167 and the last line of Fig. 7 caption). Similarly, the space pertaining to Table 1 can host for instance, Table B.2 after some space adjustment (using abbreviation etc.). In Fig. 5, the sub figures for the “train” statistics can be moved to the supplementary. L74: “depends of” -> “depends on” L119: “disrupt” -> “disrupts” L247: Please add a brief description of the algorithm in [26] L292: “he” -> “the” Post Rebuttal: The manuscript seems to have been improved by incorporating most of the suggestions received in the reviews. Further, my major concerns have been addressed. Therefore, I increase my score. I would recommend the authors to incorporate the clarification regarding the "connectivity" between models, and the "concentration of spectrum" in the main text."

Review 3

Summary and Contributions: The paper presents an empirical study of transfer learning using the ImageNet pretrained weights for a number of downstream classification problems. The paper makes the following contributions: 1) an analysis of feature reuse for transferring representations, considering various pretrain and target domain distribution shifts. 2) Statistical comparisons between finetuning the pretrained model and training from scratch. 3) For transfer learning, a quantitative measure of layer criticality and an investigate into the interim checkpoints for transfer performance.

Strengths: * This work proposes a series of tools to analyze transfer learning. Such as the block shuffle to disentangle the feature reuse and high-level statistics, module criticality and feature similarity. * The discovery and findings in the paper are interesting. For example, both low-level feature reuse and high-level semantics are important for transfer learning, and that interim checkpoints do not affect significantly for the transfer performance.

Weaknesses: * For all the downstream tasks investigated, it seems all tasks benefit from pretraining. I am wondering is there a task which would suffer from pretraining? * While these analysis are interesting, it seems such analysis is of little practice usage and applications. For example, there's no straightforward to improve transfer learning based on the provided understanding of transfer learning. * A lack of considering the amount of labels in the target domain. Since the paper compares finetuning against training scratch, it would be systematically analyze

Correctness: I find no significant error claims in the paper.

Clarity: Yes, the paper is easy to follow.

Relation to Prior Work: Related works in transfer learning are properly cited in the related work section. No specific discussions are made to any work.

Reproducibility: No

Additional Feedback: While I am satisfied with the submission, the rebuttal failed to address my concerns. The practical usage is not experimentally validated in any scenario. I downgrade my score to 6.

Review 4

Summary and Contributions: The paper targets the question: what enables a successful transfer? and which parts of the network are responsible for this? To answer these, they experiment with various methods, comparing pretrained models, fine-tuning on pretrained models and models trained from scratch using random initialization.

Strengths: The provided observations are conducted on a variety of datasets.

Weaknesses: While the paper targets an important question, it lacks in novelty and did not do justice to the targeted problem. Several of the findings are either well known or are expected. E.g. the paper claims that lower layers are in-charge of feature reuse. However, this finding is not fully supported by the evidence.

Correctness: Most of the claims are supported with evidence. I have provided feedback in detailed comments.

Clarity: The paper lacks in clarity. Since the paper is based on already published methods, a few lines description of those techniques and motivation of why authors prefer a particular approach are missing from the paper.

Relation to Prior Work: Most of the work from vision is covered. However, there are a number of attempts in NLP that analyze pre-trained models before and after fine-tuning. Authors should acknowledge the work in NLP.

Reproducibility: Yes

Additional Feedback: - Role of feature reuse: by comparing fine-tuned models on real data and on data from clip art and quick draw, authors claim that feature reuse plays an important rule in transfer learning. However, it is not clear which features are reused. Secondly, it is an expected output that finetuning on related data helps more than finetuning on unrelated data. The finetuning of unrelated data can be improved given the large size of the finetuning data. Overall, I am not sure how to connect the performance as a measure of feature reuse. - Again looking at faster convergence of P-T compared to RI-T, this is quite obvious since P-T is starting from pretrained weights which are already optimized and the fine-tuning step is adapting the weights towards the target domain. On the other hand, RI-T is starting from a random place and would require a lot more time to optimize. - Results on randomly shuffling the blocks in an image: again it is understandable that shuffling the image would drop the performance and it is likely to drop more for RI than P since P is started from a well optimized point. - Authors mentioned (line 131) in the case of quick draw, some other factors are helping the downstream tasks. Do you any intuition on what are those features? Which part of the network they belong to? - It is interesting to observe that two P-Ts trained on the same data but with different initialization make similar mistakes. I am wondering what could be the reason? Is it because you are looking at the models from the last epoch. So models may start from very different points in the space but with the large size of the data and a good number of training epochs, they finally converge to approximately the same point. I remembered that ensembling two different checkpoints of a same model gives performance improvement. This might be happening since both checkpoints were far from each other in the space and they may be making different mistakes. I think checkpoint-wise comparison is an important point to understand whether the reason of two PTs making the same mistakes and also have a similar flat basin is the result of using identical large data plus a large number of epochs. - The above comment is also valid for the case of feature similarity (3.3). Looking a table 1, I am wondering if RI-T being very different from RI-T and others is because of the small size of the training data? The data is small for the model to generalize well and it may stuck in some local optimum? - One interesting observation in Table 1 is that higher layers are more different than lower layers. I understand that lower layers are similar because they are closer to input and represent input features while higher layers learn abstract features plus features more optimized towards the objective function. Since two P-Ts are using identical data and identical base model, it would be interesting to know why they are different on high layers while the loss landscape shows that these models belong to the same flat basin. - Loss landscape: did authors play with layer-wise weight difference in two models? I suspect - The paper should justfiy the choice of techniques. Why cka or singular value analysis using Sedghi et al.? I would expect to have a samll description of each method used in the paper instead of just saying, we used the method in Sedghi et al. for our analysis. Minor comments: - Why to introduce RI? it is confusing since this is not a trained model but just random weights and you are not using it in the paper. typos: line 269: 29, 30, 30 -> 29, 30, 31 line 292: he -> the