NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2442
Title:Learning New Tricks From Old Dogs: Multi-Source Transfer Learning From Pre-Trained Networks

Reviewer 1

This paper presents a novel approach to solve the problem of multi-source transfer learning by using the maximal correlation. The method are evaluated in 3 dataset. However the setting in the three dataset are somehow similar. The dataset are divided into several parts. One is chosen as the target and the rest as sources. The target task and the source tasks are from the same domain in all the experiments. The classifier is also limited to binary or 5-way classifier but not with more ways. The experiment shows that the proposed method has a significant gain in k-shot learning when k <= 10 and it can benefit from multiple sources. It also shows in fig 2 that the maximal correlation is a good metric to show how the new task benefit from a certain old task. The paper is well written and easy to follow.

Reviewer 2

## Summary In the proposed approach, maximal correlations and correlation functions are first obtained from the feature functions and target samples. The maximal correlations and correlation functions are then used to predict the class for the target sample. The evaluation is done on 3 datasets (CIFAR100, Stanford Dogs, and Tiny Imagenet). The proposed MCW method is compared with SVM trained on output of the penultimate layer. For all the datasets, the Multi-Source MCW shows significant advantage especially when there are few samples. ## Originality Formulation of prediction method (MCW) using the maximal correlation and correlation function is novel. The appeal of the method is that no control over the training of the source networks is needed. ## Quality The problem formulation and the method look sound. Source code is not provided with the submission. It would have helped in understanding the method better. Evaluation is a bit weak. There is no comparison with other few-shot learning algorithms. Even though these algorithms might be fundamentally different, ablation studies would have helped in highlighting specific features of the methods. For all the three datasets, only one set of randomly selected set of target classes is chosen. Ideally, the experiments should be carried on multiple subsets of randomly chosen target classes. ## Clarity The problem motivation and introduction is presented well. The problem formulation is also clear. The main section on MCW is a bit dry read. A little introduction to HGR maximal correlation (and correlation functions) and intuitive explanation of the method would be helpful. The algorithm listings, however, are quite helpful. ## Significance The proposed method is simple and elegant. It doesn’t require joint training. It has the potential of opening up new avenues for reusing pre-trained networks. ## Minor Line 110: “We also denote the correlation *of of* the ith …” ## Post rebuttal comments Thank you for the response. Your results on the randomly chosen source and target classes are noted.

Reviewer 3

originality: The paper presents a new method to perform domain adaptation. It originally uses a statistical tool to design a dedicated algorithm. quality and clarity: The method is shown to outperform some alternative methods to perform the task. However, The description of the method lacks (see the specific comments below) both in notation and exposition - it remains unclear why the method performs well, what are the properties of the data that are leveraged, and what makes the method specifically suitable to few-shot learning. significance: The setting addressed is important and seems to become prevalent. Specific comments: From the description in section 3, it is unclear why the method is specifically applicable to low-shot settings. The pseudo-code in Algorithm 1 as presented uses expectation over the underlying P^t_{X|Y} rather than an estimate. line 37 - maybe also mention differential privacy (and state of the art related research and results), as related to the setting. In the related works section, why isn't the method mentioned in lines 91-93 applicable to the paper's setting? should elaborate (also how compares conceptually and performance-wise). Provide reference for the work mentioned in line 101 (Yoshinski) and line 102 (Bao). What is the orthogonality requirement mentioned in lines 115-116 ? line 118: It is stated that the prior P_Y(y) can be estimated from the data. Wouldn't this estimate be very noisy in few-shot settings!? Similarly, in line 129 - the conditional expectation is stated to be easily 'computed', however the correct term is 'estimated' and again, this estimate would be very noisy in few-shot settings.. In formula (2), should make the dependence of g on f explicit. e.g., g_f(y) = .. Some typos: line 110 (of of) line 111 sigma_{i}. Also, shouldn't the f and g be starred at that definition of sigma_{i} !? formula (3): P_{Y|X} formula (9): P_{Y|X} formula (10): P_{Y|X} --------------- Given the detailed answer by the authors and their commitment to address the typos and editorial comments (although a couple of mine remain un-answered) I increase the score to 6