{"title": "Learning New Tricks From Old Dogs: Multi-Source Transfer Learning From Pre-Trained Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4370, "page_last": 4380, "abstract": "The advent of deep learning algorithms for mobile devices and sensors has led to a dramatic expansion in the availability and number of systems trained on a wide range of machine learning tasks, creating a host of opportunities and challenges in the realm of transfer learning. Currently, most transfer learning methods require some kind of control over the systems learned, either by enforcing constraints during the source training, or through the use of a joint optimization objective between tasks that requires all data be co-located for training. However, for practical, privacy, or other reasons, in a variety of applications we may have no control over the individual source task training, nor access to source training samples. Instead we only have access to features pre-trained on such data as the output of \"black-boxes.'' For such scenarios, we consider the multi-source learning problem of training a classifier using an ensemble of pre-trained neural networks for a set of classes that have not been observed by any of the source networks, and for which we have very few training samples. We show that by using these distributed networks as feature extractors, we can train an effective classifier in a computationally-efficient manner using tools from (nonlinear) maximal correlation analysis. In particular, we develop a method we refer to as maximal correlation weighting (MCW) to build the required target classifier from an appropriate weighting of the feature functions from the source networks. We illustrate the effectiveness of the resulting classifier on datasets derived from the CIFAR-100, Stanford Dogs, and Tiny ImageNet datasets, and, in addition, use the methodology to characterize the relative value of different source tasks in learning a target task.", "full_text": "Learning New Tricks From Old Dogs: Multi-Source\n\nTransfer Learning From Pre-Trained Networks\n\nJoshua Ka-Wing Lee\n\nDept. EECS, MIT\njk_lee@mit.edu\n\nPrasanna Sattigeri\n\nMIT-IBM Watson AI Lab, IBM Research\n\npsattig@us.ibm.com\n\nGregory W. Wornell\n\nDept. EECS, MIT\n\ngww@mit.edu\n\nAbstract\n\nThe advent of deep learning algorithms for mobile devices and sensors has led to\na dramatic expansion in the availability and number of systems trained on a wide\nrange of machine learning tasks, creating a host of opportunities and challenges in\nthe realm of transfer learning. Currently, most transfer learning methods require\nsome kind of control over the systems learned, either by enforcing constraints dur-\ning the source training, or through the use of a joint optimization objective between\ntasks that requires all data be co-located for training. However, for practical, pri-\nvacy, or other reasons, in a variety of applications we may have no control over the\nindividual source task training, nor access to source training samples. Instead we\nonly have access to features pre-trained on such data as the output of \u201cblack-boxes.\u201d\nFor such scenarios, we consider the multi-source learning problem of training a\nclassi\ufb01er using an ensemble of pre-trained neural networks for a set of classes that\nhave not been observed by any of the source networks, and for which we have\nvery few training samples. We show that by using these distributed networks as\nfeature extractors, we can train an effective classi\ufb01er in a computationally-ef\ufb01cient\nmanner using tools from (nonlinear) maximal correlation analysis. In particular,\nwe develop a method we refer to as maximal correlation weighting (MCW) to build\nthe required target classi\ufb01er from an appropriate weighting of the feature functions\nfrom the source networks. We illustrate the effectiveness of the resulting classi-\n\ufb01er on datasets derived from the CIFAR-100, Stanford Dogs, and Tiny ImageNet\ndatasets, and, in addition, use the methodology to characterize the relative value of\ndifferent source tasks in learning a target task.\n\n1\n\nIntroduction\n\nRecently, the development of ef\ufb01cient algorithms for training deep neural networks on diverse\nplatforms with limited interaction has created both opportunities and challenges for deep learning.\nAn emerging example involves training networks on mobile devices [8, 23, 14]. In such cases, while\neach user\u2019s device may be training on a different set of data with a different classi\ufb01cation objective,\nmulti-task learning techniques can be used to leverage these separate datasets in order to transfer to\nnew tasks for which we observe few samples.\nHowever, most existing methods require some aspect of control over the training on the source\ndatasets. Either all the datasets must be located on the same device for training based on some joint\noptimization criterion, or the overall architecture requires some level of control over the training\nfor each individual source dataset. In the case of, e.g., object classi\ufb01cation in images collected by\nusers, sending this data to a central location for processing may be impractical, or even a violation of\nprivacy rights. Alternatively, it is possible that one might wish to use older, pre-trained classi\ufb01ers for\nwhich the original training data is no longer available, and to transfer them for use in a new task. In\neither case, it could be acceptable to transmit the neural network features learned by the device in an\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fanonymized fashion, and to then combine the networks learned by multiple users in order to classify\nnovel images.\nThis would be an example of a multi-task learning problem in which we have not only multiple\nsource datasets, but access to only pre-trained networks (whose learning objective we cannot control)\nfrom those datasets, not the underlying training data used, and we wish to train a classi\ufb01er for some\nnew target label set given only a few target samples.\nFine-tuning methods can be used when the source network is frozen to transfer to a target domain, but\nthese methods tend not to work very well in a few-shot setting when there are multiple networks due\nto the number of parameters necessary for \ufb01ne-tuning, especially in an environment where features\ncannot be learned with the intention of transfer [4].\nIn this paper, we apply the methodology of (nonlinear) maximal correlation analysis that originated\nwith Hirschfeld [9] to this problem. In particular, we exploit a useful and convenient interpretation of\nthe features in a neural network as maximal correlation functions, as described in, e.g., [10]. The\nresult is a method we refer to as maximal correlation weighting (MCW) for combining multiple\npre-trained neural networks to carry out few-shot learning of a classi\ufb01er to distinguish a set of\nnever-before-seen classes. Attractively, this method allows for the computation of combining weights\non individual feature functions in a completely decoupled fashion.\nThis paper is organized as follows. In Section 2, we describe the problem formulation and related\nwork.\nIn Section 3, we introduce the the relevant aspects of maximal correlation analysis for\ncombining neural networks, and develop the MCW methodology and processing used to train our\nclassi\ufb01er. Section 4 describes experimental results on the CIFAR-100, Stanford Dogs, and Tiny\nImageNet datasets, and Section 5 contains concluding remarks.\n\n2 Background and Problem Description\n\n2.1 Problem Formulation and Notation\n\n1, . . . , Ts\n\nN},\n\n, ysn\nkn\n\n1), . . . , (xt\n\nk, yt\n\n1 , ysn\n\n1 ), . . . , (xsn\nkn\n\ni \u2208 X for all n and i, and xt\n\ni \u2208 Yt for all i, where Yt (cid:54)= Ys1 (cid:54)= \u00b7\u00b7\u00b7 (cid:54)= YsN ).\n\nfor which we have labeled data {(xsn\nk)}.\n\nConsider a multi-task learning setup in which we have N different source classi\ufb01cation\ntasks {Ts\n)} for task\nn, n \u2208 {1, . . . , N}. We also have a single target task Tt, with associated labeled data\nTs\n{(xt\n1, yt\ni \u2208 X for all i. That is, the data for the\nFor this problem we assume that xsn\ntarget and each source task are drawn from some common alphabet (e.g., all data are natural images).\ni \u2208 Ysn for all n and i,\nWe do not assume any overlap between labels for any pair of datasets (i.e., ysn\nand yt\nFor each source task Ts\nn, we have access to a pre-trained neural network which we assume to have\nbeen trained to classify ysn from xsn. We assume that the network has some number of layers\ncorresponding to the extraction of features from xsn, followed by a \ufb01nal classi\ufb01cation layer which\nmaps the features to a predicted class label \u02c6ysn. We denote the output of the penultimate layer as\nf sn : X \u2192 Rln, of which the ith feature is f sn\n: X \u2192 R, where ln is the number of features output by\nthis layer. We denote the \ufb01nal layer as hsn : Rln \u2192 Ysn, so that the entire neural network classi\ufb01er\ncan be written as \u02c6y = (hsn \u25e6 f sn )(x).\nk)}, with\nWe seek to train a classi\ufb01er on the target task given training samples {(xt\naccess to hsn and f sn for each source dataset, but without any access to the underlying source training\nsamples {(xsn\nAs an example context, this re\ufb02ects a situation in which there are many devices collecting and\nanalyzing data, but where the target learner is not allowed to access the data, either because the\ndevices have limited bandwidth and cannot transmit everything they have detected, the data is personal\n(i.e. pictures taken by users of a mobile app) and cannot be transmitted for privacy purposes, or the\noriginal data is otherwise lost (if the data was collected a long time ago). However, in these cases, it\nmay still be possible to query the classi\ufb01er trained on each device to get their intermediate features,\nwhich would require less information to be transmitted.\n\n1 , ysn\n\n1 ), . . . , (xsn\nkn\n\n1, yt\n\n1), . . . , (xt\n\nk, yt\n\n)}.\n\n, ysn\nkn\n\ni\n\n2\n\n\f2.2 Related Work\n\nMulti-task learning is a well-studied problem, with several variations and formulations. One standard\napproach is to learn a common feature function f (\u00b7) across all tasks which optimize some joint\nobjective, followed by a \ufb01nal classi\ufb01cation layer for each task [19, 24]. This is a technique which has\nsome theoretical guarantees as given by Ben-David, et al., [2]. While effective, this method requires\njoint training, which our formulation precludes.\nGupta and Ratinov [7] propose a method of combining the outputs of multiple pre-trained classi\ufb01ers\nby training on their raw predictions, but this method is designed for pre-trained classi\ufb01ers specially\nselected to work well in combination with the target task, with an emphasis on cases where the\nnumber of possible class labels (i.e. the value of each |Ysn|) is large, which we do not assume in our\nproblem formulation.\nOther methods involve some kind of sequential learning [27] or shared memory unit [18], which\ncould decentralize data storage, but which still require joint control over the training [17].\nMeta-learning algorithms have also gained popularity in recent years [21, 22]. These algorithms\nattempt to learn a suitably general learning rule or model from a set of source tasks which can be\n\ufb01ne-tuned with data from a target task [4]. While these methods allow for the combining of multiple\nsource datasets, they are still bound by the need for centralized training.\nFinally, the notion of transferring from a single pre-trained network onto a new target task has also\nbeen studied before. Yosinski, et al., explore the transferability of different layers of a neural net\nto other tasks in the context of learning general features [28], while Bao, et al., propose a score for\nmeasuring transferability of features across tasks [1].\n\n3 Multi-Source Transfer Learning via Maximal Correlations\n\n3.1 Maximal Correlation Analysis\n\nOur methodology is based on the use of maximal correlation analysis, which originated with the work\nof Hirschfeld [9], and has been further developed in a wide range of subsequent work , including by\nGebelein and R\u00e9nyi [6, 26], and as a result is sometimes referred to as Hirschfeld-Gebelein-Renyi\n(HGR) maximal correlation analysis. (For a more detailed summary of this literature, see, e.g., the\nreferences and discussion in [10].)\nGiven 1 \u2264 k \u2264 K \u2212 1 with K = min{|X|,|Y|}, the maximal correlation problem for random\nvariables X \u2208 X and Y \u2208 Y is\n(f\u2217, g\u2217) (cid:44)\n\nE(cid:2)f T(X) g(Y )(cid:3) ,\n\narg max\n\n(1)\n\nf : X\u2192Rk, g : Y\u2192Rk\nE[f (X)]=E[g(Y )]=0,\n\nE[f (X)f T(X)]=E[g(Y )gT(Y )]=I\n\nwhere expectations are with respect to joint distribution PX,Y . We refer to f\u2217 and g\u2217 as the maximal\ncorrelation functions. With f\u2217 = (f\u2217\nk)T, we further de\ufb01ne the\nassociated maximal correlations \u03c3i = E [f\u2217\ni (Y )] for i = 1, . . . , k. In turn, the optimizing\nfunctions satisfy\n\nk )T and g = (g\u2217\n\n1 , . . . , f\u2217\n\n1, . . . , g\u2217\n\ni (X) g\u2217\n\nEpX|Y (\u00b7|y) [f\u2217\n\ni (X)] = \u03c3i g\u2217\n\ni (y)\n\nand EpY |X (\u00b7|x) [g\u2217\n\ni (Y )] = \u03c3i f\u2217\n\ni (y),\n\nwhich underlies the alternating conditional expectations (ACE) algorithm of Breiman and Friedman\n[3] for computing these functions. Indeed, for a given f, the correlation maximizing g has components\n\n\u02c6gi(y) \u221d EpX|Y (\u00b7|y) [f\u2217\n\ni (X)] ,\n\ni = 1, . . . , k.\n\n(2)\n\nAs described in [11, 10], the maximal correlation problem is a variational form of a modal decompo-\nsition (i.e., generalized SVD) of joint distributions of the form\n\nPX,Y (x, y) = PX (x) PY (y)\n\n1 +\n\n\u03c3i f\u2217\n\ni (x) g\u2217\n\ni (y)\n\n,\n\n(3)\n\n(cid:35)\n\n(cid:34)\n\nK\u22121(cid:88)\n\ni=1\n\n3\n\n\fvia which predictions are made according to\n\nPY |X (y|x) = PY (y)\n\n(cid:32)\n\n1 +\n\nk(cid:88)\n\ni=1\n\n(cid:33)\n\n\u03c3if\u2217\n\ni (x)g\u2217\n\ni (y)\n\n,\n\n(4)\n\nwhere suitable estimates of PY are obtained from the data or domain knowledge about label distribu-\ntions.\nMoreover, the maximal correlation features arise naturally in a local version of softmax regression\n[10], and thus have a direct interpretation in the context of neural networks. In particular, given\n(normalized) features f, [10] shows that such regression produces EpX|Y (\u00b7|y) [f (X)] as combining\nweights. Moreover, [10] establishes that optimizing over the choice of features yields the maximal\ncorrelation ones, i.e., f\u2217, and that as a result the corresponding combining weights correspond to\ng\u2217 (weighted by \u03c31, . . . , \u03c3k). (And as as such, it also highlights the connection between the ACE\nalgorithm and the use of traditional neural network training.)\n\n3.2 Combining Maximal Correlation Functions\n\nThe preceding relationships motivate our approach to their application to the multi-task learning\nproblem. Given a \ufb01xed set of feature functions {f s1, . . . , f sN} we seek to maximize the total maximal\ncorrelation\n\n(cid:2)f T(X) g(Y )(cid:3)\n\nL = E \u02c6P t\n\nX,Y\n\nwith respect to g, where f = (f s1 , . . . , f sN )T and g = (gs1 , . . . , gsN )T, and where the optimization\nis over all valid (zero-mean and unit-variance with respect to the empirical distribution of the target\nclass labels) g for \ufb01xed f. \u02c6P t\nExpanding (5) as\n\nX,Y is the empirical joint target distribution of X and Y .\n\n(cid:88)\n\ni,n\n\nL =\n\nE \u02c6P t\n\nX,Y\n\n[f sn\n\ni (X) gsn\n\ni (Y )] ,\n\nwe can then maximize each term separately, yielding\n\ngsn\ni (y) = arg max\n\n\u02dcgsn\ni\n\nL = arg max\n\n\u02dcgsn\ni\n\nE \u02c6P t\n\nX,Y\n\n[f sn\n\ni (X)\u02dcgsn\n\ni (Y )] .\n\nThen, for each gsn\nconditional expectation\n\ni (y), for a \ufb01xed f sn\n\ni\n\n, we have from (2) that the optimal gsn\ni\n\nis given by the\n\ni (y) = E \u02c6P t\ngsn\n\nX|Y (\u00b7|y) [f sn\n\ni (X)] ,\n\nwhich can easily be computed from the target samples.\nIn turn, we compute the corresponding maximized correlation for each pair of functions f sn\nvia\n\ni\n\n\u03c3n,i = E \u02c6P t\n\nX,Y\n\n[f sn\n\ni (X) gsn\n\ni (Y )] .\n\n3.3 The Maximal Correlation Weighting (MCW) Algorithm\n\nUsing the combining weights thus derived, a predictor for the target labels is formed in accordance\nwith (4); speci\ufb01cally,\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nand gsn\ni\n\n(9)\n\nfrom which are classi\ufb01cation y for a given test sample x is\n\n\u02c6y = arg max\n\ny\n\n\u02c6PY |X (y|x) = arg max\n\ny\n\n\u02c6P t\n\nY (y)\n\nwhere \u02c6P t\n\nY is an estimate of the target label distribution.\n\n4\n\n\u02c6PY |X (y|x) = \u02c6P t\n\nY (y)\n\n\u03c3n,if sn\n\ni (x)gsn\n\ni (y)\n\n\uf8eb\uf8ed1 +\n\n(cid:88)\n\nn,i\n\n\uf8f6\uf8f8 ,\n\n\uf8eb\uf8ed1 +\n\n(cid:88)\n\nn,i\n\n\u03c3n,if sn\n\ni (x)gsn\n\ni (y)\n\n(10)\n\n(11)\n\n\uf8f6\uf8f8 ,\n\n\f1, yt\n\n{(xt\n\nAlgorithm 1 Extracting maximal correlation parameters\nData: zero-mean, unit-variance feature functions {f sn\ni }\nResult: associated maximal correlations {\u03c3n,i} and correlation functions {gsn\nfor n = 1, . . . , N do // Iterate over all source tasks\nfor i = 1, . . . , ln do // Iterate over features in each network\n\nfor y \u2208 Yt do // Iterate over all target class labels\n\n1), . . . , (xt\n\nk)}\n\nk, yt\n\ni } from source tasks and target task samples\n\nX|Y (\u00b7|y) [f sn\n\ni (X)] // Compute feature and label-specific\n\ni (X) gsn\n\ni (Y )] // Compute feature-specific weight\n\ni (y) \u2190 EP t\ngsn\nweight\nend\n\u03c3n,i \u2190 E \u02c6P t\n\n[f sn\n\nX,Y\n\nend\n\nend\nreturn {gsn\n\ni },{\u03c3n,i}\n\nAlgorithm 2 Prediction with the maximal correlation weighting method\nData: maximal correlation functions {f sn\n\ni } and {gsn\n\ni } with associated correlations {\u03c3n,i}, empirical\n\nclass label distribution \u02c6P t\n\nY , and target task sample xt\n\nResult: class label prediction \u02c6yt given xt\nY (y) \u2200y \u2208 Yt\nInitialize \u02c6P t\nfor n = 1, . . . , N do // Iterate over all source tasks\n\nY |X (y|xt) = \u02c6P t\n\nfor i = 1, . . . , ln do // Iterate over features in each network\n\nfor y \u2208 Yt do // Iterate over all target class labels\n\nY |X (y|xt) = \u02c6P t\n\u02c6P t\n\nY |X (y|xt) + \u02c6P t\n\nY (y)\u03c3n,if sn\n\ni (xt)gsn\n\ni (y)// Apply Equation (9)\n\nend\n\nend\n\nend\nreturn arg maxy\n\nY |X (y|x)\n\u02c6P t\n\nThe resulting algorithms for learning the MCW parameters and computing the MCW predictions are\nsummarized in Algorithm 1 and Algorithm 2.\nComputing the empirical conditional expected value requires a single pass through the data, and so\nhas linear time complexity in the number of target samples. We also need to compute one conditional\nexpectation for each feature function. Thus, the time complexity of the \ufb01ne-tuning is O(C + N Kk),\nwhere C is the time needed to extract features from all the pre-trained networks, N is the number of\nnetworks, K is the maximum number of features per network, and k is the number of target training\nsamples. The number of parameters grows as O(N K|Yt|), which is the number of entries needed to\nstore all the g functions. |Yt| is the number of target class labels.\nTo compute a prediction from one target test sample, the time complexity is O(C + N K|Yt|). This\ni (y) for each possible\n\narises from the fact that we must compute the quantity(cid:80)\n\nn,i \u03c3n,if sn\n\ni (x)gsn\n\nclass label.\n\n5\n\n\fFigure 1: Example images from the (a) CIFAR-100, (b) Stanford Dogs, and (c) Tiny ImageNet\ndatasets.\n\n4 Experimental Results\n\n4.1 General Experimental Setup\n\nIn order to illustrate the effectiveness of the MCW method, we perform experiments on three different\nimage classi\ufb01cation datasets: CIFAR-100, Stanford Dogs, and Tiny ImageNet. Example images from\neach dataset can be found in Figure 1.\nFor each dataset, we divide the classes into a set of mutually exclusive subsets, select one subset as\nour target task, and several others as the source datasets. We use the LeNet architecture [15] as our\nneural network for each source dataset, and train a different network for each source dataset. We\nimplemented the network in PyTorch [25], and trained it with learning rate=0.001, momentum=0.9,\nand number of epochs = 100.\nWe remove the means and normalize to unit variance all of the feature functions with respect to the\ntarget samples, and then compute the maximal correlations and associated functions for each output\nin the penultimate layer using the target data according to Algorithm 1. We then use them to compute\npredictions on the test set for the target task according to Algorithm 2.\nWe compare the classi\ufb01cation accuracies on the test set with that of a Support Vector Machine (SVM)\ntrained on the penultimate layers with the same target training data (similar to the setup in [7]), as\nwell as the best results from the MCW method and SVM method using only one source dataset/neural\nnetwork. We also include the \"upper bound\" baseline performance on the dataset by a LeNet neural\nnetwork trained on a number of target training samples equal to the number of training samples\nprovided for each source dataset. The reported results are over 20 runs using the same set of tasks for\neach run.1\n\n4.2 CIFAR-100 Dataset\n\nThe CIFAR-100 dataset2 [13] is a collection of color images of size 32x32 drawn from 100 different\ncategories of real-world subjects. Because of the low resolution of the images, CIFAR-100 is\ngenerally seen as a dif\ufb01cult classi\ufb01cation problem. For our experiment, we construct a series of\nbinary classi\ufb01cation tasks from the classes. We randomly selected \"apple\" vs. \"\ufb01sh\" as our target\nbinary classi\ufb01cation task, and randomly selected 10 other pairs of non-overlapping categories for the\nsource tasks. For each source task, we extracted 500 samples per class for training, and we used 1, 5,\n10, and 20 samples per class to compute the maximal correlation functions in the target task. We used\nthe training/test splits included with the dataset, and report results over all test samples with the target\nlabels.\nTable 1 shows the test accuracies of our algorithm as applied to the CIFAR-100 dataset. We can\nsee that the MCW method performs signi\ufb01cantly better than an SVM when there are few samples,\n\n1Code for the experiments can be found at http:/allegro.mit.edu/~gww/multitransfer\n2https://www.cs.toronto.edu/ kriz/cifar.html\n\n6\n\n\fTable 1: Experimental results for the CIFAR-100 dataset. Accuracies are reported with 95% con\ufb01-\ndence intervals.\n\nMethod\nBest Single Source SVM\nBest Single Source MCW\nMulti-Source SVM\nMulti-Source MCW\nBaseline (All Target Samples)\n\n1-Shot Acc.\n56.9 \u00b1 2.5\n59.2 \u00b1 2.1\n64.7 \u00b1 3.0\n69.0 \u00b1 3.0\n\n5-Shot Acc.\n67.0 \u00b1 3.0\n69.0 \u00b1 3.0\n72.8 \u00b1 2.7\n78.1 \u00b1 0.8\n\n10-Shot Acc.\n70.4 \u00b1 1.9\n67.0 \u00b1 2.4\n76.2 \u00b1 1.8\n80.1 \u00b1 0.8\n\n90.7 \u00b1 0.1\n\n20-Shot Acc.\n70.9 \u00b1 1.2\n70.4 \u00b1 1.5\n81.5 \u00b1 0.6\n81.7 \u00b1 0.6\n\nFigure 2: Average values of(cid:80)\n\ni \u03c3n,i for each source task sn for the 5-shot transfer learning task on\nthe CIFAR-100 dataset, with the target task of \"apple vs. \ufb01sh.\" Points are plotted with 95% con\ufb01dence\nintervals.\n\nlikely due to its ability to work with fewer target data points in learning, but that this performance gap\ncloses as more target training samples are added, likely due to the fact that the models which require\njoint training over the features begin to have enough target samples to properly learn their parameters.\nIn addition, we can see that combining multiple networks provides performance that is better than\nany one network can achieve with the same methods, once again suggesting that our algorithm is\ntaking in contributions from multiple sources instead of just one.\nIn order to investigate the functioning of the MCW method, we plot the sum of correlations for each\nof the 10 tasks for the 5-shot case in Figure 2. We can see a signi\ufb01cant variation among tasks, which\nprovides a clear indication of which tasks are being preferred and which do not contribute as much to\nthe overall performance. To verify this, we run two additional experiments in which we \ufb01rst remove\nthe source task with the lowest total correlation (\"camel\" vs. \"can\") and see how well the MCW\nmethod performs with the remaining 9 source datasets, and then remove the task with the highest\ntotal correlation (\"dolphin\" vs. \"elephant\") while keeping the other 9 sources in and run the same test.\nWithout the least-favoured task, the classi\ufb01cation accuracy drops to 76.8 \u00b1 1.0, which is not a\nsigni\ufb01cant difference from using all 10 source tasks. However, when we remove the most-favoured\ntask, the accuracy plummets to 73.0 \u00b1 1.3, which indicates that \"dolphin\" vs. \"elephant\" had a\nsigni\ufb01cant impact on the quality of the classi\ufb01er, but that the MCW method still takes the input of the\nother tasks into account in order to construct a good classi\ufb01er on the target set.\n\n7\n\n\fTable 2: Experimental results for the Stanford Dogs dataset. Accuracies are reported with 95%\ncon\ufb01dence intervals.\n\nMethod\nBest Single Source SVM\nBest Single Source MCW\nMulti-Source SVM\nMulti-Source MCW\nBaseline (All Target Samples)\n\n5-Shot Accuracy\n35.8 \u00b1 0.8\n38.2 \u00b1 0.6\n38.9 \u00b1 0.3\n41.6 \u00b1 0.5\n55.2 \u00b1 0.1\n\nTable 3: Experimental results for the Tiny ImageNet dataset. Accuracies are reported with 95%\ncon\ufb01dence intervals.\n\nMethod\nBest Single Source SVM\nBest Single Source MCW\nMulti-Source SVM\nMulti-Source MCW\nBaseline (All Target Samples)\n\n5-Shot Accuracy\n31.4 \u00b1 0.9\n33.9 \u00b1 1.0\n42.5 \u00b1 1.4\n47.4 \u00b1 1.1\n53.8 \u00b1 0.1\n\n4.3 Stanford Dogs Dataset\n\nThe Stanford Dogs dataset3 [12] is a subset of the ImageNet dataset designed for \ufb01ne-grained image\nclassi\ufb01cation. It consists of 22,000 images of varying sizes covering 120 classes of dog breeds. For\nthis task we construct a random 5-way target classi\ufb01cation task (differentiating between \"Chihuahua\",\n\"Japanese Spaniel\", \"Maltese Dog\", \"Pekinese\", and \"Shih-Tzu\") and 10 other random 5-way source\nclassi\ufb01cation tasks with no overlapping classes. For the target set, we take 5 samples per class for\ntraining and use the rest for testing. For the source sets, we take 100 samples per class for training.\nAll images were resized to size 144x144.\nTable 2 shows the test accuracies of our algorithm as applied to the Stanford Dogs dataset. This\ntime, we observe a loose hierarchy whereby the MCW method outperforms the SVM, which in turn\noutperforms any single source transfer. We can thus conclude that the MCW method is effective in\nthe case of m-way learning for m > 2, and that we can still leverage multiple networks to get a gain\nin cases where the classes are very similar.\n\n4.4 Tiny ImageNet Dataset\n\nThe Tiny ImageNet dataset4 [16] is another subset of the ImageNet dataset, consisting of images of\nsize 64x64 drawn from 200 categories, with 500 images provided for each category. The categories\ncover a much wider range than the Stanford Dogs dataset, including animals, natural and man-made\nobjects, and even abstract concepts (e.g. \"elongation\"). As with the Stanford Dogs dataset, we\nconstructed 11 random 5-way classi\ufb01cation tasks, and selected one as the target task (\"Lighthouse\"\nvs. \"Rocking Chair\" vs. \"Bannister\" vs. \"Jelly\ufb01sh\" vs. \"Chain\") and the others as source tasks. We\nused 5 training samples per class for the target task (with 250 samples per class for testing) and all\n500 samples per class for the source training samples. For the baseline, we only trained with the 250\nsamples per class in the target dataset that were not in the test split.\nTable 3 shows the test accuracies of the MCW method as applied to the Tiny ImageNet dataset.\nCompared to the Stanford Dogs dataset, we see a larger gain from leveraging multiple sources\ncompared to a single source, which suggests that if the source classes are much more dissimilar than\nthe target classes, then integrating more networks (and thus leveraging a wider range of features)\nwill have a greater effect on target task accuracy, likely due to the ability of different source tasks\nto \"cover\" the feature set needed for the target task, as opposed to the Dogs setup where the classes\nwere highly similar.\n\n3http://vision.stanford.edu/aditya86/ImageNetDogs/\n4https://tiny-imagenet.herokuapp.com/\n\n8\n\n\f5 Concluding Remarks\n\nWe presented a new multi-task learning problem inspired by advances in the modern Deep Learning\necosystem in which a target task learner has access to only a few target task samples, and access to\nthe neural networks already trained by the sources, but not the underlying data. By leveraging the\nHirschfeld-Gebelein-R\u00e9nyi maximal correlation, we were able to develop a fast, easily-computed\nmethod for combining the features extracted by these neural networks to build a classi\ufb01er for the\ntarget task.\nWe showed that this method was effective for binary and 5-way classi\ufb01cation on image data, and that\ncombining multiple nets was more effective when there were no similar classes in the source datasets\nto those in the target dataset.\nIt is possible that the maximal correlation can also be a tool to measure how important each neural\nnetwork is relative to training the target task, as we showed in our experiments with the CIFAR-100\ndataset. In an online setting, this could encourage a procedure whereby more-relevant networks are\nqueried more often compared to less-relevant networks if data transfer is limited, since it is more\nimportant that the more-relevant networks are \"correct\" (i.e. trained with more training data).\nIn addition, the privacy implications of our setup could be considered, as it is possible to reconstruct\ntraining data from the learned features [5], which means that our method as-is does not erase all\nprivacy concerns. These methods can be countered with differential privacy measures [20], such as\nadding noise to the feature functions, but their effect on transfer quality is as-of-yet unknown.\nIndeed, with the advent of mass small-scale Deep Learning, many opportunities and challenges will\narise, allowing us to leverage the power of crowdsourcing for learning in a novel application of the\nprinciple of the Wisdom of the Crowd.\n\nAcknowledgments\n\nThis work was supported in part by the MIT-IBM Watson AI Lab, and by NSF under Grant No. CCF-\n1717610.\n\nReferences\n[1] Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Amir R. Zamir, and Leonidas J. Guibas. An\ninformation-theoretic metric of transferability for task transfer learning. https://openreview.\nnet/forum?id=BkxAUjRqY7, 2019. [Online; accessed 13-May-2019].\n\n[2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations\nfor domain adaptation. In Advances in Neural Information Processing Systems, pages 137\u2013144,\n2007.\n\n[3] Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for multiple\n\nregression and correlation. J. Am. Stat. Assoc., 80(391):580\u2013598, September 1985.\n\n[4] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proc. Int. Conf. Machine Learning (ICML), volume 70, pages\n1126\u20131135, 2017.\n\n[5] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit\ncon\ufb01dence information and basic countermeasures. In Proc. ACM SIGSAC Conf. Computer,\nCommunications Security, pages 1322\u20131333, 2015.\n\n[6] Hans Gebelein. Das statistische problem der korrelation als variations- und eigenwertproblem\nund sein zusammenhang mit der ausgleichsrechnung. Z. Angewandte Math., Mech., 21(6):364\u2013\n379, 1941.\n\n[7] Rakesh Gupta and Lev-Arie Ratinov. Text categorization with knowledge transfer from hetero-\ngeneous data sources. In Proc. AAAI Conf. Arti\ufb01cial Intelligence, volume 2, pages 842\u2013847,\n2008.\n\n9\n\n\f[8] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In Proc. Int. Conf. Machine Learning (ICML), pages 1737\u2013\n1746, 2015.\n\n[9] Hermann O. Hirschfeld. A connection between correlation and contingency. Proc. Cambridge\n\nPhil. Soc., 31:520\u2013524, 1935.\n\n[10] Shao-Lun Huang, Anuran Makur, Gregory W. Wornell, and Lizhong Zheng. On univer-\nsal features for high-dimensional learning and inference. preprint, October 2019. http:\n//allegro.mit.edu/~gww/unifeatures.\n\n[11] Shao-Lun Huang, Anuran Makur, Lizhong Zheng, and Gregory W. Wornell. An information-\ntheoretic approach to universal feature selection in high-dimensional inference. In Proc. Int.\nSymp. Inform. Theory (ISIT), pages 1336\u20131340, 2017.\n\n[12] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for\n\ufb01ne-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained\nVisual Categorization (FGVC), 2011.\n\n[13] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, University of Toronto, Toronto, Canada, 2009.\n\n[14] Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena\nQendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning\ninference on mobile devices. In Proc. Int. Conf. Information Processing in Sensor Networks,\npage 23, 2016.\n\n[15] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proc. IEEE, 86(11):2278\u20132324, 1998.\n\n[16] Fei-Fei Li, Andrej Karpathy, and Justin Johnson. Tiny imagenet visual recognition challenge.\n\nhttps://tiny-imagenet.herokuapp.com/, 2015. [Online; accessed 13-May-2019].\n\n[17] Chee Peng Lim and Robert F. Harrison. Online pattern classi\ufb01cation with multiple neural\nnetwork systems: an experimental study. IEEE Trans. Systems, Man, and Cybernetics, Part C\n(Applications and Reviews), 33(2):235\u2013247, 2003.\n\n[18] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Deep multi-task learning with shared memory.\n\nCoRR, abs/1609.07222, 2016. http://arxiv.org/abs/1609.07222.\n\n[19] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable\nfeatures with deep adaptation networks. CoRR, abs/1502.02791, 2015. http://arxiv.org/\nabs/1502.02791.\n\n[20] Giuseppe Manco and Giuseppe Pirr\u00f2. Differential privacy and neural networks: A preliminary\n\nanalysis. In Proc. Int. Workshop Personal Analytics, Privacy, pages 23\u201335, 2017.\n\n[21] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. Meta-learning with temporal\n\nconvolutions. CoRR, abs/1707.03141, 2017. http://arxiv.org/abs/1707.03141.\n\n[22] Alex Nichol, Joshua Achiam, and John Schulman. On \ufb01rst-order meta-learning algorithms.\n\nCoRR, abs/1803.02999, 2018. http://arxiv.org/abs/1803.02999.\n\n[23] Kaoru Ota, Minh Son Dao, Vasileios Mezaris, and Francesco G. B. De Natale. Deep learning\nfor mobile multimedia: A survey. ACM Trans. Multimedia Computing, Communications, and\nApplications, 13(3s):34, 2017.\n\n[24] Sinno Jialin Pan, James T. Kwok, and Qiang Yang. Transfer learning via dimensionality\n\nreduction. In Proc. AAAI Conf. Arti\ufb01cial Intelligence, volume 8, pages 677\u2013682, 2008.\n\n[25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In Autodiff Workshop, Conf. Neural Information Processing Systems, Long Beach,\nCA, 2017.\n\n10\n\n\f[26] Alfr\u00e9d R\u00e9nyi. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungar-\n\nica, 10(3\u20134):441\u2013451, September 1959.\n\n[27] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR,\nabs/1606.04671, 2016. http://arxiv.org/abs/1606.04671.\n\n[28] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features\nIn Advances in Neural Information Processing Systems, pages\n\nin deep neural networks?\n3320\u20133328, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2442, "authors": [{"given_name": "Joshua", "family_name": "Lee", "institution": "MIT"}, {"given_name": "Prasanna", "family_name": "Sattigeri", "institution": "IBM Research"}, {"given_name": "Gregory", "family_name": "Wornell", "institution": "MIT"}]}