{"title": "Adapted Deep Embeddings: A Synthesis of Methods for k-Shot Inductive Transfer Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 76, "page_last": 85, "abstract": "The focus in machine learning has branched beyond training classifiers on a single task to investigating how previously acquired knowledge in a source domain can be leveraged to facilitate learning in a related target domain, known as inductive transfer learning. Three active lines of research have independently explored transfer learning using neural networks. In weight transfer, a model trained on the source domain is used as an initialization point for a network to be trained on the target domain. In deep metric learning, the source domain is used to construct an embedding that captures class structure in both the source and target domains. In few-shot learning, the focus is on generalizing well in the target domain based on a limited number of labeled examples. We compare state-of-the-art methods from these three paradigms and also explore hybrid adapted-embedding methods that use limited target-domain data to fine tune embeddings constructed from source-domain data. We conduct a systematic comparison of methods in a variety of domains, varying the number of labeled instances available in the target domain (k), as well as the number of target-domain classes. We reach three principal conclusions: (1) Deep embeddings are far superior, compared to weight transfer, as a starting point for inter-domain transfer or model re-use (2) Our hybrid methods robustly outperform every few-shot learning and every deep metric learning method previously proposed, with a mean error reduction of 34% over state-of-the-art. (3) Among loss functions for discovering embeddings, the histogram loss (Ustinova & Lempitsky, 2016) is most robust. We hope our results will motivate a unification of research in weight transfer, deep metric learning, and few-shot learning.", "full_text": "Adapted Deep Embeddings: A Synthesis of Methods\n\nfor k-Shot Inductive Transfer Learning\n\nTyler R. Scott, Karl Ridgeway, Michael C. Mozer\n\nDepartment of Computer Science\nUniversity of Colorado, Boulder\n\n{tysc7237,karl.ridgeway,mozer}@colorado.edu\n\nAbstract\n\nThe focus in machine learning has branched beyond training classi\ufb01ers on a single\ntask to investigating how previously acquired knowledge in a source domain can\nbe leveraged to facilitate learning in a related target domain, known as inductive\ntransfer learning. Three active lines of research have independently explored\ntransfer learning using neural networks. In weight transfer, a model trained on the\nsource domain is used as an initialization point for a network to be trained on the\ntarget domain. In deep metric learning, the source domain is used to construct an\nembedding that captures class structure in both the source and target domains. In\nfew-shot learning, the focus is on generalizing well in the target domain based on a\nlimited number of labeled examples. We compare state-of-the-art methods from\nthese three paradigms and also explore hybrid adapted-embedding methods that\nuse limited target-domain data to \ufb01ne tune embeddings constructed from source-\ndomain data. We conduct a systematic comparison of methods in a variety of\ndomains, varying the number of labeled instances available in the target domain\n(k), as well as the number of target-domain classes. We reach three principal\nconclusions: (1) Deep embeddings are far superior, compared to weight transfer, as\na starting point for inter-domain transfer or model re-use (2) Our hybrid methods\nrobustly outperform every few-shot learning and every deep metric learning method\npreviously proposed, with a mean error reduction of 34% over state-of-the-art. (3)\nAmong loss functions for discovering embeddings, the histogram loss (Ustinova &\nLempitsky, 2016) is most robust. We hope our results will motivate a uni\ufb01cation of\nresearch in weight transfer, deep metric learning, and few-shot learning.\n\n1\n\nIntroduction\n\nSince the introduction of backpropagation, researchers in neural networks have investigated inductive\ntransfer learning [3, 24]. Inductive transfer learning refers to the use of labeled data from a source\ndomain to improve generalization accuracy on a related target domain with limited labeled data\n[23]. The notion of \u2018related\u2019 is not formally de\ufb01ned, though the existence of shared features across\ndomains is presumed. With the deep learning movement, there has been a resurgence of interest in\ninductive transfer learning (ITL) for classi\ufb01cation, which we will refer to as k-ITL, where k denotes\nthe number of labeled examples available for each class in the target domain. Of particular interest\nhas been the case with small k, due to the fact that deep learning is typically data hungry, in contrast\nto human learners who often generalize well from a single example [16].\nThree independent lines of research have tackled the k-ITL problem, either explicitly or implicitly.\nFirst, the deep metric learning literature [2, 4, 18, 20, 21, 26, 28, 32, 34, 35, 38] uses the source\ndomain to construct a nonlinear embedding in which instances of the same class are clustered together\nand well separated from instances of different classes. The quality of an embedding is evaluated by\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fexamining inter-class separation in the target domain. Because the target domain is just a means\nof evaluation, deep-metric learning is agnostic as to k. Second, the few-shot learning literature\n[8, 10, 12, 14, 25, 29, 31, 33] addresses the case when k is small, typically k \u2264 20. Many of these\nmethods construct embeddings, just as in the metric-learning literature, though other methods have\nbeen explored, e.g., meta-learning. Third, there has long been an intuitive appeal to the weight\ntransfer framework [1, 19, 24, 36, 37], which involves using the hidden representations obtained by\ntraining on the source domain as an initialization point for a second network to be trained on the\ntarget domain. In weight transfer experiments, large k (\u2265 100 or \u2265 1000) are typically chosen.\nDespite distinctive foci on k, all three lines of research utilize essentially the same architectures. They\ndiffer in two aspects of training: (1) the proposed loss function, and (2) whether weights are \ufb01ne tuned\non the target domain (which we refer to as adaptation). In this work, we compare state-of-the-art\nmethods from each paradigm on a range of data sets, varying both the number of examples provided\nfor each class in the target domain, k, and the number of classes in the target domain, n. We also\nformulate hybrid methods combining ideas across paradigms. We reach three strong conclusions:\n\u2022 Weight transfer is the least effective method for k-ITL. For small k, the other methods\nyield vastly superior results; for large k, transferring weights from source to target domains\nyields little or no improvement over training from scratch on the target domain. This result\nhas strong implications for the \ufb01eld: many researchers use weight transfer as a means of\nbootstrapping training in a novel domain, e.g., by starting with a state-of-the-art model such\nas VGG or AlexNet. Indeed, the TensorFlow development team has released a library of\npretrained models, called TensorFlow Hub [30], speci\ufb01cally for this purpose. Our results\nindicate that this hub would better serve the community by providing pretrained embeddings.\n\u2022 Across existing methods in few-shot learning and deep metric learning that discover embed-\ndings, one speci\ufb01c loss function is most effective for small-k ITL, the histogram loss [32].\nThis loss comes from the deep metric learning literature, and it has never previously been\ncompared to losses from the few-shot learning literature.\n\n\u2022 We propose a hybrid approach, adapted embeddings, that combines loss functions for deep\nembeddings with weight adaptation in the target domain. This hybrid approach robustly\noutperforms every few-shot learning and every deep metric learning method previously\nproposed on k-ITL. The performance differences are not in tiny percentage error reductions\nthat distinguish contemporary methods, but are systematic and meaningful: a mean error\nreduction of 34% over state-of-the-art. To our knowledge, the only previous work to explore\nsuch a hybrid approach did so in a cursory manner and the results were ambiguous [33].\n\nIn the next section, we survey the three paradigms for k-ITL and identify a state-of-the-art method\nwithin each. Where multiple methods are roughly comparable in performance, we select based\non simplicity of the method. We then describe an experimental methodology for systematically\ncomparing methods, which includes the hybrid we propose, on a range of common data sets.\n\n2 Paradigms for k-Shot Inductive Transfer Learning\n\n2.1 Deep Metric Learning\n\nAn embedding is a distributed representation that captures class structure via metric properties of the\nembedding space. In deep metric learning, a neural network is trained to map from the input to the\nembedding space.1 Various objective functions have been proposed for deep metric learning, all of\nwhich aim to ensure that instances of the same class are near one another in the embedding space\nand instances of different classes are far apart [2, 4, 18, 20, 21, 26, 28, 32, 34, 35, 38]. The objective\nfunctions differ in how they quantify \u2018near\u2019 and \u2018far\u2019. Because classes are separated in the embedding,\nmetric learning supports categorization of an unlabeled instance by projecting it to the embedding\nspace and considering its proximity to labeled instances. Given a pretrained deep embedding, one\ncan perform k-shot learning by embedding the k instances of each novel class and then classifying\nunlabeled instances by their proximity to the labeled data.\n\n1Deep metric learning methods are often initialized with a pretrained classi\ufb01cation model such as AlexNet or\nVGG. One can decapitate its output layer and continue training with a metric-learning loss on the penultimate\nlayer (e.g., [21, 32])\n\n2\n\n\fDeep metric learning methods are evaluated using a variation of k-shot learning in which a support\nset of k examples of n classes is embedded, and a mean Recall@r score is obtained for a query set,\nheld-out examples of this domain. Recall@1 is simply nearest-neighbor classi\ufb01cation and this single\nbest guess is typically how k-ITL is scored. Although the entire range of r is swept in evaluation,\nranking of the methods is fairly consistent across r. Since there is not an emphasis on learning from\nfew examples, k typically varies in magnitude and is generally not directly speci\ufb01ed.\nThe histogram loss [32], hereafter HISTLOSS, is a state-of-the-art method that we chose to represent\nthe deep metric learning paradigm. Its Recall@1 performance is equivalent to or slightly better than\ncontemporaneous methods [26, 34, 35], and HISTLOSS has only one hyperparameter, the number of\nhistogram bins, and results are robust to the setting of the hyperparameter.2 HISTLOSS constructs two\nsets of similarities, S+ = {s(f\u03c6(xi), f\u03c6(xj))|yi = yj} and S\u2212 = {s(f\u03c6(xi), f\u03c6(xj))|yi (cid:54)= yj},\nwhere f\u03c6(xi) is the neural network embedding of input i with class label yi and s(., .) is a similarity\n\u2212\u221e p+(z)dz], is de\ufb01ned on the similarity distributions of positive\npairs and negative pairs, p+(s) and p\u2212(s) respectively. The distributions are each estimated as a\nhistogram, and the empirical loss is ef\ufb01ciently computed using the histogram bins to identify all\n(s+ \u2208 S+, s\u2212 \u2208 S\u2212) similarity pairs for which s\u2212 \u2265 s+. The loss is minimized via stochastic\ngradient descent in weights \u03c6.\n\nmetric. A loss, L\u03c6 = Es\u223cp\u2212 [(cid:82) s\n\n2.2 Few-Shot Learning\n\nThe few-shot learning literature is explicitly directed at the k-ITL problem with an emphasis on\nsmall k, typically k \u2264 20. Embeddings form the basis of some methods [8, 12, 14, 29, 31, 33].\nMeta-learning [10, 25] is another innovative approach involving training a recurrent network on a\nsequence of small classi\ufb01cation tasks, so that it learns more ef\ufb01ciently on a subsequent task. We\nchose the prototypical network [29], hereafter PROTONET, as our representative of few-shot learning\nmethods. It is simple and elegant, in addition to being state-of-the-art.3\nPROTONET is a deep network that embeds input xi, and for each class c, a prototype \u00b5c is constructed\nfrom the k instances in the support set: \u00b5c = 1\nto its distance to the prototypes: p(yq = c|xq) \u223c exp(\u2212d(f\u03c6(xq), \u00b5c)). The network parameters,\nk\n\n(cid:80){i|yi=c} f\u03c6(xi). A query q is classi\ufb01ed according\n\n\u03c6, are trained to maximize the conditional likelihood, i.e., L\u03c6 =(cid:80)\n\ni ln p(yi|xi).\n\n2.3 Weight Transfer\n\nWeight transfer in neural networks [1, 19, 24, 36, 37] is an instance of a more general framework in\nwhich parameters of a machine-learning model trained on a source domain are applied to a target\ndomain. In some situations, the source and target are trained simultaneously [3, 27]. Some of the\nliterature on weight transfer appears under the heading of domain adaptation [22, 27], which is often\ntreated as a synonym for transfer learning, though formally domain adaptation involves changing\ninput distributions instead of output labels [6].\nThe most systematic and thorough analysis of weight transfer is the work of Yosinski et al. [36]. In\nthis work, the source and target domains share a common layered feedforward architecture, which\nmaps input xi to internal state f\u03c6(xi) which is then mapped to domain-speci\ufb01c class probabilities\nvia a softmax, p(y|xi) \u223c exp(\u03c9f\u03c6(xi)), where \u03c9 is a set of domain-speci\ufb01c weights. Yosinski\net al. transferred various portions of \u03c6, from only the \ufb01rst layer of weights to all layers, up to and\nincluding the penultimate layer. In addition, the copied weights were either clamped after transfer\nor were further adapted on the target task. Training on the source task and adaptation on the target\ni ln p(yi|xi). Yosinki et al. found that the\nbest classi\ufb01cation accuracy on the target domain is obtained when all network weights up to the\npenultimate layer are transferred and then adapted. We will refer to this state-of-the-art scheme as\nweight adaptation, or WEIGHTADAPT for short.\n\naimed to maximize the conditional likelihood, L\u03c6,\u03c9 =(cid:80)\n\n2To rank deep metric learning algorithms, we used comparisons directly reported in articles as well as\nperformance on the same data sets and evaluation methodology. We obtain the partial ranking [32] \u2265 [26, 34, 35]\n> [4, 18, 21, 28, 38].\n\n3Our partial ranking of few-shot learning methods based on target-domain accuracy is: [29] > [10, 31] >\n\n[25] > [8] > [12] > [33] > [14].\n\n3\n\n\fYosinski et al. mainly focused on large k, k > 1000, and observed only a modest improvement\nin accuracy over the baseline condition of ignoring the source-domain data and training only on\nthe target-domain data. Nonetheless, the notion of weight adaptation is extremely popular in deep\nlearning because it can provide a large time savings over training models from scratch, and it may\nprevent over\ufb01tting when the target domain is data constrained [24, 36].\n\n2.4 Adapted Embeddings\n\nWe have summarized two representative, state-of-the-art embedding methods: HISTLOSS and PRO-\nTONET. For both methods, model parameters are determined solely based on the source-domain data.\nThe target-domain support set\u2014the k instances of each of the n classes in the target domain\u2014are\nused merely for comparison to query (to-be-classi\ufb01ed) instances. In contrast, weight adaptation\ndetermines model parameters using both source and target domain data. We explore a straightforward\nhybrid, adapted embeddings, which uni\ufb01es embedding methods and weight adaptation by using\nthe target-domain support set for model-parameter adaptation. To the best of our knowledge this\nseemingly obvious idea has been incorporated into only one few-shot learning paradigm, matching\nnets [33], referred to as \ufb01ne tuning, and is bene\ufb01cial in one domain, harmful in another.4 Perhaps the\nassumption in the few-shot literature has been that little value will be obtained from adaptation with\nsmall k; indeed, for most algorithms, the data are insuf\ufb01cient to permit adaptation with k = 1. In the\ndeep metric learning literature, the target domain is considered as a means of evaluating embeddings,\nand thus optimizing performance in the target domain is not a focus of interest.\n\n3 Methodology\n\nWe tested six methods: WEIGHTADAPT, HISTLOSS, PROTONET, ADAPTHISTLOSS, ADAPTPROTONET,\nand a non-transfer BASELINE that ignores the source domain and trains a classi\ufb01er solely on the limited\nlabeled data in the target domain. We systematically explored how methods perform as a function of\nk and n on four popular data sets: MNIST [17], Isolet [5], tinyImageNet [9], and Omniglot [16].\nPrevious research on deep-metric and few-shot learning has addressed problems in which the number\nof available classes in the source domain, Nsrc, is much larger than the number of classes to be\ndiscriminated in the target domain, n. Consequently, training on the source is divided into a series\nof episodes where n classes are sampled from the Nsrc.5 In contrast, weight transfer has chosen\nproblems in which Nsrc = n and the same n classes are used across training episodes, for a relatively\nlarge n. We had hoped to independently vary Nsrc and n in our exploration, but combined with search\nover k, the space becomes too large. We therefore assumed Nsrc = n. This constraint helps balance\ntask dif\ufb01culty across n: increasing n makes the target task harder but also provides more data for\ntraining in the source domain. As a result, our simulations do not reach ceiling performance, which\ncan be a concern in few-shot learning. Another rationale for this decision is that many real-world\nk-ITL tasks provide a limited supply of source data, as well as target data. For example, in medical\nradiology, one might hope to use labeled wrist x-rays to support the classi\ufb01cation of ankle x-rays. To\nobtain robust and generalizable results, we evaluated models over n ranging from 5 to 1000.\nAlso in the interest of robustness, we opted for another difference in methodology from most previous\nresearch on deep-metric and few-shot learning. Previous research has typically trained a single source\nmodel and evaluated over many episodes of the target domain. Statistical inference from these data\nallow one to predict the ranking of methods for new samples of the target domain, but not for new\nsamples of the source domain. Consequently, we ran multiple replications of each method for a given\nk and n, and on each replication we drew a single sample of n classes from both the source and target\ndomains.6 This approach is computation intensive, but if method X consistently outranks method\nY , it should do so for a new (related) target domain, as well as a new (related) source domain. We\n\n4The authors of [33] provide no details of how they \ufb01ne tuned. They present results for \ufb01ne tuning with\n\nk = 1, which cannot do much more than move all instances further apart.\n\n5PROTONET [29] found advantages from sampling more than n classes for source training episodes. 60-class\nepisodes were constructed for source training on Omniglot. For miniImageNet source training, 30-class episodes\nwere used when k = 1 and 20-class episodes were used when k = 5.\n\n6In the supplementary materials, we show results from testing embeddings on the Omniglot data set using\n\nthe methodology from previous few-shot learning studies.\n\n4\n\n\fFigure 1: Data pipeline. Data set D is divided into source\nS and target T domains. S is further split into \u03c4 training\nand \u03bd validation instances (see Table 1). From T , k sup-\nport instances per class are selected and the rest become\nquery instances. Tsupport is further split into support and\nquery subsets for ADAPTPROTONET adaptation.\n\nTable 1: Splits and sizes for each data set used in the k-ITL experiments. The source data set doesn\u2019t\nuse a test split and the target data set doesn\u2019t use a validation split. The train size for the target data\nset, Tsupport, is k \u00d7 n for all data sets.\n\nSource Data Set\n\nTarget Data Set\n\nn\n\nTrain Size (\u03c4)\n\nValid Size (\u03bd)\n\nk\n\nMNIST\nOmniglot\nIsolet\ntinyImageNet\n\n5\n\n{5, 10, 100, 1000}\n\n{5, 10}\n{5, 10, 50}\n\n1600n\n\n15n\n250n\n350n\n\n600n\n\n5n\n50n\n200n\n\n{1, 5, 10, 50, 100, 500, 1000}\n\n{1, 5, 10}\n\n{1, 10, 50, 100, 200}\n{1, 10, 50, 100, 300}\n\nTest Size\n\n10000\n\nn(20 \u2212 k)\nn(297 \u2212 k)\nn(550 \u2212 k)\n\nexpected to need many dozens of replications to obtain reliable estimates of mean performance, but\nto our surprise, we found that 10 replications was more than adequate to discern among methods.\nAll simulations were thus replicated 10 times. Each replication involved a random selection of classes\nand split of instances, as sketched in the data pipeline of Figure 1. To reduce variability, the same\nclass and instance splits were used across methods, as were the contents of each minibatch of training\ndata. Weights were initialized randomly for each replication.7 For the source domain, a validation set\nwas used to stop training. For target domain adaptation, training continued until performance reached\nasymptote. Given the small k available for target domain adaptation, a validation set would have had\nhigh variance and the transfer of weights from the source should impose a strong inductive bias.\nTable 1 contains details on the sizes and splits of each data set. The supplementary materials contain\ndetails on the network architectures used for each data set. For each data set, all six methods used\nthe same underlying network architecture with two exceptions: (1) the BASELINE and WEIGHTADAPT\narchitectures had an additional class-output layer which was not transferred from source to target; and\n(2) for training HISTLOSS and ADAPTHISTLOSS, the embeddings were L2 normalized, allowing for the\nuse of the (bounded) cosine distance function with a 200-bin histogram. The embedding dimension\nwas 128 for MNIST, Omniglot, and tinyImageNet, and 64 for Isolet. Because training parameters in\nPROTONET requires a data split between support and query sets, we chose to further divide Strain into\nSsupport and Squery as noted in Figure 1. All models were trained with the Adam [13] optimizer.\n\n4 Results\nMNIST. This data set consists of 28 \u00d7 28 gray-scale images of handprinted digits [17]. MNIST was\nsplit into a source domain, with the digit classes 0\u20134, and a target domain, with 5\u20139. For this and\nfollowing data sets, details of training parameters\u2014learning rates and k(cid:48) (see Figure 1)\u2014are included\nin the supplementary materials. Figure 2 plots accuracy on the test set, Tquery, for each of the six\nmethods as a function of k, with n = 5 held constant. Each point is the average over ten replications.\nError bands of \u00b11 standard error of the mean are shown, though they may be dif\ufb01cult to discern\nexcept when k is small. The pattern of results here mirrors the results that we will present for the\nother data sets. Notably,\n\ndomain diminishes as k \u2192 1000.\n\n\u2022 WEIGHTADAPT shows modest improvements over BASELINE, but the bene\ufb01t of the source\n\u2022 For k > 1, ADAPTPROTONET improves on PROTONET, and ADAPTHISTLOSS improves on\nHISTLOSS. For k = 1, there are insuf\ufb01cient instances of each class to perform any adaptation,\nand thus the adapted algorithms are identical to their non-adapted counterparts.\n\n7In [32], HISTLOSS was initialized with a pretrained classi\ufb01cation model whose output layer had been\ndecapitated, and training proceeded with the metric-learning loss. For the sake of comparison, we trained\nHISTLOSS from scratch.\n\n5\n\n\ud835\udc9f\ud835\udcaf\ud835\udcae\ud835\udcafquery\ud835\udcafsupport\ud835\udcaevalid\ud835\udcaetrain\ud835\udcaf$query\ud835\udcaf$supportsplit by instanceswithin classsplit byclassessplit by instanceswithin class\ud835\udcaequery\ud835\udcaesupport\fFigure 2: MNIST k-ITL results.\nEach point is the average test ac-\ncuracy over 10 replications. Error\nbands indicate \u00b11 standard error of\nthe mean.\n\nis\n\nIsolet k-\nFigure 3:\nEach\nITL results.\npoint\nthe aver-\nage test accuracy over\n10 replications. Er-\nror bands indicate \u00b11\nstandard error of the\nmean.\n\n\u2022 ADAPTHISTLOSS consistently outperforms ADAPTPROTONET.\n\u2022 PROTONET appears not to bene\ufb01t from k > 50, as one would expect for a method with\nhigh inductive bias which is designed for the small k regime. However, ADAPTPROTONET\ncontinues to improve as more data are available because it can also use the data for adaptation.\n\u2022 Across the range of k tested, WEIGHTADAPT is inferior to the adapted embeddings,\n\nADAPTHISTLOSS and ADAPTPROTONET.\n\nIsolet. This data set, from the UCI repository, is a spoken letter (A-Z) data set with 26 classes and\napproximately 297 examples per class [5]. The input is coded as 617 attributes which specify spectral\ncoef\ufb01cients, contour features, sonorant features, pre-sonorant features, and post-sonorant features.\nThe left and right panels of Figure 3 show test accuracy for n = 5 and n = 10, respectively. The\nresults are qualitatively identical for the two values of n. The Isolet results eerily mirror those from\nMNIST (Figure 2), all the more surprising considering that the domains\u2014vision and speech\u2014and\narchitectures\u2014convolutional and fully-connected\u2014are quite different.\ntinyImageNet. This data set is a subset of ImageNet [7] containing 200 classes with 550 examples\nper class [9]. Each image is 64 \u00d7 64 with 3 channels for RGB. The few-shot literature typically uses\nminiImageNet for evaluation. We chose tinyImageNet because it has a greater diversity of classes\n(200 vs. 100). The three panels of Figure 4 show test accuracy for 5, 10, and 50-way classi\ufb01cation\nproblems. The take-away is similar to the previous two simulations, although WEIGHTADAPT does\nnot seem to show as consistent a bene\ufb01t over BASELINE as it did in the previous simulations. Once\nagain, ADAPTHISTLOSS is consistently the best performer over all (k, n) combinations.\nOmniglot. This data set contains images of labeled, handwritten characters from diverse alphabets\n[16]. In the few-shot literature, Omniglot is the standard model-comparison data set. However,\nthe literature relies on a speci\ufb01c split of the data on which state-of-the-art methods are now close\nto achieving ceiling performance. To avoid ceiling effects and obtain greater generality, we chose\nrandom splits. Omniglot has 1623 different characters, each with 20 instances; following previous\nresearch [29, 31, 33], we augment the data set with all 90\u25e6 rotations, resulting in 6492 classes. Each\ngrayscale image is resized to 28 \u00d7 28. The three panels in Figure 5 show test accuracy for 1-, 5-, and\n10-shot learning. In each panel, n is varied from 5 to 1000. Note that WEIGHTADAPT and BASELINE\ndo not achieve performance much above chance for large n, and WEIGHTADAPT is reliably better\nthan BASELINE for only n = 5. As in the previous simulations ADAPTHISTLOSS is robustly the\n\n6\n\n1510501005001000k0.30.40.50.60.70.80.91.0Test AccuracyMNIST, n = 5Adapted Hist LossAdapted Proto NetHist LossProto NetWeight AdaptationBaseline11050100200k0.20.30.40.50.60.70.80.91.0Test AccuracyIsolet, n = 5Adapted Hist LossAdapted Proto NetHist LossProto NetWeight AdaptationBaseline11050100200k0.20.30.40.50.60.70.80.91.0Test AccuracyIsolet, n = 10\fFigure 4: tinyImageNet k-ITL results. Each point is the average test accuracy over 10 replications.\nError bands indicate \u00b11 standard error of the mean.\n\nFigure 5: Omniglot k-ITL results. Each point is the average test accuracy over 10 replications. Error\nbands indicate \u00b11 standard error of the mean.\n\nbest performer, and the adapted embedding methods (ADAPTHISTLOSS, ADAPTPROTONET) reliably\noutperform the traditional embedding methods (HISTLOSS, PROTONET). (Remember that k = 1 does\nnot provide suf\ufb01cient data to permit adaptation.)\n\n5 Discussion and Conclusions\nThe results from our k-ITL simulations are remarkably consistent across data sets and offer unam-\nbiguous prescriptions for signi\ufb01cantly improving current practice in inductive transfer learning. The\nmain messages are as follows.\nAdapted embeddings are the method of choice for k-ITL. We proposed adapted-embedding\nmethods, ADAPTHISTLOSS and ADAPTPROTONET, that combine deep embedding losses for training\non the source domain with weight adaptation on the target domain. These methods are strictly\nsuperior to non-adapted (HISTLOSS, PROTONET) and non-embedding (WEIGHTADAPT, BASELINE)\nmethods. Figure 6a summarizes 34 {data set, k, n} conditions by comparing the proportion reduction\nin classi\ufb01cation error obtained by the best adapted embedding method (i.e., ADAPTPROTONET and\nADAPTHISTLOSS) over the best of all alternative methods.8 Figures 6b,c break the results down by\ncomparing separately to non-adapted embeddings and adapted non-embedding methods, respectively.\nThe adapted embeddings achieve an error reduction of 33.7% over the best of other methods, with\na range from 2.2% to 73.9%. In every condition, adapted embeddings outperform non-adapted\nembeddings (mean 37.0%) and adapted non-embedding methods (mean 54.9%). Of the adapted\nembeddings, there is a clear ranking: ADAPTHISTLOSS is superior to ADAPTPROTONET.\nTo our knowledge, Vinyals et al. [33] is the only previous work to explore adapted embedding\nmethods, in the context of matching networks. Few details were provided about the effort and the\nresults were ambiguous. Several possibilities might explain why we see consistent and impressive\n\n8We exclude k = 1 conditions: one labeled example is insuf\ufb01cient to adapt either HISTLOSS or PROTONET.\n\n7\n\n11050100300k0.20.30.40.50.60.70.8Test AccuracytinyImageNet, n = 5Adapted Hist LossAdapted Proto NetHist LossProto NetWeight AdaptationBaseline11050100300k0.10.20.30.40.50.6Test AccuracytinyImageNet, n = 1011050100300k0.00.10.2Test AccuracytinyImageNet, n = 505101001000n0.00.10.20.30.40.50.60.70.80.9Test AccuracyOmniglot, k = 1Adapted Hist LossAdapted Proto NetHist LossProto NetWeight AdaptationBaseline5101001000n0.00.10.20.30.40.50.60.70.80.9Test AccuracyOmniglot, k = 55101001000n0.00.10.20.30.40.50.60.70.80.9Test AccuracyOmniglot, k = 10\fFigure 6: Histogram of percent reduction in classi\ufb01cation error obtained by best adapted embedding\nmethod (ADAPTPROTONET, ADAPTHISTLOSS) versus the best of (a) all other methods, (b) non-\nadapted embeddings (PROTONET, HISTLOSS), and (c) adapted non-embedding methods (BASELINE,\nWEIGHTADAPT). Each histogram includes all of the 34 {data set, k, n} conditions tested with k > 1.\n\nbene\ufb01ts of adaptation but Vinyals et al. did not. First, some algorithms appear to bene\ufb01t more than\nothers: for k \u2208 {5, 10}, adapting HISTLOSS yields a greater bene\ufb01t than adapting PROTONET. It\u2019s\npossible that matching nets over\ufb01t when adapting, whereas HISTLOSS, which has a natural stopping\ncriterion, does not. Second, the evaluation of matching nets focused on k = 1 and k = 5. For k = 1,\nadaptation provides no information about intraclass structure; it can only separate classes. (And for\nthe embedding losses we studied, we cannot do that with k = 1.)\nTo construct models that can be repurposed, use deep embeddings. WEIGHTADAPT is a common\nmethod of bootstrapping classi\ufb01er training in a new domain. WEIGHTADAPT fails to match the adapted\nembeddings or even the non-adapted embeddings on k-ITL. WEIGHTADAPT does beat BASELINE\nfor small k, but for our data sets, any advantage of WEIGHTADAPT seems to vanish for k \u2265 100,\nin contrast to the adapted embedding methods that still bene\ufb01t from increasing k. Our results are\nconsistent with those of Yosinski et al. [36]. TensorFlow Hub and other libraries have been released\nto enable the reusability of large state-of-the-art models, in order to transfer and adapt their weights to\nnovel target domains. Our results suggest that models trained on embedding losses would be far more\naccurate in transfer than models trained on an explicit classi\ufb01cation loss, and should still achieve\ncomparable training speed ups\u2014one goal of model re-purposing.\nWEIGHTADAPT decapitates a classi\ufb01cation network and treats the penultimate layer as an embedding.\nSo why does this embedding fail to be as useful for k-ITL as the embeddings discovered by PROTONET\nand HISTLOSS? The hidden layers of a classi\ufb01cation network aim to discard information unrelated\nto class discrimination, and if successful, the penultimate layer will also orthogonalize the classes,\ni.e., discard most information about how one class relates to another. This inter-class structure is\ncritical to projecting novel classes into an embedding space [26]. We thus argue that fundamentally,\nthe objective\u2014and the corresponding one-hot output representation of a classi\ufb01cation network\u2014is\ninferior for obtaining representations that will transfer to novel domains.\nMethods should not be segregated based on their focus on k. Weight transfer, few-shot learning,\nand deep metric learning all perform a variant of k-ITL, yet these three lines of research have been\nmostly disconnected from one another. (For example, when submitting to NIPS, there are distinct\nsubject areas for transfer learning, few-shot learning, and metric learning.) We suspect the lack of\ninteraction is due to the fact that each paradigm has a distinctive focus on k. Although weight transfer\nmay typically be used with larger k, our experiments show that it surprisingly beats BASELINE for\nsmall k. Few-shot learning is aimed at small k, but seems to work surprisingly well for large k.\nMetric learning is neutral as to k, but the representative method we chose, HISTLOSS, seems to work\nwell for a range of k. If the preferred method depends on k, it might be sensible to treat these as\nindependent topics, but one method\u2014the hybrid ADAPTHISTLOSS\u2014is superior for all k and over a\nrange of n.\nThe primary contribution of our work is the systematic comparison of methods across complementary\nlines of research. The novelty of ADAPTHISTLOSS\u2014as a synthesis of HISTLOSS and WEIGHTADAPT\u2014\nis admittedly minor: parameter \ufb01ne tuning is a simple and obvious strategy in many areas of machine\nlearning. What makes our work a valuable contribution is the non-obvious and impressive magnitude\nof improvements that are obtained by this obvious strategy. Many articles in metric learning and\nfew-shot learning justify and differentiate methods based on tiny percentage error reductions, as\ncontrasted with the comparatively impressive 34% error reduction we obtain over state of the art.\nBy demonstrating gains of this magnitude, we hope to motivate a uni\ufb01cation of research in weight\ntransfer, few-shot learning, and deep metric learning.\n\n8\n\n(a) 0102030405060708090Error Reduction (%)12345Frequency(b) 0 10 20 30 40 50 60 70 80 90Error Reduction (%)12345(c) 0 10 20 30 40 50 60 70 80 90Error Reduction (%)13579\fAcknowledgements\n\nWe would like to thank Chenhao Tan for helpful discussions. This research was supported by the\nNational Science Foundation awards EHR-1631428 and SES-1461535.\nThe code is available at https://github.com/tylersco/adapted_deep_embeddings.\n\nReferences\n[1] Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Freitag, M., Pugachevskiy, S., Baird, A., and Schuller,\nB. W. (2017). Snore sound classi\ufb01cation using image-based deep spectrum features. In Interspeech 2017,\n18th Annual Conference of the International Speech Communication Association.\n\n[2] Bellet, A., Habrard, A., and Sebban, M. (2013). A Survey on Metric Learning for Feature Vectors and\n\nStructured Data. arXiv e-prints, 1306.6709.\n\n[3] Caruana, R. (1997). Multitask Learning. In Machine Learning, volume 28, pages 41\u201375.\n\n[4] Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learning a similarity metric discriminatively, with application\n\nto face veri\ufb01cation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n\n[5] Cole, R. and Fanty, M. (1994). ISOLET Dataset.\n\n[6] Daum\u00e9, H. (2007). Domain adaptation vs. transfer learning. https://nlpers.blogspot.com/2007/\n\n11/domain-adaptation-vs-transfer-learning.html.\n\n[7] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009).\n\nImageNet: A Large-Scale\nHierarchical Image Database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n\n[8] Edwards, H. and Storkey, A. (2017). Towards a Neural Statistician. In International Conference on Learning\n\nRepresentations (ICLR 2017).\n\n[9] Fei-Fei, L., Johnson, J., and Yeung, S. (2018). Tiny ImageNet Visual Recognition Challenge.\n\n[10] Finn, C., Abbeel, P., and Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of\nDeep Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of\nProceedings of Machine Learning Research, pages 1126\u20131135.\n\n[11] Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing\nInternal Covariate Shift. In Proceedings of The 32nd International Conference on Machine Learning, pages\n448\u2013456.\n\n[12] Kaiser, L., Nachum, O., Roy, A., and Bengio, S. (2017). Learning to Remember Rare Events.\n\nInternational Conference on Learning Representations (ICLR 2017).\n\nIn\n\n[13] Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv e-prints, 1412.6980.\n\n[14] Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition.\n\nIn ICML Deep Learning Workshop, volume 2.\n\n[15] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. In Advances in Neural Information Processing Systems 26, pages 1097\u20131105.\n\n[16] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338.\n\n[17] LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.\n\n[18] Li, W., Zhao, R., Xiao, T., and Wang, X. (2014). DeepReID: Deep Filter Pairing Neural Network for\n\nPerson Re-identi\ufb01cation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n\n[19] Long, M., Cao, Y., Wang, J., and Jordan, M. (2015). Learning Transferable Features with Deep Adaptation\nIn Proceedings of the 32nd International Conference on Machine Learning, volume 37 of\n\nNetworks.\nProceedings of Machine Learning Research, pages 97\u2013105.\n\n[20] Lu, J., Hu, J., and Zhou, J. (2017). Deep Metric Learning for Visual Understanding: An Overview of\n\nRecent Advances. IEEE Signal Processing Magazine, 34(6):76\u201384.\n\n9\n\n\f[21] Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S. (2016). Deep metric learning via lifted structured\n\nfeature embedding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n\n[22] Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014). Learning and Transferring Mid-level Image\nRepresentations Using Convolutional Neural Networks. In The IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 1717\u20131724.\n\n[23] Pan, S. J. and Yang, Q. (2010). A Survey on Transfer Learning. volume 22, pages 1345\u20131359.\n\n[24] Pratt, L. Y., Mostow, J., and Kamm, C. A. (1991). Direct Transfer of Learned Information Among Neural\nNetworks. In Proceedings of the American Association for Arti\ufb01cial Intelligence, volume 91, pages 584\u2013589.\n\n[25] Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. In International\n\nConference on Learning Representations (ICLR 2017).\n\n[26] Ridgeway, K. and Mozer, M. C. (2018). Learning Deep Disentangled Embeddings with the F-Statistic\n\nLoss. arXiv e-prints, 1802.05312.\n\n[27] Rozantsev, A., Salzmann, M., and Fua, P. (2018). Beyond Sharing Weights for Deep Domain Adaptation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1\u20131.\n\n[28] Schroff, F., Kalenichenko, D., and Philbin, J. (2015). FaceNet: A Uni\ufb01ed Embedding for Face Recognition\n\nand Clustering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n\n[29] Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical Networks for Few-shot Learning. In Advances\n\nin Neural Information Processing Systems 31, pages 4077\u20134087.\n\n[30] TensorFlow (2018). Tensor\ufb02ow Hub.\n\n[31] Trianta\ufb01llou, E., Zemel, R., and Urtasun, R. (2017). Few-Shot Learning Through an Information Retrieval\n\nLens. In Advances in Neural Information Processing Systems 31, pages 2255\u20132265.\n\n[32] Ustinova, E. and Lempitsky, V. (2016). Learning Deep Embeddings with Histogram Loss. In Advances in\n\nNeural Information Processing Systems 30, pages 4170\u20134178.\n\n[33] Vinyals, O., Blundell, C., Lillicrap, T., kavukcuoglu, k., and Wierstra, D. (2016). Matching Networks for\n\nOne Shot Learning. In Advances in Neural Information Processing Systems 30, pages 3630\u20133638.\n\n[34] Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. (2017). Deep Metric Learning with Angular Loss. In\nIEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages\n2612\u20132620.\n\n[35] Yi, D., Lei, Z., Liao, S., and Li, S. Z. (2014). Deep Metric Learning for Person Re-identi\ufb01cation. In 2014\n\n22nd International Conference on Pattern Recognition, pages 34\u201339.\n\n[36] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural\n\nnetworks? In Advances in Neural Information Processing Systems 28, pages 3320\u20133328.\n\n[37] Zhang, Z., Ning, G., and He, Z. (2017). Knowledge Projection for Deep Neural Networks. arXiv e-prints,\n\n1710.09505.\n\n[38] Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. (2015). Scalable Person Re-identi\ufb01cation:\n\nA Benchmark. In IEEE International Conference on Computer Vision.\n\n10\n\n\f", "award": [], "sourceid": 75, "authors": [{"given_name": "Tyler", "family_name": "Scott", "institution": "University of Colorado Boulder"}, {"given_name": "Karl", "family_name": "Ridgeway", "institution": "University of Colorado, Boulder"}, {"given_name": "Michael", "family_name": "Mozer", "institution": "Google Brain / U. Colorado"}]}