{"title": "Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1908, "page_last": 1918, "abstract": "Before sufficient training data is available, fine-tuning neural networks pre-trained on large-scale datasets substantially outperforms training from random initialization. However, fine-tuning methods suffer from two dilemmas, catastrophic forgetting and negative transfer. While several methods with explicit attempts to overcome catastrophic forgetting have been proposed, negative transfer is rarely delved into. In this paper, we launch an in-depth empirical investigation into negative transfer in fine-tuning and find that, for the weight parameters and feature representations, transferability of their spectral components is diverse. For safe transfer learning, we present Batch Spectral Shrinkage (BSS), a novel regularization approach to penalizing smaller singular values so that untransferable spectral components are suppressed. BSS is orthogonal to existing fine-tuning methods and is readily pluggable to them. Experimental results show that BSS can significantly enhance the performance of representative methods, especially with limited training data.", "full_text": "Catastrophic Forgetting Meets Negative Transfer:\nBatch Spectral Shrinkage for Safe Transfer Learning\n\nXinyang Chen\u2217, Sinan Wang\u2217, Bo Fu, Mingsheng Long ((cid:66))\u2020, and Jianmin Wang\n\nSchool of Software, BNRist, Tsinghua University, China\nResearch Center for Big Data, Tsinghua University, China\nNational Engineering Laboratory for Big Data Software\n\n{chenxiny17,wang-sn17}@mails.tsinghua.edu.cn, {mingsheng,jimwang}@tsinghua.edu.cn\n\nAbstract\n\nBefore suf\ufb01cient training data is available, \ufb01ne-tuning neural networks pre-trained\non large-scale datasets substantially outperforms training from random initialization.\nHowever, \ufb01ne-tuning methods suffer from a dilemma across catastrophic forgetting\nand negative transfer. While several methods with explicit attempts to overcome\ncatastrophic forgetting have been proposed, negative transfer is rarely delved into.\nIn this paper, we launch an in-depth empirical investigation into negative transfer\nin \ufb01ne-tuning and \ufb01nd that, for the weight parameters and feature representations,\ntransferability of their spectral components is diverse. For safe transfer learning,\nwe present Batch Spectral Shrinkage (BSS), a novel regularization approach\nto penalizing smaller singular values so that untransferable spectral components\nare suppressed. BSS is orthogonal to existing \ufb01ne-tuning methods and is readily\npluggable into them. Experimental results show that BSS can signi\ufb01cantly enhance\nthe performance of state-of-the-art methods, especially in few training data regime.\n\n1\n\nIntroduction\n\nDeep learning has made revolutionary changes to diverse machine learning problems and applications.\nDuring the past few years, signi\ufb01cant improvements on various tasks have been achieved by deep\nneural networks [17, 33, 10, 35]. However, training deep neural networks from scratch is time-\nconsuming and laborious, and the excellent performance of such deep neural networks depends on\nlarge-scale labeled datasets which we may have no access to in many practical scenarios.\nFortunately, deep feature representations learned on large-scale datasets are transferable across several\ntasks and domains [25, 7, 45]. Thus, \ufb01ne-tuning, a simple yet effective method that exploits this\nnice property of deep representations, is widely adopted, especially before suf\ufb01cient training data is\navailable [9]. Under this well-established paradigm, deep neural networks are \ufb01rstly pre-trained on\nlarge-scale datasets and then \ufb01ne-tuned to target tasks, requiring relatively smaller training samples.\nTo a certain extent, \ufb01ne-tuning alleviates deep neural networks\u2019 hunger for data. However, adequate\namount of training data for target tasks is still a prerequisite for the effectiveness of vanilla \ufb01ne-tuning\nmethods. When the requirement of training data cannot be satis\ufb01ed, two hidden issues of \ufb01ne-tuning\nwill become extremely severe, seriously hampering the generalization performance of deep models.\nThe \ufb01rst is catastrophic forgetting [14], which is the tendency of the model to lose previous learnt\nknowledge abruptly while it may incorporate information relevant to target tasks, leading to over\ufb01tting.\nThe second is negative transfer [37]. Not all pre-trained knowledge is transferable across domains,\nand an indiscriminate transfer of all knowledge is detrimental to the model.\n\n\u2217Authors contributed equally\n\u2020Corresponding author: Mingsheng Long (mingsheng@tsinghua.edu.cn)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Comparison of Different Fine-Tuning Methods ((cid:150): unknown)\nTechnical Challenge\n\nTarget Dataset Size\n\nMethod\n\nlarge medium small\n\ncatastrophic forgetting\n\nnegative transfer\n\nL2\n\nL2-SP [20]\nDELTA [19]\n\nBSS (Proposed)\n\n\u0013\n\u0013\n\u0013\n\u0013\n\n(cid:150)\n(cid:150)\n(cid:150)\n\u0013\n\n\u0017\n\u0017\n\u0017\n\u0013\n\n\u0017\n\u0013\n\u0013\n\u0017\n\n\u0017\n\u0017\n\u0017\n\u0013\n\nIncremental learning [30, 18, 21, 32, 44] extends the existing model\u2019s knowledge continuously with\ngradually available training data. Various measures have been taken to curb the tendency of forgetting\npreviously learnt knowledge while acquiring new knowledge. Note that the original motivation of\nmitigating catastrophic forgetting for incremental learning and \ufb01ne-tuning is quite different. In the\ncontext of incremental learning, the model performance on both old and new tasks makes sense, while\nwhen it comes to \ufb01ne-tuning, only target tasks are concerned. In this paper, catastrophic forgetting\nrefers speci\ufb01cally to forgetting the pre-trained knowledge bene\ufb01cial to target tasks. During the past\nfew years, a few transfer learning penalties [20, 19] have been proposed to constrain parameters on\nmaintaining pre-trained knowledge. Specially, L2-SP [20] considers that weight parameters should be\ndriven to pre-trained values instead of the origin and takes the advantage of all pre-trained weights to\nrefrain networks from forgetting useful information. DELTA [19] utilizes discriminative knowledge\nin feature maps and imposes feature map regularization by the attention mechanism.\nMethods above largely alleviate the problem of catastrophic forgetting by drawing weight parameters\nclose to pre-trained values or aligning transferable channels in feature maps. Still, negative transfer\nhas not been attached with enough importance and is often overlooked in deep methods. However,\nwhen the amount of training examples on the target domain is limited, overly retaining pre-trained\nknowledge will deteriorate target performance and negative transfer will become prominent. It is\nthereby apparent that catastrophic forgetting and negative transfer constitute a dilemma, which should\nbe solved jointly for safe transfer learning. In this paper, we explore \ufb01ne-tuning against negative\ntransfer and propose a novel regularization approach to restraining detrimental pre-trained knowledge\nduring \ufb01ne-tuning. A comparison of these \ufb01ne-tuning methods is presented in Table 1.\nBased on Singular Value Decomposition (SVD), we investigate which spectral components of weight\nparameters and feature representations are untransferable across domains, and make two observations.\nFor weight parameters, in high layers, the spectral components with small singular values are not\ntransferable. For feature representations, an interesting \ufb01nding is that with suf\ufb01cient training data,\nthe spectral components with small singular values are decayed autonomously during \ufb01ne-tuning.\nInspired by this inherent mechanism, we propose Batch Spectral Shrinkage (BSS), a general\napproach to inhibiting negative transfer by suppressing the spectral components with small singular\nvalues that correspond to detrimental pre-trained knowledge. BSS is orthogonal to existing methods\nfor mitigating catastrophic forgetting, and can be easily embedded into them to tackle the dilemma.\nExperiments con\ufb01rm the effectiveness of BSS in mitigating negative transfer, especially when the\namount of available training data is limited, yielding state-of-the-art results on several benchmarks.\n\n2 Related Work\n\nTransfer learning, an important machine learning paradigm, is committed to transferring knowledge\nobtained on a source domain to a target domain [2, 26]. There are several different scenarios of\ntransfer learning, such as domain adaptation [31] and multi-task learning [2], while inductive transfer\nlearning is the most practical one. In inductive transfer learning, 1) the target task is different from\nthe source task (different label spaces), and 2) there is labeled data in the target domain.\nFine-tuning is the de facto approach to inductive transfer of deep models, where we have a pre-trained\nmodel from the source domain but have no access to the source data. To utilize pre-trained knowledge\nobtained on the source domain, Donahue et al. [7] employed a label predictor to classify features\nextracted by the pre-trained model. This method directly reused a substantial part of the weight\nparameters, which inhibits catastrophic forgetting (relevant information eliminated) but exacerbates\nthe risk of negative transfer (irrelevant information retained). Later, deep networks proved to be able\n\n2\n\n\fto learn transferable representations [45]. To explore potential factors affecting deep transfer learning\nperformance, Huh et al.[12] empirically analyzed features extracted by various networks pre-trained\non ImageNet. Recently, numerous approaches were proposed to advance this \ufb01eld, including \ufb01lter\ndistribution constraining [1], sparse transfer [22], and \ufb01lter subset selection [8, 4]. Further, Simon et\nal. [15] empirically studied what factors impact inductive transfer of deep models.\nCatastrophic forgetting is an inevitable problem of incremental learning or lifelong learning [36]. To\novercome this limitation, incremental moment matching [18] and \u201chard attention to the task\u201d [32]\nhave been proposed. In inductive transfer learning, the pre-trained networks also have the tendency to\nlose previous learnt knowledge abruptly while incorporating information relevant to target tasks. By\ndriving weight parameters to initial pre-trained values, L2-SP [20] enhances model performance for\ntarget tasks while avoiding degradation in accuracy on pre-trained datasets. Inspired by knowledge\ndistillation for model compression [29, 11, 46, 43], Li et al. [19] proposed the idea of \u201cunactivated\nchannel re-usage\u201d and presented DELTA, a feature map regularization with attention.\nAbove methods have achieved remarkable performance gains and alleviated catastrophic forgetting to\nvarying degrees. However, negative transfer, a major challenge in domain adaptation [31, 34, 38, 40,\n41], has rarely been considered in inductive transfer learning. In this paper, from the perspective of\ninhibiting negative transfer during \ufb01ne-tuning, we propose Batch Spectral Shrinkage (BSS), a novel\nregularization approach orthogonal to existing methods, to enhance \ufb01ne-tuned models\u2019 performance.\n\n3 Catastrophic Forgetting Meets Negative Transfer\n\nIn inductive transfer learning (\ufb01ne-tuning), we have access to a target domain with n labeled examples\nand a network pre-trained on a source domain. Different from domain adaptation [26], in \ufb01ne-tuning\nthe source domain is inaccessible at training. For classi\ufb01cation tasks, typically, the network consists\nof two parts: the shared sub-network (feature extractor F ) and the task-speci\ufb01c architecture (classi\ufb01er\nG). We denote by F 0 and G0 the corresponding parts with pre-trained weights respectively.\nThere are two potential pitfalls inductive transfer learning may have. The \ufb01rst one is catastrophic\nforgetting, which refers to a tendency of the model to abruptly forget previously learnt knowledge\nupon acquiring new knowledge. The second is negative transfer, a process where the model transfers\nknowledge irrelevant to target tasks, and leads to negative impacts on model performance. Almost all\nexisting deep methods concentrate on the former. It is natural to raise the following questions: 1)\nDoes negative transfer really exist in \ufb01ne-tuning? 2) If it does, how does it affect model performance?\n\n3.1 Regularizations for Transfer Learning\n\nn(cid:88)\n\nWe \ufb01rst review existing inductive transfer learning methods. Almost all \ufb01ne-tuning methods can be\nformulated as follows:\n\n(1)\nwhere W refers to the weight parameters of models, L(\u00b7,\u00b7) denotes the loss function and \u2126(\u00b7) is the\nregularization term on the weights or on the features extracted by the model. Next we will discuss\nthree \ufb01ne-tuning penalties and their corresponding effects on mitigating catastrophic forgetting.\n\nmin\nW\n\ni=1\n\nL(G(F (xi)), yi) + \u2126(\u00b7),\n\nL2 penalty. The common penalty for transfer learning is L2 penalty, also known as weight decay:\n\n\u2126(W) =\n\n(cid:107)W(cid:107)2\n2 ,\n\n\u03b1\n2\n\n(2)\n\nwhere \u03b1 is a hyperparameter to control the strength of this regularization term. L2 penalty tries to\ndrive the network parameters to zero, without considering catastrophic forgetting or negative transfer.\n\nL2-SP. The key concept of L2-SP penalty [19] is \u201cstarting point as reference\u201d:\n\n(cid:13)(cid:13)WS \u2212 W0\n\nS\n\n(cid:13)(cid:13)2\n\n\u2126(W) = \u2126(W, W0) =\n\n\u03b2\n2\n\n2 +\n\n(cid:107)WS(cid:107)2\n2 ,\n\n\u03b1\n2\n\n(3)\n\nwhere W0\nS is the pre-trained weight parameters of the shared architecture (feature extractor F0), WS\nis weight parameters of F , WS is weight parameters of the task-speci\ufb01c classi\ufb01er G, \u03b2 is a trade-off\n\n3\n\n\fhyperparameter to control the strength of the penalty. L2-SP penalty tries to drive weight parameters\nto pre-trained values. Xuhong et al. [20] empirically proved that L2-SP reduces drop in accuracy of\nnetworks on source tasks after \ufb01ne-tuning, revealing that L2-SP can alleviate catastrophic forgetting.\n\nDELTA. Based on the key insight of \u201cunactivated channel re-usage\u201d, Li et al. [19] proposed a\nregularized transfer learning framework, DELTA. Speci\ufb01cally, DELTA selects the discriminative\nfeatures from higher layer outputs with a supervised attention mechanism. \u2126(W) is formulated as:\n\n\u2126(W) = \u2126(W, W0, xi, yi, z) = \u03b3 \u00b7 \u2126(cid:48)(W, W0, xi, yi, z) + \u03ba \u00b7 \u2126(cid:48)(cid:48)(W\\W0)\n\u2126(cid:48)(W, W0, xi, yi, z) =\n\nDj(z, W0, xi, yi) \u00b7(cid:13)(cid:13)(cid:13)FMj(z, W, xi) \u2212 FMj(z, W0, xi)\n\nN(cid:88)\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n(4)\n\nj=1\n\nDj(z, W0, xi, yi) = softmax(L(z(xi, W0\\j), yi) \u2212 L(z(xi, W0), yi))\n\nwhere z is the model, \u2126(cid:48) is behavioral regularizer, \u2126(cid:48)(cid:48) constrains the L2-norm of the private parameters\nin W; Dj(z, W0, xi, yi) refers to the behavioral difference between the two feature maps (FM) and\nthe weight assigned to the jth \ufb01lter and the ith image (for 1 < j < N); \u03b3 and \u03ba are trade-off hyper-\nparameters to control the strength of the two regularization terms. DELTA alleviates catastrophic\nforgetting by aligning the behaviors of certain higher layers of the target network to the source one.\n\n3.2 Negative Transfer in Fine-tuning\n\nIn this section, we will investigate whether negative transfer exists and whether it has a negative\nimpact on the model\u2019s performance. We design an experiment based on L2 penalty and L2-SP penalty.\nResNet-50 [10] pre-trained on ImageNet is chosen as the backbone and MIT Indoors 67 [28] is the\ntarget dataset. The training details are consistent with Section 5. We sample the training datasets at\nthe rates of 15%, 30%, 50% and 100% to construct new training datasets of different sizes.\nContrary to what one might suppose, as shown in Figure 1(a), L2-SP penalty worsens the model\u2019s\nperformance, compared with L2 penalty, especially when the amount of training data is limited. L2-\nSP penalty explicitly promotes the similarity of the \ufb01nal solution with the initial model to alleviate\ncatastrophic forgetting, while L2 does not. Although only the behaviors of certain higher layers of\nthe target network are aligned to the source one, L2-SP still aggravates negative transfer, in that the\npre-trained knowledge irrelevant to the target tasks is still transferred forcefully.\nAs negative transfer does exist, further, we want to answer two questions: 1) Which part of weight\nparameters and feature representations causes negative transfer? 2) How to mitigate this problem?\n\n3.3 Why Negative Transfer?\n\nIn this section, we will explore which part of the weight parameters W and feature representations\nf = F (x) may not be transferable and may negatively impact the model accuracy. ResNet-50 [10]\npre-trained on ImageNet is chosen as the backbone and MIT Indoors 67 is the target dataset. Weight\nparameters and feature representations of both pre-trained and \ufb01ne-tuned networks are analyzed.\n\nCorresponding Angle. Principal angles [24] have been introduced to measure the similarity of\nsubspaces. However, it is unreasonable to calculate the principal angles by completing the pairing\nbetween whole eigenvectors in subspaces with the smallest angle, regardless of their relative singular\nvalues, because eigenvectors with large singular values and small singular values have different roles\nin matrices. Inspired by [3], we use corresponding angles, denoted by \u03b8. De\ufb01nitions are as follows:\n\nDe\ufb01nition 1 (Corresponding Angle) It is the angle between two eigenvectors which are equally\nimportant in their matrices. That is, they are related to the same index in the singular value matrices.\n\nThe cosine value of the corresponding angle is calculated as\n(cid:104)u1,i, u2,i(cid:105)\n(cid:107)u1,i(cid:107)(cid:107)u2,i(cid:107) ,\n\ncos(\u03b8i) =\n\n(5)\n\nwhere u1,i is the ith eigenvector with the ith largest singular value in one matrix, and similarly for\nu2,i in another matrix. We will use \u03b8 to measure the transferability of eigenvectors in weight matrices.\nIntuitively, eigenvectors with smaller corresponding angle across domains imply better transferability.\n\n4\n\n\f(a) error rate\n\n(b) cos(\u03b8)\n\n(c) singular value \u03c3\n\n(d) smaller half of \u03c3\n\nFigure 1: Analysis of negative transfer: (a) Error rates of \ufb01ne-tuned models with L2 and L2-SP\npenalties; (b) Cosine values of the corresponding angles between W and W0; (c) All singular values\nof feature matrices extracted on four con\ufb01gurations for the dataset MIT Indoors 67, with random\nsampling rates 15%, 30%, 50% and 100% respectively; (d) The smaller half of singular values in (c).\n\nWeights. We denote by W0 and W the pre-trained weight parameters of ResNet-50 on ImageNet\nand \ufb01ne-tuned weight parameters on MIT Indoors 67 respectively. For a conv2d layer, its parameters\nform a four-dimensional tensor with the shape of (ci+1, ci, kh, kw). We unfold this tensor to a matrix\nwith the shape (ci+1, ci \u00b7 kh \u00b7 kw) and perform SVD to obtain eigenvectors U and singular values \u03a3:\n(6)\nThen, following Equation (5), relative angles \u03b8 are calculated in every layers between W and W0.\nCorresponding angles in four lower layers (the \ufb01rst convolutional layer and three convolutional layers\nin the \ufb01rst residual block) and three higher layers (three convolutional layers in the last residual block)\nare shown in Figure 1(b), the former with solid lines and the latter with dotted lines. We can observe\nthat for the lower layers, eigenvectors in W and W0 have small relative angles, which means these\nweight parameters are transferable. However, in the higher layers, only eigenvectors corresponding to\nrelatively larger singular values have small corresponding angles. So aligning all weight parameters\nindiscriminately to the initial pre-trained values is risky to negative transfer.\n\nW = U\u03a3VT.\n\nFeatures. Analyzing feature representations, rather than weight parameters, is more straightforward.\nWe will analyze the characteristics of feature representations produced by models with different\ngeneralization performance. As the size of training dataset has a profound impact on model per-\nformance, we sample MIT Indoors 67 at the rates of 15%, 30%, 50% and 100% to construct new\ntraining datasets. We \ufb01ne-tune ImageNet pre-trained ResNet-50 on these four datasets and then\nobtain four models.\nThe feature extractor \ufb01ne-tuned on target datasets is denoted by F and the feature vector is calculated\nby fi = F (xi). Every feature matrix F = [f1 . . . fb] is composed of a batch size b of feature vectors.\nAgain, we apply SVD to compute all singular eigenvectors U and values \u03a3 of the feature matrices:\n(7)\nThe main diagonal elements [\u03c31, \u03c32..., \u03c3b] of the singular value matrix \u03a3 (a rectangular diagonal\nmatrix) are drawn in Figure 1(c) and Figure 1(d) in descending order, measuring the importance of\neigenvectors. Figure 1(c) contains all these singular values, and Figure 1(d) contains the smaller half\nof them. As justi\ufb01ed by [9], with suf\ufb01cient labeled data, \ufb01ne-tuning and training from scratch achieve\ncomparably best results. Hence models \ufb01ne-tuned on larger datasets can have stronger generalization\nperformance. It is important to observe that the relatively small singular values of features extracted\nby such models are suppressed signi\ufb01cantly, indicating that the spectral components corresponding to\nrelatively small singular values are relevant to the variation of training data that are less transferable.\nConsequently, promoting the similarity between these components will give rise to negative transfer.\n\nF = U\u03a3VT.\n\n4 Approach\n\nWe stress that catastrophic forgetting and negative transfer are equally important and constitute an\ninherent dilemma for \ufb01ne-tuning. While the previous section focuses on why negative transfer occurs,\nthis section presents how to alleviate negative transfer without casting aside pre-trained knowledge.\n\n5\n\n15%30%50%100%0.200.250.300.350.40Error RateL2L2-SP0510152025303540Index0.00.20.40.60.81.0Corresponding anglescosine values of corresponding anglesconv1layer0.0.conv1layer0.0.conv2layer0.0.conv3layer4.2.conv1layer4.2.conv2layer4.2.conv301020304050Index4681012141618Singular ValuesSingular Values15%30%50%100%253035404550Index345678Singular ValuesSingular Values15%30%50%100%\fFigure 2: The architecture of Batch Spectral Shrinkage (BSS). BSS is a new regularization approach\nto overcoming negative transfer in \ufb01ne-tuning, which is readily pluggable into existing methods and is\nend-to-end trainable with differentiable SVD natively supported in PyTorch (best viewed in color).\n\nThe analysis above shows that both weight parameters and feature representations are partially\ntransferable. For weight parameters, almost all eigenvectors in lower layers are transferable, while in\nhigher layers only eigenvectors with large singular values are transferable. For feature representations,\nan expanded dataset can enhance the performance of models and suppress the eigenvectors with\nsmall singular values of the feature matrices. This inspires us to suppress the importance of spectral\ncomponents that are untransferable, especially when the number of training data examples is limited.\nAs applying SVD to high-dimensional weight matrices is extremely costly, for untransferable layers\nwith huge weight parameters, we perform spectral component shrinkage on the feature matrices only.\n\n4.1 Batch Spectral Shrinkage\n\nThe above decomposition analysis of feature matrices brings us the key inspiration. We propose a\nnew regularization approach, Batch Spectral Shrinkage (BSS), to restrain negative transfer during\n\ufb01ne-tuning through directly suppressing the small singular values of the feature matrices. Detailed\nprocedures are as follows: 1) Constructing a feature matrix F from a batch size b of feature vectors f;\n2) Applying SVD to compute all singular values of F as Equation (7); 3) Penalizing the smallest k\nsingular values [\u03c31, \u03c32..., \u03c3b] in the diagonal of singular value matrix \u03a3 to mitigate negative transfer:\n\nLbss(F ) = \u03b7\n\n\u03c32\u2212i,\n\n(8)\n\ni=1\n\nwhere \u03b7 is a trade-off hyperparameter to control the strength of spectral shrinkage, k is the number of\nsingular values to be penalized, and \u03c3\u2212i refers to the i-th smallest singular value.\nComputational Complexity. For a p \u00d7 q matrix, the time complexity of full SVD that computes\nall singular values is O(min(p2q, pq2)). The time cost of performing SVD on a nearly squared matrix\nis unacceptable, e.g. weight matrices of deep networks. The complexity of BSS is O(b2d) where d is\nthe dimension of features and b is the batch size. Typically, as b is relatively small, say b = 48, the\noverall computational budget of BSS is nearly negligible in \ufb01ne-tuning through the mini-batch SGD.\n\n4.2 Models with Batch Spectral Shrinkage\n\nAlmost all of existing \ufb01ne-tuning methods concentrate on catastrophic forgetting. BSS, as a novel\nregularization approach we propose from another perspective, boosts \ufb01ne-tuning through inhibiting\nnegative transfer, making itself orthogonal to previous methods. BSS is lightweight and pluggable\nreadily into existing \ufb01ne-tuning methods, e.g. L2, L2-SP [20] and DELTA [19]. Figure 2 showcases\nthe architecture of L2+BSS. BSS embedded into existing \ufb01ne-tuning scenarios can be formulated as:\n\nk(cid:88)\n\nn(cid:88)\n\nmin\nW\n\ni=1\n\n6\n\nL(G(F (xi)), yi) + \u2126(W) + Lbss(F ).\n\n(9)\n\ncross-entropyLSVDbatch\ud835\udc9a\"\ud835\udc99FfFBSSLbssG\f5 Experiments\n\nWe embed BSS into representative inductive transfer learning methods mentioned above, including L2,\nL2-SP and DELTA, and evaluate these methods on several visual recognition benchmarks. Except\nthat, BSS is also explored in other scenarios, such as incremental learning and natural language\nprocessing. Code and datasets are available at github.com/thuml/Batch-Spectral-Shrinkage.\n\n5.1 Setup\n\nStanford Dogs [13] contains 20,580 images of 120 breeds of dogs from around the world. Each\ncategory is composed of exactly 100 training examples and around 72 testing examples.\nOxford-IIIT Pet [27] is a 37-category pet dataset with roughly 200 images for each class.\nCUB-200-2011 [42] is a dataset for \ufb01ne-grained visual recognition with 11,788 images in 200 bird\nspecies. It is an extended version of the CUB-200 dataset, roughly doubling the number of images.\nStanford Cars [16] contains 16,185 images of 196 classes of cars. Each category has been split\nroughly in a 50-50 split. There are 8,144 images for training and 8,041 images for testing.\nFGVC Aircraft [23] is a benchmark for the \ufb01ne-grained visual categorization of aircraft. The dataset\ncontains 10,200 images of aircraft, with 100 images for each of the 102 different aircraft variants.\nTo explore the impact of negative transfer with different numbers of training examples, we create\nfour con\ufb01gurations for each dataset, which respectively have 15%, 30%, 50%, and 100% randomly\nsampled training examples for each category. Following the previous protocols [20, 19], we employ\nResNet-50 [10] pre-trained on ImageNet [5] as the source model. The last fully connected layer is\ntrained from scratch, with learning rate set to be 10 times those of the \ufb01ne-tuned layers, which is\na de facto con\ufb01guration in \ufb01ne-tuning. We adopt mini-batch SGD with momentum of 0.95 using\nthe progressive training strategies in [20] except that the initial learning rate for the last layer is set\nto 0.01 or 0.001, depending on the tasks. We set batch size to 48. In all experiments with BSS, the\ntrade-off hyperparameter \u03b7 is \ufb01xed to 0.001 and k is set to 1. Each experiment is repeated \ufb01ve times,\nand the average top-1 accuracy is reported in Table 2.\n\n5.2 Results and Analyses\n\nResults. The top-1 classi\ufb01cation accuracies are shown in Table 2. It is observed that BSS produces\nboosts in accuracy with fewer training data for most methods on most datasets. However, performance\ngains on Stanford Dogs and Oxford-IIIT Pet are not very obvious, indicating that the transferability of\npre-trained knowledge across these datasets plays a major role and thus negative transfer impact is not\nas serious as expected. Embedding BSS into L2-SP and DELTA, L2-SP+BSS and DELTA+BSS\nalleviate negative transfer and catastrophic forgetting simultaneously to yield state-of-the-art results.\nNegative Transfer. To delve into BSS, we remove the spectral components corresponding to the\nsmallest r singular values, named Truncated SVD (TSVD). Formally, SVD is performed on mini-\nbatch feature matrix F, yielding b singular vectors and values. Then only the b \u2212 r column vectors of\nU and b\u2212 r row vectors of VT corresponding to the b\u2212 r largest singular values \u03a3b\u2212r are calculated.\nFinally, the rest of the matrix F is discarded, with an approximate feature matrix Fb\u2212r reconstructed:\n(10)\nResNet-50 pre-trained on ImageNet is employed as the base model and Stanford Dogs is the target\ndataset. We analyze the performance of TSVD (r = 1, 2, 4, 8) with L2 penalty. Results are shown in\nFigure 3(a). We \ufb01nd that when the dataset is relatively small, TSVD with a larger r leads to better\nperformance, which proves that spectral components corresponding to relatively small singular values\nhave negative impact on transfer learning. Thus, BSS is a reasonable approach to inhibiting negative\ntransfer. However, when suf\ufb01cient training data is available, a larger r may deteriorate the accuracy.\nSingular Values. Singular values of features extracted by the networks \ufb01ne-tuned with regulariza-\ntion L2+BSS and L2 are shown in Figure 3(b)\u20133(c). The former is with dotted line and the latter is\nwith solid line. Although k in Equation (8) is set to 1, more than one singular values are suppressed,\nindicating that feature matrices are capable of automatically adjusting singular value distributions.\nk = 1 is adequate for most cases, and a larger k may display equal effect with a larger trade-off\nhyperparameter \u03b7. BSS is effective in suppressing small singular values to combat negative transfer.\n\nF = U\u03a3VT, Fb\u2212r = Ub\u2212r\u03a3b\u2212rVT\n\nb\u2212r.\n\n7\n\n\fTable 2: Comparison of Top-1 Accuracy with Different Methods (Backbone: ResNet-50)\n\nDataset\n\nMethod\n\nStanford Dogs\n\nCUB-200-2011\n\nStanford Cars\n\nOxford-IIIT Pet\n\nFGVC Aircraft\n\nL2\n\nL2+BSS\nL2-SP [20]\nL2-SP+BSS\nDELTA [19]\nDELTA+BSS\n\nL2\n\nL2+BSS\nL2-SP [20]\nL2-SP+BSS\nDELTA [19]\nDELTA+BSS\n\nL2\n\nL2+BSS\nL2-SP [20]\nL2-SP+BSS\nDELTA [19]\nDELTA+BSS\n\nL2\n\nL2+BSS\nL2-SP [20]\nL2-SP+BSS\nDELTA [19]\nDELTA+BSS\n\nL2\n\nL2+BSS\nL2-SP [20]\nL2-SP+BSS\nDELTA [19]\nDELTA+BSS\n\nSampling Rates\n\n15%\n\n81.05\u00b10.18\n81.86\u00b10.19\n81.41\u00b10.23\n82.20\u00b10.27\n81.46\u00b10.18\n81.93\u00b10.29\n45.25\u00b10.12\n47.74\u00b10.23\n45.08\u00b10.19\n46.77\u00b10.19\n46.83\u00b10.21\n49.77\u00b10.07\n36.77\u00b10.12\n40.57\u00b10.12\n36.10\u00b10.30\n39.44\u00b10.18\n39.37\u00b10.34\n41.92\u00b10.16\n86.56\u00b10.21\n87.57\u00b10.13\n86.78\u00b10.21\n87.53\u00b10.36\n87.17\u00b10.23\n87.30\u00b10.23\n39.57\u00b10.20\n40.41\u00b10.12\n39.27\u00b10.24\n40.02\u00b10.15\n42.16\u00b10.21\n43.79\u00b10.19\n\n30%\n\n84.47\u00b10.23\n84.79\u00b10.18\n84.88\u00b10.15\n85.06\u00b10.17\n83.66\u00b10.29\n84.33\u00b10.16\n59.68\u00b10.21\n63.38\u00b10.29\n57.78\u00b10.24\n60.89\u00b10.28\n60.37\u00b10.25\n62.95\u00b10.18\n60.63\u00b10.18\n64.13\u00b10.18\n60.30\u00b10.28\n64.41\u00b10.19\n63.28\u00b10.27\n64.67\u00b10.28\n89.99\u00b10.35\n90.46\u00b10.21\n90.00\u00b10.23\n90.13\u00b10.21\n89.95\u00b10.25\n90.44\u00b10.12\n57.46\u00b10.12\n59.23\u00b10.31\n57.12\u00b10.27\n58.78\u00b10.26\n58.60\u00b10.29\n61.58\u00b10.17\n\n50%\n\n85.69\u00b10.21\n86.00\u00b10.22\n85.99\u00b10.18\n86.18\u00b10.05\n84.73\u00b10.16\n85.30\u00b10.30\n70.12\u00b10.29\n72.56\u00b10.17\n69.47\u00b10.29\n72.33\u00b10.26\n71.38\u00b10.20\n72.31\u00b10.38\n75.10\u00b10.21\n76.78\u00b10.21\n75.48\u00b10.22\n76.56\u00b10.28\n76.53\u00b10.24\n77.58\u00b10.33\n91.22\u00b10.19\n92.07\u00b10.29\n90.65\u00b10.18\n91.03\u00b10.09\n91.17\u00b10.19\n91.70\u00b10.30\n67.93\u00b10.28\n69.19\u00b10.13\n67.46\u00b10.26\n68.96\u00b10.21\n68.51\u00b10.25\n69.46\u00b10.29\n\n100%\n\n86.89\u00b10.32\n87.18\u00b10.14\n86.72\u00b10.20\n86.91\u00b10.19\n86.01\u00b10.22\n86.54\u00b10.14\n78.01\u00b10.16\n78.85\u00b10.31\n78.44\u00b10.17\n79.36\u00b10.12\n78.63\u00b10.18\n79.02\u00b10.21\n87.20\u00b10.19\n87.63\u00b10.27\n86.58\u00b10.26\n87.38\u00b10.23\n86.32\u00b10.20\n86.32\u00b10.25\n92.75\u00b10.25\n93.30\u00b10.14\n92.29\u00b10.22\n92.41\u00b10.18\n92.29\u00b10.12\n92.62\u00b10.27\n81.13\u00b10.21\n81.48\u00b10.18\n80.98\u00b10.29\n81.27\u00b10.31\n80.44\u00b10.20\n80.85\u00b10.17\n\nSensitivity Analysis. Sensitivity analysis of a larger k in Equation (8) is conducted on the Stanford\nDogs dataset, with results shown in Figure 3(d). When the amount of training data examples is small,\na larger k enhances the performance of \ufb01ne-tuned models. However, with relatively suf\ufb01cient training\ndata examples, a larger k leads to a slight decline in classi\ufb01cation accuracy. Thus k = 1 is generally\na good choice, since we always have dif\ufb01culty in determining the relative size of the training dataset.\n\n5.3 More Scenarios\n\nIncremental Learning. The \ufb01ne-tuning step is a special case of incremental learning that has only\none additional stage. Though source task is not considered by BSS, it is interesting to \ufb01nd how BSS\nin\ufb02uences the performance of incremental learning methods. We evaluate BSS embeded with EWC\n[14] on the permuted MNIST dataset. For this task, we use the same training strategies of [14] and\ntest the accuracy of both the source task and target task. Top-1 classi\ufb01cation accuracies are shown in\nTable 3. It is observed that BSS promotes the target task while slightly hurts the source task. This is\nan intuitive and reasonable result because BSS tries to alleviate the risk of negative transfer and does\nnot focus on remembering the previously-learnt knowledge of the source task.\n\n8\n\n\f(a) TSVD\n\n(b) \u03c3\n\n(c) smaller half of \u03c3\n\n(d) sensitivity analysis\n\nFigure 3: Analysis of TSVD, singular values and hyperparameter sensitivity: (a) Error rate of TSVD\nwith different r; (b) All singular values of feature matrices in four con\ufb01gurations for Stanford Dogs,\nwhich have random sampling rates 15%, 30%, 50% and 100% respectively, either with (w/) BSS and\nwithout (w/o) BSS; (c) Smaller half of singular values in (b); (d) Sensitivity analysis of different k.\n\nTable 3: BSS Embedded into EWC for Incremental Learning\n\nMethod (incremental learning)\n\n\ufb01ne-tuning + EWC [14]\n\n\ufb01ne-tuning + EWC [14] + BSS\n\ntask A\n96.60\n96.46\n\ntask B\n97.42\n98.04\n\nAvg\n97.01\n97.25\n\nTable 4: BSS Embedded into BERT for Nature Language Processing\n\nMethod (text classi\ufb01cation) MNLI-m\n\nBERTbase [6]\n\nBERTbase [6] + BSS\n\n84.4\n85.0\n\nQNLI MRPC\n86.7\n88.4\n89.6\n87.9\n\nSST-2\n92.7\n93.2\n\nAvg\n88.0\n88.9\n\nNatural Language Processing. Fine-tuning is an important technique to transfer knowledge from\nother sources or pre-trained models. Its effectiveness in visual recognition applications is shown\nin section 5.2. We further justify its power in natural language processing. The General Language\nUnderstanding Evaluation (GLUE) benchmark [39] is a collection of diverse natural language\nunderstanding tasks. MNLI-m, QNLI, MRPC and SST-2 in GLUE [39] are used to evaluate the effect\nof BSS. Considering that BERT [6] is a state-of-the-art NLP pre-trained model, we embed BSS into\nBERTbase. We use a batch size of 32 and \ufb01ne-tune for 3 epochs over the data for these four tasks.\nFor learning rate, we use the same strategies as [6]. Results on four tasks in the Dev sets are listed in\nTable 4. From the Table we \ufb01nd that BSS can also help \ufb01ne-tuning in natural language processing.\n\n6 Conclusion\n\nIn this paper, we studied \ufb01ne-tuning of deep models pre-trained on source tasks to substantially\ndifferent target tasks. We delved into this widely-successful inductive transfer learning scenario from\na new perspective: negative transfer. While existing deep methods mainly focus on alleviating the\nproblem of catastrophic forgetting for reusing pre-trained knowledge, we \ufb01nd that not all weight\nparameters or feature matrices are transferable and some spectral components in them are detrimental\nto the target tasks, especially with limited training data. Based on this observation, Batch Spectral\nShrinkage (BSS), a regularization approach based on spectral analysis of feature representations,\nis proposed to actively inhibit untransferable spectral components. BSS is pluggable into existing\n\ufb01ne-tuning methods and yields signi\ufb01cant performance gains. We expect that BSS will shed light\ninto potential future directions for safe transfer learning towards making inductive transfer never hurt.\n\nAcknowledgments\n\nWe thank Dr. Yuchen Zhang at Tsinghua University for helpful discussions. This work was supported\nby the National Key R&D Program of China (2017YFC1502003) and Natural Science Foundation of\nChina (61772299 and 71690231).\n\n9\n\n15%30%50%100%0.130.140.150.160.170.180.190.20Error Rate of Truncated SVDoriginTSVD,r=1TSVD,r=2TSVD,r=4TSVD,r=801020304050Index5101520253035Singular ValuesSingular Valuesw/o BSS:15%w/o BSS:30%w/o BSS:50%w/o BSS:100%BSS:15%BSS:30%BSS:50%BSS:100%253035404550Index46810121416Singular ValuesSingular Valuesw/o BSS:15%w/o BSS:30%w/o BSS:50%w/o BSS:100%BSS:15%BSS:30%BSS:50%BSS:100%15%30%50%100%0.120.130.140.150.160.170.180.19Error Rate (sensitivity analysis)k=1k=2k=4k=8\fReferences\n\n[1] M. Aygun, Y. Aytar, and H. Kemal Ekenel. Exploiting convolution \ufb01lter patterns for transfer\n\nlearning. In International Conference on Computer Vision (ICCV), pages 2674\u20132680, 2017.\n\n[2] R. Caruana. Multitask learning. Machine learning, 28(1):41\u201375, 1997.\n[3] X. Chen, S. Wang, M. Long, and J. Wang. Transferability vs. discriminability: Batch spectral\nIn International Conference on Machine\n\npenalization for adversarial domain adaptation.\nLearning (ICML), pages 1081\u20131090, 2019.\n\n[4] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. Large scale \ufb01ne-grained categorization\nand domain-speci\ufb01c transfer learning. In IEEE conference on computer vision and pattern\nrecognition (CVPR), pages 4109\u20134118, 2018.\n\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In IEEE conference on computer vision and pattern recognition (CVPR), pages\n248\u2013255. Ieee, 2009.\n\n[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional\ntransformers for language understanding. In Annual Conference of the North American Chapter\nof the Association for Computational Linguistics (NAACL), 2019.\n\n[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep\nconvolutional activation feature for generic visual recognition. In International conference on\nmachine learning (ICML), pages 647\u2013655, 2014.\n\n[8] W. Ge and Y. Yu. Borrowing treasures from the wealthy: Deep transfer learning through\nselective joint \ufb01ne-tuning. In IEEE conference on computer vision and pattern recognition\n(CVPR), pages 1086\u20131095, 2017.\n\n[9] K. He, R. Girshick, and P. Doll\u00e1r. Rethinking imagenet pre-training. International Conference\n\non Computer Vision (ICCV), 2019.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE\n\nconference on computer vision and pattern recognition (CVPR), pages 770\u2013778, 2016.\n\n[11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[12] M. Huh, P. Agrawal, and A. A. Efros. What makes imagenet good for transfer learning? arXiv\n\npreprint arXiv:1608.08614, 2016.\n\n[13] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for \ufb01ne-grained image\ncategorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nColorado Springs, CO, June 2011.\n\n[14] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,\nJ. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural\nnetworks. Proceedings of the national academy of sciences (PNAS), 114(13):3521\u20133526, 2017.\nIEEE\n\n[15] S. Kornblith, J. Shlens, and Q. V. Le. Do better imagenet models transfer better?\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[16] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for \ufb01ne-grained\ncategorization. In 4th International IEEE Workshop on 3D Representation and Recognition\n(3dRR-13), Sydney, Australia, 2013.\n\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in neural information processing systems (NeurIPS), pages\n1097\u20131105, 2012.\n\n[18] S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang. Overcoming catastrophic forgetting\nIn Advances in neural information processing systems\n\nby incremental moment matching.\n(NeurIPS), pages 4652\u20134662, 2017.\n\n[19] X. Li, H. Xiong, H. Wang, Y. Rao, L. Liu, and J. Huan. Delta: Deep learning transfer using\nfeature map with attention for convolutional networks. In International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[20] X. Li, G. Yves, and D. Franck. Explicit inductive bias for transfer learning with convolutional\nnetworks. In International Conference on Machine Learning (ICML), pages 2830\u20132839, 2018.\n[21] Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and\n\nmachine intelligence (TPAMI), 40(12):2935\u20132947, 2018.\n\n[22] J. Liu, Y. Wang, and Y. Qiao. Sparse deep transfer learning for convolutional neural network.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2017.\n\n10\n\n\f[23] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classi\ufb01cation\n\n[24] J. Miao and A. Ben-Israel. On principal angles between subspaces in rn. Linear algebra and its\n\nof aircraft. Technical report, 2013.\n\napplications, 171:81\u201398, 1992.\n\n[25] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image\nrepresentations using convolutional neural networks. In IEEE conference on computer vision\nand pattern recognition (CVPR), pages 1717\u20131724, 2014.\n\n[26] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and\n\ndata engineering (TKDE), 22(10):1345\u20131359, 2009.\n\n[27] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar. Cats and dogs. In IEEE conference on\n\ncomputer vision and pattern recognition (CVPR), pages 3498\u20133505. IEEE, 2012.\n\n[28] A. Quattoni and A. Torralba. Recognizing indoor scenes. In IEEE Conference on Computer\n\nVision and Pattern Recognition (CVPR), pages 413\u2013420. IEEE, 2009.\n\n[29] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for\n\nthin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[30] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu,\nR. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671,\n2016.\n\n[31] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains.\n\nIn European conference on computer vision (ECCV), pages 213\u2013226. Springer, 2010.\n\n[32] J. Serra, D. Suris, M. Miron, and A. Karatzoglou. Overcoming catastrophic forgetting with\nhard attention to the task. In International Conference on Machine Learning (ICML), pages\n4555\u20134564, 2018.\n\n[33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In International Conference on Learning Representations (ICLR), 2015.\n\n[34] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In\n\nEuropean Conference on Computer Vision (ECCV), pages 443\u2013450. Springer, 2016.\n\n[35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception archi-\ntecture for computer vision. In IEEE conference on computer vision and pattern recognition\n(CVPR), pages 2818\u20132826, 2016.\n\n[36] S. Thrun. A lifelong learning perspective for mobile robot control. In International Conference\n\non Intelligent Robots and Systems (IROS), pages 201\u2013214. Elsevier, 1995.\n\n[37] L. Torrey and J. Shavlik. Transfer learning. In Handbook of research on machine learning\napplications and trends: algorithms, methods, and techniques, pages 242\u2013264. IGI Global,\n2010.\n\n[38] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation.\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167\u20137176,\n2017.\n\n[39] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task\nbenchmark and analysis platform for natural language understanding. Conference on Empirical\nMethods in Natural Language Processing (EMNLP), page 353, 2018.\n\n[40] X. Wang, L. Li, W. Ye, M. Long, and J. Wang. Transferable attention for domain adaptation. In\n\nAAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2019.\n\n[41] Z. Wang, Z. Dai, B. P\u00f3czos, and J. Carbonell. Characterizing and avoiding negative transfer.\n\nIEEE conference on computer vision and pattern recognition (CVPR), 2019.\n\n[42] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD\n\nBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[43] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization,\nnetwork minimization and transfer learning. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 4133\u20134141, 2017.\n\n[44] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learning with dynamically expandable\n\nnetworks. In International Conference on Learning Representations (ICLR), 2018.\n\n[45] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural\n\nnetworks? In Advances in Neural Information Processing Systems (NeurIPS), 2014.\n\n[46] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance\nof convolutional neural networks via attention transfer. International Conference on Learning\nRepresentations (ICLR), 2017.\n\n11\n\n\f", "award": [], "sourceid": 1101, "authors": [{"given_name": "Xinyang", "family_name": "Chen", "institution": "Tsinghua University"}, {"given_name": "Sinan", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Fu", "institution": "Tsinghua University"}, {"given_name": "Mingsheng", "family_name": "Long", "institution": "Tsinghua University"}, {"given_name": "Jianmin", "family_name": "Wang", "institution": "Tsinghua University"}]}