{"title": "Transfusion: Understanding Transfer Learning for Medical Imaging", "book": "Advances in Neural Information Processing Systems", "page_first": 3347, "page_last": 3357, "abstract": "Transfer learning from natural image datasets, particularly ImageNet, using standard large models and corresponding pretrained weights has become a de-facto method for deep learning applications to medical imaging. \nHowever, there are fundamental differences in data sizes, features and task specifications between natural image classification and the target medical tasks, and there is little understanding of the effects of transfer. In this paper, we explore properties of transfer learning for medical imaging. A performance evaluation on two large scale medical imaging tasks shows that surprisingly, transfer offers little benefit to performance, and simple, lightweight models can perform comparably to ImageNet architectures. Investigating the learned representations and features, we find that some of the differences from transfer learning are due to the over-parametrization of standard models rather than sophisticated feature reuse. We isolate where useful feature reuse occurs, and outline the implications for more efficient model exploration. We also explore feature independent benefits of transfer arising from weight scalings.", "full_text": "Transfusion: Understanding Transfer Learning for\n\nMedical Imaging\n\nMaithra Raghu\u2217\n\nCornell University and Google Brain\n\nmaithrar@gmail.com\n\nJon Kleinberg\u2020\nCornell University\n\nkleinber@cs.cornell.edu\n\nAbstract\n\nChiyuan Zhang\u2217\nGoogle Brain\n\nchiyuan@google.com\n\nSamy Bengio\u2020\nGoogle Brain\n\nbengio@google.com\n\nTransfer learning from natural image datasets, particularly IMAGENET, using stan-\ndard large models and corresponding pretrained weights has become a de-facto\nmethod for deep learning applications to medical imaging. However, there are fun-\ndamental differences in data sizes, features and task speci\ufb01cations between natural\nimage classi\ufb01cation and the target medical tasks, and there is little understanding of\nthe effects of transfer. In this paper, we explore properties of transfer learning for\nmedical imaging. A performance evaluation on two large scale medical imaging\ntasks shows that surprisingly, transfer offers little bene\ufb01t to performance, and\nsimple, lightweight models can perform comparably to IMAGENET architectures.\nInvestigating the learned representations and features, we \ufb01nd that some of the\ndifferences from transfer learning are due to the over-parametrization of standard\nmodels rather than sophisticated feature reuse. We isolate where useful feature\nreuse occurs, and outline the implications for more ef\ufb01cient model exploration. We\nalso explore feature independent bene\ufb01ts of transfer arising from weight scalings.\n\n1\n\nIntroduction\n\nWith the growth of deep learning, transfer learning has become integral to many applications \u2014\nespecially in medical imaging, where the present standard is to take an existing architecture designed\nfor natural image datasets such as IMAGENET, together with corresponding pretrained weights (e.g.\nResNet [10], Inception [27]), and then \ufb01ne-tune the model on the medical imaging data.\nThis basic formula has seen almost universal adoption across many different medical specialties.\nTwo prominent lines of research have used this methodology for applications in radiology, training\narchitectures like ResNet, DenseNet on chest x-rays [31, 24] and ophthalmology, training Inception-\nv3, ResNet on retinal fundus images [2, 9, 23, 4]. The research on ophthalmology has also culminated\nin FDA approval [28], and full clinical deployment [29]. Other applications include performing early\ndetection of Alzheimer\u2019s Disease [5], identifying skin cancer from dermatologist level photographs\n[6], and even determining human embryo quality for IVF procedures [15].\nDespite the immense popularity of transfer learning in medical imaging, there has been little work\nstudying its precise effects, even as recent work on transfer learning in the natural image setting\n[11, 16, 20, 12, 7] has challenged many commonly held beliefs. For example in [11], it is shown that\n\n\u2217Equal Contribution.\n\u2020Equal Contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Example images from the IMAGENET, the retinal fundus photographs, and the CHEXPERT datasets,\nrespectively. The fundus photographs and chest x-rays have much higher resolution than the IMAGENET images,\nand are classi\ufb01ed by looking for small local variations in tissue.\n\ntransfer (even between similar tasks) does not necessarily result in performance improvements, while\n[16] illustrates that pretrained features may be less general than previously thought.\nIn the medical imaging setting, many such open questions remain. As described above, transfer\nlearning is typically performed by taking a standard IMAGENET architecture along with its pretrained\nweights, and then \ufb01ne-tuning on the target task. However, IMAGENET classi\ufb01cation and medical\nimage diagnosis have considerable differences.\nFirst, many medical imaging tasks start with a large image of a bodily region of interest and use\nvariations in local textures to identify pathologies. For example, in retinal fundus images, small red\n\u2018dots\u2019 are an indication of microaneurysms and diabetic retinopathy [1], and in chest x-rays local\nwhite opaque patches are signs of consolidation and pneumonia. This is in contrast to natural image\ndatasets like IMAGENET, where there is often a clear global subject of the image (Fig. 1). There is\nthus an open question of how much IMAGENET feature reuse is helpful for medical images.\nAdditionally, most datasets have larger images (to facilitate the search for local variations), but with\nmany fewer images than IMAGENET, which has roughly one million images. By contrast medical\ndatasets range from several thousand images [15] to a couple hundred thousand [9, 24].\nFinally, medical tasks often have signi\ufb01cantly fewer classes (5 classes for Diabetic Retinopathy\ndiagnosis [9], 5 \u2212 14 chest pathologies from x-rays [24]) than the standard IMAGENET classi\ufb01cation\nsetup of 1000 classes. As standard IMAGENET architectures have a large number of parameters\nconcentrated at the higher layers for precisely this reason, the design of these models is likely to be\nsuboptimal for the medical setting.\nIn this paper, we perform a \ufb01ne-grained study on transfer learning for medical images. Our main\ncontributions are:\n[1] We evaluate the performance of standard architectures for natural images such as IMAGENET, as\nwell as a family of non-standard but smaller and simpler models, on two large scale medical imaging\ntasks, for which transfer learning is currently the norm. We \ufb01nd that (i) in all of these cases, transfer\ndoes not signi\ufb01cantly help performance (ii) smaller, simpler convolutional architectures perform\ncomparably to standard IMAGENET models (iii) IMAGENET performance is not predictive of medical\nperformance. These conclusions also hold in the very small data regime.\n[2] Given the comparable performance, we investigate whether using pretrained weights leads to dif-\nferent learned representations, by using (SV)CCA [22] to directly analyze the hidden representations.\nWe \ufb01nd that pretraining does affect the hidden representations, but there is a confounding issue of\nmodel size, where the large, standard IMAGENET models do not change signi\ufb01cantly through the\n\ufb01ne-tuning process, as evidenced through surprising correlations between representational similarity\nat initialization and after convergence.\n[3] Using further analysis and weight transfusion experiments, where we partially reuse pretrained\nweights, we isolate locations where meaningful feature reuse does occur, and explore hybrid ap-\nproaches to transfer learning where a subset of pretrained weights are used, and other parts of the\nnetwork are redesigned and made more lightweight.\n[4] We show there are also feature-independent bene\ufb01ts to pretraining \u2014 reusing only the scaling of\nthe pretrained weights but not the features can itself lead to large gains in convergence speed.\n\n2 Datasets\nOur primary dataset, the RETINA data, consists of retinal fundus photographs [9], large 587 \u00d7 587\nimages of the back of the eye. These images are used to diagnose a variety of eye diseases including\nDiabetic Retinopathy (DR) [3]. DR is graded on a \ufb01ve-class scale of increasing severity [1]. Grades\n\n2\n\n\f3 and up are referable DR (requiring immediate specialist attention), while grades 1 and 2 correspond\nto non-referable DR. As in prior work [9, 2] we evaluate via AUC-ROC on identifying referable DR.\nWe also study a second medical imaging dataset, CHEXPERT [14], which consists of chest x-ray\nimages (resized to 224 \u00d7 224), which can be used to diagnose 5 different thoracic pathologies:\natelectasis, cardiomegaly, consolidation, edema and pleural effusion. We evaluate our models on\nthe AUC of diagnosing each of these pathologies. Figure 1 shows some example images from both\ndatasets and IMAGENET, demonstrating drastic differences in visual features among those datasets.\n\n3 Models and Performance Evaluation of Transfer Learning\n\nTo lay the groundwork for our study, we select multiple neural network architectures and evaluate\ntheir performance when (1) training from random initialization and (2) doing transfer learning from\nIMAGENET. We train both standard, high performing IMAGENET architectures that have been popular\nfor transfer learning, as well as a family of signi\ufb01cantly smaller convolutional neural networks, which\nachieve comparable performance on the medical tasks.\nAs far as we are aware, there has been little work studying the effects of transfer learning from\nIMAGENET on smaller, non-standard IMAGENET architectures. (For example, [21] studies a different\nmodel, but does not evaluate the effect of transfer learning.) This line of investigation is especially\nimportant in the medical setting, where large, computationally expensive models might signi\ufb01cantly\nimpede mobile and on-device applications. Furthermore, in standard IMAGENET models, most of the\nparameters are concentrated at the top, to perform the 1000-class classi\ufb01cation. However, medical\ndiagnosis often has considerably fewer classes \u2013 both the retinal fundus images and chest x-rays have\njust 5 classes \u2013 likely meaning that IMAGENET models are highly overparametrized.\nWe \ufb01nd that across both datasets and all models, transfer learning does not signi\ufb01cantly affect perfor-\nmance. Additionally, the family of smaller lightweight convolutional networks performs comparably\nto standard IMAGENET models, despite having signi\ufb01cantly worse accuracy on IMAGENET\u2013 the\nIMAGENET task is not necessarily a good indication of success on medical datasets. Finally, we\nobserve that these conclusions also hold in the setting of very limited data.\n\n3.1 Description of Models\n\nFor the standard IMAGENET architectures, we evaluate ResNet50 [11] and Inception-v3 [27], which\nhave both been used extensively in medical transfer learning applications [2, 9, 31]. We also design a\nfamily of simple, smaller convolutional architectures. The basic building block for this family is the\npopular sequence of a (2d) convolution, followed by batch normalization [13] and a relu activation.\nEach architecture has four to \ufb01ve repetitions of this basic layer. We call this model family CBR.\nDepending on the choice of the convolutional \ufb01lter size (\ufb01xed for the entire architecture), the number\nof channels and layers, we get a family of architectures with size ranging from a third of the standard\nIMAGENET model size (CBR-LargeT, CBR-LargeW) to one twentieth the size (CBR-Tiny). Full\narchitecture details are in the Appendix.\n\n3.2 Results\n\nWe evaluate three repetitions of the different models and initializations (random initialization vs\npretrained weights) on the two medical tasks, with the result shown in Tables 1, 2. There are two\npossibilities for repetitions of transfer learning: we can have a \ufb01xed set of pretrained weights and\nmultiple training runs from that initialization, or for each repetition, \ufb01rst train from scratch on\nIMAGENET and then \ufb01ne-tune on the medical task. We opt for evaluating the former, as that is the\nstandard method used in practice. For all models except for Inceptionv3, we \ufb01rst train on IMAGENET\nto get the pretrained weights. For Inceptionv3, we used the pretrained weights provided by [26].\nTable 1 shows the model performances on the RETINA data (AUC of identifying moderate Diabetic\nRetinopathy (DR), described in Section 2), along with IMAGENET top 5 accuracy. Firstly, we see\nthat transfer learning has minimal effect on performance, not helping the smaller CBR architectures\nat all, and only providing a fraction of a percent gain for Resnet and Inception. Next, we see that\ndespite the signi\ufb01cantly lower performance of the CBR architectures on IMAGENET, they perform\nvery comparably to Resnet and Inception on the RETINA task. These same conclusions are seen on\n\n3\n\n\fDataset Model Architecture Random Init\n96.4% \u00b1 0.05\nRETINA\n96.6% \u00b1 0.13\nRETINA\n96.2% \u00b1 0.04\nRETINA\n95.8% \u00b1 0.04\nRETINA\n95.7% \u00b1 0.04\nRETINA\n95.8% \u00b1 0.03\nRETINA\n\nResnet-50\nInception-v3\nCBR-LargeT\nCBR-LargeW\nCBR-Small\nCBR-Tiny\n\nTransfer\n96.7% \u00b1 0.04\n96.7% \u00b1 0.05\n96.2% \u00b1 0.04\n95.8% \u00b1 0.05\n95.8% \u00b1 0.01\n95.8% \u00b1 0.01\n\nParameters\n\n23570408\n22881424\n8532480\n8432128\n2108672\n1076480\n\nIMAGENET Top5\n92.% \u00b1 0.06\n93.9%\n77.5% \u00b1 0.03\n75.1% \u00b1 0.3\n67.6% \u00b1 0.3\n73.5% \u00b1 0.05\n\nTable 1: Transfer learning and random initialization perform comparably across both standard IMA-\nGENET architectures and simple, lightweight CNNs for AUCs from diagnosing moderate DR. Both sets\nof models also have similar AUCs, despite signi\ufb01cant differences in size and complexity. Model perfor-\nmance on DR diagnosis is also not closely correlated with IMAGENET performance, with the small models\nperforming poorly on IMAGENET but very comparably on the medical task.\n\nModel Architecture\nResnet-50\nResnet-50 (trans)\nCBR-LargeT\nCBR-LargeT (trans)\nCBR-LargeW\nCBR-LargeW (trans)\nCBR-Small\nCBR-Small (trans)\nCBR-Tiny\nCBR-Tiny (trans)\n\nAtelectasis\n79.52\u00b10.31\n79.76\u00b10.47\n81.52\u00b10.25\n80.89\u00b11.68\n79.79\u00b10.79\n80.70\u00b10.31\n80.43\u00b10.72\n80.18\u00b10.85\n80.81\u00b10.55\n80.02\u00b11.06\n\nCardiomegaly Consolidation\n85.49\u00b11.32\n75.23\u00b10.35\n84.42\u00b10.65\n74.93\u00b11.41\n74.83\u00b11.66\n88.12\u00b10.25\n76.84\u00b10.87\n86.15\u00b10.71\n86.71\u00b11.45\n74.63\u00b10.69\n86.87\u00b10.33\n77.23\u00b10.84\n88.07\u00b10.60\n74.36\u00b11.06\n75.24\u00b11.43\n86.48\u00b11.13\n75.17\u00b10.73\n85.31\u00b10.82\n84.28\u00b10.82\n75.74\u00b10.71\n\nEdema\n\n88.34\u00b11.17\n88.89\u00b11.66\n87.97\u00b11.40\n89.03\u00b10.74\n84.80\u00b10.77\n89.57\u00b10.34\n86.20\u00b11.35\n89.09\u00b11.04\n84.87\u00b11.13\n89.81\u00b11.08\n\nPleural Effusion\n\n88.70\u00b10.13\n88.07\u00b11.23\n88.37\u00b10.01\n88.44\u00b10.84\n86.53\u00b10.54\n87.29\u00b10.69\n86.14\u00b11.78\n87.88\u00b11.01\n85.56\u00b10.89\n87.69\u00b10.75\n\nTable 2: Transfer learning provides mixed performance gains on chest x-rays. Performances (AUC%) of\ndiagnosing different pathologies on the CHEXPERT dataset. Again we see that transfer learning does not help\nsigni\ufb01cantly, and much smaller models performing comparably.\n\nthe chest x-ray results, Table 2. Here we show the performance AUC for the \ufb01ve different pathologies\n(Section 2). We again observe mixed gains from transfer learning. For Atelectasis, Cardiomegaly and\nConsolidation, transfer learning performs slightly worse, but helps with Edema and Pleural Effusion.\n\n3.3 The Very Small Data Regime\n\nWe conducted additional experiments to study the effect of transfer learning in the very small data\nregime. Most medical datasets are signi\ufb01cantly smaller than IMAGENET, which is also the case for\nour two datasets. However, our datasets still have around two hundred thousand examples, and other\nsettings many only have a few thousand. To study the effects in this very small data regime, we\ntrained models on only 5000 datapoints on the RETINA dataset, and examined the effect of transfer\nlearning. The results, in Table 3, suggest that while transfer learning has a bigger effect with very\n\nModel\nResnet50\nCBR-LargeT\nCBR-LargeW\n\nRand Init Pretrained\n\n92.2%\n93.6%\n93.6%\n\n94.6%\n93.9%\n93.7%\n\nTable 3: Bene\ufb01ts of transfer learning in the small data regime are largely due to architecture size. AUCs\nwhen training on the RETINA task with only 5000 datapoints. We see a bigger gap between random initialization\nand transfer learning for Resnet (a large model), but not for the smaller CBR models.\n\nsmall amounts of data, there is a confounding effect of model size \u2013 transfer primarily helps the large\nmodels (which are designed to be trained with a million examples) and smaller models again show\nlittle difference between transfer and random initialization.\n\n4\n\n\fFigure 2: Pretrained weights give rise to different hidden representations than training from random\ninitialization for large models. We compute CCA similarity scores between representations learned using\npretrained weights and those from random initialization. We do this for the top two layers (or stages for Resnet,\nInception) and average the scores, plotting the results in orange. In blue is a baseline similarity score, for\nrepresentations trained from different random initializations. We see that representations learned from random\ninitialization are more similar to each other than those learned from pretrained weights for larger models, with\nless of a distinction for smaller models.\n\n4 Representational Analysis of the Effects of Transfer\n\nIn Section 3 we saw that transfer learning and training from random initialization result in very\nsimilar performance across different neural architectures and tasks. This gives rise to some natural\nquestions about the effect of transfer learning on the kinds of representations learned by the neural\nnetworks. Most fundamentally, does transfer learning in fact result in any representational differences\ncompared to training from random initialization? Or are the effects of the initialization lost? Does\nfeature reuse take place, and if so, where exactly? In this section, we provide some answers to these\nbasic questions. Our approach directly analyzes and compares the hidden representations learned by\ndifferent populations of neural networks, using (SV)CCA [22, 19], revealing an important dependence\non model size, and differences in behavior between lower and higher layers. These insights, combined\nwith results Section 5 suggest new, hybrid approaches to transfer learning.\nQuantitatively Studying Hidden Representations with (SV)CCA To understand how pretraining\naffects the features and representations learned by the models, we would like to (quantitatively) study\nthe learned intermediate functions (latent layers). Analyzing latent representations is challenging due\nto their complexity and the lack of any simple mapping to inputs, outputs or other layers. A recent\ntool that effectively overcomes these challenges is (Singular Vector) Canonical Correlation Analysis,\n(SV)CCA [22, 19], which has been used to study latent representations through training, across\ndifferent models, alternate training objectives, and other properties [22, 19, 25, 18, 8, 17, 30]. Rather\nthan working directly with the model parameters or neurons, CCA works with neuron activation\nvectors \u2014 the ordered collection of outputs of the neuron on a sequence of inputs. Given the activation\nvectors for two sets of neurons (say, corresponding to distinct layers), CCA seeks linear combinations\nof each that are as correlated as possible. We adapt existing CCA methods to prevent the size of the\nactivation sets from overwhelming the computation in large models (details in Appendix C), and\napply them to compare the latent representations of corresponding hidden layers of different pairs of\nneural networks, giving a CCA similarity score of the learned intermediate functions.\nTransfer Learning and Random Initialization Learn Different Representations Our \ufb01rst exper-\niment uses CCA to compare the similarity of the hidden representations learned when training from\npretrained weights to those learned when training from random initialization. We use the represen-\ntations learned at the top two layers (for CBRs) or stages (for Resnet, Inception) before the output\nlayer, averaging their similarity scores. As a baseline to compare to, we also look at CCA similarity\nscores for the same representations when training from random initialization with two different seeds\n(different initializations and gradient updates). The results are shown in Figure 2. For larger models\n(Resnet, Inception), there is a clear difference between representations, with the similarity of represen-\ntations between training from random initialization and pretrained weights (orange) noticeably lower\n\n5\n\nResnet50Inceptionv30.210.220.230.240.250.260.27CCA SimilarityRepresentational Similarity Higher Layers ImageNet ModelsCBR-LargeWCBR-SmallCBR-Tiny0.3500.3550.3600.3650.3700.3750.380Representational Similarity Higher Layers CBRsCCA(Rand, ImNet)CCA(Rand, Rand)\fFigure 3: Per-layer CCA similarities before and after training on medical task. For all models, we see that\nthe lowest layers are most similar to their initializations, and this is especially evident for Resnet50 (a large\nmodel). We also see that feature reuse is mostly restricted to the bottom two layers (stages for Resnet) \u2014 the\nonly place where similarity with initialization is signi\ufb01cantly higher for pretrained weights (grey dotted lines\nshows the difference in similarity scores between pretrained and random initialization).\n\nFigure 4: Large models move less through training at lower layers: similarity at initialization is highly\ncorrelated with similarity at convergence for large models. We plot CCA similarity of Resnet (conv1)\ninitialized randomly and with pretrained weights at (i) initialization, against (ii) CCA similarity of the converged\nrepresentations (top row second from left.) We also do this for two different random initializations (top row,\nleft). In both cases (even for random initialization), we see a surprising, strong correlation between similarity at\ninitialization and similarity after convergence (R2 = 0.75, 0.84). This is not the case for the smaller CBR-Small\nmodel, illustrating the overparametrization of Resnet for the task. Higher must likely change much more for\ngood task performance.\n\nthan representations learned independently from different random initializations (blue). However for\nsmaller models (CBRs), the functions learned are more similar.\nLarger Models Change Less Through Training The reasons underlying this difference between\nlarger and smaller models becomes apparent as we further study the hidden representations of all the\nlayers. We \ufb01nd that larger models change much less during training, especially in the lowest layers.\nThis is true even when they are randomly initialized, ruling out feature reuse as the sole cause, and\nimplying their overparametrization for the task. This is in line with other recent \ufb01ndings [33].\nIn Figure 3, we look at per-layer representational similarity before/after \ufb01netuning, which shows that\nthe lowest layer in Resnet (a large model), is signi\ufb01cantly more similar to its initialization than in the\nsmaller models. This plot also suggests that any serious feature reuse is restricted to the lowest couple\nof layers, which is where similarity before/after training is clearly higher for pretrained weights vs\nrandom initialization. In Figure 4, we plot the CCA similarity scores between representations using\npretrained weights and random initialization at initialization vs after training, for the lowest layer\n(conv1) as well as higher layers, for Resnet and CBR-Small. Large models changing less through\ntraining is evidenced by a surprising correlation between the CCA similarities for Resnet conv1,\nwhich is not true for higher layers or the smaller CBR-Small model.\n\n6\n\nconv1block1block2block3block40.00.10.20.30.40.50.60.70.80.9Resnet50 Per Layer CCA Similarity Before/After FinetuningRandInitPretrainedPretrained - RandInitpool1pool2pool3pool40.00.10.20.30.40.50.60.70.80.9CBR-LargeW Per Layer CCA Similarity Before/After FinetuningRandInitPretrainedPretrained - RandInitpool1pool2pool3pool40.00.10.20.30.40.50.60.70.80.9CBR-Small Per Layer CCA Similarity Before/After FinetuningRandInitPretrainedPretrained - RandInit0.5600.5650.5700.5750.5800.5850.590CCA Similarity At Init0.6200.6250.6300.6350.6400.6450.650CCA Similarity After TrainingR^2=0.838Resnet50 CCA(Rand, Rand) Conv1 Before vs After Finetune, R^2=0.8380.560.570.580.59CCA Similarity At Init0.6050.6100.6150.6200.6250.6300.6350.640CCA Similarity After TrainingR^2=0.752Resnet50 CCA(Rand, Imnet) Conv1 Before vs After Finetune, R^2=0.7520.610.620.630.640.650.66CCA Similarity At Init0.700.710.720.730.740.75CCA Similarity After TrainingR^2=0.197CBR-Small CCA(Rand, Rand) Conv1 Before vs After Finetune, R^2=0.1970.5800.5850.5900.5950.6000.605CCA Similarity At Init0.630.640.650.660.670.68CCA Similarity After TrainingR^2=0.160CBR-Small CCA(Rand, Imnet) Conv1 Before vs After Finetune, R^2=0.1600.3550.3600.3650.3700.375CCA Similarity At Init0.2620.2640.2660.2680.270CCA Similarity After TrainingR^2=0.008Resnet50 CCA(Rand, Rand) Higher Layers Before vs After Finetune, R^2=0.0080.2900.2950.3000.3050.3100.3150.3200.325CCA Similarity At Init0.2360.2380.2400.242CCA Similarity After TrainingR^2=0.001Resnet50 CCA(Rand, Imnet) Higher Layers Before vs After Finetune, R^2=0.0010.4820.4840.4860.4880.490CCA Similarity At Init0.3680.3700.3720.3740.3760.3780.380CCA Similarity After TrainingR^2=0.079CBR-Small CCA(Rand, Rand) Higher Layers Before vs After Finetune, R^2=0.0790.3780.3800.3820.3840.3860.3880.390CCA Similarity At Init0.3600.3650.3700.375CCA Similarity After TrainingR^2=0.322CBR-Small CCA(Rand, Imnet) Higher Layers Before vs After Finetune, R^2=0.322\f(a) Resnet Init\n\n(b) Resnet Final\n\n(c) Res-trans Init\n\n(d) Res-trans \ufb01nal\n\n(e) CBR-Small Init\n\n(f) CBR-Small Final\n\n(g) CBR Trans\n\n(h) CBR Trans Final\n\nFigure 5: Visualization of conv1 \ufb01lters shows the remains of initialization after training in Resnet, and\nthe lack of and erasing of Gabor \ufb01lters in CBR-Small. We visualize the \ufb01lters before and after training from\nrandom initialization and pretrained weights for Resnet (top row) and CBR-Small (bottom row). Comparing the\nsimilarity of (e) to (f) and (g) to (h) shows the limited movement of Resnet through training, while CBR-Small\nchanges much more. We see that CBR does not learn Gabor \ufb01lters when trained from scratch (f), and also erases\nsome of the pretrained Gabors (compare (g) to (h).)\n\nFilter Visualizations and the Absence of Gabors As a \ufb01nal study of how pretraining affects the\nmodel representations, we visualize some of the \ufb01lters of conv1 for Resnet and CBR-Small (both\n7x7 kernels), before and after training on the RETINA task. The \ufb01lters are shown in Figure 5, with\nvisualizations for chest x-rays in the Appendix. These add evidence to the aformentioned observation:\nthe Resnet \ufb01lters change much less than those of CBR-Small. In contrast, CBR-Small moves more\nfrom its initialization, and has more similar learned \ufb01lters in random and pretrained initialization.\nInterestingly, CBR-Small does not appear to learn Gabor \ufb01lters when trained from scratch (bottom row\nsecond column). Comparing the third and fourth columns of the bottom row, we see that CBR-Small\neven erases some of the Gabor \ufb01lters that it is initialized with in the pretrained weights.\n\n5 Convergence: Feature Independent Bene\ufb01ts and Weight Transfusion\n\nIn this section, we investigate the effects of transfer learning on convergence speed, \ufb01nding that:\n(i) surprisingly, transfer offers feature independent bene\ufb01ts to convergence simply through better\nweight scaling (ii) using pretrained weights from the lowest two layers/stages has the biggest effect\non convergence \u2014 further supporting the \ufb01ndings in the previous section that any meaningful feature\nreuse is concentrated in these lowest two layers (Figure 3.) These results suggest some hybrid\napproaches to transfer learning, where only a subset of the pretrained weights (lowest layers) are\nused, with a lightweight redesign to the top of the network, and even using entirely synthetic features,\nsuch as synthetic Gabor \ufb01lters (Appendix F.3). We show these hybrid approaches capture most of the\nbene\ufb01ts of transfer and enable greater \ufb02exibility in its application.\nFeature Independent Bene\ufb01ts of Transfer: Weight Scalings We consistently observe that using\npretrained weights results in faster convergence. One explanation for this speedup is that there is\nsigni\ufb01cant feature reuse. However, the results of Section 4 illustrate that there are many confounding\nfactors, such as model size, and feature reuse is likely limited to the lowest layers. We thus tested\nto see whether there were feature independent bene\ufb01ts of the pretrained weights, such as better\nscaling. In particular, we initialized a iid weights from N (\u02dc\u00b5, \u02dc\u03c32), where \u02dc\u00b5 and \u02dc\u03c32 are the mean and\nvariance of \u02dcW , the pretrained weights. Doing this for each layer separately inherits the scaling of the\npretrained weights, but destroys all of the features. We called this the Mean Var init, and found that it\nsigni\ufb01cantly helps speed up convergence (Figure 6.) Several additional experiments studying batch\nnormalization, weight sampling, etc are in the Appendix.\n\n7\n\n\fFigure 6: Using only the scaling of the pretrained weights (Mean Var Init) helps with convergence speed.\nThe \ufb01gures compare the standard transfer learning and the Mean Var initialization scheme to training from\nscratch. On both the RETINA data (a-b) and the CHEXPERT data (c) (with Resnet50 on the Consolidation\ndisease), we see convergence speedups.\n\nFigure 7: Reusing a subset of the pretrained weights (weight transfusion), further supports only the\nlowest couple of layers performing meaningful feature reuse. We initialize a Resnet with a contiguous\nsubset of the layers using pretrained weights (weight transfusion), and the rest randomly, and train on the\nRETINA task. On the left, we show the convergene plots when transfusing up to conv1 (just one layer), up to\nblock 1 (conv1 and all the layers in block1), etc up to full transfer. On the right, we plot the number of train\nsteps taken to reach 91% AUC for different numbers of transfused weights. Consistent with \ufb01ndings in Section\n4, we observe that reusing the lowest layers leads to the greatest gain in convergence speed. Perhaps surprisingly,\njust reusing conv1 gives the greatest marginal convergence speedup, even though transfusing weights for a block\nmeans several new layers are using pretrained weights.\n\nWeight Transfusions and Feature Reuse We next study whether the results suggested by Section 4,\nthat meaningful feature reuse is restricted to the lowest two layers/stages of the network is supported\nby the effect on convergence speed. We do this via a weight transfusion experiment, transfering a\ncontiguous set of some of the pretrained weights, randomly initializing the rest of the network, and\ntraining on the medical task. Plotting the training curves and steps taken to reach a threshold AUC in\nFigure 7 indeed shows that using pretrained weights for lowest few layers has the biggest training\nspeedup. Interestingly, just using pretrained weights for conv1 for Resnet results in the largest gain,\ndespite transfusion for a Resnet block meaning multiple layers are now reusing pretrained weights.\nTakeaways: Hybrid Approaches to Transfer Learning The transfusion results suggest some\nhybrid, more \ufb02exible approaches to transfer learning. Firstly, for larger models such as Resnet, we\ncould consider reusing pretrained weights up to e.g. Block2, redesiging the top of the network (which\nhas the bulk of the parameters) to be more lightweight, initializing these layers randomly, and training\nthis new Slim model end to end. Seeing the disproportionate importance of conv1, we might also\nlook at the effect of initializing conv1 with synthetic Gabor \ufb01lters (see Appendix F.3 for details) and\nthe rest of the network randomly. In Figure 8 we illustrate these hybrid approaches. Slimming the top\nof the network in this way offers the same convergence and performance as transfer learning, and\nusing synthetic Gabors for conv1 has the same effect as pretrained weights for conv1. These variants\nhighlight many new, rich and \ufb02exible ways to use transfer learning.\n\n8\n\n020000400006000080000100000120000140000Train Step0.40.50.60.70.80.91.0Test AUCInceptionv3Imagenet TransferRandom InitMean Var020000400006000080000100000120000140000Train Step0.40.50.60.70.80.91.0Test AUCResnet50Imagenet TransferRandom InitMean Var020000400006000080000100000Train Step0.20.30.40.50.60.70.80.9Test AUCImagenet TransferRandom InitMean Var020000400006000080000100000Train Step0.40.50.60.70.80.91.0Test AUCResnet50 Weight TransfusionRandInit, 69069 steps > 0.91AUCConv1, 31031 steps > 0.91AUCBlock1, 15215 steps > 0.91AUCBlock2, 8608 steps > 0.91AUCBlock3, 8208 steps > 0.91AUCBlock4, 7407 steps > 0.91AUCTransfer, 8008 steps > 0.91AUCNone (Rand)Conv1Block1Block2Block3Block4All (Transfer)Weight Transfusion Up To Layer01000020000300004000050000600007000080000First Train Step with AUC > 0.91Resnet50 Weight Transfusion Convergence\fFigure 8: Hybrid approaches to transfer learning: reusing a subset of the weights and slimming the\nremainder of the network, and using synthetic Gabors for conv1. For Resnet, we look at the effect of\nreusing pretrained weights up to Block2, and slimming the remainder of the network (halving the number of\nchannels), randomly initializing those layers, and training end to end. This matches performance and convergence\nof full transfer learning. We also look at initializing conv1 with synthetic Gabor \ufb01lters (so no use of pretrained\nweights), and the rest of the network randomly, which performs equivalently to reusing conv1 pretrained weights.\nThis result generalizes to different architectures, e.g. CBR-LargeW on the right.\n\n6 Conclusion\n\nIn this paper, we have investigated many central questions on transfer learning for medical imag-\ning applications. Having benchmarked both standard IMAGENET architectures and non-standard\nlightweight models (itself an underexplored question) on two large scale medical tasks, we \ufb01nd\nthat transfer learning offers limited performance gains and much smaller architectures can perform\ncomparably to the standard IMAGENET models. Our exploration of representational similarity and\nfeature reuse reveals surprising correlations between similarities at initialization and after training\nfor standard IMAGENET models, providing evidence of their overparametrization for the task. We\nalso \ufb01nd that meaningful feature reuse is concentrated at the lowest layers and explore more \ufb02exible,\nhybrid approaches to transfer suggested by these results, \ufb01nding that such approaches maintain all\nthe bene\ufb01ts of transfer and open up rich new possibilities. We also demonstrate feature-independent\nbene\ufb01ts of transfer learning for better weight scaling and convergence speedups.\n\nReferences\n[1] AAO.\n\nInternational Clinical Diabetic Retinopathy Disease Severity Scale Detailed Table.\n\nAmerican Academy of Ophthalmology, 2002.\n\n[2] Michael David Abr\u00e0moff, Yiyue Lou, Ali Erginay, Warren Clarida, Ryan Amelon, James C\nFolk, and Meindert Niemeijer. Improved automated detection of diabetic retinopathy on a\npublicly available dataset through integration of deep learning. Investigative ophthalmology &\nvisual science, 57(13):5200\u20135206, 2016.\n\n[3] Hasseb Ahsan. Diabetic retinopathy \u2013 biomolecules and multiple pathophysiology. Diabetes\n\nand Metabolic Syndrome: Clincal Research and Review, pages 51\u201354, 2015.\n\n[4] Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad\nTomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O\u2019Donoghue, Daniel Visentin,\net al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature\nmedicine, 24(9):1342, 2018.\n\n[5] Yiming Ding, Jae Ho Sohn, Michael G Kawczynski, Hari Trivedi, Roy Harnish, Nathaniel W\nJenkins, Dmytro Lituiev, Timothy P Copeland, Mariam S Aboian, Carina Mari Aparici, et al.\nA deep learning model to predict a diagnosis of alzheimer disease by using 18f-fdg pet of the\nbrain. Radiology, 290(2):456\u2013464, 2018.\n\n9\n\n020000400006000080000100000Train Step0.40.50.60.70.80.91.0Test AUCResnet50 Hybrid ApproachesRandInit, 69069 steps > 0.91AUCSynthetic Gabor, 25425 steps > 0.91AUCSlim, 8208 steps > 0.91AUCTransfer, 8008 steps > 0.91AUC020000400006000080000100000120000Train Step0.40.50.60.70.80.91.0Test AUCCBR-LargeW Hybrid ApproachesRandInit, 104104 steps > 0.91AUCSynthetic Gabor, 82082 steps > 0.91AUCTransfer, 19419 steps > 0.91AUC\f[6] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and\nSebastian Thrun. Dermatologist-level classi\ufb01cation of skin cancer with deep neural networks.\nNature, 542(7639):115, 2017.\n\n[7] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann,\nand Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias\nimproves accuracy and robustness. In ICLR, 2019.\n\n[8] Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. A closer look\nat deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint\narXiv:1810.13243, 2018.\n\n[9] Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam\nNarayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros,\nRamasamy Kim, Rajiv Raman, Philip Q Nelson, Jessica Mega, and Dale Webster. Development\nand validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus\nphotographs. JAMA, 316(22):2402\u20132410, 2016.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[11] Kaiming He, Ross Girshick, and Piotr Doll\u00e1r. Rethinking imagenet pre-training. arXiv preprint\n\narXiv:1811.08883, 2018.\n\n[12] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer\n\nlearning? arXiv preprint arXiv:1608.08614, 2016.\n\n[13] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[14] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute,\nHenrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large\nchest radiograph dataset with uncertainty labels and expert comparison. In Thirty-Third AAAI\nConference on Arti\ufb01cial Intelligence, 2019.\n\n[15] Pegah Khosravi, Ehsan Kazemi, Qiansheng Zhan, Marco Toschi, Jonas E Malmsten, Cristina\nHickman, Marcos Meseguer, Zev Rosenwaks, Olivier Elemento, Nikica Zaninovic, et al. Robust\nautomated assessment of human blastocyst quality using deep learning. bioRxiv, page 394882,\n2018.\n\n[16] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?\n\narXiv preprint arXiv:1805.08974, 2018.\n\n[17] Sneha Reddy Kudugunta, Ankur Bapna, Isaac Caswell, Naveen Arivazhagan, and Orhan Firat.\nInvestigating multilingual nmt representations at scale. arXiv preprint arXiv:1909.02197, 2019.\n\n[18] Martin Magill, Faisal Qureshi, and Hendrick de Haan. Neural networks trained to solve differ-\nential equations learn general representations. In Advances in Neural Information Processing\nSystems, pages 4075\u20134085, 2018.\n\n[19] Ari S Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in\n\nneural networks with canonical correlation. arXiv preprint arXiv:1806.05759, 2018.\n\n[20] Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc V Le, and Ruoming Pang.\nDomain adaptive transfer learning with specialist models. arXiv preprint arXiv:1811.07056,\n2018.\n\n[21] F Pasa, V Golkov, F Pfeiffer, D Cremers, and D Pfeiffer. Ef\ufb01cient deep network architectures\nfor fast chest x-ray tuberculosis screening and visualization. Scienti\ufb01c reports, 9(1):6268, 2019.\n\n[22] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vec-\ntor canonical correlation analysis for deep learning dynamics and interpretability. In Advances\nin Neural Information Processing Systems, pages 6076\u20136085, 2017.\n\n10\n\n\f[23] Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Sendhil Mullainathan, and Jon\nKleinberg. Direct uncertainty prediction with applications to healthcare. arXiv preprint\narXiv:1807.01771, 2018.\n\n[24] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy\nDing, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew P. Lungren, and Andrew Y. Ng.\nChexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. CoRR,\narXiv:1711.05225, 2017.\n\n[25] Naomi Saphra and Adam Lopez. Understanding learning dynamics of language models with\n\nsvcca. arXiv preprint arXiv:1811.00225, 2018.\n\n[26] Tensor\ufb02ow Slim. Tensor\ufb02ow slim inception-v3. https://github.com/tensorflow/\n\nmodels/tree/master/research/slim, 2017.\n\n[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1\u20139,\n2015.\n\n[28] Eric J Topol. High-performance medicine: the convergence of human and arti\ufb01cial intelligence.\n\nNature medicine, 25(1):44, 2019.\n\n[29] Amber A Van Der Heijden, Michael D Abramoff, Frank Verbraak, Manon V van Hecke, Albert\nLiem, and Giel Nijpels. Validation of automated screening for referable diabetic retinopathy\nwith the idx-dr device in the hoorn diabetes care system. Acta ophthalmologica, 96(1):63\u201368,\n2018.\n\n[30] Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in\nthe transformer: A study with machine translation and language modeling objectives. arXiv\npreprint arXiv:1909.01380, 2019.\n\n[31] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M\nSummers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-\nsupervised classi\ufb01cation and localization of common thorax diseases. In 2017 IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), pages 3462\u20133471. IEEE, 2017.\n\n[32] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in\ndeep neural networks? In Advances in neural information processing systems, pages 3320\u20133328,\n2014.\n\n[33] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all layers created equal? arXiv preprint\n\narXiv:1902.01996, 2019.\n\n11\n\n\f", "award": [], "sourceid": 1852, "authors": [{"given_name": "Maithra", "family_name": "Raghu", "institution": "Cornell University and Google Brain"}, {"given_name": "Chiyuan", "family_name": "Zhang", "institution": "Google Brain"}, {"given_name": "Jon", "family_name": "Kleinberg", "institution": "Cornell University"}, {"given_name": "Samy", "family_name": "Bengio", "institution": "Google Research, Brain Team"}]}