{"title": "The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 1854, "page_last": 1862, "abstract": "Understanding if classifiers generalize to out-of-sample datasets is a central problem in machine learning. Microscopy images provide a standardized way to measure the generalization capacity of image classifiers, as we can image the same classes of objects under increasingly divergent, but controlled factors of variation. We created a public dataset of 132,209 images of mouse cells, COOS-7 (Cells Out Of Sample 7-Class). COOS-7 provides a classification setting where four test datasets have increasing degrees of covariate shift: some images are random subsets of the training data, while others are from experiments reproduced months later and imaged by different instruments. We benchmarked a range of classification models using different representations, including transferred neural network features, end-to-end classification with a supervised deep CNN, and features from a self-supervised CNN. While most classifiers perform well on test datasets similar to the training dataset, all classifiers failed to generalize their performance to datasets with greater covariate shifts. These baselines highlight the challenges of covariate shifts in image data, and establish metrics for improving the generalization capacity of image classifiers.", "full_text": "The Cells Out of Sample (COOS) dataset and\n\nbenchmarks for measuring out-of-sample\n\ngeneralization of image classi\ufb01ers\n\nAlex X. Lu\n\nComputer Science,\nUniversity of Toronto\n\nalexlu@cs.toronto.edu\n\nAmy X. Lu\n\nComputer Science,\nUniversity of Toronto\n\nVector Institute\n\namyxlu@cs.toronto.edu\n\nWiebke Schormann\nBiological Sciences,\n\nSunnybrook Research Institute\n\nwiebke.schormann@sri.utoronto.ca\n\nMarzyeh Ghassemi\n\nCIFAR AI Chair,\n\nUniversity of Toronto\n\nVector Institute\n\nmarzyeh@cs.toronto.edu\n\nDavid W. Andrews\nBiological Sciences,\n\nSunnybrook Research Institute\n\nBiochemistry and Medical Biophysics,\n\nUniversity of Toronto\n\ndavid.andrews@sri.utoronto.ca\n\nAlan M. Moses\n\nCell and Systems Biology\n\nComputer Science\n\nCAGEF\n\nUniversity of Toronto\n\nalan.moses@utoronto.ca\n\nAbstract\n\nUnderstanding if classi\ufb01ers generalize to out-of-sample datasets is a central prob-\nlem in machine learning. Microscopy images provide a standardized way to\nmeasure the generalization capacity of image classi\ufb01ers, as we can image the same\nclasses of objects under increasingly divergent, but controlled factors of variation.\nWe created a public dataset of 132,209 images of mouse cells, COOS-7 (Cells\nOut Of Sample 7-Class). COOS-7 provides a classi\ufb01cation setting where four\ntest datasets have increasing degrees of covariate shift: some images are random\nsubsets of the training data, while others are from experiments reproduced months\nlater and imaged by different instruments. We benchmarked a range of classi\ufb01ca-\ntion models using different representations, including transferred neural network\nfeatures, end-to-end classi\ufb01cation with a supervised deep CNN, and features from a\nself-supervised CNN. While most classi\ufb01ers perform well on test datasets similar to\nthe training dataset, all classi\ufb01ers failed to generalize their performance to datasets\nwith greater covariate shifts. These baselines highlight the challenges of covariate\nshifts in image data, and establish metrics for improving the generalization capacity\nof image classi\ufb01ers.\n\n1\n\nIntroduction\n\nFor a classi\ufb01er to be useful predictively, it must be able to accurately label out-of-sample data (new\ndata not seen during training). Researchers often estimate predictive performance by holding out a\nrandom subset of the training data, but this only simulates the condition where test and training data\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fare drawn from the same distribution. In practice, even small natural variations in data distributions\ncan challenge the generalization capacity of classi\ufb01ers: Recht et al. show that deep learning models\ntrained on CIFAR or ImageNet drop in classi\ufb01cation accuracy when evaluated on a new dataset\ncarefully curated with the same methods as the original datasets [1], suggesting that even state-of-the-\nart classi\ufb01ers are not robust to out-of-sample data from a more realistic setting.\nWhile understanding the robustness of classi\ufb01cation models to covariate shifts (situations where the\ndistribution of out-of-sample data differs from that of training data) is broadly applicable, biomedical\ndomains exemplify cases where the failure of image classi\ufb01ers to generalize can have serious\nconsequences. Diagnostic systems, such as those that predict pneumonia from chest radiographs, may\nnot generalize to data from different institutions [2]. In pharmaceutical research, drugs are screened\nbased on the effects they have on diseased cells [3]; models classifying these effects perform better on\nmicroscope images from the same sample than on reproduced experiments [4]. These issues are not\nexclusive to images, and are prevalent in many biomedical datasets: similar challenges include batch\neffects in genomics data [5], site effects in MRI data [6], and covariate shifts over time in medical\nrecords [7, 8]. Thus, validating models with realistic out-of-sample datasets is important not only for\nestimating performance in real use-cases where covariate shifts are unavoidable, but also for model\nselection: performance gains on randomly held-out test data may not translate to improvements on\ndatasets with covariate shifts [7].\nHere, we sought to create a standardized dataset for measuring the robustness of image classi\ufb01ers\nunder various degrees of covariate shift. We reasoned that microscopy experiments would allow us\nto image a large set of naturally variable objects (cells) under controlled factors of variation. Cells\nnaturally vary in aspects like shape or size [9]. While still stochastic, these variations are in\ufb02uenced\nby environmental factors like temperature or humidity [10], meaning that images taken on the same\nday are more likely to be similar than those taken on different days or seasons. Compounding these\nbiological variations are technical biases, such as microscope settings. Different instruments may\nproduce subtle illumination or contrast differences, which classi\ufb01ers can over\ufb01t [11, 12].\nWe introduce COOS-7 (Cells Out Of Sample 7-Class), a public dataset of 132,209 images of mouse\ncells. In addition to a training dataset of 41,456 images, COOS-7 is associated with four test datasets,\nrepresenting increasingly divergent factors of variation from the training dataset: some images are\nrandom subsets of the training data, while others are from experiments reproduced months later and\nimaged by different instruments. We benchmark a range of classi\ufb01cation models using different\nrepresentations, both classic and state-of-the-art, and show that all methods drop signi\ufb01cantly in\nclassi\ufb01cation performance on the most diverged datasets.\nThe full COOS-7 dataset is freely available at Zenodo (https://zenodo.org/record/3386336)\nunder a CC-BY-NC 4.0 license. We provide a script to unpack all images into directories of tiff \ufb01les.\n\n2 COOS-7\n\n2.1 Overview of images and classi\ufb01cation setting\n\nTo create COOS-7, we curated 132,209 images of mouse cells. Each image in COOS-7 is a 64x64\npixel crop centered around a unique mouse cell. Each image contains two channels. The \ufb01rst channel\nshows a \ufb02uorescent protein that targets a speci\ufb01c component of the cell. Each mouse cell is stained\nwith one of seven \ufb02uorescent proteins, which highlight distinct parts of the cell ranging from the\nER to the nuclear membrane. The goal of our classi\ufb01cation problem is to predict which \ufb02uorescent\nprotein a cell has been stained with: Table 1 summarizes our class labels and shows example images\nof the \ufb01rst channel for each class, from three of the datasets in COOS-7.\nThe second channel is a \ufb02uorescent dye that stains the nucleus, consistent across all cells in our\ndataset. On its own, the second nucleus channel is not expected to discriminate any of the classes in\nour dataset, but we provide this channel to help models learn useful correlations. For example, while\nthe Golgi (class 3) and the peroxisomes (class 5) are both characterized as bright dots in the cell, the\nGolgi tends to surround the nucleus, while the peroxisomes are distributed more evenly in the cell.\nAll images are stored in 16-bit, representing the raw intensity values acquired by the microscope,\nwhich we provide to maximize \ufb02exibility in preprocessing for methods on this dataset. For visualiza-\ntion purposes (and as preprocessing to the methods we benchmark), we rescale images, but users\nshould be aware that the raw images will look different from those presented in Table 1.\n\n2\n\n\fTable 1: Classes and examples from COOS-7\n\nLabel\n\nClass\n\nTraining Examples\n\nTest3\n\nExample\n\nTest4\n\nExample\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nEndoplasmic\nReticulum (ER)\n\nInner Mitochondrial\nMembrane (IMM)\n\nGolgi\n\nPeroxisomes\n\nEarly Endosome\n\nCytosol\n\nNuclear Envelope\n\n2.2 Overview of training and test datasets\n\nCOOS-7 is curated from a larger set of microscopy experiments, spanning the course of two years.\nIn these experiments, cells were grown on plates, containing 384 wells. Each well is a \ufb02uorescent\nprotein, with the con\ufb01guration differing from plate to plate. A robot-controlled microscope slides\nover the wells, taking 10-20 images for each well (see methods of [13] for a similar experimental\nset-up.) The original images taken by the microscope typically contain multiple cells; we process\nthese into crops centered around individual cells by segmenting the second nucleus channel using a\ntrained mask-RCNN, YeastSpotter [14]. We systematically imaged 95 plates, typically with about\na week between plates. Four of these plates were grown and imaged at a different microscope in a\ncollaborating institution. All images were taken using PerkinElmer OPERA R(cid:13) QHS spinning disk\nautomated confocal microscope with 40x water objectives (NA=0.9).\nWe exploited the structure of these experiments to produce \ufb01ve independent datasets with different\nfactors of variation (Table 2). We manually examined all candidate wells for our seven selected\n\ufb02uorescent proteins to ensure that they were free from experimental defect and contamination, and to\nverify the images were visually consistent with their label. With these \ufb01lters, we curated a subset of\nhigh-quality images, spread over 7 plates for each class. We chose 8 wells for each plate, for a total\nof 56 wells per class. We roughly balanced the number of images by using a consistent number of\nraw microscope images per well (although there is still some class imbalance at the level of the cell\ncrop images due to differing proliferation rates.)\nFinally, we divided these plates and wells into training and test datasets, as described in Table 2.\nWhere possible, we emphasized potential systematic biases in dividing our dataset. For Test2, we\nonly included wells at the borders of a plate, as these wells are more suspectible to environmental\neffects than interior wells [15]. For Test3, we chose the two chronologically latest experiments, to\nemphasize potential non-stationarities over time (in most classes, the Test3 experiments differ from\nthe Training dataset by a gap of months.)\nIn our classi\ufb01cation setting, all methods must be trained and optimized using the Training dataset\nexclusively, and their performance evaluated on each of the four test datasets. As shown in Table 1,\n\n3\n\n\fDataset\nTraining\n\nTest1\n\nTest2\n\nTest3\n\nTest4\n\nTable 2: Description of datasets provided in COOS-7\n\nDescription\n\nImages from 4 independent plates for each class\n\nRandomly held-out images from the same plates in\n\ntraining dataset\n\nImages from the same plates, but different wells than\n\ntraining dataset\n\nImages from 2 independent plates for each class,\nreproduced on different days than training dataset\n\nImages from 1 plate for each class, reproduced on\n\ndifferent day and imaged under different microscope than\n\ntraining dataset\n\nImages\n41,456\n\n10,364\n\n17,021\n\n32,596\n\n30,772\n\nwhile the morphology of cells and contrast of images may differ in the test datasets relative to the\ntraining datasets, their underlying classes are still visibly distinct and identi\ufb01able.\n\n2.3 Related Work\n\nCovariate shifts caused by differences in plate, well, and instruments are widely acknowledged as an\nissue in microscopy datasets [15] and have been demonstrated to cause classi\ufb01ers on these images to\nover\ufb01t [11]. Previous work recognizes the importance of demonstrating robustness, and numerous\nmethods have been proposed [4, 12, 16, 17, 18]. However, previous works have been limited in their\nability to measure these effects due to lack of appropriate datasets. These studies rely on adapting\npre-existing datasets, which are not designed with the purpose of measuring covariate shifts. For\nexample, while [4] measure generalization under batch effects by holding out images from the same\nbatch during evaluation, this procedure reduces the number of classes evaluated, as some classes\nare only imaged within a single batch. Similarly, high-throughput microscopy databases like [19]\nsometimes provide replicates of screens under the same treatment, but the arrangement of proteins on\nplates generally remains the same each time, prohibiting the analysis of well effects.\nIn contrast, our experimental design centers around randomized plate arrangements and the replication\nof experiments over time, directly enabling strati\ufb01cation of the dataset by multiple kinds of covariate\nshifts. Our test sets encompass covariate shifts not typically seen in other datasets: few microscopy\ndatasets replicate experiments at different sites/microscopes. To our knowledge, COOS is the most\nextensive microscopy image dataset for measuring generalization under covariate shifts to date.\nCompared to other datasets in computer vision, COOS resembles those used for studying domain\nadaptation [20]. Unlike these datasets, the test datasets in COOS should not be considered as being\nfrom a different domain than the training dataset: the datasets encompass natural variation in cells that\nwould be dif\ufb01cult or impossible to control for in a realistic deployment situation. Our dataset is more\nsimilar to that of [1] in that we show that even in-domain generalization is challenging. Compared to\nthe natural images in their dataset, we consider our cell images to be simpler. In addition, we provide\nmultiple test datasets with controlled and known covariate shifts.\n\n3 Classi\ufb01cation Baselines for COOS-7\n\n3.1 Baselines from a wide range of classi\ufb01ers\n\nTo provide baselines for out-of-sample generalization on COOS-7, we extracted features and built\nclassi\ufb01ers using a variety of methods commonly used for microscopy images, both classic and\nstate-of-the-art. Unless otherwise stated, we followed practices outlined in previous work.\nFirst, as our classic computer vision baseline, we extracted Haralick texture features (abbreviated\nas Texture), which are popular for microscopy image analysis due to their rotation invariance [21].\n\n4\n\n\fWe rescaled the intensity of each image to the range [0, 1], and extracted texture features from the\n\ufb01rst channel at 5 scales. In addition, we extracted features representing the mean, sum, and standard\ndeviation of intensity of pixels in the \ufb01rst channel, and the correlation between the \ufb01rst and second\nchannels.\nSecond, we trained a fully supervised 11-layer CNN, DeepLoc, which has achieved state-of-the-art\nresults in classifying protein localization for 64x64 images of yeast cells [22]. We followed all\npreprocessing, parameterization, and model selection practices by the authors. We trained DeepLoc\nfor 10,000 iterations with a batch size of 128 on 80% of the Training dataset, and chose the iteration\n(at intervals of 500) with the best performance on the remaining 20% (strati\ufb01ed to preserve percentage\nof samples per class). We report end-to-end performance, and we also extracted features from the last\nfully-connected layer of our trained model for building further classi\ufb01ers.\nThird, we extracted features from pretrained CNNs on ImageNet (abbreviated as VGG16), which\nhas been shown to outperform classic unsupervised feature representation methods for cancer cell\nmorphology [23]. We used a pre-trained VGG16 model, as this was the best-performing previously-\nreported model that would accept 64x64 images. We converted and rescaled the \ufb01rst channel of our\nimages to 8-bit RGB. Contrary to the results of Pawlowski et al., we observed that including features\nfrom the second channel decreased performance, so we did not include models with these features\nin our \ufb01nal baselines. We extracted features from all layers (max-pooling convolutional layers), but\nonly report benchmarks for the top 3 overall performing layers.\nFourth, we extracted features learned from a self-supervised method designed for microscopy images\n(abbreviated as PCI), which has achieved unsupervised state-of-the-art results in classifying protein\nlocalization for 64x64 images of yeast cells and human cells [24]. We trained the model unsupervised\non the Training dataset exclusively, following practices by the authors. We extracted features from all\nlayers of the source cell encoder of this model (max-pooling convolutional layers), but only report\nbenchmarks for the top 3 overall performing layers.\nFor each feature set, we built three classi\ufb01ers on the extracted features for the Training dataset\nexclusively: a k-nearest neighbor classi\ufb01er (k = 11), a L1 Logistic Regression classi\ufb01er, and a\nRandom Forest classi\ufb01er. For all models, we centered and scaled features with the mean and standard\ndeviation of the Training dataset. To optimize our Random Forest classi\ufb01ers, we conducted a random\nsearch (100 samples) over a parameter grid (n_estimators = {20, 40, 60, 80, 120, 140, 160, 180, 200},\nmax_features = {\u2019log2\u2019, \u2019sqrt\u2019}, max_depth = {1, 13, 25, 37, 50, \u2019None\u2019}, min_samples = {2, 5, 10,\n20, 40}, min_samples_leaf = {1, 2, 4, 8, 16}), and selected the classi\ufb01er with the best performance\non a 5-fold cross-validation of the Training dataset. All classi\ufb01ers were implemented in Python with\nScikit-Learn [25].\nWe report the performance of all classi\ufb01ers on all datasets in Table 3. We report the balanced\nclassi\ufb01cation error, to control for differences in class balance from dataset to dataset. The best results\nfor each feature representation method on each test dataset are bolded.\n\n3.2 All classi\ufb01ers drop in performance on out-of-sample data with larger covariate shifts\n\nAll methods we tried performed well on the test datasets most similar to the Training dataset, Test1\nand Test2. Features from our deep learning models had as little as 1.1% error on these datasets, but\neven logistic regression classi\ufb01ers built on classic computer vision features achieved 6.8% error or\nlower. However, when attempting to generalize to the test datasets with larger covariate shifts, Test3\nand Test4, all classi\ufb01ers had large drops in performance (although the fully-supervised CNN achieved\nthe lowest error on these datasets.)\nWe observed that all classi\ufb01ers failed to generalize regardless of the complexity of the classi\ufb01cation\nmodel. The texture features bene\ufb01ted more from classi\ufb01cation models that weigh features (e.g.\nlogistic regression versus kNN) compared to the models that learn features speci\ufb01c to a dataset\n(DeepLoc, PCI). Otherwise, we saw little difference in performance or generalization capacity, most\nexempli\ufb01ed by our DeepLoc results: we achieved similar error with a kNN classi\ufb01er on the last layer\u2019s\nfeatures as we did with the CNN\u2019s fully-connected classi\ufb01er. These results suggest that performance\nand generalization is bounded by the quality of the representation, not by the complexity of the\nclassi\ufb01cation model.\n\n5\n\n\fTable 3: Class-Balanced Error (%) of Classi\ufb01cation Models on COOS-7 Datasets\n\nFeatures\nDeepLoc\n\nDeepLoc (FC2)\nDeepLoc (FC2)\nDeepLoc (FC2)\n\nTexture\nTexture\nTexture\n\nPCI Conv3\nPCI Conv3\nPCI Conv3\n\nPCI Conv4\nPCI Conv4\nPCI Conv4\n\nPCI Conv5\nPCI Conv5\nPCI Conv5\n\nVGG16 Conv3_3\nVGG16 Conv3_3\nVGG16 Conv3_3\n\nVGG16 Conv4_1\nVGG16 Conv4_1\nVGG16 Conv4_1\n\nVGG16 Conv4_2\nVGG16 Conv4_2\nVGG16 Conv4_2\n\nModel\n\nEnd-to-End\n\nkNN\nL1 LR\n\nRF\nkNN\nL1 LR\n\nRF\nkNN\nL1 LR\n\nRF\n\nkNN\nL1 LR\n\nRF\n\nkNN\nL1 LR\n\nRF\nkNN\nL1 LR\n\nRF\n\nkNN\nL1 LR\n\nRF\n\nkNN\nL1 LR\n\nRF\n\nTrain Test1 Test2 Test3 Test4\n5.4\n1.2\n4.7\n1.1\n4.1\n1.1\n5.0\n0.0\n25.6\n10.4\n12.1\n6.4\n0.0\n17.1\n8.0\n2.2\n7.4\n1.0\n0.1\n8.6\n\n7.4\n6.9\n7.7\n6.8\n17.6\n12.0\n16.4\n8.9\n9.2\n11.0\n\n1.2\n1.3\n1.1\n1.1\n11.8\n6.8\n7.3\n2.4\n1.4\n2.1\n\n1.5\n1.5\n1.4\n1.4\n11.2\n6.5\n7.1\n2.7\n1.7\n2.2\n\n2.1\n1.6\n0.1\n\n2.6\n2.5\n0.0\n6.8\n5.7\n0.2\n\n6.5\n3.1\n0.1\n\n6.6\n2.8\n0.2\n\n2.4\n1.5\n2.5\n\n2.7\n2.5\n2.5\n7.9\n6.6\n9.5\n\n7.3\n4.2\n7.5\n\n7.8\n3.9\n7.4\n\n2.5\n1.9\n2.7\n\n2.9\n2.6\n2.7\n8.2\n6.9\n9.2\n\n7.6\n4.1\n7.3\n\n7.8\n3.9\n7.5\n\n10.1\n8.7\n10.7\n\n12.1\n11.4\n10.8\n12.4\n11.3\n15.9\n\n9.3\n8.2\n11.3\n\n9.1\n8.0\n10.2\n\n7.8\n6.0\n7.5\n\n8.9\n5.7\n7.4\n10.0\n9.3\n10.9\n\n8.4\n6.7\n7.8\n\n8.4\n6.8\n8.4\n\n3.3 Confusion matrices reveal non-uniform errors\n\nNext, to understand which speci\ufb01c classes our classi\ufb01ers were failing to generalize on, we examined\nthe confusion matrices for some classi\ufb01ers on various test datasets (Supplementary Tables 1-9). We\nobserved that covariate shifts sometimes have non-uniform effects on classi\ufb01cation performance:\nin some cases, the majority of classes were predicted with very little error, with only a few classes\nsharply decreasing in performance.\nAcross the classi\ufb01ers we examined, we observed that the two most common errors were classifying\nthe early endosome as the ER or the Golgi, or the Golgi as the IMM or the peroxisomes. These\nerrors were between the more visually similar classes in COOS-7; in contrast, the classes that were\ndistinct from any other class in our dataset, such as the cytosol or the nuclear envelope, were generally\nclassi\ufb01ed well by all classi\ufb01ers, across all datasets.\nAs an example of a case where errors were predominantly concentrated in one class, we observed\nfor the DeepLoc classi\ufb01ers on Test 4, while every other class was classi\ufb01ed with > 0.97 sensitivity,\nthe early endosome class was classi\ufb01ed with only 0.684 sensitivity, compared to 0.989 sensitivity in\nTest1.\nThe non-uniform effects of covariate shifts observed here suggest that overall metrics may not always\nadequately describe how classi\ufb01ers fail to generalize on out-of-sample data. Here, these effects are\ndetectable due to the small number of classes, but for classi\ufb01cation problems with many classes (such\nas ImageNet), a large drop in performance on only a few classes may not be detectable from metrics\nlike the overall classi\ufb01cation error.\n\n6\n\n\f3.4 Comparing errors between test datasets reveals variable effects of covariate shifts\n\nIn comparing confusion matrices for the same models between datasets, we observed that the types of\nerrors classi\ufb01ers made differed between datasets. For example, while all classi\ufb01ers we examined had\nlower sensitivity on the early endosome class in both Test3 and Test4, classi\ufb01ers mostly confused the\nearly endosome with the ER in Test3, and the early endosome with the Golgi in Test4. Qualitatively,\nwe noticed differences in the typical appearance of the early endosome class in our test datasets\n(as shown in the examples in Table 1), possibly due to systematic differences in morphology or\nmicroscope illumination.\nWe also observed that some models were robust to errors in the same classes in one dataset, but\nnot in another. For example, the logistic regression classi\ufb01er built on the VGG16 features had\nlower sensitivity on the Golgi class in both Test3 (0.848) and Test4 (0.866). In contrast, the logistic\nregression classi\ufb01er built on the self-supervised (PCI) features had lower sensitivity in Test3 (0.749),\nbut not in Test4 (0.984).\nThese results suggest that the exact nature of covariate shifts can differ in different out-of-sample\ndatasets. New out-of-sample datasets may challenge classi\ufb01ers in unpredictable ways, inducing errors\nnot seen in previous out-of-sample datasets. Thus, validation on a single out-of-sample dataset may\nnot be suf\ufb01cient to conclude that a classi\ufb01cation model is robust in general.\n\n4 Conclusion\n\nWe released a new public dataset, COOS-7, speci\ufb01cally designed to test the generalization capacity\nof image classi\ufb01ers under covariate shift. We demonstrated the challenge of generalizing image\nclassi\ufb01ers to out-of-sample data: no current state-of-the-art technique was able to fully compensate\nfor covariate shifts in the datasets most different from the training data.\nOur baselines highlight challenges in measuring out-of-sample generalization under covariate shift:\nwe showed that covariate shifts will have non-uniform effects on the class-speci\ufb01c performance of\nclassi\ufb01ers, and that the nature of covariate shifts can differ from dataset to dataset. These observations\nhave implications for how rigorous we need to be in validating the performance of machine learning\nmodels before deploying them into real-life applications where data distributions are not stable:\noverall classi\ufb01cation metrics may understate the true effects of covariate shifts and good performance\non a single out-of-sample dataset may not con\ufb01rm that the model is robust to all covariate shifts in\ngeneral.\nWe note that we intentionally designed COOS-7 to contain only visually distinct \ufb02uorescent proteins.\nWe carefully curated a subset of higher quality experiments from a larger dataset, and approximately\nclass-balanced the examples. These factors make COOS-7 particularly amenable to machine learning\nmethods, and easy for most classi\ufb01ers to achieve a high level of performance. Yet, even on this\ntoy example, covariate shifts greatly hamper out-of-sample generalization. It is unknown if these\ncovariate shifts will be exacerbated in a more realistic biological imaging setting, especially with the\ninclusion of even more visually similar classes. We plan to examine this problem by releasing further\ndatasets in the future, which will include more ambiguous classes and a greater range of experimental\nvariability.\nFinally, while we focused on the value of COOS-7 for methods development in this manuscript,\n\ufb01nding methods that improve the out-of-sample generalization baselines in this work will have\npractical implications for biologists working with microscopy images. Classifying protein localization\nis a major problem in cell biology, as a protein\u2019s localization strongly relates with its function [26].\nAmong other applications, accurately classifying protein localization in cells under stress or drug\ntreatments can lead to identi\ufb01cation of the key proteins that drive disease [27]. Since these classi\ufb01ers\nare usually intended to automate the labeling of new data, out-of-sample generalization is essential.\nHere, we structured our dataset to be a protein localization classi\ufb01cation problem: although proteins\ncan have the same localization, we speci\ufb01cally chose proteins that were distinct examples of different\nprotein localizations. We therefore expect improvements to our baselines to yield methods for more\nrobust and generalizable prediction of protein localization, meaning that methodological work on this\ndataset will contribute to a more robust and generalizable understanding of protein biology.\n\n7\n\n\f5 Acknowledgements\n\nAlex X. Lu is funded by a pre-doctoral award from NSERC. Amy X. Lu is funded by a Master\u2019s\naward from NSERC. Alan M. Moses holds a Tier II Canada Research Chair. Wiebke Schormann and\nDavid W. Andrews are funded by CIHR Foundation Grant FDN143312. David W. Andrews holds\na Tier 1 Canada Research Chair in Membrane Biogenesis. Maryzeh Ghassemi is funded in part by\nMicrosoft Research, a CIFAR AI Chair at the Vector Institute, a Canada Research Council Chair, and\nan NSERC Discovery Grant. This work was partially performed on a GPU donated by Nvidia.\n\nReferences\n[1] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet Classi\ufb01ers\n\nGeneralize to ImageNet? ICML, May 2019.\n\n[2] John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and Eric Karl\nOermann. Variable generalization performance of a deep learning model to detect pneumonia in chest\nradiographs: A cross-sectional study. PLOS Medicine, 15(11):e1002683, Nov 2018.\n\n[3] Mark-Anthony Bray, Sigrun M Gustafsdottir, Mohammad H Rohban, Shantanu Singh, Vebjorn Ljosa,\nKatherine L Sokolnicki, Joshua A Bittker, Nicole E Bodycombe, Vlado Danc\u00edk, Thomas P Hasaka, Cindy S\nHon, Melissa M Kemp, Kejie Li, Deepika Walpita, Mathias J Wawer, Todd R Golub, Stuart L Schreiber,\nPaul A Clemons, Alykhan F Shamji, and Anne E Carpenter. A dataset of images and morphological\npro\ufb01les of 30 000 small-molecule treatments using the Cell Painting assay. GigaScience, 6(12):1\u20135, 2017.\n[4] D. Michael Ando, Cory McLean, and Marc Berndl. Improving Phenotypic Measurements in High-Content\n\nImaging Screens. bioRxiv, page 161422, Jul 2017.\n\n[5] Wilson Wen Bin Goh, Wei Wang, and Limsoon Wong. Why Batch Effects Matter in Omics Data, and How\n\nto Avoid Them. Trends in Biotechnology, 35(6):498\u2013507, jun 2017.\n\n[6] Jiayu Chen, Jingyu Liu, Vince D Calhoun, Alejandro Arias-Vasquez, Marcel P Zwiers, Cota Navin Gupta,\nBarbara Franke, and Jessica A Turner. Exploration of scanning effects in multi-site structural MRI studies.\nJournal of neuroscience methods, 230:37\u201350, jun 2014.\n\n[7] Kenneth Jung and Nigam H. Shah. Implications of non-stationarity on predictive modeling using EHRs.\n\nJournal of Biomedical Informatics, 58:168\u2013174, Dec 2015.\n\n[8] Bret Nestor, Matthew B. A. McDermott, Willie Boag, Gabriela Berner, Tristan Naumann, Michael C.\nHughes, Anna Goldenberg, and Marzyeh Ghassemi. Feature Robustness in Non-stationary Health Records:\nCaveats to Deployable Model Performance in Common Clinical Machine Learning Tasks. arXiv Preprint,\nAug 2019.\n\n[9] Arjun Raj and Alexander van Oudenaarden. Nature, Nurture, or Chance: Stochastic Gene Expression and\n\nIts Consequences. Cell, 135(2):216\u2013226, Oct 2008.\n\n[10] Berend Snijder and Lucas Pelkmans. Origins of regulated cell-to-cell variability. Nature Reviews Molecular\n\nCell Biology, 12(2):119\u2013125, Feb 2011.\n\n[11] L Shamir. Assessing the ef\ufb01cacy of low-level image content descriptors for computer-based \ufb02uorescence\n\nmicroscopy image analysis. Journal of microscopy, 243(3):284\u201392, Sep 2011.\n\n[12] S Singh, M-A Bray, T R Jones, and A E Carpenter. Pipeline for illumination correction of images for\n\nhigh-throughput microscopy. Journal of microscopy, 256(3):231\u20136, Dec 2014.\n\n[13] Tony J. Collins, Jarkko Ylanko, Fei Geng, and David W. Andrews. A Versatile Cell Death Screening Assay\nUsing Dye-Stained Cells and Multivariate Image Analysis. Assay and Drug Development Technologies,\n13(9):547, nov 2015.\n\n[14] Alex X Lu, Taraneh Zarin, Ian S Hsu, and Alan M Moses. YeastSpotter: Accurate and parameter-free web\n\nsegmentation for microscopy images of yeast cells. Bioinformatics, May 2019.\n\n[15] Juan C Caicedo, Sam Cooper, Florian Heigwer, Scott Warchal, Peng Qiu, Csaba Molnar, Aliaksei S\nVasilevich, Joseph D Barry, Harmanjit Singh Bansal, Oren Kraus, Mathias Wawer, Lassi Paavolainen,\nMarkus D Herrmann, Mohammad Rohban, Jane Hung, Holger Hennig, John Concannon, Ian Smith,\nPaul A Clemons, Shantanu Singh, Paul Rees, Peter Horvath, Roger G Linington, and Anne E Carpenter.\nData-analysis strategies for image-based cell pro\ufb01ling. Nature Methods, 14(9):849\u2013863, Aug 2017.\n\n[16] Sonal Kothari, John H Phan, Todd H Stokes, Adeboye O Osunkoya, Andrew N Young, and May D Wang.\nRemoving batch effects from histopathological images for enhanced cancer diagnosis. IEEE journal of\nbiomedical and health informatics, 18(3):765\u201372, May 2014.\n\n[17] Gil Tabak, Minjie Fan, Samuel J. Yang, Stephan Hoyer, and Geoff Davis. Correcting Nuisance Variation\n\nusing Wasserstein Distance. arXiv, Nov 2017.\n\n8\n\n\f[18] Alice Schoenauer-Sebag, Louise Heinrich, Marc Schoenauer, Michele Sebag, Lani F. Wu, and Steve J.\n\nAltschuler. Multi-Domain Adversarial Learning. ICLR 2019, Mar 2019.\n\n[19] Judice L Y Koh, Yolanda T Chong, Helena Friesen, Alan Moses, Charles Boone, Brenda J Andrews,\nand Jason Moffat. CYCLoPs: A Comprehensive Database Constructed from Automated Analysis of\nProtein Abundance and Subcellular Localization Patterns in Saccharomyces cerevisiae. G3 (Bethesda,\nMd.), 5(6):1223\u201332, jun 2015.\n\n[20] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135\u2013153,\n\noct 2018.\n\n[21] Ben T Grys, Dara S Lo, Nil Sahin, Oren Z Kraus, Quaid Morris, Charles Boone, and Brenda J Andrews.\nMachine learning and computer vision approaches for phenotypic pro\ufb01ling. The Journal of cell biology,\n216(1):65\u201371, Jan 2017.\n\n[22] Oren Z Kraus, Ben T Grys, Jimmy Ba, Yolanda Chong, Brendan J Frey, Charles Boone, and Brenda J\nAndrews. Automated analysis of high-content microscopy data with deep learning. Molecular Systems\nBiology, 13(4), Apr 2017.\n\n[23] Nick Pawlowski, Juan C Caicedo, Shantanu Singh, Anne E Carpenter, and Amos Storkey. Automating\nMorphological Pro\ufb01ling with Generic Deep Convolutional Networks. bioRxiv, page 085118, Nov 2016.\n[24] Alex X. Lu, Oren Z. Kraus, Sam Cooper, and Alan M. Moses. Learning unsupervised feature repre-\nsentations for single cell microscopy images with paired cell inpainting. PLOS Computational Biology,\n15(9):e1007348, Sep 2019.\n\n[25] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,\nMathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in\npython. Journal of machine learning research, 12(Oct):2825\u20132830, 2011.\n\n[26] Ying-Ying Xu, Li-Xiu Yao, and Hong-Bin Shen. Bioimage-based protein subcellular location prediction: a\n\ncomprehensive review. Frontiers of Computer Science, 12(1):26\u201339, feb 2018.\n\n[27] Mien-Chie Hung and Wolfgang Link. Protein localization in disease and therapy. Journal of cell science,\n\n124(Pt 20):3381\u201392, oct 2011.\n\n9\n\n\f", "award": [], "sourceid": 1069, "authors": [{"given_name": "Alex", "family_name": "Lu", "institution": "University of Toronto"}, {"given_name": "Amy", "family_name": "Lu", "institution": "University of Toronto/Vector Institute"}, {"given_name": "Wiebke", "family_name": "Schormann", "institution": "Sunnybrook Research Institute"}, {"given_name": "Marzyeh", "family_name": "Ghassemi", "institution": "University of Toronto, Vector Institute"}, {"given_name": "David", "family_name": "Andrews", "institution": "Sunnybrook Research Institute"}, {"given_name": "Alan", "family_name": "Moses", "institution": "University of Toronto"}]}