Reviews: The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers

---- Summary ---- The authors contribute a dataset of microscopy images to benchmark image classifier generalization. The dataset contains tens of thousands of 64x64 pixel images of cells. Each image shows cells labelled with a fluorescent protein. In total, seven different fluorescent proteins were employed. The authors consider the image analysis task of classifying images w.r.t. these seven fluorescent labels. The data is divided into a number of sub-sets according to properties that are hypothesized to yield significant covariate shifts, like acquisition day, acquisition site and microscope, and position of the imaged cells on the imaged grid of wells on a plate (center vs fringes). The authors quantitatively evaluate classical and state of the art methods in terms of generalization from a training set (drawn from all sub-sets) to the individual subsets. They show that the performance of all evaluated methods drops considerably for all kinds of transfer. Furthermore, they show that a drop in performance when transferring to one subset is not necessarily indicative of the performance when transferred to another subset. ---- Comments ---- The paper is easy to follow as it is clearly written and well-organized. The authors contribute a novel, large dataset of images and benchmark known methods for image classification. The dataset appears to be immensely useful for method development / benchmarking of transfer learning techniques. That said, it did not become fully clear to me what the added benefit of the proposed dataset is, compared to the cited related work from biology as well as computer vision. More images? Known reasons for covariate shift? Readily available as benchmark? The authors exclusively consider the case of training on a training set (supervised or unsupervised) and then evaluating on their test sets. They do not evaluate any method for unsupervised transfer learning on the test sets. There are probably good reasons for limiting the evaluation this way. A respective discussion would improve clarity. Furthermore, the relevance of the considered classification task for applications in Biology is not discussed. I am not aware of an application where the fluorescent stain used for sample prep is unknown after imaging and has to be classified automatically. Is the dataset intended purely for technical method development, or is there a related application in Biology? Please discuss. --- Post Author Feedback --- The author feedback clarified that the biological application they consider is protein *localization*, and not protein classification. The latter would be a toy task because labelled proteins would be known a priori. The proteins used in the proposed dataset have known localizations, and hence in this case the classification task coincides with the localization task. If I understand correctly, the idea is that proteins different from the 7 used in their dataset, with varying localizations indicative of cell function and stress, would be labelled in "real" application data. Is protein localization always unique in these cases? Isn't transfer to such "real" data potentially much harder than within the different test sets of the studied data? In summary, I still don't fully understand to which extent the presented data and classification task is a "toy task" w.r.t. the targeted application of protein localization.

Paper ID:	1069
Title:	The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers

Reviewer 1

Reviewer 2

Reviewer 3