{"title": "Generalisation in humans and deep neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7538, "page_last": 7550, "abstract": "We compare the robustness of humans and current convolutional deep neural networks (DNNs) on object recognition under twelve different types of image degradations. First, using three well known DNNs (ResNet-152, VGG-19, GoogLeNet) we find the human visual system to be more robust to nearly all of the tested image manipulations, and we observe progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker. Secondly, we show that DNNs trained directly on distorted images consistently surpass human performance on the exact distortion types they were trained on, yet they display extremely poor generalisation abilities when tested on other distortion types. For example, training on salt-and-pepper noise does not imply robustness on uniform white noise and vice versa. Thus, changes in the noise distribution between training and testing constitutes a crucial challenge to deep learning vision systems that can be systematically addressed in a lifelong machine learning approach. Our new dataset consisting of 83K carefully measured human psychophysical trials provide a useful reference for lifelong robustness against image degradations set by the human visual system.", "full_text": "Generalisation in humans and deep neural networks\n\nRobert Geirhos1-3\u2217\u00a7\n\nCarlos R. Medina Temme1\u2217\n\nJonas Rauber2,3\u2217\n\nHeiko H. Sch\u00fctt1,4,5\n\nMatthias Bethge2,6,7\u2217\n\nFelix A. Wichmann1,2,6,8\u2217\n\n1Neural Information Processing Group, University of T\u00fcbingen\n2Centre for Integrative Neuroscience, University of T\u00fcbingen\n\n3International Max Planck Research School for Intelligent Systems\n\n4Graduate School of Neural and Behavioural Sciences, University of T\u00fcbingen\n\n5Department of Psychology, University of Potsdam\n\n6Bernstein Center for Computational Neuroscience T\u00fcbingen\n\n7Max Planck Institute for Biological Cybernetics\n\n8Max Planck Institute for Intelligent Systems\n\n\u2217Joint \ufb01rst / joint senior authors\n\n\u00a7To whom correspondence should be addressed: robert.geirhos@bethgelab.org\n\nAbstract\n\nWe compare the robustness of humans and current convolutional deep neural net-\nworks (DNNs) on object recognition under twelve different types of image degra-\ndations. First, using three well known DNNs (ResNet-152, VGG-19, GoogLeNet)\nwe \ufb01nd the human visual system to be more robust to nearly all of the tested image\nmanipulations, and we observe progressively diverging classi\ufb01cation error-patterns\nbetween humans and DNNs when the signal gets weaker. Secondly, we show that\nDNNs trained directly on distorted images consistently surpass human performance\non the exact distortion types they were trained on, yet they display extremely poor\ngeneralisation abilities when tested on other distortion types. For example, training\non salt-and-pepper noise does not imply robustness on uniform white noise and\nvice versa. Thus, changes in the noise distribution between training and testing\nconstitutes a crucial challenge to deep learning vision systems that can be sys-\ntematically addressed in a lifelong machine learning approach. Our new dataset\nconsisting of 83K carefully measured human psychophysical trials provide a useful\nreference for lifelong robustness against image degradations set by the human\nvisual system.\n\n1\n\nIntroduction\n\n1.1 Deep neural networks as models of human object recognition\n\nThe visual recognition of objects by humans in everyday life is rapid and seemingly effortless, as\nwell as largely independent of viewpoint and object orientation [1]. The rapid and primarily foveal\nrecognition during a single \ufb01xation has been termed core object recognition (see [2] for a review).\nWe know, for example, that it is possible to reliably identify objects in the central visual \ufb01eld within a\nsingle \ufb01xation in less than 200 ms when viewing \u201cstandard\u201d images [2\u20134]. Based on the rapidness of\nobject recognition, core object recognition is often thought to be achieved with mainly feedforward\nprocessing although feedback connections are ubiquitous in the primate brain. Object recognition\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTrain\n\nTest\n\nTrain\n\nTest\n\nTrain\n\nTest\n\n(a) Super-human performance\n\n(b) Super-human performance\n\n(c) Chance level performance\n\nFigure 1: Classi\ufb01cation performance of ResNet-50 trained from scratch on (potentially distorted)\nImageNet images. (a) Classi\ufb01cation performance when trained on standard colour images and tested\non colour images is close to perfect (better than human observers). (b) Likewise, when trained and\ntested on images with additive uniform noise, performance is super-human. (c) Striking generalisation\nfailure: When trained on images with salt-and-pepper noise and tested on images with uniform noise,\nperformance is at chance level\u2014even though both noise types do not seem much different to human\nobservers.\n\nin the primate brain is believed to be realised by the ventral visual pathway, a hierarchical structure\nconsisting of areas V1-V2-V4-IT, with information from the retina reaching the cortex in V1 (e.g.\n[5]).\nUntil a few years ago, animate visual systems were the only ones known to be capable of broad-\nranging visual object recognition. This has changed, however, with the advent of brain-inspired deep\nneural networks (DNNs) which, after having been trained on millions of labeled images, achieve\nhuman-level performance when classifying objects in images of natural scenes [6]. DNNs are now\nemployed on a variety of tasks and set the new state-of-the-art, sometimes even surpassing human\nperformance on tasks which only a few years ago were thought to be beyond an algorithmic solution\nfor decades to come [7, 8]. Since DNNs and humans achieve similar accuracy, a number of studies\nhave started investigating similarities and differences between DNNs and human vision [9\u201324]. On\nthe one hand, the network units are an enormous simpli\ufb01cation given the sophisticated nature and\ndiversity of neurons in the brain [25]. On the other hand, often the strength of a model lies not\nin replicating the original system but rather in its ability to capture the important aspects while\nabstracting from details of the implementation (e.g. [26, 27]).\nOne of the most remarkable properties of the human visual system is its ability to generalise\nrobustly. Humans generalise across a wide variety of changes in the input distribution, such as across\ndifferent illumination conditions and weather types. For instance, human object recognition is largely\nunimpaired even if there are rain drops or snow \ufb02akes in front of an object. While humans are\ncertainly exposed to a large number of such changes during their preceding lifetime (i.e., at \u201ctraining\ntime\u201d, as we would say for DNNs), there seems to be something very generic about the way the\nhuman visual system is able to generalise that is not limited to the same distribution one was exposed\nto previously. Otherwise we would not be able to make sense of a scene if there was some sort of\n\u201cnew\u201d, previously unseen noise. Even if one never had a shower of confetti before, one is still able to\neffortlessly recognise objects at a carnival parade. Naturally, such generic, robust mechanisms are\nnot only desirable for animate visual systems but also for solving virtually any visual task that goes\nbeyond a well-con\ufb01ned setting where one knows the exact test distribution already at training time.\nDeep learning for autonomous driving may be one prominent example: one would like to achieve\nrobust classi\ufb01cation performance in the presence of confetti, despite not having had any confetti\nexposure during training time. Thus, from a machine learning perspective, general noise robustness\ncan be used as a highly relevant example of lifelong machine learning [28] requiring generalisation\nthat does not rely on the standard assumption of independent, identically distributed (i.i.d.) samples\nat test time.\n\n1.2 Comparing generalisation abilities\n\nGeneralisation in DNNs usually works surprisingly well: First of all, DNNs are able to learn\nsuf\ufb01ciently general features on the training distribution to achieve a high accuracy on the i.i.d. test\ndistribution despite having suf\ufb01cient capacity to completely memorise the training data [29], and\n\n2\n\n\fconsiderable effort has been devoted to understand this phenomenon (e.g. [30\u201332]).1 Secondly,\nfeatures learned on one task often transfer to only loosely related tasks, such as from classi\ufb01cation to\nsaliency prediction [33], emotion recognition [34], medical imaging [35] and a large number of other\ntransfer learning tasks [36]. However, transfer learning still requires a substantial amount of training\nbefore it works on the new task. Here, we focus on a third setting that adopts the lifelong machine\nlearning point of view of generalisation [37]: How well can a visual learning system cope with a\nnew image degradation after it has learned to cope with a certain set of image distortions before. As\na measure of object recognition robustness we can test the ability of a classi\ufb01er or visual system\nto tolerate changes in the input distribution up to a certain degree, i.e., to achieve high recognition\nperformance despite being evaluated on a test distribution that differs to some degree from the training\ndistribution (testing under realistic, non-i.i.d. conditions). Using this approach we measure how well\nDNNs and human observers cope with parametric image manipulations that gradually distort the\noriginal image.\nFirst, we assess how top-performing DNNs that are trained on ImageNet, GoogLeNet [38], VGG-\n19 [39] and ResNet-152 [40], compare against human observers when tested on twelve different\ndistortions such as additive noise or phase noise (see Figure 2 for an overview)\u2014in other words,\nhow well do they generalise towards previously unseen distortions.2 In a second set of experiments,\nwe train networks directly on distorted images to see how well they can in general cope with noisy\ninput, and how much training on distortions as a form of data augmentation helps in dealing with\nother distortions. Psychophysical investigations of human behaviour on object recognition tasks,\nmeasuring accuracies depending on image colour (greyscale vs. colour), image contrast and the\namount of additive visual noise have been powerful means of exploring the human visual system,\nrevealing much about the internal computations and mechanisms at work [42\u201348]. As a consequence,\nsimilar experiments might yield equally interesting insights into the functioning of DNNs, especially\nas a comparison to high-quality measurements of human behaviour. In particular, human data\nfor our experiments were obtained using a controlled lab environment (instead of e.g. Amazon\nMechanical Turk without suf\ufb01cient control about presentation times, display calibration, viewing\nangles, and sustained attention of participants). Our carefully measured behavioural datasets\u2014twelve\nexperiments encompassing a total number of 82,880 psychophysical trials\u2014as well as materials and\ncode are available online at https://github.com/rgeirhos/generalisation-humans-DNNs.\n\n2 Methods\n\nWe here report the core elements of employed paradigm, procedure, image manipulations, observers\nand DNNs; this is aimed at giving the reader just enough information to understand experiments\nand results. For in-depth explanations we kindly refer to the comprehensive supplementary material,\nwhich seeks to provide exhaustive and reproducible experimental details.\n\n2.1 Paradigm, procedure & 16-class-ImageNet\n\nFor this study, we developed an experimental paradigm aimed at comparing human observers\nand DNNs as fair as possible by using a forced-choice image categorisation task.3 Achieving\na fair psychophysical comparison comes with a number of challenges: First of all, many high-\nperforming DNNs are trained on the ILSRVR 2012 database [50] with 1,000 \ufb01ne-grained cate-\ngories (e.g., over a hundred different dog breeds). If humans are asked to name objects, however,\nthey most naturally categorise them into so-called entry-level categories (e.g. dog rather than\nGerman shepherd). We thus developed a mapping from 16 entry-level categories such as dog,\ncar or chair to their corresponding ImageNet categories using the WordNet hierarchy [51]. We\nterm this dataset \u201c16-class-ImageNet\u201d since it groups a subset of ImageNet classes into 16 entry-\nlevel categories (airplane, bicycle, boat, car, chair, dog, keyboard, oven, bear,\nbird, bottle, cat, clock, elephant, knife, truck). In every experiment, then, an im-\nage was presented on a computer screen and observers had to choose the correct category by clicking\non one of these 16 categories. For pre-trained DNNs, the sum of all softmax values mapping to\n\n1Still, DNNs usually need orders of magnitude more training data in comparison to humans, as explored by\n\nthe literature on one-shot or few-shot learning (see e.g. [23] for an overview).\n\n2We have reported a subset of these experiments on arXiv in an earlier version of this paper [41].\n3This is the same paradigm as reported in [49].\n\n3\n\n\fFigure 2: Example stimulus image of class bird across all distortion types. From left to right, image\nmanipulations are: colour (undistorted), greyscale, low contrast, high-pass, low-pass (blurring), phase\nnoise, power equalisation. Bottom row: opponent colour, rotation, Eidolon I, II and III, additive\nuniform noise, salt-and-pepper noise. Example stimulus images across all used distortion levels are\navailable in the supplementary material.\n\na certain entry-level category was computed. The entry-level category with the highest sum was\nthen taken as the network\u2019s decision. A second challenge is the fact that standard DNNs only use\nfeedforward computations at inference time, while recurrent connections are ubiquitous in the human\nbrain [52, 53].4 In order to prevent this discrepancy from playing a major confounding role in our\nexperimental comparison, presentation time for human observers was limited to 200 ms. An image\nwas immediately followed by a 200 ms presentation of a noise mask with 1/f spectrum, known to\nminimise, as much as psychophysically possible, feedback in\ufb02uence in the brain.\n\n2.2 Observers & pre-trained deep neural networks\n\nData from human observers were compared against classi\ufb01cation performance of three pre-trained\nDNNs: VGG-19 [39], GoogLeNet [38] and ResNet-152 [40]. For each of the twelve experiments that\nwere conducted, either \ufb01ve or six observers participated (with the exception of the colour experiment,\nfor which only three observers participated since similar experiments had already been performed by\na number of studies [48, 55, 56]). Observers reported normal or corrected-to-normal vision.\n\n2.3\n\nImage manipulations\n\nA total of twelve experiments were performed in a well-controlled psychophysical lab setting. In\nevery experiment, a (possibly parametric) distortion was applied to a large number of images, such\nthat the signal strength ranged from \u2018no distortion / full signal\u2019 to \u2018distorted / weak(er) signal\u2019.\nWe then measured how classi\ufb01cation accuracy changed as a function of signal strength. Three of\nthe employed image manipulations were dichotomous (colour vs. greyscale, true vs. opponent\ncolour, original vs. equalised power spectrum); one manipulation had four different levels (0, 90,\n180 and 270 degrees of rotation); one had seven levels (0, 30, ..., 180 degrees of phase noise)\nand the other distortions had eight different levels. Those manipulations were: uniform noise,\ncontrolled by the \u2018width\u2019 parameter indicating the bounds of pixel-wise additive uniform noise;\nlow-pass \ufb01ltering and high-pass \ufb01ltering (with different standard deviations of a Gaussian \ufb01lter);\ncontrast reduction (contrast levels from 100% to 1%) as well as three different manipulations from\nthe eidolon toolbox [57]). The three eidolon experiments correspond to different versions of a\nparametric image manipulation, with the \u2018reach\u2019 parameter controlling the strength of the distortion.\nAdditionally, for experiments with training on distortions, we also evaluated performance on stimuli\nwith salt-and-pepper noise (controlled by parameter p indicating probability of setting a pixel to\neither black or white; p \u2208 [0, 10, 20, 35, 50, 65, 80, 95]%). More information about the different\nimage manipulations is provided in the supplementary material (Section Image preprocessing and\ndistortions), where we also show example images across all manipulations and distortion levels\n(Figures 10, 11, 12, 13, 14). For a brief overview, Figure 2 depicts one exemplary manipulation\nper distortion. Overall, the manipulations we used were chosen to re\ufb02ect a large variety of possible\ndistortions.\n\n4But see e.g. [54] for a critical assessment of this argument.\n\n4\n\n\fAccuracy\n\nEntropy\n\nAccuracy\n\nEntropy\n\n(a) Colour vs. greyscale\n\n(b) True vs. false colour\n\n(c) Uniform noise\n\n(d) Low-pass\n\n(e) Contrast\n\n(f) High-pass\n\n(g) Eidolon I\n\n(h) Phase noise\n\n(i) Eidolon II\n\n(j) Power equalisation\n\n(k) Eidolon III\n\n(l) Rotation\n\nFigure 3: Classi\ufb01cation accuracy and response distribution entropy for GoogLeNet, VGG-19 and\nResNet-152 as well as for human observers. \u2018Entropy\u2019 indicates the Shannon entropy of the re-\nsponse/decision distribution (16 classes). It here is a measure of bias towards certain categories: using\na test dataset that is balanced with respect to the number of images per category, responding equally\nfrequently with all 16 categories elicits the maximum possible entropy of four bits. If a network\nor observer responds prefers some categories over others, entropy decreases (down to zero bits in\nthe extreme case of responding with one particular category all the time, irrespective of the ground\ntruth category). Human \u2018error bars\u2019 indicate the full range of results across participants. Image\nmanipulations are explained in Section 2.3 and visualised in Figures 10, 11, 12, 13 and 14.\n\n5\n\nColourClassification accuracycolourgreyscale0.00.20.40.60.81.0lparticipants (avg.)GoogLeNetVGG\u221219ResNet\u2212152llColourResponse distr. entropy [bits]colourgreyscale0.01.02.03.04.0llColourClassification accuracytrueopponent0.00.20.40.60.81.0llColourResponse distr. entropy [bits]trueopponent0.01.02.03.04.0llUniform noise widthClassification accuracy00.20.40.60.810.00.20.40.60.81.0llllllllUniform noise widthResponse distr. entropy [bits]00.20.40.60.810.01.02.03.04.0llllllllFilter standard deviationClassification accuracy013510400.00.20.40.60.81.0llllllllFilter standard deviationResponse distr. entropy [bits]013510400.01.02.03.04.0llllllllLog10 of contrast in percentClassification accuracy21.510.500.00.20.40.60.81.0llllllllLog10 of contrast in percentResponse distr. entropy [bits]21.510.500.01.02.03.04.0llllllllFilter standard deviationClassification accuracyInf31.510.70.450.00.20.40.60.81.0llllllllFilter standard deviationResponse distr. entropy [bits]Inf31.510.70.450.01.02.03.04.0llllllllLog2 of 'reach' parameterClassification accuracy012345670.00.20.40.60.81.0llllllllLog2 of 'reach' parameterResponse distr. entropy [bits]012345670.01.02.03.04.0llllllllPhase noise width [\u00b0]Classification accuracy03060901201501800.00.20.40.60.81.0lllllllPhase noise width [\u00b0]Response distr. entropy [bits]03060901201501800.01.02.03.04.0lllllllLog2 of 'reach' parameterClassification accuracy012345670.00.20.40.60.81.0llllllllLog2 of 'reach' parameterResponse distr. entropy [bits]012345670.01.02.03.04.0llllllllPower spectrumClassification accuracyoriginalequalised0.00.20.40.60.81.0llPower spectrumResponse distr. entropy [bits]originalequalised0.01.02.03.04.0llLog2 of 'reach' parameterClassification accuracy012345670.00.20.40.60.81.0llllllllLog2 of 'reach' parameterResponse distr. entropy [bits]012345670.01.02.03.04.0llllllllRotation angle [\u00b0]Classification accuracy0901802700.00.20.40.60.81.0llllRotation angle [\u00b0]Response distr. entropy [bits]0901802700.01.02.03.04.0llll\f2.4 Training on distortions\n\nBeyond evaluating standard pre-trained DNNs on distortions (results reported in Figure 3), we\nalso trained networks directly on distortions (Figure 4). These networks were trained on 16-class-\nImageNet, a subset of the standard ImageNet dataset as described in Section 2.1. This reduced the size\nof the unperturbed training set to approximately one \ufb01fth. To correct for the highly imbalanced number\nof samples per class, we weighted each sample in the loss function with a weight proportional to one\nover the number of samples of the corresponding class. All networks trained in these experiments\nhad a ResNet-like architecture that differed from a standard ResNet-50 only in the number of output\nneurons that we reduced from 1000 to 16 to match the 16 entry-level classes of the dataset. Weights\nwere initialised with a truncated normal distribution with zero mean and a standard deviation of\n1\u221a\nn where n is the number of output neurons in a layer. While training from scratch, we performed\non-the-\ufb02y data augmentation using different combinations of the image manipulations. When training\na network on multiple types of image manipulations (models B1 to B9 as well as C1 and C2 of\nFigure 4), the type of manipulation (including unperturbed, i.e. standard colour images if applicable)\nwas drawn uniformly and we only applied one manipulation at a time (i.e., the network never saw a\nsingle image perturbed with multiple image manipulations simultaneously, except that some image\nmanipulations did include other manipulations per construction: uniform noise, for example, was\nalways added after conversion to greyscale and contrast reduction to 30%). For a given image\nmanipulation, the amount of perturbation was drawn uniformly from the levels used during test\ntime (cf. Figure 3). The remaining aspects of the training followed standard training procedures for\ntraining a ResNet on ImageNet: we used SGD with a momentum of 0.997, a batch size of 64, and\nan initial learning rate of 0.025. The learning rate was multiplied with 0.1 after 30, 60, 80 and 90\nepochs (when training for 100 epochs) or 60, 120, 160 and 180 epochs (when training for 200 epochs).\nTraining was done using TensorFlow 1.6.0 [58]. In the training experiments, all manipulations with\nmore than two levels were included except for the eidolon stimuli, since the generation of those\nstimuli is computationally too slow for ImageNet training. For comparison purposes, we additionally\nincluded colour vs. greyscale as well as salt-and-pepper noise (for which there is no human data, but\ninformal comparisons between uniform noise and salt-and-pepper noise strongly suggest that human\nperformance will be similar, see Figure 1c).\n\n3 Generalisation of humans and pre-trained DNNs towards distortions\n\nIn order to assess generalisation performance when the signal gets weaker, we tested twelve different\nways of degrading images. These images at various levels of signal strength were then shown to\nboth human observers in a lab and to pre-trained DNNs (ResNet-152, GoogLeNet and VGG-19)\nfor classi\ufb01cation. The results of this comparison are visualised in Figure 3. While human and\nDNN performance was similar for comparatively minor colour-related distortions such as conversion\nto greyscale or opponent colours, we \ufb01nd human observers to be more robust for all of the other\ndistortions: by a small margin for low contrast, power equalisation and phase noise images and by\na larger margin for uniform noise, low-pass, high-pass, rotation and all three eidolon experiments.\nFurthermore, there are strong differences in the error patterns as measured by the response distribution\nentropy (indicating biases towards certain categories). Human participants\u2019 responses were distributed\nmore or less equally amongst the 16 classes, whereas all three DNNs show increasing biases towards\ncertain categories when the signal gets weaker. These biases are not completely explained by the\nprior class probabilities, and deviate from distortion to distortion. For instance, ResNet-152 almost\nsolely predicts class bottle for images with strong uniform noise (irrespective of the ground truth\ncategory),5 and classes dog or bird for images distorted by phase noise. One might think of\nsimple tricks to reduce the discrepancy between the response distribution entropy of DNNs and\nhumans. One possible way would be increasing the softmax temperature parameter and assuming that\nmodel decisions are sampled from the softmax distribution rather than taking the argmax. However,\nincreasing the response DNN distribution entropy in this way dramatically decreases classi\ufb01cation\naccuracy and thus comes with a trade-off (cf. Figure 8 in the supplementary material).\nThese results are in line with previous \ufb01ndings reporting human-like processing of chromatic infor-\nmation in DNNs [19] but strong decreases in DNN recognition accuracy for image degradations like\n\n5A category-level analysis of decision biases for the uniform noise experiment is provided in the supplemen-\n\ntary material, Figure 9.\n\n6\n\n\fFigure 4: Classi\ufb01cation accuracy (in percent) for networks with potentially distorted training data.\nRows show different test conditions at an intermediate dif\ufb01culty (exact condition indicated in brackets,\nunits as in Figure 3). Columns correspond to differently trained networks (leftmost column: human\nobservers for comparison; no human data available for salt-and-pepper noise). All of the networks\nwere trained from scratch on (a potentially manipulated version of) 16-class-ImageNet. Manipulations\nincluded in the training data are indicated by a red rectangle; additionally \u2018greyscale\u2019 is underlined\nif it was part of the training data because a certain distortion encompasses greyscale images at full\ncontrast. Models A1 to A9: ResNet-50 trained on a single distortion (100 epochs). Models B1 to B9:\nResNet-50 trained on uniform noise plus one other distortion (200 epochs). Models C1 & C2:\nResNet-50 trained on all but one distortion (200 epochs). Chance performance is at 1\n16 = 6.25%\naccuracy.\n\nnoise and blur [13, 14, 59\u201361]. Overall, DNNs seem to have much more problems generalising to\nweaker signals than humans, across a wide variety of image distortions. While the human visual\nsystem has been exposed to a number of distortions during evolution and lifetime, we clearly had no\nexposure whatsoever to many of the exact image manipulations that we tested here. Thus, our human\ndata show that a high level of generalisation is, in principle, possible. There may be many different\nreasons for the discrepancy between human and DNN generalisation performance that we \ufb01nd: Are\nthere limitations in terms of the currently used network architectures (as hypothesised by [60]), which\nmay be inferior to the human brain\u2019s intricate computations? Is it a problem of the training data (as\nsuggested by e.g. [61]), or are today\u2019s training methods / optimisers not suf\ufb01cient to solve robust and\ngeneral object recognition? In order to shed light on the dissimilarities we found, we performed a\nsecond batch of experiments by training networks directly on distorted images.\n\n4 Training DNNs directly on distorted images\n\nWe trained one network per distortion directly and from scratch on (potentially manipulated) 16-class-\nImageNet images. The results of this training are visualised in Figure 4 (models A1 to A9). We \ufb01nd\nthat these specialised networks consistently outperformed human observers, by a large margin, on\nthe image manipulation they were trained on (as indicated by strong network performance on the\ndiagonal). This is a strong indication that currently employed architectures (such as ResNet-50) and\ntraining methods (standard optimiser and training procedure) are suf\ufb01cient to \u2018solve\u2019 distortions under\ni.i.d. train/test conditions. We were able to not only close the human-DNN performance gap that was\nobserved by [13] (who \ufb01ne-tuned networks on distortions, reporting improved but not human-level\nDNN performance) but to surpass human performance in this respect. While the human visual system\ncertainly has a much more complicated structure [24], this does not seem to be necessary to deal with\neven strong image manipulations of the type employed here.\nHowever, as noted earlier, robust generalisation is primarily not about solving a speci\ufb01c problem\nknown exactly in advance. We therefore tested how networks trained on a certain distortion type\nperform when tested on other distortions. These results are visualised in Figure 4 by the off-diagonal\n\n7\n\n88.596.78.150.083.190.886.184.295.995.510.410.290.611.297.995.472.393.091.192.494.986.687.89.894.186.290.593.287.895.194.810.311.495.612.894.096.896.293.395.794.390.947.613.129.089.419.610.239.817.188.290.928.634.614.237.946.351.795.150.579.159.445.249.821.120.629.911.78.392.627.790.791.410.418.924.719.825.122.829.225.094.327.528.348.518.96.616.478.49.811.916.074.974.76.97.116.19.316.018.614.487.220.513.813.545.66.280.36.99.06.07.36.271.511.010.285.47.389.884.683.385.084.683.782.583.857.423.38.931.227.024.446.681.482.682.97.47.828.37.630.831.430.631.443.487.424.178.536.58.039.931.889.040.437.780.580.18.58.343.38.838.541.940.335.240.140.589.0NA6.16.25.87.96.46.26.213.678.679.489.66.46.26.26.16.35.45.85.76.2= manipulation included in training datauniform noise (0.35)salt\u2212and\u2212pepper noise (0.2)rotation (90\u00b0)phase noise (90\u00b0)high\u2212pass (std=0.7)low\u2212pass (std=7)contrast (5%)greyscalecolourhuman observersA1A2A3A4A5A6A7A8A9B1B2B3B4B5B6B7B8B9C1C2ModelEvaluation condition\fcells of models A1 to A9. Overall, we \ufb01nd that training on a certain distortion slightly improves\nperformance on other distortions in a few instances, but is detrimental in other cases (when compared\nto a vanilla ResNet-50 trained on colour images, model A1 in the \ufb01gure).6 Performance on salt-and-\npepper noise as well as uniform noise was close to chance level for all networks, even for a network\ntrained directly on the respectively other noise model. This may be surprising given that these two\ntypes of noise do not seem very different to a human eye (as indicated in Figure 1c). Hence, training\na network on one distortion does not generally lead to improvements on other distortions.\nSince training on a single distortion alone does not seem to be suf\ufb01cient to evoke robust generalisation\nperformance in DNNs, we also trained the same architecture (ResNet-50) on two additional settings.\nModels B1 to B9 in Figure 4 show performance for training on one particular distortion in combination\nwith uniform noise (training consisted of 50% images from each manipulation). Uniform noise was\nchosen since it seemed to be one of the hardest distortions for all networks, and hence they might\nbene\ufb01t from including this particular distortion in the training data. Furthermore, we trained models\nC1 and C2 on all but one distortion (either uniform or salt-and-pepper noise was left out).\nWe \ufb01nd that object recognition performance of models B1 to B9 is improved compared to models A1\nto A9, both on the distortions they were actually trained on (diagonal entries with red rectangles in\nFigure 4) as well as on a few of the distortions that were not part of the training data. However, this\nimprovement may be largely due to the fact that models B1 to B9 were trained on 200 epochs instead\nof 100 epochs as for models A1 to A9, since the accuracy of model B9 (trained & tested on uniform\nnoise, 200 epochs) also shows an improvement towards model A9 (trained & tested on uniform\nnoise, 100 epochs). Hence, in the presence of heavy distortions, training longer may go a long way\nbut incorporating other distortions in the training does not seem to be generally bene\ufb01cial to model\nperformance. Furthermore, we \ufb01nd that it is possible even for a single model to reach high accuracies\non all of the eight distortions it was trained on (models C1 & C2), however for both left-out uniform\nand salt-and-pepper noise, object recognition accuracy stayed around 11 to 14%, which is by far\ncloser to chance level (approx. 6%) than to the accuracy reached by a specialised network trained on\nthis exact distortion (above 70%, serving as a lower bound on the achievable performance).\nTaken together, these \ufb01ndings indicate that data augmentation with distortions alone may be insuf\ufb01-\ncient to overcome the generalisation problem that we \ufb01nd. It may be necessary to move from asking\n\u201cwhy are DNNs generalising so well (under i.i.d. settings)?\u201d [29] to \u201cwhy are DNNs generalising\nso poorly (under non-i.i.d. settings)?\u201d. It is up to future investigations to determine how DNNs that\nare currently being handled as computational models of human object recognition can solve this\nchallenge. At the exciting interface between cognitive science / visual perception and deep learning,\ninspiration and ideas may come from both \ufb01elds: While the computer vision sub-area of domain\nadaptation (see [63] for a review) is working on robust machine inference in spite of shifts in the\ninput distribution, the human vision community is accumulating strong evidence for the bene\ufb01ts of\nlocal gain control mechanisms. These normalisation processes seem to be crucial for many aspects\nof robust animal and human vision [46], are predictive for human vision data [21, 64] and have\nproven useful in the context of computer vision [65, 66]. It could be an interesting avenue for future\nresearch to determine whether there is a connection between neural normalisation processes and\nDNN generalisation performance. Furthermore, incorporating a shape bias in DNNs seems to be a\nvery promising avenue towards general noise robustness, strongly improving performance on many\ndistortions [67].\n\n5 Conclusion\n\nWe conducted a behavioural comparison of human and DNN object recognition robustness against\ntwelve different image distortions. In comparison to human observers, we \ufb01nd the classi\ufb01cation\nperformance of three well-known DNNs trained on ImageNet\u2014ResNet-152, GoogLeNet and VGG-\n19\u2014to decline rapidly with decreasing signal-to-noise ratio under image distortions. Additionally, we\n\ufb01nd progressively diverging patterns of classi\ufb01cation errors between humans and DNNs with weaker\nsignals.Our results, based on 82,880 psychophysical trials under well-controlled lab conditions,\ndemonstrate that there are still marked differences in the way humans and current DNNs process\n6The no free lunch theorem [62] states that better performance on some input is necessarily accompanied\nby worse performance on other input; however we here are only interested in a narrow subset of the possible\ninput space (natural images corrupted by distortions). The high accuracies of human observers across distortions\nindicate that it is, in principle, possible to achieve good performance on many distortions simultaneously.\n\n8\n\n\fobject information. These differences, in our setting, cannot be overcome by training on distorted\nimages (i.e., data augmentation): While DNNs cope perfectly well with the exact distortion they\nwere trained on, they still show a strong generalisation failure towards previously unseen distortions.\nSince the space of possible distortions is literally unlimited (both theoretically and in real-world\napplications), it is not feasible to train on all of them. DNNs have a generalisation problem when it\ncomes to settings that go beyond the usual (yet often unrealistic) i.i.d. assumption. We believe that\nsolving this generalisation problem will be crucial both for robust machine inference and towards\nbetter models of human object recognition, and we envision that our \ufb01ndings as well as our carefully\nmeasured and freely available behavioural data7 may provide a new useful benchmark for improving\nDNN robustness and a motivation for neuroscientists to identify mechanisms in the brain that may be\nresponsible for this remarkable robustness.\n\nAuthor contributions\n\nThe initial project idea of comparing humans against DNNs was developed by F.A.W. and R.G. All\nauthors jointly contributed towards designing the study and interpreting the data. R.G. and C.R.M.T.\ndeveloped the image manipulations and acquired the behavioural data with input from H.H.S. and\nF.A.W.; J.R. trained networks on distortions; experimental data and networks were evaluated by\nC.R.M.T., R.G. and J.R. with input from H.H.S, M.B. and F.A.W.; R.G. and C.R.M.T. worked on\nmaking our work reproducible (data, code and materials openly accessible; writing supplementary\nmaterial); R.G. wrote the paper with signi\ufb01cant input from all other authors.\n\nAcknowledgments\n\nThis work has been funded, in part, by the German Federal Ministry of Education and Research\n(BMBF) through the Bernstein Computational Neuroscience Program T\u00fcbingen (FKZ: 01GQ1002)\nas well as the German Research Foundation (DFG; Sachbeihilfe Wi 2103/4-1 and SFB 1233 on\n\u201cRobust Vision\u201d). The authors thank the International Max Planck Research School for Intelli-\ngent Systems (IMPRS-IS) for supporting R.G. and J.R.; J.R. acknowledges support by the Bosch\nForschungsstiftung (Stifterverband, T113/30057/17); M.B. acknowledges support by the Centre for\nIntegrative Neuroscience T\u00fcbingen (EXC 307) and by the Intelligence Advanced Research Projects\nActivity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number\nD16PC00003.\nWe would like to thank David Janssen for his invaluable contributions in shaping the early stage of\nthis project. Furthermore, we are very grateful to Tom Wallis for providing the MATLAB source code\nof one of his experiments, and for allowing us to use and modify it; Silke Gramer for administrative\nand Uli Wannek for technical support, as well as Britta Lewke for the method of creating response\nicons and Patricia Rubisch for help with testing human observers. Moreover, we would like to thank\nNikolaus Kriegeskorte, Jakob Macke and Tom Wallis for helpful feedback, and three anonymous\nreviewers for constructive suggestions.\n\nReferences\n[1] Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological\n\nReview, 94(2):115\u2013147, 1987.\n\n[2] James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition?\n\nNeuron, 73(3):415\u2013434, 2012.\n\n[3] Mary C Potter. Short-term conceptual memory for pictures. Journal of Experimental Psychology: human\n\nlearning and memory, 2(5):509, 1976.\n\n[4] Simon Thorpe, Denis Fize, and Catherine Marlot. Speed of processing in the human visual system. Nature,\n\n381(6582):520\u2013522, 1996.\n\n[5] Melvyn A. Goodale and A. David Milner. Separate visual pathways for perception and action. Trends in\n\nNeurosciences, 15(1):20\u201325, 1992.\n\n[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 1097\u20131105, 2012.\n\n7https://github.com/rgeirhos/generalisation-humans-DNNs\n\n9\n\n\f[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on ImageNet classi\ufb01cation. In Proceedings of the IEEE International Conference\non Computer Vision, pages 1026\u20131034, 2015.\n\n[8] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik\nGrewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray\nKavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks\nand tree search. Nature, 529(7587):484\u2013489, 2016. ISSN 0028-0836.\n\n[9] Charles F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and\nJ. J. DiCarlo. Deep neural networks rival the representation of primate IT cortex for core visual object\nrecognition. PLoS Computational Biology, 10(12), 2014.\n\n[10] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo.\nPerformance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings\nof the National Academy of Sciences, 111(23):8619\u20138624, 2014.\n\n[11] Radoslaw Martin Cichy, Aditya Khosla, Dimitrios Pantazis, and Aude Oliva. Dynamics of scene represen-\ntations in the human brain revealed by magnetoencephalography and deep neural networks. NeuroImage,\n2016.\n\n[12] Saeed Reza Kheradpisheh, Masoud Ghodrati, Mohammad Ganjtabesh, and Timoth\u00e9e Masquelier.\nDeep networks resemble human feed-forward vision in invariant object recognition. arXiv preprint\narXiv:1508.03929, 2016.\n\n[13] Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recognition\n\nperformance under visual distortions. arXiv preprint arXiv:1705.02498, 2017.\n\n[14] Samuel Dodge and Lina Karam. Can the early human visual system compete with deep neural networks?\n\narXiv preprint arXiv:1710.04744, 2017.\n\n[15] Ron Dekel. Human perception in computer vision. arXiv preprint arXiv:1701.04674, 2017.\n\n[16] RT Pramod and SP Arun. Do computational models differ systematically from human object perception?\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1601\u20131609,\n2016.\n\n[17] Hamid Karimi-Rouzbahani, Nasour Bagheri, and Reza Ebrahimpour. Invariant object recognition is a\npersonalized selection of invariant features in humans, not simply explained by hierarchical feed-forward\nvision models. Scienti\ufb01c reports, 7(1):14402, 2017.\n\n[18] Amir Rosenfeld, Markus D Solbach, and John K Tsotsos. Totally looks like-how humans compare,\n\ncompared to machines. arXiv preprint arXiv:1803.01485, 2018.\n\n[19] Alban Flachot and Karl R Gegenfurtner. Processing of chromatic information in a deep convolutional\n\nneural network. JOSA A, 35(4):B334\u2013B346, 2018.\n\n[20] Thomas SA Wallis, Christina M Funke, Alexander S Ecker, Leon A Gatys, Felix A Wichmann, and\nMatthias Bethge. A parametric texture model based on deep convolutional features closely matches texture\nappearance for humans. Journal of vision, 17(12):5\u20135, 2017.\n\n[21] Alexander Berardino, Valero Laparra, Johannes Ball\u00e9, and Eero Simoncelli. Eigen-distortions of hi-\nerarchical representations. In Advances in Neural Information Processing Systems, pages 3533\u20133542,\n2017.\n\n[22] Kamila M Jozwik, Nikolaus Kriegeskorte, Katherine R Storrs, and Marieke Mur. Deep convolutional neural\nnetworks outperform feature-based but not categorical models in explaining object similarity judgments.\nFrontiers in psychology, 8:1726, 2017.\n\n[23] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines\n\nthat learn and think like people. Behavioral and Brain Sciences, 40, 2017.\n\n[24] Tim C Kietzmann, Patrick McClure, and Nikolaus Kriegeskorte. Deep neural networks in computational\n\nneuroscience. bioRxiv, 2017.\n\n[25] Rodney J Douglas and Kevan A C Martin. Opening the grey box. Trends in Neurosciences, 14(7):286\u2013293,\n\n1991.\n\n10\n\n\f[26] George E P Box. Science and statistics. Journal of the American Statistical Association, 71(356):791\u2013799,\n\n1976.\n\n[27] Nikolaus Kriegeskorte. Deep neural networks: A new framework for modeling biological vision and brain\n\ninformation processing. Annual Review of Vision Science, 1(15):417\u2013446, 2015.\n\n[28] Zhiyuan Chen and Bing Liu. Lifelong Machine Learning. Morgan & Claypool Publishers, 2016. ISBN\n\n1627055010, 9781627055017.\n\n[29] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep\n\nlearning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n[30] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv\n\npreprint arXiv:1710.05468, 2017.\n\n[31] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in\n\ndeep learning. In Advances in Neural Information Processing Systems, pages 5949\u20135958, 2017.\n\n[32] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.\n\narXiv preprint arXiv:1703.00810, 2017.\n\n[33] Matthias K\u00fcmmerer, Thomas SA Wallis, and Matthias Bethge. Deepgaze II: Reading \ufb01xations from deep\n\nfeatures trained on object recognition. arXiv preprint arXiv:1610.01563, 2016.\n\n[34] Hong-Wei Ng, Viet Dung Nguyen, Vassilios Vonikakis, and Stefan Winkler. Deep learning for emotion\nrecognition on small datasets using transfer learning. In Proceedings of the 2015 ACM on international\nconference on multimodal interaction, pages 443\u2013449. ACM, 2015.\n\n[35] Hayit Greenspan, Bram van Ginneken, and Ronald M Summers. Guest editorial deep learning in medical\nimaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical\nImaging, 35(5):1153\u20131159, 2016.\n\n[36] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell.\nDecaf: A deep convolutional activation feature for generic visual recognition. In International conference\non machine learning, pages 647\u2013655, 2014.\n\n[37] Sebastian Thrun. Is learning the n-th thing any easier than learning the \ufb01rst? In Advances in Neural\n\nInformation Processing Systems, pages 640\u2013646. The MIT Press, 1996.\n\n[38] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2015.\n\n[40] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[41] Robert Geirhos, David HJ Janssen, Heiko H Sch\u00fctt, Jonas Rauber, Matthias Bethge, and Felix A Wichmann.\nComparing deep neural networks against humans: object recognition when the signal gets weaker. arXiv\npreprint arXiv:1706.06969, 2017.\n\n[42] Jacob Nachmias and R V Sansbury. Grating contrast: Discrimination may be better than detection. Vision\n\nResearch, 14(10):1039\u20131042, 1974.\n\n[43] Denis G Pelli and Bart Farell. Why use noise? Journal of the Optical Society of America A, 16(3):647\u2013653,\n\n1999.\n\n[44] Felix A Wichmann. Some Aspects of Modelling Human Spatial Vision: Contrast Discrimination. PhD\n\nthesis, The University of Oxford, 1999.\n\n[45] G Bruce Henning, C M Bird, and Felix A Wichmann. Contrast discrimination with pulse trains in pink\n\nnoise. Journal of the Optical Society of America A, 19(7):1259\u20131266, 2002.\n\n[46] Matteo Carandini and David J Heeger. Normalization as a canonical neural computation. Nature Reviews\n\nNeuroscience, 13(1):51\u201362, 2012.\n\n[47] Matteo Carandini, David J Heeger, and J Anthony Movshon. Linearity and normalization in simple cells\n\nof the macaque primary visual cortex. The Journal of Neuroscience, 17(21):8621\u20138644, 1997.\n\n11\n\n\f[48] Arnaud Delorme, Guillaume Richard, and Michele Fabre-Thorpe. Ultra-rapid categorisation of natural\nscenes does not rely on colour cues: a study in monkeys and humans. Vision Research, 40(16):2187\u20132200,\n2000.\n\n[49] Felix A Wichmann, David HJ Janssen, Robert Geirhos, Guillermo Aguilar, Heiko H Sch\u00fctt, Marianne\nMaertens, and Matthias Bethge. Methods and measurements to compare men against machines. Electronic\nImaging, Human Vision and Electronic Imaging, 2017(14):36\u201345, 2017.\n\n[50] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[51] George A Miller. Wordnet: a lexical database for English. Communications of the ACM, 38(11):39\u201341,\n\n1995.\n\n[52] Victor AF Lamme, Hans Super, and Henk Spekreijse. Feedforward, horizontal, and feedback processing in\n\nthe visual cortex. Current opinion in neurobiology, 8(4):529\u2013535, 1998.\n\n[53] Olaf Sporns and Jonathan D Zwi. The small world of the cerebral cortex. Neuroinformatics, 2(2):145\u2013162,\n\n2004.\n\n[54] Wulfram Gerstner. How can the brain be so fast? In J. Leo van Hemmen and Terrence J Sejnowski, editors,\n\n23 Problems in Systems Neuroscience, pages 135\u2013142. Oxford University Press, 2005.\n\n[55] Jonas Kubilius, Stefania Bracci, and Hans P Op de Beeck. Deep neural networks as a computational model\n\nfor human shape sensitivity. PLoS Computational Biology, 12(4):e1004896, 2016.\n\n[56] Felix A Wichmann, Doris I Braun, and Karl R Gegenfurtner. Phase noise and the classi\ufb01cation of natural\n\nimages. Vision Research, 46(8):1520\u20131529, 2006.\n\n[57] Jan Koenderink, Matteo Valsecchi, Andrea van Doorn, Johan Wagemans, and Karl Gegenfurtner. Eidolons:\n\nNovel stimuli for vision research. Journal of Vision, 17(2):7\u20137, 2017.\n\n[58] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,\nM. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,\nB. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals,\nP. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-Scale Machine Learning\non Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[59] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich. Examining the Impact of Blur on Recognition by\n\nConvolutional Networks. arXiv preprint arXiv:1611.05760, 2016.\n\n[60] Samuel Dodge and Lina Karam. Understanding how image quality affects deep neural networks. In Quality\nof Multimedia Experience (QoMEX), 2016 Eighth International Conference on, pages 1\u20136. IEEE, 2016.\n\n[61] Yiren Zhou, Sibo Song, and Ngai-Man Cheung. On classi\ufb01cation of distorted images with deep convolu-\ntional neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International\nConference on, pages 1213\u20131217. IEEE, 2017.\n\n[62] David H Wolpert and William G Macready. No free lunch theorems for optimization. IEEE transactions\n\non evolutionary computation, 1(1):67\u201382, 1997.\n\n[63] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Visual domain adaptation: A\n\nsurvey of recent advances. IEEE signal processing magazine, 32(3):53\u201369, 2015.\n\n[64] Heiko H Sch\u00fctt and Felix A Wichmann. An image-computable psychophysical spatial vision model.\n\nJournal of vision, 17(12):12\u201312, 2017.\n\n[65] Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object\nrecognition? In Computer Vision, 2009 IEEE 12th International Conference on, pages 2146\u20132153. IEEE,\n2009.\n\n[66] Mengye Ren, Renjie Liao, Raquel Urtasun, Fabian H Sinz, and Richard S Zemel. Normalizing the\nnormalizers: Comparing and extending network normalization schemes. arXiv preprint arXiv:1611.04520,\n2016.\n\n[67] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland\nBrendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and\nrobustness. arXiv preprint arXiv:1811.12231, 2018.\n\n12\n\n\f[68] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical\n\nComputing, Vienna, Austria, 2016. URL https://www.R-project.org/.\n\n[69] David H Brainard. The psychophysics toolbox. Spatial Vision, 10:433\u2013436, 1997.\n\n[70] Mario Kleiner, David Brainard, Denis Pelli, Allen Ingling, Richard Murray, and Christopher Broussard.\n\nWhat\u2019s new in Psychtoolbox-3. Perception, 36(14):1, 2007.\n\n[71] Eleanor Rosch. Principles of categorization. Concepts: core readings, 189, 1999.\n\n[72] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on\nComputer Vision, pages 740\u2013755. Springer, 2015.\n\n[73] Stefan Van der Walt, Johannes L Sch\u00f6nberger, Juan Nunez-Iglesias, Fran\u00e7ois Boulogne, Joshua D Warner,\nNeil Yager, Emmanuelle Gouillart, and Tony Yu. scikit-image: image processing in Python. PeerJ, 2:e453,\n2014.\n\n[74] A. M. Derrington, J. Krauskopf, and P. Lennie. Chromatic mechanisms in lateral geniculate nucleus of\n\nmacaque. The Journal of Physiology, 357:241\u2013265, 1984.\n\n[75] Andrew Stockman and Lindsay T Sharpe. The spectral sensitivities of the middle-and long-wavelength-\nsensitive cones derived from measurements in observers of known genotype. Vision research, 40(13):\n1711\u20131737, 2000.\n\n[76] David H Brainard. Human color vision, chapter Cone Contrast and Opponent Modulation Color Spaces.\n\nOptical Society of America, Washington, DC, 2 edition, 1996.\n\n[77] A. van der Schaaf and J.H. van Hateren. Modelling the power spectra of natural images: Statistics and\ninformation. Vision Research, 36(17):2759 \u2013 2770, 1996. ISSN 0042-6989. doi: http://dx.doi.org/10.1016/\n0042-6989(96)00002-8.\n\n[78] Felix A Wichmann, Jan Drewes, Pedro Rosas, and Karl R Gegenfurtner. Animal detection in natural\n\nscenes: Critical features revisited. Journal of Vision, 10(4:6):1\u201327, 2010.\n\n[79] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty\n\nin deep learning. In international conference on machine learning, pages 1050\u20131059, 2016.\n\n13\n\n\f", "award": [], "sourceid": 3743, "authors": [{"given_name": "Robert", "family_name": "Geirhos", "institution": "University of T\u00fcbingen"}, {"given_name": "Carlos R. M.", "family_name": "Temme", "institution": "University of T\u00fcbingen"}, {"given_name": "Jonas", "family_name": "Rauber", "institution": "University of T\u00fcbingen"}, {"given_name": "Heiko H.", "family_name": "Sch\u00fctt", "institution": "University of T\u00fcbingen"}, {"given_name": "Matthias", "family_name": "Bethge", "institution": "University of T\u00fcbingen"}, {"given_name": "Felix A.", "family_name": "Wichmann", "institution": "University of T\u00fcbingen"}]}