{"title": "Sanity Checks for Saliency Maps", "book": "Advances in Neural Information Processing Systems", "page_first": 9505, "page_last": 9515, "abstract": "Saliency methods have emerged as a popular tool to highlight features in an input\ndeemed relevant for the prediction of a learned model. Several saliency methods\nhave been proposed, often guided by visual appeal on image data. In this work, we\npropose an actionable methodology to evaluate what kinds of explanations a given\nmethod can and cannot provide. We find that reliance, solely, on visual assessment\ncan be misleading. Through extensive experiments we show that some existing\nsaliency methods are independent both of the model and of the data generating\nprocess. Consequently, methods that fail the proposed tests are inadequate for\ntasks that are sensitive to either data or model, such as, finding outliers in the data,\nexplaining the relationship between inputs and outputs that the model learned,\nand debugging the model. We interpret our findings through an analogy with\nedge detection in images, a technique that requires neither training data nor model.\nTheory in the case of a linear model and a single-layer convolutional neural network\nsupports our experimental findings.", "full_text": "Sanity Checks for Saliency Maps\n\nJulius Adebayo\u2217, Justin Gilmer(cid:93), Michael Muelly(cid:93), Ian Goodfellow(cid:93), Moritz Hardt(cid:93)\u2020, Been Kim(cid:93)\n\njuliusad@mit.edu, {gilmer,muelly,goodfellow,mrtz,beenkim}@google.com\n\n(cid:93)Google Brain\n\n\u2020University of California Berkeley\n\nAbstract\n\nSaliency methods have emerged as a popular tool to highlight features in an input\ndeemed relevant for the prediction of a learned model. Several saliency methods\nhave been proposed, often guided by visual appeal on image data. In this work, we\npropose an actionable methodology to evaluate what kinds of explanations a given\nmethod can and cannot provide. We \ufb01nd that reliance, solely, on visual assessment\ncan be misleading. Through extensive experiments we show that some existing\nsaliency methods are independent both of the model and of the data generating\nprocess. Consequently, methods that fail the proposed tests are inadequate for\ntasks that are sensitive to either data or model, such as, \ufb01nding outliers in the data,\nexplaining the relationship between inputs and outputs that the model learned,\nand debugging the model. We interpret our \ufb01ndings through an analogy with\nedge detection in images, a technique that requires neither training data nor model.\nTheory in the case of a linear model and a single-layer convolutional neural network\nsupports our experimental \ufb01ndings2.\n\n1\n\nIntroduction\n\nAs machine learning grows in complexity and impact, much hope rests on explanation methods as\ntools to elucidate important aspects of learned models [1, 2]. Explanations could potentially help\nsatisfy regulatory requirements [3], help practitioners debug their model [4, 5], and perhaps, reveal\nbias or other unintended effects learned by a model [6, 7]. Saliency methods3 are an increasingly\npopular class of tools designed to highlight relevant features in an input, typically, an image. Despite\nmuch excitement, and signi\ufb01cant recent contribution [8\u201321], the valuable effort of explaining machine\nlearning models faces a methodological challenge: the dif\ufb01culty of assessing the scope and quality\nof model explanations. A paucity of principled guidelines confound the practitioner when deciding\nbetween an abundance of competing methods.\nWe propose an actionable methodology based on randomization tests to evaluate the adequacy\nof explanation approaches. We instantiate our analysis on several saliency methods for image\nclassi\ufb01cation with neural networks; however, our methodology applies in generality to any explanation\napproach. Critically, our proposed randomization tests are easy to implement, and can help assess the\nsuitability of an explanation method for a given task at hand.\nIn a broad experimental sweep, we apply our methodology to numerous existing saliency methods,\nmodel architectures, and data sets. To our surprise, some widely deployed saliency methods are\nindependent of both the data the model was trained on, and the model parameters. Consequently,\n\n\u2217Work done during the Google AI Residency Program.\n2All code to replicate our \ufb01ndings will be available here: https://goo.gl/hBmhDt\n3We refer here to the broad category of visualization and attribution methods aimed at interpreting trained\n\nmodels. These methods are often used for interpreting deep neural networks particularly on image data.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Saliency maps for some common methods compared to an edge detector. Saliency\nmasks for 3 inputs for an Inception v3 model trained on ImageNet. We see that an edge detector\nproduces outputs that are strikingly similar to the outputs of some saliency methods. In fact, edge\ndetectors can also produce masks that highlight features which coincide with what appears to be\nrelevant to a model\u2019s class prediction. We \ufb01nd that the methods most similar (see Appendix for SSIM\nmetric) to an edge detector, i.e., Guided Backprop and its variants, show minimal sensitivity to our\nrandomization tests.\n\nthese methods are incapable of assisting with tasks that depend on the model, such as debugging the\nmodel, or tasks that depend on the relationships between inputs and outputs present in the data.\nTo illustrate the point, Figure 1 compares the output of standard saliency methods with those of an\nedge detector. The edge detector does not depend on model or training data, and yet produces results\nthat bear visual similarity with saliency maps. This goes to show that visual inspection is a poor\nguide in judging whether an explanation is sensitive to the underlying model and data.\nOur methodology derives from the idea of a statistical randomization test, comparing the natural\nexperiment with an arti\ufb01cially randomized experiment. We focus on two instantiations of our general\nframework: a model parameter randomization test, and a data randomization test.\nThe model parameter randomization test compares the output of a saliency method on a trained\nmodel with the output of the saliency method on a randomly initialized untrained network of the\nsame architecture. If the saliency method depends on the learned parameters of the model, we should\nexpect its output to differ substantially between the two cases. Should the outputs be similar, however,\nwe can infer that the saliency map is insensitive to properties of the model, in this case, the model\nparameters. In particular, the output of the saliency map would not be helpful for tasks such as model\ndebugging that inevitably depend on the model parameters.\nThe data randomization test compares a given saliency method applied to a model trained on a\nlabeled data set with the method applied to the same model architecture but trained on a copy of the\ndata set in which we randomly permuted all labels. If a saliency method depends on the labeling of\nthe data, we should again expect its outputs to differ signi\ufb01cantly in the two cases. An insensitivity to\nthe permuted labels, however, reveals that the method does not depend on the relationship between\ninstances (e.g. images) and labels that exists in the original data.\nSpeaking more broadly, any explanation method admits a set of invariances, i.e., transformations\nof data and model that do not change the output of the method. If we discover an invariance that is\nincompatible with the requirements of the task at hand, we can safely reject the method. As such, our\ntests can be thought of as sanity checks to perform before deploying a method in practice.\nOur contributions\n\n1. We propose two concrete, easy to implement tests for assessing the scope and quality of\nexplanation methods: the model parameter randomization test, and the data randomization test.\nThese tests apply broadly to explanation methods.\n2. We conduct extensive experiments with several explanation methods across data sets and model\narchitectures, and \ufb01nd, consistently, that some of the methods tested are independent of both the\nmodel parameters and the labeling of the data that the model was trained on.\n\n2\n\nOriginal ImageGradientSmoothGradGuided BackPropGuided GradCAMIntegrated GradientsIntegrated Gradients SmoothGradGradient Input Edge DetectorJunco BirdCornWheaten Terrier\f3. Of the methods we tested, Gradients & GradCAM pass the sanity checks, while Guided BackProp\n& Guided GradCAM fail. In the other cases, we observe a visual perception versus ranking dichotomy,\nwhich we describe in our results.\n4. Consequently, our \ufb01ndings imply that the saliency methods that fail our tests are incapable of\nsupporting tasks that require explanations that are faithful to the model or the data generating process.\n5. We interpret our \ufb01ndings through a series of analyses of linear models and a simple 1-layer\nconvolutional sum-pooling architecture, as well as a comparison with edge detectors.\n\n2 Methods and Related Work\nIn our formal setup, an input is a vector x \u2208 Rd. A model describes a function S : Rd \u2192 RC,\nwhere C is the number of classes in the classi\ufb01cation problem. An explanation method provides an\nexplanation map E : Rd \u2192 Rd that maps inputs to objects of the same shape.\nWe now brie\ufb02y describe some of the explanation methods we examine. The supplementary materials\ncontain an in-depth overview of these methods. Our goal is not to exhaustively evaluate all prior\nexplanation methods, but rather to highlight how our methods apply to several cases of interest.\nThe gradient explanation for an input x is Egrad(x) = \u2202S\n\u2202x [8, 22, 23]. The gradient quanti\ufb01es how\nmuch a change in each input dimension would a change the predictions S(x) in a small neighborhood\naround the input.\nGradient (cid:12) Input. Another form of explanation is the element-wise product of the input and the\ngradient, denoted x (cid:12) \u2202S\n\u2202x , which can address \u201cgradient saturation\u201d, and reduce visual diffusion [13].\nIntegrated Gradients (IG) also addresses gradient saturation by summing over scaled versions of\nd\u03b1, where \u00afx is a\n\nthe input [14]. IG for an input x is de\ufb01ned as EIG(x) = (x \u2212 \u00afx) \u00d7(cid:82) 1\n\n\u2202S(\u00afx+\u03b1(x\u2212\u00afx)\n\n0\n\n\u2202x\n\u201cbaseline input\u201d that represents the absence of a feature in the original input x.\nGuided Backpropagation (GBP) [9] builds on the \u201cDeConvNet\u201d explanation method [10] and\ncorresponds to the gradient explanation where negative gradient entries are set to zero while back-\npropagating through a ReLU unit.\nGuided GradCAM. Introduced by Selvaraju et al. [19], GradCAM explanations correspond to the\ngradient of the class score (logit) with respect to the feature map of the last convolutional unit of a\nDNN. For pixel level granularity GradCAM can be combined with Guided Backpropagation through\nan element-wise product.\n(cid:80)N\nSmoothGrad (SG) [16] seeks to alleviate noise and visual diffusion [14, 13] for saliency maps by\naveraging over explanations of noisy copies of an input. For a given explanation map E, SmoothGrad\ni=1 E(x + gi), where noise vectors gi \u223c N (0, \u03c32)) are drawn i.i.d. from\nis de\ufb01ned as Esg(x) = 1\nN\na normal distribution.\n\n2.1 Related Work\n\nOther Methods & Similarities. Aside gradient-based approaches, other methods \u2018learn\u2019 an expla-\nnation per sample for a model [20, 17, 12, 15, 11, 21]. More recently, M. Ancona [24] showed that\nfor ReLU networks (with zero baseline and no biases) the \u0001-LRP and DeepLift (Rescale) explanation\nmethods are equivalent to the input (cid:12) gradient. Similarly, Lundberg and Lee [18] proposed SHAP\nexplanations which approximate the shapley value and unify several existing methods.\n\nFragility. Ghorbani et al. [25] and Kindermans et al. [26] both present attacks against saliency\nmethods; showing that it is possible to manipulate derived explanations in unintended ways. Nie\net al. [27] theoretically assessed backpropagation based methods and found that Guided BackProp\nand DeconvNet, under certain conditions, are invariant to network reparamaterizations, particularly\nrandom Gaussian initialization. Speci\ufb01cally, they show that Guided BackProp and DeconvNet both\nseem to be performing partial input recovery. Our \ufb01ndings are similar for Guided BackProp and\nits variants. Further, our work differs in that we propose actionable sanity checks for assessing\nexplanation approaches. Along similar lines, Mahendran and Vedaldi [28] also showed that some\nbackpropagation-based saliency methods lack neuron discriminativity.\n\n3\n\n\fCurrent assessment methods. Both Samek et al. [29] and Montavon et al. [30] proposed an input\nperturbation procedure for assessing the quality of saliency methods. Dabkowski and Gal [17]\nproposed an entropy-based metric to quantify the amount of relevant information an explanation\nmask captures. Performance of a saliency map on an object localization task has also been used for\nassessing saliency methods. Montavon et al. [30] discuss explanation continuity and selectivity as\nmeasures of assessment.\nRandomization. Our label randomization test was inspired by the work of Zhang et al. [31], although\nwe use the test for an entirely different purpose.\n\n2.2 Visualization & Similarity Metrics\n\nWe discuss our visualization approach and overview the set of metrics used in assessing similarity\nbetween two explanations.\nVisualization. We visualize saliency maps in two ways. In the \ufb01rst case, absolute-value (ABS), we\ntake absolute values of a normalized4 map. For the second case, diverging visualization, we leave the\nmap as is, and use different colors to show positive and negative importance.\nSimilarity Metrics. For quantitative comparison, we rely on the following metrics: Spearman rank\ncorrelation with absolute value (absolute value), Spearman rank correlation without absolute value\n(diverging), the structural similarity index (SSIM), and the Pearson correlation of the histogram of\ngradients (HOGs) derived from two maps. We compute the SSIM and HOGs similarity metric on\nImageNet examples without absolute values.5 These metrics capture a broad notion of similarity;\nhowever, quantifying human visual perception is still an active area of research.\n\n3 Model Parameter Randomization Test\n\nThe parameter settings of a model encode what the model has learned from the data during training,\nand determine test set performance. Consequently, for a saliency method to be useful for debugging a\nmodel, it ought to be sensitive to model parameters.\nAs an illustrative example, consider a linear function of the form f (x) = w1x1 + w2x2 with input\nx \u2208 R2. A gradient-based explanation for the model\u2019s behavior for input x is given by the parameter\nvalues (w1, w2), which correspond to the sensitivity of the function to each of the coordinates.\nChanges in the model parameters therefore change the explanation.\nOur proposed model parameter randomization test assesses an explanation method\u2019s sensitivity\nto model parameters. We conduct two kinds of randomization. First we randomly re-initialize\nall weights of the model both completely and in a cascading fashion. Second, we independently\nrandomize a single layer at a time while keeping all others \ufb01xed. In both cases, we compare the\nresulting explanation from a network with random weights to the one obtained with the model\u2019s\noriginal weights.\n\n3.1 Cascading Randomization\n\nOverview. In the cascading randomization, we randomize the weights of a model starting from the\ntop layer, successively, all the way to the bottom layer. This procedure destroys the learned weights\nfrom the top layers to the bottom ones. Figure 2 visualizes the cascading randomization for several\nsaliency methods. In Figures 3 and 4, we show the Spearman metrics as well as the SSIM and HOGs\nsimilarity metrics.\nThe gradient shows sensitivity while Guided BackProp is invariant. We \ufb01nd that the gradient\nmap is sensitive to model parameters. We also observe sensitivity for the GradCAM masks. On the\nother hand, across all architectures and datasets, Guided BackProp and Guided GradCAM show no\nchange regardless of model degradation.\n\n4We normalize the maps to the range [\u22121.0, 1.0]. Normalizing in this manner potentially ignores peculiar\ncharacteristics of some saliency methods. For example, Integrated gradients has the property that the attributions\nsum up to the output value. This property cannot usually be visualized. We contend that such properties will not\naffect the manner in which the output visualizations are perceived.\n5See appendix for a discussion on calibration of these metrics.\n\n4\n\n\fFigure 2: Cascading randomization on Inception v3 (ImageNet). Figure shows the original\nexplanations (\ufb01rst column) for the Junco bird. Progression from left to right indicates complete\nrandomization of network weights (and other trainable variables) up to that \u2018block\u2019 inclusive. We\nshow images for 17 blocks of randomization. Coordinate (Gradient, mixed_7b) shows the gradient\nexplanation for the network in which the top layers starting from Logits up to mixed_7b have been\nreinitialized. The last column corresponds to a network with completely reinitialized weights.\n\nThe danger of the visual assessment. On visual inspection, we \ufb01nd that integrated gradients and\ngradient(cid:12)input show a remarkable visual similarity to the original mask. In fact, from Figure 2, it\nis still possible to make out the structure of the bird even after multiple blocks of randomization.\nThis visual similarity is re\ufb02ected in the rank correlation with absolute value (Figure 3-Top), SSIM,\nand the HOGs metric (Figure 4). However, re-initialization disrupts the sign of the map, so that the\nSpearman rank correlation without absolute values goes to zero (Figure 3-Bottom) almost as soon as\nthe top layers are randomized. This observed visual perception versus numerical ranking dichotomy\nindicates that naive visual inspection of the masks does not distinguish networks of similar structure\nbut widely differing parameters. We explain the source of this phenomenon in our discussion section.\n\n3.2\n\nIndependent Randomization\n\nOverview. As a different form of the model parameter randomization test, we conduct an independent\nlayer-by-layer randomization with the goal of isolating the dependence of the explanations by layer.\nConsequently, we can assess the dependence of saliency masks on lower versus higher layer weights.\nResults. We observe a correspondence between the results from the cascading and independent layer\nrandomization experiments (see Figures ??, ??, ??, and ?? in the Appendix). As previously observed,\nGuided Backprop and Guided GradCAM masks remain almost unchanged regardless of the layer that\nis independently randomized across all networks. Similarly, we observe that the structure of the input\nis maintained, visually, for the gradient(cid:12)input and Integrated Gradient methods.\n\n5\n\nGradientGradient-SGGradient InputGuided Back-propagationGradCAMIntegrated GradientsIntegrated Gradients-SGlogitsconv2d_1a_3x3mixed_7cmixed_7bconv2d_2a_3x3conv2d_2b_3x3conv2d_4a_3x3mixed_7amixed_6emixed_6dmixed_6cmixed_6bmixed_6amixed_5dmixed_5cmixed_5bconv2d_3b_1x1Original ExplanationGuided GradCAMCascading randomization from top to bottom layersOriginal Image\fFigure 3: Similarity Metrics for Cascading Randomization. We show results for Inception v3\non ImageNet, CNN on Fashion MNIST, and MLP on MNIST. See appendix for MLP on Fashion\nMNIST and CNN on MNIST. In all plots, y axis is the rank correlation between original explanation\nand the randomized explanation derived for randomization up to that layer/block, while the x axis\ncorresponds to the layers/blocks of the DNN starting from the output layer. The vertical black\ndashed line indicates where successive randomization of the network begins, which is at the top layer.\nTop: Spearman Rank correlation with absolute values, Bottom: Spearman Rank correlation without\nabsolute values. Caption Note: For Inception v3 on ImageNet no ABS, the IG, gradient-input, and\ngradients all coincide. For MLP-MNIST IG and gradient-input coincide.\n\nFigure 4: Similarity Metrics for Cascading Randomization. Figure showing HOGs similarity and\nSSIM between original input masks and the masks generated as the Inception v3 is randomized in\na cascading manner. Caption Note: For SSIM: Inception v3 - ImageNet, IG and gradient(cid:12)input\ncoincide, while GradCAM, Guided GradCAM, and Guided BackProp are clustered together at the\ntop.\n\n4 Data Randomization Test\n\nThe feasibility of accurate prediction hinges on the relationship between instances (e.g., images)\nand labels encoded by the data. If we arti\ufb01cially break this relationship by randomizing the labels,\nno predictive model can do better than random guessing. Our data randomization test evaluates the\nsensitivity of an explanation method to the relationship between instances and labels. An explanation\nmethod insensitive to randomizing labels cannot possibly explain mechanisms that depend on the\nrelationship between instances and labels present in the data generating process. For example, if an\nexplanation did not change after we randomly assigned diagnoses to CT scans, then evidently it did\nnot explain anything about the relationship between a CT scan and the correct diagnosis in the \ufb01rst\nplace (see [32] for an application of Guided BackProp as part of a pipepline for shadow detection in\n2D Ultrasound).\nIn our data randomization test, we permute the training labels and train a model on the randomized\ntraining data. A model achieving high training accuracy on the randomized training data is forced to\nmemorize the randomized labels without being able to exploit the original structure in the data. As it\n\n6\n\nCNN - Fashion MNISTMLP- MNISTInception v3 - ImageNetoutput-fcfc2originalconv-hidden2conv-hidden1outputhidden3originalhidden2hidden1Mixed7c 7b 7a 6e 6d 6c 6b 6a 5d 5c 5b 4a 3b 2b 2a 1a logitsoriginalConv2dRank Correlation ABSoutput-fcfc2originalconv-hidden2conv-hidden1outputhidden3originalhidden2hidden1Mixed7c 7b 7a 6e 6d 6c 6b 6a 5d 5c 5b 4a 3b 2b 2a 1a logitsoriginalConv2dRank Correlation No ABSSee Caption NoteSee NoteHOGs Similarity: Inception v3 - ImageNetMixed7c 7b 7a 6e 6d 6c 6b 6a 5d 5c 5b 4a 3b 2b 2a 1a logitsoriginalConv2dSSIM: Inception v3 - ImageNetMixed7c 7b 7a 6e 6d 6c 6b 6a 5d 5c 5b 4a 3b 2b 2a 1a logitsConv2doriginalSee Caption Note\fturns out, state-of-the art deep neural networks can easily \ufb01t random labels as was shown in Zhang\net al. [31].\nIn our experiments, we permute the training labels for each model and data set pair, and train the\nmodel to greater than 95% training set accuracy. Note that the test accuracy is never better than\nrandomly guessing a label (up to sampling error). For each resulting model, we then compute\nexplanations on the same test bed of inputs for a model trained with true labels and the corresponding\nmodel trained on randomly permuted labels.\n\nFigure 5: Explanation for a true model vs. model trained on random labels. Top Left: Absolute-\nvalue visualization of masks for digit 0 from the MNIST test set for a CNN. Top Right: Saliency\nmasks for digit 0 from the MNIST test set for a CNN shown in diverging color. Bottom Left:\nSpearman rank correlation (with absolute values) bar graph for saliency methods. We compare the\nsimilarity of explanations derived from a model trained on random labels, and one trained on real\nlabels. Bottom Right: Spearman rank correlation (without absolute values) bar graph for saliency\nmethods for MLP. See appendix for corresponding \ufb01gures for CNN, and MLP on Fashion MNIST.\n\nGradient is sensitive. We \ufb01nd, again, that gradients, and its smoothgrad variant, undergo substantial\nchanges. In addition, the GradCAM masks also change becoming more disconnected.\nSole reliance on visual inspection can be misleading. For Guided BackProp, we observe a visual\nchange; however, we \ufb01nd that the masks still highlight portions of the input that would seem plausible,\ngiven correspondence with the input, on naive visual inspection. For example, from the diverging\nmasks (Figure 5-Right), we see that the Guided BackProp mask still assigns positive relevance across\nmost of the digit for the network trained on random labels.\nFor gradient(cid:12)input and integrated gradients, we also observe visual changes in the masks obtained,\nparticularly, in the sign of the attributions. Despite this, the input structure is still clearly prevalent in\nthe masks. The effect observed is particularly prominent for sparse inputs like MNIST where the\nbackground is zero; however, we observe similar effects for Fashion MNIST (see Appendix). With\nvisual inspection alone, it is not inconceivable that an analyst could confuse the integrated gradient\nand gradient(cid:12)input masks derived from a network trained on random labels as legitimate.\n\n5 Discussion\n\nWe now take a step back to interpret our \ufb01ndings. First, we discuss the in\ufb02uence of the model\narchitecture on explanations derived from NNs. Second, we consider methods that approximate an\nelement-wise product of input and gradient, as several local explanations do [33, 18]. We show,\nempirically, that the input \u201cstructure\u201d dominates the gradient, especially for sparse inputs. Third,\nwe explain the observed behavior of the gradient explanation with an appeal to linear models. We\nthen consider a single 1-layer convolution with sum-pooling architecture, and show that saliency\nexplanations for this model mostly capture edges. Finally, we return to the edge detector and make\ncomparisons between the methods that fail our sanity checks and an edge detector.\n\n7\n\nCNN - MNISTTrue LabelsRandom LabelsGradientGradient-SGGuided BackPropGradCAMGuided GradCAMIntegrated GradientsIntegrated Gradients-SGGradient InputTrue LabelsRandom LabelsGradientGradient-SGGuided BackPropGradCAMGuided GradCAMIntegrated GradientsIntegrated Gradients-SGGradient InputRank Correlation - AbsRank Correlation - No AbsAbsolute-Value VisualizationDiverging Visualization\f5.1 The role of model architecture as a prior\n\nThe architecture of a deep neural network has an important effect on the representation derived from\nthe network. A number of results speak to the strength of randomly initialized models as classi\ufb01cation\npriors [34, 35]. Moreover, randomly initialized networks trained on a single input can perform tasks\nlike denoising, super-resolution, and in-painting [36] without additional training data. These prior\nworks speak to the fact that randomly initialized networks correspond to non-trivial representations.\nExplanations that do not depend on model parameters or training data might still depend on the\nmodel architecture and thus provide some useful information about the prior incorporated in the\nmodel architecture. However, in this case, the explanation method should only be used for tasks\nwhere we believe that knowledge of the model architecture on its own is suf\ufb01cient for giving useful\nexplanations.\n\n5.2 Element-wise input-gradient products\n\nA number of methods, e.g., \u0001-LRP, DeepLift, and integrated gradients, approximate the element-wise\nproduct of the input and the gradient (on a piecewise linear function like ReLU). To gain further\ninsight into our \ufb01ndings, we can look at what happens to the input-gradient product E(x) = x(cid:12) \u2202S\n\u2202x , if\nthe input is kept \ufb01xed, but the gradient is randomized. To do so, we conduct the following experiment.\nFor an input x, sample two random vectors u, v (we consider both the truncated normal and uniform\ndistributions) and consider the element-wise product of x with u and v, respectively, i.e., x (cid:12) u, and\nx(cid:12) v. We then look at the similarity, for all the metrics considered, between x(cid:12) u and x(cid:12) v as noise\nincreases. We conduct this experiment on ImageNet samples. We observe that the input does indeed\ndominate the product (see Figure ?? in Appendix). We also observe that the input dominance persists\neven as the noisy gradient vectors change drastically. This experiment indicates that methods that\napproximate the \u201cinput-times-gradient\u201d could conceivably mostly return the input, in cases where the\ngradients look visually noisy as they tend to do.\n\n5.3 Analysis for simple models\n\nTo better understand our \ufb01ndings, we analyze the output of\nthe saliency methods tested on two simple models: a linear\nmodel and a 1-layer sum pooling convolutional network.\nWe \ufb01nd that the output of the saliency methods, on a\nlinear model, returns a coef\ufb01cient that intuitively measures\nthe sensitivity of the model with respect to that variable.\nHowever, these methods applied to a random convolution\nseem to result in visual artifacts that are akin to an edge\ndetector.\nLinear Model. Consider a linear model f : Rd \u2192 R\nde\ufb01ned as f (x) = w \u00b7 x where w \u2208 Rd are the model\nweights. For gradients we have Egrad(x) = \u2202(w\u00b7x)\n\u2202x = w.\nSimilarly for SmoothGrad we have Esg(x) = w (the gradi-\nent is independent of the input, so averaging gradients over\nnoisy inputs yields the same model weight). Integrated\nGradients reduces to \u201cgradient (cid:12) input\u201d for this case:\n\n(cid:90) 1\n\n\u2202f (\u00afx + \u03b1(x \u2212 \u00afx))\n\n0\n\n\u2202x\n\n(cid:90) 1\n\n0\n\nFigure 6: Explanations derived for the\n1-layer Sum-Pooling Convolution archi-\ntecture. We show gradient, SmoothGrad,\nIntegrated Gradients, and Guided Back-\nProp explanations. (See Appendix for\nSimilarity Metrics).\n\nEIG(x) = (x \u2212 \u00afx) (cid:12)\n\nd\u03b1 = (x \u2212 \u00afx) (cid:12)\n\nw\u03b1d\u03b1 = (x \u2212 \u00afx) (cid:12) w/2 .\n\nConsequently, we see that the application of the basic gradient method to a linear model will pass our\nsanity check. Gradients on a random model will return an image of white noise, while integrated\ngradients will return a noisy version of the input image. We did not consider Guided Backprop and\nGradCAM here because both methods are not de\ufb01ned for the linear model considered above.\n\n1 Layer Sum-Pool Conv Model. We now show that the application of these same methods to a\n1-layer convolutional network may result in visual artifacts that can be misleading unless further\n\n8\n\nGradientGradient SmoothGradGBPIGRGBGray Scale\fn(cid:80)\n\nn(cid:80)\n\nanalysis is done. Consider a single-layer convolutional network applied to a grey-scale image\nx \u2208 Rn\u00d7n. Let w \u2208 R3\u00d73 denote the 3 \u00d7 3 convolutional \ufb01lter, indexed as wij for i, j \u2208 {\u22121, 0, 1}.\nWe denote by w \u2217 x \u2208 Rn\u00d7n the output of the convolution operation on the image x. Then the output\n\u03c3(w \u2217 x)ij , where \u03c3 is the ReLU non-linearity\nof this network can be written as l(x) =\napplied point-wise. In particular, this network applies a single 3x3 convolutional \ufb01lter to the input\nimage, then applies a ReLU non-linearity and \ufb01nally sum-pools over the entire convolutional layer\nfor the output. This is a similar architecture to the one considered in [34]. As shown in Figure 6, we\nsee that different saliency methods do act like edge detectors. This suggests that the convolutional\nstructure of the network is responsible for the edge detecting behavior of some of these saliency\nmethods.\nTo understand why saliency methods applied to this simple architecture visually appear to be edge\nl(x). Let aij = 1{(w \u2217 x)ij \u2265 0} indicate\ndetectors, we consider the closed form of the gradient\nthe activation pattern of the ReLU units in the convolutional layer. Then for i, j \u2208 [2, n \u2212 1] we have\n\n\u2202xij\n\nj=1\n\ni=1\n\n\u2202\n\n1(cid:88)\n\n1(cid:88)\n\nk=\u22121\n\nl=\u22121\n\n\u2202\n\n\u2202xij\n\nl(x) =\n\n\u03c3(cid:48)((w \u2217 x)i+k,j+l)wkl =\n\nai+k,j+lwkl\n\n1(cid:88)\n\n1(cid:88)\n\nk=\u22121\n\nl=\u22121\n\n(Recall that \u03c3(cid:48)(x) = 0 if x < 0 and 1 otherwise). This implies that the 3 \u00d7 3 activation pattern local\nto pixel xij uniquely determines\n. It is now clear why edges will be visible in the produced\nsaliency mask \u2014 regions in the image corresponding to an \u201cedge\u201d will have a distinct activation\npattern from surrounding pixels. In contrast, pixel regions of the image which are more uniform will\nall have the same activation pattern, and thus the same value of\nl(x). Perhaps a similar principle\napplies for stacked convolutional layers.\n\n\u2202xij\n\n\u2202xij\n\n\u2202\n\n\u2202\n\n5.4 The case of edge detectors.\n\nAn edge detector, roughly speaking, is a classical tool to highlight sharp transitions in an image.\nNotably, edge detectors are typically untrained and do not depend on any predictive model. They are\nsolely a function of the given input image. As some of the saliency methods we saw, edge detection\nis invariant under model and data transformations.\nIn Figure 1 we saw that edge detectors produce images that are strikingly similar to the outputs of\nsome saliency methods. In fact, edge detectors can also produce pictures that highlight features which\ncoincide with what appears to be relevant to a model\u2019s class prediction. However, here the human\nobserver is at risk of con\ufb01rmation bias when interpreting the highlighted edges as an explanation\nof the class prediction. In Figure ?? (In Appendix), we show a qualitative comparison of saliency\nmaps of an input image with the same input image multiplied element-wise by the output of an edge\ndetector. The result indeed looks strikingly similar, illustrating that saliency methods mostly use the\nedges of the image.\nWhile edge detection is a fundamental and useful image processing technique, it is typically not\nthought of as an explanation method, simply because it involves no model or training data. In light of\nour \ufb01ndings, it is not unreasonable to interpret some saliency methods as implicitly implementing\nunsupervised image processing techniques, akin to edge detection, segmentation, or denoising. To\ndifferentiate such methods from model-sensitive explanations, visual inspection is insuf\ufb01cient.\n\n6 Conclusion and future work\n\nThe goal of our experimental method is to give researchers guidance in assessing the scope of model\nexplanation methods. We envision these methods to serve as sanity checks in the design of new model\nexplanations. Our results show that visual inspection of explanations alone can favor methods that\nmay provide compelling pictures, but lack sensitivity to the model and the data generating process.\nInvariances in explanation methods give a concrete way to rule out the adequacy of the method for\ncertain tasks. We primarily focused on invariance under model randomization, and label random-\nization. Many other transformations are worth investigating and can shed light on various methods\nwe did and did not evaluate. Along these lines, we hope that our paper is a stepping stone towards a\nmore rigorous evaluation of new explanation methods, rather than a verdict on existing methods.\n\n9\n\n\fAcknowledgments\n\nWe thank the Google PAIR team for open source implementation of the methods used in this work.\nWe thank Martin Wattenberg and other members of the Google Brain team for critical feedback that\nhelped improved the work. Lastly, we thank anonymous reviewers for feedback that helped improve\nthe manuscript.\n\nReferences\n[1] Alfredo Vellido, Jos\u00e9 David Mart\u00edn-Guerrero, and Paulo JG Lisboa. Making machine learning models\n\ninterpretable. In ESANN, volume 12, pages 163\u2013172. Citeseer, 2012.\n\n[2] Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gershman, David O\u2019Brien, Stuart\nSchieber, James Waldo, David Weinberger, and Alexandra Wood. Accountability of ai under the law: The\nrole of explanation. arXiv preprint arXiv:1711.01134, 2017.\n\n[3] Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making and a\"\n\nright to explanation\". arXiv preprint arXiv:1606.08813, 2016.\n\n[4] Gabriel Cadamuro, Ran Gilad-Bachrach, and Xiaojin Zhu. Debugging machine learning models. In ICML\n\nWorkshop on Reliable Machine Learning in the Wild, 2016.\n\n[5] Jorge Casillas, Oscar Cord\u00f3n, Francisco Herrera Triguero, and Luis Magdalena. Interpretability issues in\n\nfuzzy modeling, volume 128. Springer, 2013.\n\n[6] Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Interpretable & explorable approxi-\n\nmations of black box models. arXiv preprint arXiv:1707.01154, 2017.\n\n[7] Fulton Wang and Cynthia Rudin. Causal falling rule lists. arXiv preprint arXiv:1510.05189, 2015.\n\n[8] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.\n\n[9] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity:\n\nThe all convolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[10] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European\n\nconference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[11] Maximilian Alber Klaus-Robert M\u00fcller Dumitru Erhan Been Kim Sven D\u00e4hne Pieter-Jan Kindermans,\nKristof T. Sch\u00fctt. Learning how to explain neural networks: Patternnet and patternattribution. Interna-\ntional Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=\nHkn7CBaTW.\n\n[12] Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network\n\ndecisions: Prediction difference analysis. arXiv preprint arXiv:1702.04595, 2017.\n\n[13] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box:\nLearning important features through propagating activation differences. arXiv preprint arXiv:1605.01713,\n2016.\n\n[14] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina\nPrecup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning,\nvolume 70 of Proceedings of Machine Learning Research, pages 3319\u20133328, International Convention\nCentre, Sydney, Australia, 06\u201311 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/\nsundararajan17a.html.\n\n[15] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the\npredictions of any classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, pages 1135\u20131144. ACM, 2016.\n\n[16] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi\u00e9gas, and Martin Wattenberg. Smoothgrad:\n\nremoving noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.\n\n[17] Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classi\ufb01ers. In Advances in Neural\n\nInformation Processing Systems, pages 6970\u20136979, 2017.\n\n10\n\n\f[18] Scott M Lundberg and Su-In Lee. A uni\ufb01ed approach to interpreting model predictions. In Advances in\n\nNeural Information Processing Systems, pages 4768\u20134777, 2017.\n\n[19] Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and\n\nDhruv Batra. Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.\n\n[20] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation.\n\narXiv preprint arXiv:1704.03296, 2017.\n\n[21] Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. Learning to explain: An information-\ntheoretic perspective on model interpretation. In Jennifer Dy and Andreas Krause, editors, Proceedings\nof the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 883\u2013892, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR. URL\nhttp://proceedings.mlr.press/v80/chen18j.html.\n\n[22] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of\n\na deep network. University of Montreal, 1341(3):1, 2009.\n\n[23] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert\nM\u00c3\u017eller. How to explain individual classi\ufb01cation decisions. Journal of Machine Learning Research, 11\n(Jun):1803\u20131831, 2010.\n\n[24] A. C. \u00d6ztireli M. Gross M. Ancona, E. Ceolini. Towards better understanding of gradient-based attribution\nmethods for deep neural networks. International Conference on Learning Representations (ICLR 2018),\n2018.\n\n[25] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. arXiv\n\npreprint arXiv:1710.10547, 2017.\n\n[26] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Sch\u00fctt, Sven D\u00e4hne,\nDumitru Erhan, and Been Kim. The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867,\n2017.\n\n[27] Weili Nie, Yang Zhang, and Ankit Patel. A theoretical explanation for perplexing behaviors of\n\nbackpropagation-based visualizations. In ICML, 2018.\n\n[28] Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In European Conference on\n\nComputer Vision, pages 120\u2013135. Springer, 2016.\n\n[29] Wojciech Samek, Alexander Binder, Gr\u00e9goire Montavon, Sebastian Lapuschkin, and Klaus-Robert M\u00fcller.\nEvaluating the visualization of what a deep neural network has learned. IEEE transactions on neural\nnetworks and learning systems, 28(11):2660\u20132673, 2017.\n\n[30] Gr\u00e9goire Montavon, Wojciech Samek, and Klaus-Robert M\u00fcller. Methods for interpreting and understand-\n\ning deep neural networks. Digital Signal Processing, 2017.\n\n[31] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep\n\nlearning requires rethinking generalization. In In Proc. 5th ICLR, 2017.\n\n[32] Qingjie Meng, Christian Baumgartner, Matthew Sinclair, James Housden, Martin Rajchl, Alberto Gomez,\nBenjamin Hou, Nicolas Toussaint, Jeremy Tan, Jacqueline Matthew, et al. Automatic shadow detection in\n2d ultrasound. 2018.\n\n[33] Marco Ancona, Enea Ceolini, Cengiz \u00d6ztireli, and Markus Gross. Towards better understanding of\n\ngradient-based attribution methods for deep neural networks. In In Proc. 6th ICLR, 2018.\n\n[34] Andrew M Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On\n\nrandom weights and unsupervised feature learning. In ICML, pages 1089\u20131096, 2011.\n\n[35] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classi\ufb01er probes.\n\narXiv preprint arXiv:1610.01644, 2016.\n\n[36] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior.\n\narXiv:1711.10925, 2017.\n\narXiv preprint\n\n11\n\n\f", "award": [], "sourceid": 5780, "authors": [{"given_name": "Julius", "family_name": "Adebayo", "institution": "MIT"}, {"given_name": "Justin", "family_name": "Gilmer", "institution": "Google Brain"}, {"given_name": "Michael", "family_name": "Muelly", "institution": "Google"}, {"given_name": "Ian", "family_name": "Goodfellow", "institution": "Google"}, {"given_name": "Moritz", "family_name": "Hardt", "institution": "Google Brain"}, {"given_name": "Been", "family_name": "Kim", "institution": "Google"}]}