{"title": "A Benchmark for Interpretability Methods in Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 9737, "page_last": 9748, "abstract": "We propose an empirical measure of the approximate accuracy of feature importance estimates in deep neural networks. Our results across several large-scale image classification datasets show that many popular interpretability methods produce estimates of feature importance that are not better than a random designation of feature importance. Only certain ensemble based approaches---VarGrad and SmoothGrad-Squared---outperform such a random assignment of importance. The manner of ensembling remains critical, we show that some approaches do no better then the underlying method but carry a far higher computational burden.", "full_text": "A Benchmark for Interpretability Methods in Deep\n\nNeural Networks\n\nSara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim\n\nshooker,dumitru,pikinder,beenkim@google.com\n\nGoogle Brain\n\nAbstract\n\nWe propose an empirical measure of the approximate accuracy of feature impor-\ntance estimates in deep neural networks. Our results across several large-scale\nimage classi\ufb01cation datasets show that many popular interpretability methods pro-\nduce estimates of feature importance that are not better than a random designation\nof feature importance. Only certain ensemble based approaches\u2014VarGrad and\nSmoothGrad-Squared\u2014outperform such a random assignment of importance. The\nmanner of ensembling remains critical, we show that some approaches do no better\nthen the underlying method but carry a far higher computational burden.\n\n1\n\nIntroduction\n\nIn a machine learning setting, a question of great interest is estimating the in\ufb02uence of a given input\nfeature to the prediction made by a model. Understanding what features are important helps improve\nour models, builds trust in the model prediction and isolates undesirable behavior. Unfortunately,\nit is challenging to evaluate whether an explanation of model behavior is reliable. First, there is no\nground truth. If we knew what was important to the model, we would not need to estimate feature\nimportance in the \ufb01rst place. Second, it is unclear which of the numerous proposed interpretability\nmethods that estimate feature importance one should select [6, 5, 43, 30, 37, 33, 39, 36, 19, 22, 11,\n9, 40, 32, 41, 27, 34, 2]. Many feature importance estimators have interesting theoretical properties\ne.g. preservation of relevance [5] or implementation invariance [37]. However even these methods\nneed to be con\ufb01gured correctly [22, 37] and it has been shown that using the wrong con\ufb01guration\ncan easily render them ineffective [18]. For this reason, it is important that we build a framework to\nempirically validate the relative merits and reliability of these methods.\n\nA commonly used strategy is to remove the supposedly informative features from the input and\nlook at how the classi\ufb01er degrades [29]. This method is cheap to evaluate but comes at a signi\ufb01cant\ndrawback. Samples where a subset of the features are removed come from a different distribution\n(as can be seen in Fig. 1). Therefore, this approach clearly violates one of the key assumptions\nin machine learning: the training and evaluation data come from the same distribution. Without\nre-training,it is unclear whether the degradation in model performance comes from the distribution\nshift or because the features that were removed are truly informative [9, 11].\n\nFor this reason we decided to verify how much information can be removed in a typical dataset before\naccuracy of a retrained model breaks down completely. In this experiment, we applied ResNet-50 [16],\none of the most commonly used models, to ImageNet. It turns out that removing information is\nquite hard. With 90% of the inputs removed the network still achieves 63.53% accuracy compared to\n76.68% on clean data. This implies that a strong performance degradation without re-training might\nbe caused by a shift in distribution instead of removal of information.\n\nInstead, in this work we evaluate interpretability methods by verifying how the accuracy of a retrained\nmodel degrades as features estimated to be important are removed. We term this approach ROAR,\nRemOve And Retrain. For each feature importance estimator, ROAR replaces the fraction of the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fpixels estimated to be most important with a \ufb01xed uninformative value. This modi\ufb01cation (shown in\nFig. 1) is repeated for each image in both the training and test set. To measure the change to model\nbehavior after the removal of these input features, we separately train new models on the modi\ufb01ed\ndataset such that train and test data comes from a similar distribution. More accurate estimators\nwill identify as important input pixels whose subsequent removal causes the sharpest degradation in\naccuracy. We also compare each method performance to a random assignment of importance and a\nsobel edge \ufb01lter [35]. Both of these control variants produce rankings that are independent of the\nproperties of the model we aim to interpret. Given that these methods do not depend upon the model,\nthe performance of these variants respresent a lower bound of accuracy that a interpretability method\ncould be expected to achieve. In particular, a random baseline allows us to answer the question: is the\ninterpretability method more accurate than a random guess as to which features are important? In\nSection 3 we will elaborate on the motivation and the limitations of ROAR.\n\nWe applied ROAR in a broad set of experiments across three large scale, open source image datasets:\nImageNet [10], Food 101 [8] and Birdsnap [7]. In our experiments we show the following.\n\n\u2022 Training performance is quite robust to removing input features. For example, after randomly\nreplacing 90% of all ImageNet input features, we can still train a model that achieves\n63.53 \u00b1 0.13 (average across 5 independent runs). This implies that a small subset of\nfeatures are suf\ufb01cient for the actual decision making. Our observation is consistent across\ndatasets.\n\n\u2022 The base methods we evaluate are no better or on par with a random estimate at \ufb01nding\nthe core set of informative features. However, we show that SmoothGrad-Squared (an\nunpublished variant of Classic SmoothGrad [34]) and Vargrad [2], methods which ensemble\na set of estimates produced by basic methods, far outperform both the underlying method\nand a random guess. These results are consistent across datasets and methods.\n\n\u2022 Not all ensemble estimators improve performance. Classic SmoothGrad [34] is worse than\n\na single estimate despite being more computationally intensive.\n\n2 Related Work\n\nInterpretability research is diverse, and many different approaches are used to gain intuition about the\nfunction implemented by a neural network. For example, one can distill or constrain a model into\na functional form that is considered more interpretable [4, 12, 38, 28]. Other methods explore the\nrole of neurons or activations in hidden layers of the network [24, 26, 23, 42], while others use high\nlevel concepts to explain prediction results [17]. Finally there are also the input feature importance\nestimators that we evaluate in this work. These methods estimate the importance of an input feature\nfor a speci\ufb01ed output activation.\n\nWhile there is no clear way to measure \u201ccorrectness\u201d, comparing the relative merit of different\nestimators is often based upon human studies [30, 27, 21] which interrogate whether the ranking is\nmeaningful to a human. Recently, there have been efforts to evaluate whether interpretability methods\nare both reliable and meaningful to human. For example, in [18] a unit test for interpretability\nmethods is proposed which detects whether the explanation can be manipulated by factors that are\nnot affecting the decision making process. Another approach considers a set of sanity checks that\nmeasure the change to an estimate as parameters in a model or dataset labels are randomized [2].\nClosely related to this manuscript are the modi\ufb01cation-based evaluation measures proposed originally\nby [29] with subsequent variations [19, 25]. In this line of work, one replaces the inputs estimated\nto be most important with a value considered meaningless to the task. These methods measure the\nsubsequent degradation to the trained model at inference time. Recursive feature elimination methods\n[15] are a greedy search where the algorithm is trained on an iteratively altered subset of features.\nRecursive feature elimination does not scale to high dimensional datasets (one would have to retrain\nremoving one pixel at a time) and unlike our work is a method to estimate feature importance (rather\nthan evaluate different existing interpretability methods).\n\nTo the best of our knowledge, unlike prior modi\ufb01cation based evaluation measures, our benchmark\nrequires retraining the model from random initialization on the modi\ufb01ed dataset rather than re-scoring\nthe modi\ufb01ed image at inference time. Without this step, we argue that one cannot decouple whether\nthe model\u2019s degradation in performance is due to artifacts introduced by the value used to replace\n\n2\n\n\fFigure 1: A single ImageNet image modi\ufb01ed according to the ROAR framework. The fraction of\npixels estimated to be most important by each interpretability method is replaced with the mean.\nAbove each image, we include the average test-set accuracy for 5 ResNet-50 models independently\ntrained on the modi\ufb01ed dataset. From left to right: base estimators (gradient heatmap (GRAD),\nIntegrated Gradients (IG), Guided Backprop (GB)), derivative approaches that ensemble a set of\nestimates (SmoothGrad Integrated Gradients (SG-SQ-IG), SmoothGrad-Squared Integrated Gradients\n(SG-SQ-IG), VarGrad Integrated Gradients (Var-IG)) and control variants (random modi\ufb01cation\n(Random) and a sobel edge \ufb01lter (Sobel)). This image is best visualized in digital format.\n\nthe pixels that are removed or due to the approximate accuracy of the estimator. Our work considers\nseveral large scale datasets, whereas all previous evaluations have involved a far smaller subset of\ndata [3, 29].\n\n3 ROAR: Remove And Retrain\n\nTo evaluate a feature importance estimate using ROAR, we sort the input dimensions according to the\nestimated importance. We compute an estimate e of feature importance for every input in the training\nand test set. We rank each e into an ordered set {eo\ni=1. For the top t fraction of this ordered set, we\nreplace the corresponding pixels in the raw image with the per channel mean. We generate new train\nand test datasets at different degradation levels t = [0., 10, . . . , 100] (where t is a percentage of all\nfeatures modi\ufb01ed). Afterwards the model is re-trained from random initialization on the new dataset\nand evaluated on the new test data.\n\ni }N\n\nOf course, because re-training can result in slightly different models, it is essential to repeat the\ntraining process multiple times to ensure that the variance in accuracy is low. To control for this, we\nrepeat training 5 times for each interpretabiity method e and level of degradation t. We introduce the\nmethodology and motivation for ROAR in the context of linear models and deep neural networks.\nHowever, we note that the properties of ROAR differ given an algorithm that explicitly uses feature\nselection (e.g. L1 regularization or any mechanism which limits the features available to the model at\ninference time). In this case one should of course mask the inputs that are known to be ignored by\nthe model, before re-training. This will prevent them from being used after re-training, which could\notherwise corrupt the ROAR metric. For the remainder of this paper, we focus on the performance of\nROAR given deep neural networks and linear models which do not present this limitation.\n\n3\n\n\fWhat would happen without re-training? The re-training is the most computationally expensive\naspect of ROAR. One should question whether it is actually needed. We argue that re-training is\nneeded because machine learning models typically assume that the train and the test data comes from\na similar distribution.\n\nThe replacement value c can only be considered uninformative if the model is trained to learn it as\nsuch. Without retraining, it is unclear whether degradation in performance is due to the introduction of\nartifacts outside of the original training distribution or because we actually removed information. This\nis made explicit in our experiment in Section 4.3.1, we show that without retraining the degradation\nis far higher than the modest decrease in performance observed with re-training. This suggests\nretraining has better controlled for artefacts introduced by the modi\ufb01cation.\n\nAre we evaluating the right aspects? Re-training does have limitations. For one, while the\narchitecture is the same, the model used during evaluation is not the same as the model on which the\nfeature importance estimates were originally obtained. To understand why ROAR is still meaningful\nwe have to think about what happens when the accuracy degrades, especially when we compare it to\na random baseline. The possibilities are:\n\n1. We remove input dimensions and the accuracy drops. In this case, it is very likely\nthat the removed inputs were informative to the original model. ROAR thus gives a good\nindication that the importance estimate is of high quality.\n\n2. We remove inputs and the accuracy does not drop. This can be explained as either:\n\n(a) It could be caused by removal of an input that was uninformative to the model. This\nincludes the case where the input might have been informative but not in a way that is\nuseful to the model, for example, when a linear model is used and the relation between\nthe feature and the output is non-linear. Since in such a case the information was not\nused by the model and it does not show in ROAR we can assume ROAR behaves as\nintended.\n\n(b) There might be redundancy in the inputs. The same information could represented in\nanother feature. This behavior can be detected with ROAR as we will show in our toy\ndata experiment.\n\nValidating the behavior of ROAR on arti\ufb01cial data. To demonstrate the difference between\nROAR and an approach without re-training in a controlled environment we generate a 16 dimensional\ndataset with 4 informative features. Each datapoint x and its label y was generated as follows:\n\nx =\n\naz\n10\n\n+ d\u03b7 +\n\n\u01eb\n10\n\n,\n\ny = (z > 0).\n\nAll random variables were sampled from a standard normal distribution. The vectors a and d are 16\ndimensional vectors that were sampled once to generate the dataset. In a only the \ufb01rst 4 values have\nnonzero values to ensure that there are exactly 4 informative features. The values \u03b7, \u01eb were sampled\nindependently for each example.\n\nWe use a least squares model as this problem can be solved linearly. We compare three rankings: the\nground truth importance ranking, random ranking and the inverted ground truth ranking (the worst\npossible estimate of importance). In the left plot of Fig. 2 we can observe that without re-training the\nworst case estimator is shown to degrade performance relatively quickly. In contrast, ROAR shows\nno degradation until informative features begin to be removed at 75%. This correctly shows that this\nestimator has ranked feature importance poorly (ranked uninformative features as most important).\n\nFinally, we consider ROAR performance given a set of variables that are completely redundant. We\nnote that ROAR might not decrease until all of them are removed. To account for this we measure\nROAR at different levels of degradation, with the expectation that across this interval we would be\nable to detect in\ufb02ection points in performance that would indicate a set of redundant features. If this\nhappens, we believe that it could be detected easily by the sharp decrease as shown in Fig. 2. Now\nthat we have validated ROAR in a controlled setup, we can move on to our large scale experiments.\n\n4\n\n\fFigure 2: A comparison between not retraining and ROAR on arti\ufb01cial data. In the case where the\nmodel is not retrained, test-set accuracy quickly erodes despite the worst case ranking of redundant\nfeatures as most important. This incorrectly evaluates a completely incorrect feature ranking as being\ninformative. ROAR is far better at identifying this worst case estimator, showing no degradation until\nthe features which are informative are removed at 75%. This plot also shows the limitation of ROAR,\nan accuracy decrease might not happen until a complete set of fully redundant features is removed.\nTo account for this we measure ROAR at different levels of degradation, with the expectation that\nacross this interval we would be able to control for performance given a set of redundant features.\n\n4 Large scale experiments\n\n4.1 Estimators under consideration\n\nOur evaluation is constrained to a subset of estimators of feature importance. We selected these\nbased on the availability of open source code, consistent guidelines on how to apply them and the\nease of implementation given a ResNet-50 architecture [16]. Due to the breadth of the experimental\nsetup it was not possible to include additional methods. However, we welcome the opportunity to\nconsider additional estimators in the future, and in order to make it easy to apply ROAR to additional\nestimators we have open sourced our code https://bit.ly/2ttLLZB. Below, we brie\ufb02y introduce\neach of the methods we evaluate.\n\nBase estimators are estimators that compute a single estimate of importance (as opposed to ensemble\nmethods). While we note that guided backprop and integrated gradients are examples of signal\nand attribution methods respectively, the performance of these estimators should not be considered\nrepresentative of other methods, which should be evaluated separately.\n\n\u2022 Gradients or Sensitivity heatmaps [33, 6] (GRAD) are the gradient of the output activa-\n\ntion of interest Al\n\nn with respect to xi:\n\ne =\n\n\u2202Al\nn\n\u2202xi\n\n\u2022 Guided Backprop [36] (GB) is an example of a signal method that aim to visualize the input\npatterns that cause the neuron activation Al\nn in higher layers [36, 39, 19]. GB computes this\nby using a modi\ufb01ed backpropagation step that stops the \ufb02ow of gradients when less than\nzero at a ReLu gate.\n\n\u2022 Integrated Gradients [37] (IG)is an example of an attribution method which assign impor-\ntance to input features by decomposing the output activation Al\nn into contributions from\nthe individual input features [5, 37, 22, 31, 19]. Integrated gradients interpolate a set of\nestimates for values between a non-informative reference point x0 to the actual input x.\nThis integral can be approximated by summing a set of k points at small intervals between\nx0 and x:\n\ne = (xi \u2212 x0\n\ni ) \u00d7\n\nX\n\ni=1\n\nk\n\n\u2202fw(x0 + i\n\u2202xi\n\nk (x \u2212 x0))\n\n\u00d7\n\n1\nk\n\nThe \ufb01nal estimate e will depend upon both the choice of k and the reference point x0. As\nsuggested by [37], we use a black image as the reference point and set k to be 25.\n\n5\n\n\fEnsembling methods In addition to the base approaches we also evaluate three ensembling methods\nfor feature importance. For all the ensemble approaches that we describe below (SG, SG-SQ, Var),\nwe average over a set of 15 estimates as suggested by in the original SmoothGrad publication [34].\n\n\u2022 Classic SmoothGrad (SG) [34] SG averages a set J noisy estimates of feature importance\n\n(constructed by injecting a single input with Gaussian noise \u03b7 independently J times):\n\ne =\n\nJ\n\nX\n\ni=1\n\n(gi(x + \u03b7, Al\n\nn))\n\n\u2022 SmoothGrad2(SG-SQ) is an unpublished variant of classic SmoothGrad SG which squares\n\neach estimate e before averaging the estimates:\n\ne =\n\nJ\n\nX\n\ni=1\n\n(gi(x + \u03b7, Al\n\nn)2)\n\nAlthough SG-SQ is not described in the original publication, it is the default open-source\nimplementation of the open source code for SG: https://bit.ly/2Hpx5ob.\n\n\u2022 VarGrad (Var) [2] employs the same methodology as classic SmoothGrad (SG) to construct\na set of t J noisy estimates. However, VarGrad aggregates the estimates by computing the\nvariance of the noisy set rather than the mean.\n\ne = Var(gi(x + \u03b7, Al\n\nn))\n\nControl Variants As a control, we compare each estimator to two rankings (a random assignment\nof importance and a sobel edge \ufb01lter) that do not depend at all on the model parameters. These\ncontrols represent a lower bound in performance that we would expect all interpretability methods to\noutperform.\n\n\u2022 Random A random estimator gR assigns a random binary importance probability e 7\u2192 0, 1.\nThis amounts to a binary vector e \u223c Bernoulli(1 \u2212 t) where (1 \u2212 t) is the probability of\nei = 1. The formulation of gR does not depend on either the model parameters or the input\nimage (beyond the number of pixels in the image).\n\n\u2022 Sobel Edge Filter convolves a hard-coded, separable, integer \ufb01lter over an image to produce\na mask of derivatives that emphasizes the edges in an image. A sobel mask treated as a\nranking e will assign a high score to areas of the image with a high gradient (likely edges).\n\n4.2 Experimental setup\n\nWe use a ResNet-50 model for both generating the feature importance estimates and subsequent\ntraining on the modi\ufb01ed inputs. ResNet-50 was chosen because of the public code implementations\n(in both PyTorch [14] and Tensor\ufb02ow [1]) and because it can be trained to give near to state of art\nperformance in a reasonable amount of time [13].\n\nFor all train and validation images in the dataset we \ufb01rst apply test time pre-processing as used by\nGoyal et al. [13]. We evaluate ROAR on three open source image datasets: ImageNet, Birdsnap and\nFood 101. For each dataset and estimator, we generate new train and test sets that each correspond to\na different fraction of feature modi\ufb01cation t = [0, 10, 30, 50, 70, 90]. We evaluate 18 estimators in\ntotal (this includes the base estimators, a set of ensemble approaches wrapped around each base and\n\ufb01nally a set of squared estimates). In total, we generate 540 large-scale modi\ufb01ed image datasets in\norder to consider all experiment variants (180 new test/train for each original dataset).\n\nWe independently train 5 ResNet-50 models from random initialization on each of these modi\ufb01ed\ndataset and report test accuracy as the average of these 5 runs. In the base implementation, the\nResNet-50 trained on an unmodi\ufb01ed ImageNet dataset achieves a mean accuracy of 76.68%. This\nis comparable to the performance reported by [13]. On Birdsnap and Food 101, our unmodi\ufb01ed\ndatasets achieve 66.65% and 84.54% respectively (average of 10 independent runs). This baseline\nperformance is comparable to that reported by Kornblith et al. [20].\n\n6\n\n\f4.3 Experimental results\n\n4.3.1 Evaluating the random ranking\n\nComparing estimators to the random ranking allows us to answer the question: is the estimate\nof importance more accurate than a random guess? It is \ufb01rstly worthwhile noting that model\nperformance is remarkably robust to random modi\ufb01cation. After replacing a large portion of all\ninputs with a constant value, the model not only trains but still retains most of the original predictive\npower. For example, on ImageNet, when only 10% of all features are retained, the trained model\nstill attains 63.53% accuracy (relative to unmodi\ufb01ed baseline of 76.68%). The ability of the model to\nextract a meaningful representation from a small random fraction of inputs suggests a case where\nmany inputs are likely redundant. The nature of our input\u2013an image where correlations between\npixels are expected \u2013 provides one possible readons for redundancy.\n\nThe results for our random baseline provides additional support for the need to re-train. We can\ncompare random ranking on ROAR vs. a traditional deletion metric [29], i.e. the setting where we\ndo not retain. These results are given in Fig. 3. Without retraining, a random modi\ufb01cation of 90%\ndegrades accuracy to 0.5% for the model that was not retrained. Keep in mind that on clean data\nwe achieve 76.68% accuracy. This large discrepancy illustrates that without retraining the model,\nit is not possible to decouple the performance of the ranking from the degradation caused by the\nmodi\ufb01cation itself.\n\n4.3.2 Evaluating Base Estimators\n\nNow that we have established the baselines, we can start evaluating the base estimators: GB,\nIG, GRAD. Surprisingly, the left inset of Fig. 4 shows that these estimators consistently perform\nworse than the random assignment of feature importance across all datasets and for all thresholds\nt = [0.1, 0.3, 0.5, 0.7, 0.9]. Furthermore, our estimators fall further behind the accuracy of random\nguess as a larger fraction t of inputs is modi\ufb01ed. The gap is widest when t = 0.9.\n\nOur base estimators also do not compare favorably to the performance of a sobel edge \ufb01lter SOBEL.\nBoth the sobel \ufb01lter and the random ranking have formulations that are entirely independent of the\nmodel parameters. All the base estimators that we consider have formulations that depend upon the\ntrained model weights, and thus we would expect them to have a clear advantage in outperforming\nthe control variants. However, across all datasets and thresholds t, the base estimators GB, IG, GRAD\nperform on par or worse than SOBEL.\n\nBase estimators perform within a very narrow range. Despite the very different formulations of base\nestimators that we consider, the difference between the performance of the base estimators is in a\nstrikingly narrow range. For example, as can be seen in the left column of Fig. 4, for Birdsnap, the\ndifference in accuracy between the best and worst base estimator at t = 90% is only 4.22%. This\nrange remains narrow for both Food101 and ImageNet, with a gap of 5.17% and 3.62 respectively.\nOur base estimator results are remarkably consistent results across datasets, methods and for all\nfractions of t considered. The variance is very low across independent runs for all datasets and\nestimators. The maximum variance observed for ImageNet was a variance of 1.32% using SG-SQ-\nGRAD at 70% of inputs removed. On Birdsnap the highest variance was 0.12% using VAR-GRAD\nat 90% removed. For food101 it was 1.52% using SG-SQ-GRAD at 70% removed.\n\nFinally, we compare performance of the base estimators using ROAR re-training vs. a traditional\ndeletion metric [29], again the setting where we do not retain. In Fig. 3 we see a behavior for the\nbase estimators on all datasets that is similar to the behavior of the inverse (worst possible) ranking\non the toy data in Fig. 2. The base estimators appear to be working when we do not retrain, but they\nare clearly not better than the random baseline when evaluated using ROAR. This provides additional\nsupport for the need to re-train.\n\n4.3.3 Evaluating Ensemble Approaches\n\nSince the base estimators do not appear to perform well, we move on to ensemble estimators.\nEnsemble approaches inevitably carry a higher computational approach, as the methodology requires\nthe aggregation of a set of individual estimates. However, these methods are often preferred by humans\nbecause they appear to produce \u201cless noisy\u201d explanations. However, there is limited theoretical\nunderstanding of what these methods are actually doing or how this is related to the accuracy of the\n\n7\n\n\fFigure 3: On the left we evaluate three base estimators and the random baseline without retraining.\nAll of the methods appear to reduce accuracy at quite a high rate. On the right, we see, using ROAR,\nthat after re-training most of the information is actually still present. It is also striking that in this\ncase the base estimators perform worse than the random baseline.\n\nFigure 4: Left: Grad (GRAD), Integrated Gradients (IG) and Guided Backprop (GB) perform worse\nthan a random assignment of feature importance. Middle: SmoothGrad (SG) is less accurate than\na random assignment of importance and often worse than a single estimate (in the case of raw\ngradients SG-Grad and Integrated Gradients SG-IG). Right: SmoothGrad Squared (SG-SQ) and\nVarGrad (VAR) produce a dramatic improvement in approximate accuracy and far outperform the\nother methods in all datasets considered, regardless of the underlying estimator.\n\nexplanation. We evaluate ensemble estimators and produce results that are remarkably consistent\nresults across datasets, methods and for all fractions of t considered.\n\nClassic Smoothgrad is less accurate or on par with a single estimate. In the middle column in Fig. 4\nwe evaluate Classic SmoothGrad (SG). It average 15 estimates computed according to an underlying\nbase method (GRAD, IG or GB). However, despite averaging SG degrades test-set accuracy still\nless than a random guess. In addition, for GRAD and IG SmoothGrad performs worse than a single\nestimate.\n\n8\n\n\fSmoothGrad-Squared and VarGrad produce large gains in accuracy. In the right inset of Fig. 4,\nwe show that both VarGrad (VAR) and SmoothGrad-Squared (SG-SQ) far outperform the two control\nvariants. In addition, for all the interpretability methods we consider, VAR or SG-SQ far outperform\nthe approximate accuracy of a single estimate. However, while VAR and SG-SQ bene\ufb01t the accuracy\nof all base estimators, the overall ranking of estimator performance differs by dataset. For ImageNet\nand Food101, the best performing estimators are VAR or SG-SQ when wrapped around GRAD.\nHowever, for the Birdsnap dataset, the most approximately accurate estimates are these ensemble\napproaches wrapped around GB. This suggests that while the VAR and SG-SQ consistently improve\nperformance, the choice of the best underlying estimator may vary by task.\n\nNow, why do both of these methods work so well? First, these methods are highly similar. If the\naverage (squared) gradient over the noisy samples is zero then VAR and SG-SQ reduce to the same\nmethod. For many images it appears that the mean gradient is much smaller than the mean squared\ngradient. This implies that the \ufb01nal output should be similar. Qualitatively this seems to be the case\nas well. In Fig. 1 we observe that both methods appear to remove whole objects. The other methods\nremoved inputs that are less concentrated but spread more widely over the image. It is important to\nnote that these methods were not forced to behave as such. It is emergent behavior. Understanding\nwhy this happens and why this is bene\ufb01cial should be the focus of future work.\n\nSquaring estimates The \ufb01nal question we consider is why SmoothGrad-Squared SG-SQ dramatically\nimproves upon the performance of SmoothGrad SG despite little difference in formulation. The\nonly difference between the two estimates is that SG-SQ squares the estimates before averaging. We\nconsider the effect of only squaring estimates (no ensembling). We \ufb01nd that while squaring improves\nthe accuracy of all estimators, the transformation does not adequately explain the large gains that we\nobserve when applying VAR or SG-SQ. When base estimators are squared they slightly outperform\nthe random baseline (all results included in the supplementary materials).\n\n5 Conclusion and Future Work\n\nIn this work, we propose ROAR to evaluate the quality of input feature importance estimators.\nSurprisingly, we \ufb01nd that the commonly used base estimators, Gradients, Integrated Gradients and\nGuided BackProp are worse or on par with a random assignment of importance. Furthermore,\ncertain ensemble approaches such as SmoothGrad are far more computationally intensive but do not\nimprove upon a single estimate (and in some cases are worse). However, we do \ufb01nd that VarGrad and\nSmoothGrad-Squared strongly improve the quality of these methods and far outperform a random\nguess. While the low effectiveness of many methods could be seen as a negative result, we view\nthe remarkable effectiveness of SmoothGrad-Squared and VarGrad as important progress within the\ncommunity. Our \ufb01ndings are particularly pertinent for sensitive domains where the accuracy of a\nexplanation of model behavior is paramount. While we venture some initial consideration of why\ncertain ensemble methods far outperform other estimator, the divergence in performance between the\nensemble estimators is an important direction of future research.\n\nAcknowledgments\n\nWe thank Gabriel Bender, Kevin Swersky, Andrew Ross, Douglas Eck, Jonas Kemp, Melissa Fabros,\nJulius Adebayo, Simon Kornblith, Prajit Ramachandran, Niru Maheswaranathan, Gamaleldin Elsayed,\nHugo Larochelle, Varun Vasudevan for their thoughtful feedback on earlier iterations of this work. In\naddition, thanks to Sally Jesmonth, Dan Nanas and Alexander Popper for institutional support and\nencouragement.\n\nReferences\n\n[1] Abadi, Mart\u00edn, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig,\nCorrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow,\nIan, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser,\nLukasz, Kudlur, Manjunath, Levenberg, Josh, Man\u00e9, Dandelion, Monga, Rajat, Moore, Sherry,\nMurray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya,\nTalwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Vi\u00e9gas, Fernanda, Vinyals,\nOriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang.\n\n9\n\n\fTensorFlow: Large-scale machine learning on heterogeneous systems, January 2015. URL\nhttps://www.tensorflow.org/. Software available from tensor\ufb02ow.org.\n\n[2] Adebayo, Julius, Gilmer, Justin, Muelly, Michael, Goodfellow, Ian J., Hardt, Moritz, and Kim,\n\nBeen. Sanity checks for saliency maps. In NeurIPS, 2019.\n\n[3] Ancona, M., Ceolini, E., \u00d6ztireli, C., and Gross, M. Towards better understanding of gradient-\n\nbased attribution methods for Deep Neural Networks. ArXiv e-prints, November 2017.\n\n[4] Ba, Lei Jimmy and Caruana, Rich. Do deep nets really need to be deep? In Proceedings\nof the 27th International Conference on Neural Information Processing Systems - Volume 2,\nNIPS\u201914, pp. 2654\u20132662, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.\norg/citation.cfm?id=2969033.2969123.\n\n[5] Bach, Sebastian, Binder, Alexander, Montavon, Gr\u00e9goire, Klauschen, Frederick, M\u00fcller, Klaus-\nRobert, and Samek, Wojciech. On pixel-wise explanations for non-linear classi\ufb01er decisions by\nlayer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.\n\n[6] Baehrens, David, Schroeter, Timon, Harmeling, Stefan, Kawanabe, Motoaki, Hansen, Katja, and\nM\u00fcller, Klaus-Robert. How to explain individual classi\ufb01cation decisions. Journal of Machine\nLearning Research, 11(Jun):1803\u20131831, 2010.\n\n[7] Berg, Thomas, Liu, Jiongxin, Lee, Seung Woo, Alexander, Michelle L., Jacobs, David W., and\nBelhumeur, Peter N. Birdsnap: Large-scale \ufb01ne-grained visual categorization of birds. 2014\nIEEE Conference on Computer Vision and Pattern Recognition, pp. 2019\u20132026, 2014.\n\n[8] Bossard, Lukas, Guillaumin, Matthieu, and Van Gool, Luc. Food-101 \u2013 mining discriminative\n\ncomponents with random forests. In European Conference on Computer Vision, 2014.\n\n[9] Dabkowski, P. and Gal, Y. Real Time Image Saliency for Black Box Classi\ufb01ers. ArXiv e-prints,\n\nMay 2017.\n\n[10] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale\n\nHierarchical Image Database. In CVPR09, 2009.\n\n[11] Fong, Ruth C. and Vedaldi, Andrea. Interpretable explanations of black boxes by meaningful\n\nperturbation. In ICCV, pp. 3449\u20133457. IEEE Computer Society, 2017.\n\n[12] Frosst, N. and Hinton, G. Distilling a Neural Network Into a Soft Decision Tree. ArXiv e-prints,\n\nNovember 2017.\n\n[13] Goyal, P., Doll\u00e1r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia,\nY., and He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv e-prints,\nJune 2017.\n\n[14] Gross, S. and Wilber, M. Training and investigating Residual Nets., January 2017. URL\n\nhttps://github.com/facebook/fb.resnet.torch.\n\n[15] Guyon, Isabelle, Weston, Jason, Barnhill, Stephen, and Vapnik, Vladimir. Gene selection for\ncancer classi\ufb01cation using support vector machines. Machine Learning, 46(1):389\u2013422, Jan\n2002. ISSN 1573-0565. doi: 10.1023/A:1012487302797. URL https://doi.org/10.1023/\nA:1012487302797.\n\n[16] He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. ArXiv\n\ne-prints, December 2015.\n\n[17] Kim, Been, Wattenberg, Martin, Gilmer, Justin, Cai, Carrie, Wexler, James, Viegas, Fernanda,\nand sayres, Rory. Interpretability beyond feature attribution: Quantitative testing with concept\nactivation vectors (TCAV). In Dy, Jennifer and Krause, Andreas (eds.), Proceedings of the\n35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pp. 2668\u20132677, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR. URL http://proceedings.mlr.press/v80/kim18d.html.\n\n[18] Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Sch\u00fctt, K. T., D\u00e4hne, S., Erhan, D., and\n\nKim, B. The (Un)reliability of saliency methods. ArXiv e-prints, November 2017.\n\n10\n\n\f[19] Kindermans, Pieter-Jan, Sch\u00fctt, Kristof T., Alber, Maximilian, M\u00fcller, Klaus-Robert, Erhan,\nDumitru, Kim, Been, and D\u00e4hne, Sven. Learning how to explain neural networks: Patternnet\nand patternattribution. In International Conference on Learning Representations, 2018. URL\nhttps://openreview.net/forum?id=Hkn7CBaTW.\n\n[20] Kornblith, Simon, Shlens, Jonathon, and Le, Quoc V. Do Better ImageNet Models Transfer\n\nBetter? arXiv e-prints, art. arXiv:1805.08974, May 2018.\n\n[21] Lage, Isaac, Slavin Ross, Andrew, Kim, Been, Gershman, Samuel J., and Doshi-Velez, Finale.\n\nHuman-in-the-Loop Interpretability Prior. arXiv e-prints, art. arXiv:1805.11571, May 2018.\n\n[22] Montavon, Gr\u00e9goire, Lapuschkin, Sebastian, Binder, Alexander, Samek, Wojciech, and M\u00fcller,\nKlaus-Robert. Explaining nonlinear classi\ufb01cation decisions with deep taylor decomposition.\nPattern Recognition, 65:211\u2013222, 2017.\n\n[23] Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C., and Botvinick, M. On the importance of\n\nsingle directions for generalization. ArXiv e-prints, March 2018.\n\n[24] Olah, Chris, Mordvintsev, Alexander, and Schubert, Ludwig. Feature visualization. Distill,\n\n2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.\n\n[25] Petsiuk, Vitali, Das, Abir, and Saenko, Kate. RISE: randomized input sampling for explanation\nof black-box models. CoRR, abs/1806.07421, 2018. URL http://arxiv.org/abs/1806.\n07421.\n\n[26] Raghu, Maithra, Gilmer, Justin, Yosinski, Jason, and Sohl-Dickstein, Jascha. SVCCA: singular\nvector canonical correlation analysis for deep learning dynamics and interpretability. In NIPS,\npp. 6078\u20136087, 2017.\n\n[27] Ross, Andrew Slavin and Doshi-Velez, Finale. Improving the adversarial robustness and inter-\npretability of deep neural networks by regularizing their input gradients. CoRR, abs/1711.09404,\n2017.\n\n[28] Ross, Andrew Slavin, Hughes, Michael C., and Doshi-Velez, Finale. Right for the right reasons:\nTraining differentiable models by constraining their explanations. In Proceedings of the Twenty-\nSixth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI 2017, Melbourne, Australia,\nAugust 19-25, 2017, pp. 2662\u20132670, 2017.\n\n[29] Samek, W., Binder, A., Montavon, G., Lapuschkin, S., and M\u00fcller, K. R. Evaluating the\nVisualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural\nNetworks and Learning Systems, 28(11):2660\u20132673, Nov 2017. ISSN 2162-237X. doi: 10.\n1109/TNNLS.2016.2599820.\n\n[30] Selvaraju, Ramprasaath R., Cogswell, Michael, Das, Abhishek, Vedantam, Ramakrishna, Parikh,\nDevi, and Batra, Dhruv. Grad-cam: Visual explanations from deep networks via gradient-based\nlocalization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.\n\n[31] Shrikumar, A., Greenside, P., Shcherbina, A., and Kundaje, A. Not Just a Black Box: Learning\n\nImportant Features Through Propagating Activation Differences. ArXiv e-prints, May 2016.\n\n[32] Shrikumar, A., Greenside, P., and Kundaje, A. Learning Important Features Through Propagat-\n\ning Activation Differences. ArXiv e-prints, April 2017.\n\n[33] Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale\n\nimage recognition. In ICLR, 2015.\n\n[34] Smilkov, Daniel, Thorat, Nikhil, Kim, Been, Vi\u00e9gas, Fernanda, and Wattenberg, Martin. Smooth-\n\ngrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.\n\n[35] Sobel, Irwin. An isotropic 3x3 image gradient operator. Presentation at Stanford A.I. Project\n\n1968, 02 2014.\n\n[36] Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving\n\nfor simplicity: The all convolutional net. In ICLR, 2015.\n\n11\n\n\f[37] Sundararajan, Mukund, Taly, Ankur, and Yan, Qiqi. Axiomatic attribution for deep networks.\n\narXiv preprint arXiv:1703.01365, 2017.\n\n[38] Wu, M., Hughes, M. C., Parbhoo, S., Zazzi, M., Roth, V., and Doshi-Velez, F. Beyond Sparsity:\n\nTree Regularization of Deep Models for Interpretability. ArXiv e-prints, November 2017.\n\n[39] Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In\n\nEuropean Conference on Computer Vision, pp. 818\u2013833. Springer, 2014.\n\n[40] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning\n\nrequires rethinking generalization. ArXiv e-prints, November 2016.\n\n[41] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Object Detectors Emerge in\n\nDeep Scene CNNs. ArXiv e-prints, 2014.\n\n[42] Zhou, Bolei, Sun, Yiyou, Bau, David, and Torralba, Antonio. Revisiting the importance of\nindividual units in cnns via ablation. CoRR, abs/1806.02891, 2018. URL http://arxiv.org/\nabs/1806.02891.\n\n[43] Zintgraf, Luisa M, Cohen, Taco S, Adel, Tameem, and Welling, Max. Visualizing deep neural\n\nnetwork decisions: Prediction difference analysis. In ICLR, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5145, "authors": [{"given_name": "Sara", "family_name": "Hooker", "institution": "Google Brain"}, {"given_name": "Dumitru", "family_name": "Erhan", "institution": "Google Brain"}, {"given_name": "Pieter-Jan", "family_name": "Kindermans", "institution": "Google Brain"}, {"given_name": "Been", "family_name": "Kim", "institution": "Google"}]}