{"title": "Representer Point Selection for Explaining Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 9291, "page_last": 9301, "abstract": "We propose to explain the predictions of a deep neural network, by pointing to the set of what we call representer points in the training set, for a given test point prediction. Specifically, we show that we can decompose the pre-activation prediction of a neural network into a linear combination of activations of training points, with the weights corresponding to what we call representer values, which thus capture the importance of that training point on the learned parameters of the network. But it provides a deeper understanding of the network than simply training point influence: with positive representer values corresponding to excitatory training points, and negative values corresponding to inhibitory points, which as we show provides considerably more insight. Our method is also much more scalable, allowing for real-time feedback in a manner not feasible with influence functions.", "full_text": "Representer Point Selection for\n\nExplaining Deep Neural Networks\n\nChih-Kuan Yeh\u2217\n\nJoon Sik Kim \u2217\n\nIan E.H. Yen\nMachine Learning Department\nCarnegie Mellon University\n\nPradeep Ravikumar\n\n{cjyeh, joonsikk, eyan, pradeepr}@cs.cmu.edu\n\nPittsburgh, PA 15213\n\nAbstract\n\nWe propose to explain the predictions of a deep neural network, by pointing to the\nset of what we call representer points in the training set, for a given test point pre-\ndiction. Speci\ufb01cally, we show that we can decompose the pre-activation prediction\nof a neural network into a linear combination of activations of training points, with\nthe weights corresponding to what we call representer values, which thus capture\nthe importance of that training point on the learned parameters of the network. But\nit provides a deeper understanding of the network than simply training point in\ufb02u-\nence: with positive representer values corresponding to excitatory training points,\nand negative values corresponding to inhibitory points, which as we show provides\nconsiderably more insight. Our method is also much more scalable, allowing for\nreal-time feedback in a manner not feasible with in\ufb02uence functions.\n\n1\n\nIntroduction\n\nAs machine learning systems start to be more widely used, we are starting to care not just about the\naccuracy and speed of the predictions, but also why it made its speci\ufb01c predictions. While we need not\nalways care about the why of a complex system in order to trust it, especially if we observe that the\nsystem has high accuracy, such trust typically hinges on the belief that some other expert has a richer\nunderstanding of the system. For instance, while we might not know exactly how planes \ufb02y in the air,\nwe trust some experts do. In the case of machine learning models however, even machine learning\nexperts do not have a clear understanding of why say a deep neural network makes a particular\nprediction. Our work proposes to address this gap by focusing on improving the understanding of\nexperts, in addition to lay users. In particular, expert users could then use these explanations to further\n\ufb01ne-tune the system (e.g. dataset/model debugging), as well as suggest different approaches for\nmodel training, so that it achieves a better performance.\nOur key approach to do so is via a representer theorem for deep neural networks, which might be of\nindependent interest even outside the context of explainable ML. We show that we can decompose\nthe pre-activation prediction values into a linear combination of training point activations, with\nthe weights corresponding to what we call representer values, which can be used to measure the\nimportance of each training point has on the learned parameter of the model. Using these representer\nvalues, we select representer points \u2013 training points that have large/small representer values \u2013 that\ncould aid the understanding of the model\u2019s prediction.\nSuch representer points provide a richer understanding of the deep neural network than other ap-\nproaches that provide in\ufb02uential training points, in part because of the meta-explanation underlying\nour explanation: a positive representer value indicates that a similarity to that training point is excita-\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftory, while a negative representer value indicates that a similarity to that training point is inhibitory,\nto the prediction at the given test point. It is in these inhibitory training points where our approach\nprovides considerably more insight compared to other approaches: speci\ufb01cally, what would cause\nthe model to not make a particular prediction? In one of our examples, we see that the model makes\nan error in labeling an antelope as a deer. Looking at its most inhibitory training points, we see that\nthe dataset is rife with training images where there are antelopes in the image, but also some other\nanimals, and the image is labeled with the other animal. These thus contribute to inhibitory effects\nof small antelopes with other big objects: an insight that as machine learning experts, we found\ndeeply useful, and which is dif\ufb01cult to obtain via other explanatory approaches. We demonstrate the\nutility of our class of representer point explanations through a range of theoretical and empirical\ninvestigations.\n\n2 Related Work\n\nThere are two main classes of approaches to explain the prediction of a model. The \ufb01rst class of\napproaches point to important input features. Ribeiro et al. [1] provide such feature-based explanations\nthat are model-agnostic; explaining the decision locally around a test instance by \ufb01tting a local linear\nmodel in the region. Ribeiro et al. [2] introduce Anchors, which are locally suf\ufb01cient conditions\nof features that \u201cholds down\u201d the prediction so that it does not change in a local neighborhood.\nSuch feature based explanations are particularly natural in computer vision tasks, since it enables\nvisualizing the regions of the input pixel space that causes the classi\ufb01er to make certain predictions.\nThere are numerous works along this line, particularly focusing on gradient-based methods that\nprovide saliency maps in the pixel space [3, 4, 5, 6].\nThe second class of approaches are sample-based, and they identify training samples that have the\nmost in\ufb02uence on the model\u2019s prediction on a test point. Among model-agnostic sample-based\nexplanations are prototype selection methods [7, 8] that provide a set of \u201crepresentative\u201d samples\nchosen from the data set. Kim et al. [9] provide criticism alongside prototypes to explain what are\nnot captured by prototypes. Usually such prototype and criticism selection is model-agnostic and\nused to accelerate the training for classi\ufb01cations. Model-aware sample-based explanation identify\nin\ufb02uential training samples which are the most helpful for reducing the objective loss or making the\nprediction. Recently, Koh and Liang [10] provide tractable approximations of in\ufb02uence functions that\ncharacterize the in\ufb02uence of each sample in terms of change in the loss. Anirudh et al. [11] propose a\ngeneric approach to in\ufb02uential sample selection via a graph constructed using the samples.\nOur approach is based on a representer theorem for deep neural network predictions. Representer\ntheorems [12] in machine learning contexts have focused on non-parametric regression, speci\ufb01cally\nin reproducing kernel Hilbert spaces (RKHS), and which loosely state that under certain conditions\nthe minimizer of a loss functional over a RKHS can be expressed as a linear combination of\nkernel evaluations at training points. There have been recent efforts at leveraging such insights\nto compositional contexts [13, 14], though these largely focus on connections to non-parametric\nestimation. Bohn et al. [13] extend the representer theorem to compositions of kernels, while Unser\n[14] draws connections between deep neural networks to such deep kernel estimation, speci\ufb01cally\ndeep spline estimation. In our work, we consider the much simpler problem of explaining pre-\nactivation neural network predictions in terms of activations of training points, which while less\nilluminating from a non-parametric estimation standpoint, is arguably much more explanatory, and\nuseful from an explainable ML standpoint.\n\n3 Representer Point Framework\nConsider a classi\ufb01cation problem, of learning a mapping from an input space X \u2286 Rd (e.g., images)\nto an output space Y \u2286 R (e.g., labels), given training points x1, x2, ...xn, and corresponding\nlabels y1, y2, ...yn. We consider a neural network as our prediction model, which takes the form\n\u02c6yi = \u03c3(\u03a6(xi, \u0398)) \u2286 Rc, where \u03a6(xi, \u0398) = \u03981fi \u2286 Rc and fi = \u03a62(xi, \u03982) \u2286 Rf is the last\nintermediate layer feature in the neural network for input xi. Note that c is the number of classes,\nf is the dimension of the feature, \u03981 is a matrix \u2286 Rc\u00d7f , and \u03982 is all the parameters to generate\nthe last intermediate layer from the input xi. Thus \u0398 = {\u03981, \u03982} are all the parameters of our\nneural network model. The parameterization above connotes splitting of the model as a feature model\n\u03a62(xi, \u03982) and a prediction network with parameters \u03981. Note that the feature model \u03a62(xi, \u03982)\n\n2\n\n\f(cid:8) 1\n\nn\n\n(cid:80)n\n\u03a6(xt, \u0398\u2217) = (cid:80)n\n\ni L(xi, yi, \u0398) + g(||\u0398||)(cid:9) for some non-\n(cid:80)n\n\ncan be arbitrarily deep, or simply the identity function, so our setup above is applicable to general\nfeed-forward networks.\nOur goal is to understand to what extent does one particular training point xi affect the prediction\n\u02c6yt of a test point xt as well as the learned weight parameter \u0398. Let L(x, y, \u0398) be the loss, and\ni L(xi, yi, \u0398) be the empirical risk. To indicate the form of a representer theorem, suppose we\n1\nn\nsolve for the optimal parameters \u0398\u2217 = arg min\u0398\ndecreasing g. We would then like our pre-activation predictions \u03a6(xt, \u0398) to have the decomposition:\ni \u03b1ik(xt, xi). Given such a representer theorem, \u03b1ik(xt, xi) can be seen as the\ncontribution of the training data xi on the testing prediction \u03a6(xt, \u0398). However, such representer\ntheorems have only been developed for non-parametric predictors, speci\ufb01cally where \u03a6 lies in a\nreproducing kernel Hilbert space. Moreover, unlike the typical RKHS setting, \ufb01nding a global\nminimum for the empirical risk of a deep network is dif\ufb01cult, if not impossible, to obtain. In the\nfollowing, we provide a representer theorem that addresses these two points: it holds for deep neural\nnetworks, and for any stationary point solution.\nTheorem 3.1. Let us denote the neural network prediction function by \u02c6yi = \u03c3(\u03a6(xi, \u0398)), where\n\u03a6(xi, \u0398) = \u03981fi and fi = \u03a62(xi, \u03982). Suppose \u0398\u2217 is a stationary point of the optimization\nproblem: arg min\u0398\n0. Then we have the decomposition:\n\ni L(xi, yi, \u0398)) + g(||\u03981||)(cid:9), where g(||\u03981||) = \u03bb||\u03981||2 for some \u03bb >\n(cid:80)n\n\n(cid:8) 1\n\nn\n\nn(cid:88)\n\n\u03a6(xt, \u0398\u2217) =\n\nk(xt, xi, \u03b1i),\n\nwhere \u03b1i = 1\u22122\u03bbn\ngiven xt.\n\n\u2202L(xi,yi,\u0398)\n\n\u2202\u03a6(xi,\u0398) and k(xt, xi, \u03b1i) = \u03b1if T\n\ni ft, which we call a representer value for xi\n\ni\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\nn(cid:88)\n\nProof. Note that for any stationary point, the gradient of the loss with respect to \u03981 is equal to 0.\nWe therefore have\n\n\u2202L(xi, yi, \u0398)\n\ni=1\n\n\u2202\u03981\n\n+ 2\u03bb\u0398\u2217\n\n1 = 0 \u21d2 \u0398\u2217\n\n1 = \u2212 1\n2\u03bbn\n\n\u2202L(xi, yi, \u0398)\n\ni=1\n\n\u2202\u03981\n\n=\n\n\u03b1if T\ni\n\n(1)\n\nn(cid:88)\n\n1\nn\n\nwhere \u03b1i = \u2212 1\n\n2\u03bbn\n\n\u2202L(xi,yi,\u0398)\n\n\u2202\u03a6(xi,\u0398) by the chain rule. We thus have that\n\n\u03a6(xt, \u0398\u2217) = \u0398\u2217\n\n1ft =\n\nk(xt, xi, \u03b1i),\n\n(2)\n\nwhere k(xt, xi, \u03b1i) = \u03b1if T\n\ni ft by simply plugging in the expression (1) into (2).\n\ni=1\n\n1jft =(cid:80)n\n\nWe note that \u03b1i can be seen as the resistance for training example feature fi towards minimizing the\nnorm of the weight matrix \u03981. Therefore, \u03b1i can be used to evaluate the importance of the training\ndata xi have on \u03981. Note that for any class j, \u03a6(xt, \u0398\u2217)j = \u0398\u2217\ni=1 k(xt, xi, \u03b1i)j holds by\n(2). Moreover, we can observe that for k(xt, xi, \u03b1i)j to have a signi\ufb01cant value, two conditions must\ni ft should have a large value. Therefore, we\nbe satis\ufb01ed: (a) \u03b1ij should have a large value, and (b) f T\ninterpret the pre-activation value \u03a6(xt, \u0398)j as a weighted sum for the feature similarity f T\ni ft with\nthe weight \u03b1ij. When ft is close to fi with a large positive weight \u03b1ij, the prediction score for class j\nis increased. On the other hand, when ft is close to fi with a large negative weight \u03b1ij, the prediction\nscore for class j is then decreased.\nWe can thus interpret the training points with negative representer values as inhibitory points that\nsuppress the activation value, and those with positive representer values as excitatory examples that\ndoes the opposite. We demonstrate this notion with examples further in Section 4.2. We note that\nsuch excitatory and inhibitory points provide a richer understanding of the behavior of the neural\nnetwork: it provides insight both as to why the neural network prefers a particular prediction, as well\nas why it does not, which is typically dif\ufb01cult to obtain via other sample-based explanations.\n\n3\n\n\f3.1 Training an Interpretable Model by Imposing L2 Regularization.\n\nTheorem 3.1 works for any model that performs a linear matrix multiplication before the activation\n\u03c3, which is quite general and can be applied on most neural-network-like structures. By simply\nintroducing a L2 regularizer on the weight with a \ufb01xed \u03bb > 0, we can easily decompose the pre-\nsoftmax prediction value as some \ufb01nite linear combinations of a function between the test and train\ndata. We now state our main algorithm. First we solve the following optimization problem:\n\n\u0398\u2217 = arg min\n\n\u0398\n\n1\nn\n\nL(yi, \u03a6(xi, \u0398)) + \u03bb||\u03981||2.\n\n(3)\n\nn(cid:88)\n\ni\n\ncan then decompose \u03a6(xt, \u0398) =(cid:80)n\n\nNote that for the representer point selection to work, we would need to achieve a stationary point\nwith high precision. In practice, we \ufb01nd that using a gradient descent solver with line search or\nLBFGS solver to \ufb01ne-tune after converging in SGD can achieve highly accurate stationary point.\nNote that we can perform the \ufb01ne-tuning step only on \u03981, which is usually ef\ufb01cient to compute. We\ni k(xt, xi, \u03b1i) by Theorem 3.1 for any arbitrary test point xt,\nwhere k(xt, xi, \u03b1i) is the contribution of training point xi on the pre-softmax prediction \u03a6(xt, \u0398).\nWe emphasize that imposing L2 weight decay is a common practice to avoid over\ufb01tting for deep\nneural networks, which does not sacri\ufb01ce accuracy while achieving a more interpretable model.\n\n3.2 Generating Representer Points for a Given Pre-trained Model.\n\nWe are also interested in \ufb01nding representer points for a given model \u03a6(\u0398given) that has already\nbeen trained, potentially without imposing the L2 regularizer. While it is possible to add the L2\nregularizer and retrain the model, the retrained model may converge to a different stationary point,\nand behave differently compared to the given model, in which case we cannot use the resulting\nrepresenter points as explanations. Accordingly, we learn the parameters \u0398 while imposing the L2\nregularizer, but under the additional constraint that \u03a6(xi, \u0398) be close to \u03a6(xi, \u0398given). In this case,\nour learning objective becomes \u03a6(xi, \u0398given) instead of yi, and our loss L(xi, yi, \u0398) can be written\nas L(\u03a6(xi, \u0398given), \u03a6(xi, \u0398)).\nDe\ufb01nition 3.1. We say that a convex loss function L(\u03a6(xi, \u0398given), \u03a6(xi, \u0398)) is \u201csuitable\u201d to an\nactivation function \u03c3, if it holds that for any \u0398\u2217 \u2208 arg min\u0398 L(\u03a6(xi, \u0398given), \u03a6(xi, \u0398)), we have\n\u03c3(\u03a6(xi, \u0398\u2217)) = \u03c3(\u03a6(xi, \u0398given)).\nAssume that we are given such a loss function L that is \u201csuitable to\u201d the activation function \u03c3. We\ncan then solve the following optimization problem:\n\n(cid:41)\n\nL(\u03a6(xi, \u0398given), \u03a6(xi, \u0398)) + \u03bb||\u03981||2\n\n.\n\n(4)\n\n(cid:40)\n\nn(cid:88)\n\ni\n\n\u0398\u2217 \u2208 arg min\n\n\u0398\n\n1\nn\n\nThe optimization problem can be seen to be convex under the assumptions on the loss function. The\nparameter \u03bb > 0 controls the trade-off between the closeness of \u03c3(\u03a6(X, \u0398)) and \u03c3(\u03a6(X, \u0398given)),\nand the computational cost. For a small \u03bb, \u03c3(\u03a6(X, \u0398)) could be arbitrarily close to \u03c3(\u03a6(X, \u0398given)),\nwhile the convergence time may be long. We note that the learning task in Eq. (4) can be seen as\nlearning from a teacher network \u0398given and imposing a regularizer to make the student model \u0398\ncapable of generating representer points. In practice, we may take \u0398given as an initialization for\n\u0398 and perform a simple line-search gradient descent with respect to \u03981 in (4). In our experiments,\nwe discover that the training for (4) can converge to a stationary point in a short period of time, as\ndemonstrated in Section 4.5.\nWe now discuss our design for the loss function that is mentioned in (4). When \u03c3 is the soft-\nmax activation, we choose the softmax cross-entropy loss, which computes the cross entropy\nbetween \u03c3(\u03a6(xi, \u0398given)) and \u03c3(\u03a6(xi, \u0398)) for Lsoftmax(\u03a6(xi, \u0398given), \u03a6(xi, \u0398)). When \u03c3 is\n2 max(\u03a6(xi, \u0398), 0)(cid:12) \u03a6(xi, \u0398)\u2212\nReLU activation, we choose LReLU(\u03a6(xi, \u0398given), \u03a6(xi, \u0398)) = 1\nmax(\u03a6(xi, \u0398given), 0) (cid:12) \u03a6(xi, \u0398), where (cid:12) is the element-wise product. In the following Propo-\nsition, we show that Lsoftmax and LReLU are convex, and satisfy the desired suitability property in\nDe\ufb01nition 3.1. The proof is provided in the supplementary material.\nProposition 3.1. The loss functions Lsoftmax and LReLU are both convex in \u03981. Moreover, Lsoftmax\nis \u201csuitable to\u201d the softmax activation, and LReLU is \u201csuitable to\u201d the ReLU activation, following\nDe\ufb01nition 3.1.\n\n4\n\n\fFigure 1: Pearson correlation between the actual and approximated softmax output (expressed as\na linear combination) for train (left) and test (right) data in CIFAR-10 dataset. The correlation is\nalmost 1 for both cases.\n\ncalculate \u03a6(xt, \u0398\u2217) =(cid:80)n\npredicted output \u03c3((cid:80)n\n\nAs a sanity check, we perform experiments on the CIFAR-10 dataset [15] with a pre-trained VGG-16\nnetwork [16]. We \ufb01rst solve (4) with loss Lsoftmax(\u03a6(xi, \u0398), \u03a6(xi, \u0398given)) for \u03bb = 0.001, and then\ni=1 k(xt, xi, \u03b1i) as in (2) for all train and test points. We note that the\ncomputation time for the whole procedure only takes less than a minute, given the pre-trained model.\nWe compute the Pearson correlation coef\ufb01cient between the actual output \u03c3(\u03a6(xt, \u0398)) and the\ni=1 k(xt, xi, \u03b1i)) for multiple points and plot them in Figure 1. The correlation\n\nis almost 1 for both train and test data, and most points lie at the both ends of y = x line.\nWe note that Theorem 3.1 can be applied to any hidden layer with ReLU activation by de\ufb01ning a\nsub-network from input x and the output being the hidden layer of interest. The training could be\ndone in a similar fashion by replacing Lsoftmax with LReLU. In general, any activation can be used\nwith a derived \"suitable loss\".\n\n4 Experiments\n\nWe perform a number of experiments with multiple datasets and evaluate our method\u2019s performance\nand compare with that of the in\ufb02uence functions.2 The goal of these experiments is to demonstrate\nthat selecting the representer points is ef\ufb01cient and insightful in several ways. Additional experi-\nments discussing the differences between our method and the in\ufb02uence function are included in the\nsupplementary material.\n\n4.1 Dataset Debugging\n\nFigure 2: Dataset debugging performance for several methods. By inspecting the training points\nusing the representer value, we are able to recover the same amount of mislabeled training points as\nthe in\ufb02uence function (right) with the highest test accuracy compared to other methods (left).\n\n2Source code available at github.com/chihkuanyeh/Representer_Point_Selection.\n\n5\n\n\fTo evaluate the in\ufb02uence of the samples, we consider a scenario where humans need to inspect the\ndataset quality to ensure an improvement of the model\u2019s performance in the test data. Real-world\ndata is bound to be noisy, and the bigger the dataset becomes, the more dif\ufb01cult it will be for humans\nto look for and \ufb01x mislabeled data points. It is crucial to know which data points are more important\nthan the others to the model so that prioritizing the inspection can facilitate the debugging process.\nTo show how well our method does in dataset debugging, we run a simulated experiment on CIFAR-\n10 dataset [17] with a task of binary classi\ufb01cation with logistic regression for the classes automobiles\nand horses. The dataset is initially corrupted, where 40 percent of the data has the labels \ufb02ipped,\nwhich naturally results in a low test accuracy of 0.55. The simulated user will check some fraction of\nthe train data based on the order set by several metrics including ours, and \ufb01x the labels. With the\ncorrected version of the dataset, we retrain the model and record the test accuracies for each metrics.\nFor our method, we train an explainable model by mimimizing (3) as explained in section 3.1. The\nL2 weight decay is set to 1e\u22122 for all methods for fair comparison. All experiments are repeated for\n5 random splits and we report the average result. In Figure 2 we report the results for four different\nmetrics: \u201cours\u201d picks the points with bigger |\u03b1ij| for training instance i and its corresponding label j;\n\u201cin\ufb02uence\u201d prioritizes the training points with bigger in\ufb02uence function value; and \u201crandom\u201d picks\nrandom points. We observe that our method recovers the same amount of training data as the in\ufb02uence\nfunction while achieving higher testing accuracy. Nevertheless, both methods perform better than the\nrandom selection method.\n\n4.2 Excitatory (Positive) and Inhibitory (Negative) Examples\n\nWe visualize the training points with high representer values (both positive and negative) for some\ntest points in Animals with Attributes (AwA) dataset [18] and compare the results with those of the\nin\ufb02uence functions. We use a pre-trained Resnet-50 [19] model and \ufb01ne-tune on the AwA dataset to\nreach over 90 percent testing accuracy. We then generate representer points as described in section\n3.2. For computing the in\ufb02uence functions, just as described in [10], we froze all top layers of the\nmodel and trained the last layer. We report top three points for two test points in the following\nFigures 3 and 4. In Figure 3, which is an image of three grizzly bears, our method correctly returns\nthree images that are in the same class with similar looks, similar to the results from the in\ufb02uence\nfunction. The positive examples excite the activation values for a particular class and supports the\ndecision the model is making. For the negative examples, just like the in\ufb02uence functions, our method\nreturns images that look like the test image but are labeled as a different class. In Figure 4, for the\nimage of a rhino the in\ufb02uence function could not recover useful training points, while ours does,\nincluding the similar-looking elephants or zebras which might be confused as rhinos, as negatives.\nThe negative examples work as inhibitory examples for the model \u2013 they suppress the activation\nvalues for a particular class of a given test point because they are in a different class despite their\nstriking similarity to the test image. Such inhibitory points thus provide a richer understanding, even\nto machine learning experts, of the behavior of deep neural networks, since they explicitly indicate\ntraining points that lead the network away from a particular label for the given test point. More\nexamples can be found in the supplementary material.\n\nFigure 3: Comparison of top three positive and negative in\ufb02uential training images for a test point\n(left-most column) using our method (left columns) and in\ufb02uence functions (right columns).\n\n6\n\nOursInfluence Function\fFigure 4: Here we can observe that our method provides clearer positive and negative examples while\nthe in\ufb02uence function fails to do so.\n\n4.3 Understanding Misclassi\ufb01ed Examples\n\nThe representer values can be used to understand the model\u2019s mistake on a test image. Consider a test\nimage of an antelope predicted as a deer in the left-most panel of Figure 5. Among 181 test images\nof antelopes, the total number of misclassi\ufb01ed instances is 15, among which 12 are misclassi\ufb01ed as\ndeer. All of those 12 test images of antelopes had the four training images shown in Figure 5 among\nthe top inhibitory examples. Notice that we can spot antelopes even in the images labeled as zebra\nor elephant. Such noise in the labels of the training data confuses the model \u2013 while the model sees\nelephant and antelope, the label forces the model to focus on just the elephant. The model thus learns\nto inhibit the antelope class given an image with small antelopes and other large objects. This insight\nsuggests for instance that we use multi-label prediction to train the network, or perhaps clean the\ndataset to remove such training examples that would be confusing to humans as well. Interestingly,\nthe model makes the same mistake (predicting deer instead of antelope) on the second training image\nshown (third from the left of Figure 5), and this suggests that for the training points, we should\nexpect most of the misclassi\ufb01cations to be deer as well. And indeed, among 863 training images of\nantelopes, 8 are misclassi\ufb01ed, and among them 6 are misclassi\ufb01ed as deer.\n\nFigure 5: A misclassi\ufb01ed test image (left) and the set of four training images that had the most\nnegative representer values for almost all test images in which the model made the same mistakes.\nThe negative in\ufb02uential images all have antelopes in the image despite the label being a different\nanimal.\n\n4.4 Sensitivity Map Decomposition\n\n\u03a6(xt, \u0398\u2217) =(cid:80)n\n\ni \u03b1if T\n\n\u2202xt\n\n=(cid:80)n\n\nFrom Theorem 3.1, we have seen that the pre-softmax output of the neural network can be decomposed\nas the weighted sum of the product of the training point feature and the test point feature, or\ni ft. If we take the gradient with respect to the test input xt for both sides,\nwe get \u2202\u03a6(xt,\u0398\u2217)\n. Notice that the LHS is the widely-used notion of sensitivity map\ni \u03b1i\n(gradient-based attribution), and the RHS suggests that we can decompose this sensitivity map into a\nweighted sum of sensitivity maps that are native to each i-th training point. This gives us insight into\nhow sensitivities of training points contribute to the sensitivity of the given test image.\nIn Figure 6, we demonstrate two such examples, one from the class zebra and one from the class\nmoose from the AwA dataset. The \ufb01rst column shows the test images whose sensitivity maps we wish\nto decompose. For each example, in the following columns we show top four in\ufb02uential representer\n\n\u2202f T\ni ft\n\u2202xt\n\n7\n\nOursInfluence Function\fpoints in the the top row, and visualize the decomposed sensitivity maps in the bottom. We used\nSmoothGrad [20] to obtain the sensitivity maps.\nFor the \ufb01rst example of a zebra, the sensitivity map on the test image mainly focuses on the face of the\nzebra. This means that in\ufb01nitesimally changing the pixels around the face of the zebra would cause\nthe greatest change in the neuron output. Notice that the focus on the head of the zebra is distinctively\nthe strongest in the fourth representer point (last column) when the training image manifests clearer\nfacial features compared to other training points. For the rest of the training images that are less\ndemonstrative of the facial features, the decomposed sensitivity maps accordingly show relatively\nhigher focus on the background than on the face. For the second example of a moose, a similar\ntrend can be observed \u2013 when the training image exhibits more distinctive bodily features of the\nmoose than the background (\ufb01rst, second, third representer points), the decomposed sensitivity map\nhighlights the portion of the moose on the test image more compared to training images with more\nfeatures of the background (last representer point). This provides critical insight into the contribution\nof the representer points towards the neuron output that might not be obvious just from looking at the\nimages itself.\n\nFigure 6: Sensitivity map decomposition using representer points, for the class zebra (above two\nrows) and moose (bottom two rows). The sensitivity map on the test image in the \ufb01rst column can be\nreadily seen as the weighted sum of the sensitivity maps for each training point. The less the training\npoint displays spurious features from the background and more of the features related to the object of\ninterest, the more focused the decomposed sensitivity map corresponding to the training point is at\nthe region the test sensitivity map mainly focuses on.\n\n4.5 Computational Cost and Numerical Instabilities\n\nComputation time is particularly an issue for computing the in\ufb02uence function values [10] for a large\ndataset, which is very costly to compute for each test point. We randomly selected a subset of test\npoints, and report the comparison of the computation time in Table 1 measured on CIFAR-10 and\nAwA datasets. We randomly select 50 test points to compute the values for all train data, and recorded\nthe average and standard deviation of computation time. Note that the in\ufb02uence function does not\nneed the \ufb01ne-tuning step when given a pre-trained model, hence the values being 0, while our method\n\n8\n\n\f\ufb01rst optimizes for \u0398\u2217 using line-search then computes the representer values. However, note that the\n\ufb01ne-tuning step is a one time cost, while the computation time is spent for every testing image we\nanalyze. Our method signi\ufb01cantly outperforms the in\ufb02uence function, and such advantage will favor\nour method when a larger number of data points is involved. In particular, our approach could be\nused for real-time explanations of test points, which might be dif\ufb01cult with the in\ufb02uence function\napproach.\n\nIn\ufb02uence Function\n\nOurs\n\nDataset\nCIFAR-10\n\nAwA\n\nFine-tuning\n\n0\n0\n\nComputation\n267.08 \u00b1 248.20\n172.71 \u00b1 32.63\n\nFine-tuning\n7.09 \u00b1 0.76\n12.41 \u00b1 2.37\n\nComputation\n0.10 \u00b1 0.08\n0.19 \u00b1 0.12\n\nTable 1: Time required for computing an in\ufb02uence function / representer value for all training points\nand a test point in seconds. The computation of Hessian Vector Products for in\ufb02uence function alone\ntook longer than our combined computation time.\n\nWhile ranking the training points according to their in\ufb02uence function values, we have observed\nnumerical instabilities, more discussed in the supplementary material. For CIFAR-10, over 30 percent\nof the test images had all zero training point in\ufb02uences, so in\ufb02uence function was unable to provide\npositive or negative in\ufb02uential examples. The distribution of the values is demonstrated in Figure 7,\nwhere we plot the histogram of the maximum of the absolute values for each test point in CIFAR-10.\nNotice that over 300 testing points out of 1,000 lie in the \ufb01rst bin for the in\ufb02uence functions (right).\nWe checked that all data in the \ufb01rst bin had the exact value of 0. Roughly more than 200 points lie in\nrange [10\u221240, 10\u221228], the values which may create numerical instabilities in computations. On the\nother hand, our method (left) returns non-trivial and more numerically stable values across all test\npoints.\n\nFigure 7: The distribution of in\ufb02uence/representer values for a set of randomly selected 1,000 test\npoints in CIFAR-10. While ours have more evenly spread out larger values across different test points\n(left), the in\ufb02uence function values can be either really small or become zero for some points, as seen\nin the left-most bin (right).\n\n5 Conclusion and Discussion\n\nIn this work we proposed a novel method of selecting representer points, the training examples that\nare in\ufb02uential to the model\u2019s prediction. To do so we introduced the modi\ufb01ed representer theorem\nthat could be generalized to most deep neural networks, which allows us to linearly decompose the\nprediction (activation) value into a sum of representer values. The optimization procedure for learning\nthese representer values is tractable and ef\ufb01cient, especially when compared against the in\ufb02uence\nfunctions proposed in [10]. We have demonstrated our method\u2019s advantages and performances on\nseveral large-scale models and image datasets, along with some insights on how these values allow\nthe users to understand the behaviors of the model.\nAn interesting direction to take from here would be to use the representer values for data poisoning\njust like in [10]. Also to truly see if our method is applicable to several domains other than image\ndataset with different types of neural networks, we plan to extend our method to NLP datasets with\nrecurrent neural networks. The result of a preliminary experiment is included in the supplementary\nmaterial.\n\n9\n\n\fAcknowledgements\n\nWe acknowledge the support of DARPA via FA87501720152, and Zest Finance.\n\nReferences\n[1] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining\nthe predictions of any classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 1135\u20131144. ACM, 2016.\n\n[2] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-\n\nagnostic explanations. 2018.\n\n[3] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:\nVisualising image classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034,\n2013.\n\n[4] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through\n\npropagating activation differences. arXiv preprint arXiv:1704.02685, 2017.\n\n[5] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks.\n\narXiv preprint arXiv:1703.01365, 2017.\n\n[6] Sebastian Bach, Alexander Binder, Gr\u00e9goire Montavon, Frederick Klauschen, Klaus-Robert\nM\u00fcller, and Wojciech Samek. On pixel-wise explanations for non-linear classi\ufb01er decisions by\nlayer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.\n\n[7] Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classi\ufb01cation. The\n\nAnnals of Applied Statistics, pages 2403\u20132424, 2011.\n\n[8] Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generative approach\nfor case-based reasoning and prototype classi\ufb01cation. In Advances in Neural Information\nProcessing Systems, pages 1952\u20131960, 2014.\n\n[9] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to\ncriticize! criticism for interpretability. In Advances in Neural Information Processing Systems,\npages 2280\u20132288, 2016.\n\n[10] Pang Wei Koh and Percy Liang. Understanding black-box predictions via in\ufb02uence functions.\n\nIn International Conference on Machine Learning, pages 1885\u20131894, 2017.\n\n[11] Rushil Anirudh, Jayaraman J Thiagarajan, Rahul Sridhar, and Timo Bremer. In\ufb02uential sample\n\nselection: A graph signal processing approach. arXiv preprint arXiv:1711.05407, 2017.\n\n[12] Bernhard Sch\u00f6lkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In\n\nInternational conference on computational learning theory, pages 416\u2013426. Springer, 2001.\n\n[13] Bastian Bohn, Michael Griebel, and Christian Rieger. A representer theorem for deep kernel\n\nlearning. arXiv preprint arXiv:1709.10441, 2017.\n\n[14] Michael Unser.\n\nA representer theorem for deep neural networks.\n\narXiv:1802.09210, 2018.\n\narXiv preprint\n\n[15] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[16] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[17] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[18] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a\ncomprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600,\n2017.\n\n10\n\n\f[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[20] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi\u00e9gas, and Martin Wattenberg. Smooth-\n\ngrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.\n\n[21] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher\nPotts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting\nof the association for computational linguistics: Human language technologies-volume 1, pages\n142\u2013150. Association for Computational Linguistics, 2011.\n\n11\n\n\f", "award": [], "sourceid": 5669, "authors": [{"given_name": "Chih-Kuan", "family_name": "Yeh", "institution": "Carnegie Mellon University"}, {"given_name": "Joon", "family_name": "Kim", "institution": "Carnegie Mellon University"}, {"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "Carnegie Mellon University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}]}