Review for NeurIPS paper: Attribution Preservation in Network Compression for Reliable Network Interpretation

NeurIPS 2020

Attribution Preservation in Network Compression for Reliable Network Interpretation

Review 1

Summary and Contributions: Update based on author feedback: The authors addressed both of my main concerns, and so, as promised, I am raising my score. On the subject of novelty: 1) I agree with other reviewers that the similarities to Zagoruyko et al. reduce novelty (particularly since R2 pointed out that they failed to clarify that Zagoruyko et al. also included sensitivity-based regularizers). However, in Zagoruyko et al. they seem to have recommended the activation-based regularization as the preferred method (they wrote "we also trained a network with activation-based AT in the same training conditions, which resulted in the best performance among all methods"). Thus, I think it is meaningful that these authors have demonstrated that a sensitivity-based approach works better. 2) As R2 noted based on the author response, a key difference between the approach in this work and the approach in Zagoruyko et al. is that Zagoruyko et al. takes the gradient w.r.t. the loss but in this work they take the gradient w.r.t. an output logit. In addition to losing task-specific information, the gradient of a categorical cross-entropy loss tends to zero for examples that are predicted confidently & correctly - thus, taking the gradient w.r.t. the loss would focus on examples where the model's predictions are most *incorrect* (those examples would both have the largest loss and the largest loss gradients); by contrast, taking the gradient w.r.t. the task logit (as these authors do) would avoid both those issues. I think this difference is nontrivial. 3) I somewhat disagree with the line in the author response re. Zagoruyko et al. where they say "since they only match the gradients at the input level, the information in the intermediate layers (and thus their decision processes) are not appropriately transferred" because the idea of matching at higher layers is very explicitly acknowledged in Zagoruyko et al., who write "in this work we consider only gradients w.r.t. the input layer, but in general one might have the proposed attention transfer and symmetry constraints w.r.t. higher layers of the network." 4) I agree with the authors of the present work that the matter of framing is important; in Zagoruyko et al., they focused specifically on improving performance. As many reviewers noted, this work seems to be the first to point out that even when performance is retained, the attribution maps can become distorted during compression. That in itself is a valuable observation; without work like this, people may not have thought to perform attribution regularization if they felt that the performance was fine. Ideally, the authors would compare to the sensitivity-based regularizer from Zagoruyko et al. and show that their proposed approach works better. However, given that the authors of Zagoruyko et al. recommended the activation-based regularizer as the best, and that the authors of the present work showed that they outperformed the activation-based regularizer, I personally feel that the work is above the acceptance threshold. ------------- The paper introduces the problem that a compressed model may have distorted attributions relative to the parent model. The authors propose a solution to this problem wherein the attributions of the child model are regularized to match those of the parent model. Using image segmentation for ground-truth explanations, they empirically verify that, across three different compression strategies (distillation, structured pruning and unstructured pruning), the resulting compressed models both resemble the parent model in terms of the attributions and also have higher-quality attributions compared to naively-compressed models.

Strengths: - To my knowledge, the problem formulation is novel; I don't believe other works have specifically investigated the issue of distorted attributions in compressed models. - The methodology makes sense, and the empirical results seem convincing in terms of matching the attributions to the parent model and also obtaining higher-quality attributions w.r.t. the image segmentation ground truth.

Weaknesses: (1) My primary concern is the that while the authors have demonstrated that models compressed using their proposed strategy obtain better results with respect to quantifying the quality of the attributions w.r.t. a ground-truth segmentation task, the authors stated that "if we evaluate the attribution performance on the entire test set, models with low predictive performance are naturally in a disadvantage. To compensate for this effect and compare the attributions of all models on the same ground truth, we only consider the samples that each model predicted correctly". When compressing models, we *do* also care about whether the distilled model retains good predictive performance, and from the tables/figures in the main text I did not get a sense of whether the **predictions** of the models distilled using the proposed method are also more accurate (the figures and tables all appear to be reporting measure of attribution performance, not of prediction performance). The authors have listed the trustworthiness of a model as a reason for using the proposed method - the trustworthiness of a model also depends on having reliable predictions, so I think it is essential to report this as well. I noted that Tables 3 and 4 in the supplement have a column labeled "accuracy" that shows that the accuracies of SWA & SSWA are not lower relative to "Naive" - can the authors confirm whether "accuracy" in these columns refers to prediction accuracy? If so, could the authors report a similar measure of prediction performance for the tables in the main text? **If the authors can confirm that predictive performance is still maintained when compared to the naive compression strategy, I would be willing to revise my score towards acceptance.** (2) (This is a more minor concern as I don't expect it to be likely, but I think it is still worth thinking about) The authors evaluated the saliency maps w.r.t. (a) whether the saliency maps of the child model resemble those of the parent models, and (b) whether the saliency maps match an image segmentation ground-truth - however, I did not see a measure of whether the saliency maps are **indicative of the decision-making process of the child model**. It has been shown before that saliency maps are "fragile" in that they are susceptible to adversarial perturbations (https://arxiv.org/abs/1710.10547), so it is conceivable to me that a child model can generate a saliency map (via a method like Grad-CAM) that superficially mimics the saliency map of the parent model but which is not reflective of the child model's decision-making process (particularly because Grad-CAM relies on gradients, which don't incorporate saturation effects as discussed in the DeepLIFT/IntegratedGradients papers). In such a situation, the saliency maps would be **misleading** with respect to the child model. I do not expect it to be the case that this is happening, but I think it is worth addressing for completeness. The perturbation metric proposed in the FullGrad paper, where the *least important* pixels are perturbed and the change in the model performance is quantified, may be a good way to measure the extent to which the saliency map is faithful to the child model's decision-making process: https://arxiv.org/abs/1905.00780

Correctness: My main concern about the methodology, as described in the previous section, is that the child model may be superficially mimicking the saliency map of the parent model without actually adopting the parent model's decision-making process (particularly of a "local" explanation method like Grad-CAM is used). Perturbation experiments could address this concern. However, I do not expect this phenomenon to be likely, so it is not a very major concern. Another small concern: in line 193, the authors describe applying a ReLU operation to the channel importance obtained from Grad-CAM. In general, discarding negative gradients (as is done in, e.g., Guided Backprop and DeconvNet) has been shown to diminish the quality of the attributions (e.g. by making them prone to failing sanity checks; https://arxiv.org/abs/1912.09818). I am thus somewhat concerned that the authors felt the need to discard negative channel importance here, because negative importance can still be relevant for classification.

Clarity: For the most part, yes. One piece I was unclear on was whether only correctly-predicted examples from the parent model were used for regularization during training (the authors wrote that "we only consider the samples that each model correctly predicted" in the context of the test set because those are the attributions that are likely to be reliable; I was unsure whether this was also leveraged during training).

Relation to Prior Work: To my knowledge, yes (I am not very familiar with the model compression literature, hence my lower confidence rating).

Reproducibility: Yes

Additional Feedback: I have mentioned some suggestions under "Weaknesses"; what's listed here are more minor issues: (1) Assuming that the Grad-CAM backpropagation was started w.r.t. the logits of the softmax layer, it may be a good idea to normalize the logits such that the mean across all classes sums to 0. Normalizing the logits of a softmax does not change the output of a softmax, but it would change the attributions (in that, if a particular channel has the same contribution to all softmax logits, it is effectively contributing to none of the softmax logits). This is also mentioned in the section "Adjustments for softmax layers" in the DeepLIFT paper: https://arxiv.org/pdf/1704.02685.pdf (2) I think it is worth reflecting on the extent to which image segmentation is a good "ground truth" explanation, because I think background pixels can often be relevant for a class prediction (for example, if the background is green, then a prediction of "cow" is more likely than if the background is pink). That said, I agree that, broadly speaking, the "pointing game" measure is likely valid (i.e. the peak attribution should fall within the segmented region). (3) (minor) I would be curious how Grad-CAM (which averages the gradients over a channel) performs relative to simply doing "activation*gradient" at each individual neuron in the convolutional layer.

Review 2

Summary and Contributions: The paper starts from the observation that compressed networks can produce attribution maps significantly different from the corresponding original uncompressed networks, despite having comparable accuracy. The authors argue that this is problematic, as similar accuracy does not necessarily mean that the two networks process information in the same way. They propose an attribution-based regularization term to steer the fine-tuning towards local minima that have both high predictive accuracy and good matching of attributions between the original and the compressed network.

Strengths: As neural networks will be increasingly used in safety-critical domains, the problem of understanding how they process input information is important. To the best of my knowledge, the observation that compression techniques might shift the attention of the network towards less relevant input features, despite preserving the model accuracy, is novel and therefore potentially relevant for the XAI and security community. The authors show empirically on VOC and ImageNet that it is possible to mitigate the problem of "attribution shift" employing attributions as regularization term, and that this often produces better results than simple activation matching as proposed by Zagoruyko et al. 2017. The paper and the proposed method are easy to understand. The proposed regularization technique is based on a well-known attribution method and easy to implement. The framework can be readily applied to several compression techniques, such as structured/unstructured pruning and KD.

Weaknesses: I have two main concerns regarding this paper: 1) as the goal is to match the attributions between two networks, the idea of adding an attribution-based regularization term in the cost functions seems a trivial and straight-forward solution to me. Moreover, a very similar regularization term was previously proposed by Zagoruyko et al. 2017 who investigated not only activation matching but also the use of gradient-based attributions (in particular, sensitivity map by Simonyan et al. ). While Zagoruyko et al. formulated the problem as "attention transfer", practically their motivation was the same: ensuring that a student model "pays attention" to the same features as the parent. Although it is true that they were only interested in improving the network accuracy, I believe this paper does not add a significant contribution to the method. The authors suggest the use of Grad-Cam as an attribution method but there is no theoretical nor empirical evidence that this method provides superior results than sensitivity map as suggested by Zagoruyko et al. or other gradient-based attribution methods (Gradient x Input, Integrated Gradients, DeepLIFT, or others). Finally, Stochastic matching does not seem to find a theoretical justification. What is the rationale for dropping randomly selected channels and why this should work better than using actual attributions? The connection to dropout is not clear to me. 2) the experimental section does not provide error bounds. As the performance gap between the different methods seems marginal (in particular between EWA and (S)SWA), I wonder if the difference is significant at all. The results might be affected by some stochasticity in the pipeline (e.g. SGD and channel sampling in SSWA). Without providing any standard deviation for these results across different runs, it is impossible to assess the significance of the results.

Correctness: Some claims require clarification. In particular, the connection between Stochastic matching and dropout. Claims such as "preserves the interpretation of the original networks" and "signiﬁcant performance gains" cannot be assessed without error bounds in the experimental section.

Clarity: While the paper is somehow understandable, I have the feeling that the paper would benefit from professional proofreading as several sentences sound odd to me (as a non-native speaker).

Relation to Prior Work: The paper should better explain what is inherited from Zagoruyko et al. as, currently, it seems that Zagoruyko et al. only investigated equally weighted activation matching, while actually they also investigated sensitivity-based regularizers. There is also a line of works [1-3] that investigated training neural networks using attribution as regularizers. The authors might want to compare and contrast with these works. [1] https://arxiv.org/abs/1703.03717 [2] https://arxiv.org/abs/1906.10670 [3] https://arxiv.org/pdf/1909.13584.pdf

Reproducibility: Yes

Additional Feedback: Due to the lack of novelty in the method and empirical results that are not particularly strong, I believe this paper requires some more work. However, I still believe that both the motivation of the paper as well as the observation of the attribution-shift phenomenon during compression are relevant. I would suggest the authors do some more in-depth analysis of the phenomenon. This could be, for example, investigating some of the following: why it occurs in the first place, show attributions computed by other methods (is the problem evident with other methods than GradCAM?), discuss (possibly from a theoretical point of view) how can attributions be so different despite the accuracy being similar (are the logits preserved? What about the activation of the hidden layers?), discuss possible robustness implications (if attributions look wrong despite high accuracy, does it mean that the network is actually less robust and more sensitive to wrong areas of the input? this could be investigated with ablation tests). Some minor comments: - Table 1: "AUC" is not defined. Even if a person understands that this is the area under the curve, it is not clear which are the dimensions that define the curve until the reader reaches page 7. The authors might consider making the caption of the figure more explicit. - l. 166: "V() is a rectification function" - does this mean V() is a ReLU? Probably not as in (3) it is chosen to be the identity. The authors might want to change the definition of V() to avoid confusion. - it is not clear to me whether the weight U of stochastic matching is purely based on the Bernoulli distribution (as it seems from the first section of page 6) or whether instead the stochasticity is added on top of the gradient-based weights obtained by (4) as line 223 seems to suggest ("its stochastic version"). I believe it is the former, but then why call it SSWA? - page 7: "mAP" used in Figure 3 and other tables is nowhere defined. - l. 277: the full network is trained from scratch or only fine-tuned? - there is often a missing whitespace before an opened bracket "(", e.g. line 290) ============================================== I increase my score after reading the author response. The newly provided results with deviation and the clarification about the differences with Zagoruyko et al. are convincing. On the other hand, the novely of the method remains limited and I believe this could be a much stronger contribution with a discussion/comparison of other gradient-based attribution methods and with the loss function used by Zagoruyko et al. I still believe that a more in-depth analysis of the phenomenon of attribution shift (with some open questions mentioned above) would be very interesting and could make the work stronger.

Review 3

Summary and Contributions: This paper highlights the surprising fact that network compression, while maintaining the accuracy of the original network, changes the regions of attention of the network, making it less explainable. This is addressed by introducing a regularization term that encourages the attribution maps of the student network to match that of the teacher one.

Strengths: - As far as I am aware of, this paper is the first work to notice that the regions on which a network focuses are affected by compression/distillation. I find this surprising and interesting. - The experiments demonstrate convincingly that the proposed SWA regularizer addresses this issue. - The paper is clearly written and could be relatively easily reproduced (let alone the fact that the code is provided).

Weaknesses: Technical novelty: As acknowledged by the authors, [4] proposed a very similar regularizer (see Eq. 2 in [4]). In fact, the form of Eq. 2 in [4] is quite general, as any function F() could potentially be used. In practice, the authors of [4] studied several functions, i.e., not only the one referred to as EWA in this submission, although this one was the best-performing one in [4]. Altogether, I acknowledge that the motivation behind [4] was different from the one here and that the proposed formulation is somewhat more general and more effective than the one in [4]. However, I feel that the technical novelty remains on the weak side. Presentation: While the paper is clearly written, it could benefit from some additional analysis. In particular: - As mentioned above, I do appreciate the interest of observing that attribution maps are affected by compression. However, I feel that the authors fail to study and explain why this happens. In particular, in the context of pruning, I find particularly surprising that the fine-tuning stage does not address this issue. I would be glad to hear some hypothetical explanations from the authors. - What is the motivation behind the rectification function V()? Why does one need it, and why is a ReLU an appropriate choice (better than alternatives)? - What is the motivation behind the stochastic matching approach? Experiments: The experiments are in general convincing. However: - It would be interesting to study the sensitivity to \beta. - The additional results on ImageNet in the supplementary material (Table 3) show that the compressed networks have a higher AUC than the full network. Can the authors explain this? #### POST-REBUTTAL COMMENTS #### I would like to thank the authors for their responses. I acknowledge that there are some differences w.r.t. [4]. However, I still feel that the similarities make the novelty on the weak side for NeurIPS. Furthermore, while the rebuttal indeed clarifies a few points, others remain unclear, such as why fine-tuning post-pruning doesn't solve the problem by itself, the motivation behind the function V(.) and the influence of \beta. Therefore, while this paper is essentially borderline, I tend to remain slightly on the rejection side.

Correctness: The claims and methodology are correct.

Clarity: The paper is clearly written, but one point nonetheless bothers me: At the beginning of Section 4, the authors mention that Grad-Cam is used to generate the attribution maps. However, it seems to me that these maps depend on the regularizer used, i.e., they are generated using Eq. 3 for EWA, using Eq. 4 for SWA, and using the stochastic variant for SSWA. Is Grad-Cam used for some other purpose?

Relation to Prior Work: The relation to prior work is acknowledged, although, as discussed above, the technical novelty over [4] is limited.

Reproducibility: Yes

Additional Feedback: - Strictly speaking, experimental evaluation is not a contribution and should thus not be listed as such in the introduction. - In unstructured pruning, is the regularizer used in every fine-tuning step?

Review 4

Summary and Contributions: This paper aims to compress the neural network by preserving visual attribution. The authors observe that existing network compression methods only focus on simulating the performance of the target network, so their attribution does not match that of the target network. An attribution-aware compression method is proposed and evaluated on PASCAL VOC 2012 and ImageNet datasets, under several network compression techniques; structured pruning, unstructured pruning, and knowledge distillation.

Strengths: + This paper is well-written and easy to follow. + It is interesting to find that the existing network compression methods do not preserve the attribution map, and the method to address the problem is well-motivated. + Evaluation is done on several network compression techniques and several datasets.

Weaknesses: - Novelty: Finding that existing network compression methods do not preserve attributions is interesting, but this problem has already been partially addressed in [4]. Even [4] considers gradient-based attention (with respect to input image x). - Differentiable attribution methods: Conditions requiring differentiable attribution methods seems not trivial. Methods that do not use gradients seem difficult to apply. For examples, Fong et al., Interpretable explanations of black boxes by meaningful perturbation. Fong et al., Understanding deep networks via extremal perturbations and smooth masks. Schulz et al., Restricting the Flow: Information Bottlenecks for Attribution. Chang et al., Explaining Image Classifiers by Counterfactual Generation. - Generalization for different attribution methods: The experiment of generalization for different attribution methods is missing. It would be interesting to add experiments on how well a network, trained by regularization that preserves attribution maps by Grad-CAM, can preserve the attribution obtained by other attribution methods. Especially, if some methods cannot be applied to this method (because of the differentiability), pleas verify that the interpretations from those (which are not differentiable) can be preserved by the proposed method trained with Grad-CAM.

Correctness: Their findings are reasonable and well visualized. The method was proposed to solve the problems, and evaluated appropriately.

Clarity: This paper is well-written and easy to follow.

Relation to Prior Work: 1. It is necessary to analyze more attribution methods and study whether they can be used in this method.

Reproducibility: Yes

Additional Feedback: Minor comments: (1) Results on ImageNet dataset are shown in supplementary material, but the reviewer thinks it would be better to include them in the main paper. (2) Among raw images in Figure 1, only the third image contains white contour. (3) In Table 3 (supplementary) prune ratio 60%, EWA has higher point accuracy than SSWA, but the result of SSWA is mistakenly bolded. (4) It would be interesting to show the results (in Tables 2 and 3) with lightweight networks trained from scratch to observe the localization ability of those networks. ======= Post author feedback ======= Thanks a lot for the authors' reply. I have read all the comments from other reviewers and the author feedback, and I would keep my original rating. (1) I still feel that novelty is limited. It was modified to be minor compared to Zagoruyko et al. This paper provides a discussion different from the existing method, but considering the high standard of NeurIPS, I think this modification is not sufficient. (2) I was concerned about whether non-differentiable attribution methods other than gradient-based methods could be applied, but these were not addressed in the rebuttal. The generalization experiment of rebuttal is also performed only in gradient methods.