NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5401 CXPlain: Causal Explanations for Model Interpretation under Uncertainty

### Reviewer 1

I’m not very familiar with recent approaches to explanation methods in ML, therefore my low confidence. 1) I am not entirely convinced that an amortized explanation model is a reasonable thing to consider; this could possibly stem from my incomplete understanding of the major use cases of an explanation model. I imagine it to be most useful in practice to investigate outliers / failure cases of the system in question. If this is correct, the explanation method does not necessarily need to be very fast, as it’s only used in rare failure cases. Furthermore, (under this use case assumption) the explanation model would only be useful if it matches the true model on (catastrophic) miss-classification examples; this is inherently hard to guarantee if this is not taken into account during the design and training of the explanation model. A straight-forward computation of feature importance on the ground truth system would be more appropriate in this case I presume. 2) Eqn (5): This definition of feature importance only considers contributions of features towards the ground truth label. It does not attribute importance to features that potentially massively change how the model goes wrong, as long as the error stays the same; eg on ImageNet it would not assign importance to pixels that change the prediction from one wrong class to another, although intuitively this could be an interesting piece of information to debug a classifier. Could the authors comment on this choice? 3) I might be misunderstanding the evaluation of the feature importance uncertainty in section 4.2. Why can the authors not compute the ground truth feature importance on the test set and check is the predictive uncertainty is well calibrated to these held-out values? Ie why is the rank-based method necessary?

### Reviewer 2

* Update after author response* Thanks to the authors for their thoughtful and detailed responses. Most of the responses were quite clarifying (esp. in regards to noting that there is a separate Omega for each x!) and I'll increase my score from a (4) to a (5). I still think much more clarity is needed in describing the methodology, the overall goal, and defining carefully what they mean by "causal." Other referees noted that the interest in this paper is Granger causality and not in understanding what might have happened under intervention (the authors mention this as well in their response, noting that "our goal is not to estimate what would happen if a particular feature's value changed"). In the present form, I worry that most readers will think the paper is about this second notion and get quite confused. ------- The authors set a rather ambitious goal for themselves: to produce fast and accurate estimates of variable importance that can be applied to *any* machine learning model. They do this by framing the question causally: i.e. identify which inputs causally affect the outputs of a machine learning model. They note, importantly, that their procedure does not generally provide any information about the causal mechanisms that generated the data. All it does is provide an explanation for a trained model's predictions. That being said, I am not sure that I completely understood the specifics of the proposed procedure. In what follows, I'll attempt to summarize my understanding and I would greatly appreciate any clarification from the authors. The authors begin by assuming that one has pairs of (x_n, y_n) of covariate x_n and observations y_n and that one has access to a prediction machine $\hat{f}.$ Further, one is able to use this prediction machine to obtain $\hat{y}_{n}$ and $\hat{y}_{n,-i},$ the predicted outcome for observation $n$ given the full set of predictors $x_{n}$ and given all the predictors but X_i. Based on these two predictions, the authors define a discrepancy $\Delta \epsilon_{X,i}$ to be the difference in score/loss between the two predictions. To draw an analogy to standard linear model theory, this discrepancy plays a role similar to the partial-F statistic in linear models: it measures how much is added, from a predictive standpoint, by including a specific covariate in a model that already contains all of the others. From these discrepancies, the authors then derive a single vector of importance weights, $\Omega.$ Note that this vector is specific to the prediction model $\hat{f}$ and one can similarly define another set of importance weights for a different prediction model. As far as I understood, the central idea of CXPlain is to train an explanation model so that the corresponding set of importance weights is as close as possible to the true set of importance weights computed using the original prediction model. If so, then, it would be helpful to make this point explicit in the exposition. Additionally, can the authors clarify what is being averaged in the definition of the causal objective? If I understood correctly, the importance weights $\omega$ are computed for the entire dataset and not an individual datapoint. As such, it does not initially make any sense why we would be averaging over the entire dataset, as suggested by the notation. If this is not the case, then I would ask the authors to be a bit more precise in their notation. Notwithstanding these minor notational points, I am still confused about why one needs to do anything after computing $\Omega.$ After all, the quantity $\omega(i)$ precisely measures the relative gain in prediction from including predictor $i$ to a model that already included the other predictors. Furthermore, if the goal is to determine what might happen to our predictions if we change a particular feature slightly keeping all others fixed, I don't see any role for the explanation model -- one can simply compute the new prediction. Finally, if we take the explanation model to be exactly the original prediction model, we may trivially minimize the proposed causal objective. In light of this, I would appreciate some additional clarification about what is gained by learning A and not just reporting Omega directly. Some additional clarity on why the authors are using a KL discrepancy is merited. Why not use, say, the euclidean distance between the vector Omega and the importance weights derived from the explanation model? --- Originality: The authors note that the causal objective was first introduced in reference 14. The main contribution therefore seems to be a different architecture for the explanation model Clarity: The paper was well-written but was somewhat terse in terms of motivating the specific methodology proposed. Quality & Significance: I am unable to comment on the quality or significance as it is not clear to me why the explanation model is needed in general.