Reviews: Disentangling Influence: Using disentangled representations to audit model predictions

Originality: while somewhat similar to [Adler et al 2018], which they compare against, this submission does seem to make novel contributions both in their theoretical framing of the problem and in their practical solution, which feel like solid improvements. Quality: the submission seems technically sound, and the authors seem pretty honest about both strengths and weaknesses. I appreciated the error analyses and the demonstration on toy data (Figure 2), which shows that the method isn't perfect but still captures something important. I do have a few suggestions and points of criticism which I will list below. Clarity: the paper is mostly well-written and well-organized, though I have a few suggestions for improvements here too. Significance: So, I have mixed feelings here. I do feel like this paper provides one of the first mostly-fleshed-out solutions to an extremely important problem, and that the method it provides should work on datasets which are simple enough. However, I worry that on the majority of real-world datasets (e.g. very high-dimensional tabular data or big images), the autoencoder training step is just not going to work -- discriminative models are a lot easier to train than generative ones. In that case, something like TCAV [https://arxiv.org/pdf/1711.11279.pdf] might be a lot more useful for auditing. So I think that limits the significance of this work. But it still could be an important building block, and it does provide quality measures that ensure it doesn't "fail silently." --- Updates based on author feedback and reviewer discussion: Overall, I was happy with the author feedback and think the rewrites/reorganizations which were planned will strengthen the paper. Accordingly, I am strengthening my score. I do want to echo one of R2's points, though, which is that -- although this is a "local" explanation method -- the disentanglement objective is in some sense "global." While I don't think this makes your method equivalent to demographic (dis)parity-based approaches (since one can have demographic parity but still have significant indirect influence by sensitive attributes that happens to average out to zero), it might be worth considering whether there is any notion of "local" disentanglement you could use instead. However, that's pretty difficult, and overall I feel this paper will help advance the field and enhance our ability to address important problems.

Reviewer 2

The authors are interested in computing the influence of a feature p on the prediction of model M, both in terms of individual predictions and global trends. While previous work in interpreting model predictions has focused on computing feature importance via saliency maps and sensitivity analysis of the model to input perturbations, the authors are concerned with the total influence of p on the model predictions, including through model sensitivity to features besides p which nonetheless serve as proxies for p. I believe this is a useful direction of research in model interpretability. I also believe that the idea to learning an invariant/factorized intermediate representation is interesting. However, in my assessment the submission is not up to the standard required for publication. There are improvements to be made in terms of properly defining and motivating the problem, reducing formalisms that are not used by the implemented method, and improving the experiments. I offer some concrete suggestions below. The experimental justification for the proposed method in particular requires some attention (again, I will elaborate below). Put briefly, the authors deviate from the standard approach to modeling DSprites with disentanglement priors (e.g., in data dimentionality, architectures, and metrics) and their baseline comparisons are not enlightening. As an overall impression, I felt that the paper was not very clearly written, and the claim from the abstract that "our method is more powerful than existing methods for ascertaining feature influence" was not substantiated. I hope that the authors are not too discouraged by this feedback, since I suspect there is a useful contribution to be had in this direction given further empirical work and a rewrite of the paper. If the authors are interested in more of a fairness spin to this work, I would suggest including at least one additional tabular dataset besides Adult.

Reviewer 3

Overall Comments One significant difficulty in a lot of feature importance/significance assessments is the issue of multicollinearity, which this paper notes as proxy features. The literature on disentangling could be potentially useful in this setting. It is interesting that the paper leverages this connection to help address this issue. In my opinion, this is an interesting connection and one that has not been fully explored in the literature. Originality As far as I am aware, this work is the first to combine work from the disentangling literature with the SHAP method for interpretability. One could contend that disentangling is essentially clustering and that feature selection (which is what SHAP does) on top of clustering is not novel; however, I would still say that this work is. Quality I consider this paper a comprehensive assessment of the thesis that the paper sets forth, i.e., disentangling helps to better interpret when there is proxy influence. Overall the paper is well written with the exception of section 2.1. I also commend the authors for overall clarity in general. I note some points about section 2.1 later in this review. Significance. The issue of dealing with proxy features and multicollinearity is a timely problem and particularly potent one in the fairness + ml literature. Here it is important to properly disambiguate the impact/influence of a protected attribute like race etc on the outcome. The goal of the disentangled influence work directly addresses this issue. My biggest issue with this work is with the disentangled representations portion. It is hard to judge the output of these methods and to properly assess whether representations are disentangled or to even know what it means for something to be disentangled. I'll return to this later in my review. In spite of this, I contend that this is still a useful contribution on the part of the authors. Some gripes - Disentangled representations: I have followed this literature lately, and it seems there is significant doubt whether most of the methods in this area works. In particular, the recent ICML paper: https://arxiv.org/abs/1811.12359.pdf. First, it is not clear what kinds of dataset it is possible to learn a disentangled representations for. Second, even if possible, the field is not clear on what metrics are useful for assessing this disentanglement. If such disentanglement were possible and verifiable, I agree with the authors that marrying feature selection with disentangled representations is a significant undertaking that should help solve multicollinearity issues. This is my biggest gripe with this line of work; however, since several of the previous papers were accepted at top conferences, I think this work is interesting as a proof of concept of what is possible along these lines. After the somewhat philosophical rant, this section pertains to some methodological choices the paper makes: - Why SHAP? It seems to me that if one has disentangled representations, and if the representations are 'truly' disentangled, then a pipeline with any reasonable feature selection method should work here? Here is what I mean, right now the authors propose disentangled representations + SHAP, I argue that disentangled representations + {SHAP, LIME, Correlation Coef, etc} should suffice. I do understand that SHAP has been shown to provably subsume LIME and a whole host of other methods. - Why not compare your encoder + decoder scheme with betaVAE or the some other class of methods that can learn disentangled representations? - Section 2.1: I read this section a few times, and I understand that it is supposed to provide theoretical justification for your work, however, I don't quite see how it fits with the rest of the paper. Proposition 1 follows pretty straightforwardly from the definition of a disentangled representation, so am not sure why it is needed? - The synthetic example is (section 3.1) is very good, and indeed demonstrates that disentangled representations + {SHAP} can capture proxy influence from variables 2x y^2 etc. - Does the paper address disentanglement of p, the sensitive attribute, in general or for all variables. I was confused on this issue. Does the method seek a representation that where p has been 'removed' or does the method seek to factor all 'features' in the input into 'independent' components? If it is just to remove the variable p, then it makes sense to not test against disentangled representation methods more generally. - The paper's section on error analyses gets to the core of my biggest issue with the disentanglement literature. Do the authors envisions that people will perform this kind of error analyses before using their method? Overall, I consider this work to be interesting since it bridges two interesting areas of research. The authors also perform pretty comprehensive assessment of the proposed method on real datasets. Update I have read the author rebuttal, and it indeed clarified a some questions that were brought up as part of the review. I am planing to maintain my accept rating for this work.

Paper ID:	2530
Title:	Disentangling Influence: Using disentangled representations to audit model predictions

Reviewer 1

Reviewer 2

Reviewer 3