NeurIPS 2020

Understanding Global Feature Contributions With Additive Importance Measures

Review 1

Summary and Contributions: The authors propose a feature importance method, Shapley additive global importance (SAGE), by summarizing additive importance measures using Shapley values.

Strengths: The use of Shapley values to summarize global feature importance is interesting. Results seem sound and correct, and the problem has high relevance in ML.

Weaknesses: The main limitation of the proposed method is lack of novelty. The use of Shapley values has been tried before, such as in SHAP, so the main contribution here seems to be the approximation of SAGE using sampling. The paper should have then focused more on this particular contribution and on how this could be better than prior related methods. Still on this note, the paper focuses too much on conceptual reviews, but pushes most of the results and discussion to supplements. It is not clear what is the real advantage of SAGE. Computing Shapley values directly is computationally prohibitive even for a reasonable amount of features, so the authors resort to sampling. What should be a good sample size to achieve consistent Shapley values? In the empirical study, for instance, it is hard to really assess SAGE for this same reason: since SAGE scores are random, the authors should have at least included error bars in the results, say, in Figure 1. In any case, the proposed method doesn't seem to perform significantly better than permutation tests and SHAP.

Correctness: The results seem to be correct.

Clarity: The paper is well written and clear.

Relation to Prior Work: Adequate.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper proposes SAGE, a new model-agnostic method for feature importance which takes into consideration feature interactions (usually ignored by other methods). SAGE is building on concepts from prior work, namely the SHAP method, but extending it to global (contrary to the local) explanations for any model. Additionally, the implementation of the proposed algorithm has efficiency advantages over prior work.

Strengths: 1. The method shows to be a solid new tool for feature selection which is of great significance for proper and effective usage of ML in any domain. 2. Being model-agnostic makes the method widely applicable. 3. The writing is very clear and well organized. 4. Overview and discussion of related work is nicely done. 5. Extensive experiments to support the main claims are provided.

Weaknesses: Some questions I have regarding the paper: 1. Is it possible to provide uncertainty estimates for the estimated feature importance? 2. Since it is difficult to assess the feature importance wrt real data sets, I was wondering if the authors thought about simulation studies with synthetic data where the ground truth general model will be available? 3. What would be the implication of using conditional quantile in Eq. 2 instead of conditional mean? Will this make the method more robust to nonsimetric, heteroscedastisic distributions and will this comply with the required properties in 3.1? 4. Can this method be useful/aplicable as regularizer over latent space features, embeddings in deep models? Or the computational overhead will make it inefficient? 5. Is there an experiment where the method is applied on the same dataset only with different models underlaying SAGE? It will be interesting to see if the selected features overlap across different implementations.

Correctness: Yes.

Clarity: Yes, I find the paper very well written. Couple of suggestions: - The presentation can be improved by including a graphical motivating example in the introduction (something on the line of intuition provided on line 125). - In section 2.3 the paragraphs for the 3 different subgroups can be titeled (just bold text the begining), this might improve readability. - I think $m$ was not defined/mentioned prior to Thm 1.

Relation to Prior Work: Yes, I appreciate Table 1 which summarizes prior work. Missing references that could be discussed: - Stefan Depeweg, José Miguel Hernández-Lobato, Steffen Udluft, and Thomas A. Runkler. Sensitivity analysis for predictive uncertainty. In ESANN, 2017. - Schwab, Patrick, and Walter Karlen. "CXPlain: Causal explanations for model interpretation under uncertainty." Advances in Neural Information Processing Systems. 2019.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: In order to understand the inner working of the machine learning models, the paper assesses the role of individual input features in a global sentence. (1)This paper proposes a new feature importance method, named SAGE. SAGE can apply Shapley value to represent the predictive power of subsets of features. (2)This work also introduces a framework of additive importance measures. The framework unifies many existing methods. (3) An efficient sampling-based approximation method is proposed. The method is faster than the naive calulation.

Strengths: (1) Proposing a new feature importance method, Shapley additive global importance (SAGE) which is model-agnostic. (2) Evaluating SAGT on eight datasets, and the quantitative metrics show SAGE can achieve more representative of the predictive power associated with each feature.

Weaknesses: Some important claims are not clearly described, such as: (1) why the method SAGT is model agnostic. (2) How the unifying framework of additive importance measures is used in machine learning methods. Some references are missed, eg. On the 145th line of the paper, "we apply a game-theoretic solution". I can not know the reference of the game-theoretic solution. On the 184th line of the paper, "the mode $f$ is optimal", why it is optimal. I can not find a describition.

Correctness: Some claims need to be described in detail.

Clarity: A good written.

Relation to Prior Work: This work is clearly discussed how this work differs from previous contribution.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The paper introduces a new method for computing feature importance based on shapley values. The two points that the paper emphasizes are additivity and feature interactions.

Strengths: The paper is theoretically rigorous and provides consistency and convergence theorems for an approximation algorithm for scaling.

Weaknesses: The experiments do not demonstrate any practical value. First of all the datasets are really trivial. The ML algorithms used (MLP, SVM, Logistic Regression) are either outdated or not used in the right context. For example, logistic regression is used in high dimensional datasets. The authors are missing the most popular method that is used for tabular data, such as boosted trees (XGBoost, CatBoost, LightGBM). It would also be absolutely necessary to use datasets from Kaggle competitions. More importantly, they need to replicate winning solutions that do heavy feature engineering with aggregations and conditionals.

Correctness: The theoretical claims are correct, but the empirical results are insignificant.

Clarity: The paper is well written

Relation to Prior Work: The related work is properly analyzed and compared to other approaches. The authors provide a very informative table summarizing the comparison with other methods

Reproducibility: No

Additional Feedback: