NeurIPS 2020

Feature Shift Detection: Localizing Which Features Have Shifted via Conditional Distribution Tests

Review 1

Summary and Contributions: The paper proposes a feature shift detection algorithm based on conditional distribution tests.

Strengths: The problem is interesting and novel.

Weaknesses: More experiments based on real applications is required to justify the effectiveness of the proposed method. - The experiments on real world data is insufficient and the result seems bad compare to simulation. More explanation is required. - In table 1, please explain why Marginal-KS is extremely bad? - In table 1, what the first column Rec represent? I've read the response and comments from other reviewers. The response well addressed my concern as the authors add more real world experiments and it seems promising. Thus, I will increase my score.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This manuscript proposes to detect feature shift during observation of a multivariate signal. In particular, the authors propose a novel approach based on score function. This approach is appealing for its computational aspects and the possibility of relying on flexible generative models for fitting the data. A set of experiments, focusing on a compelling time-series setting, shows that the model actually identifies which covariates shifted.

Strengths: This manuscript posits a novel and interesting extension of the outlier detection problem, with added interpretability constraints where one needs to identify which latent variables shifted. The proposed method is also simple and appealing, with the requirement to fit only one multivariate black-box density model (and not one per hypothesis). The simulated experiments based on multivariate Gaussian distributions are well led and are compelling.

Weaknesses: The manuscript highly markets the possibility of using this method for arbitrary dependence structure, in particular including those modelled with a deep density model (e.g., normalizing flows). It is mentioned in the abstract, in the introduction and in Section 2. However, this does not appear in the experiments (even the real-world data case appears to be treated with multivariate Gaussian). The discussion about statistical significance merits to be further extended. Line 255 - 258, it is mentioned that the bootstrap is used for simulating the null hypothesis. + Does that mean that the model must be fit multiple times, or that only the score function must be calculated with respect to these bootstrap datasets? If this is just the score function, why do we expect the density to be accurate in those areas, especially in the case of a deep density model? + How does this method perform at controlling the False Discovery Rate?

Correctness: What is in this manuscript seems reasonable to me

Clarity: The paper is well written and clear. Minor points: KNN is not defined before being used A space is missing in line 175 Words are missing or added in line 222 and 224 Typo line 248 Missing reference line 356

Relation to Prior Work: To my knowledge, related work is cited

Reproducibility: Yes

Additional Feedback: After author feedback ------ I would like to thank the author for running more experiments, in particular for the deep density model. I think these make the paper stronger and more relevant!

Review 3

Summary and Contributions: The paper studies the question of which features lead to the distribution shift. They formalize this problem multiple conditional distribution hypothesis tests and propose both non-parametric and parametric statistical tests. In particular, they build on the idea of density model score function to build flexible statistics.

Strengths: The paper studies an important problem of attributing distribution shift to specific features. The formulation of this important task into a statistical problem of multiple conditional distribution hypothesis tests opens the door to many existing algorithms in conditional testing. The resulting proposal hence leverage this connection and utilize a computationally efficient density model score function. Notably, this statistic and be computed for all dimensions in a single forward and backward pass. Moreover, it inherits the flexibility of current density estimators. The fomulation of the task of distribution shift attribution is an interesting and important contribution. The development of a computationally efficient test statistic makes it applicable to model applications in complicated settings.

Weaknesses: While the test statistic is computationally tractable and flexible, it is unclear how the use of flexible density estimators may affect the power of the tests. In particular, the proposed statistic is compatible with any density model including deep density models such as normalizing flows or autoregressive models. However these flexible density models are known for requiring a large number of samples to produce good density estimation. In such cases, it may decrease the power of the statistical tests for distribution shift attribution when the sample size is limited. This aspect of using flexible density estimators is worth discussing in the paper. The paper also has extended the proposal to a time-series setting. However, the results in tables 3 and 4 appear quite sensitive to the choice of window size. A discussion of how to choose window size for the proposal in time-series settings would be very helpful, especially such distribution shift task commonly occur in a time-series setting.

Correctness: The paper appears correct.

Clarity: The paper is quite well-written.

Relation to Prior Work: The paper adequately discussed prior work.

Reproducibility: Yes

Additional Feedback: See above. -------------- Thank you to the authors for the rebuttal. I have read the rebuttal and my evaluation stays the same.

Review 4

Summary and Contributions: The paper addresses distribution shift detection, casting it as a conditional shift problem, that is designed for multivariate settings. Through the use of the density model score function, an efficient algorithm is given that uses just a single forward and backward pass, and can be combined with modern density models based on neural networks. A key differentiator for this work is the desire to localise a shift (e.g. which sensor in a sensor network) as well as detecting the shift. ========= Post rebuttal: It's commendable that the authors ran more experiments using deep density models, in response to all reviewers. Pleasingly, the Deep-SM model seems to do even better, although I found the table in the rebuttal a little hard to parse. The authors also answered my technical points satisfactorily. I've raised my score accordingly

Strengths: - Neat application of the score function method to statistical testing via the Fisher divergence - Attack model is well constructed, and the range of Gaussian copula models used in the simulation study is well thought out

Weaknesses: - The KNN approach to building a conditional density seems slightly strange. It would seem that other non-parametric approaches, such as K-D trees, might be better suited to this task - On of the purported advantages of the score function approach is the ability to use modern density models. It’s therefore a pity that these aren’t used in the paper, for example neural-kernelized conditional density estimation [1] or methods in [2]. - Real world experiments are very preliminary [1] Sasaki, Hiroaki, and Aapo Hyvärinen. "Neural-kernelized conditional density estimation." arXiv preprint arXiv:1806.01754 (2018). [2] Rothfuss, Jonas, et al. "Conditional density estimation with neural networks: Best practices and benchmarks." arXiv preprint arXiv:1903.00954 (2019).

Correctness: Method is correct. Empirical methodology seems solid.

Clarity: Mostly well written and clear. There's no conclusions section, presumably due to lack of space, which together with the brief discussion of real-world experiments, gives an "unfinished" feel to the paper. In the model free approach, I can see that A and B are describing if the sample is in the nearest neighbours from both p and q, but what is the distance function phi? Is it simply the indicator function?

Relation to Prior Work: The related work on shift detection is well described. The novelties in terms of the score function-based inference and the localization of shifts are clearly positioned relative to previous works.

Reproducibility: Yes

Additional Feedback: KS - expand on first use (L185) Duplicate citations [19] & [20]