Review for NeurIPS paper: Benchmarking Deep Learning Interpretability in Time Series Predictions

NeurIPS 2020

Benchmarking Deep Learning Interpretability in Time Series Predictions

Meta Review

This work introduces a bunch of benchmarks for evaluating time series saliency methods (with respective metrics). The authors do a number of empirical evaluations, draw some conclusions about why certain things don't work, and propose a new saliency method based on that. There are a number of things that I like about this work and that was pointed out by the reviewers as well: there is a definite lack of datasets with groundtruth saliency in them so coming up with such a dataset (and associated metrics) is a worthy contribution by itself (though perhaps not rising up to the bar of acceptance at NeurIPS). In general, everyone agreed that this part of the paper is good. What was more controversial: is the subsequent analysis interesting and novel enough? I can see the arguments on both sides of this. On the one hand, the reliance on mostly time series of *image* data is somewhat limiting in my opinion. I understand that the rebuttal proposes a new dataset, but it's unclear to me how relevant that dataset is. There's also some discussion (in the rebuttal period) about whether the two-step approach is actually that novel or useful for time-series and whether the experiments show this. There's merit to the criticism that the synthetic results don't necessarily translate well into conclusions on real-world datasets. On the other hand, I do think that the idea of using these synthetic benchmarks as a sort of a "unit test" could be appealing. And the authors seemed to have made a genuine effort with their fMRI results to show real-world relevance. All in all, I believe this work could appeal to NeurIPS audience and enrich the literature on this topic, especially if the authors heed the advice of adding more bona-fide interpretability methods (rather than just saliency) in the camera ready version, make more of an effort to understand how this analysis applies to more realistic datasets (again, in the camera ready) and finally convince the readers of the value of TSR (via extensive ablations if appropriate).