Reviews: Fooling Neural Network Interpretations via Adversarial Model Manipulation

This work addresses one of the most important problems in AI today and the explainability of AI systems becomes more important as we have more systems that interact with people. Perhaps the notable example is the use of machine learning in parole decisions https://www.researchgate.net/publication/315886656_An_impact_assessment_of_machine_learning_risk_forecasts_on_parole_board_decisions_and_recidivism On the negative side, the examples the authors provided in their submission are far from being sufficient. Discrimination should be demonstrated on real life cases, as the authors presented in their rebuttal (on a small scale problem). While images are insightful, unfortunately they are not aligned with the paper's motivation and main insight and this makes the paper significantly weaker --- this is especially important since parole decisions or credit underwriting decisions do not use these deep learning architectures and the attacks and defenses will be quite different. To conclude, we agree that this submission is a teaser that yet need to be proved, but we prefer to see such a work at NeurIPS. We also like the authors to refer to [1] Gu, Tianyu, Brendan Dolan-Gavitt, and Siddharth Garg. "Badnets: Identifying vulnerabilities in the machine learning model supply chain." arXiv preprint arXiv:1708.06733 (2017). [2] Adi, Yossi, et al. "Turning your weakness into a strength: Watermarking deep neural networks by backdooring." 27th {USENIX} Security Symposium ({USENIX} Security 18). (2018).

Paper ID:	1687
Title:	Fooling Neural Network Interpretations via Adversarial Model Manipulation