NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:926
Title:A New Defense Against Adversarial Images: Turning a Weakness into a Strength

Reviewer 1


		
I like the paper, I think it is mostly nicely written, well motivated and the method seems solid. However there are some issues with the evaluation. While it seems that the authors did work hard to produce good evaluations, it is very hard to evaluate adversarial defense and I feel like more is needed for me to be convinced of the merits of these method. Detailed remarks: - The authors claim to use BPDA that is a method to overcome non-differentiable transformations, but there is nothing non-differentiable in objective 3 or 4 (unlike the number of steps). You can backpropagate through the gradients and it is commonly used in second order methods. These results should be added as well. - As is easily seen, in this work and others, there can be great variance in robustness against various attacks. To show validity of a defense it needs to be tested vs more attacks (and attacks targeted to fool the detector). I would be much more convinced if they showed robustness against different strong attacks, like the boundary attack that can be modified easily to work against detection. - While the results are ok, they are still bellow performance of robust networks like Madry et al so I am unsure about the practical significance (I know this is about detection not robust classification so the comparison isn't exact but still). - The main idea is combining two known properties of NN for detection so this work has slightly limited novelty.

Reviewer 2


		
This paper addresses the problem of detecting adversarial attacks to image classifiers -- this is a very important problem, and its solution lies at the heart of what can be considered to be one of the main challenges to overcome in the near future for this kind of classifiers to be of use in real-world applications. The solution proposed is novel in that the existence of so-called adversarial perturbations -- usually considered to be the main problem -- are used as the main building blocks for a defense mechanism against attacks. The material is clearly presented, and experimental results show that the approach achieves good results on a challenging dataset. The main question that comes up, on which the authors should comment in the rebuttal, is regarding one of the basic tenets on which the method is based. In particular, it's not clear whether the criterion regarding closeness to the decision boundary is universal -- is it valid for any kind of classifier? A discussion of limitations of the approach with respect to this issue seems necessary. Another question is regarding the white-box adversary: what prevents the design of an attack method using the authors' method as a black box, basically iterating in the generation of adversarial images until it passes the detection method? Finally, in the conclusion the authors mention in passing that running time is a limitation; could you please provide some details regarding this? As it is, it's not possible to determine if it's a major issue or just a minor drawback.

Reviewer 3


		
Though the proposed method has achieved improvements on standard datasets. However, there are some questions about the methodology. Why data close to the decision boundaries are considered adversarial examples? Though the densities of data near the decision boundaries are relatively sparse, should it be the core principle for finding adversarial examples? The writing quality of the paper is questionable. There are lots of grammar issues that may prevent readers from reading the paper smoothly. For example, line 8: "an image tampered with", what's the object following with? And, line 35, there should be a comma after "on one hand"