NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5038
Title:Certified Adversarial Robustness with Additive Noise

Reviewer 1

*** FURTHER UPDATE: After discussion with the AC, I am happy to change my score to 5 as we agreed the novelty of this paper should be judged before the publication made by Cohen et al. *** UPDATE: I thank the authors for their response. I have just some further comments. The certified bound for L_infty=0.3 for MNIST shown in Figure 2 shows that it is approximately 70% accuracy? Whereas TRADES seems to be closer to 100% and Gowal et al is above 90% - it seems low compared to the numbers I am used to. This might be due to the bound being too loose. I definitely agree that the goal of the adversary is to find an image where the difference is imperceptible to the human eye, however, when the perturbation radius is larger we should be less sure that **all** images within this space are imperceptible to the original. To test gradient obfuscation just looking at black box attacks is often not enough. Here I would like to refer the authors to the paper: Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples - Section 3.1. Again I thank the authors for the response, but given the concerns I still have I will keep my original score. *** Overall I found the paper slightly hard to read, my main comments are below: - The initial figure (Figure 1) has a mismatching caption and seems to be missing some curves that are referenced below ? It is unclear to me how to interpret the figure. The caption mentioned that there should be a pink line which shows TRADES, but I cannot find a pink line on the plot. The brown line and purple line I guess should be the green and orange line shown shown in the Figure? The difference between the caption and the figure itself makes the results presented very confusing. On top of this the experimental section in general is hard to follow, weakening the paper. For example in Table 1 the authors list a result of “1.58 (69.0%)” this is either suggesting a certified bound of 1.58 in an unclear metric (with corresponding accuracy of 69%) which would not be a very good result, or it suggests a robustness bound (measured in error) in which case the accuracy contradicts the bound. - The results are not good compared to state of the art and it is unclear how the theoretical result can help to push the boundaries of current research going on in the field. For example: ‘To interpret, the results, for example on MNIST, when sigma=0.7, Algorithm 1 achieves at least 51% accuracy under any attack whose l2_norm size is smaller than 1.4.’ This seems to be well off the state of the art as Gowal et al: ‘On the effectiveness of Interval Bound Propagation for Training Verifiably Robust Models’ which has verified accuracy on MNIST 92% on l_infinity norm of 0.3. - Figure 2 shows that TRADES gives better results for L infinity norm considered, more worryingly - STN does not seem to perform well when the epsilon is small (namely when the perturbation is most imperceptible) for L2 norm. The authors also made a comment on the fact that TRADES is expensive to train - TRADES used 10 FGSM steps to train the network, this means that it would be approximately 10X more expensive than standard ERM. The authors alluded to the fact that their method is cheaper, can the authors quantify how much cheaper it is in comparison to TRADES? -The authors also talked about gradient masking - here the comments were made about what they did to combat gradient masking but there is no strong empirical evidence that the gradient masking is indeed not occurring. In order to do this, can the authors show the adversarial accuracy of the networks trained under a very strong attack - e.g. PGD 1000 steps and maybe 20 different random initializations of the initial perturbation. Other comments: - Can the authors comment if this theoretical result can be extended to other L_p norms? The authors alluded to the fact the theoretical lower bound is not tight in the conclusion, how would this change if we consider other norms? - A side note, I would prefer the figures to have corresponding Tables in the supplementary, this allows us to see the exact adversarial accuracy achieved. - A general note in talking about adversarial accuracy under PGD. The authors should always report the step size and the number of PGD steps performed as the adversarial accuracy is highly dependent upon these parameters. In fact a lot more details are needed to describe the attack: if you are using PGD, is the optimizer you are using just standard SGD or Momentum or Adam? To know how strong your results are you will need to specify these parameters so others can have an idea of how strong the attack is.

Reviewer 2

The paper studies a novel method for making neural networks robust to norm bounded adversarial perturbations, in particular obtaining provable guarantees against perturbations in the L2 norm and empirical results showing that the networks are also robust to perturbations in the Linf norm. The paper is well written and the algorithmic and theoretical contributions are clearly outlined. My specific concerns center around the originality of the paper (distinction from prior work) and quality of some of the experimental results obtained. Originality: The paper improves upon the analysis from Lecuyer et al and develops a training method for producing certifiably robust randomized classifiers that is novel as far as I know. However, it is unclear to me how much the paper improves upon the analysis of Lecuyer and what the main source of the improvement was. Further, there is more recent work by Cohen et al that appeared on ArXiv in February (and a revised version was later published at ICML). I understand that the authors may have missed this work at the time of submission, but since the paper was already online before the submission deadline and published soon after, I would appreciate clarifications from the authors regarding the novelty in the rebuttal phase. In particular, I would like to understand: 1) The developed certificates only apply to L2 norm perturbations. Can the authors' framework providing any guarantees for perturbations in other norms? 2) How does the certificate derived compare to that in Lecuyer et al? In particular, is the certificate always guaranteed to be tighter than the one from Lecuyer et al? If not, what are the regimes where it is tighter? 3) How does the work compare to that of Cohen et al? Cohen et al claim that their certificate is the tightest one can obtain for the L2 norm for binary classifiers given that the only information known about the classifier is the probability of correct classification under random Gaussian perturbations. Given this, how does the certificate derived by the authors compare? Quality: The proofs of the mathematical results are correct in my assessment. The experimental results are interesting and indicate that the method developed indeed produces provably robust classifiers. However, there are a few issues with the experimental evaluation I wanted to clarify: 1) The authors mention that they multiply the confidence interval by 95% to obtain the corresponding accuracy. This seems very confusing to me. What accuracy is referred to here? If this is the accuracy in terms of the fraction of test examples certified to be adversarially robust, I find this a bit confusing, since the two probability spaces (sampling over the data distribution vs sampling over the Gaussian perturbations) are unconnected. 2) The improvements over PixelDP seem to come largely from the training method. (ie the blue curves are much above the orange curves in figure 1). What if the PixelDP bound was evaluated on the classifier trained by STN? 3) Table 1 is rather confusing. I assume the certified bound is the radius of the perturbation being certified. Why was this specific value chosen? Was it to maintain consistency with [17]? 4) Since obtaining certificates on ImageNet is a significant advance, I would advise the authors to include these results in the main paper rather than the appendix. Clarity: The paper is well written overall and the details are clear and easy to follow. Significance: I think

Reviewer 3

***EDITED REVIEW*** I thank the authors for addressing my concerns in the rebuttal. I thank the authors for including a comparison to Cohen et al and I am surprised to see that their method outperforms Cohen et al. Cohen et al argue that it is not possible to surpass their bound with a randomized smoothing approach. If indeed stability training is that much more powerful than Gaussian data augmentation, I think the results are interesting to the adversarial community, since Gaussian data augmentation is still a very common baseline (e.g., and others). The authors acknowledge that Cohen et al improve their bound; therefore, the adversarial community would benefit most from a combination of Cohen's better bound and stability training. For this reason, the paper should be rewritten, since the main contribution stems from stability training. However, stability training itself is not a novel contribution as it was introduced in "Improving the robustness of deep neural networks via stability training" and it was already demonstrated in reference [32] of this paper that stability training works better than Gaussian noise for increasing performance on noise corruptions. Due to the Discussion with another reviewer, I agree that the results on CIFAR-10, l_inf are not convincing and there are other methods that are faster than TRADES that achieve better results (e.g. I have changed my score to a 6, because I find it interesting that the results of Cohen et al are outperformed. I do think that the authors should rewrite the paper focussing more on the improvement due to stability training and maybe use Cohen's better bound. ******** The authors derive and prove adversarial robustness bounds based on the Renyi divergence and utilizing stability training. They build upon the methods presented in Lecuyer et al [2018] and show higher robustness bounds. The paper is written in the context of randomization-based defenses where noise is added to inputs; if the inputs have been perturbed adversarially before that, the predictions of a classifier are smoothened by the added noise. Originality, Quality and Significance: My main comment is that the current state of the art in certified adversarial robustness, namely Cohen et al. [2019] was not discussed and not compared to. Cohen et al. use a very similar approach as they also add Gaussian noise to adversarial examples to “drown out” adversarial perturbations and even prove that it is impossible to certify higher l2-robustness with a radius larger than their derived R. In more detail, Cohen et al use Gaussian data augmentation during training, while this paper uses stability training where an additional stability term is included during training. The stability term is the cross-entropy loss between f(x) and f(x’) where x is clean data and x’ is data with added Gaussian noise. The original publication on stability training [Zheng et al, 2016] reported that data augmentation with Gaussian noise leads to underfitting compared to their stability training approach. Another recent study demonstrates better model performance of stability training compared to data augmentation and also decreased sensitivity to hyperparameters [Laermann et al, 2019]. I would be interested to see whether this difference (stability training is better than data augmentation) leads to a better robustness bound compared to the approach using data augmentation. It would be crucial for the authors of this paper to compare to Cohen et al. as the approaches are very similar and it is not clear whether the approach of Cohen et al. can be superseded at all with another randomization-based method due to the tight bounds argument of Cohen et al. Additionally, Cohen et al. provide results for ImageNet and the authors should include them in their discussion/ experiments’ section. Instead, the authors compare their results to Wong et al. [2018] which is a baseline that was already beaten by Cohen et al. If the results from comparing to Cohen et al are worse, I would suggest rejecting this paper as the approaches are very similar. On the other hand, the authors achieve better performance than TRADES on l2, both for MNIST and CIFAR10 with a less time-consuming approach. Cohen et al have not compared their method with TRADES, so it is not clear which would perform better. In case this approach performs better than Cohen et al (and as already shown better than TRADES), the paper would be worth accepting. Clarity: The paper is well written and it is easy to read. The form of the presentation is therefore good and acceptable for NeurIPS. I have several minor comments. In Figure 1, the colors in the caption do not match the colors in the image. In Table 1, I could not find the cited results in Wong et al. [2018]. In Wong et al. [2018], in Table 4, the results look similar to the numbers in Table 1, but they are not exact. For example, in Table 1, Cifar-10, Single, the robust accuracy is displayed as 53% which would correspond to an error of 47%. This robust accuracy however does not occur in Table 4, Wong et al. [2018]. The natural accuracy seems off by 10% for CIFAR-10. The other numbers are also slightly off. Did the authors of this paper use the code of Wong et al. [2018] to rerun their simulations which naturally results in slightly different numbers? Also: Which model was used in this comparison as there are 3 different models both for CIFAR-10 and for MNIST in Wong et al. [2018], in Table 4. In Figure 2, the orange curve displaying the Natural case. Please state whether this is a vanilla trained model and how it was attacked. There are some typos in the text, but not many, such as eg. Page 3, line 110: ‘robustss’ instead of robustness. In the Appendix, there are a few typos in D1 such as in the sentence: ‘Since only PixelDP [18] is able to obtain non-trivial certified bound on ImageNet, we compare out bound to theirs’ -> 1) obtain a non-trivial… and 2) our bound. References: Lecuyer, Mathias, et al. "On the Connection between Differential Privacy and Adversarial Robustness in Machine Learning." arXiv:1802.03471 (2018). Cohen, Jeremy, et al. “Certified Adversarial Robustness via Randomized Smoothing”, arXiv:1902.02918 (2019). Wong, Eric et al. „ Scaling provable adversarial defenses”, arXiv:1805.12514 [2018] Zheng, Stephan et al., “Improving the Robustness of Deep Neural Networks via Stability Training”, [2016] Laermann, Jan et al. “Achieving Generalizable Robustness of Deep Neural Networks by Stability Training” [2019]