{"title": "Accurate, reliable and fast robustness evaluation", "book": "Advances in Neural Information Processing Systems", "page_first": 12861, "page_last": 12871, "abstract": "Throughout the past five years, the susceptibility of neural networks to minimal adversarial perturbations has moved from a peculiar phenomenon to a core issue in Deep Learning. Despite much attention, however, progress towards more robust models is significantly impaired by the difficulty of evaluating the robustness of neural network models. Today's methods are either fast but brittle (gradient-based attacks), or they are fairly reliable but slow (score- and decision-based attacks). We here develop a new set of gradient-based adversarial attacks which (a) are more reliable in the face of gradient-masking than other gradient-based attacks, (b) perform better and are more query efficient than current state-of-the-art gradient-based attacks, (c) can be flexibly adapted to a wide range of adversarial criteria and (d) require virtually no hyperparameter tuning. These findings are carefully validated across a diverse set of six different models and hold for L0, L1, L2 and Linf in both targeted as well as untargeted scenarios. Implementations will soon be available in all major toolboxes (Foolbox, CleverHans and ART). We hope that this class of attacks will make robustness evaluations easier and more reliable, thus contributing to more signal in the search for more robust machine learning models.", "full_text": "Accurate, reliable and fast robustness evaluation\n\nWieland Brendel1,3\n\nJonas Rauber1-3 Matthias K\u00fcmmerer1-3\n\nIvan Ustyuzhaninov1-3\n\nMatthias Bethge1,3,4\n\n1Centre for Integrative Neuroscience, University of T\u00fcbingen\n\n2International Max Planck Research School for Intelligent Systems\n\n3Bernstein Center for Computational Neuroscience T\u00fcbingen\n\n4Max Planck Institute for Biological Cybernetics\n\nwieland.brendel@bethgelab.org\n\nAbstract\n\nThroughout the past \ufb01ve years, the susceptibility of neural networks to minimal\nadversarial perturbations has moved from a peculiar phenomenon to a core issue in\nDeep Learning. Despite much attention, however, progress towards more robust\nmodels is signi\ufb01cantly impaired by the dif\ufb01culty of evaluating the robustness of\nneural network models. Today\u2019s methods are either fast but brittle (gradient-based\nattacks), or they are fairly reliable but slow (score- and decision-based attacks).\nWe here develop a new set of gradient-based adversarial attacks which (a) are\nmore reliable in the face of gradient-masking than other gradient-based attacks, (b)\nperform better and are more query ef\ufb01cient than current state-of-the-art gradient-\nbased attacks, (c) can be \ufb02exibly adapted to a wide range of adversarial criteria\nand (d) require virtually no hyperparameter tuning. These \ufb01ndings are carefully\nvalidated across a diverse set of six different models and hold for L0, L1, L2 and\nL\u221e in both targeted as well as untargeted scenarios. Implementations will soon\nbe available in all major toolboxes (Foolbox, CleverHans and ART). We hope that\nthis class of attacks will make robustness evaluations easier and more reliable, thus\ncontributing to more signal in the search for more robust machine learning models.\n\n1\n\nIntroduction\n\nManipulating just a few pixels in an input can easily derail the predictions of a deep neural network\n(DNN). This susceptibility threatens deployed machine learning models and highlights a gap between\nhuman and machine perception. This phenomenon has been intensely studied since its discovery in\nDeep Learning [Szegedy et al., 2014] but progress has been slow [Athalye et al., 2018a].\nOne core issue behind this lack of progress is the shortage of tools to reliably evaluate the robustness\nof machine learning models. Almost all published defenses against adversarial perturbations have\nlater been found to be ineffective [Athalye et al., 2018a]: the models just appeared robust on the\nsurface because standard adversarial attacks failed to \ufb01nd the true minimal adversarial perturbations\nagainst them. State-of-the-art attacks like PGD [Madry et al., 2018] or C&W [Carlini and Wagner,\n2016] may fail for a number of reasons, ranging from (1) suboptimal hyperparameters over (2) an\ninsuf\ufb01cient number of optimization steps to (3) masking of the backpropagated gradients.\nIn this paper, we adopt ideas from the decision-based boundary attack [Brendel et al., 2018] and\ncombine them with gradient-based estimates of the boundary. The resulting class of gradient-based\nattacks surpasses current state-of-the-art methods in terms of attack success, query ef\ufb01ciency and\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Schematic of our approach. Consider a two-pixel input which a model either interprets as a\ndog (shaded region) or as a cat (white region). Given a clean dog image (solid triangle), we search for\nthe closest image classi\ufb01ed as a cat. Standard gradient-based attacks start somewhere near the clean\nimage and perform gradient descent towards the boundary (left). Our attacks start from an adversarial\nimage far away from the clean image and walk along the boundary towards the closest adversarial\n(middle). In each step, we solve an optimization problem to \ufb01nd the optimal descent direction along\nthe boundary that stays within the valid pixel bounds and the trust region (right).\n\nreliability. Like the decision-based boundary attack, but unlike existing gradient-based attacks,\nour attacks start from a point far away from the clean input and follow the boundary between the\nadversarial and non-adversarial region towards the clean input, Figure 1 (middle). This approach has\nseveral advantages: \ufb01rst, we always stay close to the decision boundary of the model, the most likely\nregion to feature reliable gradient information. Second, instead of minimizing some surrogate loss\n(e.g. a weighted combination of the cross-entropy and the distance loss), we can formulate a clean\nquadratic optimization problem. Its solution relies on the local plane of the boundary to estimate\nthe optimal step towards the clean input under the given Lp norm and the pixel bounds, see Figure\n1 (right). Third, because we always stay close to the boundary, our method features only a single\nhyperparameter (the trust region) but no other trade-off parameters as in C&W or a \ufb01xed Lp norm ball\nas in PGD. We tested our attacks against the current state-of-the-art in the L0, L1, L2 and L\u221e metric\non two conditions (targeted and untargeted) on six different models (of which all but the vanilla\nResNet-50 are defended) across three different data sets. To make all comparisons as fair as possible,\nwe conducted a large-scale hyperparameter tuning for each attack. In almost all cases tested, we \ufb01nd\nthat our attacks outperform the current state-of-the-art in terms of attack success, query ef\ufb01ciency and\nrobustness to suboptimal hyperparameter settings. We hope that these improvements will facilitate\nprogress towards robust machine learning models.\n\n2 Related work\n\nGradient-based attacks are the most widely used tools to evaluate model robustness due to their\nef\ufb01ciency and success rate relative to other classes of attacks with less model information (like\ndecision-based, score-based or transfer-based attacks, see [Brendel et al., 2018]). This class includes\nmany of the best-known attacks such as L-BFGS [Szegedy et al., 2014], FGSM [Goodfellow et al.,\n2015], JSMA [Papernot et al., 2016], DeepFool [Moosavi-Dezfooli et al., 2016], PGD [Kurakin et al.,\n2016, Madry et al., 2018], C&W [Carlini and Wagner, 2016], EAD [Chen et al., 2017] and SparseFool\n[Modas et al., 2019]. Nowadays, the two most important ones are PGD with a random starting point\n[Madry et al., 2018] and C&W [Carlini and Wagner, 2016]. They are usually considered the state of\nthe art for L\u221e (PGD) and L2 (CW). The other ones are either much weaker (FGSM, DeepFool) or\nminimize other norms, e.g. L0 (JSMA, SparseFool) or L1 (EAD). More recently, there have been\nsome improvements to PGD that aim at making it more effective and/or more query-ef\ufb01cient by\nchanging its update rule to Adam [Uesato et al., 2018] or momentum [Dong et al., 2018].\n\n3 Attack algorithm\n\nOur attacks are inspired by the decision-based boundary attack [Brendel et al., 2018] but use gradients\nto estimate the local boundary between adversarial and non-adversarial inputs. We will refer to this\n\n2\n\nMulti-step viewstandard gradient-based methodsour methodpixel value #1pixel value #2clean imagecatdogstartingpointadversarialimagepixel value #2pixel value #2k-1kFind optimal step k-1 k that(1) minimizes distance to clean image(2) stays within trust region(3) stays within pixel bounds(4) stays on decision boundarySingle-step view(1)(2)(3)(4)bk\fhas minimal Lp distance to the clean input x, (2) the size(cid:13)(cid:13)\u03b4k(cid:13)(cid:13)2\n\nboundary as the adversarial boundary for the rest of this manuscript. In a nutshell, the attack starts\nfrom an adversarial input \u02dcx0 (which may be far away from the clean sample) and then follows the\nadversarial boundary towards the clean input x, see Figure 1 (middle). To compute the optimal step\nin each iteration, Figure 1 (right), we solve a quadratic trust region optimization problem. The goal of\nthis optimization problem is to \ufb01nd a step \u03b4k such that (1) the updated perturbation \u02dcxk = \u02dcxk\u22121 + \u03b4k\n2 of the step is smaller than a given\ntrust region radius r, (3) the updated perturbation stays within the box-constraints of the valid input\nvalue range (e.g. [0, 1] or [0, 255] for input) and (4) the updated perturbation \u02dcxk is approximately\nplaced on the adversarial boundary.\n\n0 \u2264 \u02dcxk\u22121 + \u03b4k \u2264 1 \u2227 bk(cid:62)\u03b4k = ck \u2227 (cid:13)(cid:13)\u03b4k(cid:13)(cid:13)2\n\nOptimization problem In mathematical terms, this optimization problem can be phrased as\n2 \u2264 r,\n\n(1)\nmin\nwhere (cid:107).(cid:107)p denotes the Lp norm and bk denotes the estimate of the normal vector of the local boundary\n(see Figure 1) around \u02dcxk\u22121 (see below for details). The problem can be solved for p = 0, 1,\u221e with\noff-the-shelf solvers like ECOS [Domahidi et al., 2013] or SCS [O\u2019Donoghue et al., 2016] but the\nruntime of these solvers as well as their numerical instabilities in high dimensions prohibits their use\nin practice. We therefore derived ef\ufb01cient iterative algorithms based on the dual of (1) to solve Eq.\n(1) for L0, L1, L2 and L\u221e. The additional optimization step (1) has little impact on the runtime of\nour attack compared to standard iterative gradient-based attacks like PGD. We report the details of\nthe derivation and the resulting algorithms in the supplementary materials.\n\n(cid:13)(cid:13)x \u2212 \u02dcxk\u22121 \u2212 \u03b4k(cid:13)(cid:13)p\n\n\u03b4\n\ns.t.\n\nAdversarial criterion Our attacks move along the adversarial boundary to minimize the distance\nto the clean input. We assume that this boundary can be de\ufb01ned by a differentiable equality constraint\nadv(\u02dcx) = 0, i.e. the manifold that de\ufb01nes the boundary is given by the set of inputs {\u02dcx| adv(\u02dcx) = 0}.\nNo other assumptions about the adversarial boundary are being made. Common choices for adv(.)\nare targeted or untargeted adversarials, de\ufb01ned by perturbations that switch the model prediction from\nthe ground-truth label y to either a speci\ufb01ed target label t (targeted scenario) or any other label t (cid:54)= y\n(untargeted scenario). More precisely, let m(\u02dcx) \u2208 RC be the class-conditional log-probabilities\npredicted by model m(.) on the input \u02dcx. Then adv(\u02dcx) = m(\u02dcx)y \u2212 m(\u02dcx)t is the criterion for\ntargeted adversarials and adv(\u02dcx) = mint,t(cid:54)=y(my(\u02dcx) \u2212 mt(\u02dcx)) for untargeted adversarials.\nThe direction of the boundary bk in step k at point \u02dcxk\u22121 is de\ufb01ned as the derivative of adv(.),\n\nbk = \u2207\u02dcxk\u22121 adv(\u02dcxk\u22121).\n\n(2)\nHence, any step \u03b4k for which bk(cid:62)\u03b4k = adv(\u02dcxk\u22121) will move the perturbation \u02dcxk = \u02dcxk\u22121 + \u03b4k\nonto the adversarial boundary (if the linearity assumption holds exactly). In Eq. (1), we de\ufb01ned\nck \u2261 adv(\u02dcxk\u22121) for brevity. Finally, we note that in the targeted and untargeted scenarios, we\ncompute gradients for the same loss found to be most effective in Carlini and Wagner [2016]. In our\ncase, this loss is naturally derived from a geometric perspective of the adversarial boundary.\n\nStarting point The algorithm always starts from a point \u02dcx0 that is typically far away from the\nclean image and lies in the adversarial region. There are several straight-forward ways to \ufb01nd such\nstarting points, e.g. by (1) sampling random noise inputs, (2) choosing a real sample that is part of\nthe adversarial region (e.g. is classi\ufb01ed as a given target class) or (3) choosing the output of another\nadversarial attack.\nIn all experiments presented in this paper, we choose the starting point as the closest sample (in\nterms of the L2 norm) to the clean input which was classi\ufb01ed differently (in untargeted settings) or\nclassi\ufb01ed as the desired target class (in targeted settings) by the given model. After \ufb01nding a suitable\nstarting point, we perform a binary search with a maximum of 10 steps between the clean input and\nthe starting point to \ufb01nd the adversarial boundary. From this point, we perform an iterative descent\nalong the boundary towards the clean input. Algorithm 1 provides a compact summary of the attack\nprocedure.\n\n4 Methods\n\nWe extensively compare the proposed attack against current state-of-the art attacks in a range of\ndifferent scenarios. This includes six different models (varying in model architecture, defense\n\n3\n\n\fAlgorithm 1: Schematic of our attacks.\nData: clean input x, differentiable adversarial criterion adv(.), adversarial starting point \u02dcx0\n\nResult: adversarial example \u02dcx such that the distance d(x, \u02dcxk) =(cid:13)(cid:13)x \u2212 \u02dcxk(cid:13)(cid:13)p is minimized\n\nbegin\n\nk \u2190\u2212 0\nb0 \u2190\u2212 0\nif no \u02dcx0 is given: \u02dcx0 \u223c U(0, 1) s.t. \u02dcx0 is adversarial (or sample from adv. class)\nwhile k < maximum number of steps do\n\nbk := \u2207\u02dcxk\u22121 adv(\u02dcxk\u22121)\nboundary\nck := adv(\u02dcxk\u22121)\n\u03b4k \u2190\u2212 solve optimization problem Eq. (1) for given Lp norm\n\u02dcxk \u2190\u2212 \u02dcxk\u22121 + \u03b4k\nk \u2190\u2212 k + 1\n\n// estimate local geometry of adversarial\n\n// estimate distance to adversarial boundary\n\nend\n\nend\n\nmechanism and data set), two different adversarial categories (targeted and untargeted) and four\ndifferent metrics (L0, L1, L2 and L\u221e). In addition, we perform a large-scale hyperparameter tuning\nfor all attacks we compare against in order to be as fair as possible. The full analysis pipeline is built\non top of Foolbox [Rauber et al., 2017].\n\nAttacks We compare against several attacks which are considered to be state-of-the-art in\nL0, L1, L2 and L\u221e:\n\u2022 Projected Gradient Descent (PGD) [Madry et al., 2018]. Iterative gradient attack that optimizes\nL\u221e by minimizing a cross-entropy loss under a \ufb01xed L\u221e norm constraint enforced in each step.\n\u2022 Projected Gradient Descent with Adam (AdamPGD) [Uesato et al., 2018]. Same as PGD but\n\nwith Adam Optimiser for update steps.\n\n\u2022 C&W [Carlini and Wagner, 2016]. L2 iterative gradient attack that relies on the Adam optimizer,\na tanh-nonlinearity to respect pixel-constraints and a loss function that weighs a classi\ufb01cation loss\nwith the distance metric to be minimized.\n\n\u2022 Decoupling Direction and Norm Attack (DDN) [Rony et al., 2018]. L2 iterative gradient attack\npitched as a query-ef\ufb01cient alternative to the C&W attack that requires less hyperparameter\ntuning.\n\n\u2022 EAD [Chen et al., 2018]. Variation of C&W adapted for elastic net metrics. We run the attack\n\nwith high regularisation value (1e\u22122) to approach the optimal L1 performance.\n\n\u2022 Saliency-Map Attack (JSMA) [Papernot et al., 2016]. L0/L1 attack that iterates over saliency\n\nmaps to discover pixels with the highest potential to change the decision of the classi\ufb01er.\n\n\u2022 Sparse-Fool [Modas et al., 2019]. A sparse version of DeepFool, which uses a local linear\napproximation of the geometry of the adversarial boundary to estimate the optimal step towards\nthe boundary.\n\nModels We test all attacks on all models regardless as to whether the models have been speci\ufb01cally\ndefended against the distance metric the attacks are optimizing. The sole goal is to evaluate all attacks\non a maximally broad set of different models to ensure their wide applicability. For all models, we\nused the of\ufb01cial implementations of the authors as available in the Foolbox model zoo [Rauber et al.,\n2017].\n\u2022 Madry-MNIST [Madry et al., 2018]: Adversarially trained model on MNIST. Claim: 89.62% (\n\nL\u221e perturbation \u2264 0.3). Best third-party evaluation: 88.42% [Wang et al., 2018].\n\n4\n\n\f\u2022 Madry-CIFAR [Madry et al., 2018]: Adversarially trained model on CIFAR-10. Claim: 47.04%\n\n(L\u221e perturbation \u2264 8/255). Best third-party evaluation: 44.71% [Zheng et al., 2018].\n\n\u2022 Distillation [Papernot et al., 2015]: Defense (MNIST) with increased softmax temperature. Claim:\n99.06% (L0 perturbation \u2264 112). Best third-party evaluation: 3.6% [Carlini and Wagner, 2016].\n\u2022 Logitpairing [Kannan et al., 2018]: Variant of adversarial training on downscaled ImageNET\n(64 x 64 pixels) using the logit vector instead of cross-entropy. Claim: 27.9% (L\u221e perturbation\n\u2264 16/255). Best third-party evaluation: 0.6% [Engstrom et al., 2018].\n\n\u2022 Kolter & Wong [Kolter and Wong, 2017]: Provable defense that considers a convex outer approxi-\nmation of the possible hidden activations within an Lp ball to optimize a worst-case adversarial\nloss over this region. MNIST claims: 94.2% (L\u221e perturbations \u2264 0.1).\n\n\u2022 ResNet-50 [He et al., 2016]: Standard vanilla ResNet-50 model trained on ImageNET that reaches\n\n50% for L2 perturbations \u2264 1 \u00d7 10\u22127 [Brendel et al., 2018].\n\nAdversarial categories We test all attacks in two common attack scenarios: untargeted and\ntargeted attacks. In other words, perturbed inputs are classi\ufb01ed as adversarials if they are classi\ufb01ed\ndifferently from the ground-truth label (untargeted) or are classi\ufb01ed as a given target class (targeted).\n\nHyperparameter tuning We ran all attacks on each model/attack combination and each sample\nwith \ufb01ve repetitions and a large range of potentially interesting hyperparameter settings, resulting to\nbetween one (SparseFool) and 96 (C& W) hyperparameter settings we test for each attack. In the\nappendix we list all hyperparameters and hyperparameter ranges for each attack.\n\nEvaluation The success of an L\u221e attack is typically quanti\ufb01ed as the attack success rate within a\ngiven L\u221e norm ball. In other words, the attack is allowed to perturb the clean input with a maximum\nL\u221e norm of \u0001 and one measures the classi\ufb01cation accuracy of the model on the perturbed inputs. The\nsmaller the classi\ufb01cation accuracy the better performed the attack. PGD [Madry et al., 2018] and\nAdamPGD [Uesato et al., 2018] are highly adapted to this scenario and expect \u0001 as an input.\nThis contrasts with most L0, L1 and L2 attacks like C&W [Carlini and Wagner, 2016] or SparseFool\n[Modas et al., 2019] which are designed to \ufb01nd minimal adversarial perturbations. In such scenarios,\nit is more natural to measure the success of an attack as the median over the adversarial perturbation\nsizes across all tested samples [Schott et al., 2019]. The smaller the median perturbations the better\nthe attack.\nOur attacks also seek minimal adversarials and thus lend themselves to both evaluation schemes.\nTo make the comparison to the current state-of-the-art as fair as possible, we adopt the success rate\ncriterion on L\u221e and the median perturbation distance on L0, L1 and L2.\nAll results reported have been evaluated on 1000 validation samples. For the L\u221e evaluation, we\nchose \u0001 for each model and each attack scenario such that the best attack performance reaches\nroughly 50% accuracy. This makes it easier to compare the performance of different attacks (com-\npared to thresholds at which model accuracy is close to zero or close to clean performance). In\nthe untargeted scenario, we chose \u0001 = 0.33, 0.15, 0.1, 0.03, 0.0015, 0.0006 in the untargeted and\n\u0001 = 0.35, 0.2, 0.15, 0.06, 0.04, 0.002 in the targeted scenarios for Madry-MNIST, Kolter & Wong,\nDistillation, Madry-CIFAR, Logitpairing and ResNet-50, respectively.\n\n5 Results\n\n5.1 Attack success\n\nIn both targeted as well as untargeted attack scenarios, our attacks surpass the current state-of-\nthe-art on every single model we tested, see Table 1 (untargeted) and Table 2 (targeted) (with the\nLogitpairing in the targeted L\u221e scenario being the only exception). While the gains are small on\nsome model/metric combinations like Distillation or Madry-CIFAR on L2, we reach quite substantial\ngains on many others: on Madry-MNIST, our untargeted L2 attack reaches median perturbation sizes\nof 1.15 compared to 3.24 for C&W. In the targeted scenario, the difference is even more pronounced\n(1.70 vs 4.79). On L\u221e, our attack further reduces the model accuracy by 0.1% to 14.0% relative to\n\n5\n\n\fFigure 2: Randomly selected adversarial examples found by our L2 and L\u221e attacks for each model.\nThe top part shows adversarial examples that minimize the L\u221e norm while the bottom row shows\nadversarial examples that minimize the L2 norm. Adversarial examples optimised with out L0 and\nL1 attacks are displayed in the appendix.\n\nPGD. On L1 and L0 our gains are particularly drastic: while the SaliencyMap attack and SparseFool\noften fail on the defended models, our attack reaches close to 100% attack success on all models\nwhile reaching adversarials that are up to one to two orders smaller. Even the current state-of-the-art\non L1, EAD [Chen et al., 2018], is up to a factor six worse than our attack. Adversarial examples\nproduced by our attacks are visualized in Figure 2 (for L2 and L\u221e) and in the supplementary material\n(for L1 and L0).\n\nMNIST\n\nK&W\n76.5%\n72.5%\n69.5%\n2.78\n1.95\n1.62\n0.02346\n0.32114\n0.03730\n0.00707\n1.00000\n0.14732\n0.06250\n\nMadry-MNIST\n60.1%\n53.4%\n49.1%\n3.24\n1.59\n1.15\n0.01931\n0.11393\n0.04114\n0.00377\n1.00000\n0.22832\n0.07143\n\nPGD\nAdamPGD\nOurs-L\u221e\nC&W\nDDN\nOurs-L2\nEAD\nSparseFool\nSaliencyMap\nours-L1\nSparseFool\nSaliencyMap\nours-L0\nTable 1: Attack success in untargeted scenario. Model accuracies (\ufb01rst block) and median ad-\nversarial perturbation distance (all other blocks) in untargeted attack scenarios. Smaller is better.\nSparseFool and SaliencyMap attacks did not always \ufb01nd suf\ufb01ciently many adversarials to compute\nan overall score.\n\nDistillation\n32.1%\n31.3%\n31.2%\n1.09\n1.07\n1.07\n0.00768\n0.48129\n0.02482\n0.00698\n1.00000\n0.08291\n0.00765\n\nLP\n53.3%\n53.5%\n42.5%\n0.10\n0.15\n0.09\n0.00013\n0.49915\n0.00297\n0.00008\n1.00000\n0.00647\n0.00024\n\nImageNet\n\nResNet-50\n51.0%\n50.2%\n37.0%\n0.14\n0.24\n0.13\n\nCIFAR-10\nMadry-CIFAR\n57.1%\n57.1%\n57.0%\n0.75\n0.73\n0.72\n0.00285\n0.47687\n0.00292\n0.00116\n1.00000\n0.03483\n0.00228\n\n6\n\nMadry-MNIST(L\u221e)Madry-CIFAR(L\u221e)LogitPairing(L\u221e)Kolter&Wong(L2)Distillation(L2)ResNet-50(L2)CleanAdversarialDi\ufb00erenceCleanAdversarialDi\ufb00erence\fFigure 3: Query-Success curves for all model/attack combinations in the targeted and untargeted\nscenario for L2 and L\u221e metric (see supplementary information for L0 and L1 metric). Each curve\nshows the attack success either in terms of model accuracy (for L\u221e, left part) or median adversarial\nperturbation size (for L2, right part) over the number of queries to the model. In both cases, lower is\nbetter. For each point on the curve, we selected the optimal hyperparameter. If no line is shown the\nattack success was lower than 50%. For all other points with less than 99% line is 50% transparent.\n\n7\n\nLinfL2untargetedtargeteduntargetedtargetedMadry-MNISTKolter&WongDistillationMadry-CIFARLogitpairingResNet-50\fMNIST\n\nK&W\n46.4%\n40.2%\n39.8%\n4.06\n2.89\n2.31\n0.04019\n\u2014\n\u2014\n0.00904\n\u2014\n0.17793\n0.07908\n\nMadry-MNIST\n65.6%\n59.8%\n56.0%\n4.79\n2.22\n1.70\n0.03648\n\u2014\n0.05740\n0.00499\n\u2014\n0.13074\n0.08929\n\nPGD\nAdamPGD\nOurs-L\u221e\nC&W\nDDN\nOurs-L2\nEAD\nSparseFool\nSaliencyMap\nours-L1\nSparseFool\nSaliencyMap\nours-L0\nTable 2: Attack success in targeted scenario. Model accuracies (\ufb01rst block) and median adversarial\nperturbation distance (all other blocks) in targeted attack scenarios. Smaller is better. SparseFool and\nSaliencyMap attacks did not always \ufb01nd suf\ufb01ciently many adversarials to compute an overall score.\n\nDistillation\n53.2%\n52.3%\n50.5%\n2.09\n2.09\n2.05\n0.01808\n\u2014\n0.03160\n0.00925\n\u2014\n0.12117\n0.01020\n\nLP\n1.5%\n0.6%\n0.9%\n0.70\n0.58\n0.51\n0.00221\n\u2014\n\u2014\n0.00085\n\u2014\n\u2014\n0.01147\n\nImageNet\n\nResNet-50\n47.4%\n44.1%\n37.0%\n0.44\n0.64\n0.40\n\nCIFAR-10\nMadry-CIFAR\n39.7%\n39.9%\n37.6%\n1.20\n1.19\n1.16\n0.00698\n\u2014\n0.00872\n0.00146\n\u2014\n0.04036\n0.00293\n\nFigure 4: Sensitivity of our method to the number of repetitions and suboptimal hyperparameters.\n\n5.2 Query ef\ufb01ciency\n\nOn L2, our attack is signi\ufb01cantly more query ef\ufb01cient than C&W and at least on par with DDN, see\nthe query-distortion curves in Figure 3. Each curve represents the maximal attack success (either in\nterms of model accuracy or median perturbation size) as a function of query budget. For each query\n(i.e. each point of the curve) and each model, we select the optimal hyperparameter. This ensures\nthat the we tease out how good each attack can perform in limited-query scenarios. We \ufb01nd that our\nL2 attack generally requires only about 10 to 20 queries to get close to convergence while C&W\noften needs several hundred iterations. Our attack performs particularly well on adversarially trained\nmodels like Madry-MNIST.\nSimilarly, our L\u221e attack generally surpasses PGD and AdamPGD in terms of attack success after\naround 10 queries. The \ufb01rst few queries are typically required by our attack to \ufb01nd a suitable initial\npoint on the adversarial boundary. This gives PGD a slight advantage at the very beginning.\n\n5.3 Hyperparameter robustness\n\nIn Figure 4, we show the results of an ablation study on L2 and L\u221e. In the full case (8 params + 5\nreps), we run all our attacks against C&W as well as PGD with all hyperparameter values and with\n\ufb01ve repetitions for 1000 steps on each sample and model. We then choose the smallest adversarial\ninput across all hyperparameter values and all repetitions. This is the baseline we compare all\nablations against. The results are as follows:\n\n8\n\nL\u221eL2\f\u2022 Like PGD or C&W, our attacks experience only a 4% performance drop if a single hyperpa-\n\nrameter is used instead of eight.\n\n\u2022 Our attacks experience around 15% - 19% drop in performance for a single hyperparameter\n\nand only one instead of \ufb01ve repetitions, similar to PGD and C&W.\n\n\u2022 We can even choose the same trust region hyperparameter across all models with no further\ndrop in performance. C&W, in comparison, experiences a further 16% drop in performance,\nmeaning it is more sensitive to per-model hyperparameter tuning.\n\n\u2022 Our attack is extremely insensitive to suboptimal hyperparameter tuning: changing the\noptimal trust region two orders of magnitude up or down changes performance by less than\n15%. In comparison, just one order of magnitude deteriorates C&W performance by almost\n50%. Larger deviations from the optimal learning rate disarm C&W completely. PGD is\nless sensitive than C&W but still experiences large drops if the learning rate gets too small.\n\n6 Discussion & Conclusion\n\nAn important obstacle slowing down the search for robust machine learning models is the lack of\nreliable evaluation tools: out of roughly two hundred defenses proposed and evaluated in the literature,\nless than a handful are widely accepted as being effective. A more reliable evaluation of adversarial\nrobustness has the potential to more clearly distinguish effective defenses from ineffective ones, thus\nproviding more signal and thereby accelerating progress towards robust models.\nIn this paper, we introduced a novel class of gradient-based attacks that outperforms the current\nstate-of-the-art in terms of attack success, query ef\ufb01ciency and reliability on L0, L1, L2 and L\u221e.\nBy moving along the adversarial boundary, our attacks stay in a region with fairly reliable gradient\ninformation. Other methods like C&W which move through regions far away from the boundary\nmight get stuck due to obfuscated gradients, a common issue for robustness evaluation [Athalye et al.,\n2018b].\nFurther extensions to other metrics (e.g. elastic net) are possible as long as the optimization problem\nEq. (1) can be solved ef\ufb01ciently. Extensions to other adversarial criteria are trivial as long as the\nboundary between the adversarial and the non-adversarial region can be described by a differentiable\nequality constraint. This makes the attack more suitable to scenarios other than targeted or untargeted\nclassi\ufb01cation tasks.\nTaken together, our methods set a new standard for adversarial attacks that is useful for practitioners\nand researchers alike to \ufb01nd more robust machine learning models.\n\nAcknowledgments\n\nThis work has been funded, in part, by the German Federal Ministry of Education and Research\n(BMBF) through the Bernstein Computational Neuroscience Program T\u00fcbingen (FKZ: 01GQ1002)\nas well as the German Research Foundation (DFG CRC 1233 on \u201cRobust Vision\u201d) and the T\u00fcbingen\nAI Center (FKZ: 01IS18039A). The authors thank the International Max Planck Research School\nfor Intelligent Systems (IMPRS-IS) for supporting J.R., M.K. and I.U.; J.R. acknowledges support\nby the Bosch Forschungsstiftung (Stifterverband, T113/30057/17); M.B. acknowledges support by\nthe Centre for Integrative Neuroscience T\u00fcbingen (EXC 307); W.B. and M.B. were supported by\nthe Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior\nBusiness Center (DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to\nreproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation\nthereon. Disclaimer: The views and conclusions contained herein are those of the authors and should\nnot be interpreted as necessarily representing the of\ufb01cial policies or endorsements, either expressed\nor implied, of IARPA, DoI/IBC, or the U.S. Government.\n\nReferences\nAnish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of\nsecurity: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018a. URL\nhttp://arxiv.org/abs/1802.00420.\n\n9\n\n\fAnish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of\nsecurity: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018b. URL\nhttp://arxiv.org/abs/1802.00420.\n\nW. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against\nblack-box machine learning models. In International Conference on Learning Representations,\n2018. URL https://arxiv.org/abs/1712.04248.\n\nNicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR,\n\nabs/1608.04644, 2016. URL http://arxiv.org/abs/1608.04644.\n\nPin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net attacks to\n\ndeep neural networks via adversarial examples. arXiv preprint arXiv:1709.04114, 2017.\n\nPin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. EAD: elastic-net attacks\nto deep neural networks via adversarial examples. In Proceedings of the Thirty-Second AAAI\nConference on Arti\ufb01cial Intelligence, (AAAI-18), the 30th innovative Applications of Arti\ufb01cial Intel-\nligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Arti\ufb01cial Intelligence\n(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 10\u201317, 2018.\n\nAlexander Domahidi, Eric Chun-Pu Chu, and Stephen P. Boyd. Ecos: An socp solver for embedded\n\nsystems. 2013 European Control Conference (ECC), pages 3071\u20133076, 2013.\n\nYinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting\nadversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, 2018.\n\nLogan Engstrom, Andrew Ilyas, and Anish Athalye. Evaluating and understanding the robustness of\nadversarial logit pairing. CoRR, abs/1807.10272, 2018. URL http://arxiv.org/abs/1807.\n10272.\n\nIan Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial\nexamples. In International Conference on Learning Representations, 2015. URL http://arxiv.\norg/abs/1412.6572.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,\nLas Vegas, NV, USA, June 27-30, 2016, pages 770\u2013778, 2016. doi: 10.1109/CVPR.2016.90. URL\nhttps://doi.org/10.1109/CVPR.2016.90.\n\nHarini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. CoRR,\n\nabs/1803.06373, 2018. URL http://arxiv.org/abs/1803.06373.\n\nJ. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex\nouter adversarial polytope. CoRR, abs/1711.00851, 2017. URL http://arxiv.org/abs/1711.\n00851.\n\nAlexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world.\n\narXiv preprint arXiv:1607.02533, 2016.\n\nAleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. In 6th International Conference on\nLearning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference\nTrack Proceedings, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.\n\nApostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Sparsefool: a few pixels\nmake a big difference. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2019.\n\nSeyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and\naccurate method to fool deep neural networks. In The IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), June 2016.\n\n10\n\n\fB. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and\nhomogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):\n1042\u20131068, June 2016. URL http://stanford.edu/~boyd/papers/scs.html.\n\nN. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of\ndeep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy\n(EuroS P), pages 372\u2013387, March 2016. doi: 10.1109/EuroSP.2016.36.\n\nNicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a\ndefense to adversarial perturbations against deep neural networks. CoRR, abs/1511.04508, 2015.\nURL http://arxiv.org/abs/1511.04508.\n\nNicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram\nSwami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P),\n2016 IEEE European Symposium on, pages 372\u2013387. IEEE, 2016.\n\nJonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox v0.8.0: A python toolbox to\nbenchmark the robustness of machine learning models. CoRR, abs/1707.04131, 2017. URL\nhttp://arxiv.org/abs/1707.04131.\n\nJ\u00e9r\u00f4me Rony, Luiz G Hafemann, Luis S Oliveira, Ismail Ben Ayed, Robert Sabourin, and Eric\nGranger. Decoupling direction and norm for ef\ufb01cient gradient-based l2 adversarial attacks and\ndefenses. arXiv preprint arXiv:1811.09600, 2018.\n\nLukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the \ufb01rst adversarially\nrobust neural network model on MNIST. In International Conference on Learning Representations,\n2019. URL https://openreview.net/forum?id=S1EHOsC9tX.\n\nChristian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,\nand Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning\nRepresentations, 2014. URL http://arxiv.org/abs/1312.6199.\n\nJonathan Uesato, Brendan O\u2019Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk\n\nand the dangers of evaluating against weak attacks. arXiv preprint arXiv:1802.05666, 2018.\n\nShiqi Wang, Yizheng Chen, Ahmed Abdou, and Suman Jana. Mixtrain: Scalable training of formally\n\nrobust neural networks. arXiv preprint arXiv:1811.02625, 2018.\n\nTianhang Zheng, Changyou Chen, and Kui Ren. Distributionally adversarial attack. CoRR,\n\nabs/1808.05537, 2018. URL http://arxiv.org/abs/1808.05537.\n\n11\n\n\f", "award": [], "sourceid": 7015, "authors": [{"given_name": "Wieland", "family_name": "Brendel", "institution": "AG Bethge, University of T\u00fcbingen"}, {"given_name": "Jonas", "family_name": "Rauber", "institution": "University of T\u00fcbingen"}, {"given_name": "Matthias", "family_name": "K\u00fcmmerer", "institution": "University of T\u00fcbingen"}, {"given_name": "Ivan", "family_name": "Ustyuzhaninov", "institution": "University of T\u00fcbingen"}, {"given_name": "Matthias", "family_name": "Bethge", "institution": "University of T\u00fcbingen"}]}