{"title": "With Friends Like These, Who Needs Adversaries?", "book": "Advances in Neural Information Processing Systems", "page_first": 10749, "page_last": 10759, "abstract": "The vulnerability of deep image classification networks to adversarial attack is now well known, but less well understood. Via a novel experimental analysis, we illustrate some facts about deep convolutional networks for image classification that shed new light on their behaviour and how it connects to the problem of adversaries. In short, the celebrated performance of these networks and their vulnerability to adversarial attack are simply two sides of the same coin: the input image-space directions along which the networks are most vulnerable to attack are the same directions which they use to achieve their classification performance in the first place. We develop this result in two main steps. The first uncovers the fact that classes tend to be associated with specific image-space directions. This is shown by an examination of the class-score outputs of nets as functions of 1D movements along these directions. This provides a novel perspective on the existence of universal adversarial perturbations. The second is a clear demonstration of the tight coupling between classification performance and vulnerability to adversarial attack within the spaces spanned by these directions. Thus, our analysis resolves the apparent contradiction between accuracy and vulnerability. It provides a new perspective on much of the prior art and reveals profound implications for efforts to construct neural nets that are both accurate and robust to adversarial attack.", "full_text": "With Friends Like These, Who Needs Adversaries?\n\nSaumya Jetley\u22171\n\nNicholas A. Lord\u2217 1,2\n\nPhilip H.S. Torr1,2\n\n1Department of Engineering Science, University of Oxford\n\n2Oxford Research Group, FiveAI Ltd.\n\n{sjetley, nicklord, phst}@robots.ox.ac.uk\n\nAbstract\n\nThe vulnerability of deep image classi\ufb01cation networks to adversarial attack is\nnow well known, but less well understood. Via a novel experimental analysis, we\nillustrate some facts about deep convolutional networks for image classi\ufb01cation that\nshed new light on their behaviour and how it connects to the problem of adversaries.\nIn short, the celebrated performance of these networks and their vulnerability to\nadversarial attack are simply two sides of the same coin: the input image-space\ndirections along which the networks are most vulnerable to attack are the same\ndirections which they use to achieve their classi\ufb01cation performance in the \ufb01rst\nplace. We develop this result in two main steps. The \ufb01rst uncovers the fact that\nclasses tend to be associated with speci\ufb01c image-space directions. This is shown\nby an examination of the class-score outputs of nets as functions of 1D movements\nalong these directions. This provides a novel perspective on the existence of\nuniversal adversarial perturbations. The second is a clear demonstration of the\ntight coupling between classi\ufb01cation performance and vulnerability to adversarial\nattack within the spaces spanned by these directions. Thus, our analysis resolves\nthe apparent contradiction between accuracy and vulnerability. It provides a new\nperspective on much of the prior art and reveals profound implications for efforts\nto construct neural nets that are both accurate and robust to adversarial attack.1\n\n1\n\nIntroduction\n\nThose studying deep networks \ufb01nd themselves forced to confront an apparent paradox. On the one\nhand, there is the demonstrated success of networks in learning class distinctions on training sets that\nseem to generalise well to unseen test data. On the other, there is the vulnerability of the very same\nnetworks to adversarial perturbations that produce dramatic changes in class predictions despite being\ncounter-intuitive or even imperceptible to humans. A common understanding of the issue can be\nstated as follows: \u201cWhile deep networks have proven their ability to distinguish between their target\nclasses so as to generalise over unseen natural variations, they curiously possess an Achilles heel\nwhich must be defended.\u201d In fact, efforts to formulate attacks and counteracting defences of networks\nhave led to a dedicated competition [1] and a body of literature already too vast to summarise in total.\nIn the current work we attempt to demystify this phenomenon at a fundamental level. We base our\nwork on the geometric decision boundary analysis of [2], which we reinterpret and extend into a\nframework that we believe is simpler and more illuminating with regards to the aforementioned\nparadoxical behaviour of deep convolutional networks (DCNs) for image classi\ufb01cation. Through\na fairly straightforward set of experiments and explanations, we clarify what it is that adversarial\nexamples represent, and indeed, what it is that modern DCNs do and do not currently do. In doing so,\nwe tie together work which has focused on adversaries per se with other work which has sought to\ncharacterise the feature spaces learned by these networks.\n\n\u2217S. Jetley and N.A. Lord have contributed equally and assert joint \ufb01rst authorship.\n1Source code for replicating all experiments is provided at https://github.com/torrvision/whoneedsadversaries.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Plots of the \u2018frog\u2019 class score \u02dcFfrog(s|in, dj, \u03b8) for the Network-in-Network [3] architecture trained\non CIFAR10, associated with two speci\ufb01c image-space directions d1 and d2 respectively. These directions\nare visualised as 2D images in the row below; the method of estimating them is explained in Sec. 3. Each plot\ncorresponds to a randomly selected CIFAR10 test image in. Adding or subtracting components along d1 causes\nthe network to change its prediction to \u2018frog\u2019: as can be seen, a \u2018deer\u2019 with a mild diamond striping added to\nit gets classi\ufb01ed as a \u2018frog\u2019. This happens with little regard for the choice of input image in itself. Likewise,\nperturbations along d2 change any \u2018frog\u2019 to a \u2018non-frog\u2019 class: notice the predicted labels for the sample images\nalong the red curve in the second plot. These class-transition phenomena are predicted by the framework\ndeveloped in this paper. While simplistic functions along directions d1 and d2 are used by the network to\naccomplish the task of classi\ufb01cation, perturbations along the very same directions constitute adversarial attacks.\n\nLet \u02c7i represent vectorised input images and \u00afi be the average vector-image over a given dataset. Then,\nthe mean-normalised version of the dataset is denoted by I = {i1, i2,\u00b7\u00b7\u00b7 iN}, where the nth image\nin = \u02c7in \u2212\u00afi. We de\ufb01ne the perturbation of the image in in the direction dj as: \u02dcin \u2190 in + s\u02c6dj, where\ns is the perturbation scaling factor and \u02c6dj is the unit-norm vector in the direction dj. The image\nis fed through a network parameterised by \u03b8 and the output score2 for a speci\ufb01c class c is given\nby Fc(\u02dci|\u03b8). This class-score function can be rewritten as Fc(in + s\u02c6dj|\u03b8), which we equivalently\ndenote by \u02dcFc(s|in, dj, \u03b8). Our work examines the nature of \u02dcFc as a function of movement s in\nspeci\ufb01c image-space directions dj starting from randomly sampled natural images in, for a variety\nof classi\ufb01cation DCNs. With this novel analysis, we uncover three noteworthy observations about\nthese functions that relate directly to the phenomenon of adversarial vulnerability in these nets, all of\nwhich are on display in Fig. 1. We now discuss these observations in more detail.\nBefore we begin, we note that these directions dj are obtained via the method explained in Sec. 3\nand by design exhibit either positive or negative association with a speci\ufb01c class. In Fig. 1 we\nstudy two such directions for the \u2018frog\u2019 class: similar directions exist for all other classes. Firstly,\nnotice that the score of the corresponding class c (\u2018frog\u2019, in this case) as a function of s is often\napproximately symmetrical about some point s0, i.e. \u02dcFc(s\u2212s0|in, dj, \u03b8) \u2248 \u02dcFc(\u2212s\u2212s0|in, dj, \u03b8) \u2200s,\nand monotonic in both half-lines. This means that simply increasing the magnitude of correlation\nbetween the input image and a single direction causes the net to believe that more (or less) of the\nclass c is present. In other words, the image-space direction sends all images either towards or\naway from the class c. In the former scenario, the direction represents a class-speci\ufb01c universal\nadversarial perturbation (UAP). Second, let id = i \u00b7 \u02c6d, and let id\u22a5 be the projection of i onto the\nspace normal to \u02c6d, such that id\u22a5 = i \u2212 id\u02c6d. Then, our results illustrate that there exists a basis of\nimage space containing \u02c6d such that the class-score function is approximately additively separable\ni.e. Fc(i|\u03b8) = Fc([id, id\u22a5 ]|\u03b8) \u2248 G(id) + H(id\u22a5 ) for some functions G and H. This means that the\ndirections under study can be used to alter the nets\u2019 predictions almost independently of each other.\nHowever, despite these facts, their 2D visualisation reveals low-level structures that are devoid of\na clear semantic link to the associated classes, as shown in Fig. 1. Thus, we demonstrate that the\nlearned functions encode a more simplistic notion of class identity than DCNs are commonly assumed\nto represent, albeit one that generalises to the test distribution to an extent. Unsurprisingly, this\ndoes not align with the way in which the human visual system makes use of these data dimensions:\n\u2018adversarial vulnerability\u2019 is simply the name given to this disparity and the phenomena derived from\nit, with the universal adversarial perturbations of [4] being a particularly direct example of this fact.\nFinally, we show that nets\u2019 classi\ufb01cation performance and adversarial vulnerability are inextricably\nlinked by the way they make use of the above directions, on a variety of architectures. Consequently,\n\n2The output of the layer just before the softmax operation, commonly known as the logit layer.\n\n2\n\n-1500-1000-500050010001500Perturbation scaling factor (s)-1001020CIFAR10-NIN, Pos.frogfrogdeer(s=0)frogfrog-1500-1000-500050010001500Perturbation scaling factor (s)-20-1001020CIFAR10-NIN, Neg.catcatfrog(s=0)catcat\fefforts to improve robustness by \u201csuppressing\u201d nets\u2019 responses to components in these directions\n(e.g. [5]) cannot simultaneously retain full classi\ufb01cation accuracy. The features and functions thereof\nthat DCNs currently rely on to solve the classi\ufb01cation problem are, in a sense, their own worst\nadversaries.\n\n2 Related Work\n\n2.1 Fundamental developments in attack methods\n\nSzegedy et al. coined the term \u2018adversarial example\u2019 in [7], demonstrating the use of box-constrained\nL-BFGS to estimate a minimal (cid:96)2-norm additive perturbation to an input image to cause its label to\nchange to a target class while keeping the resulting image within intensity bounds. Strikingly, they\nlocate a small-norm (imperceptible) perturbation at every point, for every network tested. Further, the\nadversaries thus generated are able to fool nets trained differently to one another, even when trained\nwith different subsets of the data. Goodfellow et al. [8] subsequently proposed the \u2018fast gradient\nsign method\u2019 (FGSM) to demonstrate the effectiveness of the local linearity assumption in producing\nthe same result, calculating the gradient of the cost function and perturbing with a \ufb01xed-size step in\nthe direction of its sign (optimal under the linearity assumption and an (cid:96)\u221e-norm constraint). The\nDeepFool method of Moosavi-Dezfooli et al. [9] retains the \ufb01rst-order framework of FGSM, but\ntailors itself precisely to the goal of \ufb01nding the perturbation of minimum norm that changes the class\nlabel of a given natural image to any label other than its own. Through iterative attempts to cross the\nnearest (linear) decision boundary by a tiny margin, this method records successful perturbations with\nnorms that are even smaller than those of [8]. In [4], Moosavi-Dezfooli & Fawzi et al. propose an\niterative aggregation of DeepFool perturbations that produces \u201cuniversal\u201d adversarial perturbations:\nsingle images which function as adversaries over a large fraction of an entire dataset for a targeted\nnet. While these perturbations are typically much larger than individual DeepFools, they do not\ncorrespond to human perception, and indicate that there are \ufb01xed image-space directions along which\nnets are vulnerable to deception independently of the image-space locations at which they are applied.\nThey also demonstrate some generalisation over network architectures.\nSabour & Cao et al. [10] pose an interesting variant of the problem: instead of \u201clabel adversaries\u201d,\nthey target \u201cfeature adversaries\u201d which minimise the distance from a particular guide image in a\nselected network feature space, subject to a constraint on the (cid:96)\u221e-norm of image-space distance from\na source image. Despite this constraint, the adversarial image mimics the guide very closely: not\nonly is it nearly always assigned to the guide\u2019s class, but it appears to be an inlier with respect to the\nguide-class distribution in the chosen feature space. Finally, while \u201cadversaries\u201d are conceived of as\nsmall perturbations applied to natural images such that the resulting images are still recognisable to\nhumans, the \u201cfooling images\u201d of Nguyen et al. [11] are completely unrecognisable to humans and yet\ncon\ufb01dently predicted by deep networks to be of particular classes. Such images are easily obtained\nby both evolutionary algorithms and gradient ascent, under direct encoding of pixel intensities\n(appearing to consist mostly of noise) and under CPPN [12]-regularised encoding (appearing as\nabstract mid-level patterns).\n\n2.2 Analysis of adversarial vulnerability and proposed defences\n\nIn [13], Wang et al. propose a nomenclature and theoretical framework with which to discuss the\nproblem of adversarial vulnerability in the abstract, agnostic of any actual net or attack thereof. They\ndenote an oracle relative to whose judgement robustness and accuracy must be assessed, and illustrate\nthat a classi\ufb01er can only be both accurate and robust (invulnerable to attack) relative to its oracle if it\nlearns to use exactly the same feature space that the oracle does. Otherwise, a network is vulnerable to\nadversarial attack in precisely the directions in which its feature space departs from that of the oracle.\nUnder the assumption that a net\u2019s feature space contains some spurious directions, Gao et al. [5]\npropose a subtractive scheme of suppressing the neuronal activations (i.e. feature responses) which\nchange signi\ufb01cantly between the natural and adversarial inputs. Notably, the increase in robustness is\naccompanied by a loss of performance accuracy. An alternative to network feature suppression is the\ncompression of input image data explored in e.g. [14, 15, 16].\nGoodfellow et al. [8] hypothesise that the high dimensionality and excessive linearity of deep\nnetworks explain their vulnerability. Tanay and Grif\ufb01n [17] begin by taking issue with the above\nvia illustrative toy problems. They then advance an explanation based on the angle of intersection\n\n3\n\n\fof the separating boundary with the data manifold which rests on over\ufb01tting and calls for effective\nregularisation - which they note is neither solved nor known to be solvable for deep nets. A variety\nof training-based [8, 18, 19, 20] methods are proposed to address the premise of the preceding\nanalyses. Hardening methods [8, 18] investigate the use of adversarial examples to train more robust\ndeep networks. Detection-based methods [19, 20] view adversaries as outliers to the training data\ndistribution and train detectors to identify them as such in the intermediate feature spaces of nets.\nNotably, these methods [19, 20] have not been evaluated on the feature adversaries of Sabour & Cao\net al. [10]. Further, data augmentation schemes such as that of Zhang et al. [21], wherein convex\ncombinations of input images are mapped to convex combinations of their labels, attempt to enable\nthe nets to learn smoother decision boundaries. While their approach [21] offers improved resistance\nto single-step gradient sign attacks, it is no more robust to iterative attacks of the same type.\nOver the course of the line of work in [2], [22], [23], and [24], the authors build up an image-space\nanalysis of the geometry of deep networks\u2019 decision boundaries, and its connection with adversarial\nvulnerability. In [23], they note that the DeepFool perturbations of [9] tend to evince relatively\nhigh components in the subspace spanned by the directions in which the decision boundary has a\nhigh curvature. Also, the sign of the mean curvature of the decision boundary in the vicinity of a\nDeepFooled image is typically reversed with respect to that of the corresponding natural image, which\nprovides a simple scheme to identify and undo the attack. They conclude that a majority of image-\nspace directions correspond to near-\ufb02atness of the decision boundary and are insensitive to attack,\nbut along the remaining directions, those of signi\ufb01cant curvature, the network is indeed vulnerable.\nFurther, the directions in question are observed to be shared over sample images. They illustrate in\n[2] why a hypothetical network which possessed this property would theoretically be predicted to be\nvulnerable to universal adversaries, and note that the analysis suggests a direct construction method\nfor such adversaries as an alternative to the original randomised iterative approach of [4]: they can be\nconstructed as random vectors in the subspace of shared high-curvature dimensions.\n\n3 Method\n\nThe analysis begins as in [2], with the extraction of the principal directions and principal curvatures\nof the classi\ufb01er\u2019s image-space class decision boundaries. Put simply, a principal direction vector\nand its associated principal curvature tell you how much a surface curves as you move along it in a\nparticular direction, from a particular point. Now, it takes many decision boundaries to characterise\n\nthe classi\ufb01cation behaviour of a multiclass net:(cid:0)C\n\n(cid:1) for a C-class classi\ufb01er. However, in order to\n\nunderstand the boundary properties that are useful for discriminating a given class from all others, it\nshould suf\ufb01ce to analyse only the C 1-vs.-all decision boundaries. Thus, for each class c, the method\nproceeds by locating samples very near to the decision boundary (Fc \u2212 F\u02c6c) = 0 between c and the\nunion of remaining classes \u02c6c (cid:54)= c. In practice, for each sample, this corresponds to the decision\nboundary between c and the closest neighbouring class \u02dcc, which is arrived at by perturbing the sample\nfrom the latter (\u201csource\u201d) to the former (\u201ctarget\u201d). Then, the geometry of the decision boundary is\nestimated as outlined in Alg. 1 below3, closely following the approach of [2]:\nAlgorithm 1 Computes mean principal directions and principal curvatures for a net\u2019s image-space decision surface.\nInput: network class score function F, dataset I = {i1, i2,\u00b7\u00b7\u00b7 iN}, target class label c\nOutput: principal curvature basis matrix Vb and corresponding principal curvature-score vector vs\n\n2\n\nprocedure PRINCIPALCURVATURES(F, I, c)\n\nH \u2190 null\nfor each sample in \u2208 I s.t. argmaxk(Fk(in)) (cid:54)= c do\n\n\u02c6c \u2190 argmaxk(Fk(in))\nHc\u02c6c: de\ufb01ne as Hessian of function (Fc \u2212 F\u02c6c)\n\u02dcin \u2190 DEEPFOOL(in, c)\nH \u2190 H + Hc\u02c6c(\u02dcin)\n\nH \u2190 H/(cid:107)I(cid:107)\n(Vb, vs) = EIGS(H)\nreturn (Vb, vs)\n\n(cid:46) network predicts in to be of class \u02c6c\n(cid:46) subscripts select class scores\n(cid:46) approximate nearest boundary point to in\n(cid:46) accumulate Hessian at sample boundary point\n(cid:46) normalise mean Hessian by number of samples\n(cid:46) compute eigenvectors and eigenvalues of mean Hessian\n\nThe authors of [2] advance a hypothesis connecting positively curved directions with the universal\nadversarial perturbations of [4]. Essentially, they demonstrate that if the normal section of a net\u2019s\ndecision surface along a given direction can be locally bounded on the outside by a circular arc of\n\n3For more discussion about the implementation and associated concepts, refer to the supplementary material.\n\n4\n\n\fa particular positive curvature in the vicinity of a sample image point, then geometry accordingly\ndictates an upper bound on the distance between that point and the boundary in that direction. If\nsuch directions and bounds turn out to be largely common across sample image points (which they\ndo), then the existence of universal adversaries follows directly, with higher curvature implying\nlower-norm adversaries. This argument is depicted visually in the supplementary material. It is\nfrom this point that we move beyond the prior art and begin an iterative loop of further analysis,\nexperimentation, and demonstration, as follows.\n\n4 Experiments and Analysis\n\nProvided only that the second-order boundary approximation holds up well over a suf\ufb01ciently wide\nperturbation range and variety of images, the model implies that the distance of such adversaries\nfrom the decision boundary should increase as a function of their norm. Also, the attack along any\npositively curved direction should in that case be associated with the corresponding target class: the\nclass c in the call to Alg. 1. And while positively curved directions may be of primary interest in [2],\nthe extension of the above geometric argument to the negative-curvature case points to an important\ncorollary: as suf\ufb01cient steps along positive-curvature directions should perturb increasingly into\nclass c, so should steps along negative-curvature directions perturb increasingly away from class c.\nFinally, the plethora of approximately zero-curvature (\ufb02at) directions identi\ufb01ed in [23, 2] should have\nnegligible effect on class identity.\n\nFigure 2: Selected class scores plotted as functions of the scaling factor s of the perturbation along the most\npositively curved direction per net. The \u2018Median class score\u2019 plot compares the score of a randomly selected\ntarget class with the supremum of the scores for the non-target classes. Each curve represents the median of the\nclass scores over the associated dataset, bracketed below by the 30th-percentile score and above by the 70th.\nThe \u2018Transition into target class\u2019 plot depicts the fraction of the dataset not originally of the target class, but\nwhich is transitioned into the target class by the perturbation. Alongside, we graph that population\u2019s median\nsoftmax target-class score. The black dashed line represents the fraction of the population originally of the\ntarget class that remains in the target class under the perturbation. The image grid on the right illustrates the 2D\nvisualisations of the two most-positively curved directions for randomly selected target classes: the columns\ncorrespond, from left to right, with the four net-dataset pairs under study. To observe class scores as functions of\nthe norms of the perturbations along the most negatively curved and \ufb02at directions, refer to the supplement.\n\n4.1 Class identity as a function of the component in speci\ufb01c image-space directions\n\nTo test how well the above conjectures hold in practice, we graph statistics of the target and non-\ntarget class scores over the dataset as a function of the magnitude of the perturbation applied in\ndirections identi\ufb01ed as above. The results are depicted in Fig. 2, in which the predicted phenomena\nare readily evident. Along the selected positive-curvature directions, as the perturbation magnitude\nincreases (with either sign), the population\u2019s target class score approaches and then surpasses the\nhighest non-target class score. The monotonicity of this effect is laid bare by graphing the fraction of\nnon-target samples perturbed into the target class, alongside the median target class softmax score.\nNote, again, that the link between the directions in question and the target class identity is established\na priori by Alg. 1. We continue in the supplementary material and show that, as predicted, the same\nphenomenon is evident in reverse when using negative-curvature directions instead. All that changes\nis that it is the population\u2019s non-target class scores that overtake its target class score with increasing\n\n5\n\n-50000500000.10.20.30.41. MNIST-LeNetTarget class 9Highest non-target-50000500000.20.40.60.81Sample proportionMedian smax scorePositive curvature directions-1000010000510152. CIFAR10-NiNTarget class 7-10000100000.20.40.60.81-100001000-5051015203. CIFAR10-AlexNetTarget class 5-10000100000.20.40.60.81-1000010000510154. CIFAR100-VGGTarget class 2-10000100000.20.40.60.81\fperturbation norm, with natural samples of the target class accordingly being perturbed out of it. We\nalso illustrate the point that \ufb02atness of the decision boundary manifests as \ufb02atness of both target\nand non-target class scores: over a wide range of magnitudes, these directions do not in\ufb02uence the\nnetwork in any way. While Fig. 2 illustrates these effects at the level of the population, Fig. 1 shows\na disaggregation into individual sample images, with one response curve per sample from a large\nset. The population-level trends remain evident, but another fact becomes apparent: empirically, the\nshapes of the curves change very little between most samples. They shift vertically to re\ufb02ect the\nclass score contribution of the orthonormal components, but they themselves do not otherwise much\ndepend on those components. That is to say that at least some key components are approximately\nadditively separable from one another. This fact connects directly to the fact that such directions are\n\u201cshared\u201d across samples in the \ufb01rst place, and thus identi\ufb01able by Alg. 1.\nA more intuitive picture of what the networks are actually doing begins to emerge: they are identifying\nthe high-curvature image-space directions as features associated with respective class identities, with\nthe curvature magnitude representing the sensitivity of class identity to the presence of that feature.\nBut if this is true, it suggests that what we have thus identi\ufb01ed are actually the directions which the net\nrelies on generally in predicting the classes of natural images, with the curvatures-cum-sensitivities\nrepresenting their relative weightings. Accordingly, it should be possible to disregard the \u201c\ufb02at\u201d\ndirections of near-zero curvature without any noticeable change in the network\u2019s class predictions.\n\n4.2 Network classi\ufb01cation performance versus effective data dimensionality\n\nTo con\ufb01rm the above hypothesis regarding the relative importance of different image-space directions\nfor classi\ufb01cation, we plot the training and test accuracies of a sample of nets as a function of the\nsubspace onto which their input images are projected. The input subspace is parametrised by a\ndimensionality parameter d, which controls the number of basis vectors selected per class. We use\nfour variants of selection: the d most positively curved directions per class (yielding the subspace\nSpos); the d most negatively curved directions per class (yielding the subspace Sneg); the union of\nthe previous two (subspace Sneg \u222a pos); and the d least curved (\ufb02attest) directions per class (subspace\nS\ufb02at). The subspace S so obtained is represented by the orthonormalised basis matrix Qd (obtained\nby QR decomposition of the aggregated directions), and each input image i is then projected4 onto S\nas id = QdQ(cid:62)\nThe outcome is striking: it is evident that in many cases, classi\ufb01cation decisions have effectively\nalready been made based on a relatively small number of features, corresponding to the most curved\ndirections. The sensitivity of the nets along these directions, then, is clearly learned purposefully from\nthe training data, and does largely generalise in testing, as seen. Note also that at this level of analysis,\nit essentially does not matter whether positively or negatively curved directions are chosen. Another\nimportant point emerges here. Since it is the high-curvature directions that are largely responsible\nfor determining the nets\u2019 classi\ufb01cation decisions, the nets should be vulnerable to adversarial attack\nalong precisely these directions.\n\nd i. Accuracies on {id} as a function of d are shown in the top row of Fig. 3.\n\n4.3 Link between classi\ufb01cation and adversarial directions\n\nIt has already been noted in [23] that adversarial attack vectors evince high components in subspaces\nspanned by high-curvature directions. We expand the analysis by repeating the procedure of Sec. 4.2\nfor various attack methods, to determine whether existing attacks are indeed exploiting the directions\nin accordance with the classi\ufb01er\u2019s reliance on them. Results are displayed in the bottom row of\nFig. 3, and should be compared against the row above. The graphs in these \ufb01gures illustrate the\ndirect relationship between the fraction of adversarial norm in given subspaces and the corresponding\nusefulness of those subspaces for classi\ufb01cation. The inclusion of the saliency images of [25] alongside\nthe attack methods makes explicit the fact that adversaries are themselves an exposure of the net\u2019s\nnotion of saliency.\nBy now, two results hint at a simpler and more direct way of identifying bases of classi\ufb01ca-\ntion/adversarial directions. First, a close inspection of the class-score curves sampled and displayed\nin Fig. 1 reveals a direct connection between the curvature of a direction near the origin and its\nderivative magnitude over a fairly large interval around it. Second, this observation is made more\n\n4The mean training-set orthogonal component (I\u2212 QdQ(cid:62)\nfor data normalised by mean subtraction, as is the case here.\n\nd )\u00afi can be added, but is approximately 0 in practice\n\n6\n\n\fFigure 3: Top row: Training and test classi\ufb01cation accuracies for various DCNs on image sets projected onto\nthe subspaces described in Sec. 4.2, as a function of their dimensionality parameter d (from 0 until the input\nspace is fully spanned). The principal directions de\ufb01ning the subspaces are obtained by applying Alg. 1 once\nfor each possible choice of target class c and retaining d directions per class. Note the relationship between the\nordering of curvature magnitudes and classi\ufb01cation accuracy by comparing the S\ufb02at curves to the others. Bottom\nrow: Mean (cid:96)2-norms of various adversarial perturbations (DeepFool [9], FGSM [8] and UAP [4]) and saliency\nmaps [25] when projected onto the same subspaces as above, as a fraction of their original norms.\n\nFigure 4: Classi\ufb01cation accuracies on image sets projected onto subspaces of the spans of their corresponding\nDeepFool perturbations. For each net-dataset pair, DeepFool perturbations are computed over the image set\nand assembled into a matrix that is decomposed into its SVD. The singular vectors are ordered as per their\nsingular values: Shi represents the high-to-low ordering, Slo the low-to-high, and d the number of vectors\nretained. Compare this \ufb01gure to Fig. 3 (while noticing how d now counts the total number of directions). For the\nImageNet experiments, owing to memory constraints, the SVD is performed on downsampled DeepFools of\nsize 100 \u00d7 100 \u00d7 3 and 120 \u00d7 120 \u00d7 3, respectively. The resulting singular vectors span the entire effective\nclassi\ufb01cation space of correspondingly downsampled images. This is evinced by the fact that the classi\ufb01cation\naccuracy of images projected onto the singular vectors\u2019 subspace saturates to the same performance as that\nyielded when the net is tested directly on the downsampled images.\n\nclear in Fig. 3 where it can be seen that the directions obtained by boundary curvature analysis in\nAlg. 1 correspond to the directions exploited by various \ufb01rst-order methods. Thus, we hypothesise\nthat to identify such a basis, one need actually only perform SVD on a matrix of stacked class-score\ngradients5. Here, we implement this using a collection of DeepFool perturbations to provide the\nrequired gradient information, and repeat the analysis of Sec. 4.2, using singular values to order the\nvectors. The results, in Fig. 4, neatly replicate the previously seen classi\ufb01cation accuracy trends for\nhigh-to-low and low-to-high curvature traversal of image-space directions. Henceforth, we use these\ndirections directly, simplifying analysis and allowing us to analyse ImageNet networks.\n\n5In fact, this analysis is begun in [4], but only the singular values are examined.\n\n7\n\n2040608020406080100MNIST-LeNetSneg,TrainSneg, TestSpos, TrainSpos, TestSneg pos, TrainSneg pos, TestSflat, TrainSflat, Test100200300CIFAR10-NiN100200300CIFAR10-AlexNet10203040CIFAR100-VGG2040600.10.20.30.40.50.60.70.80.9MNIST-LeNetSneg, Mean DeepFoolSneg, UAPSneg, Mean SaliencySneg, Mean FGSMSflat, Mean DeepFoolSflat, UAPSflat, Mean SaliencySflat, Mean FGSM50100150200250300CIFAR10-NiN50100150200250300CIFAR10-AlexNet51015202530CIFAR100-NiN20040060020406080100Accuracy (in %)MNIST-LeNetShi, TrainShi, TestSlo, TrainSlo, Test100020003000CIFAR10-NiN100020003000CIFAR10-AlexNet100020003000CIFAR100-VGG12341041020304050IMAGENET-AlexNet49.3751.31Shi, Test (100)Slo, Test (100)Test AccuracyShi, Test (120)Slo, Test (120)\fFigure 5: Blue curves depict the mean (cid:96)2-norms of \"con\ufb01ned DeepFool\" perturbations: those that are calculated\nunder strict con\ufb01nement to the respective subspaces of Fig. 4, also detailed in Sec. 4.3. Note the differences in\nscale of the y-axes of the different plots. For MNIST and CIFAR, we also plot (in red) the mean norms of the\nprojections of the input images onto those subspaces: observe the inverse relationship between the two curves.\nThe columns on the right visualise, from top to bottom, sample images at the indicated points on the curves in the\nCIFAR100-AlexNet plots, from left to right: blue-bordered images represent con\ufb01ned DeepFool perturbations\n(rescaled for display), with their red-bordered counterparts displaying the projection of the corresponding sample\nCIFAR image onto the same subspace. Observe that when the human-recognisable object appearance is captured\nin any given subspace, the corresponding DeepFool perturbation becomes maximally effective (i.e. small-norm).\nLikewise, when the projected image is not readily recognisable to a human, the DeepFool perturbation is large.\nThe feature space per se does not account for adversariality: the issue is in the net\u2019s response to the features.\n\nWhile Fig. 3 displays the magnitudes of components of pre-computed adversarial perturbations in\ndifferent subspaces, we also design a variation on the analysis to illustrate how effective an ef\ufb01cient\nattack method (DeepFool) is when con\ufb01ned to the respective subspaces. This is implemented by\nsimply projecting the gradient vectors used in solving DeepFool\u2019s linearised problem onto each\nsubspace before otherwise solving the problem as usual. The results, displayed in Fig. 5, thus\nrepresent DeepFool\u2019s \u201cearnest\u201d attempts to attack the network as ef\ufb01ciently as possible within each\ngiven subspace. It is evident that the attack must exploit genuine classi\ufb01cation directions in order to\nachieve low norm.\n\ndlow En{(cid:96)2 norm(in)} En{(cid:96)2 norm(\u03b4in)} Accuracy (%)\n\n227\n200\n150\n120\n100\n\n26798.72\n26515.20\n26327.03\n26159.98\n26008.02\n\n63.96\n53.19\n46.86\n41.92\n37.98\n\n57.75\n55.80\n53.50\n51.75\n48.10\n\nf = 1\n100.00\n32.75\n35.55\n36.15\n41.65\n\nf = 2\n100.00\n77.25\n58.35\n49.80\n49.25\n\nFooling rate (%)\nf = 4\nf = 3\n100.00\n100.00\n92.20\n88.95\n85.95\n77.90\n76.90\n66.20\n59.95\n68.05\n\nf = 5\n100.00\n94.35\n89.25\n82.95\n74.80\n\nf = 10\n100.00\n97.65\n95.65\n92.90\n88.30\n\nTable 1: The images in used to train AlexNet operate at the scale of dorig = 227 (pixels on a side). In the\npre-processing step, these images are downsized to dlow, before being upsampled back to the original scale. The\nreconstructed DeepFool perturbations \u03b4in lose some of their effectiveness, as seen in the fooling-rate column\nfor f = 1. When the effect of downsampling is countered by increasing the value of the (cid:96)2-norms of these\nperturbations (using higher values of f), their ef\ufb01cacy is steadily restored. Note that the mean norms of images\nand perturbations are estimated in the upscaled space, as are the classi\ufb01cation accuracies. The accuracy values\nfor dlow = {100, 120} should be compared to those at convergence in Fig. 4. Any difference in the performance\nscores is strictly due to the random selection of the subset of 2000 test images used for evaluation.\n\n4.4 On image compression and robustness to adversarial attack\n\nThe above observations have made it clear that the most effective directions of adversarial attack\nare also the directions that contribute the most to the DCNs\u2019 classi\ufb01cation performance. Hence, any\nattempt to mitigate adversarial vulnerability by discarding these directions, either by compressing the\ninput data [14, 15, 16] or by suppressing speci\ufb01c components of image representations at intermediate\nnetwork layers [5], must effect a loss in the classi\ufb01cation accuracy. Further, our framework anticipates\nthe fact that the nets must remain just as vulnerable to attack along the remaining directions that\n\n8\n\n20040060010002000300040005000246MNIST-LeNetConfined DeepFoolProjected Image2004006001.61.822.22.42.634567Confined DeepFoolProjected Image1000200030001234104100020003000CIFAR100-AlexNet 10002000300010015020025010002000300012310450010001500200025003000IMAGENET-AlexNet 1231047075808590\fcontinue to determine classi\ufb01cation decisions, given that the corresponding class-score functions,\nwhich possess the properties discussed earlier, remain unchanged. We use image downsampling as\nan example data compression technique to illustrate this effect on ImageNet.\nWe proceed by inserting a pre-processing unit between the DCN and its input at test time. This unit\ndownsamples the input image in to a lower size dlow before upsampling it back to the original input\nsize dorig. The resizing (by bicubic interpolation) serves to reduce the effective dimensionality of the\ninput data. For a randomly selected set of 2000 ImageNet [26] test images, we observe the change\nin classi\ufb01cation accuracy over different values of dlow, shown in column 4 of Table 1. The fooling\nrates6 for the downsampled versions of these natural images\u2019 adversarial counterparts, produced by\napplying DeepFool to the original network (without the resampling unit), follow in column 5 of the\ntable. At \ufb01rst glance, it appears that the downsampling-based pre-processing unit has afforded an\nincrease in the network robustness at a moderate cost in accuracy. Results pertaining to this tradeoff\nhave been widely reported [14, 15, 5]. Here, we take the analysis a step further.\nTo start, we note the fact that the methodology just described represents a transfer attack from\nthe original net to the net as modi\ufb01ed by the inclusion of the resampling unit. As DeepFool\nperturbations \u03b4in are not designed to transfer in this manner, we \ufb01rst augment them by simply\nincreasing their (cid:96)2-norm by a scalar factor f. We adjust f from unity up to a point at which the\nmean DeepFool perturbation norm is still a couple of orders of magnitude smaller than the mean\nimage norm, such that the perturbations remain largely imperceptible. The corresponding fooling\nrates grow steadily with respect to f, as is observable in Table 1. Hence, although the original\nfull-resolution perturbations may be suboptimal attacks on the resampling variants of the network (as\nsome components are effectively lost to projection onto the compressed space), suf\ufb01cient rescaling\nrestores their effectiveness. On the other hand, the modi\ufb01ed net continues to be equally vulnerable\nalong the remaining effective classi\ufb01cation directions, and can easily be attacked directly. To go\nabout this, we simply take the SVD of the stack of downsampled DeepFool perturbations, for dlow\nvalues of 100 and 120 (owing to memory constraints). The resulting singular vectors span the entire\nspace of classi\ufb01cation/adversarial directions of the corresponding resampling network, as can be seen\nfrom the accuracy values in the rightmost subplot of Fig. 4. More crucially, lower-norm DeepFools\ncan be obtained by restricting the attack\u2019s iterative linear optimisation procedure to the space spanned\nby these compressed perturbations, exactly as described in Sec. 4.3 and displayed in Fig. 5. This\nsubspace-con\ufb01ned optimisation is analogous to designing a white-box DeepFool attack for the new\nnetwork architecture inclusive of the resampling unit, instead of the original network, and is as\neffective as before. Note that this observation is consistent with the results reported in [16], where\nthe strength of the examined gradient-based attack methods increases progressively as the targeted\nmodel better approximates the defending model.\n\n5 Conclusion\n\nIn this work, we expose a collection of directions along which a given net\u2019s class-score output\nfunctions exhibit striking similarity across sample images. These functions are nonlinear, but are de\nfacto of a relatively constrained form: roughly axis-symmetric7 and typically monotonic over large\nranges. We illustrate a close relationship between these directions and class identity: many such\ndirections effectively encode the extent to which the net believes that a particular target class is or\nis not present. Thus, as it stands, the predictive power and adversarial vulnerability of the studied\nnets are intertwined owing to the fact that they base their classi\ufb01cation decisions on rather simplistic\nresponses to components of the input images in speci\ufb01c directions, irrespective of whether the source\nof those components is natural or adversarial. Clearly, any gain in robustness obtained by suppressing\nthe net\u2019s response to these components must come at the cost of a corresponding loss of accuracy.\nWe demonstrate this experimentally. We also note that these robustness gains may be lower than they\nappear, as the network actually remains vulnerable to a properly designed attack along the remaining\ndirections it continues to use. A discussion including some nuanced observations and connections to\nexisting work that follow from our study can be found in the supplementary material. To conclude,\nwe believe that for any scheme to be truly effective against the problem of adversarial vulnerability, it\nmust lead to a fundamentally more insightful (and likely complicated) use of features than presently\noccurs. Until then, those features will continue to be the nets\u2019 own worst adversaries.\n\n6Measured as a percentage of samples from the dataset that undergo a change in their predicted label.\n7Though not necessarily so for MNIST, because of its constraints: see supplementary material.\n\n9\n\n\fAcknowledgements. This work was supported by the ERC grant ERC-2012-AdG 321162-HELIOS,\nEPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. We would also like\nto acknowledge the Royal Academy of Engineering, FiveAI, and extend our thanks to Seyed-Mohsen\nMoosavi-Dezfooli for providing his research code for curvature analysis of decision boundaries of\nDCNs.\n\nReferences\n[1] NIPS:\n\n2017 competition on adversarial attacks and defenses.\n\nhttps://www.kaggle.com/\n\nnips-2017-adversarial-learning-competition (2017) accessed: 2018-03-12.\n\n[2] Moosavi-Dezfooli*, S.M., Fawzi*, A., Fawzi, O., Frossard, P., Soatto, S.: Robustness of classi\ufb01ers to\nuniversal perturbations: A geometric perspective. In: International Conference on Learning Representations.\n(2018)\n\n[3] Lin, M., Chen, Q., Yan, S.: Network in network. International Conference on Learning Representations\n\n(2013)\n\n[4] Moosavi-Dezfooli*, S.M., Fawzi*, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In:\n\nComputer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 86\u201394\n\n[5] Gao, J., Wang, B., Lin, Z., Xu, W., Qi, Y.: Deepcloak: Masking deep neural network models for robustness\n\nagainst adversarial samples. In: International Conference on Learning Representations. (2017)\n\n[6] Zhao, Q., Grif\ufb01n, L.D.: Suppressing the unusual: towards robust cnns using symmetric activation functions.\n\nCoRR abs/1603.05145 (2016)\n\n[7] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.:\n\nproperties of neural networks. In: International Conference on Learning Representations. (2014)\n\nIntriguing\n\n[8] Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International\n\nConference on Learning Representations. (2015)\n\n[9] Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep\nneural networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR). Number EPFL-CONF-218057 (2016)\n\n[10] Sabour*, S., Cao*, Y., Faghri, F., Fleet, D.J.: Adversarial manipulation of deep representations. In:\n\nInternational Conference on Learning Representations. (2016)\n\n[11] Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High con\ufb01dence predictions\nfor unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition. (2015) 427\u2013436\n\n[12] Stanley, K.O.: Compositional pattern producing networks: A novel abstraction of development. Genetic\n\nprogramming and evolvable machines 8(2) (2007) 131\u2013162\n\n[13] Wang, B., Gao, J., Qi, Y.: A theoretical framework for robustness of (deep) classi\ufb01ers under adversarial\n\nnoise. arXiv preprint arXiv:1612.00334 (2016)\n\n[14] Maharaj, A.V.: Improving the adversarial robustness of convnets by reduction of input dimensionality\n\n(2015)\n\n[15] Das, N., Shanbhogue, M., Chen, S., Hohman, F., Chen, L., Kounavis, M.E., Chau, D.H.: Keeping the\nbad guys out: Protecting and vaccinating deep learning with JPEG compression. CoRR abs/1705.02900\n(2017)\n\n[16] Xie, C., Wang, J., Zhang, Z., Ren, Z., Yuille, A.L.: Mitigating adversarial effects through randomization.\n\nCoRR abs/1711.01991 (2017)\n\n[17] Tanay, T., Grif\ufb01n, L.: A boundary tilting persepective on the phenomenon of adversarial examples. arXiv\n\npreprint arXiv:1608.07690 (2016)\n\n[18] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to\n\nadversarial attacks. International Conference on Learning Representations (2018)\n\n[19] Lu, J., Issaranon, T., Forsyth, D.: Safetynet: Detecting and rejecting adversarial examples robustly. In:\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 446\u2013454\n\n[20] Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. arXiv\n\npreprint arXiv:1702.04267 (2017)\n\n[21] Zhang, H., Cisse, M., Dauphin, Y., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In:\n\nInternational Conference on Learning Representations. (2018)\n\n[22] Fawzi*, A., Moosavi-Dezfooli*, S.M., Frossard, P.: Robustness of classi\ufb01ers: from adversarial to random\n\nnoise. In: Advances in Neural Information Processing Systems. (2016) 1632\u20131640\n\n10\n\n\f[23] Fawzi*, A., Moosavi-Dezfooli*, S.M., Frossard, P., Soatto, S.: Classi\ufb01cation regions of deep neural\n\nnetworks. arXiv preprint arXiv:1705.09552 (2017)\n\n[24] Fawzi, A., Moosavi-Dezfooli, S.M., Frossard, P.: The robustness of deep networks: A geometrical\n\nperspective. IEEE Signal Processing Magazine 34(6) (2017) 50\u201362\n\n[25] Simonyan, K., Vedald, A., Zisserman, A.: Deep inside convolutional networks: Visualising image\n\nclassi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)\n\n[26] Russakovsky*, O., Deng*, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,\nBernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International\nJournal of Computer Vision (IJCV) 115(3) (2015) 211\u2013252\n\n[27] Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and\nPattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Volume 1., IEEE (2005)\n886\u2013893\n\n[28] do Carmo, M.: Differential Geometry of Curves and Surfaces. Prentice-Hall (1976)\n\n11\n\n\f", "award": [], "sourceid": 6847, "authors": [{"given_name": "Saumya", "family_name": "Jetley", "institution": "University of Oxford"}, {"given_name": "Nicholas", "family_name": "Lord", "institution": "University of Oxford/FiveAI"}, {"given_name": "Philip", "family_name": "Torr", "institution": "University of Oxford"}]}