{"title": "Lower bounds on the robustness to adversarial perturbations", "book": "Advances in Neural Information Processing Systems", "page_first": 804, "page_last": 813, "abstract": "The input-output mappings learned by state-of-the-art neural networks are significantly discontinuous. It is possible to cause a neural network used for image recognition to misclassify its input by applying very specific, hardly perceptible perturbations to the input, called adversarial perturbations. Many hypotheses have been proposed to explain the existence of these peculiar samples as well as several methods to mitigate them. A proven explanation remains elusive, however. In this work, we take steps towards a formal characterization of adversarial perturbations by deriving lower bounds on the magnitudes of perturbations necessary to change the classification of neural networks. The bounds are experimentally verified on the MNIST and CIFAR-10 data sets.", "full_text": "Lower bounds on the robustness to adversarial\n\nperturbations\n\nJonathan Peck1,2, Joris Roels2,3, Bart Goossens3, and Yvan Saeys1,2\n\n1Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, 9000, Belgium\n\n2Data Mining and Modeling for Biomedicine, VIB In\ufb02ammation Research Center, Ghent, 9052, Belgium\n3Department of Telecommunications and Information Processing, Ghent University, Ghent, 9000, Belgium\n\nAbstract\n\nThe input-output mappings learned by state-of-the-art neural networks are sig-\nni\ufb01cantly discontinuous. It is possible to cause a neural network used for image\nrecognition to misclassify its input by applying very speci\ufb01c, hardly perceptible\nperturbations to the input, called adversarial perturbations. Many hypotheses have\nbeen proposed to explain the existence of these peculiar samples as well as several\nmethods to mitigate them, but a proven explanation remains elusive. In this work,\nwe take steps towards a formal characterization of adversarial perturbations by\nderiving lower bounds on the magnitudes of perturbations necessary to change the\nclassi\ufb01cation of neural networks. The proposed bounds can be computed ef\ufb01ciently,\nrequiring time at most linear in the number of parameters and hyperparameters\nof the model for any given sample. This makes them suitable for use in model\nselection, when one wishes to \ufb01nd out which of several proposed classi\ufb01ers is\nmost robust to adversarial perturbations. They may also be used as a basis for\ndeveloping techniques to increase the robustness of classi\ufb01ers, since they enjoy the\ntheoretical guarantee that no adversarial perturbation could possibly be any smaller\nthan the quantities provided by the bounds. We experimentally verify the bounds\non the MNIST and CIFAR-10 data sets and \ufb01nd no violations. Additionally, the\nexperimental results suggest that very small adversarial perturbations may occur\nwith non-zero probability on natural samples.\n\n1\n\nIntroduction\n\nDespite their big successes in various AI tasks, neural networks are basically black boxes: there is no\nclear fundamental explanation how they are able to outperform the more classical approaches. This\nhas led to the identi\ufb01cation of several unexpected and counter-intuitive properties of neural networks.\nIn particular, Szegedy et al. [2014] discovered that the input-output mappings learned by state-of-the-\nart neural networks are signi\ufb01cantly discontinuous. It is possible to cause a neural network used for\nimage recognition to misclassify its input by applying a very speci\ufb01c, hardly perceptible perturbation\nto the input. Szegedy et al. [2014] call these perturbations adversarial perturbations, and the inputs\nresulting from applying them to natural samples are called adversarial examples.\nIn this paper, we hope to shed more light on the nature and cause of adversarial examples by\nderiving lower bounds on the magnitudes of perturbations necessary to change the classi\ufb01cation of\nneural network classi\ufb01ers. Such lower bounds are indispensable for developing rigorous methods\nthat increase the robustness of classi\ufb01ers without sacri\ufb01cing accuracy. Since the bounds enjoy the\ntheoretical guarantee that no adversarial perturbation could ever be any smaller, a method which\nincreases these lower bounds potentially makes the classi\ufb01er more robust. They may also aid model\nselection: if the bounds can be computed ef\ufb01ciently, then one can use them to compare different\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmodels with respect to their robustness to adversarial perturbations and select the model that scores\nthe highest in this regard without the need for extensive empirical tests.\nThe rest of the paper is organized as follows. Section 2 discusses related work that has been done\non the phenomenon of adversarial perturbations; Section 3 details the theoretical framework used\nto prove the lower bounds; Section 4 proves lower bounds on the robustness of different families of\nclassi\ufb01ers to adversarial perturbations; Section 5 empirically veri\ufb01es that the bounds are not violated;\nSection 6 concludes the paper and provides avenues for future work.\n\n2 Related work\n\nSince the puzzling discovery of adversarial perturbations, several hypotheses have been proposed to\nexplain why they exist, as well as a number of methods to make classi\ufb01ers more robust to them.\n\n2.1 Hypotheses\n\nThe leading hypothesis explaining the cause of adversarial perturbations is the linearity hypothesis by\nGoodfellow et al. [2015]. According this view, neural network classi\ufb01ers tend to act very linearly on\ntheir input data despite the presence of non-linear transformations within their layers. Since the input\ndata on which modern classi\ufb01ers operate is often very high in dimensionality, such linear behavior\ncan cause minute perturbations to the input to have a large impact on the output. In this vein, Lou\net al. [2016] propose a variant of the linearity hypothesis which claims that neural network classi\ufb01ers\noperate highly linearly on certain regions of their inputs, but non-linearly in other regions. Rozsa et al.\n[2016] conjecture that adversarial examples exist because of evolutionary stalling: during training,\nthe gradients of samples that are classi\ufb01ed correctly diminish, so the learning algorithm \u201cstalls\u201d and\ndoes not create signi\ufb01cantly \ufb02at regions around the training samples. As such, most of the training\nsamples will lie close to some decision boundary, and only a small perturbation is required to push\nthem into a different class.\n\n2.2 Proposed solutions\n\nGu and Rigazio [2014] propose the Deep Contractive Network, which includes a smoothness penalty\nin the training procedure inspired by the Contractive Autoencoder. This penalty encourages the\nJacobian of the network to have small components, thus making the network robust to small changes\nin the input. Based on their linearity hypothesis, Goodfellow et al. [2015] propose the fast gradient\nsign method for ef\ufb01ciently generating adversarial examples. They then use this method as a regularizer\nduring training in an attempt to make networks more robust. Lou et al. [2016] use their \u201clocal linearity\nhypothesis\u201d as the basis for training neural network classi\ufb01ers using foveations, i.e. a transformation\nwhich selects certain regions from the input and discards all other information. Rozsa et al. [2016]\nintroduce Batch-Adjusted Network Gradients (BANG) based on their idea of evolutionary stalling.\nBANG normalizes the gradients on a per-minibatch basis so that even correctly classi\ufb01ed samples\nretain signi\ufb01cant gradients and the learning algorithm does not stall.\nThe solutions proposed above provide attractive intuitive explanations for the cause of adversarial\nexamples, and empirical results seem to suggest that they are effective at eliminating them. However,\nnone of the hypotheses on which these methods are based have been formally proven. Hence, even\nwith the protections discussed above, it may still be possible to generate adversarial examples for\nclassi\ufb01ers using techniques which defy the proposed hypotheses. As such, there is a need to formally\ncharacterize the nature of adversarial examples. Fawzi et al. [2016] take a step in this direction by\nderiving precise bounds on the norms of adversarial perturbations of arbitrary classi\ufb01ers in terms of\nthe curvature of the decision boundary. Their analysis encourages to impose geometric constraints\non this curvature in order to improve robustness. However, it is not obvious how such constraints\nrelate to the parameters of the models and hence how one would best implement such constraints\nin practice. In this work, we derive lower bounds on the robustness of neural networks directly in\nterms of their model parameters. We consider only feedforward networks comprised of convolutional\nlayers, pooling layers, fully-connected layers and softmax layers.\n\n2\n\n\f3 Theoretical framework\n\nThe theoretical framework used in this paper draws heavily from Fawzi et al. [2016] and Papernot\net al. [2016]. In the following, (cid:107)\u00b7(cid:107) denotes the Euclidean norm and (cid:107)\u00b7(cid:107)F denotes the Frobenius norm.\nWe assume we want to train a classi\ufb01er f : Rd \u2192 {1, . . . , C} to correctly assign one of C different\nclasses to input vectors x from a d-dimensional Euclidean space. Let \u00b5 denote the probability measure\non Rd and let f (cid:63) be an oracle that always returns the correct label for any input. The distribution \u00b5\nis assumed to be of bounded support, i.e. Px\u223c\u00b5(x \u2208 X ) = 1 with X = {x \u2208 Rd | (cid:107)x(cid:107) \u2264 M} for\nsome M > 0.\nFormally, adversarial perturbations are de\ufb01ned relative to a classi\ufb01er f and an input x. A perturbation\n\u03b7 is called an adversarial perturbation of x for f if f (x + \u03b7) (cid:54)= f (x) while f (cid:63)(x + \u03b7) = f (cid:63)(x).\nAn adversarial perturbation \u03b7 is called minimal if no other adversarial perturbation \u03be for x and f\nsatis\ufb01es (cid:107)\u03be(cid:107) < (cid:107)\u03b7(cid:107). In this work, we will focus on minimal adversarial perturbations.\nThe robustness of a classi\ufb01er f is de\ufb01ned as the expected norm of the smallest perturbation necessary\nto change the classi\ufb01cation of an arbitrary input x sampled from \u00b5:\n\nwhere\n\n\u03c1adv(f ) = Ex\u223c\u00b5[\u2206adv(x; f )],\n\n\u2206adv(x; f ) = min\n\u03b7\u2208Rd\n\n{(cid:107)\u03b7(cid:107) | f (x + \u03b7) (cid:54)= f (x)}.\n\nA multi-index is a tuple of non-negative integers, generally denoted by Greek letters such as \u03b1 and \u03b2.\nFor a multi-index \u03b1 = (\u03b11, . . . , \u03b1m) and a function f we de\ufb01ne\n\n|\u03b1| = \u03b11 + \u00b7\u00b7\u00b7 + \u03b1n,\n\n\u2202\u03b1f =\n\n\u2202|\u03b1|f\n1 . . . \u2202x\u03b1n\n\nn\n\n\u2202x\u03b11\n\n.\n\nThe Jacobian matrix of a function f : Rn \u2192 Rm : x (cid:55)\u2192 [f1(x), . . . , fm(x)]T is de\ufb01ned as\n\n\uf8ee\uf8ef\uf8f0 \u2202f1\n\n\u2202x1\n\n...\n\n\u2202fm\n\u2202x1\n\n\uf8f9\uf8fa\uf8fb .\n\n. . .\n...\n. . .\n\n\u2202f1\n\u2202xn\n\n...\n\n\u2202fm\n\u2202xn\n\n\u2202\n\u2202x\n\nf =\n\n3.1 Families of classi\ufb01ers\n\nThe derivation of the lower bounds will be built up incrementally. We will start with the family\nof linear classi\ufb01ers, which are among the simplest. Then, we extend the analysis to Multi-Layer\nPerceptrons, which are the oldest neural network architectures. Finally, we analyze Convolutional\nNeural Networks. In this section, we introduce each of these families of classi\ufb01ers in turn.\nA linear classi\ufb01er is a classi\ufb01er f of the form\n\nf (x) = arg max\n\ni=1,...,C\n\nwi \u00b7 x + bi.\n\nThe vectors wi are called weights and the scalars bi are called biases.\nA Multi-Layer Perceptron (MLP) is a classi\ufb01er given by\n\nf (x) = arg max\n\ni=1,...,C\n\nsoftmax(hL(x))i,\n\nhL(x) = gL(VLhL\u22121(x) + bL),\n\n...\n\nh1(x) = g1(V1x + b1).\n\nAn MLP is nothing more than a series of linear transformations Vlhl\u22121(x) + bl followed by non-\nlinear activation functions gl (e.g. a ReLU [Glorot et al., 2011]). Here, softmax is the softmax\nfunction:\n\nsoftmax(y)i =\n\n(cid:80)\nexp(wi \u00b7 y + bi)\nj exp(wj \u00b7 y + bj)\n\n.\n\n3\n\n\fc(cid:88)\n\nq(cid:88)\n\nq(cid:88)\n\nThis function is a popular choice as the \ufb01nal layer for an MLP used for classi\ufb01cation, but it is by no\nmeans the only possibility. Note that having a softmax as the \ufb01nal layer essentially turns the network\ninto a linear classi\ufb01er of the output of its penultimate layer, hL(x).\nA Convolutional Neural Network (CNN) is a neural network that uses at least one convolution\noperation. For an input tensor X \u2208 Rc\u00d7d\u00d7d and a kernel tensor W \u2208 Rk\u00d7c\u00d7q\u00d7q, the discrete\nconvolution of X and W is given by\n\n(X (cid:63) W)ijk =\n\nwi,n,m,lxn,m+s(q\u22121),l+s(q\u22121).\n\nn=1\n\nm=1\n\nl=1\n\nHere, s is the stride of the convolution. The output of such a layer is a 3D tensor of size k \u00d7 t \u00d7 t\ns + 1. After the convolution operation, usually a bias b \u2208 Rk is added to each of the\nwhere t = d\u2212q\nfeature maps. The different components (W (cid:63) X)i constitute the feature maps of this convolutional\nlayer. In a slight abuse of notation, we will write W (cid:63) X + b to signify the tensor W (cid:63) X where each\nof the k feature maps has its respective bias added in:\n\n(W (cid:63) X + b)ijk = (W (cid:63) X)ijk + bi.\n\nCNNs also often employ pooling layers, which perform a sort of dimensionality reduction. If we\nwrite the output of a pooling layer as Z(X), then we have\n\nzijk(X) = p({xi,n+s(j\u22121),m+s(k\u22121) | 1 \u2264 n, m \u2264 q}).\n\nHere, p is the pooling operation, s is the stride and q is a parameter. The output tensor Z(X) has\ndimensions c \u00d7 t \u00d7 t. For ease of notation, we assume each pooling operation has an associated\nfunction I such that\n\nzijk(X) = p({xinm | (n, m) \u2208 I(j, k)}).\n\nIn the literature, the set I(j, k) is referred to as the receptive \ufb01eld of the pooling layer. Each receptive\n\ufb01eld corresponds to some q \u00d7 q region in the input X. Common pooling operations include taking\nthe maximum of all inputs, averaging the inputs and taking an Lp norm of the inputs.\n\n4 Lower bounds on classi\ufb01er robustness\n\nComparing the architectures of several practical CNNs such as LeNet [Lecun et al., 1998], AlexNet\n[Krizhevsky et al., 2012], VGGNet [Simonyan and Zisserman, 2015], GoogLeNet [Szegedy et al.,\n2015] and ResNet [He et al., 2016], it would seem the only useful approach is a \u201cmodular\u201d one. If we\nsucceed in lower-bounding the robustness of some layer given the robustness of the next layer, we can\nwork our way backwards through the network, starting at the output layer and going backwards until\nwe reach the input layer. That way, our approach can be applied to any feedforward neural network\nas long as the robustness bounds of the different layer types have been established. To be precise, if a\ngiven layer computes a function h of its input y and if the following layer has a robustness bound of\n\u03ba in the sense that any adversarial perturbation to this layer has a Euclidean norm of at least \u03ba, then\nwe want to \ufb01nd a perturbation r such that\n\n(cid:107)h(y + r)(cid:107) = (cid:107)h(y)(cid:107) + \u03ba.\n\nThis is clearly a necessary condition for any adversarial perturbation to the given layer. Hence, any\nadversarial perturbation q to this layer will satisfy (cid:107)q(cid:107) \u2265 (cid:107)r(cid:107). Of course, the output layer of the\nnetwork will require special treatment. For softmax output layers, \u03ba is the norm of the smallest\nperturbation necessary to change the maximal component of the classi\ufb01cation vector.\nThe obvious downside of this idea is that we most likely introduce cumulative approximation errors\nwhich increase as the number of layers of the network increases. In turn, however, we get a \ufb02exible\nand ef\ufb01cient framework which can handle any feedforward architecture composed of known layer\ntypes.\n\n4.1 Softmax output layers\nWe now want to \ufb01nd the smallest perturbation r to the input x of a softmax layer such that f (x+r) (cid:54)=\nf (x). It can be proven (Theorem A.3) that any such perturbation satis\ufb01es\n\n(cid:107)r(cid:107) \u2265 min\nc(cid:48)(cid:54)=c\n\n|(wc(cid:48) \u2212 wc) \u00b7 x + bc(cid:48) \u2212 bc|\n\n(cid:107)wc(cid:48) \u2212 wc(cid:107)\n\n,\n\nwhere f (x) = c. Moreover, there exist classi\ufb01ers for which this bound is tight (Theorem A.4).\n\n4\n\n\f4.2 Fully-connected layers\n\nTo analyze the robustness of fully-connected layers to adversarial perturbations, we assume the next\nlayer has a robustness of \u03ba (this will usually be the softmax output layer, however there exist CNNs\nwhich employ fully-connected layers in other locations than just at the end [Lin et al., 2014]). We\nthen want to \ufb01nd a perturbation r such that\n\n(cid:107)hL(x + r)(cid:107) = (cid:107)hL(x)(cid:107) + \u03ba.\n\nWe \ufb01nd\nTheorem 4.1. Let hL : Rd \u2192 Rn be twice differentiable with second-order derivatives bounded by\nM. Then for any x \u2208 Rd,\n\n(cid:113)(cid:107)J (x)(cid:107)2 + 2M\n\n(cid:107)r(cid:107) \u2265\n\n\u221a\n\u221a\n\nn\n\nM\n\nn\u03ba \u2212 (cid:107)J (x)(cid:107)\n\n,\n\n(1)\n\nwhere J (x) is the Jacobian matrix of hL at x.\n\nThe proof can be found in Appendix A. In Theorem A.5 it is proved that the assumptions on hL\nare usually satis\ufb01ed in practice. The proof of this theorem also yields an ef\ufb01cient algorithm for\napproximating M, a task which otherwise might involve a prohibitively expensive optimization\nproblem.\n\n4.3 Convolutional layers\n\nThe next layer of the network is assumed to have a robustness bound of \u03ba, in the sense that any\nadversarial perturbation Q to X must satisfy (cid:107)Q(cid:107)F \u2265 \u03ba. We can now attempt to bound the norm of a\nperturbation R to X such that\n\n(cid:107)ReLU(W (cid:63) (X + R) + b)(cid:107)F = (cid:107)ReLU(W (cid:63) X + b)(cid:107)F + \u03ba.\n\nWe \ufb01nd\nTheorem 4.2. Consider a convolutional layer with \ufb01lter tensor W \u2208 Rk\u00d7c\u00d7q\u00d7q and stride s whose\ninput consists of a 3D tensor X \u2208 Rc\u00d7d\u00d7d. Suppose the next layer has a robustness bound of \u03ba, then\nany adversarial perturbation to the input of this layer must satisfy\n\n(cid:107)R(cid:107)F \u2265 \u03ba\n(cid:107)W(cid:107)F\nThe proof of Theorem 4.2 can be found in Appendix A.\n\n.\n\n4.4 Pooling layers\n\n(2)\n\nTo facilitate the analysis of the pooling layers, we make the following assumption which is satis\ufb01ed\nby the most common pooling operations (see Appendix B):\nAssumption 4.3. The pooling operation satis\ufb01es\n\nzijk(X + R) \u2264 zijk(X) + zijk(R).\n\nWe have\nTheorem 4.4. Consider a pooling layer whose operation satis\ufb01es Assumption 4.3. Let the input be of\nsize c\u00d7 d\u00d7 d and the receptive \ufb01eld of size q \u00d7 q. Let the output be of size c\u00d7 t\u00d7 t. If the robustness\nbound of the next layer is \u03ba, then the following bounds hold for any adversarial perturbation R:\n\n\u2022 MAX or average pooling:\n\n\u2022 Lp pooling:\n\nProof can be found in Appendix A.\n\n(cid:107)R(cid:107)F \u2265 \u03ba\nt\n\n.\n\n(cid:107)R(cid:107)F \u2265 \u03ba\ntq2/p\n\n.\n\n5\n\n(3)\n\n(4)\n\n\fFigure 1: Illustration of LeNet architecture. Image taken from Lecun et al. [1998].\n\nTable 1: Normalized summary of norms of adversarial perturbations found by FGS on MNIST and\nCIFAR-10 test sets\nData set\nMNIST\nCIFAR-10\n\nMedian\n0.884287\n0.0091399\n\n0.4655439\n0.06103627\n\n0.000023\n0.0000012\n\n3.306903\n1.6975207\n\n0.933448\n0.0218984\n\nMean\n\nMin\n\nMax\n\nStd\n\n5 Experimental results\n\nWe tested the theoretical bounds on the MNIST and CIFAR-10 test sets using the Caffe [Jia et al.,\n2014] implementation of LeNet [Lecun et al., 1998]. The MNIST data set [LeCun et al., 1998]\nconsists of 70,000 28 \u00d7 28 images of handwritten digits; the CIFAR-10 data set [Krizhevsky and\nHinton, 2009] consists of 60,000 32 \u00d7 32 RGB images of various natural scenes, each belonging to\none of ten possible classes. The architecture of LeNet is depicted in Figure 1. The kernels of the two\nconvolutional layers will be written as W1 and W2, respectively. The output sizes of the two pooling\nlayers will be written as t1 and t2. The function computed by the \ufb01rst fully-connected layer will be\ndenoted by h with Jacobian J. The last fully-connected layer has a weight matrix V and bias vector\nb. For an input sample x, the theoretical lower bound on the adversarial robustness of the network\nwith respect to x is given by \u03ba1, where\n\n(cid:113)(cid:107)J (x)(cid:107)2 + 2M\n\n\u221a\n\u221a\n\nM\n\n500\u03ba6 \u2212 (cid:107)J (x)(cid:107)\n500\n\n,\n\n|(vc(cid:48) \u2212 vc) \u00b7 x + bc(cid:48) \u2212 bc|\n\n(cid:107)vc(cid:48) \u2212 vc(cid:107)\n\n\u03ba4 =\n\n\u03ba6 = min\nc(cid:48)(cid:54)=c\n\u03ba5\nt2\n\u03ba3\nt1\n\n\u03ba2 =\n\n,\n\n,\n\n,\n\n\u03ba5 =\n\n\u03ba3 =\n\n\u03ba1 =\n\n\u03ba4\n(cid:107)W2(cid:107)F\n\u03ba2\n(cid:107)W1(cid:107)F\n\n,\n\n.\n\nBecause our method only computes norms and does not provide a way to generate actual adversarial\nperturbations, we used the fast gradient sign method (FGS) [Goodfellow et al., 2015] to adversarially\nperturb each sample in the test sets in order to assess the tightness of our theoretical bounds. FGS\nlinearizes the cost function of the network to obtain an estimated perturbation\n\n\u03b7 = \u03b5sign\u2207xL(x, \u03b8).\n\nHere, \u03b5 > 0 is a parameter of the algorithm, L is the loss function and \u03b8 is the set of parameters of\nthe network. The magnitudes of the perturbations found by FGS depend on the choice of \u03b5, so we\nhad to minimize this value in order to obtain the smallest perturbations the FGS method could supply.\nThis was accomplished using a simple binary search for the smallest value of \u03b5 which still resulted in\nmisclassi\ufb01cation. As the MNIST and CIFAR-10 samples have pixel values within the range [0, 255],\nwe upper-bounded \u03b5 by 100.\nNo violations of the bounds were detected in our experiments. Figure 2 shows histograms of the\nnorms of adversarial perturbations found by FGS and Table 1 summarizes their statistics. Histograms\nof the theoretical bounds of all samples in the test set are shown in Figure 3; their statistics are\nsummarized in Table 2. Note that the statistics of Tables 1 and 2 have been normalized by dividing\nthem by the dimensionality of their respective data sets (i.e. 28 \u00d7 28 for MNIST and 3 \u00d7 32 \u00d7 32\nfor CIFAR-10) to allow for a meaningful comparison between the two networks. Figure 4 provides\nhistograms of the per-sample log-ratio between the norms of the adversarial perturbations and their\ncorresponding theoretical lower bounds.\n\n6\n\n\f(a) MNIST\n\n(b) CIFAR-10\n\nFigure 2: Histograms of norms of adversarial perturbations found by FGS on MNIST and CIFAR-10\ntest sets\n\n(a) MNIST\n\n(b) CIFAR-10\n\nFigure 3: Histograms of theoretical bounds on MNIST and CIFAR-10 test sets\n\nAlthough the theoretical bounds on average deviate considerably from the perturbations found by\nFGS, one has to take into consideration that the theoretical bounds were constructed to provide a\nworst-case estimate for the norms of adversarial perturbations. These estimates may not hold for\nall (or even most) input samples. Furthermore, the smallest perturbations we were able to generate\non the two data sets have norms that are much closer to the theoretical bound than their averages\n(0.0179 for MNIST and 0.0000012 for CIFAR-10). This indicates that the theoretical bound is\nnot necessarily very loose, but rather that very small adversarial perturbations occur with non-zero\nprobability on natural samples. Note also that the FGS method does not necessarily generate minimal\nperturbations even with the smallest choice of \u03b5: the method depends on the linearity hypothesis and\nuses a \ufb01rst-order Taylor approximation of the loss function. Higher-order methods may \ufb01nd much\nsmaller perturbations by exploiting non-linearities in the network, but these are generally much less\nef\ufb01cient than FGS.\nThere is a striking difference in magnitude between MNIST and CIFAR-10 of both the empirical\nand theoretical perturbations: the perturbations on MNIST are much larger than the ones found for\n\nTable 2: Normalized summary of theoretical bounds on MNIST and CIFAR-10 test sets\n\nData set\nMNIST\nCIFAR-10\n\nMean\n7.274e\u22128\n4.812e\u221213\n\nMedian\n6.547e\u22128\n4.445e\u221213\n\nStd\n\n4.229566e\u22128\n2.605381e\u221213\n\nMin\n\n4.073e\u221210\n7.563e\u221215\n\nMax\n\n2.932e\u22127\n2.098e\u221212\n\n7\n\n\f(a) MNIST\n\n(b) CIFAR-10\n\nFigure 4: Histograms of the per-sample log-ratio between adversarial perturbation and lower bound\nfor MNIST and CIFAR-10 test sets. A higher ratio indicates a bigger deviation of the theoretical\nbound from the empirical norm.\n\nCIFAR-10. This result can be explained by the linearity hypothesis of Goodfellow et al. [2015].\nThe input samples of CIFAR-10 are much larger in dimensionality than MNIST samples, so the\nlinearity hypothesis correctly predicts that networks trained on CIFAR-10 are more susceptible to\nadversarial perturbations due to the highly linear behavior these classi\ufb01ers are conjectured to exhibit.\nHowever, these differences may also be related to the fact that LeNet achieves much lower accuracy\non the CIFAR-10 data set than it does on MNIST (over 99% on MNIST compared to about 60% on\nCIFAR-10).\n\n6 Conclusion and future work\n\nDespite attracting a signi\ufb01cant amount of research interest, a precise characterization of adversarial\nexamples remains elusive. In this paper, we derived lower bounds on the norms of adversarial\nperturbations in terms of the model parameters of feedforward neural network classi\ufb01ers consisting\nof convolutional layers, pooling layers, fully-connected layers and softmax layers. The bounds can be\ncomputed ef\ufb01ciently and thus may serve as an aid in model selection or the development of methods\nto increase the robustness of classi\ufb01ers. They enable one to assess the robustness of a classi\ufb01er\nwithout running extensive tests, so they can be used to compare different models and quickly select\nthe one with highest robustness. Furthermore, the bounds enjoy a theoretical guarantee that no\nadversarial perturbation could ever be smaller, so methods which increase these bounds may make\nclassi\ufb01ers more robust. We tested the validity of our bounds on MNIST and CIFAR-10 and found no\nviolations. Comparisons with adversarial perturbations generated using the fast gradient sign method\nsuggest that these bounds can be close to the actual norms in the worst case.\nWe have only derived lower bounds for feedforward networks consisting of fully-connected layers,\nconvolutional layers and pooling layers. Extending this analysis to recurrent networks and other types\nof layers such as Batch Normalization [Ioffe and Szegedy, 2015] and Local Response Normalization\n[Krizhevsky et al., 2012] is an obvious avenue for future work.\nIt would also be interesting to quantify just how tight the above bounds really are. In the absence\nof a precise characterization of adversarial examples, the only way to do this would be to generate\nadversarial perturbations using optimization techniques that make no assumptions on their underlying\ncause. Szegedy et al. [2014] use a box-constrained L-BFGS approach to generate adversarial examples\nwithout any assumptions, so using this method for comparison could provide a more accurate picture\nof how tight the theoretical bounds are. It is much less ef\ufb01cient than the FGS method, however.\nThe analysis presented here is a \u201cmodular\u201d one: we consider each layer in isolation, and derive bounds\non their robustness in terms of the robustness of the next layer. However, it may also be insightful to\nstudy the relationship between the number of layers, the breadth of each layer and the robustness of\nthe network. Providing estimates on the approximation errors incurred by this layer-wise approach\ncould also be useful.\n\n8\n\n\fFinally, there is currently no known precise characterization of the trade-off between classi\ufb01er robust-\nness and accuracy. Intuitively, one might expect that as the robustness of the classi\ufb01er increases, its\naccuracy will also increase up to a point since it is becoming more robust to adversarial perturbations.\nOnce the robustness exceeds a certain threshold, however, we expect the accuracy to drop because the\ndecision surfaces are becoming too \ufb02at and the classi\ufb01er becomes too insensitive to changes. Having\na precise characterization of this relationship between robustness and accuracy may aid methods\ndesigned to protect classi\ufb01ers against adversarial examples while also maintaining state-of-the-art\naccuracy.\n\nReferences\nA. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Robustness of classi\ufb01ers: from adversarial to\nrandom noise. In Proceedings of Advances in Neural Information Processing Systems 29, pages\n1632\u20131640. Curran Associates, Inc., 2016.\n\nX. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti\ufb01er neural networks. In Proceedings of\nthe Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, volume 15 of\nProceedings of Machine Learning Research, pages 315\u2013323, Fort Lauderdale, FL, USA, 11\u201313\nApr 2011. PMLR.\n\nI. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In\nProceedings of the Third International Conference on Learning Representations, volume 3 of\nProceedings of the International Conference on Learning Representations, San Diego, CA, USA,\n7\u20139 May 2015. ICLR.\n\nS. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples.\n\nNIPS Workshop on Deep Learning and Representation Learning, 2014.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, Las Vegas,\nNV, USA, 26 Jun \u2013 1 Jul 2016. CVPR.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning,\nvolume 37 of Proceedings of the International Conference on Machine Learning, pages 448\u2013456,\nLille, 6\u201311 Jul 2015. JMLR.\n\nY. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.\nCaffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM\nInternational Conference on Multimedia, pages 675\u2013678. ACM, 2014.\n\nA. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, University of Toronto, 2009.\n\nA. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\nnetworks. In Proceedings of the 25th International Conference on Advances in Neural Information\nProcessing Systems, volume 25 of Advances in Neural Information Processing Systems, pages\n1097\u20131105, Lake Tahoe, USA, 3\u20138 Dec 2012. NIPS.\n\nY. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, Nov 1998.\n\nY. LeCun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits. http://yann.\n\nlecun.com/exdb/mnist/, 1998. Accessed 2017-04-17.\n\nM. Lin, Q. Chen, and S. Yan. Network in network. Proceedings of International Conference on\n\nLearning Representations, 2014.\n\nY. Lou, X. Boix, G. Roig, T. Poggio, and Q. Zhao. Foveation-based mechanisms alleviate adversarial\n\nexamples. arXiv preprint arXiv:1511.06292, 2016.\n\nN. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial\nperturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy\n(SP), pages 582\u2013597, May 2016.\n\n9\n\n\fA. Rozsa, M. Gunther, and T. E. Boult. Towards robust deep neural networks with BANG. arXiv\n\npreprint arXiv:1612.00138, 2016.\n\nK. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\nIn Proceedings of the Third International Conference on Learning Representations, volume 3 of\nProceedings of the International Conference on Learning Representations, San Diego, CA, USA,\n7\u20139 May 2015. ICLR.\n\nC. Szegedy, W. Zaremba, and I. Sutskever. Intriguing properties of neural networks. In Proceedings\nof the Second International Conference on Learning Representations, volume 2 of Proceedings\nof the International Conference on Learning Representations, Banff, Canada, 14\u201316 Apr 2014.\nICLR.\n\nC. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\nA. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, Boston, MA, USA, 7-12 Jun 2015. CVPR.\n\n10\n\n\f", "award": [], "sourceid": 532, "authors": [{"given_name": "Jonathan", "family_name": "Peck", "institution": "Ghent University"}, {"given_name": "Joris", "family_name": "Roels", "institution": "Ghent University"}, {"given_name": "Bart", "family_name": "Goossens", "institution": "Ghent University"}, {"given_name": "Yvan", "family_name": "Saeys", "institution": "Ghent University"}]}