{"title": "Sparse DNNs with Improved Adversarial Robustness", "book": "Advances in Neural Information Processing Systems", "page_first": 242, "page_last": 251, "abstract": "Deep neural networks (DNNs) are computationally/memory-intensive and vulnerable to adversarial attacks, making them prohibitive in some real-world applications. By converting dense models into sparse ones, pruning appears to be a promising solution to reducing the computation/memory cost. This paper studies classification models, especially DNN-based ones, to demonstrate that there exists intrinsic relationships between their sparsity and adversarial robustness. Our analyses reveal, both theoretically and empirically, that nonlinear DNN-based classifiers behave differently under $l_2$ attacks from some linear ones. We further demonstrate that an appropriately higher model sparsity implies better robustness of nonlinear DNNs, whereas over-sparsified models can be more difficult to resist adversarial examples.", "full_text": "Sparse DNNs with Improved Adversarial Robustness\n\nYiwen Guo1, 2\u2217 Chao Zhang3\u2217 Changshui Zhang2 Yurong Chen1\n\n1 Intel Labs China\n\n2 Institute for Arti\ufb01cial Intelligence, Tsinghua University (THUAI),\n\nState Key Lab of Intelligent Technologies and Systems,\n\nBeijing National Research Center for Information Science and Technology (BNRis),\n\nDepartment of Automation, Tsinghua University\n\n3 Academy for Advanced Interdisciplinary Studies, Center for Data Science, Peking University\nzcs@mail.tsinghua.edu.cn\n{yiwen.guo, yurong.chen}@intel.com\n\npkuzc@pku.edu.cn\n\nAbstract\n\nDeep neural networks (DNNs) are computationally/memory-intensive and vulnera-\nble to adversarial attacks, making them prohibitive in some real-world applications.\nBy converting dense models into sparse ones, pruning appears to be a promising\nsolution to reducing the computation/memory cost. This paper studies classi\ufb01ca-\ntion models, especially DNN-based ones, to demonstrate that there exists intrinsic\nrelationships between their sparsity and adversarial robustness. Our analyses reveal,\nboth theoretically and empirically, that nonlinear DNN-based classi\ufb01ers behave\ndifferently under l2 attacks from some linear ones. We further demonstrate that an\nappropriately higher model sparsity implies better robustness of nonlinear DNNs,\nwhereas over-sparsi\ufb01ed models can be more dif\ufb01cult to resist adversarial examples.\n\n1\n\nIntroduction\n\nAlthough deep neural networks (DNNs) have advanced the state-of-the-art of many arti\ufb01cial intelli-\ngence techniques, some undesired properties may hinder them from being deployed in real-world\napplications. With continued proliferation of deep learning powered applications, one major concern\nraised recently is the heavy computation and storage burden that DNN models shall lay upon mobile\nplatforms. Such burden stems from substantially redundant feature representations and parameteriza-\ntions [6]. To address this issue and make DNNs less resource-intensive, a variety of solutions have\nbeen proposed. In particular, it has been reported that more than 90% of connections in a well-trained\nDNN can be removed using pruning strategies [14, 13, 28, 21, 23], while no accuracy loss is observed.\nSuch a remarkable network sparsity leads to considerable compressions and speedups on both GPUs\nand CPUs [25]. Aside from being ef\ufb01cient, sparse representations are theoretically attractive [2, 8]\nand have made their way into tremendous applications over the past decade.\nOrthogonal to the inef\ufb01ciency issue, it has also been discovered that DNN models are vulnerable to\nadversarial examples\u2014maliciously generated images which are perceptually similar to benign ones\nbut can fool classi\ufb01ers to make arbitrary predictions [26, 3]. Furthermore, generic regularizations\n(e.g., dropout and weight decay) do not really help on resisting adversarial attacks [11]. Such\nundesirable property may prohibit DNNs from being applied to security-sensitive applications. The\ncause of this phenomenon seems mysterious and remains to be an open question. One reasonable\nexplanation is the local linearity of modern DNNs [11]. Quite a lot of attempts, including adversarial\ntraining [11, 27, 19], knowledge distillation [24], detecting and rejecting [18], and some gradient\nmasking techniques like randomization [31], have been made to ameliorate this issue and defend\nadversarial attacks.\n\n\u2217The \ufb01rst two authors contributed equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIt is crucial to study potential relationships between the inef\ufb01ciency (i.e., redundancy) and adversarial\nrobustness of classi\ufb01ers, in consideration of the inclination to avoid \u201crobbing Peter to pay Paul\u201d, if\npossible. Towards shedding light on such relationships, especially for DNNs, we provide compre-\nhensive analyses in this paper from both the theoretical and empirical perspectives. By introducing\nreasonable metrics, we reveal, somewhat surprising, that there is a discrepancy between the robustness\nof sparse linear classi\ufb01ers and nonlinear DNNs, under l2 attacks. Our results also demonstrate that an\nappropriately higher sparsity implies better robustness of nonlinear DNNs, whereas over-sparsi\ufb01ed\nmodels can be more dif\ufb01cult to resist adversarial examples, under both the l\u221e and l2 circumstances.\n\n2 Related Works\n\nIn light of the \u201cOccam\u2019s razor\u201d principle, we presume there exists intrinsic relationships between\nthe sparsity and robustness of classi\ufb01ers, and thus perform a comprehensive study in this paper. Our\ntheoretical and empirical analyses shall cover both linear classi\ufb01ers and nonlinear DNNs, in which\nthe middle-layer activations and connection weights can all become sparse.\nThe (in)ef\ufb01ciency and robustness of DNNs have seldom been discussed together, especially from\na theoretical point of view. Very recently, Gopalakrishnan et al. [12, 20] propose to sparsify the\ninput representations as a defense and provide provable evidences on resisting l\u221e attacks. Though\nintriguing, their theoretical analyses are limited to only linear and binary classi\ufb01cation cases. Con-\ntemporaneous with our work, Wang et al. [29] and Ye et al. [32] experimentally discuss how pruning\nshall affect the robustness of some DNNs but surprisingly draw opposite conclusions. Galloway et\nal. [9] focus on binary DNNs instead of the sparse ones and show that the dif\ufb01culty of performing\nadversarial attacks on binary networks DNNs remains as that of training.\nTo some extent, several very recent defense methods also utilize the sparsity of DNNs. For improved\nmodel robustness, Gao et al. [10] attempt to detect the feature activations exclusive to the adversarial\nexamples and prune them away. Dhillon et al. [7] choose an alternative way that prunes activations\nstochastically to mask gradients. These methods focus only on the sparsity of middle-layer activations\nand pay little attention to the sparsity of connections.\n\n3 Sparsity and Robustness of Classi\ufb01ers\n\nThis paper aims at analyzing and exploring potential relationships between the sparsity and robust-\nness of classi\ufb01ers to untargeted white-box adversarial attacks, from both theoretical and practical\nperspectives. To be more speci\ufb01c, we consider models which learn parameterized mappings xi (cid:55)\u2192 yi,\nwhen given a set of labelled training samples {(xi, yi)} for supervision. Similar to a bunch of other\ntheoretical efforts, our analyses start from linear classi\ufb01ers and will be generalized to nonlinear DNNs\nlater in Section 3.2.\nGenerally, the sparsity of a DNN model can be considered in two aspects: the sparsity of connections\namong neurons and the sparsity of neuron activations. In particular, the sparsity of activations also\ninclude that of middle-layer activations and inputs, which can be treated as a special case. Knowing\nthat the input sparsity has been previously discussed [12], we shall focus primarily on the weight and\nactivation sparsity for nonlinear DNNs and just study the weight sparsity for linear models.\n\n3.1 Linear Models\n\ntrained by minimizing some empirical loss(cid:80)\n\nFor simplicity of notation, we \ufb01rst give theoretical results for binary classi\ufb01ers with \u02c6yi = sgn(wT xi),\nin which w, xi \u2208 Rn. We also ignore the bias term b for clarity. Notice that wT x + b can be simply\nrewritten as \u00b4wT [x; 1] in which \u00b4w = [w; b], so all our theoretical results in the sequel apply directly\nto linear cases with bias. Given ground-truth labels yi \u2208 {+1,\u22121}, a classi\ufb01er can be effectively\ni \u03c4 (\u2212yi \u00b7 wT xi) using a logistic sigmoid function like\nsoftplus: \u03c4 (\u00b7) = log(1 + exp(\u00b7)) [11].\nAdversarial attacks typically minimize an lp norm (e.g., l2, l\u221e, l1 and l0) of the required perturbation\nunder certain (box) constraints. Though not completely equivalent with the distinctions in our visual\ndomain, such norms play a crucial role in evaluating adversarial robustness. We study both the l\u221e\nand l2 attacks in this paper. With an ambition to totalize them, we propose to evaluate the robustness\n\n2\n\n\fof linear models using the following metrics that describe the ability of resisting them respectively:\n\n(cid:0)1y=sgn(wT \u02c7x)\n\n(cid:1) ,\n\nBinary : r\u221e := Ex,y\n\nr2 := Ex,y (1y=\u02c6y \u00b7 d(x, \u02dcx)) .\n\n(1)\n\nHere we let \u02c7x = x\u2212 \u0001y\u00b7 sgn(w) and \u02dcx = x\u2212 w(wT x)/(cid:107)w(cid:107)2\n2 be the adversarial examples generated\nby applying the fast gradient sign (FGS) [11] and DeepFool [22] methods as representatives. Without\nbox constraints on the image domain, they can be regarded as the optimal l\u221e and l2 attacks targeting\non the linear classi\ufb01ers [20, 22]. Function d calculates the Euclidean distance between two n-\ndimensional vectors and we know that d(x, \u02dcx) = |wT x|/(cid:107)w(cid:107)2.\nThe introduced two metrics evaluate robustness of classi\ufb01ers from two different perspectives: r\u221e\ncalculates the expected accuracy on (FGS) adversarial examples and r2 measures a decision margin\nbetween benign examples from the two classes. For both of them, higher value indicates stronger\nadversarial robustness. Note that unlike some metrics calculating (maybe normalized) Euclidean\ndistances between all pairs of benign and adversarial examples, our r2 omits the originally misclass\ufb01ed\nexamples, which makes more sense if the classi\ufb01ers are imperfect in the sense of prediction accuracy.\nWe will refer to \u00b5k := E(x|y = k, \u02c6y = k), which is the conditional expectation for class k.\nBe aware that although there exists attack-agnostic guarantees on the model robustness [16, 30], they\nare all instance-speci\ufb01c. Instead of generalizing them to the entire input space for analysis, we focus\non the proposed statistical metrics and present their connections to the guarantees later in Section 3.2.\nSome other experimentally feasible metrics shall be involved in Section 4. The following theorem\nsheds light on intrinsic relationships between the described robustness metrics and lp norms of w.\nTheorem 3.1.\n(The sparsity and robustness of binary linear classi\ufb01ers). Suppose that Py(k) = 1/2\nfor k = \u00b11, and an obtained linear classi\ufb01er achieves the same expected accuracy t on different\nclasses, then we have\n\nr2 =\n\n\u00b7 wT (\u00b5+1 \u2212 \u00b5\u22121)\n\n(cid:107)w(cid:107)2\n\nt\n2\n\nand r\u221e \u2264 t\n2\n\n\u00b7 wT (\u00b5+1 \u2212 \u00b5\u22121)\n\n\u0001(cid:107)w(cid:107)1\n\n.\n\n(2)\n\nProof. For r\u221e, we \ufb01rst rewrite it in the form of Pr(y \u00b7 wT \u02c7x > 0). We know from assumptions that\nPr(\u02c6y = k|y = k) = t and Pr(y = k) = 1/2, so we further get\n\nr\u221e =\n\nPr (k \u00b7 wT x > \u0001(cid:107)w(cid:107)1| y = k, \u02c6y = k),\n\nt\n2\n\n(3)\n\n(cid:88)\n\nk=\u00b11\n\nby using the law of total probability and substituting \u02c7x with x\u2212 \u0001y \u00b7 sgn(w). Lastly the result follows\nafter using the Markov\u2019s inequality.\nAs for r2, the proof is straightforward by similarly casting its de\ufb01nition into the sum of conditional\nexpectations. That is,\n\n(cid:88)\n\nk=\u00b11\n\nr2 =\n\nt\n2\n\nEx|y,\u02c6y\n\n(cid:18) |wT x|\n\n(cid:107)w(cid:107)2\n\n(cid:12)(cid:12)(cid:12)(cid:12) y = k, \u02c6y = k\n\n(cid:19)\n\n.\n\n(4)\n\nTheorem 3.1 indicates clear relationships between the sparsity and robustness of linear models. In\nterms of r\u221e, optimizing the problem gives rise to a sparse solution of w. By duality, maximizing the\nsquared upper bound of r\u221e also resembles solving a sparse PCA problem [5]. Reciprocally, we might\nalso concur that a highly sparse w implies relatively robust classi\ufb01cation results. Nevertheless, it\nseems that the de\ufb01ned r2 has nothing to do with the sparsity of w. It gets maximized iff w approaches\n\u00b5+1 \u2212 \u00b5\u22121 or \u00b5\u22121 \u2212 \u00b5+1, however, sparsifying w probably does not help on reaching this goal.\nIn fact, under some assumptions about data distributions, the dense reference model can be nearly\noptimal in the sense of r2. We will see this phenomenon remains in multi-class linear classi\ufb01cations\nin Theorem 3.2 but does not remain in nonlinear DNNs in Section 3.2. One can check Section 4.1\nand 4.2 for some experimental discussions in more details.\nHaving realized that the l\u221e robustness of binary linear classi\ufb01ers is closely related to (cid:107)w(cid:107)1, we now\nturn to multi-class cases with the ground truth yi \u2208 {1, . . . , c} and prediction \u02c6yi = arg maxk(wT\nk xi),\nin which wk = W [:, k] indicates the k-th column of a matrix W \u2208 Rn\u00d7c. Here the training objective\n\n3\n\n\ff calculates the cross-entropy loss between ground truth labels and outputs of a softmax function.\nThe introduced two metrics shall be slightly modi\ufb01ed to:\n\n(cid:16)\n\n(cid:17)\n\nMulti-class : r\u221e := Ex,y\n\n1y=arg maxk(wT\nr2 := Ex,y (1y=\u02c6y \u00b7 d(x, \u02dcx)) .\n\u03b4 x)/(cid:107)w\u03b4(cid:107)2\n\nk \u02c7x)\n\n,\n\n(5)\n\nclassi\ufb01er by \u00afw :=(cid:80)\n\nLikewise, \u02c7x = x + \u0001 \u00b7 sgn(\u2207f (x)) and \u02dcx = x \u2212 w\u03b4(wT\n2 are the FGS and DeepFool\nadversarial examples under multi-class circumstances, in which w\u03b4 = w\u02c6y \u2212 we and e \u2208 {1, . . . , c}\u2212\n{\u02c6y} is carefully chosen such that |(w\u02c6y \u2212 we)T x|/(cid:107)w\u02c6y \u2212 we(cid:107)2 is minimized. Denote an averaged\nk wk/c, we provide upper bounds for both r\u221e and r2 in the following theorem.\nTheorem 3.2.\n(The sparsity and robustness of multi-class linear classi\ufb01ers). Suppose that Py(k) =\n1/c for k \u2208 {1, ..., c}, and an obtained linear classi\ufb01er achieves the same expected accuracy t on\ndifferent classes, then we have\n\nr2 \u2264 t\nc\n\n(wk \u2212 \u00afw)T \u00b5k\n(cid:107)wk \u2212 \u00afw(cid:107)2\n\nand\n\nr\u221e \u2264 t\nc\n\n(wk \u2212 \u00afw)T \u00b5k\n\u0001(cid:107)wk \u2212 \u00afw(cid:107)1\n\n(6)\n\nc(cid:88)\n\nk=1\n\nc(cid:88)\n\nk=1\n\nunder two additional assumptions: (I) FGS achieves higher per-class success rates than a weaker\nperturbation like \u2212\u0001 \u00b7 sgn(wy \u2212 \u00afw), (II) the FGS perturbation does not correct misclassi\ufb01cations.\nWe present in Theorem 3.2 similar bounds for multi-class classi\ufb01ers to that provided in Theorem 3.1,\nunder some mild assumptions. Our proof is deferred to the supplementary material. We emphasize\nthat the two additional assumptions are intuitively acceptable. First, increasing the classi\ufb01cation loss\nin a more principled way, say using FGS, ought to diminish the expected accuracy more effectively.\nSecond, with high probability, an original misclassi\ufb01cation cannot be \ufb01xed using the FGS method, as\none intends to do precisely the opposite.\nSimilarly, the presented bound for r\u221e also implies sparsity, though it is the sparsity of wk \u2212 \u00afw.\nIn fact, this is directly related with the sparsity of wk, considering that the classi\ufb01ers can be post-\nprocessed to subtract their average simultaneously whilst the classi\ufb01cation decision won\u2019t change for\nany possible input. Particularly, Theorem 3.2 also partially suits linear DNN-based classi\ufb01cations.\nLet the classi\ufb01er gk be factorized in a form of wT\n1 , it is evident to see that\nhigher sparsity of the multipliers encourages higher probability of a sparse wk.\n\nk = (w(cid:48)\n\nd\u22121 . . . W T\n\nk)T W T\n\n3.2 Deep Neural Networks\nA nonlinear feedforward DNN is usually speci\ufb01ed by a directed acyclic graph G = (V,E) [4] with a\nsingle root node for \ufb01nal outputs. According to the forward propagation rule, the activation value of\neach internal (and also output) node is calculated based on its incoming nodes and learnable weights\ncorresponding to the edges. Nonlinear activation functions are incorporated to ensure the capacity.\nWith biases, some nodes output a special value of one. We omit them for simplicity reasons as before.\nClassi\ufb01cations are performed by comparing the prediction scores corresponding to different classes,\nwhich means \u02c6y = arg maxk\u2208{1,...,c} gk(x). Bene\ufb01t from some very recent theoretic efforts [16, 30],\nwe can directly utilize well-established robustness guarantees for nonlinear DNNs. Let us \ufb01rst denote\nby Bp(x, R) a close ball centred at x with radius R and then denote by Lk\nq,x the (best) local Lipschitz\nconstant of function g\u02c6y(x) \u2212 gk(x) over a \ufb01xed Bp(x, R), if there exists one. It has been proven that\nthe following lemma offers a reasonable lower bound for the required lp norm of instance-speci\ufb01c\nperturbations when all classi\ufb01ers are Lipschitz continuous [30].\nq = 1, then for any \u2206x \u2208 Bp(0, R),\nProposition 3.1. [30] Let \u02c6y = arg maxk\u2208{1,...,c} gk(x) and 1\np + 1\n(cid:27)\np \u2208 R+ and a set of Lipschitz continuous functions {gk : Rn (cid:55)\u2192 R}, with\n\n(cid:26)\n\ng\u02c6y(x) \u2212 gk(x)\n\n(cid:107)\u2206x(cid:107)p \u2264 min\n\nmin\nk(cid:54)=\u02c6y\n\nLk\n\nq,x\n\n, R\n\n:= \u03b3,\n\n(7)\n\nit holds that \u02c6y = arg maxk\u2208{1,...,c} gk(x + \u2206x), which means the classi\ufb01cation decision does not\nchange on Bp(x, \u03b3).\n\nHere the introduced \u03b3 is basically an instance-speci\ufb01c lower bound that guarantees the robustness of\nmulti-class classi\ufb01ers. We shall later discuss its connections with our rps, for p \u2208 {\u221e, 2}, and now\n\n4\n\n\fwe try providing a local Lipschitz constant (which may not be the smallest) of function g\u02c6y(x)\u2212 gk(x),\nto help us delve deeper into the robustness of nonlinear DNNs. Without loss of generality, we will let\nthe following discussion be made under a \ufb01xed radius R > 0 and a given instance x \u2208 Rn.\nSome modern DNNs can be structurally very complex. Let us simply consider a multi-layer perceptron\n(MLP) parameterized by a series of weight matrices W1 \u2208 Rn0\u00d7n1, . . . , Wd \u2208 Rnd\u22121\u00d7nd, in which\nn0 = n and nd = c. Discussions about networks with more advanced architectures like convolutions,\npooling and skip connections can be directly generalized [1]. Speci\ufb01cally, we have\n\nk \u03c3(W T\n\ngk(xi) = wT\n\nd\u22121\u03c3(. . . \u03c3(W T\n\n(8)\nin which wk = Wd[:, k] and \u03c3 is the nonlinear activation function. Here we mostly focus on \u201cReLU\nnetworks\u201d with recti\ufb01ed-linear-\ufb02avoured nonlinearity, so the neuron activations in middle-layers are\nnaturally sparse. Due to clarity reasons, we discuss the weight and activation sparsities separately.\nMathematically, we let a0 = x and aj = \u03c3(W T\nj aj\u22121) for 0 < j < d be the layer-wise activations.\nWe will refer to\n\nDj(x) := diag(cid:0)1Wj [:,1]T aj\u22121>0, . . . , 1Wj [:,nj ]T aj\u22121>0\n\n1 xi))),\n\n(cid:1) ,\n\n(9)\n\nwhich is a diagonal matrix whose entries taking value one correspond to nonzero activations within the\nj-th layer, and Mj \u2208 {0, 1}nj\u22121\u00d7nj , which is a binary mask corresponding to each (possibly sparse)\nWj. Along with some analyses, the following lemma and theorem present intrinsic relationships\nbetween the adversarial robustness and (both weight and activation) sparsity of nonlinear DNNs.\nq = 1, then for any x \u2208 Rn,\nLemma 3.1. (A local Lipschitz constant for ReLU networks). Let 1\nk \u2208 {1, . . . , c} and q \u2208 {1, 2}, the local Lipschitz constant of function g\u02c6y(x) \u2212 gk(x) satis\ufb01es\n\np + 1\n\nq,x \u2264 (cid:107)w\u02c6y \u2212 wk(cid:107)q\nLk\n\nsup\n\nx(cid:48)\u2208Bp(x,R)\n\n((cid:107)Dj(x(cid:48))(cid:107)p(cid:107)Wj(cid:107)p) .\n\n(10)\n\nd\u22121(cid:89)\n\nj=1\n\nTheorem 3.3. (The sparsity and robustness of nonlinear DNNs). Let the weight matrix be represented\nj \u25e6 Mj, in which {Mj[u, v]} are independent Bernoulli B(1, 1 \u2212 \u03b1j) random variables\nas Wj = W (cid:48)\nand 0 /\u2208 {W (cid:48)\nj[u, v]}, for j \u2208 {1, . . . , d \u2212 1}. Then for any x \u2208 Rn and k \u2208 {1, . . . , c}, it holds that\n(11)\n\nEM1,...,Md\u22121(Lk\n\n2,x) \u2264 c2 \u00b7 (1 \u2212 \u03b7(\u03b11, . . . , \u03b1d\u22121; x))\n1,x) \u2264 c1 \u00b7 (1 \u2212 \u03b7(\u03b11, . . . , \u03b1d\u22121; x)),\n\nand\n\n(cid:81)\n\n(12)\nj(cid:107)F and\n\nEM1,...,Md\u22121(Lk\n\n(cid:81)\nj (cid:107)W (cid:48)\n\nin which function \u03b7 is monotonically increasing w.r.t. each \u03b1j, c2 = (cid:107)w\u02c6y \u2212 wk(cid:107)2\nc1 = (cid:107)w\u02c6y \u2212 wk(cid:107)1\n\nj(cid:107)1,\u221e are two constants.\n(cid:89)\n\nProof Sketch. Function(cid:81)(cid:107)Dj(\u00b7)(cid:107)p(cid:107)Wj(cid:107)p de\ufb01ned on Rn is bounded from above and below, thus\nParticularly,(cid:81)(cid:107)Dj(\u02c6x)(cid:107)p (cid:54)= 0 is ful\ufb01lled iff (cid:107)Dd\u22121(\u02c6x)(cid:107)p (cid:54)= 0 (i.e., it equals 1 for q \u2208 {1, 2}). Under\n\nwe know there exists an \u02c6x \u2208 Bp(x, R) satisfying\nq,x \u2264 (cid:107)w\u02c6y \u2212 wk(cid:107)q\nLk\n\n(cid:107)Dj(\u02c6x)(cid:107)p(cid:107)Wj(cid:107)p.\n\nj (cid:107)W (cid:48)\n\nthe assumptions on Mj, we know that the entries of Wj are independent of each other, thus\n\n(13)\n\nj\n\nPrM1,...,Md\u22121 (Dd\u22121(\u02c6x)[u, u] = 0) = PrM1,...,Md\u22121 (Wd\u22121[:, u]T ad\u22122 \u2264 0)\n(\u03b1d\u22121 + \u03bed\u22122,u(cid:48) \u2212 \u03b1d\u22121\u03bed\u22122,u(cid:48)),\n\n(14)\n\n\u2265(cid:89)\n\nu(cid:48)\n\nvalidate its monotonicity. Additionally, we prove that cq \u2265 (cid:107)w\u02c6y\u2212wk(cid:107)qE ((cid:81)(cid:107)Wj(cid:107)p|(cid:107)Dd\u22121(\u02c6x)(cid:107)p =\n\nin which \u03bed\u22122,u(cid:48) is a newly introduced scalar that equals or less equals to the probability of the u(cid:48)-th\nneuron being deactivated. In this manner, we can recursively de\ufb01ne the function \u03b7 and it is easy to\n1) holds for q \u2208 {1, 2} and the result follows. See the supplementary material for a detailed proof.\n\nIn Lemma 3.1 we introduce probably smaller local Lipschitz constants than the commonly known\nones (i.e., c2 and c1), and subsequently in Theorem 3.3 we build theoretical relationships between\nq,x and the network sparsity, for q \u2208 {1, 2} (i.e., p \u2208 {\u221e, 2}). Apparently, Lk\nq,x is prone to get\nLk\n\n5\n\n\f(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 1: The robustness of linear classi\ufb01ers with varying weight sparsity. Upper: binary classi\ufb01cation\nbetween \u201c1\u201ds and \u201c7\u201ds, lower: multi-class classi\ufb01cation on the whole MNIST test set.\n\nsmaller if any weight matrix gets more sparse. It is worthy noting that the local Lipschitz constant is of\ngreat importance in evaluating the robustness of DNNs, and it is effective to regularize DNNs by just\nq,x, or equivalently (cid:107)\u2207g\u02c6y(x) \u2212 \u2207gk(x)(cid:107)q for differentiable continuous functions [16].\nminimizing Lk\nThus we reckon, when the network is over-parameterized, an appropriately higher weight sparsity\nimplies a larger \u03b3 and stronger robustness. There are similar conclusions if aj gets more sparse.\nRecall that in the linear binary case, we apply the DeepFool adversarial example \u02dcx when evaluating the\nrobustness using r2. It is not dif\ufb01cult to validate that the equality d(x, \u02dcx) = |(w\u02c6y \u2212 wk(cid:54)=\u02c6y)T x|/Lk\nholds for such \u02dcx and w\u00b11 := \u00b1w, which means the DeepFool perturbation ideally minimizes the\n2,x\nEuclidean norm and helps us measure a lower bound in this regard. This can be directly generalized to\nmulti-class classi\ufb01ers. Unlike r2 which represents a margin, our r\u221e is basically an expected accuracy.\nNevertheless, we also know that a perturbation of \u2212\u0001y \u00b7 sgn(w) shall successfully fool the classi\ufb01ers\nif \u0001 \u2265 |(w\u02c6y \u2212 wk(cid:54)=\u02c6y)T x|/Lk\n\n1,x.\n\n4 Experimental Results\n\nIn this section, we conduct experiments to testify our theoretical results. To be consistent, we still\nstart from linear models and turn to nonlinear DNNs afterwards. As previously discussed, we perform\nboth l\u221e and l2 attacks on the classi\ufb01ers to evaluate their adversarial robustness. In addition to the\nFGS [11] and DeepFool [22] attacks which have been thoroughly discussed in Section 3, we introduce\ntwo more attacks in this section for extensive comparisons of the model robustness.\n\nAdversarial attacks. We use the FGS and randomized FGS (rFGS) [27] methods to perform l\u221e\nattacks. As a famous l\u221e attack, FGS has been widely exploited in the literature. In order to generate\nadversarial examples, it calculates the gradient of training loss w.r.t. benign inputs and uses its sign as\nperturbations, in an element-wise manner. The rFGS attack is a computationally ef\ufb01cient alternative\nto multi-step l\u221e attacks with an ability of breaking adversarial training-based defences. We keep its\nhyper-parameters \ufb01xed for all experiments in this paper. For l2 attacks, we choose DeepFool and the\nC&W\u2019s attack [3]. DeepFool linearises nonlinear classi\ufb01ers locally and approximates the optimal\nperturbations iteratively. C&W\u2019s method casts the problem of constructing adversarial examples as\noptimizing an objective function without constraints, such that some recent gradient-descent-based\nsolvers can be adopted. On the base of different attacks, four r2 and r\u221e values can be calculated for\neach classi\ufb01cation model.\n\n4.1 The Sparse Linear Classi\ufb01er Behaves Differently under l\u221e and l2 Attacks\n\nIn our experiments on linear classi\ufb01ers, both the binary and multi-class scenarios shall be evaluated.\nWe choose the well-established MNIST dataset as a benchmark, which consists of 70,000 28 \u00d7 28\nimages of handwritten digits. According to the of\ufb01cial test protocol, 10,000 of them should be used\nfor performance evaluation and the remaining 60,000 for training. For experiments on the binary\n\n6\n\n\f(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 2: The robustness of nonlinear DNNs with varying weight sparsity. (a)-(b): LeNet-300-100,\n(c)-(d): LeNet-5, (e)-(f): the VGG-like network, (g)-(h): ResNet-32.\n\n(cid:80)\ncases, we randomly choose a pair of digits (e.g., \u201c0\u201d and \u201c8\u201d or \u201c1\u201d and \u201c7\u201d) as positive and negative\nclasses. Linear classi\ufb01ers are trained following our previous discussions and utilizing the softplus\ni log(1 + exp(\u2212yi(wT xi + b))). Parameters w and b are randomly initialized\nfunction: minw,b\nand learnt by means of stochastic gradient descent with momentum. For the \u201c1\u201d and \u201c7\u201d classi\ufb01cation\ncase, we train 10 reference models from different initializations and achieve a prediction accuracy of\n99.17 \u00b1 0.00% on the benign test set. For the classi\ufb01cation of all 10 classes, we train 10 references\nsimilarly and achieve a test-set accuracy of 92.26 \u00b1 0.08%.\nTo produce models with different weight sparsities, we use a progressive pruning strategy [14]. That\nbeing said, we follow a pipeline of iteratively pruning and re-training. Within each iteration, a portion\n(\u03c1) of nonzero entries of w, whose magnitudes are relatively small in comparison with the others,\nwill be directly set to zero and shall never be activated again. After m times of such \u201cpruning\u201d, we\nshall collect 10(m + 1) models from all 10 dense references. Here we set m = 16, \u03c1 = 1/3 so\nthe achieved \ufb01nal percentage of zero weights should be 99.74% \u2248 1 \u2212 (1 \u2212 \u03c1)m. We calculate the\nprediction accuracies on adversarial examples (i.e., r\u221e) under different l\u221e attacks and the average\nEuclidean norm of required perturbations (i.e., r2) under different l2 attacks to evaluate the adversarial\nrobustness of different models in practice. For l\u221e attacks, we set \u0001 = 0.1.\nFigure 1 illustrates how our metrics of robustness vary with the weight sparsity. We only demonstrate\nthe variability of the \ufb01rst 12 points (from left to right) on each curve, to make the bars more resolvable.\nThe upper and lower sub\ufb01gures correspond to binary and multi-class cases, respectively. Obviously,\nthe experimental results are consistent with our previous theoretical ones. While sparse linear models\nare prone to be more robust in the sense of r\u221e, their r2 robustness maintains similar or becomes even\nslightly weaker than the dense references, until there emerges inevitable accuracy degradations on\nbenign examples (i.e., when r\u221e may drop as well). We also observe from Figure 1 that, in both the\nbinary and multi-class cases, r2 starts decreasing much earlier than the benign-set accuracy. Though\nvery slight in the binary case, the degradation of r2 actually occurs after the \ufb01rst round of pruning\n(from 2.0103 \u00b1 0.0022 to 2.0009 \u00b1 0.0016 with DeepFool incorporated, and from 2.3151 \u00b1 0.0023\nto 2.3061 \u00b1 0.0023 with the C&W\u2019s attack).\n\n4.2 Sparse Nonlinear DNNs Can be Consistently More Robust\n\nRegarding nonlinear DNNs, we follow the same experimental pipeline as described in Section 4.1.\nWe train MLPs with 2 hidden fully-connected layers and convolutional networks with 2 convolutional\nlayers, 2 pooling layers and 2 fully-connected layers as references on MNIST, following the \u201cLeNet-\n300-100\u201d and \u201cLeNet-5\u201d architectures in network compression papers [14, 13, 28, 21]. We also follow\nthe training policy suggested by Caffe [17] and train network models for 50,000 iterations with a batch\nsize of 64 such that the training cross-entropy loss does not decrease any longer. The well-trained\nreference models achieve much higher prediction accuracies (LeNet-300-100: 98.20 \u00b1 0.07% and\nLeNet-5: 99.11 \u00b1 0.04%) than previous tested linear ones on the benign test set.\n\n7\n\n\fWeight sparsity. Then we prune the dense references and illustrate some major results regarding\nthe robustness and weight sparsity in Figure 2 (a)-(d). (See Figure 3 in our supplementary material for\nresults under rFGS and the C&W\u2019s attack.) Weight matrices/tensors within each layer is uniformly\npruned so the network sparsity should be approximately equal to the layer-wise sparsity. As expected,\nwe observe similar results to previous linear cases in the context of our r\u221e but signi\ufb01cantly different\nresults in r2. Unlike previous linear models which behave differently under l\u221e and l2 attacks,\nnonlinear DNN models show a consistent trend of adversarial robustness with respect to the sparsity.\nIn particular, we observe increased r\u221e and r2 values under different attacks when continually pruning\nthe models, until the sparsity reaches some thresholds and leads to inevitable capacity degradations.\nFor additional veri\ufb01cations, we calculate the CLEVER [30] scores that approximate attack-agnostic\nlower bounds of the lp norms of required perturbations (in Table 3 in the supplementary material).\nExperiments are also conducted on CIFAR-10, in which deeper nonlinear networks can be involved.\nWe train 10 VGG-like network models [23] (each incorporates 12 convolutional layers and 2 fully-\nconnected layers) and 10 ResNet models [15] (each incorporates 31 convolutional layers and a single\nfully-connected layers) from scratch. Such deep architectures lead to average prediction accuracies of\n93.01% and 92.89%. Still, we prune dense network models in the progressive manner and illustrate\nquantitative relationships between the robustness and weight sparsity in Figure 2 (e)-(h). The \ufb01rst\nand last layers in each network are kept dense to avoid early accuracy degradation on the benign set.\nThe same observations can be made. Note that the ResNets are capable of resisting some DeepFool\nexamples, for which the second and subsequent iterations make little sense and can be disregarded.\n\nActivation sparsity. Having testi\ufb01ed relationship between the robustness and weight sparsity of\nnonlinear DNNs, we now examine the activation sparsity. As previously mentioned, the middle-layer\nactivations of ReLU incorporated DNNs are naturally sparse. We simply add a l1 norm regularization\nof weight matrices/tensors to the learning objective to encourage higher sparsities and calculate\nr\u221e and r2 accordingly. Experiments are conducted on MNIST. Table 1 summarizes the results, in\nwhich \u201cSparsity (a)\u201d indicates the percentage of deactivated (i.e., zero) neurons feeding to the last\nfully-connected layer. Here the r\u221e and r2 values are calculated using the FGS and DeepFool attacks,\nrespectively. Apparently, we still observe positive correlations between the robustness and (activation)\nsparsity in a certain range.\n\nTable 1: The robustness of DNNs regularized using the l1 norm of weight matrices/tensors.\n\nNetwork\n\nLeNet-300-100\n\nLeNet-5\n\nr\u221e\n0.2862\u00b10.0113\n0.3993\u00b10.0079\n0.2098\u00b10.0133\n0.7388\u00b10.0188\n0.7729\u00b10.0081\n0.6741\u00b10.0162\n\nr2\n1.3213\u00b10.0207\n1.5903\u00b10.0240\n1.1440\u00b10.0402\n2.7831\u00b10.1490\n3.1688\u00b10.1203\n2.0799\u00b10.0522\n\nAccuracy\nSparsity (a)\n98.20\u00b10.07% 45.25\u00b11.14%\n98.27\u00b10.04% 75.92\u00b10.54%\n97.96\u00b10.07% 95.22\u00b10.18%\n99.11\u00b10.04% 51.26\u00b11.88%\n99.19\u00b10.05% 97.54\u00b10.10%\n99.10\u00b10.06% 99.64\u00b10.02%\n\n4.3 Avoid \u201cOver-pruning\u201d\n\nWe discover from Figure 2 that the sharp decrease of the adversarial robustness, especially in the sense\nof r2, may occur in advance of the benign-set accuracy degradation. Hence, it can be necessary to\nevaluate the adversarial robustness of DNNs during an aggressive surgery, even though the prediction\naccuracy of compressed models may remain competitive with their references on benign test-sets.\nTo further explore this, we collect some off-the-shelf sparse models (including a 56\u00d7 compressed\nLeNet-300-100 and a 108\u00d7 compressed LeNet-5) [13] and their corresponding dense references\nfrom the Internet and hereby evaluate their r\u221e and r2 robustness. Table 2 compares the robustness of\ndifferent models. Obviously, these extremely sparse models are more vulnerable to the DeepFool\nattack, and what\u2019s worse, the over 100\u00d7 pruned LeNet-5 seems also more vulnerable to FGS, which\nsuggests researchers to take care and avoid \u201cover-pruning\u201d if possible. One might also discover the\nfact with other pruning methods.\n\n8\n\n\fTable 2: The robustness of pre-compressed nonlinear DNNs and their provided dense references.\n\nModel\nLeNet-300-100 dense\nLeNet-300-100 sparse\nLeNet-5 dense\nLeNet-5 sparse\n\nr\u221e\n0.2663\n0.3823\n0.7887\n0.6791\n\nr2\n1.3899\n1.1058\n2.7226\n1.7383\n\nSparsity (W )\n0.00%\n98.21%\n0.00%\n99.07%\n\n5 Conclusions\n\nIn this paper, we study some intrinsic relationships between the adversarial robustness and the sparsity\nof classi\ufb01ers, both theoretically and empirically. By introducing plausible metrics, we demonstrate\nthat unlike some linear models which behave differently under l\u221e and l2 attacks, sparse nonlinear\nDNNs can be consistently more robust to both of them than their corresponding dense references,\nuntil their sparsity reaches certain thresholds and inevitably causes harm to the network capacity. Our\nresults also demonstrate that such sparsity, including sparse connections and middle-layer neuron\nactivations, can be effectively imposed using network pruning and l1 regularization of weight tensors.\n\nAcknowledgement\n\nWe would like to thank anonymous reviewers for their constructive suggestions. Changshui Zhang is\nsupported by NSFC (Grant No. 61876095, No. 61751308 and No. 61473167) and Beijing Natural\nScience Foundation (Grant No. L172037).\n\nReferences\n[1] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. In NIPS, 2017.\n\n[2] Emmanuel J Cand\u00e8s, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal\nreconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory,\n52(2):489\u2013509, 2006.\n\n[3] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In SP, 2017.\n\n[4] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval\n\nnetworks: Improving robustness to adversarial examples. In ICML, 2017.\n\n[5] Alexandre d\u2019Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal\n\ncomponent analysis. JMLR, 9(July):1269\u20131294, 2008.\n\n[6] Misha Denil, Babak Shakibi, Laurent Dinh, Marc\u2019Aurelio Ranzato, and Nando De Freitas. Predicting\n\nparameters in deep learning. In NIPS, 2013.\n\n[7] Guneet S Dhillon, Kamyar Azizzadenesheli, Zachary C Lipton, Jeremy Bernstein, Jean Kossai\ufb01, Aran\nKhanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial defense. In ICLR,\n2018.\n\n[8] David L Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306,\n\n2006.\n\n[9] Angus Galloway, Graham W Taylor, and Medhat Moussa. Attacking binarized neural networks. In ICLR,\n\n2018.\n\n[10] Ji Gao, Beilun Wang, Zeming Lin, Weilin Xu, and Yanjun Qi. Deepcloak: Masking deep neural network\n\nmodels for robustness against adversarial samples. In ICLR Workshop, 2017.\n\n[11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.\n\nIn ICLR, 2015.\n\n[12] Soorya Gopalakrishnan, Zhinus Marzi, Upamanyu Madhow, and Ramtin Pedarsani. Combating adversarial\n\nattacks using sparse representations. In ICLR Workshop, 2018.\n\n9\n\n\f[13] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for ef\ufb01cient dnns. In NIPS, 2016.\n\n[14] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for ef\ufb01cient\n\nneural network. In NIPS, 2015.\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[16] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classi\ufb01er against\n\nadversarial manipulation. In NIPS, 2017.\n\n[17] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In MM,\n2014.\n\n[18] Jiajun Lu, Theerasit Issaranon, and David Forsyth. Safetynet: Detecting and rejecting adversarial examples\n\nrobustly. In ICCV, 2017.\n\n[19] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards\n\ndeep learning models resistant to adversarial attacks. In ICLR, 2018.\n\n[20] Zhinus Marzi, Soorya Gopalakrishnan, Upamanyu Madhow, and Ramtin Pedarsani. Sparsity-based defense\n\nagainst adversarial attacks on linear classi\ufb01ers. arXiv preprint arXiv:1801.04695, 2018.\n\n[21] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsi\ufb01es deep neural\n\nnetworks. In ICML, 2017.\n\n[22] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: a simple and accurate\n\nmethod to fool deep neural networks. In CVPR, 2016.\n\n[23] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian pruning\n\nvia log-normal multiplicative noise. In NIPS, 2017.\n\n[24] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense\n\nto adversarial perturbations against deep neural networks. In SP, 2016.\n\n[25] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. Faster\n\ncnns with direct sparse convolutions and guided pruning. In ICLR, 2017.\n\n[26] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. In ICLR, 2014.\n\n[27] Florian Tram\u00e8r, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick McDaniel. Ensemble adversar-\n\nial training: Attacks and defenses. In ICLR, 2018.\n\n[28] Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. In\n\nICLR, 2017.\n\n[29] Luyu Wang, Gavin Weiguang Ding, Ruitong Huang, Yanshuai Cao, and Yik Chau Lui. Adversarial\n\nrobustness of pruned neural networks. In ICLR Workshop submission, 2018.\n\n[30] Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca\nDaniel. Evaluating the robustness of neural networks: An extreme value theory approach. In ICLR, 2018.\n\n[31] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects\n\nthrough randomization. In ICLR, 2018.\n\n[32] Shaokai Ye, Siyue Wang, Xiao Wang, Bo Yuan, Wujie Wen, and Xue Lin. Defending DNN adversarial\n\nattacks with pruning and logits augmentation. In ICLR Workshop submission, 2018.\n\n10\n\n\f", "award": [], "sourceid": 182, "authors": [{"given_name": "Yiwen", "family_name": "Guo", "institution": "Intel Labs China"}, {"given_name": "Chao", "family_name": "Zhang", "institution": "Peking University"}, {"given_name": "Changshui", "family_name": "Zhang", "institution": "Tsinghua University"}, {"given_name": "Yurong", "family_name": "Chen", "institution": "Intel Labs China"}]}